Chapter 33. Factors affecting I/O and file system performance
The appropriate settings for storage and file system performance are highly dependent on the storage purpose.
I/O and file system performance can be affected by any of the following factors:
- Data write or read patterns
- Sequential or random
- Buffered or Direct IO
- Data alignment with underlying geometry
- Block size
- File system size
- Journal size and location
- Recording access times
- Ensuring data reliability
- Pre-fetching data
- Pre-allocating disk space
- File fragmentation
- Resource contention
33.1. Tools for monitoring and diagnosing I/O and file system issues Copy linkLink copied to clipboard!
The following tools are available in Red Hat Enterprise Linux 8 for monitoring system performance and diagnosing performance problems related to I/O, file systems, and their configuration:
-
vmstattool reports on processes, memory, paging, block I/O, interrupts, and CPU activity across the entire system. It can help administrators determine whether the I/O subsystem is responsible for any performance issues. If analysis withvmstatshows that the I/O subsystem is responsible for reduced performance, administrators can use theiostattool to determine the responsible I/O device. -
iostatreports on I/O device load in your system. It is provided by thesysstatpackage. -
blktraceprovides detailed information about how time is spent in the I/O subsystem. The companion utilityblkparsereads the raw output fromblktraceand produces a human readable summary of input and output operations recorded byblktrace. bttanalyzesblktraceoutput and displays the amount of time that data spends in each area of the I/O stack, making it easier to spot bottlenecks in the I/O subsystem. This utility is provided as part of theblktracepackage. Some of the important events tracked by theblktracemechanism and analyzed bybttare:-
Queuing of the I/O event (
Q) -
Dispatch of the I/O to the driver event (
D) -
Completion of I/O event (
C)
-
Queuing of the I/O event (
-
iowatchercan use theblktraceoutput to graph I/O over time. It focuses on the Logical Block Address (LBA) of disk I/O, throughput in megabytes per second, the number of seeks per second, and I/O operations per second. This can help to identify when you are hitting the operations-per-second limit of a device. BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (
eBPF) programs. TheeBPFprograms are triggered on events, such as disk I/O, TCP connections, and process creations. The BCC tools are installed in the/usr/share/bcc/tools/directory. The followingbcc-toolshelps to analyze performance:-
biolatencysummarizes the latency in block device I/O (disk I/O) in histogram. This allows the distribution to be studied, including two modes for device cache hits and for cache misses, and latency outliers. -
biosnoopis a basic block I/O tracing tool for displaying each I/O event along with the issuing process ID, and the I/O latency. Using this tool, you can investigate disk I/O performance issues. -
biotopis used for block i/o operations in the kernel. -
filelifetool traces thestat()syscalls. -
fileslowertraces slow synchronous file reads and writes. -
filetopdisplays file reads and writes by process. ext4slower,nfsslower, andxfsslowerare tools that show file system operations slower than a certain threshold, which defaults to10ms.For more information, see the Analyzing system performance with BPF Compiler Collection.
-
-
bpftaceis a tracing language foreBPFused for analyzing performance issues. It also provides trace utilities like BCC for system observation, which is useful for investigating I/O performance issues. The following
SystemTapscripts may be useful in diagnosing storage or file system performance problems:-
disktop.stp: Checks the status of reading or writing disk every 5 seconds and outputs the top ten entries during that period. -
iotime.stp: Prints the amount of time spent on read and write operations, and the number of bytes read and written. -
traceio.stp: Prints the top ten executable based on cumulative I/O traffic observed, every second. -
traceio2.stp: Prints the executable name and process identifier as reads and writes to the specified device occur. -
Inodewatch.stp: Prints the executable name and process identifier each time a read or write occurs to the specified inode on the specified major or minor device. -
inodewatch2.stp: Prints the executable name, process identifier, and attributes each time the attributes are changed on the specified inode on the specified major or minor device.
-
33.2. Available tuning options for formatting a file system Copy linkLink copied to clipboard!
Some file system configuration decisions cannot be changed after the device is formatted.
The following are the options available before formatting a storage device:
Size- Create an appropriately-sized file system for your workload. Smaller file systems require less time and memory for file system checks. However, if a file system is too small, its performance suffers from high fragmentation.
Block sizeThe block is the unit of work for the file system. The block size determines how much data can be stored in a single block, and therefore the smallest amount of data that is written or read at one time.
The default block size is appropriate for most use cases. However, your file system performs better and stores data more efficiently if the block size or the size of multiple blocks is the same as or slightly larger than the amount of data that is typically read or written at one time. A small file still uses an entire block. Files can be spread across multiple blocks, but this can create additional runtime overhead.
Additionally, some file systems are limited to a certain number of blocks, which in turn limits the maximum size of the file system. Block size is specified as part of the file system options when formatting a device with the
mkfscommand. The parameter that specifies the block size varies with the file system.GeometryFile system geometry is concerned with the distribution of data across a file system. If your system uses striped storage, like RAID, you can improve performance by aligning data and metadata with the underlying storage geometry when you format the device.
Many devices export recommended geometry, which is then set automatically when the devices are formatted with a particular file system. If your device does not export these recommendations, or you want to change the recommended settings, you must specify geometry manually when you format the device with the
mkfscommand.The parameters that specify file system geometry vary with the file system.
External journals- Journaling file systems document the changes that will be made during a write operation in a journal file prior to the operation being executed. This reduces the likelihood that a storage device will become corrupted in the event of a system crash or power failure, and speeds up the recovery process.
Red Hat does not recommend using the external journals option.
Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as, or faster than, the primary storage.
Ensure that external journals are reliable. Losing an external journal device causes file system corruption. External journals must be created at format time, with journal devices being specified at mount time.
33.3. Available tuning options for mounting a file system Copy linkLink copied to clipboard!
The following are the options available to most file systems and can be specified as the device is mounted:
Access TimeEvery time a file is read, its metadata is updated with the time at which access occurred (
atime). This involves additional write I/O. Therelatimeis the defaultatimesetting for most file systems.However, if updating this metadata is time consuming, and if accurate access time data is not required, you can mount the file system with the
noatimemount option. This disables updates to metadata when a file is read. It also enablesnodiratimebehavior, which disables updates to metadata when a directory is read.
Disabling atime updates by using the noatime mount option can break applications that rely on them, for example, backup programs.
Read-aheadRead-aheadbehavior speeds up file access by pre-fetching data that is likely to be needed soon and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The higher the read-ahead value, the further ahead the system pre-fetches data.Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects about your file system. However, accurate detection is not always possible. For example, if a storage array presents itself to the system as a single LUN, the system detects the single LUN, and does not set the appropriate read-ahead value for an array.
Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values. The storage-related tuned profiles provided with Red Hat Enterprise Linux raise the read-ahead value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.
33.4. Types of discarding unused blocks Copy linkLink copied to clipboard!
Regularly discarding blocks that are not in use by the file system is a recommended practice for both solid-state disks and thinly-provisioned storage.
The following are the two methods of discarding unused blocks:
Batch discard-
This type of discard is part of the
fstrimcommand. It discards all unused blocks in a file system that match criteria specified by the administrator. Red Hat Enterprise Linux 8 supports batch discard on XFS and ext4 formatted devices that support physical discard operations. Online discardThis type of discard operation is configured at mount time with the discard option, and runs in real time without user intervention. However, it only discards blocks that are transitioning from used to free. Red Hat Enterprise Linux 8 supports online discard on XFS and ext4 formatted devices.
Red Hat recommends batch discard, except where online discard is required to maintain performance, or where batch discard is not feasible for the system’s workload.
Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 8 supports pre-allocating space on XFS, ext4, and GFS2 file systems. Applications can also benefit from pre-allocating space by using the fallocate(2) glibc call.
33.5. Solid-state disks tuning considerations Copy linkLink copied to clipboard!
Solid-state disks (SSD) use NAND flash chips rather than rotating magnetic platters to store persistent data. SSD provides a constant access time for data across their full Logical Block Address range, and does not incur measurable seek costs like their rotating counterparts. They are more expensive per gigabyte of storage space and have a lesser storage density, but they also have lower latency and greater throughput than HDDs.
Performance generally degrades as the used blocks on an SSD approach the capacity of the disk. The degree of degradation varies by vendor, but all devices experience degradation in this circumstance. Enabling discard behavior can help to alleviate this degradation. For more information, see Types of discarding unused blocks.
The default I/O scheduler and virtual memory options are suitable for use with SSDs. Consider the following factors when configuring settings that can affect SSD performance:
I/O SchedulerAny I/O scheduler is expected to perform well with most SSDs. However, as with any other storage type, Red Hat recommends benchmarking to determine the optimal configuration for a given workload. When using SSDs, Red Hat advises changing the I/O scheduler only for benchmarking particular workloads. For instructions on how to switch between I/O schedulers, see the
/usr/share/doc/kernel-version/Documentation/block/switching-sched.txtfile.For single queue HBA, the default I/O scheduler is
deadline. For multiple queue HBA, the default I/O scheduler isnone. For information about how to set the I/O scheduler, see Setting the disk scheduler.Virtual Memory-
Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast nature of I/O on SSD, try turning down the
vm_dirty_background_ratioandvm_dirty_ratiosettings, as increased write-out activity does not usually have a negative impact on the latency of other operations on the disk. However, this tuning can generate more overall I/O, and is therefore not generally recommended without workload-specific testing. Swap- An SSD can also be used as a swap device, and is likely to produce good page-out and page-in performance.
33.6. Generic block device tuning parameters Copy linkLink copied to clipboard!
The generic tuning parameters listed here are available in the /sys/block/sdX/queue/ directory.
The following listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all I/O schedulers:
add_random-
Some I/O events contribute to the entropy pool for the
/dev/random. This parameter can be set to0if the overhead of these contributions become measurable. iostatsBy default,
iostatsis enabled and the default value is1. Settingiostatsvalue to0disables the gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O path. Settingiostatsto0might slightly improve performance for very high performance devices, such as certain NVMe solid-state storage devices. It is recommended to leaveiostatsenabled unless otherwise specified for the given storage model by the vendor.If you disable
iostats, the I/O statistics for the device are no longer present within the/proc/diskstatsfile. The content of/sys/diskstatsfile is the source of I/O information for monitoring I/O tools, such assaroriostats. Therefore, if you disable theiostatsparameter for a device, the device is no longer present in the output of I/O monitoring tools.max_sectors_kbSpecifies the maximum size of an I/O request in kilobytes. The default value is
512KB. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value of themax_hw_sectors_kb.Red Hat recommends
max_sectors_kbto always be a multiple of the optimal I/O size and the internal erase block size. Use a value oflogical_block_sizefor either parameter if they are zero or not specified by the storage device.nomerges-
Most workloads benefit from request merging. However, disabling merges can be useful for debugging purposes. By default, the
nomergesparameter is set to0, which enables merging. To disable simple one-hit merging, setnomergesto1. To disable all types of merging, setnomergesto2. nr_requests-
It is the maximum allowed number of the queued I/O. If the current I/O scheduler is
none, this number can only be reduced; otherwise the number can be increased or reduced. optimal_io_size- Some storage devices report an optimal I/O size through this parameter. If this value is reported, Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible.
read_ahead_kbDefines the maximum number of kilobytes that the operating system may read ahead during a sequential read operation. As a result, the necessary information is already present within the kernel page cache for the next sequential read, which improves read I/O performance.
Device mappers often benefit from a high
read_ahead_kbvalue.128KB for each device to be mapped is a good starting point, but increasing theread_ahead_kbvalue up to request queue’smax_sectors_kbof the disk might improve performance in application environments where sequential reading of large files takes place.rotational-
Some solid-state disks do not correctly advertise their solid-state status, and are mounted as traditional rotational disks. Manually set the
rotationalvalue to0to disable unnecessary seek-reducing logic in the scheduler. rq_affinity-
The default value of the
rq_affinityis1. It completes the I/O operations on one CPU core, which is in the same CPU group of the issued CPU core. To perform completions only on the processor that issued the I/O request, set therq_affinityto2. To disable the mentioned two abilities, set it to0. scheduler-
To set the scheduler or scheduler preference order for a particular storage device, edit the
/sys/block/devname/queue/schedulerfile, where devname is the name of the device you want to configure.