Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 26. Factors affecting I/O and file system performance
The appropriate settings for storage and file system performance are highly dependent on the storage purpose. I/O and file system performance can be affected by various factors.
Below is a list of factors that can affect I/O and file system performance:
- Data write or read patterns
- Sequential or random
- Buffered or Direct IO
- Data alignment with underlying geometry
- Block size
- File system size
- Journal size and location
- Recording access times
- Ensuring data reliability
- Pre-fetching data
- Pre-allocating disk space
- File fragmentation
- Resource contention
26.1. Tools for monitoring and diagnosing I/O and file system issues Copier lienLien copié sur presse-papiers!
Monitor and diagnose I/O and file system issues efficiently by using tools that track performance metrics, analyze device load, latency, and trace operations. These tools help to pinpoint bottlenecks and optimize system performance in Red Hat Enterprise Linux 10 environments.
The following tools are available in Red Hat Enterprise Linux 10 for monitoring system performance and diagnosing performance problems related to I/O and file systems:
-
vmstattool reports on processes, memory, paging, block I/O, interrupts, and CPU activity across the entire system. It can help administrators determine whether the I/O subsystem is responsible for any performance issues. Ifvmstatanalysis shows that the I/O subsystem causes reduced performance, administrators can useiostatto identify the responsible I/O device. -
iostatreports on I/O device load in your system. It is provided by thesysstatpackage. -
blktraceprovides detailed information about how time is spent in the I/O subsystem. The companion utilityblkparsereads the raw output fromblktraceand produces a human readable summary of recorded input and output operations. bttanalyzesblktraceoutput and displays the amount of time that data spends in each area of the I/O stack. This makes it easier to spot bottlenecks in the I/O subsystem. This utility is provided as part of theblktracepackage. Some of the important events tracked by theblktracemechanism and analyzed bybttare:-
Queuing of the I/O event (
Q) -
Dispatch of the I/O to the driver event (
D) -
Completion of I/O event (
C)
-
Queuing of the I/O event (
-
iowatchercan use theblktraceoutput to graph I/O over time. It focuses on the Logical Block Address (LBA) of disk I/O, throughput, seeks per second, and I/O operations per second. This can help to identify when you are hitting the operations-per-second limit of a device. BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (
eBPF) programs. TheeBPFprograms are triggered on events, such as disk I/O, TCP connections, and process creations.The BCC tools are installed in the
/usr/share/bcc/tools/directory. The followingbcc-toolshelps to analyze performance:-
biolatencysummarizes the latency in block device I/O (disk I/O) in histogram. This allows the distribution to be studied, including two modes for device cache hits and for cache misses, and latency outliers. -
biosnoopis a basic block I/O tracing tool for displaying each I/O event along with the issuing process ID and I/O latency. Using this tool, you can investigate disk I/O performance issues. -
biotopis used for block i/o operations in the kernel. -
filelifetool traces thestat()syscalls. -
fileslowertraces slow synchronous file reads and writes. -
filetopdisplays file reads and writes by process. -
ext4slower,nfsslower, andxfsslowerare tools that show file system operations slower than a certain threshold, which defaults to10ms.
-
-
bpftaceis a tracing language foreBPFused for analyzing performance issues. It also provides trace utilities like BCC for system observation, which is useful for investigating I/O performance issues. The following
SystemTapscripts may be useful in diagnosing storage or file system performance problems:-
disktop.stp: Checks the status of reading or writing disk every 5 seconds and outputs the top ten entries during that period. -
iotime.stp: Prints the amount of time spent on read and write operations, and the number of bytes read and written. -
traceio.stp: Prints the top ten executable based on cumulative I/O traffic observed, every second. -
traceio2.stp: Prints the executable name and process identifier as reads and writes to the specified device occur. -
Inodewatch.stp: Prints the executable name and process identifier each time a read or write occurs to the specified inode on the specified device. -
inodewatch2.stp: Prints the executable name, process identifier, and attributes each time the attributes are changed on the specified inode on the specified inode.
-
For more information, see:
-
vmstat(8),iostat(1),blktrace(8),blkparse(1),btt(1),bpftrace, andiowatcher(1)man pages on your system.
26.2. Available tuning options for formatting a file system Copier lienLien copié sur presse-papiers!
Some file system configuration decisions cannot be changed after the device is formatted. These include the size, block size, geometry, and external journals.
The following are the details of the options that are available before formatting a storage device:
Size- Create an appropriately-sized file system for your workload. Smaller file systems require less time and memory for file system checks. However, if a file system is too small, its performance suffers from high fragmentation.
Block sizeThe block is the unit of work for the file system. The block size determines how much data can be stored in a single block. It therefore sets the smallest data amount written or read at one time.
The default block size is appropriate for most use cases. However, your file system performs better if the block size matches the typical read or write amount. Optimal performance occurs when the block size equals or slightly exceeds the data typically accessed at once.
A small file still uses an entire block. Files can be spread across multiple blocks, but this can create additional runtime overhead.
Additionally, some file systems are limited to a certain number of blocks, which limits the maximum size of the file system. Block size is specified as part of the file system options when formatting a device with the
mkfscommand. The parameter that specifies the block size varies with the file system.GeometryFile system geometry is concerned with the distribution of data across a file system. If your system uses striped storage like RAID, align data and metadata with the underlying storage geometry when formatting. This improves performance.
Many devices export recommended geometry, which is then set automatically when the devices are formatted with a particular file system. If your device does not export these recommendations, or you want to change them, specify geometry manually when formatting with
mkfs.The parameters that specify file system geometry vary with the file system.
External journals- Journaling file systems document changes in a journal file before running write operations. This reduces the likelihood of device corruption during system crashes or power failures. It also speeds up recovery.
It is preferable to not use the external journals option.
Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a device with a metadata-intensive workload by placing its journal on dedicated storage. Use storage as fast as, or faster than the primary storage.
Ensure that external journals are reliable. Losing an external journal device causes file system corruption. External journals must be created at format time, with journal devices being specified at mount time.
26.3. Available tuning options for mounting a file system Copier lienLien copié sur presse-papiers!
You can explore key tuning options for mounting file systems, including atime, noatime, and read-ahead settings, to select mount options that balance performance and functionality for different workloads.
The following are the options available to most file systems and can be specified as the device is mounted:
Access TimeEvery time a file is read, its metadata is updated with the time at which access occurred (
atime). This involves additional write I/O. Therelatimeis the defaultatimesetting for most file systems.However, if updating this metadata is time consuming, and if accurate access time data is not required, you can mount the file system with the
noatimemount option. This disables updates to metadata when a file is read. It also enablesnodiratimebehavior, which disables updates to metadata when a directory is read.
Disabling atime updates by using the noatime mount option can break applications that rely on them, for example, backup programs.
Read-aheadRead-aheadbehavior speeds up file access by pre-fetching data that is likely to be needed soon and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The higher the read-ahead value, the further ahead the system pre-fetches data.Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects about your file system. However, accurate detection is not always possible. For example, if a storage array presents itself to the system as a single LUN, the system detects the single LUN, and does not set the appropriate read-ahead value for an array.
Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values. The storage-related tuned profiles provided with Red Hat Enterprise Linux raise the read-ahead value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.
26.4. Discarding blocks that are unused Copier lienLien copié sur presse-papiers!
Regularly discarding blocks that are not in use by the file system is a good practice for both solid-state disks and thinly-provisioned storage.
26.5. Solid-state disks tuning considerations Copier lienLien copié sur presse-papiers!
Solid-state disks (SSD) use NAND flash chips rather than rotating magnetic platters to store persistent data. SSD provides a constant access time for data across their full Logical Block Address range. It does not incur measurable seek costs like their rotating counterparts.
They are more expensive per gigabyte of storage space and have a lesser storage density. However, they also have lower latency and greater throughput than Hard Disk Drives (HDD)s.
Performance generally degrades as the used blocks on an SSD approach the capacity of the disk. The degree of degradation varies by vendor, but all devices experience degradation in this circumstance. Enabling discard behavior can help to alleviate this degradation.
The default I/O scheduler and virtual memory options are suitable for use with SSDs. Consider the following factors when configuring settings that can affect SSD performance:
I/O SchedulerAny I/O scheduler is expected to perform well with most SSDs. However, as with any other storage type, benchmark to determine the optimal configuration for a given workload. When using SSDs, change the I/O scheduler only to benchmark particular workloads.
For instructions on how to switch between I/O schedulers, see the
/usr/share/doc/kernel-version/Documentation/block/switching-sched.txtfile.For single queue Host Bus Adapter (HBA), the default I/O scheduler is
deadline. For multiple queue HBA, the default I/O scheduler isnone.Virtual Memory-
Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast nature of I/O on SSD, try turning down the
vm_dirty_background_ratioandvm_dirty_ratiosettings. Increased write-out activity does not usually have a negative impact on the latency of other operations on the disk. However, this tuning can generate more overall I/O, and is therefore not generally preferable without workload-specific testing. Swap- An SSD can also be used as a swap device, and is likely to produce good page-out and page-in performance.
26.6. Generic block device tuning parameters Copier lienLien copié sur presse-papiers!
The generic tuning parameters listed here are available in the /sys/block/sdX/queue/ directory.
The following listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all I/O schedulers:
add_random-
Some I/O events contribute to the entropy pool for the
/dev/random. This parameter can be set to0if the overhead of these contributions become measurable. iostatsBy default,
iostatsis enabled and the default value is1. Settingiostatsto0disables gathering of I/O statistics for the device. This removes a small amount of overhead with the I/O path.Setting
iostatsto0might improve performance for high performance devices, such as certain Non-volatile Memory Express (NVMe) storage devices. It is preferable to leaveiostatsenabled unless otherwise specified for the given storage model by the vendor.If you disable
iostats, the I/O statistics for the device are no longer present within the/proc/diskstatsfile. The content of/sys/diskstatsfile is the source of I/O information for monitoring I/O tools, such assaroriostats. Therefore, if you disable theiostatsparameter for a device, it is no longer present in the output of I/O monitoring tools.max_sectors_kbSpecifies the maximum size of an I/O request in kilobytes. The default value is
512KB. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value of themax_hw_sectors_kb.max_sectors_kbmust always be a multiple of the optimal I/O size and the internal erase block size. Use a value oflogical_block_sizefor either parameter if they are zero or not specified by the storage device.nomerges-
Most workloads benefit from request merging. However, disabling merges can be useful for debugging purposes. By default, the
nomergesparameter is set to0, which enables merging. To disable simple one-hit merging, setnomergesto1. and to disable all types of merging, setnomergesto2. nr_requests-
It is the maximum allowed number of the queued I/O. If the current I/O scheduler is
none, this number can only be reduced; otherwise the number can be increased or reduced. optimal_io_size- Some storage devices report an optimal I/O size through this parameter. If this value is reported, applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible.
read_ahead_kbDefines the maximum number of kilobytes that the operating system may read ahead during a sequential read operation. As a result, the necessary information is already present within the kernel page cache for the next sequential read. This improves read I/O performance.
Device mappers often benefit from a high
read_ahead_kbvalue.128KB for each device to be mapped is a good starting point. Increasing theread_ahead_kbvalue up to request queue’smax_sectors_kbof the disk might improve performance where sequential reading of large files occur.rotational-
Some solid-state disks do not correctly advertise their solid-state status, and are mounted as traditional rotational disks. Manually set the
rotationalvalue to0to disable unnecessary seek-reducing logic in the scheduler. rq_affinity-
The default value of the
rq_affinityis1. It completes the I/O operations on one CPU core, which is in the same CPU group of the issued CPU core. To perform completions only on the processor that issued the I/O request, set therq_affinityto2. To disable the mentioned two abilities, set it to0. scheduler-
To set the scheduler or scheduler preference order for a storage device, edit the
/sys/block/devname/queue/scheduler. Replace devname with the device name you want to configure.