Questo contenuto non è disponibile nella lingua selezionata.
Chapter 33. Factors affecting I/O and file system performance
The appropriate settings for storage and file system performance are highly dependent on the storage purpose.
I/O and file system performance can be affected by any of the following factors:
- Data write or read patterns
- Sequential or random
- Buffered or Direct IO
- Data alignment with underlying geometry
- Block size
- File system size
- Journal size and location
- Recording access times
- Ensuring data reliability
- Pre-fetching data
- Pre-allocating disk space
- File fragmentation
- Resource contention
33.1. Tools for monitoring and diagnosing I/O and file system issues
The following tools are available in Red Hat Enterprise Linux 8 for monitoring system performance and diagnosing performance problems related to I/O, file systems, and their configuration:
-
vmstat
tool reports on processes, memory, paging, block I/O, interrupts, and CPU activity across the entire system. It can help administrators determine whether the I/O subsystem is responsible for any performance issues. If analysis withvmstat
shows that the I/O subsystem is responsible for reduced performance, administrators can use theiostat
tool to determine the responsible I/O device. -
iostat
reports on I/O device load in your system. It is provided by thesysstat
package. -
blktrace
provides detailed information about how time is spent in the I/O subsystem. The companion utilityblkparse
reads the raw output fromblktrace
and produces a human readable summary of input and output operations recorded byblktrace
. btt
analyzesblktrace
output and displays the amount of time that data spends in each area of the I/O stack, making it easier to spot bottlenecks in the I/O subsystem. This utility is provided as part of theblktrace
package. Some of the important events tracked by theblktrace
mechanism and analyzed bybtt
are:-
Queuing of the I/O event (
Q
) -
Dispatch of the I/O to the driver event (
D
) -
Completion of I/O event (
C
)
-
Queuing of the I/O event (
-
iowatcher
can use theblktrace
output to graph I/O over time. It focuses on the Logical Block Address (LBA) of disk I/O, throughput in megabytes per second, the number of seeks per second, and I/O operations per second. This can help to identify when you are hitting the operations-per-second limit of a device. BPF Compiler Collection (BCC) is a library, which facilitates the creation of the extended Berkeley Packet Filter (
eBPF
) programs. TheeBPF
programs are triggered on events, such as disk I/O, TCP connections, and process creations. The BCC tools are installed in the/usr/share/bcc/tools/
directory. The followingbcc-tools
helps to analyze performance:-
biolatency
summarizes the latency in block device I/O (disk I/O) in histogram. This allows the distribution to be studied, including two modes for device cache hits and for cache misses, and latency outliers. -
biosnoop
is a basic block I/O tracing tool for displaying each I/O event along with the issuing process ID, and the I/O latency. Using this tool, you can investigate disk I/O performance issues. -
biotop
is used for block i/o operations in the kernel. -
filelife
tool traces thestat()
syscalls. -
fileslower
traces slow synchronous file reads and writes. -
filetop
displays file reads and writes by process. ext4slower
,nfsslower
, andxfsslower
are tools that show file system operations slower than a certain threshold, which defaults to10ms
.For more information, see the Analyzing system performance with BPF Compiler Collection.
-
-
bpftace
is a tracing language foreBPF
used for analyzing performance issues. It also provides trace utilities like BCC for system observation, which is useful for investigating I/O performance issues. The following
SystemTap
scripts may be useful in diagnosing storage or file system performance problems:-
disktop.stp
: Checks the status of reading or writing disk every 5 seconds and outputs the top ten entries during that period. -
iotime.stp
: Prints the amount of time spent on read and write operations, and the number of bytes read and written. -
traceio.stp
: Prints the top ten executable based on cumulative I/O traffic observed, every second. -
traceio2.stp
: Prints the executable name and process identifier as reads and writes to the specified device occur. -
Inodewatch.stp
: Prints the executable name and process identifier each time a read or write occurs to the specified inode on the specified major or minor device. -
inodewatch2.stp
: Prints the executable name, process identifier, and attributes each time the attributes are changed on the specified inode on the specified major or minor device.
-
Additional resources
-
vmstat(8)
,iostat(1)
,blktrace(8)
,blkparse(1)
,btt(1)
,bpftrace
, andiowatcher(1)
man pages on your system - Analyzing system performance with BPF Compiler Collection
33.2. Available tuning options for formatting a file system
Some file system configuration decisions cannot be changed after the device is formatted.
The following are the options available before formatting a storage device:
Size
- Create an appropriately-sized file system for your workload. Smaller file systems require less time and memory for file system checks. However, if a file system is too small, its performance suffers from high fragmentation.
Block size
The block is the unit of work for the file system. The block size determines how much data can be stored in a single block, and therefore the smallest amount of data that is written or read at one time.
The default block size is appropriate for most use cases. However, your file system performs better and stores data more efficiently if the block size or the size of multiple blocks is the same as or slightly larger than the amount of data that is typically read or written at one time. A small file still uses an entire block. Files can be spread across multiple blocks, but this can create additional runtime overhead.
Additionally, some file systems are limited to a certain number of blocks, which in turn limits the maximum size of the file system. Block size is specified as part of the file system options when formatting a device with the
mkfs
command. The parameter that specifies the block size varies with the file system.Geometry
File system geometry is concerned with the distribution of data across a file system. If your system uses striped storage, like RAID, you can improve performance by aligning data and metadata with the underlying storage geometry when you format the device.
Many devices export recommended geometry, which is then set automatically when the devices are formatted with a particular file system. If your device does not export these recommendations, or you want to change the recommended settings, you must specify geometry manually when you format the device with the
mkfs
command.The parameters that specify file system geometry vary with the file system.
External journals
- Journaling file systems document the changes that will be made during a write operation in a journal file prior to the operation being executed. This reduces the likelihood that a storage device will become corrupted in the event of a system crash or power failure, and speeds up the recovery process.
Red Hat does not recommend using the external journals option.
Metadata-intensive workloads involve very frequent updates to the journal. A larger journal uses more memory, but reduces the frequency of write operations. Additionally, you can improve the seek time of a device with a metadata-intensive workload by placing its journal on dedicated storage that is as fast as, or faster than, the primary storage.
Ensure that external journals are reliable. Losing an external journal device causes file system corruption. External journals must be created at format time, with journal devices being specified at mount time.
Additional resources
-
mkfs(8)
andmount(8)
man pages on your system - Overview of available file systems
33.3. Available tuning options for mounting a file system
The following are the options available to most file systems and can be specified as the device is mounted:
Access Time
Every time a file is read, its metadata is updated with the time at which access occurred (
atime
). This involves additional write I/O. Therelatime
is the defaultatime
setting for most file systems.However, if updating this metadata is time consuming, and if accurate access time data is not required, you can mount the file system with the
noatime
mount option. This disables updates to metadata when a file is read. It also enablesnodiratime
behavior, which disables updates to metadata when a directory is read.
Disabling atime
updates by using the noatime mount
option can break applications that rely on them, for example, backup programs.
Read-ahead
Read-ahead
behavior speeds up file access by pre-fetching data that is likely to be needed soon and loading it into the page cache, where it can be retrieved more quickly than if it were on disk. The higher the read-ahead value, the further ahead the system pre-fetches data.Red Hat Enterprise Linux attempts to set an appropriate read-ahead value based on what it detects about your file system. However, accurate detection is not always possible. For example, if a storage array presents itself to the system as a single LUN, the system detects the single LUN, and does not set the appropriate read-ahead value for an array.
Workloads that involve heavy streaming of sequential I/O often benefit from high read-ahead values. The storage-related tuned profiles provided with Red Hat Enterprise Linux raise the read-ahead value, as does using LVM striping, but these adjustments are not always sufficient for all workloads.
Additional resources
-
mount(8)
,xfs(5)
, andext4(5)
man pages on your system
33.4. Types of discarding unused blocks
Regularly discarding blocks that are not in use by the file system is a recommended practice for both solid-state disks and thinly-provisioned storage.
The following are the two methods of discarding unused blocks:
Batch discard
-
This type of discard is part of the
fstrim
command. It discards all unused blocks in a file system that match criteria specified by the administrator. Red Hat Enterprise Linux 8 supports batch discard on XFS and ext4 formatted devices that support physical discard operations. Online discard
This type of discard operation is configured at mount time with the discard option, and runs in real time without user intervention. However, it only discards blocks that are transitioning from used to free. Red Hat Enterprise Linux 8 supports online discard on XFS and ext4 formatted devices.
Red Hat recommends batch discard, except where online discard is required to maintain performance, or where batch discard is not feasible for the system’s workload.
Pre-allocation marks disk space as being allocated to a file without writing any data into that space. This can be useful in limiting data fragmentation and poor read performance. Red Hat Enterprise Linux 8 supports pre-allocating space on XFS, ext4, and GFS2 file systems. Applications can also benefit from pre-allocating space by using the fallocate(2) glibc
call.
Additional resources
-
mount(8)
andfallocate(2)
man pages on your system
33.5. Solid-state disks tuning considerations
Solid-state disks (SSD) use NAND flash chips rather than rotating magnetic platters to store persistent data. SSD provides a constant access time for data across their full Logical Block Address range, and does not incur measurable seek costs like their rotating counterparts. They are more expensive per gigabyte of storage space and have a lesser storage density, but they also have lower latency and greater throughput than HDDs.
Performance generally degrades as the used blocks on an SSD approach the capacity of the disk. The degree of degradation varies by vendor, but all devices experience degradation in this circumstance. Enabling discard behavior can help to alleviate this degradation. For more information, see Types of discarding unused blocks.
The default I/O scheduler and virtual memory options are suitable for use with SSDs. Consider the following factors when configuring settings that can affect SSD performance:
I/O Scheduler
Any I/O scheduler is expected to perform well with most SSDs. However, as with any other storage type, Red Hat recommends benchmarking to determine the optimal configuration for a given workload. When using SSDs, Red Hat advises changing the I/O scheduler only for benchmarking particular workloads. For instructions on how to switch between I/O schedulers, see the
/usr/share/doc/kernel-version/Documentation/block/switching-sched.txt
file.For single queue HBA, the default I/O scheduler is
deadline
. For multiple queue HBA, the default I/O scheduler isnone
. For information about how to set the I/O scheduler, see Setting the disk scheduler.Virtual Memory
-
Like the I/O scheduler, virtual memory (VM) subsystem requires no special tuning. Given the fast nature of I/O on SSD, try turning down the
vm_dirty_background_ratio
andvm_dirty_ratio
settings, as increased write-out activity does not usually have a negative impact on the latency of other operations on the disk. However, this tuning can generate more overall I/O, and is therefore not generally recommended without workload-specific testing. Swap
- An SSD can also be used as a swap device, and is likely to produce good page-out and page-in performance.
33.6. Generic block device tuning parameters
The generic tuning parameters listed here are available in the /sys/block/sdX/queue/
directory.
The following listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all I/O schedulers:
add_random
-
Some I/O events contribute to the entropy pool for the
/dev/random
. This parameter can be set to0
if the overhead of these contributions become measurable. iostats
By default,
iostats
is enabled and the default value is1
. Settingiostats
value to0
disables the gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O path. Settingiostats
to0
might slightly improve performance for very high performance devices, such as certain NVMe solid-state storage devices. It is recommended to leaveiostats
enabled unless otherwise specified for the given storage model by the vendor.If you disable
iostats
, the I/O statistics for the device are no longer present within the/proc/diskstats
file. The content of/sys/diskstats
file is the source of I/O information for monitoring I/O tools, such assar
oriostats
. Therefore, if you disable theiostats
parameter for a device, the device is no longer present in the output of I/O monitoring tools.max_sectors_kb
Specifies the maximum size of an I/O request in kilobytes. The default value is
512
KB. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value of themax_hw_sectors_kb
.Red Hat recommends
max_sectors_kb
to always be a multiple of the optimal I/O size and the internal erase block size. Use a value oflogical_block_size
for either parameter if they are zero or not specified by the storage device.nomerges
-
Most workloads benefit from request merging. However, disabling merges can be useful for debugging purposes. By default, the
nomerges
parameter is set to0
, which enables merging. To disable simple one-hit merging, setnomerges
to1
. To disable all types of merging, setnomerges
to2
. nr_requests
-
It is the maximum allowed number of the queued I/O. If the current I/O scheduler is
none
, this number can only be reduced; otherwise the number can be increased or reduced. optimal_io_size
- Some storage devices report an optimal I/O size through this parameter. If this value is reported, Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible.
read_ahead_kb
Defines the maximum number of kilobytes that the operating system may read ahead during a sequential read operation. As a result, the necessary information is already present within the kernel page cache for the next sequential read, which improves read I/O performance.
Device mappers often benefit from a high
read_ahead_kb
value.128
KB for each device to be mapped is a good starting point, but increasing theread_ahead_kb
value up to request queue’smax_sectors_kb
of the disk might improve performance in application environments where sequential reading of large files takes place.rotational
-
Some solid-state disks do not correctly advertise their solid-state status, and are mounted as traditional rotational disks. Manually set the
rotational
value to0
to disable unnecessary seek-reducing logic in the scheduler. rq_affinity
-
The default value of the
rq_affinity
is1
. It completes the I/O operations on one CPU core, which is in the same CPU group of the issued CPU core. To perform completions only on the processor that issued the I/O request, set therq_affinity
to2
. To disable the mentioned two abilities, set it to0
. scheduler
-
To set the scheduler or scheduler preference order for a particular storage device, edit the
/sys/block/devname/queue/scheduler
file, where devname is the name of the device you want to configure.