8.4. Configuration Tools
Red Hat Enterprise Linux provides a number of tools to assist administrators in configuring the storage and file systems. This section outlines the available tools and provides examples of how they can be used to solve I/O and file system related performance problems in Red Hat Enterprise Linux 7.
8.4.1. Configuring Tuning Profiles for Storage Performance
The Tuned service provides a number of profiles designed to improve performance for specific use cases. The following profiles are particularly useful for improving storage performance.
- latency-performance
- throughput-performance (the default)
To configure a profile on your system, run the following command, replacing name with the name of the profile you want to use.
$ tuned-adm profile name
The
tuned-adm recommend
command recommends an appropriate profile for your system.
For further details about these profiles or additional configuration options, see Section A.5, “tuned-adm”.
8.4.2. Setting the Default I/O Scheduler
The default I/O scheduler is the scheduler that is used if no other scheduler is explicitly specified for the device.
If no default scheduler is specified, the
cfq
scheduler is used for SATA drives, and the deadline
scheduler is used for all other drives. If you specify a default scheduler by following the instructions in this section, that default scheduler is applied to all devices.
To set the default I/O scheduler, you can use the Tuned tool, or modify the
/etc/default/grub
file manually.
Red Hat recommends using the Tuned tool to specify the default I/O scheduler on a booted system. To set the
elevator
parameter, enable the disk
plug-in. For information on the disk
plug-in, see Section 3.1.1, “Plug-ins” in the Tuned chapter.
To modify the default scheduler by using GRUB 2, append the
elevator
parameter to the kernel command line, either at boot time, or when the system is booted. You can use the Tuned tool, or modify the /etc/default/grub
file manually, as described in Procedure 8.1, “Setting the Default I/O Scheduler by Using GRUB 2”.
Procedure 8.1. Setting the Default I/O Scheduler by Using GRUB 2
To set the default I/O Scheduler on a booted system and make the configuration persist after reboot:
- Add the
elevator
parameter to theGRUB_CMDLINE_LINUX
line in the/etc/default/grub
file.#
cat /etc/default/grub
... GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=vg00/lvroot rd.lvm.lv=vg00/lvswap elevator=noop" ...In Red Hat Enterprise Linux 7, the available schedulers aredeadline
,noop
, andcfq
. For more information, see thecfq-iosched.txt
anddeadline-iosched.txt
files in the documentation for your kernel, available after installing the kernel-doc package. - Create a new configuration with the
elevator
parameter added.The location of the GRUB 2 configuration file is different on systems with the BIOS firmware and on systems with UEFI. Use one of the following commands to recreate the GRUB 2 configuration file.- On a system with the BIOS firmware, use:
#
grub2-mkconfig -o /etc/grub2.cfg
- On a system with the UEFI firmware, use:
#
grub2-mkconfig -o /etc/grub2-efi.cfg
- Reboot the system for the change to take effect.For more information on version 2 of the GNU GRand Unified Bootloader (GRUB 2), see the Working with the GRUB 2 Boot Loader chapter of the Red Hat Enterprise Linux 7 System Administrator's Guide.
8.4.3. Generic Block Device Tuning Parameters
The generic tuning parameters listed in this section are available within the
/sys/block/sdX/queue/
directory. The listed tuning parameters are separate from I/O scheduler tuning, and are applicable to all I/O schedulers.
- add_random
- Some I/O events contribute to the entropy pool for
/dev/random
. This parameter can be set to0
if the overhead of these contributions becomes measurable. - iostats
- The default value is
1
(enabled
). Settingiostats
to0
disables the gathering of I/O statistics for the device, which removes a small amount of overhead with the I/O path. Settingiostats
to0
might slightly improve performance for very high performance devices, such as certain NVMe solid-state storage devices. It is recommended to leaveiostats
enabled unless otherwise specified for the given storage model by the vendor.If you disableiostats
, the I/O statistics for the device are no longer present within the/proc/diskstats
file. The content of/sys/diskstats
is the source of I/O information for monitoring I/O tools, such assar
oriostats
. Therefore, if you disable theiostats
parameter for a device, the device is no longer present in the output of I/O monitoring tools. - max_sectors_kb
- Specifies the maximum size of an I/O request in kilobytes. The default value is
512
KB. The minimum value for this parameter is determined by the logical block size of the storage device. The maximum value for this parameter is determined by the value ofmax_hw_sectors_kb
.Certain solid-state disks perform poorly when the I/O requests are larger than the internal erase block size. To determine if this is the case of the solid-state disk model attached to the system, check with the hardware vendor, and follow their recommendations. Red Hat recommendsmax_sectors_kb
to always be a multiple of the optimal I/O size and the internal erase block size. Use a value oflogical_block_size
for either parameter if they are zero or not specified by the storage device. - nomerges
- Most workloads benefit from request merging. However, disabling merges can be useful for debugging purposes. By default, the
nomerges
parameter is set to0
, which enables merging. To disable simple one-hit merging, setnomerges
to1
. To disable all types of merging, setnomerges
to2
. - nr_requests
- Specifies the maximum number of read and write requests that can be queued at one time. The default value is
128
, which means that 128 read requests and 128 write requests can be queued before the next process to request a read or write is put to sleep.For latency-sensitive applications, lower the value of this parameter and limit the command queue depth on the storage so that write-back I/O cannot fill the device queue with write requests. When the device queue fills, other processes attempting to perform I/O operations are put to sleep until queue space becomes available. Requests are then allocated in a round-robin manner, which prevents one process from continuously consuming all spots in the queue.The maximum number of I/O operations within the I/O scheduler isnr_requests*2
. As stated,nr_requests
is applied separately for reads and writes. Note thatnr_requests
only applies to the I/O operations within the I/O scheduler and not to I/O operations already dispatched to the underlying device. Therefore, the maximum outstanding limit of I/O operations against a device is(nr_requests*2)+(queue_depth)
wherequeue_depth
is/sys/block/sdN/device/queue_depth
, sometimes also referred to as the LUN queue depth. You can see this total outstanding number of I/O operations in, for example, the output ofiostat
in theavgqu-sz
column. - optimal_io_size
- Some storage devices report an optimal I/O size through this parameter. If this value is reported, Red Hat recommends that applications issue I/O aligned to and in multiples of the optimal I/O size wherever possible.
- read_ahead_kb
- Defines the maximum number of kilobytes that the operating system may read ahead during a sequential read operation. As a result, the likely-needed information is already present within the kernel page cache for the next sequential read, which improves read I/O performance.Device mappers often benefit from a high
read_ahead_kb
value. 128 KB for each device to be mapped is a good starting point, but increasing theread_ahead_kb
value up to 4–8 MB might improve performance in application environments where sequential reading of large files takes place. - rotational
- Some solid-state disks do not correctly advertise their solid-state status, and are mounted as traditional rotational disks. If your solid-state device does does not set this to
0
automatically, set it manually to disable unnecessary seek-reducing logic in the scheduler. - rq_affinity
- By default, I/O completions can be processed on a different processor than the processor that issued the I/O request. Set
rq_affinity
to1
to disable this ability and perform completions only on the processor that issued the I/O request. This can improve the effectiveness of processor data caching. - scheduler
- To set the scheduler or scheduler preference order for a particular storage device, edit the
/sys/block/devname/queue/scheduler
file, where devname is the name of the device you want to configure.# echo cfq > /sys/block/hda/queue/scheduler
8.4.4. Tuning the Deadline Scheduler
When
deadline
is in use, queued I/O requests are sorted into a read or write batch and then scheduled for execution in increasing LBA order. Read batches take precedence over write batches by default, as applications are more likely to block on read I/O. After a batch is processed, deadline
checks how long write operations have been starved of processor time and schedules the next read or write batch as appropriate.
The following parameters affect the behavior of the
deadline
scheduler.
- fifo_batch
- The number of read or write operations to issue in a single batch. The default value is
16
. A higher value can increase throughput, but will also increase latency. - front_merges
- If your workload will never generate front merges, this tunable can be set to
0
. However, unless you have measured the overhead of this check, Red Hat recommends the default value of1
. - read_expire
- The number of milliseconds in which a read request should be scheduled for service. The default value is
500
(0.5 seconds). - write_expire
- The number of milliseconds in which a write request should be scheduled for service. The default value is
5000
(5 seconds). - writes_starved
- The number of read batches that can be processed before processing a write batch. The higher this value is set, the greater the preference given to read batches.
8.4.5. Tuning the CFQ Scheduler
When CFQ is in use, processes are placed into three classes: real time, best effort, and idle. All real time processes are scheduled before any best effort processes, which are scheduled before any idle processes. By default, processes are classed as best effort. You can manually adjust the class of a process with the
ionice
command.
You can further adjust the behavior of the CFQ scheduler with the following parameters. These parameters are set on a per-device basis by altering the specified files under the
/sys/block/devname/queue/iosched
directory.
- back_seek_max
- The maximum distance in kilobytes that CFQ will perform a backward seek. The default value is
16
KB. Backward seeks typically damage performance, so large values are not recommended. - back_seek_penalty
- The multiplier applied to backward seeks when the disk head is deciding whether to move forward or backward. The default value is
2
. If the disk head position is at 1024 KB, and there are equidistant requests in the system (1008 KB and 1040 KB, for example), theback_seek_penalty
is applied to backward seek distances and the disk moves forward. - fifo_expire_async
- The length of time in milliseconds that an asynchronous (buffered write) request can remain unserviced. After this amount of time expires, a single starved asynchronous request is moved to the dispatch list. The default value is
250
milliseconds. - fifo_expire_sync
- The length of time in milliseconds that a synchronous (read or
O_DIRECT
write) request can remain unserviced. After this amount of time expires, a single starved synchronous request is moved to the dispatch list. The default value is125
milliseconds. - group_idle
- This parameter is set to
0
(disabled) by default. When set to1
(enabled), thecfq
scheduler idles on the last process that is issuing I/O in a control group. This is useful when using proportional weight I/O control groups and whenslice_idle
is set to0
(on fast storage). - group_isolation
- This parameter is set to
0
(disabled) by default. When set to1
(enabled), it provides stronger isolation between groups, but reduces throughput, as fairness is applied to both random and sequential workloads. Whengroup_isolation
is disabled (set to0
), fairness is provided to sequential workloads only. For more information, see the installed documentation in/usr/share/doc/kernel-doc-version/Documentation/cgroups/blkio-controller.txt
. - low_latency
- This parameter is set to
1
(enabled) by default. When enabled,cfq
favors fairness over throughput by providing a maximum wait time of300
ms for each process issuing I/O on a device. When this parameter is set to0
(disabled), target latency is ignored and each process receives a full time slice. - quantum
- This parameter defines the number of I/O requests that
cfq
sends to one device at one time, essentially limiting queue depth. The default value is8
requests. The device being used may support greater queue depth, but increasing the value of quantum will also increase latency, especially for large sequential write work loads. - slice_async
- This parameter defines the length of the time slice (in milliseconds) allotted to each process issuing asynchronous I/O requests. The default value is
40
milliseconds. - slice_idle
- This parameter specifies the length of time in milliseconds that cfq idles while waiting for further requests. The default value is
0
(no idling at the queue or service tree level). The default value is ideal for throughput on external RAID storage, but can degrade throughput on internal non-RAID storage as it increases the overall number of seek operations. - slice_sync
- This parameter defines the length of the time slice (in milliseconds) allotted to each process issuing synchronous I/O requests. The default value is
100
ms.
8.4.5.1. Tuning CFQ for Fast Storage
The
cfq
scheduler is not recommended for hardware that does not suffer a large seek penalty, such as fast external storage arrays or solid-state disks. If your use case requires cfq
to be used on this storage, you will need to edit the following configuration files:
- Set
/sys/block/devname/queue/iosched/slice_idle
to0
- Set
/sys/block/devname/queue/iosched/quantum
to64
- Set
/sys/block/devname/queue/iosched/group_idle
to1
8.4.6. Tuning the noop Scheduler
The
noop
I/O scheduler is primarily useful for CPU-bound systems that use fast storage. Also, the noop
I/O scheduler is commonly, but not exclusively, used on virtual machines when they are performing I/O operations to virtual disks.
There are no tunable parameters specific to the
noop
I/O scheduler.
8.4.7. Configuring File Systems for Performance
This section covers the tuning parameters specific to each file system supported in Red Hat Enterprise Linux 7. Parameters are divided according to whether their values should be configured when you format the storage device, or when you mount the formatted device.
Where loss in performance is caused by file fragmentation or resource contention, performance can generally be improved by reconfiguring the file system. However, in some cases the application may need to be altered. In this case, Red Hat recommends contacting Customer Support for assistance.
8.4.7.1. Tuning XFS
This section covers some of the tuning parameters available to XFS file systems at format and at mount time.
The default formatting and mount settings for XFS are suitable for most workloads. Red Hat recommends changing them only if specific configuration changes are expected to benefit your workload.
8.4.7.1.1. Formatting Options
For further details about any of these formatting options, see the man page:
$ man mkfs.xfs
- Directory block size
- The directory block size affects the amount of directory information that can be retrieved or modified per I/O operation. The minimum value for directory block size is the file system block size (4 KB by default). The maximum value for directory block size is
64
KB.At a given directory block size, a larger directory requires more I/O than a smaller directory. A system with a larger directory block size also consumes more processing power per I/O operation than a system with a smaller directory block size. It is therefore recommended to have as small a directory and directory block size as possible for your workload.Red Hat recommends the directory block sizes listed in Table 8.1, “Recommended Maximum Directory Entries for Directory Block Sizes” for file systems with no more than the listed number of entries for write-heavy and read-heavy workloads.Table 8.1. Recommended Maximum Directory Entries for Directory Block Sizes Directory block size Max. entries (read-heavy) Max. entries (write-heavy) 4 KB 100,000–200,000 1,000,000–2,000,000 16 KB 100,000–1,000,000 1,000,000–10,000,000 64 KB >1,000,000 >10,000,000 For detailed information about the effect of directory block size on read and write workloads in file systems of different sizes, see the XFS documentation.To configure directory block size, use themkfs.xfs -l
option. See themkfs.xfs
man page for details. - Allocation groups
- An allocation group is an independent structure that indexes free space and allocated inodes across a section of the file system. Each allocation group can be modified independently, allowing XFS to perform allocation and deallocation operations concurrently as long as concurrent operations affect different allocation groups. The number of concurrent operations that can be performed in the file system is therefore equal to the number of allocation groups. However, since the ability to perform concurrent operations is also limited by the number of processors able to perform the operations, Red Hat recommends that the number of allocation groups be greater than or equal to the number of processors in the system.A single directory cannot be modified by multiple allocation groups simultaneously. Therefore, Red Hat recommends that applications that create and remove large numbers of files do not store all files in a single directory.To configure allocation groups, use the
mkfs.xfs -d
option. See themkfs.xfs
man page for details. - Growth constraints
- If you may need to increase the size of your file system after formatting time (either by adding more hardware or through thin-provisioning), you must carefully consider initial file layout, as allocation group size cannot be changed after formatting is complete.Allocation groups must be sized according to the eventual capacity of the file system, not the initial capacity. The number of allocation groups in the fully-grown file system should not exceed several hundred, unless allocation groups are at their maximum size (1 TB). Therefore for most file systems, the recommended maximum growth to allow for a file system is ten times the initial size.Additional care must be taken when growing a file system on a RAID array, as the device size must be aligned to an exact multiple of the allocation group size so that new allocation group headers are correctly aligned on the newly added storage. The new storage must also have the same geometry as the existing storage, since geometry cannot be changed after formatting time, and therefore cannot be optimized for storage of a different geometry on the same block device.
- Inode size and inline attributes
- If the inode has sufficient space available, XFS can write attribute names and values directly into the inode. These inline attributes can be retrieved and modified up to an order of magnitude faster than retrieving separate attribute blocks, as additional I/O is not required.The default inode size is 256 bytes. Only around 100 bytes of this is available for attribute storage, depending on the number of data extent pointers stored in the inode. Increasing inode size when you format the file system can increase the amount of space available for storing attributes.Both attribute names and attribute values are limited to a maximum size of 254 bytes. If either name or value exceeds 254 bytes in length, the attribute is pushed to a separate attribute block instead of being stored inline.To configure inode parameters, use the
mkfs.xfs -i
option. See themkfs.xfs
man page for details. - RAID
- If software RAID is in use,
mkfs.xfs
automatically configures the underlying hardware with an appropriate stripe unit and width. However, stripe unit and width may need to be manually configured if hardware RAID is in use, as not all hardware RAID devices export this information. To configure stripe unit and width, use themkfs.xfs -d
option. See themkfs.xfs
man page for details. - Log size
- Pending changes are aggregated in memory until a synchronization event is triggered, at which point they are written to the log. The size of the log determines the number of concurrent modifications that can be in-progress at one time. It also determines the maximum amount of change that can be aggregated in memory, and therefore how often logged data is written to disk. A smaller log forces data to be written back to disk more frequently than a larger log. However, a larger log uses more memory to record pending modifications, so a system with limited memory will not benefit from a larger log.Logs perform better when they are aligned to the underlying stripe unit; that is, they start and end at stripe unit boundaries. To align logs to the stripe unit, use the
mkfs.xfs -d
option. See themkfs.xfs
man page for details.To configure the log size, use the followingmkfs.xfs
option, replacing logsize with the size of the log:# mkfs.xfs -l size=logsize
For further details, see themkfs.xfs
man page:$ man mkfs.xfs
- Log stripe unit
- Log writes on storage devices that use RAID5 or RAID6 layouts may perform better when they start and end at stripe unit boundaries (are aligned to the underlying stripe unit).
mkfs.xfs
attempts to set an appropriate log stripe unit automatically, but this depends on the RAID device exporting this information.Setting a large log stripe unit can harm performance if your workload triggers synchronization events very frequently, because smaller writes need to be padded to the size of the log stripe unit, which can increase latency. If your workload is bound by log write latency, Red Hat recommends setting the log stripe unit to 1 block so that unaligned log writes are triggered as possible.The maximum supported log stripe unit is the size of the maximum log buffer size (256 KB). It is therefore possible that the underlying storage may have a larger stripe unit than can be configured on the log. In this case,mkfs.xfs
issues a warning and sets a log stripe unit of 32 KB.To configure the log stripe unit, use one of the following options, where N is the number of blocks to use as the stripe unit, and size is the size of the stripe unit in KB.mkfs.xfs -l sunit=Nb mkfs.xfs -l su=size
For further details, see themkfs.xfs
man page:$ man mkfs.xfs
8.4.7.1.2. Mount Options
- Inode allocation
- Highly recommended for file systems greater than 1 TB in size. The
inode64
parameter configures XFS to allocate inodes and data across the entire file system. This ensures that inodes are not allocated largely at the beginning of the file system, and data is not largely allocated at the end of the file system, improving performance on large file systems. - Log buffer size and number
- The larger the log buffer, the fewer I/O operations it takes to write all changes to the log. A larger log buffer can improve performance on systems with I/O-intensive workloads that do not have a non-volatile write cache.The log buffer size is configured with the
logbsize
mount option, and defines the maximum amount of information that can be stored in the log buffer; if a log stripe unit is not set, buffer writes can be shorter than the maximum, and therefore there is no need to reduce the log buffer size for synchronization-heavy workloads. The default size of the log buffer is 32 KB. The maximum size is 256 KB and other supported sizes are 64 KB, 128 KB or power of 2 multiples of the log stripe unit between 32 KB and 256 KB.The number of log buffers is defined by thelogbufs
mount option. The default value is 8 log buffers (the maximum), but as few as two log buffers can be configured. It is usually not necessary to reduce the number of log buffers, except on memory-bound systems that cannot afford to allocate memory to additional log buffers. Reducing the number of log buffers tends to reduce log performance, especially on workloads sensitive to log I/O latency. - Delay change logging
- XFS has the option to aggregate changes in memory before writing them to the log. The
delaylog
parameter allows frequently modified metadata to be written to the log periodically instead of every time it changes. This option increases the potential number of operations lost in a crash and increases the amount of memory used to track metadata. However, it can also increase metadata modification speed and scalability by an order of magnitude, and does not reduce data or metadata integrity whenfsync
,fdatasync
, orsync
are used to ensure data and metadata is written to disk.
For more information on mount options, see
man xfs
8.4.7.2. Tuning ext4
This section covers some of the tuning parameters available to ext4 file systems at format and at mount time.
8.4.7.2.1. Formatting Options
- Inode table initialization
- Initializing all inodes in the file system can take a very long time on very large file systems. By default, the initialization process is deferred (lazy inode table initialization is enabled). However, if your system does not have an ext4 driver, lazy inode table initialization is disabled by default. It can be enabled by setting
lazy_itable_init
to 1). In this case, kernel processes continue to initialize the file system after it is mounted.
This section describes only some of the options available at format time. For further formatting parameters, see the
mkfs.ext4
man page:
$ man mkfs.ext4
8.4.7.2.2. Mount Options
- Inode table initialization rate
- When lazy inode table initialization is enabled, you can control the rate at which initialization occurs by specifying a value for the
init_itable
parameter. The amount of time spent performing background initialization is approximately equal to 1 divided by the value of this parameter. The default value is10
. - Automatic file synchronization
- Some applications do not correctly perform an
fsync
after renaming an existing file, or after truncating and rewriting. By default, ext4 automatically synchronizes files after each of these operations. However, this can be time consuming.If this level of synchronization is not required, you can disable this behavior by specifying thenoauto_da_alloc
option at mount time. Ifnoauto_da_alloc
is set, applications must explicitly use fsync to ensure data persistence. - Journal I/O priority
- By default, journal I/O has a priority of
3
, which is slightly higher than the priority of normal I/O. You can control the priority of journal I/O with thejournal_ioprio
parameter at mount time. Valid values forjournal_ioprio
range from0
to7
, with0
being the highest priority I/O.
This section describes only some of the options available at mount time. For further mount options, see the
mount
man page:
$ man mount
8.4.7.3. Tuning Btrfs
Starting with Red Hat Enterprise Linux 7.0, Btrfs is provided as a Technology Preview. Tuning should always be done to optimize the system based on its current workload. For information on creation and mounting options, see the chapter on Btrfs in the Red Hat Enterprise Linux 7 Storage Administration Guide.
Data Compression
The default compression algorithm is zlib, but a specific workload can give a reason to change the compression algorithm. For example, if you have a single thread with heavy file I/O, using the lzo algorithm can be more preferable. Options at mount time are:
compress=zlib
– the default option with a high compression ratio, safe for older kernels.compress=lzo
– compression faster, but lower, than zlib.compress=no
– disables compression.compress-force=method
– enables compression even for files that do not compress well, such as videos and disk images. The available methods arezlib
andlzo
.
Only files created or changed after the mount option is added will be compressed. To compress existing files, run the following command after you replace method with either
zlib
or lzo
:
$ btrfs filesystem defragment -cmethod
To re-compress the file using
lzo
, run:
$ btrfs filesystem defragment -r -v -clzo /
8.4.7.4. Tuning GFS2
This section covers some of the tuning parameters available to GFS2 file systems at format and at mount time.
- Directory spacing
- All directories created in the top-level directory of the GFS2 mount point are automatically spaced to reduce fragmentation and increase write speed in those directories. To space another directory like a top-level directory, mark that directory with the
T
attribute, as shown, replacing dirname with the path to the directory you wish to space:# chattr +T dirname
chattr
is provided as part of the e2fsprogs package. - Reduce contention
- GFS2 uses a global locking mechanism that can require communication between the nodes of a cluster. Contention for files and directories between multiple nodes lowers performance. You can minimize the risk of cross-cache invalidation by minimizing the areas of the file system that are shared between multiple nodes.