Product SiteDocumentation Site

7.4.2. Advanced tuning for XFS

Before changing XFS parameters, you need to understand why the default XFS parameters are causing performance problems. This involves understanding what your application is doing, and how the file system is reacting to those operations.
Observable performance problems that can be corrected or reduced by tuning are generally caused by file fragmentation or resource contention in the file system. There are different ways to address these problems, and in some cases fixing the problem will require that the application, rather than the file system configuration, be modified.
If you have not been through this process previously, it is recommended that you engage your local Red Hat support engineer for advice.
Optimizing for a large number of files
XFS imposes an arbitrary limit on the number of files that a file system can hold. In general, this limit is high enough that it will never be hit. If you know that the default limit will be insufficient ahead of time, you can increase the percentage of file system space allowed for inodes with the mkfs.xfs command. If you encounter the file limit after file system creation (usually indicated by ENOSPC errors when attempting to create a file or directory even though free space is available), you can adjust the limit with the xfs_growfs command.
Optimizing for a large number of files in a single directory
Directory block size is fixed for the life of a file system, and cannot be changed except upon initial formatting with mkfs. The minimum directory block is the file system block size, which defaults to MAX (4 KB, file system block size). In general, there is no reason to reduce the directory block size.
Because the directory structure is b-tree based, changing the block size affects the amount of directory information that can be retrieved or modified per physical I/O. The larger the directory becomes, the more I/O each operation requires at a given block size.
However, when larger directory block sizes are in use, more CPU is consumed by each modification operation compared to the same operation on a file system with a smaller directory block size. This means that for small directory sizes, large directory block sizes will result in lower modification performance. When the directory reaches a size where I/O is the performance-limiting factor, large block size directories perform better.
The default configuration of a 4 KB file system block size and a 4 KB directory block size is best for directories with up to 1-2 million entries with a name length of 20-40 bytes per entry. If your file system requires more entries, larger directory block sizes tend to perform better - a 16 KB block size is best for file systems with 1-10 million directory entries, and a 64 KB block size is best for file systems with over 10 million directory entries.
If the workload uses random directory lookups more than modifications (that is, directory reads are much more common or important than directory writes), then the above thresholds for increasing the block size are approximately one order of magnitude lower.
Optimizing for concurrency
Unlike other file systems, XFS can perform many types of allocation and deallocation operations concurrently provided that the operations are occurring on non-shared objects. Allocation or deallocation of extents can occur concurrently provided that the concurrent operations occur in different allocation groups. Similarly, allocation or deallocation of inodes can occur concurrently provided that the concurrent operations affect different allocation groups.
The number of allocation groups becomes important when using machines with a high CPU count and multi-threaded applications that attempt to perform operations concurrently. If only four allocation groups exist, then sustained, parallel metadata operations will only scale as far as those four CPUs (the concurrency limit provided by the system). For small file systems, ensure that the number of allocation groups is supported by the concurrency provided by the system. For large file systems (tens of terabytes and larger) the default formatting options generally create sufficient allocation groups to avoid limiting concurrency.
Applications must be aware of single points of contention in order to use the parallelism inherent in the structure of the XFS file system. It is not possible to modify a directory concurrently, so applications that create and remove large numbers of files should avoid storing all files in a single directory. Each directory created is placed in a different allocation group, so techniques such as hashing files over multiple sub-directories provide a more scalable storage pattern compared to using a single large directory.
Optimizing for applications that use extended attributes
XFS can store small attributes directly in the inode if space is available in the inode. If the attribute fits into the inode, then it can be retrieved and modified without requiring extra I/O to retrieve separate attribute blocks. The performance differential between in-line and out-of-line attributes can easily be an order of magnitude slower for out-of-line attributes.
For the default inode size of 256 bytes, roughly 100 bytes of attribute space is available depending on the number of data extent pointers also stored in the inode. The default inode size is really only useful for storing a small number of small attributes.
Increasing the inode size at mkfs time can increase the amount of space available for storing attributes in-line. A 512 byte inode size increases the space available for attributes to roughly 350 bytes; a 2 KB inode has roughly 1900 bytes of space available.
There is, however, a limit on the size of the individual attributes that can be stored in-line - there is a maximum size limit of 254 bytes for both the attribute name and the value (that is, an attribute with a name length of 254 bytes and a value length of 254 bytes will stay in-line). Exceeding these size limits forces the attributes out of line, even if there would have been enough space to store all the attributes in the inode.
Optimizing for sustained metadata modifications
The size of the log is the main factor in determining the achievable level of sustained metadata modification. The log device is circular, so before the tail can be overwritten all the modifications in the log must be written to the real locations on disk. This can involve a significant amount of seeking to write back all dirty metadata. The default configuration scales the log size in relation to the overall file system size, so in most cases log size will not require tuning.
A small log device will result in very frequent metadata writeback - the log will constantly be pushing on its tail to free up space and so frequently modified metadata will be frequently written to disk, causing operations to be slow.
Increasing the log size increases the time period between tail pushing events. This allows better aggregation of dirty metadata, resulting in better metadata writeback patterns, and less writeback of frequently modified metadata. The trade-off is that larger logs require more memory to track all outstanding changes in memory.
If you have a machine with limited memory, then large logs are not beneficial because memory constraints will cause metadata writeback long before the benefits of a large log can be realized. In these cases, smaller rather than larger logs will often provide better performance because metadata writeback from the log running out of space is more efficient than writeback driven by memory reclamation.
You should always try to align the log to the underlying stripe unit of the device that contains the file system. mkfs does this by default for MD and DM devices, but for hardware RAID it may need to be specified. Setting this correctly avoids all possibility of log I/O causing unaligned I/O and subsequent read-modify-write operations when writing modifications to disk.
Log operation can be further improved by editing mount options. Increasing the size of the in-memory log buffers (logbsize) increases the speed at which changes can be written to the log. The default log buffer size is MAX (32 KB, log stripe unit), and the maximum size is 256 KB. In general, a larger value results in faster performance. However, under fsync-heavy workloads, small log buffers can be noticeably faster than large buffers with a large stripe unit alignment.
The delaylog mount option also improves sustained metadata modification performance by reducing the number of changes to the log. It achieves this by aggregating individual changes in memory before writing them to the log: frequently modified metadata is written to the log periodically instead of on every modification. This option increases the memory usage of tracking dirty metadata and increases the potential lost operations when a crash occurs, but can improve metadata modification speed and scalability by an order of magnitude or more. Use of this option does not reduce data or metadata integrity when fsync, fdatasync or sync are used to ensure data and metadata is written to disk.