30.6. Tuning VDO

30.6.1. Introduction to VDO Tuning

As with tuning databases or other complex software, tuning VDO involves making trade-offs between numerous system constraints, and some experimentation is required. The primary controls available for tuning VDO are the number of threads assigned to different types of work, the CPU affinity settings for those threads, and cache settings.

30.6.2. Background on VDO Architecture

The VDO kernel driver is multi-threaded to improve performance by amortizing processing costs across multiple concurrent I/O requests. Rather than have one thread process an I/O request from start to finish, it delegates different stages of work to one or more threads or groups of threads, with messages passed between them as the I/O request makes its way through the pipeline. This way, one thread can serialize all access to a global data structure without having to lock and unlock it each time an I/O operation is processed. If the VDO driver is well-tuned, each time a thread completes a requested processing stage there will usually be another request queued up for that same processing. Keeping these threads busy reduces the overhead of context switching and scheduling, improving performance. Separate threads are also used for parts of the operating system that can block, such as enqueueing I/O operations to the underlying storage system or messages to UDS.

The various worker thread types used by VDO are:

Logical zone threads: The logical threads, with process names including the string kvdo:logQ, maintain the mapping between the logical block numbers (LBNs) presented to the user of the VDO device and the physical block numbers (PBNs) in the underlying storage system. They also implement locking such that two I/O operations attempting to write to the same block will not be processed concurrently. Logical zone threads are active during both read and write operations.
LBNs are divided into chunks (a block map page contains a bit over 3 MB of LBNs) and these chunks are grouped into zones that are divided up among the threads.
Processing should be distributed fairly evenly across the threads, though some unlucky access patterns may occasionally concentrate work in one thread or another. For example, frequent access to LBNs within a given block map page will cause one of the logical threads to process all of those operations.
The number of logical zone threads can be controlled using the --vdoLogicalThreads=thread count option of the vdo command
Physical zone threads: Physical, or kvdo:physQ, threads manage data block allocation and maintain reference counts. They are active during write operations.
Like LBNs, PBNs are divided into chunks called slabs, which are further divided into zones and assigned to worker threads that distribute the processing load.
The number of physical zone threads can be controlled using the --vdoPhysicalThreads=thread count option of the vdo command.
I/O submission threads: kvdo:bioQ threads submit block I/O (bio) operations from VDO to the storage system. They take I/O requests enqueued by other VDO threads and pass them to the underlying device driver. These threads may communicate with and update data structures associated with the device, or set up requests for the device driver's kernel threads to process. Submitting I/O requests can block if the underlying device's request queue is full, so this work is done by dedicated threads to avoid processing delays.
If these threads are frequently shown in D state by ps or top utilities, then VDO is frequently keeping the storage system busy with I/O requests. This is generally good if the storage system can service multiple requests in parallel, as some SSDs can, or if the request processing is pipelined. If thread CPU utilization is very low during these periods, it may be possible to reduce the number of I/O submission threads.
CPU usage and memory contention are dependent on the device driver(s) beneath VDO. If CPU utilization per I/O request increases as more threads are added then check for CPU, memory, or lock contention in those device drivers.
The number of I/O submission threads can be controlled using the --vdoBioThreads=thread count option of the vdo command.
CPU-processing threads: kvdo:cpuQ threads exist to perform any CPU-intensive work such as computing hash values or compressing data blocks that do not block or require exclusive access to data structures associated with other thread types.
The number of CPU-processing threads can be controlled using the --vdoCpuThreads=thread count option of the vdo command.
I/O acknowledgement threads: The kvdo:ackQ threads issue the callbacks to whatever sits atop VDO (for example, the kernel page cache, or application program threads doing direct I/O) to report completion of an I/O request. CPU time requirements and memory contention will be dependent on this other kernel-level code.
The number of acknowledgement threads can be controlled using the --vdoAckThreads=thread count option of the vdo command.
Non-scalable VDO kernel threads:
Deduplication thread: The kvdo:dedupeQ thread takes queued I/O requests and contacts UDS. Since the socket buffer can fill up if the server cannot process requests quickly enough or if kernel memory is constrained by other system activity, this work is done by a separate thread so if a thread should block, other VDO processing can continue. There is also a timeout mechanism in place to skip an I/O request after a long delay (several seconds).
Journal thread: The kvdo:journalQ thread updates the recovery journal and schedules journal blocks for writing. A VDO device uses only one journal, so this work cannot be split across threads.
Packer thread: The kvdo:packerQ thread, active in the write path when compression is enabled, collects data blocks compressed by the kvdo:cpuQ threads to minimize wasted space. There is one packer data structure, and thus one packer thread, per VDO device.

30.6.3. Values to tune

30.6.3.1. CPU/memory

30.6.3.1.1. Logical, physical, cpu, ack thread counts

The logical, physical, cpu, and I/O acknowledgement work can be spread across multiple threads, the number of which can be specified during initial configuration or later if the VDO device is restarted.

One core, or one thread, can do a finite amount of work during a given time. Having one thread compute all data-block hash values, for example, would impose a hard limit on the number of data blocks that could be processed per second. Dividing the work across multiple threads (and cores) relieves that bottleneck.

As a thread or core approaches 100% usage, more work items will tend to queue up for processing. While this may result in CPU having fewer idle cycles, queueing delays and latency for individual I/O requests will typically increase. According to some queueing theory models, utilization levels above 70% or 80% can lead to excessive delays that can be several times longer than the normal processing time. Thus it may be helpful to distribute work further for a thread or core with 50% or higher utilization, even if those threads or cores are not always busy.

In the opposite case, where a thread or CPU is very lightly loaded (and thus very often asleep), supplying work for it to do is more likely to incur some additional cost. (A thread attempting to wake another thread must acquire a global lock on the scheduler's data structures, and may potentially send an inter-processor interrupt to transfer work to another core). As more cores are configured to run VDO threads, it becomes less likely that a given piece of data will be cached as work is moved between threads or as threads are moved between cores — so too much work distribution can also degrade performance.

The work performed by the logical, physical, and CPU threads per I/O request will vary based on the type of workload, so systems should be tested with the different types of workloads they are expected to service.

Write operations in sync mode involving successful deduplication will entail extra I/O operations (reading the previously stored data block), some CPU cycles (comparing the new data block to confirm that they match), and journal updates (remapping the LBN to the previously-stored data block's PBN) compared to writes of new data. When duplication is detected in async mode, data write operations are avoided at the cost of the read and compare operations described above; only one journal update can happen per write, whether or not duplication is detected.

If compression is enabled, reads and writes of compressible data will require more processing by the CPU threads.

Blocks containing all zero bytes (a zero block) are treated specially, as they commonly occur. A special entry is used to represent such data in the block map, and the zero block is not written to or read from the storage device. Thus, tests that write or read all-zero blocks may produce misleading results. The same is true, to a lesser degree, of tests that write over zero blocks or uninitialized blocks (those that were never written since the VDO device was created) because reference count updates done by the physical threads are not required for zero or uninitialized blocks.

Acknowledging I/O operations is the only task that is not significantly affected by the type of work being done or the data being operated upon, as one callback is issued per I/O operation.

30.6.3.1.2. CPU Affinity and NUMA

Accessing memory across NUMA node boundaries takes longer than accessing memory on the local node. With Intel processors sharing the last-level cache between cores on a node, cache contention between nodes is a much greater problem than cache contention within a node.

Tools such as top can not distinguish between CPU cycles that do work and cycles that are stalled. These tools interpret cache contention and slow memory accesses as actual work. As a result, moving a thread between nodes may appear to reduce the thread's apparent CPU utilization while increasing the number of operations it performs per second.

While many of VDO's kernel threads maintain data structures that are accessed by only one thread, they do frequently exchange messages about the I/O requests themselves. Contention may be high if VDO threads are run on multiple nodes, or if threads are reassigned from one node to another by the scheduler. If it is possible to run other VDO-related work (such as I/O submissions to VDO, or interrupt processing for the storage device) on the same node as the VDO threads, contention may be further reduced. If one node does not have sufficient cycles to run all VDO-related work, memory contention should be considered when selecting threads to move onto other nodes.

If practical, collect VDO threads on one node using the taskset utility. If other VDO-related work can also be run on the same node, that may further reduce contention. In that case, if one node lacks the CPU power to keep up with processing demands then memory contention must be considered when choosing threads to move onto other nodes. For example, if a storage device's driver has a significant number of data structures to maintain, it may help to move both the device's interrupt handling and VDO's I/O submissions (the bio threads that call the device's driver code) to another node. Keeping I/O acknowledgment (ack threads) and higher-level I/O submission threads (user-mode threads doing direct I/O, or the kernel's page cache flush thread) paired is also good practice.

30.6.3.1.3. Frequency throttling

If power consumption is not an issue, writing the string performance to the /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor files if they exist might produce better results. If these sysfs nodes do not exist, Linux or the system's BIOS may provide other options for configuring CPU frequency management.

Performance measurements are further complicated by CPUs that dynamically vary their frequencies based on workload, because the time needed to accomplish a specific piece of work may vary due to other work the CPU has been doing, even without task switching or cache contention.

30.6.3.2. Caching

30.6.3.2.1. Block Map Cache

VDO caches a number of block map pages for efficiency. The cache size defaults to 128 MB, but it can be increased with the --blockMapCacheSize=megabytes option of the vdo command. Using a larger cache may produce significant benefits for random-access workloads.

30.6.3.2.2. Read Cache

A second cache may be used for caching data blocks read from the storage system to verify VDO's deduplication advice. If similar data blocks are seen within a short time span, the number of I/O operations needed may be reduced.

The read cache also holds storage blocks containing compressed user data. If multiple compressible blocks were written within a short period of time, their compressed versions may be located within the same storage system block. Likewise, if they are read within a short time, caching may avoid the need for additional reads from the storage system.

The vdo command's --readCache={enabled | disabled} option controls whether a read cache is used. If enabled, the cache has a minimum size of 8 MB, but it can be increased with the --readCacheSize=megabytes option. Managing the read cache incurs a slight overhead, so it may not increase performance if the storage system is fast enough. The read cache is disabled by default.

30.6.3.3. Storage System I/O

30.6.3.3.1. Bio Threads

For generic hard drives in a RAID configuration, one or two bio threads may be sufficient for submitting I/O operations. If the storage device driver requires its I/O submission threads to do significantly more work (updating driver data structures or communicating with the device) such that one or two threads are very busy and storage devices are often idle, the bio thread count can be increased to compensate. However, depending on the driver implementation, raising the thread count too high may lead to cache or spin lock contention. If device access timing is not uniform across all NUMA nodes, it may be helpful to run bio threads on the node "closest" to the storage device controllers.

30.6.3.3.2. IRQ Handling

If a device driver does significant work in its interrupt handler and does not use a threaded IRQ handler, it may prevent the scheduler from providing the best performance. CPU time spent servicing hardware interrupts may look like normal VDO (or other) kernel thread execution in some ways. For example, if hardware IRQ handling required 30% of a core's cycles, a busy kernel thread on the same core could only use the remaining 70%. However, if the work queued up for that thread demanded 80% of the core's cycles, the thread would never catch up, and the scheduler might simply leave that thread to run impeded on that core instead of switching that thread to a less busy core.

Using such a device driver under a heavy VDO workload may require a large number of cycles to service hardware interrupts (the %hi indicator in the header of the top display). In that case it may help to assign IRQ handling to certain cores and adjust the CPU affinity of VDO kernel threads not to run on those cores.

30.6.3.4. Maximum Discard Sectors

The maximum allowed size of DISCARD (TRIM) operations to a VDO device can be tuned via /sys/kvdo/max_discard_sectors, based on system usage. The default is 8 sectors (that is, one 4 KB block). Larger sizes may be specified, though VDO will still process them in a loop, one block at a time, ensuring that metadata updates for one discarded block are written to the journal and flushed to disk before starting on the next block.

When using a VDO volume as a local file system, Red Hat testing found that a small discard size works best, as the generic block-device code in the Linux kernel will break large discard requests into multiple smaller ones and submit them in parallel. If there is low I/O activity on the device, VDO can process many smaller requests concurrently and much more quickly than one large request.

If the VDO device is to be used as a SCSI target, the initiator and target software introduce additional factors to consider. If the target SCSI software is SCST, it reads the maximum discard size and relays it to the initiator. (Red Hat has not attempted to tune VDO configurations in conjunction with LIO SCSI target code.)

Because the Linux SCSI initiator code allows only one discard operation at a time, discard requests that exceed the maximum size would be broken into multiple smaller discards and sent, one at a time, to the target system (and to VDO). So, in addition to VDO processing a number of small discard operations in serial, the round-trip communication time between the two systems adds additional latency.

Setting a larger maximum discard size can reduce this communication overhead, though that larger request is passed in its entirety to VDO and processed one 4 KB block at a time. While there is no per-block communication delay, additional processing time for the larger block may cause the SCSI initiator software to time out.

For SCSI target usage, Red Hat recommends configuring the maximum discard size to be moderately large while still keeping the typical discard time well within the initiator's timeout setting. An extra round-trip cost every few seconds, for example, should not significantly affect performance and SCSI initiators with timeouts of 30 or 60 seconds should not time out.

30.6.4. Identifying Bottlenecks

There are several key factors that affect VDO performance, and many tools available to identify those having the most impact.

Thread or CPU utilization above 70%, as seen in utilities such as top or ps, generally implies that too much work is being concentrated in one thread or on one CPU. However, in some cases it could mean that a VDO thread was scheduled to run on the CPU but no work actually happened; this scenario could occur with excessive hardware interrupt handler processing, memory contention between cores or NUMA nodes, or contention for a spin lock.

When using the top utility to examine system performance, Red Hat suggests running top -H to show all process threads separately and then entering the 1 f j keys, followed by the Enter/Return key; the top command then displays the load on individual CPU cores and identifies the CPU on which each process or thread last ran. This information can provide the following insights:

If a core has low %id (idle) and %wa (waiting-for-I/O) values, it is being kept busy with work of some kind.
If the %hi value for a core is very low, that core is doing normal processing work, which is being load-balanced by the kernel scheduler. Adding more cores to that set may reduce the load as long as it does not introduce NUMA contention.
If the %hi for a core is more than a few percent and only one thread is assigned to that core, and %id and %wa are zero, the core is over-committed and the scheduler is not addressing the situation. In this case the kernel thread or the device interrupt handling should be reassigned to keep them on separate cores.

The perf utility can examine the performance counters of many CPUs. Red Hat suggests using the perf top subcommand as a starting point to examine the work a thread or processor is doing. If, for example, the bioQ threads are spending many cycles trying to acquire spin locks, there may be too much contention in the device driver below VDO, and reducing the number of bioQ threads might alleviate the situation. High CPU use (in acquiring spin locks or elsewhere) could also indicate contention between NUMA nodes if, for example, the bioQ threads and the device interrupt handler are running on different nodes. If the processor supports them, counters such as stalled-cycles-backend, cache-misses, and node-load-misses may be of interest.

The sar utility can provide periodic reports on multiple system statistics. The sar -d 1 command reports block device utilization levels (percentage of the time they have at least one I/O operation in progress) and queue lengths (number of I/O requests waiting) once per second. However, not all block device drivers can report such information, so the sar usefulness might depend on the device drivers in use.