Chapter 6. Optimizing VDO performance

PDF

The VDO kernel driver speeds up tasks by using multiple threads. Instead of one thread doing everything for an I/O request, it splits the work into smaller parts assigned to different threads. These threads talk to each other as they handle the request. This way, one thread can handle shared data without constant locking and unlocking.

When one thread finishes a task, VDO already has another task ready for it. This keeps the threads busy and reduces the time spent switching tasks. VDO also uses separate threads for slower tasks, such as adding I/O operations to the queue or handling messages to the deduplication index.

6.1. VDO thread types

VDO uses various thread types to handle specific operations:

Logical zone threads (kvdo:logQ): Maintain the mapping between the logical block numbers (LBNs) presented to the user of the VDO device and the physical block numbers (PBNs) in the underlying storage system. They also prevent concurrent writes to the same block. Logical threads are active during both read and write operations. Processing is generally evenly distributed, however, specific access patterns may occasionally concentrate work in one thread. For example, frequent access to LBNs in a specific block map page might make one logical thread handle all those operations.
Physical zone threads (kvdo:physQ): Handle data block allocation and reference counts during write operations.
I/O submission threads (kvdo:bioQ): Handle the transfer of block I/O (bio) operations from VDO to the storage system. They handle I/O requests from other VDO threads and pass them to the underlying device driver. These threads interact with device-related data structures, create requests for device driver kernel threads, and prevent delays when I/O requests get blocked due to a full device request queue.
CPU-processing threads (kvdo:cpuQ): Handle CPU-intensive tasks that do not block or need exclusive access to data structures for to other thread types. These tasks include calculating hash values and compressing data blocks.
I/O acknowledgement threads (kvdo:ackQ): Signal the completion of I/O requests to higher-level components, such as the kernel page cache or application threads performing direct I/O. Their CPU usage and impact on memory contention are influenced by kernel-level code.
Hash zone threads (kvdo:hashQ): Coordinate I/O requests with matching hashes to handle potential deduplication tasks. Although they create and manage deduplication requests, they do not perform significant computations. A single hash zone thread is usually sufficient.
Deduplication thread (kvdo:dedupeQ): Handles I/O requests and communicates with the deduplication index. This work is performed on a separate thread to prevent blocking. It also has a timeout mechanism to skip deduplication if the index does not respond quickly. There is only one deduplication thread per VDO device.
Journal thread (kvdo:journalQ): Updates the recovery journal and schedules journal blocks for writing. This task cannot be divided among multiple threads. There is only one journal thread per VDO device.
Packer thread (kvdo:packerQ): Works during write operations when the compression is enabled. It collects compressed data blocks from the CPU threads to reduce wasted space. There is only one packer thread per VDO device.

6.2. Identifying performance bottlenecks

Identifying bottlenecks in VDO performance is crucial for optimizing system efficiency. One of the primary steps you can take is to determine whether the bottleneck lies in the CPU, memory, or the speed of the backing storage. After pinpointing the slowest component, you can develop strategies for enhancing performance.

To ensure that the root cause of the low performance is not a hardware issue, run tests with and without VDO in the storage stack.

The journalQ thread in VDO is a natural bottleneck, especially when the VDO volume is handling write operations. If you notice that another thread type has higher utilization than the journalQ thread, you can remediate this by adding more threads of that type.

6.2.1. Analyzing VDO performance with top

You can examine the performance of VDO threads by using the top utility.

Procedure

Display the individual threads:
```
$ top -H
```
Note
Tools such as top cannot differentiate between productive CPU cycles and cycles stalled due to cache or memory delays. These tools interpret cache contention and slow memory access as actual work. Moving threads between nodes can appear like reduced CPU utilization while increasing operations per second.
Press the f key to display the fields manager.
Use the (↓) key to navigate to the P = Last Used Cpu (SMP) field.
Press the spacebar to select the P = Last Used Cpu (SMP) field.
Press the q key to close the fields manager. The top utility now displays the CPU load for individual cores and indicates which CPU each process or thread recently used. You can switch to per-CPU statistics by pressing 1.

Additional resources

top(1) man page
Interpreting top results

6.2.2. Interpreting top results

While analyzing the performance of VDO threads, use the following table to interpret results of the top utility.

Table 6.1. Interpreting top results
Values	Description	Suggestions
Thread or CPU usage surpasses 70%.	The thread or CPU is overloaded. High usage can result from a VDO thread scheduled on a CPU with no actual work. This may happen due to excessive hardware interrupts, memory conflicts, or resource competition.	Increase the number of threads of the type running this core.
Low `%id` and `%wa` values	The core is actively handling tasks.	No action required.
Low `%hi` value	The core is performing standard processing work.	Add more cores to improve the performance. Avoid NUMA conflicts.
High `%hi` value ^[a] Only one thread is assigned to the core `%id` is zero `%wa` values is zero	The core is over-committed.	Reassign kernel threads and device interrupt handling to different cores.
`kvdo:bioQ` threads frequently in `D` state.	VDO is consistently keeping the storage system busy with I/O requests. ^[b]	Reduce the number of I/O submission threads if the CPU utilization is very low.
`kvdo:bioQ` threads frequently in `S` state.	VDO has more `kvdo:bioQ` threads than it needs.	Reduce the number of `kvdo:bioQ` threads.
High CPU utilization per I/O request.	CPU utilization per I/O request increases with more threads.	Check for CPU, memory, or lock contention.
^[a] More than a few percent ^[b] This is good if the storage system can handle multiple requests or if request processing is efficient.

6.2.3. Analyzing VDO performance with perf

You can check the CPU performance of VDO by using the perf utility.

Prerequisites

The perf package is installed.

Procedure

Display the performance profile:
```
# perf top
```

Analyze the CPU performance by interpreting perf results:

Table 6.2. Interpreting perf results
Values	Description	Suggestions
`kvdo:bioQ` threads spend excessive cycles acquiring spin locks	Too much contention might be occurring in the device driver below VDO	Reduce the number of `kvdo:bioQ` threads
High CPU usage	Contention between NUMA nodes. Check counters such as `stalled-cycles-backend`, `cache-misses`, and `node-load-misses` if they are supported by your processor. High miss rates might cause stalls, resembling high CPU usage in other tools, indicating possible contention.	Implement CPU affinity for the VDO kernel threads or IRQ affinity for interrupt handlers to restrict processing work to a single node.

Additional resources

perf-top(1) man page

6.2.4. Analyzing VDO performance with sar

You can create periodic reports on VDO performance by using the sar utility.

Note

Not all block device drivers can provide the data needed by the sar utility. For example, devices such as MD RAID do not report the %util value.

Prerequisites

Install the sysstat utility:
```
# yum install sysstat
```

Procedure

Displays the disk I/O statistics at 1-second intervals:
```
$ sar -d 1
```

Analyze the VDO performance by interpreting sar results:

Table 6.3. Interpreting sar results
Values	Description	Suggestions
The `%util` value for the underlying storage device is well under 100%. VDO is busy at 100%. `bioQ` threads are using a lot of CPU time.	VDO has too few `bioQ` threads for a fast device.	Add more `bioQ` threads. Note that certain storage drivers might slow down when you add `bioQ` threads due to spin lock contention.

Additional resources

sar(1) man page

6.3. Redistributing VDO threads

VDO uses various thread pools for different tasks when handling requests. Optimal performance depends on setting the right number of threads in each pool, which varies based on available storage, CPU resources, and the type of workload. You can spread out VDO work across multiple threads to improve VDO performance.

VDO aims to maximize performance through parallelism. You can improve it by allocating more threads to a bottlenecked task, depending on factors such as available CPU resources and the root cause of the bottleneck. High thread utilization (above 70-80%) can lead to delays. Therefore, increasing thread count can help in such cases. However, excessive threads might hinder performance and incur extra costs.

For optimal performance, carry out these actions:

Test VDO with various expected workloads to evaluate and optimize its performance.
Increase thread count for pools with more than 50% utilization.
Increase the number of cores available to VDO if the overall utilization is greater than 50%, even if the individual thread utilization is lower.

6.3.1. Grouping VDO threads across NUMA nodes

Accessing memory across NUMA nodes is slower than local memory access. On Intel processors where cores share the last-level cache within a node, cache problems are more significant when data is shared between nodes than when it is shared within a single node. While many VDO kernel threads manage exclusive data structures, they often exchange messages about I/O requests. VDO threads being spread across multiple nodes or the scheduler reassigning threads between nodes might cause contention, that is multiple nodes competing for the same resources.

You can enhance VDO performance by grouping certain threads on the same NUMA nodes.

Group related threads together on one NUMA node

I/O acknowledgment (ackQ) threads
Higher-level I/O submission threads:
- User-mode threads handling direct I/O
- Kernel page cache flush thread

Optimize device access

If device access timing varies across NUMA nodes, run bioQ threads on the node closest to the storage device controllers

Minimize contention

Run I/O submissions and storage device interrupt processing on the same node as logQ or physQ threads.
Run other VDO-related work on the same node.
If one node cannot handle all VDO work, consider memory contention when moving threads to other nodes. For example, move the device that interrupts handling and bioQ threads to another node.

6.3.2. Configuring the CPU affinity

You can improve VDO performance on certain storage device drivers if you adjust the CPU affinity of VDO threads.

When the interrupt (IRQ) handler of the storage device driver does substantial work and the driver does not use a threaded IRQ handler, it could limit the ability of the system scheduler to optimize VDO performance.

For optimal performance, carry out these actions:

Dedicate specific cores to IRQ handling and adjust VDO thread affinity if the core is overloaded. The core is overloaded if the %hi value is more than a few percent higher than on other cores.
Avoid running singleton VDO threads, like the kvdo:journalQ thread, on busy IRQ cores.
Keep other thread types off cores busy with IRQs only if the individual CPU use is high .

Note

The configuration does not persist across system reboots.

Procedure

Set the CPU affinity:
```
# taskset -c <cpu-numbers> -p <process-id>
```
Replace <cpu-numbers> with a comma-separated list of CPU numbers to which you want to assign the process. Replace <process-id> with the ID of the running process to which you want to set CPU affinity.
Example 6.1. Setting CPU Affinity for kvdo processes on CPU cores 1 and 2
```
# for pid in `ps -eo pid,comm | grep kvdo | awk '{ print $1 }'`
do
    taskset -c "1,2" -p $pid
done
```

Verification

Display the affinity set:
```
# taskset -p <cpu-numbers> -p <process-id>
```
Replace <cpu-numbers> with a comma-separated list of CPU numbers to which you want to assign the process. Replace <process-id> with the ID of the running process to which you want to set CPU affinity.

Additional resources

taskset(1) man page

6.4. Increasing block map cache size to enhance performance

You can enhance read and write performance by increasing the cache size for your VDO volume.

If you have extended read and write latencies or a significant volume of data read from storage that does not align with application requirements, you might need to adjust the cache size.

Warning

When you increase a block map cache, the cache uses the amount of memory that you specified, plus an additional 15% of memory. Larger cache sizes use more RAM and affect overall system stability.

The following example shows how to change the cache size from 128Mb to 640Mb in your system.

Procedure

Check the current cache size of your VDO volume:

# lvs -o vdo_block_map_cache_size
  VDOBlockMapCacheSize
               128.00m
               128.00m

Deactivate the VDO volume:
```
# lvchange -an vg_name/vdo_volume
```
Change the VDO setting:
```
# lvchange --vdosettings "block_map_cache_size_mb=640" vg_name/vdo_volume
```
Replace 640 with your new cache size in megabytes.
Note
The cache size must be a multiple of 4096, within the range of 128MB to 16TB, and at least 16MB per logical thread. Changes take effect the next time the VDO device is started. Already running devices are not affected.
Activate the VDO volume:
```
# lvchange -ay vg_name/vdo_volume
```

Verification

Check the current VDO volume configuration:

# lvs -o vdo_block_map_cache_size vg_name/vdo_volume
  VDOBlockMapCacheSize
               640.00m

Additional resources

lvchange(8) man page

6.5. Speeding up discard operations

VDO sets a maximum allowed size of DISCARD (TRIM) sectors for all VDO devices on the system. The default size is 8 sectors, which corresponds to one 4-KiB block. Increasing the DISCARD size can significantly improve the speed of the discard operations. However, there is a tradeoff between improving discard performance and maintaining the speed of other write operations.

The optimal DISCARD size varies depending on the storage stack. Both very large and very small DISCARD sectors can potentially degrade the performance. Conduct experiments with different values to discover one that delivers satisfactory results.

For a VDO volume that stores a local file system, it is optimal to use a DISCARD size of 8 sectors, which is the default setting. For a VDO volume that serves as a SCSI target, a moderately large DISCARD size, such as 2048 sectors (corresponds to a 1MB discard), works best. It is recommended that the maximum DISCARD size does not exceed 10240 sectors, which translates to 5MB discard. When choosing the size, make sure it is a multiple of 8, because VDO may not handle discards effectively if they are smaller than 8 sectors.

Procedure

Set the new maximum size for the DISCARD sector:
```
# echo <number-of-sectors> > /sys/kvdo/max_discard_sectors
```
Replace <number-of-sectors> with the number of sectors. This setting persists until reboot.
Optional: To make the persistent change to the DISCARD sector across reboot, create a custom systemd service:
1. Create a new /etc/systemd/system/max_discard_sectors.service file with the following content:
```
[Unit]
Description=Set maximum DISCARD sector
[Service]
ExecStart=/usr/bin/echo <number-of-sectors> > /sys/kvdo/max_discard_sectors

[Install]
WantedBy=multi-user.target
```
  Replace <number-of-sectors> with the number of sectors.
2. Save the file and exit.
3. Reload the service file:
```
# systemctl daemon-reload
```
4. Enable the new service:
```
# systemctl enable max_discard_sectors.service
```

Verification

Optional: If you made the scaling governor change persistent, check if the max_discard_sectors.service is enabled:
```
# systemctl is-enabled max_discard_sectors.service
```

6.6. Optimizing CPU frequency scaling

By default, RHEL uses CPU frequency scaling to save power and reduce heat when the CPU is not under heavy load. To prioritize performance over power savings, you can configure the CPU to operate at its maximum clock speed. This ensures that the CPU can handle data deduplication and compression processes with maximum efficiency. By running the CPU at its highest frequency, resource-intensive operations can be executed more quickly, potentially improving the overall performance of VDO in terms of data reduction and storage optimization.

Warning

Tuning CPU frequency scaling for higher performance can increase power consumption and heat generation. In inadequately cooled systems, this can cause overheating and might result in thermal throttling, which limits the performance gains.

Procedure

Display available CPU governors:
```
$ cpupower frequency-info -g
```
Change the scaling governor to prioritize performance:
```
# cpupower frequency-set -g performance
```
This setting persists until reboot.
Optional: To make the persistent change in scaling governor across reboot, create a custom systemd service:
1. Create a new /etc/systemd/system/cpufreq.service file with the following content:
```
[Unit]
Description=Set CPU scaling governor to performance

[Service]
ExecStart=/usr/bin/cpupower frequency-set -g performance

[Install]
WantedBy=multi-user.target
```
2. Save the file and exit.
3. Reload the service file:
```
# systemctl daemon-reload
```
4. Enable the new service:
```
# systemctl enable cpufreq.service
```

Verification

Display the currently used CPU frequency policy:
```
$ cpupower frequency-info -p
```
Optional: If you made the scaling governor change persistent, check if the cpufreq.service is enabled:
```
# systemctl is-enabled cpufreq.service
```

Chapter 6. Optimizing VDO performance

6.1. VDO thread types

6.2. Identifying performance bottlenecks

6.2.1. Analyzing VDO performance with top

6.2.2. Interpreting top results

6.2.3. Analyzing VDO performance with perf

6.2.4. Analyzing VDO performance with sar

6.3. Redistributing VDO threads

6.3.1. Grouping VDO threads across NUMA nodes

6.3.2. Configuring the CPU affinity

6.4. Increasing block map cache size to enhance performance

6.5. Speeding up discard operations

6.6. Optimizing CPU frequency scaling

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links