5.3. Redistributing VDO threads
VDO uses various thread pools for different tasks when handling requests. Optimal performance depends on setting the right number of threads in each pool, which varies based on available storage, CPU resources, and the type of workload.
You can spread out VDO work across multiple threads to improve VDO performance. VDO aims to maximize performance through parallelism. You can improve it by allocating more threads to a bottlenecked task, depending on factors such as available CPU resources and the root cause of the bottleneck. High thread utilization (above 70-80%) can lead to delays. Therefore, increasing thread count can help in such cases. However, excessive threads might hinder performance and incur extra costs.
For optimal performance, carry out these actions:
- Test VDO with various expected workloads to evaluate and optimize its performance.
- Increase thread count for pools with more than 50% utilization.
- Increase the number of cores available to VDO if the overall utilization is greater than 50%, even if the individual thread utilization is lower.
5.3.1. Grouping VDO threads across NUMA nodes 复制链接链接已复制到粘贴板!
Accessing memory across NUMA nodes is slower than local access and can introduce cache contention. Grouping related VDO threads on the same NUMA node minimizes latency and resource contention, improving system performance.
When VDO kernel threads are distributed across NUMA nodes or reassigned by the scheduler, contention may arise from nodes competing for shared data structures or exchanging messages about I/O requests. This slows access and can degrade overall performance.
You can enhance VDO performance by grouping certain threads on the same NUMA nodes.
- Group related threads together on one NUMA node
-
I/O acknowledgment (
ackQ) threads Higher-level I/O submission threads:
- User-mode threads handling direct I/O
- Kernel page cache flush thread
-
I/O acknowledgment (
- Optimize device access
-
If device access timing varies across NUMA nodes, run
bioQthreads on the node closest to the storage device controllers
-
If device access timing varies across NUMA nodes, run
- Minimize contention
-
Run I/O submissions and storage device interrupt processing on the same node as
logQorphysQthreads. - Run other VDO-related work on the same node.
-
If one node cannot handle all VDO work, consider memory contention when moving threads to other nodes. For example, move the device that interrupts handling and
bioQthreads to another node.
-
Run I/O submissions and storage device interrupt processing on the same node as
5.3.2. Configuring the CPU affinity 复制链接链接已复制到粘贴板!
You can improve VDO performance on certain storage device drivers if you adjust the CPU affinity of VDO threads.
When the interrupt (IRQ) handler of the storage device driver does substantial work and the driver does not use a threaded IRQ handler, it could limit the ability of the system scheduler to optimize VDO performance.
For optimal performance, carry out these actions:
-
Dedicate specific cores to IRQ handling and adjust VDO thread affinity if the core is overloaded. The core is overloaded if the
%hivalue is more than a few percent higher than on other cores. -
Avoid running singleton VDO threads, for example,
vdo:journalQon busy IRQ cores. - Keep other thread types off cores busy with IRQs only if the individual CPU use is high.
The configuration does not persist across system reboots.
Procedure
Set the CPU affinity:
# taskset -cp <cpu_numbers> <process_id>Replace
<cpu_numbers>with a comma-separated list of CPU numbers to which you want to assign the process. Replace<process_id>with the ID of the running process to which you want to set CPU affinity.For example, to set the CPU affinity for
dm-vdoprocesses on CPU cores 1 and 2:# for pid in $(ps -eo pid,comm | awk '/vdo/ {print $1}'); do taskset -cp "1,2" "$pid"; done
Verification
Display the affinity set:
# taskset -cp <cpu_numbers> <process_id>Replace
<process_id>with the ID of the running process for which you want to display CPU affinity.