9.3. libvirt NUMA Tuning
Generally, optimal performance on NUMA systems is achieved by limiting guest size to the amount of resources on a single NUMA node. Avoid unnecessarily splitting resources across NUMA nodes.
Use the
numastat
tool to view per-NUMA-node memory statistics for processes and the operating system.
In the following example, the
numastat
tool shows four virtual machines with suboptimal memory alignment across NUMA nodes:
# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
51722 (qemu-kvm) 68 16 357 6936 2 3 147 598 8128
51747 (qemu-kvm) 245 11 5 18 5172 2532 1 92 8076
53736 (qemu-kvm) 62 432 1661 506 4851 136 22 445 8116
53773 (qemu-kvm) 1393 3 1 2 12 0 0 6702 8114
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
Total 1769 463 2024 7462 10037 2672 169 7837 32434
You can run
numad
to align the guests' CPUs and memory resources automatically. However, it is highly recommended to configure guest resource alignment using libvirt
instead: .
To verify that the memory has veen aligned, run
numastat -c qemu-kvm
again. The following output shows successful resource alignment:
# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
51747 (qemu-kvm) 0 0 7 0 8072 0 1 0 8080
53736 (qemu-kvm) 0 0 7 0 0 0 8113 0 8120
53773 (qemu-kvm) 0 0 7 0 0 0 1 8110 8118
59065 (qemu-kvm) 0 0 8050 0 0 0 0 0 8051
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
Total 0 0 8072 0 8072 0 8114 8110 32368
Note
Running
numastat
with -c
provides compact output; adding the -m
option adds system-wide memory information on a per-node basis to the output. Refer to the numastat
man page for more information.
For optimal performance results, memory pinning should be used in combination with pinning of vCPU threads as well as other hypervisor threads.
9.3.1. NUMA vCPU Pinning
vCPU pinning provides similar advantages to task pinning on bare metal systems. Since vCPUs run as user-space tasks on the host operating system, pinning increases cache efficiency. One example of this is an environment where all vCPU threads are running on the same physical socket, therefore sharing a L3 cache domain.
Combining vCPU pinning with
numatune
can avoid NUMA misses. The performance impacts of NUMA misses are significant, generally starting at a 10% performance hit or higher. vCPU pinning and numatune
should be configured together.
If the virtual machine is performing storage or network I/O tasks, it can be beneficial to pin all vCPUs and memory to the same physical socket that is physically connected to the I/O adapter.
Note
The lstopo tool can be used to visualize NUMA topology. It can also help verify that vCPUs are binding to cores on the same physical socket. Refer to the following Knowledgebase article for more information on lstopo: https://access.redhat.com/site/solutions/62879.
Important
Pinning causes increased complexity when there are many more vCPUs than physical cores.
The following example XML configuration has a domain process pinned to physical CPUs 0-7. The vCPU thread is pinned to its own cpuset. For example, vCPU0 is pinned to physical CPU 0, vCPU1 is pinned to physical CPU 1, and so on:
<vcpu cpuset='0-7'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='3'/> <vcpupin vcpu='4' cpuset='4'/> <vcpupin vcpu='5' cpuset='5'/> <vcpupin vcpu='6' cpuset='6'/> <vcpupin vcpu='7' cpuset='7'/> </cputune>
There is a direct relationship between the vcpu and vcpupin tags. If a vcpupin option is not specified, the value will be automatically determined and inherited from the parent vcpu tag option. The following configuration shows <vcpupin > for vcpu 5 missing. Hence, vCPU5 would be pinned to physical CPUs 0-7, as specified in the parent tag <vcpu>:
<vcpu cpuset='0-7'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='3'/> <vcpupin vcpu='4' cpuset='4'/> <vcpupin vcpu='6' cpuset='6'/> <vcpupin vcpu='7' cpuset='7'/> </cputune>