Questo contenuto non è disponibile nella lingua selezionata.
9.3. libvirt NUMA Tuning
Generally, best performance on NUMA systems is achieved by limiting guest size to the amount of resources on a single NUMA node. Avoid unnecessarily splitting resources across NUMA nodes.
Use the
numastat
tool to view per-NUMA-node memory statistics for processes and the operating system.
In the following example, the
numastat
tool shows four virtual machines with suboptimal memory alignment across NUMA nodes:
# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
51722 (qemu-kvm) 68 16 357 6936 2 3 147 598 8128
51747 (qemu-kvm) 245 11 5 18 5172 2532 1 92 8076
53736 (qemu-kvm) 62 432 1661 506 4851 136 22 445 8116
53773 (qemu-kvm) 1393 3 1 2 12 0 0 6702 8114
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
Total 1769 463 2024 7462 10037 2672 169 7837 32434
Run
numad
to align the guests' CPUs and memory resources automatically.
Then run
numastat -c qemu-kvm
again to view the results of running numad
. The following output shows that resources have been aligned:
# numastat -c qemu-kvm
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
51747 (qemu-kvm) 0 0 7 0 8072 0 1 0 8080
53736 (qemu-kvm) 0 0 7 0 0 0 8113 0 8120
53773 (qemu-kvm) 0 0 7 0 0 0 1 8110 8118
59065 (qemu-kvm) 0 0 8050 0 0 0 0 0 8051
--------------- ------ ------ ------ ------ ------ ------ ------ ------ -----
Total 0 0 8072 0 8072 0 8114 8110 32368
Note
Running
numastat
with -c
provides compact output; adding the -m
option adds system-wide memory information on a per-node basis to the output. See the numastat
man page for more information.
9.3.1. Monitoring Memory per host NUMA Node
You can use the
nodestats.py
script to report the total memory and free memory for each NUMA node on a host. This script also reports how much memory is strictly bound to certain host nodes for each running domain. For example:
# /usr/share/doc/libvirt-python-2.0.0/examples/nodestats.py
NUMA stats
NUMA nodes: 0 1 2 3
MemTotal: 3950 3967 3937 3943
MemFree: 66 56 42 41
Domain 'rhel7-0':
Overall memory: 1536 MiB
Domain 'rhel7-1':
Overall memory: 2048 MiB
Domain 'rhel6':
Overall memory: 1024 MiB nodes 0-1
Node 0: 1024 MiB nodes 0-1
Domain 'rhel7-2':
Overall memory: 4096 MiB nodes 0-3
Node 0: 1024 MiB nodes 0
Node 1: 1024 MiB nodes 1
Node 2: 1024 MiB nodes 2
Node 3: 1024 MiB nodes 3
This example shows four host NUMA nodes, each containing approximately 4GB of RAM in total (
MemTotal
). Nearly all memory is consumed on each domain (MemFree
). There are four domains (virtual machines) running: domain 'rhel7-0' has 1.5GB memory which is not pinned onto any specific host NUMA node. Domain 'rhel7-2' however, has 4GB memory and 4 NUMA nodes which are pinned 1:1 to host nodes.
To print host NUMA node statistics, create a
nodestats.py
script for your environment. An example script can be found the libvirt-python package files in /usr/share/doc/libvirt-python-version/examples/nodestats.py
. The specific path to the script can be displayed by using the rpm -ql libvirt-python
command.
9.3.2. NUMA vCPU Pinning
vCPU pinning provides similar advantages to task pinning on bare metal systems. Since vCPUs run as user-space tasks on the host operating system, pinning increases cache efficiency. One example of this is an environment where all vCPU threads are running on the same physical socket, therefore sharing a L3 cache domain.
Note
In Red Hat Enterprise Linux versions 7.0 to 7.2, it is only possible to pin active vCPUs. However, with Red Hat Enterprise Linux 7.3, pinning inactive vCPUs is available as well.
Combining vCPU pinning with
numatune
can avoid NUMA misses. The performance impacts of NUMA misses are significant, generally starting at a 10% performance hit or higher. vCPU pinning and numatune
should be configured together.
If the virtual machine is performing storage or network I/O tasks, it can be beneficial to pin all vCPUs and memory to the same physical socket that is physically connected to the I/O adapter.
Note
The lstopo tool can be used to visualize NUMA topology. It can also help verify that vCPUs are binding to cores on the same physical socket. See the following Knowledgebase article for more information on lstopo: https://access.redhat.com/site/solutions/62879.
Important
Pinning causes increased complexity where there are many more vCPUs than physical cores.
The following example XML configuration has a domain process pinned to physical CPUs 0-7. The vCPU thread is pinned to its own cpuset. For example, vCPU0 is pinned to physical CPU 0, vCPU1 is pinned to physical CPU 1, and so on:
<vcpu cpuset='0-7'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='3'/> <vcpupin vcpu='4' cpuset='4'/> <vcpupin vcpu='5' cpuset='5'/> <vcpupin vcpu='6' cpuset='6'/> <vcpupin vcpu='7' cpuset='7'/> </cputune>
There is a direct relationship between the vcpu and vcpupin tags. If a vcpupin option is not specified, the value will be automatically determined and inherited from the parent vcpu tag option. The following configuration shows
<vcpupin>
for vcpu 5 missing. Hence, vCPU5 would be pinned to physical CPUs 0-7, as specified in the parent tag <vcpu>
:
<vcpu cpuset='0-7'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> <vcpupin vcpu='1' cpuset='1'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='3'/> <vcpupin vcpu='4' cpuset='4'/> <vcpupin vcpu='6' cpuset='6'/> <vcpupin vcpu='7' cpuset='7'/> </cputune>
Important
<vcpupin>
, <numatune>
, and <emulatorpin>
should be configured together to achieve optimal, deterministic performance. For more information on the <numatune>
tag, see Section 9.3.3, “Domain Processes”. For more information on the <emulatorpin>
tag, see Section 9.3.6, “Using emulatorpin”.
9.3.3. Domain Processes
As provided in Red Hat Enterprise Linux, libvirt uses libnuma to set memory binding policies for domain processes. The nodeset for these policies can be configured either as static (specified in the domain XML) or auto (configured by querying numad). See the following XML configuration for examples on how to configure these inside the
<numatune>
tag:
<numatune> <memory mode='strict' placement='auto'/> </numatune>
<numatune> <memory mode='strict' nodeset='0,2-3'/> </numatune>
libvirt uses sched_setaffinity(2) to set CPU binding policies for domain processes. The cpuset option can either be static (specified in the domain XML) or auto (configured by querying numad). See the following XML configuration for examples on how to configure these inside the
<vcpu>
tag:
<vcpu placement='auto'>8</vcpu>
<vcpu placement='static' cpuset='0-10,ˆ5'>8</vcpu>
There are implicit inheritance rules between the placement mode you use for
<vcpu>
and <numatune>
:
- The placement mode for
<numatune>
defaults to the same placement mode of<vcpu>
, or to static if a<nodeset>
is specified. - Similarly, the placement mode for
<vcpu>
defaults to the same placement mode of<numatune>
, or to static if<cpuset>
is specified.
This means that CPU tuning and memory tuning for domain processes can be specified and defined separately, but they can also be configured to be dependent on the other's placement mode.
It is also possible to configure your system with numad to boot a selected number of vCPUs without pinning all vCPUs at startup.
For example, to enable only 8 vCPUs at boot on a system with 32 vCPUs, configure the XML similar to the following:
<vcpu placement='auto' current='8'>32</vcpu>
Note
See the following URLs for more information on vcpu and numatune: http://libvirt.org/formatdomain.html#elementsCPUAllocation and http://libvirt.org/formatdomain.html#elementsNUMATuning
9.3.4. Domain vCPU Threads
In addition to tuning domain processes, libvirt also permits the setting of the pinning policy for each vcpu thread in the XML configuration. Set the pinning policy for each vcpu thread inside the
<cputune>
tags:
<cputune> <vcpupin vcpu="0" cpuset="1-4,ˆ2"/> <vcpupin vcpu="1" cpuset="0,1"/> <vcpupin vcpu="2" cpuset="2,3"/> <vcpupin vcpu="3" cpuset="0,4"/> </cputune>
In this tag, libvirt uses either cgroup or sched_setaffinity(2) to pin the vcpu thread to the specified cpuset.
Note
For more details on
<cputune>
, see the following URL: http://libvirt.org/formatdomain.html#elementsCPUTuning
In addition, if you need to set up a virtual machines with more vCPU than a single NUMA node, configure the host so that the guest detects a NUMA topology on the host. This allows for 1:1 mapping of CPUs, memory, and NUMA nodes. For example, this can be applied with a guest with 4 vCPUs and 6 GB memory, and a host with the following NUMA settings:
4 available nodes (0-3) Node 0: CPUs 0 4, size 4000 MiB Node 1: CPUs 1 5, size 3999 MiB Node 2: CPUs 2 6, size 4001 MiB Node 3: CPUs 0 4, size 4005 MiB
In this scenario, use the following Domain XML setting:
<cputune> <vcpupin vcpu="0" cpuset="1"/> <vcpupin vcpu="1" cpuset="5"/> <vcpupin vcpu="2" cpuset="2"/> <vcpupin vcpu="3" cpuset="6"/> </cputune> <numatune> <memory mode="strict" nodeset="1-2"/> </numatune> <cpu> <numa> <cell id="0" cpus="0-1" memory="3" unit="GiB"/> <cell id="1" cpus="2-3" memory="3" unit="GiB"/> </numa> </cpu>
9.3.5. Using Cache Allocation Technology to Improve Performance
You can make use of Cache Allocation Technology (CAT) provided by the kernel on specific CPU models. This enables allocation of part of the host CPU's cache for vCPU threads, which improves real-time performance.
See the following XML configuration for an example on how to configure vCPU cache allocation inside the
cachetune
tag:
<domain> <cputune> <cachetune vcpus='0-1'> <cache id='0' level='3' type='code' size='3' unit='MiB'/> <cache id='0' level='3' type='data' size='3' unit='MiB'/> </cachetune> </cputune> </domain>
The XML file above configures the thread for vCPUs 0 and 1 to have 3 MiB from the first L3 cache (level='3' id='0') allocated, once for the L3CODE and once for L3DATA.
Note
A single virtual machine can have multiple
<cachetune>
elements.
For more information see
cachetune
in the upstream libvirt
documentation.
9.3.6. Using emulatorpin
Another way of tuning the domain process pinning policy is to use the
<emulatorpin>
tag inside of <cputune>
.
The
<emulatorpin>
tag specifies which host physical CPUs the emulator (a subset of a domain, not including vCPUs) will be pinned to. The <emulatorpin>
tag provides a method of setting a precise affinity to emulator thread processes. As a result, vhost threads run on the same subset of physical CPUs and memory, and therefore benefit from cache locality. For example:
<cputune> <emulatorpin cpuset="1-3"/> </cputune>
Note
In Red Hat Enterprise Linux 7, automatic NUMA balancing is enabled by default. Automatic NUMA balancing reduces the need for manually tuning
<emulatorpin>
, since the vhost-net emulator thread follows the vCPU tasks more reliably. For more information about automatic NUMA balancing, see Section 9.2, “Automatic NUMA Balancing”.
9.3.7. Tuning vCPU Pinning with virsh
Important
These are example commands only. You will need to substitute values according to your environment.
The following example
virsh
command will pin the vcpu thread rhel7 which has an ID of 1 to the physical CPU 2:
% virsh vcpupin rhel7 1 2
You can also obtain the current vcpu pinning configuration with the
virsh
command. For example:
% virsh vcpupin rhel7
9.3.8. Tuning Domain Process CPU Pinning with virsh
Important
These are example commands only. You will need to substitute values according to your environment.
The
emulatorpin
option applies CPU affinity settings to threads that are associated with each domain process. For complete pinning, you must use both virsh vcpupin
(as shown previously) and virsh emulatorpin
for each guest. For example:
% virsh emulatorpin rhel7 3-4
9.3.9. Tuning Domain Process Memory Policy with virsh
Domain process memory can be dynamically tuned. See the following example command:
% virsh numatune rhel7 --nodeset 0-10
More examples of these commands can be found in the
virsh
man page.
9.3.10. Guest NUMA Topology
Guest NUMA topology can be specified using the
<numa>
tag inside the <cpu>
tag in the guest virtual machine's XML. See the following example, and replace values accordingly:
<cpu> ... <numa> <cell cpus='0-3' memory='512000'/> <cell cpus='4-7' memory='512000'/> </numa> ... </cpu>
Each
<cell>
element specifies a NUMA cell or a NUMA node. cpus
specifies the CPU or range of CPUs that are part of the node, and memory
specifies the node memory in kibibytes (blocks of 1024 bytes). Each cell or node is assigned a cellid
or nodeid
in increasing order starting from 0.
Important
When modifying the NUMA topology of a guest virtual machine with a configured topology of CPU sockets, cores, and threads, make sure that cores and threads belonging to a single socket are assigned to the same NUMA node. If threads or cores from the same socket are assigned to different NUMA nodes, the guest may fail to boot.
Warning
Using guest NUMA topology simultaneously with huge pages is not supported on Red Hat Enterprise Linux 7 and is only available in layered products such as Red Hat Virtualization or Red Hat OpenStack Platform.
9.3.11. NUMA Node Locality for PCI Devices
When starting a new virtual machine, it is important to know both the host NUMA topology and the PCI device affiliation to NUMA nodes, so that when PCI passthrough is requested, the guest is pinned onto the correct NUMA nodes for optimal memory performance.
For example, if a guest is pinned to NUMA nodes 0-1, but one of its PCI devices is affiliated with node 2, data transfer between nodes will take some time.
In Red Hat Enterprise Linux 7.1 and above, libvirt reports the NUMA node locality for PCI devices in the guest XML, enabling management applications to make better performance decisions.
This information is visible in the
sysfs
files in /sys/devices/pci*/*/numa_node
. One way to verify these settings is to use the lstopo tool to report sysfs
data:
# lstopo-no-graphics
Machine (126GB)
NUMANode L#0 (P#0 63GB)
Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14)
HostBridge L#0
PCIBridge
PCI 8086:1521
Net L#0 "em1"
PCI 8086:1521
Net L#1 "em2"
PCI 8086:1521
Net L#2 "em3"
PCI 8086:1521
Net L#3 "em4"
PCIBridge
PCI 1000:005b
Block L#4 "sda"
Block L#5 "sdb"
Block L#6 "sdc"
Block L#7 "sdd"
PCIBridge
PCI 8086:154d
Net L#8 "p3p1"
PCI 8086:154d
Net L#9 "p3p2"
PCIBridge
PCIBridge
PCIBridge
PCIBridge
PCI 102b:0534
GPU L#10 "card0"
GPU L#11 "controlD64"
PCI 8086:1d02
NUMANode L#1 (P#1 63GB)
Socket L#1 + L3 L#1 (20MB)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#1)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#3)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#5)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#7)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#9)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#11)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#13)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
HostBridge L#8
PCIBridge
PCI 1924:0903
Net L#12 "p1p1"
PCI 1924:0903
Net L#13 "p1p2"
PCIBridge
PCI 15b3:1003
Net L#14 "ib0"
Net L#15 "ib1"
OpenFabrics L#16 "mlx4_0"
This output shows:
- NICs
em*
and diskssd*
are connected to NUMA node 0 and cores 0,2,4,6,8,10,12,14. - NICs
p1*
andib*
are connected to NUMA node 1 and cores 1,3,5,7,9,11,13,15.