Este conteúdo não está disponível no idioma selecionado.
9.3. libvirt NUMA Tuning
Generally, best performance on NUMA systems is achieved by limiting guest size to the amount of resources on a single NUMA node. Avoid unnecessarily splitting resources across NUMA nodes.
Use the
numastat tool to view per-NUMA-node memory statistics for processes and the operating system.
In the following example, the
Copy to Clipboard
Copied!
Toggle word wrap
Toggle overflow
numastat tool shows four virtual machines with suboptimal memory alignment across NUMA nodes:
Run
numad to align the guests' CPUs and memory resources automatically.
Then run
Copy to Clipboard
Copied!
Toggle word wrap
Toggle overflow
numastat -c qemu-kvm again to view the results of running numad. The following output shows that resources have been aligned:
Note
Running
numastat with -c provides compact output; adding the -m option adds system-wide memory information on a per-node basis to the output. See the numastat man page for more information.
9.3.1. Monitoring Memory per host NUMA Node Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
You can use the
nodestats.py script to report the total memory and free memory for each NUMA node on a host. This script also reports how much memory is strictly bound to certain host nodes for each running domain. For example:
This example shows four host NUMA nodes, each containing approximately 4GB of RAM in total (
MemTotal). Nearly all memory is consumed on each domain (MemFree). There are four domains (virtual machines) running: domain 'rhel7-0' has 1.5GB memory which is not pinned onto any specific host NUMA node. Domain 'rhel7-2' however, has 4GB memory and 4 NUMA nodes which are pinned 1:1 to host nodes.
To print host NUMA node statistics, create a
nodestats.py script for your environment. An example script can be found the libvirt-python package files in /usr/share/doc/libvirt-python-version/examples/nodestats.py. The specific path to the script can be displayed by using the rpm -ql libvirt-python command.
9.3.2. NUMA vCPU Pinning Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
vCPU pinning provides similar advantages to task pinning on bare metal systems. Since vCPUs run as user-space tasks on the host operating system, pinning increases cache efficiency. One example of this is an environment where all vCPU threads are running on the same physical socket, therefore sharing a L3 cache domain.
Note
In Red Hat Enterprise Linux versions 7.0 to 7.2, it is only possible to pin active vCPUs. However, with Red Hat Enterprise Linux 7.3, pinning inactive vCPUs is available as well.
Combining vCPU pinning with
numatune can avoid NUMA misses. The performance impacts of NUMA misses are significant, generally starting at a 10% performance hit or higher. vCPU pinning and numatune should be configured together.
If the virtual machine is performing storage or network I/O tasks, it can be beneficial to pin all vCPUs and memory to the same physical socket that is physically connected to the I/O adapter.
Note
The lstopo tool can be used to visualize NUMA topology. It can also help verify that vCPUs are binding to cores on the same physical socket. See the following Knowledgebase article for more information on lstopo: https://access.redhat.com/site/solutions/62879.
Important
Pinning causes increased complexity where there are many more vCPUs than physical cores.
The following example XML configuration has a domain process pinned to physical CPUs 0-7. The vCPU thread is pinned to its own cpuset. For example, vCPU0 is pinned to physical CPU 0, vCPU1 is pinned to physical CPU 1, and so on:
There is a direct relationship between the vcpu and vcpupin tags. If a vcpupin option is not specified, the value will be automatically determined and inherited from the parent vcpu tag option. The following configuration shows
<vcpupin> for vcpu 5 missing. Hence, vCPU5 would be pinned to physical CPUs 0-7, as specified in the parent tag <vcpu>:
Important
<vcpupin>, <numatune>, and <emulatorpin> should be configured together to achieve optimal, deterministic performance. For more information on the <numatune> tag, see Section 9.3.3, “Domain Processes”. For more information on the <emulatorpin> tag, see Section 9.3.6, “Using emulatorpin”.
9.3.3. Domain Processes Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
As provided in Red Hat Enterprise Linux, libvirt uses libnuma to set memory binding policies for domain processes. The nodeset for these policies can be configured either as static (specified in the domain XML) or auto (configured by querying numad). See the following XML configuration for examples on how to configure these inside the
<numatune> tag:
<numatune>
<memory mode='strict' placement='auto'/>
</numatune>
<numatune>
<memory mode='strict' placement='auto'/>
</numatune>
<numatune>
<memory mode='strict' nodeset='0,2-3'/>
</numatune>
<numatune>
<memory mode='strict' nodeset='0,2-3'/>
</numatune>
libvirt uses sched_setaffinity(2) to set CPU binding policies for domain processes. The cpuset option can either be static (specified in the domain XML) or auto (configured by querying numad). See the following XML configuration for examples on how to configure these inside the
<vcpu> tag:
<vcpu placement='auto'>8</vcpu>
<vcpu placement='auto'>8</vcpu>
<vcpu placement='static' cpuset='0-10,ˆ5'>8</vcpu>
<vcpu placement='static' cpuset='0-10,ˆ5'>8</vcpu>
There are implicit inheritance rules between the placement mode you use for
<vcpu> and <numatune>:
- The placement mode for
<numatune>defaults to the same placement mode of<vcpu>, or to static if a<nodeset>is specified. - Similarly, the placement mode for
<vcpu>defaults to the same placement mode of<numatune>, or to static if<cpuset>is specified.
This means that CPU tuning and memory tuning for domain processes can be specified and defined separately, but they can also be configured to be dependent on the other's placement mode.
It is also possible to configure your system with numad to boot a selected number of vCPUs without pinning all vCPUs at startup.
For example, to enable only 8 vCPUs at boot on a system with 32 vCPUs, configure the XML similar to the following:
<vcpu placement='auto' current='8'>32</vcpu>
<vcpu placement='auto' current='8'>32</vcpu>
Note
See the following URLs for more information on vcpu and numatune: http://libvirt.org/formatdomain.html#elementsCPUAllocation and http://libvirt.org/formatdomain.html#elementsNUMATuning
9.3.4. Domain vCPU Threads Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
In addition to tuning domain processes, libvirt also permits the setting of the pinning policy for each vcpu thread in the XML configuration. Set the pinning policy for each vcpu thread inside the
<cputune> tags:
In this tag, libvirt uses either cgroup or sched_setaffinity(2) to pin the vcpu thread to the specified cpuset.
Note
For more details on
<cputune>, see the following URL: http://libvirt.org/formatdomain.html#elementsCPUTuning
In addition, if you need to set up a virtual machines with more vCPU than a single NUMA node, configure the host so that the guest detects a NUMA topology on the host. This allows for 1:1 mapping of CPUs, memory, and NUMA nodes. For example, this can be applied with a guest with 4 vCPUs and 6 GB memory, and a host with the following NUMA settings:
In this scenario, use the following Domain XML setting:
9.3.5. Using Cache Allocation Technology to Improve Performance Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
You can make use of Cache Allocation Technology (CAT) provided by the kernel on specific CPU models. This enables allocation of part of the host CPU's cache for vCPU threads, which improves real-time performance.
See the following XML configuration for an example on how to configure vCPU cache allocation inside the
cachetune tag:
The XML file above configures the thread for vCPUs 0 and 1 to have 3 MiB from the first L3 cache (level='3' id='0') allocated, once for the L3CODE and once for L3DATA.
Note
A single virtual machine can have multiple
<cachetune> elements.
For more information see
cachetune in the upstream libvirt documentation.
9.3.6. Using emulatorpin Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
Another way of tuning the domain process pinning policy is to use the
<emulatorpin> tag inside of <cputune>.
The
<emulatorpin> tag specifies which host physical CPUs the emulator (a subset of a domain, not including vCPUs) will be pinned to. The <emulatorpin> tag provides a method of setting a precise affinity to emulator thread processes. As a result, vhost threads run on the same subset of physical CPUs and memory, and therefore benefit from cache locality. For example:
<cputune>
<emulatorpin cpuset="1-3"/>
</cputune>
<cputune>
<emulatorpin cpuset="1-3"/>
</cputune>
Note
In Red Hat Enterprise Linux 7, automatic NUMA balancing is enabled by default. Automatic NUMA balancing reduces the need for manually tuning
<emulatorpin>, since the vhost-net emulator thread follows the vCPU tasks more reliably. For more information about automatic NUMA balancing, see Section 9.2, “Automatic NUMA Balancing”.
9.3.7. Tuning vCPU Pinning with virsh Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
Important
These are example commands only. You will need to substitute values according to your environment.
The following example
virsh command will pin the vcpu thread rhel7 which has an ID of 1 to the physical CPU 2:
% virsh vcpupin rhel7 1 2
% virsh vcpupin rhel7 1 2
You can also obtain the current vcpu pinning configuration with the
virsh command. For example:
% virsh vcpupin rhel7
% virsh vcpupin rhel7
9.3.8. Tuning Domain Process CPU Pinning with virsh Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
Important
These are example commands only. You will need to substitute values according to your environment.
The
emulatorpin option applies CPU affinity settings to threads that are associated with each domain process. For complete pinning, you must use both virsh vcpupin (as shown previously) and virsh emulatorpin for each guest. For example:
% virsh emulatorpin rhel7 3-4
% virsh emulatorpin rhel7 3-4
9.3.9. Tuning Domain Process Memory Policy with virsh Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
Domain process memory can be dynamically tuned. See the following example command:
% virsh numatune rhel7 --nodeset 0-10
% virsh numatune rhel7 --nodeset 0-10
More examples of these commands can be found in the
virsh man page.
9.3.10. Guest NUMA Topology Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
Guest NUMA topology can be specified using the
<numa> tag inside the <cpu> tag in the guest virtual machine's XML. See the following example, and replace values accordingly:
Each
<cell> element specifies a NUMA cell or a NUMA node. cpus specifies the CPU or range of CPUs that are part of the node, and memory specifies the node memory in kibibytes (blocks of 1024 bytes). Each cell or node is assigned a cellid or nodeid in increasing order starting from 0.
Important
When modifying the NUMA topology of a guest virtual machine with a configured topology of CPU sockets, cores, and threads, make sure that cores and threads belonging to a single socket are assigned to the same NUMA node. If threads or cores from the same socket are assigned to different NUMA nodes, the guest may fail to boot.
Warning
Using guest NUMA topology simultaneously with huge pages is not supported on Red Hat Enterprise Linux 7 and is only available in layered products such as Red Hat Virtualization or Red Hat OpenStack Platform.
9.3.11. NUMA Node Locality for PCI Devices Copiar o linkLink copiado para a área de transferência!
Copiar o linkLink copiado para a área de transferência!
When starting a new virtual machine, it is important to know both the host NUMA topology and the PCI device affiliation to NUMA nodes, so that when PCI passthrough is requested, the guest is pinned onto the correct NUMA nodes for optimal memory performance.
For example, if a guest is pinned to NUMA nodes 0-1, but one of its PCI devices is affiliated with node 2, data transfer between nodes will take some time.
In Red Hat Enterprise Linux 7.1 and above, libvirt reports the NUMA node locality for PCI devices in the guest XML, enabling management applications to make better performance decisions.
This information is visible in the
sysfs files in /sys/devices/pci*/*/numa_node. One way to verify these settings is to use the lstopo tool to report sysfs data:
This output shows:
- NICs
em*and diskssd*are connected to NUMA node 0 and cores 0,2,4,6,8,10,12,14. - NICs
p1*andib*are connected to NUMA node 1 and cores 1,3,5,7,9,11,13,15.