4.2. NUMA and Multi-Core Support
Originally, NUMA was a mechanism that connected single processors to multiple memory banks. As CPU manufacturers refined their processes and die sizes shrank, multiple CPU cores could be included in one package. These CPU cores were clustered so that each had equal access time to a local memory bank, and cache could be shared between the cores.
When multi-core packages are in use, applications should generally be bound to a single socket or node rather than being bound to a CPU core, particularly if the application is multi-threaded. Binding multiple threads to a single CPU core is not recommended. First-level caches are generally quite small (around 32 KB), so when multiple threads execute and access data on a single CPU, each thread may potentially access data that will evict previously-accessed data from the cache. This means that when the operating system tries to multitask between these threads, a large percentage of execution time is used on cache line replacement operations. This is known as cache thrashing. Binding an application to a node (and therefore all CPUs belonging to that node) allows threads to share cache lines on multiple levels (first-, second- and third-level cache), minimizing the need for cache fill operations.
The performance of a NUMA system can be improved primarily by ensuring that information travels efficiently. To do so, you must be aware of your system's topology - the CPUs, the memory banks, and the paths between them.
The system in
Figure 4.1, “An example of NUMA topology” contains two NUMA nodes. Each node has four CPUs, a memory bank, and a memory controller. Any CPU on a node has direct access to the memory bank on that node. Following the arrows on Node 1, the steps are as follows:
A CPU (any of 0-3) presents the memory address to the local memory controller.
The memory controller sets up access to the memory address.
The CPU performs read or write operations on that memory address.
However, if a CPU on one node needs to access code that resides on the memory bank of a different NUMA node, the path it has to take is less direct:
A CPU (any of 0-3) presents the remote memory address to the local memory controller.
The CPU's request for that remote memory address is passed to a remote memory controller, local to the node containing that memory address.
The remote memory controller sets up access to the remote memory address.
The CPU performs read or write operations on that remote memory address.
Since in the second case every action needs to pass through multiple memory controllers, access can take more than twice as long when attempting to access remote memory addresses. It is therefore important to ensure that information travels the shortest possible path.
The best way to ensure this is to use the numactl utility or numa(7) library calls to locate the nearest memory bank to the CPU(s) where your application is running, and bind the application to that local memory bank.
The numactl --hardware command lists the available hardware:
$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 8189 MB
node 0 free: 7616 MB
node 1 cpus: 4 5 6 7
node 1 size: 8192 MB
node 1 free: 7756 MB
node 2 cpus: 8 9 10 11
node 2 size: 8192 MB
node 2 free: 7594 MB
node 3 cpus: 12 13 14 15
node 3 size: 8192 MB
node 3 free: 7756 MB
node distances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10
This example system is a four-node NUMA system with 16 CPUs (four per node) and 32 GB of memory (8 GB per node).
The node distance matrix output by the numactl --hardware command shows the relative "cost" of accessing one node from another. The distance matrix in the example output shows that accessing something on node 0 from node 0 has a cost of 10, while accessing any other node from node 0 has a cost of 20. This means that all other nodes are directly connected to node 0. Since the cost to access any node from any node other than itself is 20, all four nodes in the example output are directly connected to each other.
Not every system has a connection between every node and every other node, so it is prudent to check the distance matrix on new systems to determine the relative costs of inter-node access.
$ numactl --membind 1 --cpunodebind 1 --localalloc myapplication
This command runs a program called myapplication, binds it to memory node 1, and constrains its threads to only run on the CPUs on node 1 (CPUs 4, 5, 6, and 7). You could also do this using numa(7) calls from the application, but doing so reduces the administrator's ability to change the location of the application when deploying on different hardware.