Product SiteDocumentation Site

Chapter 4. CPU

4.1. CPU and NUMA Topology
4.1.1. Using numactl and libnuma
4.2. NUMA and Multi-Core Support
4.3. NUMA enhancements in Red Hat Enterprise Linux 6
4.3.1. Bare-metal and scalability optimizations
4.3.2. Virtualization optimizations
4.4. CPU Scheduler
4.4.1. Realtime scheduling policies
4.4.2. Normal scheduling policies
4.4.3. Policy Selection
4.5. Tuned IRQs
The term CPU, which stands for central processing unit, is a misnomer for most systems, since central implies single, whereas most modern systems have more than one processing unit, or core. Physically, CPUs are contained in a package attached to a motherboard in a socket. Each socket on the motherboard has various connections: to other CPU sockets, memory controllers, interrupt controllers, and other peripheral devices. A socket to the operating system is a logical grouping of CPUs and associated resources. This concept is central to most of our discussions on CPU tuning.
Red Hat Enterprise Linux keeps a wealth of statistics about system CPU events; these statistics are useful in planning out a tuning strategy to improve CPU performance.

SMP and NUMA

Older computers had relatively few CPUs per system, which allowed an architecture known as Symmetric Multi-Processor (SMP). This meant that each CPU in the system had similar (or symmetric) access to available memory. In recent years, CPU count-per-socket has grown to the point that trying to give symmetric access to all RAM in the system has become very expensive. Most high CPU count systems these days have an architecture known as Non-Uniform Memory Access (NUMA) instead of SMP.
AMD processors have had this type of architecture for some time with their Hyper Transport (HT) interconnects, while Intel has begun implementing NUMA in their Quick Path Interconnect (QPI) designs. NUMA and SMP are tuned differently, since you need to account for the topology of the system when allocating resources for an application.

Threads

Inside the Linux operating system, the unit of execution is known as a thread. Threads have a register context, a stack, and a segment of executable code which they run on a CPU. It is the operating system's job to schedule these threads on the available CPUs.
The OS maximizes CPU utilization by load-balancing the threads across available cores. Since the OS is primarily concerned with keeping CPUs busy, it may not make optimal decisions with respect to application performance. Moving an application thread to a CPU on another socket may worsen performance more than simply waiting for the current CPU to become available, since memory access operations may slow drastically across sockets. For high-performance applications, it is usually better for the designer to determine where threads should be placed. Section 4.4, “CPU Scheduler” discusses how to best allocate CPUs and memory to best execute application threads.

Interrupts

One of the less obvious (but nonetheless important) system events that can impact application performance is the interrupt (also known as IRQs in Linux). These events are handled by the operating system, and are used by peripherals to signal the arrival of data or the completion of an operation, such as a network write or a timer event.
The manner in which the OS or CPU that is executing application code handles an interrupt does not affect the application's functionality. However, it may impact the performance of the application. This chapter also discusses tips on preventing interrupts from adversely impacting application performance.

4.1. CPU and NUMA Topology

The first computer processors were uniprocessors, meaning that the system had a single CPU. The illusion of executing processes in parallel was done by the operating system rapidly switching the single CPU from one thread of execution (process) to another. In the quest for increasing system performance, designers noted that increasing the clock rate to execute instructions faster only worked up to a point (usually the limitations on creating a stable clock waveform with the current technology). In an effort to get more overall system performance, designers added another CPU to the system, allowing two parallel streams of execution. This trend of adding processors has continued over time.
Most early multiprocessor systems were designed so that each CPU had the same logical path to each memory location (usually a parallel bus). This let each CPU access any memory location in the same amount of time as any other CPU in the system. This type of architecture is known as a Symmetric Multi-Processor (SMP) system. SMP is fine for a small number of CPUs, but once the CPU count gets above a certain point (8 or 16), the number of parallel traces required to allow equal access to memory uses too much of the available board real estate, leaving less room for peripherals.
Two new concepts combined to allow for a higher number of CPUs in a system:
  1. Serial buses
  2. NUMA topologies
A serial bus is a single-wire communication path with a very high clock rate, which transfers data as packetized bursts. Hardware designers began to use serial buses as high-speed interconnects between CPUs, and between CPUs and memory controllers and other peripherals. This means that instead of requiring between 32 and 64 traces on the board from each CPU to the memory subsystem, there was now one trace, substantially reducing the amount of space required on the board.
At the same time, hardware designers were packing more transistors into the same space by reducing die sizes. Instead of putting individual CPUs directly onto the main board, they started packing them into a processor package as multi-core processors. Then, instead of trying to provide equal access to memory from each processor package, designers resorted to a Non-Uniform Memory Access (NUMA) strategy, where each package/socket combination has one or more dedicated memory area for high speed access. Each socket also has an interconnect to other sockets for slower access to the other sockets' memory.
As a simple NUMA example, suppose we have a two-socket motherboard, where each socket has been populated with a quad-core package. This means the total number of CPUs in the system is eight; four in each socket. Each socket also has an attached memory bank with four gigabytes of RAM, for a total system memory of eight gigabytes. For the purposes of this example, CPUs 0-3 are in socket 0, and CPUs 4-7 are in socket 1. Each socket in this example also corresponds to a NUMA node.
It might take three clock cycles for CPU 0 to access memory from bank 0: a cycle to present the address to the memory controller, a cycle to set up access to the memory location, and a cycle to read or write to the location. However, it might take six clock cycles for CPU 4 to access memory from the same location; because it is on a separate socket, it must go through two memory controllers: the local memory controller on socket 1, and then the remote memory controller on socket 0. If memory is contested on that location (that is, if more than one CPU is attempting to access the same location simultaneously), memory controllers need to arbitrate and serialize access to the memory, so memory access will take longer. Adding cache consistency (ensuring that local CPU caches contain the same data for the same memory location) complicates the process further.
The latest high-end processors from both Intel (Xeon) and AMD (Opteron) have NUMA topologies. The AMD processors use an interconnect known as HyperTransport™ or HT, while Intel uses one named QuickPath Interconnect™ or QPI. The interconnects differ in how they physically connect to other interconnects, memory, or peripheral devices, but in effect they are a switch that allows transparent access to one connected device from another connected device. In this case, transparent refers to the fact that there is no special programming API required to use the interconnect, not a "no cost" option.
Because system architectures are so diverse, it is impractical to specifically characterize the performance penalty imposed by accessing non-local memory. We can say that each hop across an interconnect imposes at least some relatively constant performance penalty per hop, so referencing a memory location that is two interconnects from the current CPU imposes at least 2N + memory cycle time units to access time, where N is the penalty per hop.
Given this performance penalty, performance-sensitive applications should avoid regularly accessing remote memory in a NUMA topology system. The application should be set up so that it stays on a particular node and allocates memory from that node.
To do this, there are a few things that applications will need to know:
  1. What is the topology of the system?
  2. Where is the application currently executing?
  3. Where is the closest memory bank?
The topology of a system is how a system's components are connected: CPUs, memory, and peripheral devices. The /sys file system contains a wealth of information about how CPUs and devices are connected via NUMA interconnects.
The /sys/devices/system/cpu directory contains information about how CPUs are connected to one another. The /sys/devices/system/node directory contains information about the NUMA nodes in the system; in particular, the relative distances between nodes.

4.1.1. Using numactl and libnuma

Red Hat Enterprise Linux provides two methods for modifying how an application runs on a NUMA system. The first is a low-level library named libnuma, which contains functions for determining the NUMA topology of the host system, setting the CPUs upon which a thread can execute, and setting which NUMA nodes a thread should use when allocating memory.
The second method is numactl, a utility program that uses libnuma to set up NUMA execution parameters for an arbitrary application. numactl takes that application (and the arguments of that application) as an argument of its own. The application specified will be started, and the NUMA options passed to numactl; for example:
numactl --cpubind=0 --membind=0,1  myprogram arg1 arg2
This example tells the operating system to execute the program myprogram with arguments arg1 and arg2, to bind the myprogram process to NUMA node 0, and to only allocate memory for myprogram from NUMA nodes 0 and 1.
Two useful options to numactl are the --show and --hardware options. These are information display options, and are not used when executing programs on specific nodes. The --show option shows the current NUMA policy defaults, while the --hardware option shows the available NUMA nodes, as well as the CPUs and memory banks available on each node.
Avoid making direct libnuma calls from within an application; doing so ties the application to a particular NUMA platform, or requires further application code to support various NUMA topologies. A more practical solution is to start your application with numactl and specify parameters that bind your threads to particular nodes/CPUs, and allocate memory from local memory banks. This lets you experiment with various NUMA settings to find those that yield the best performance for your application on a given platform. When the next generation of hardware is deployed, you can retest to verify that the settings are still valid, and, if not, adjust them without any changes to application code.
For more information on numactl, refer to the numactl man page: man numactl(8). For information about the functions available from within the libnuma shared library, read the numa man page: man numa(3).