Este contenido no está disponible en el idioma seleccionado.
Monitoring and managing system status and performance
Optimizing system throughput, latency, and power consumption
Abstract
Providing feedback on Red Hat documentation Copiar enlaceEnlace copiado en el portapapeles!
We are committed to providing high-quality documentation and value your feedback. To help us improve, you can submit suggestions or report errors through the Red Hat Jira tracking system.
Procedure
Log in to the Jira website.
If you do not have an account, select the option to create one.
- Click Create in the top navigation bar.
- Enter a descriptive title in the Summary field.
- Enter your suggestion for improvement in the Description field. Include links to the relevant parts of the documentation.
- Click Create at the bottom of the window.
Chapter 1. Overview of performance monitoring options Copiar enlaceEnlace copiado en el portapapeles!
To monitor and optimize performance of your Red Hat Enterprise Linux system, you need to use specific tools.
- Performance Co-Pilot (pcp) is a suite of tools for monitoring, visualizing, storing, and analyzing system-level performance measurements. It supports the monitoring and management of real-time data, as well as logging and retrieval of historical data.
RHEL provides several command-line tools to monitor a system outside runlevel 5. The following are the built-in command-line tools:
The
procps-ngpackage provides following utilities:-
The
toputility gives a dynamic view of processes in a running system. It displays a variety of information, including a system summary and a list of tasks currently being managed by the Linux kernel. -
The
psutility captures a snapshot of the selected group of active processes. By default, the examined group is limited to processes that are owned by the current user and associated with the command line where thepscommand is executed. -
The virtual memory statistics (
vmstat) utility provides instant reports of processes, memory, paging, block input/output, interrupts, and CPU activity in the system.
-
The
-
The
sysstatpackage provides the system activity reporter (sar) utility, which collects and reports information about system activity for the current day.
-
perfuses hardware performance counters and kernel trace-points to track the impact of other commands and applications on a system. -
bcc-tools, a set of performance analysis utilities, built on top of Berkeley Packet Filter (BPF) Compiler Collection (BCC). It provides over 100 Extended BPF (eBPF) scripts that monitor kernel activities. For more information about each of these tools, see the man page describing how to use it and what functions it performs. -
The
kernel-toolspackage provides theturbostatutility that reports on processor topology, frequency, idle power-state statistics, temperature, and power usage on the Intel 64 processors. -
The
sysstatpackage provides theiostatutility that monitors and reports on loading of system I/O device to help administrators make decisions about how to balance I/O load between physical disks. -
The
irqbalanceutility distributes hardware interrupts across processors to improve system performance. -
The
numactlpackage provides thenumastatutility. By default,numastatdisplays per-node Non-Uniform Memory Access (NUMA) hit and miss system statistics from the kernel memory allocator. Highnuma_hitvalues and lownuma_missvalues indicate optimal performance. -
numadis an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system that dynamically improves NUMA resource allocation, management, and system performance. -
SystemTapis a programmable tracing/probing/debugging system for the kernel and userspace, and includes many brief scripts. -
valgrindruns uninstrumented userspace programs under supervision, to find memory errors, allocation statistics, concurrency violations. -
The
intel-cmt-catpackage provides thepqosutility to monitor and control CPU cache and memory bandwidth on recent Intel processors.
Chapter 2. Optimizing system performance with TuneD Copiar enlaceEnlace copiado en el portapapeles!
TuneD is a system tuning service designed to optimize system performance and power consumption by using predefined or custom profiles. It includes predefined profiles suited for different workloads, such as high throughput, low latency, and power saving.
2.1. TuneD profiles distributed with RHEL Copiar enlaceEnlace copiado en el portapapeles!
During installation, TuneD automatically selects the most suitable profile based on the system type. For example, the throughput-performance is selected for compute nodes, virtual-guest for virtual machines, and balanced for general systems. This ensures optimal performance for the environment.
Profiles are divided mainly into two categories:
- power-saving profiles that reduce power consumption with minimal performance impact, and
- performance-boosting profiles that optimize system resources for improved speed and responsiveness.
Following is the list of notable profiles distributed with RHEL to select based on the system load:
balanced- It is the default power-saving profile and is intended to be a compromise between performance and power consumption. It uses auto-scaling and auto-tuning whenever possible.
powersaveIt is a profile for maximum power saving performance and can throttle the performance to minimize the actual power consumption. It enables USB autosuspend, Wi-Fi power saving, and Aggressive Link Power Management (ALPM) power savings for SATA host adapters. It also schedules multi-core power savings for systems with a low wakeup rate and activates the
ondemandgovernor. It enables AC97 audio power saving or, depending on your system, HDA-Intel power savings with a 10 seconds timeout. If your system contains a supported Radeon graphics card with enabled KMS, the profile configures it to automatic power saving. It changes theenergy_performance_preferenceattribute to thepowersaveorpower energysetting. It also changes thescaling_governorpolicy attribute to either theondemandorpowersaveCPU governor.NoteIn some cases, the balanced profile is more efficient than powersave. For tasks such as video transcoding, running at full power completes the job faster. It also enables the system to enter an idle state and switch to efficient power-saving modes sooner. Throttling the machine reduces power during the task but extends its duration, potentially increasing overall energy use. Thus, the balanced profile is often a better choice.
throughput-performance-
A server profile optimized for high throughput that disables power savings mechanisms. It also enables
sysctlsettings to improve the throughput performance of the disk and network IO. accelerator-performance-
A profile that contains the same tuning as the
throughput-performanceprofile. Additionally, it locks the CPU to low C states so that the latency is less than 100us. This improves the performance of certain accelerators, such as GPUs. latency-performance-
A server profile optimized for low latency and disables power savings mechanisms and enables
sysctlsettings that improve latency. CPU governor is set to performance and the CPU is locked to the low C states (by PM QoS). network-latency-
A profile for low latency network tuning. It is based on the
latency-performanceprofile. It additionally disables transparent huge pages and NUMA balancing, and tunes several other network-related sysctl parameters. hpc-compute-
A profile optimized for high-performance computing. It is based on the
latency-performanceprofile. network-throughput-
A profile for throughput network tuning. It is based on the
throughput-performanceprofile. It additionally increases kernel network buffers. virtual-guest-
A profile designed for Red Hat Enterprise Linux virtual machines and VMWare guests based on the
throughput-performanceprofile. Among other tasks, it decreases virtual memory swappiness and increases disk readahead values. It does not disable disk barriers. virtual-host-
A profile designed for virtual hosts based on the
throughput-performanceprofile. Among other tasks, it decreases virtual memory swappiness, increases disk readahead values, and enables a more aggressive value of dirty pages writeback. oracle-
A profile optimized for Oracle database loads based on the
throughput-performanceprofile. It additionally disables transparent huge pages and modifies other performance-related kernel parameters. This profile is provided by the tuned-profiles-oracle package. desktop-
A profile optimized for desktops, based on the
balancedprofile. It additionally enables scheduler autogroups for better response of interactive applications. optimize-serial-consoleA profile that tunes down I/O activity to the serial console by reducing the printk value. This should make the serial console more responsive. This profile is intended to be used as an overlay on other profiles. For example:
# tuned-adm profile throughput-performance optimize-serial-consolemssql-
A profile provided for Microsoft SQL Server. It is based on the
throughput-performanceprofile. intel-sstA profile optimized for systems with user-defined Intel Speed Select Technology configurations. This profile is intended to be used as an overlay on other profiles. For example:
# tuned-adm profile cpu-partitioning intel-sstawsA profile optimized for AWS EC2 instances. It is based on the
throughput-performanceprofile.For more information about these profiles, see the
tuned-profilesman page on your system.
2.2. Real-time TuneD profiles distributed with RHEL Copiar enlaceEnlace copiado en el portapapeles!
RHEL ships some profiles specifically for systems running the real-time kernel. Without a special kernel build, they do not configure the system to be real-time. On RHEL, the profiles are available with additional repositories.
realtime-
Use on bare metal real-time systems. This is provided by the
tuned-profiles-realtimepackage, which is available from the RT or NFV repositories. realtime-virtual-host-
Use in a virtualization host configured for real-time. This is provided by the
tuned-profiles-nfv-hostpackage, which is available from the NFV repository. realtime-virtual-guest-
Use in a virtualization guest configured for real-time. This is provided by the
tuned-profiles-nfv-guestpackage, which is available from the NFV repository.
2.3. Installing and enabling TuneD Copiar enlaceEnlace copiado en el portapapeles!
As a system administrator, use TuneD to optimize the performance profile for a variety of use cases. You must install and enable TuneD to start using it.
After installation, the /usr/lib/tuned/profiles directory stores distribution-specific profiles. Each profile is stored in its own subdirectory. The profile includes the profile configuration file, named tuned.conf and optionally other files, such as helper scripts. Also, to customize a profile, copy the profile directory into /etc/tuned/profiles, which stores the files related to the custom profiles. In cases where profiles with the same name exist, the custom profile located in /etc/tuned/profiles takes precedence.
Prerequisites
- You have administrative privileges.
Procedure
Install the TuneD package:
# dnf install tunedEnable and start the TuneD service:
# systemctl enable --now tunedOptional: Install TuneD profiles for real-time systems:
For the TuneD profiles for real-time systems enable the
rhel-10repository.# subscription-manager repos --enable=rhel-10-for-x86_64-nfv-beta-rpms# dnf install tuned-profiles-realtime tuned-profiles-nfv
Verification
Check the activation status of the profile:
$ tuned-adm activeCurrent active profile: throughput-performanceCheck if the profile is active and applied:
$ tuned-adm verifyVerification succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
2.4. Managing TuneD profiles Copiar enlaceEnlace copiado en el portapapeles!
You can manage TuneD profiles in Red Hat Enterprise Linux, including listing available profiles, setting a specific profile, and disabling TuneD when necessary. Proper management of profiles optimizes system performance for specific workloads, such as using the cpu-partitioning profile for low-latency applications.
Prerequisites
- You have installed and enabled TuneD.
- You have administrative privileges.
Procedure
List all the available predefined TuneD profiles:
$ tuned-adm listOptional: To let TuneD recommend the most suitable profile for your system, use the following command:
# tuned-adm recommendthroughput-performanceActivate the profile:
# tuned-adm profile selected-profileAlternatively, you can activate a combination of multiple profiles:
# tuned-adm profile selected-profile1 selected-profile2View the current active TuneD profile on your system:
# tuned-adm activeCurrent active profile: selected-profileDisable all tunings temporarily:
# tuned-adm offThe tunings are applied again after the TuneD service restarts.Optional: Stop and disable the TuneD service permanently:
# systemctl disable --now tuned
2.5. Profile inheritance and variables for profile customization Copiar enlaceEnlace copiado en el portapapeles!
TuneD supports profile inheritance, variable usage, and built-in functions to create new or modify existing profiles efficiently.
The pre-installed and custom profiles are available in the following directories:
-
Pre-installed profiles: Located in
/usr/lib/tuned/profiles/ Custom profiles: Created in
/etc/tuned/profiles/- Profile Inheritance
TuneD profiles can inherit settings from other profiles by using the include option in the [main] section. Profile inheritance means Child profiles inherit parent settings, but can override or add parameters for custom requirements. For example, to create a profile based on the balanced profile that additionally sets ALPM to
min_powerinstead ofmedium_power, use:[main] include=balanced [scsi_host] alpm=min_power- Variables’ usage while customizing profiles
You can define variables while customizing profiles. Variables help to simplify configurations by reducing repetitive definitions. Define your own variables by creating the
[variables]section in a profile and by using:[variables] variable_name=valueTo expand the value of a variable in a profile, use:
${variable_name}Store the variables in different files and use these variables while customizing profiles by adding the filepath to the variable file. For example, add the following lines to the configuration file to consider variables from the separate file:
tuned.conf: [variables] include=/etc/tuned/my-variables.conf [bootloader] cmdline=isolcpus=${isolated_cores}Where
isolated_cores=1,2is added in my-variable.conf file as follows:
my-variable.conf: isolated_cores=1,2
2.6. Built-in functions for profile customization Copiar enlaceEnlace copiado en el portapapeles!
You can use the built-in functions in TuneD profiles to dynamically expand at runtime when a profile is activated. Use built-in functions with TuneD variables to modify and process values dynamically within a profile.
You can extend TuneD with custom functions by creating and integrating custom Python functions as plugins.
- Syntax to start a built-in function
${f:function_name:argument_1:argument_2}To retrieve the directory path where the profile and the
tuned.conffile are located, use thePROFILE_DIRvariable. It uses the following syntax:${i:PROFILE_DIR}- Example of isolating CPU cores by using a built-in function
[variables] non_isolated_cores=0,3-5 [bootloader] cmdline=isolcpus=${f:cpulist_invert:${non_isolated_cores}}In this example, the
${non_isolated_cores}variable expands to0,3-5. Thecpulist_invertfunction inverts the CPU list. On a system with 6 CPUs,0,3-5inverts to1,2, resulting in the kernel booting with theisolcpus=1,2option.Expand Table 2.1. Available built-in functions: Function name Description assertion
Compare two arguments. If they do not match, the function logs text from the first argument and cancels profile loading.
assertion_non_equal
Compare two arguments. If they match, the function logs text from the first argument and cancels profile loading.
calc_isolated_cores
Calculates and returns isolated cores. The argument specifies how many cores per socket to reserve for housekeeping. If not specified, one core per socket is reserved for housekeeping and the rest are isolated.
check_net_queue_count
Checks whether the user has specified a queue count for net devices. If not, it returns the number of housekeeping CPUs.
cpuinfo_check
Checks regular expressions against /proc/cpuinfo. Accepts arguments in the form: REGEX1, STR1, REGEX2, STR2, …[, STR_FALLBACK]. If REGEX1 matches something in /proc/cpuinfo, it expands to STR1; if REGEX2 matches, it expands to STR2. It stops on the first match. If no regular expression matches, it expands to STR_FALLBACK or an empty string if no fallback is provided.
cpulist2devs
Converts a CPU list to device strings.
cpulist2hex
Converts a CPU list to a hexadecimal CPU mask.
cpulist2hex_invert
Converts a CPU list to a hexadecimal CPU mask and inverts it.
cpulist_invert
Inverts a list of CPUs to make its complement. For example, on a system with 4 CPUs (0-3), the inversion of the list 0,2,3 is 1.
cpulist_online
Checks whether the CPUs from the list are online. Returns the list containing only online CPUs.
cpulist_pack
Packs a CPU list in the form of 1,2,3,5 to 1-3,5.
cpulist_present
Checks whether the CPUs from the list are present. Returns the list containing only present CPUs.
cpulist_unpack
Unpacks a CPU list in the form of 1-3,4 to 1,2,3,4.
exec
Executes a process and returns its output.
hex2cpulist
Converts a hexadecimal CPU mask to a CPU list.
intel_recommended_pstate
Checks the processor code name and returns the recommended
intel_pstateCPUFreq driver mode. "Active" is returned for newer generation processors not in thePROCESSOR_NAMElist.iscpu_check
Checks regular expressions against the output of
iscpu. Accepts arguments in the form: REGEX1, STR1, REGEX2, STR2, …[, STR_FALLBACK]. If REGEX1 matches something in the output, it expands to STR1; if REGEX2 matches, it expands to STR2. It stops on the first match. If no regular expression matches, it expands toSTR_FALLBACKor an empty string if no fallback is provided.package2cpus
Provides a CPU device list for a package (socket).
package2uncores
Provides an uncore device list for a package (socket).
regex_search_ternary
Ternary regular expression operator. Takes arguments in the form: STR1, REGEX, STR2, STR3. If REGEX matches STR1 (
re.searchis used), STR2 is returned; otherwise, STR3 is returned.log
Expands to the concatenation of arguments and logs the result, useful for debugging.
kb2s
Converts KB to disk sectors.
s2kb
Converts disk sectors to KB.
strip
Creates a string from all passed arguments and deletes both leading and trailing white spaces.
virt_check
Checks whether TuneD is running inside a virtual machine (VM) or on bare metal. Inside a VM, the function returns the first argument. On bare metal, the function returns the second argument, even in case of an error.
2.7. TuneD plugins Copiar enlaceEnlace copiado en el portapapeles!
TuneD profiles use plugins to monitor or optimize different devices on the system.
TuneD uses two types of plugins:
- Monitoring plugins
- Used for gathering system data such as, CPU load, disk I/O, and network traffic. The output of the monitoring plugins can be used by tuning plugins for dynamic tuning. Monitoring plugins are automatically instantiated whenever their metrics are needed by any of the enabled tuning plugins. The available monitoring plugins are:
disk- Gets disk load (number of IO operations) per device and measurement interval.
net- Gets network load (number of transferred packets) per network card and measurement interval.
load- Gets CPU load per CPU and measurement interval.
- Tuning plugins
- Each tuning plugin tunes an individual subsystem and takes several parameters that are populated from the TuneD profiles. Each subsystem can have multiple devices, such as multiple CPUs or network cards, handled by individual instances of the tuning plugins. Specific settings for individual devices are also supported. The available tuning plugins are:
acpi-
Configures the ACPI driver. Use the
platform_profileoption to set the ACPI platform profilesysfsattribute. It is a genericpower/performancepreference API for other drivers. Multiple profiles can be specified, separated by|. The first available profile is selected. audio-
Sets the autosuspend timeout for audio codecs to the value specified by the timeout option. Currently, the
snd_hda_intelandsnd_ac97_codeccodecs are supported. The value 0 means that autosuspend is disabled. You can also enforce the controller reset by setting the Boolean option reset_controller totrue. bootloader-
Adds options to the kernel command line. This plugin supports only the GRUB boot loader. Customized non-standard location of the GRUB configuration file can be specified by the
grub2_cfg_fileoption. The kernel options are added to the current GRUB configuration and its templates. The system needs to be rebooted for the kernel options to take effect. cpu-
Manages CPU governor and power settings by setting the CPU governor to the value specified by the
governoroption. It dynamically changes the Power Management Quality of Service (PM QoS) CPU Direct Memory Access (DMA) latency, according to the CPU load. disk-
Manages disk settings such as
apm,scheduler_quantum,readahead,readahead_multiply,spindown. eeepc_she- Dynamically sets the front-side bus (FSB) speed according to the CPU load.
irq-
Handles individual interrupt requests (IRQ) as devices and multiple plugin instances can be defined, each addressing different devices or irqs. The device names used by the plugin are
irq<n>, where<n>is the IRQ number. The special deviceDEFAULTcontrols values written to/proc/irq/default_smp_affinity, which applies to all non-active IRQs. irqbalance-
Manages settings for
irqbalance. The plugin configures CPUs which should be skipped when rebalancing IRQs in/etc/sysconfig/irqbalance. It then restarts irqbalance only if it was previously running. modules-
Applies custom kernel modules options. It can set parameters to kernel modules and creates
/etc/modprobe.d/tuned.conffile. The syntax ismodule=option1=value1 option2=value2…wheremoduleis the module name andoptionx=valuexare module options which might be present. mounts- Enables or disables barriers for mounted filesystems.
net- Configures Wake-on-LAN and interface speed by using the same syntax as the ethtool utility. Also, dynamically changes the interface speed according to the interface utilization.
rtentsk- Avoids inter-processor interruptions caused by enabling or disabling static keys. It has no options. When included, TuneD keeps an open socket with timestamping enabled, thus keeping the static key enabled.
selinux-
Tunes SELinux options. SELinux decisions, such as allowing or denying access, are cached. This cache is known as the Access Vector Cache (AVC). When using these cached decisions, SELinux policy rules need to be checked less, which increases performance. The
avc_cache_thresholdoption adjusts the maximum number of AVC entries. systemd-
Tunes systemd options. The
cpu_affinityoption sets CPU affinity in/etc/systemd/system.conf. This configures the CPU affinity for the service manager and the default CPU affinity for all forked off processes. Add a comma-separated list of CPUs with optional CPU ranges specified by the minus sign (-). scsi_host-
Tunes SCSI host settings (for example,
ALPM). scheduler- Offers a variety of options for the tuning of scheduling priorities, CPU core isolation, and process, thread, and IRQ affinities.
script- Runs external script or binary when loading or unloading a profile.
service- Handles various sysvinit, sysv-rc, openrc, and systemd services specified by the plugin options. Supported service handling commands are start, stop, enable, and disable.
sysctl-
Modifies kernel parameters. Use this plugin only if you want to change system settings that are not covered by other plugins available in TuneD. The syntax is name=value, where name is the same as the name provided by the
sysctlutility. sysfs-
Sets various
sysfssettings specified by the plugin options. The syntax is name=value, where name is thesysfspath to use. usb-
Adjusts USB autosuspend timeout. The value
0means that autosuspend is disabled. uncore-
Limits the maximum and minimum uncore frequency. The options
max_freq_khz,min_freq_khzcorrespond tosysfsfiles exposed by Intel uncore frequency driver. Their values can be specified in kHz or as a percentage of their configurable range. video-
Sets various powersave levels on video cards. Currently, only the
Radeoncards are supported. Thepowersavelevel can be specified by using theradeon_powersaveoption. Supported values are, default, auto, low, mid, high, dynpm, dpm-battery, dpm-balanced, and dpm-perfomance. vm-
Enables or disables transparent huge pages. Valid values for the transparent_hugepages option are:
always, never, madvise.
2.8. Creating a new TuneD profile Copiar enlaceEnlace copiado en el portapapeles!
You can create a new TuneD profile to customize your requirements for performance optimization.
Prerequisites
- The TuneD service is running. See Installing and enabling TuneD for details.
Procedure
In the
/etc/tuned/profilesdirectory, create a new directory named the same as the profile that you want to create:# mkdir /etc/tuned/profiles/my-profile-
In the new directory, create a file named
tuned.conf. Edit the file and add a
[main]section and plug in definitions in it, according to your requirements. For example, see the configuration of thebalancedprofile:[main] summary=General non-specialized TuneD profile [cpu] governor=conservative energy_perf_bias=normal [audio] timeout=10 [video] radeon_powersave=dpm-balanced, auto [scsi_host] alpm=medium_powerActivate the profile.
# tuned-adm profile my-profile
Verification
View the profile is active and applied:
$ tuned-adm activeCurrent active profile: my-profile$ tuned-adm verifyVerification succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
2.9. Modifying an existing TuneD profile Copiar enlaceEnlace copiado en el portapapeles!
You can modify the parameters of an existing profile to suit your custom requirements. A modified child profile can be created based on an existing TuneD profile.
Prerequisites
- The TuneD service is running. See Installing and enabling TuneD for details.
Procedure
In the
/etc/tuned/profilesdirectory, create a new directory named the same as the profile that you want to create:# mkdir /etc/tuned/profiles/modified-profileIn the new directory, create a file named
tuned.conf, and set the [main] section as follows:[main] include=parent-profileReplace parent-profile with the name of the profile you are modifying, for example:
throughput-performanceInclude your profile modifications. For example, to lower swappiness in the
throughput-performanceprofile, change the value ofvm.swappinessto5, instead of the default10, use:[main] include=throughput-performance [sysctl] vm.swappiness=5Activate the profile.
# tuned-adm profile modified-profile
Verification
View the profile is active and applied:
$ tuned-adm activeCurrent active profile: modified-profile$ tuned-adm verifyVerification succeeded, current system settings match the preset profile. See tuned log file ('/var/log/tuned/tuned.log') for details.
2.10. Enabling TuneD no-daemon mode Copiar enlaceEnlace copiado en el portapapeles!
You can run TuneD in no-daemon mode, which does not require any resident memory. In this mode, TuneD applies the settings and then exits. However, note that this mode disables certain features, including D-Bus and hot-plug support, and rollback functionality for settings.
Prerequisites
- You have root privileges or appropriate permissions to modify the system configuration.
Procedure
Edit the
/etc/tuned/tuned-main.conffile with the following changes:sudo vi /etc/tuned/tuned-main.confdaemon = 0- Save and close the file.
Restart the tuned service or reapply or switch profile:
# systemctl restart tuned# tuned-adm profile <tuned_profile>Verify the tuned status:
# tuned-adm activeIt seems that tuned daemon is not running, preset profile is not activated. Preset profile: <tuned_profile>
Chapter 3. Setting the disk scheduler Copiar enlaceEnlace copiado en el portapapeles!
The disk scheduler is responsible for ordering the I/O requests submitted to a storage device. You can configure the scheduler in several different ways:
- Set the scheduler using TuneD.
-
Set the scheduler using
udev. - Temporarily change the scheduler on a running system.
In Red Hat Enterprise Linux, block devices support only multi-queue scheduling. This enables the block layer performance to scale well with fast solid-state drives (SSDs) and multi-core systems.
3.1. Available disk schedulers Copiar enlaceEnlace copiado en el portapapeles!
Choosing the appropriate disk scheduler significantly impacts system performance, responsiveness, and the efficiency of I/O operations. Evaluate each scheduler’s characteristics to select the most suitable option for your specific workload and hardware.
The following multi-queue disk schedulers are supported in Red Hat Enterprise Linux:
none- Implements a first-in first-out (FIFO) scheduling algorithm. It merges requests at the generic block layer through a simple last-hit cache.
mq-deadlineAttempts to provide a guaranteed latency for requests from the point, at which requests reach the scheduler. The
mq-deadlinescheduler sorts queued I/O requests into a read or write batch. Later, it schedules them in an increasing logical block addressing (LBA) order. By default, read batches take precedence over write batches, because applications get block on read I/O operations. After processing a batch,mq-deadlinechecks for write starvation. It then schedules the next read or write batch accordingly.This scheduler is suitable for most use cases, but particularly those in which write operations are mostly asynchronous.
bfqTargets desktop systems and interactive tasks. The
bfqscheduler ensures that a single application is never using all of the bandwidth. In effect, the storage device is always as responsive as if it was idle. In its default configuration,bfqfocuses on delivering the lowest latency rather than achieving the maximum throughput.bfqis based oncfqcode. The scheduler assigns a budget of sectors to each process instead of granting a fixed time slice of disk access. This scheduler is suitable when copying large files do not make the system unresponsive.kyber- The scheduler achieves a latency goal by calculating the latency of every request in the block I/O layer. You can configure the target latencies for read, in the case of cache-misses, and synchronous write requests. This scheduler is suitable for fast devices, for example NVMe, SSD, or other low latency devices.
- Default disk scheduler
- Block devices use the default disk scheduler unless you specify another scheduler.
For non-volatile Memory Express (NVMe) block devices specifically, the default scheduler is none and Red Hat recommends not changing this.
The kernel selects a default disk scheduler based on the type of device. The automatically selected scheduler is typically the optimal setting. If you require a different scheduler, use udev rules or the TuneD application to configure it. Match the selected devices and switch the scheduler only for those devices.
3.2. Different disk schedulers for different use cases Copiar enlaceEnlace copiado en el portapapeles!
Use these disk schedulers as a starting point before performing analysis and tuning.
| Use case | Disk scheduler |
|---|---|
| Traditional HDD with a SCSI interface |
Use |
| High-performance SSD or a CPU-bound system with fast storage |
Use |
| Desktop or interactive tasks |
Use |
| Virtual guest |
Use |
3.3. Determining the active disk scheduler Copiar enlaceEnlace copiado en el portapapeles!
You can view the disk scheduler that is currently active on a given block device.
Procedure
Read the content of the
/sys/block/device/queue/schedulerfile:# cat /sys/block/device/queue/scheduler[mq-deadline] kyber bfq noneIn the file name, replace device with the block device name, for example
sdc. The active scheduler is listed in square brackets ([ ]).
3.4. Setting the disk scheduler using TuneD Copiar enlaceEnlace copiado en el portapapeles!
You can create and enable a TuneD profile that sets a given disk scheduler for selected block devices. The setting persists across system reboots. In the following commands and configuration, replace:
-
device with the name of the block device, for example
sdfand - selected-scheduler with the disk scheduler that you want to set for the device, for example bfq.
Prerequisites
- The TuneD service is installed and enabled. For details, see Installing and enabling TuneD.
Procedure
- Optional: Select an existing TuneD profile on which your profile will be based. For a list of available profiles, see TuneD profiles distributed with RHEL.
To see which profile is currently active, use:
# tuned-adm activeCreate a new directory to hold your TuneD profile:
# mkdir /etc/tuned/<new-profile>Find the system unique identifier of the selected block device:
# udevadm info --query=property --name=/dev/device | grep -E '(WWN|SERIAL)'ID_WWN=0x5002538d00000000_ ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0 ID_SERIAL_SHORT=20120501030900000NoteThis command returns all values identified as a World Wide Name (WWN) or serial number associated with the specified block device. Although using a WWN is preferred, it is not always available for every device. In such cases, you can use any value returned by the example command as the device system’s unique ID.
Create the
/etc/tuned/my-profile/tuned.confconfiguration file. In the file, set the following options:Optional: Include an existing profile:
[main] include=<existing-profile>Set the selected disk scheduler for the device that matches the WWN identifier:
[disk] devices_udev_regex=IDNAME=device system unique id elevator=selected-schedulerHere:
- Replace IDNAME with the name of the identifier being used (for example, ID_WWN).
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000). To match multiple devices in the devices_udev_regex option, enclose the identifiers in parentheses and separate them with vertical bars:devices_udev_regex=(ID_WWN=0x5002538d00000000)|(ID_WWN=0x1234567800000000)
Enable your profile:
# tuned-adm profile <new-profile>
Verification
Verify that the TuneD profile is active and applied:
$ tuned-adm activeCurrent active profile: <profile>$ tuned-adm verifyVerification succeeded, current system settings match the preset profile. See TuneD log file ('/var/log/tuned/tuned.log') for details.Read the contents of the
/sys/block/device/queue/schedulerfile:$ cat /sys/block/device/queue/scheduler[mq-deadline] kyber bfq noneIn the file name, replace device with the block device name, for example sdc. The active scheduler is listed in square brackets (
[]).
3.5. Setting the disk scheduler using udev rules Copiar enlaceEnlace copiado en el portapapeles!
You can set a given disk scheduler for specific block devices by using the udev rules. The setting persists across system reboots. In the following commands and configuration, replace:
-
device with the name of the block device, for example
sdfand -
selected-scheduler with the disk scheduler that you want to set for the device, for example
bfq.
Procedure
Find the system unique identifier of the block device:
# $ udevadm info --name=/dev/device | grep -E '(WWN|SERIAL)'E: ID_WWN=0x5002538d00000000 E: ID_SERIAL=Generic-_SD_MMC_20120501030900000-0:0 E: ID_SERIAL_SHORT=20120501030900000NoteThis command returns all values identified as a World Wide Name (WWN) or serial number associated with the specified block device. Although using a WWN is preferred, it is not always available for every device. In such cases, you can use any value returned by the example command as the device system’s unique ID.
To configure the
udevrule, create the/etc/udev/rules.d/99-scheduler.rulesfile with the following content:ACTION=="add|change", SUBSYSTEM=="block", ENV{IDNAME}=="device system unique id", ATTR{queue/scheduler}="selected-scheduler"Here:
-
Replace IDNAME with the name of the identifier being used (for example,
ID_WWN). -
Replace device system unique id with the value of the chosen identifier (for example,
0x5002538d00000000).
-
Replace IDNAME with the name of the identifier being used (for example,
Reload
udevrules:# udevadm control --reload-rulesApply the scheduler configuration:
# udevadm trigger --type=devices --action=change
Verification
Verify the active scheduler:
# cat /sys/block/device/queue/scheduler
3.6. Temporarily setting a scheduler for a specific disk Copiar enlaceEnlace copiado en el portapapeles!
You can set a given disk scheduler for specific block devices. The setting does not persist across system reboots.
Procedure
Write the name of the selected scheduler to the
/sys/block/device/queue/schedulerfile:# echo selected-scheduler > /sys/block/device/queue/schedulerIn the file name, replace device with the block device name, for example sdc.
Verification
Verify that the scheduler is active on the device.
# cat /sys/block/device/queue/scheduler
Chapter 4. Setting up PCP Copiar enlaceEnlace copiado en el portapapeles!
Performance Co-Pilot (PCP) is a suite of tools, services, and libraries for managing and measuring system-level performance. You use Python, Perl, C++, and C interfaces to add performance metrics. Analysis tools use Python, C++, and C client APIs directly. Web applications can access performance data through a JSON interface. To analyze data patterns,compare live results with archived data.
- Features of PCP
- Light-weight distributed architecture useful during the centralized analysis of complex systems.
- Ability to monitor and manage real-time data.
- Ability to log and retrieve historical data.
- PCP has the following components
- The Performance Metric Collector Daemon (pmcd) collects performance data from the installed Performance Metric Domain Agents (PMDA). PMDAs can be individually loaded or unloaded on the system and are controlled by the PMCD on the same host.
-
Various client tools, such as
pminfoorpmstat, can retrieve, display, archive, and process this data on the same host or over the network. -
The
pcpandpcp-system-toolspackages provide the command-line tools and core functionality. -
The
pcp-guipackage provides the graphical application pmchart. -
The
grafana-pcppackage provides powerful web-based visualizations and alerting with Grafana.
4.1. Installing and enabling PCP Copiar enlaceEnlace copiado en el portapapeles!
Install the required packages and enable the PCP monitoring services to start using it. You can also automate the PCP installation by using the pcp-zeroconf package. For more information about installing PCP by using pcp-zeroconf, see Setting up PCP with pcp-zeroconf.
Procedure
Install the pcp package:
# dnf install pcpEnable and start the pmcd service on the host machine:
# systemctl enable pmcd# systemctl start pmcd
Verification
Verify if the PMCD process is running on the host:
# pcpPerformance Co-Pilot configuration on arm10.local: platform: Linux arm10.local 6.12.0-55.13.1.el10_0.aarch64 #1 SMP PREEMPT_DYNAMIC Mon May 19 07:29:57 UTC 2025 aarch64 hardware: 4 cpus, 1 disk, 1 node, 3579MB RAM timezone: JST-9 services: pmcd pmcd: Version 6.3.7-1, 12 agents, 6 clients pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm jbd2 dm openmetrics
4.2. Deploying a minimal PCP configuration Copiar enlaceEnlace copiado en el portapapeles!
The minimal PCP configuration collects performance statistics on Red Hat Enterprise Linux. The configuration involves adding the minimum number of packages on a production system needed to gather data for further analysis.
Analyze the resulting tar.gz file and the archive of the pmlogger output by using various PCP tools. Then you can compare them with other performance data sources.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Update the pmlogger configuration:
# pmlogconf -r /var/lib/pcp/config/pmlogger/config.defaultStart the pmcd and pmlogger services:
# systemctl start pmcd.service# systemctl start pmlogger.service- Run the required operations to record the performance data.
Save the output and save it to a tar.gz file named based on the host name and the current date and time:
# cd /var/log/pcp/pmlogger/# tar -czf $(hostname).$(date +%F-%Hh%M).pcp.tar.gz $(hostname)- Extract this file and analyze the data using PCP tools.
4.3. System services and tools distributed with PCP Copiar enlaceEnlace copiado en el portapapeles!
The basic package pcp includes the system services and basic tools. You can install additional tools that are provided with the pcp-system-tools, pcp-gui, and pcp-devel packages.
Roles of system services distributed with PCP
pmcd- The Performance Metric Collector Daemon.
pmie- The Performance Metrics Inference Engine.
pmlogger- The performance metrics logger.
pmproxy- The realtime and historical performance metrics proxy, time series query and REST API service.
Tools distributed with base PCP package
pcp- Displays the current status of a Performance Co-Pilot installation.
pcp-check-
Activates, or deactivates core and optional components, such as
pmcd,pmlogger,pmproxy, and PMDAs. pcp-vmstat- Provides a high-level system performance overview every 5 seconds. Displays information about processes, memory, paging, block IO, traps, and CPU activity.
pmconfig- Displays the values of configuration parameters.
pmdiff- Compares average metric values across two archives within a time window to identify potential performance regressions.
pmdumplog- Displays control, metadata, index, and state information from a Performance Co-Pilot archive file.
pmfind- Finds PCP services on the network.
pmie- An inference engine that periodically evaluates a set of arithmetic, logical, and rule expressions. The metrics are collected either from a live system, or from a Performance Co-Pilot archive file.
pmieconf- Displays or sets configurable pmie variables.
pmiectl- Manages non-primary instances of pmie.
pminfo- Displays information about performance metrics. The metrics are collected either from a live system, or from a Performance Co-Pilot archive file.
pmlc- Interactively configures active pmlogger instances.
pmlogcheck- Identifies invalid data in a Performance Co-Pilot archive file.
pmlogconf- Creates and modifies a pmlogger configuration file.
pmlogctl- Manages non-primary instances of pmlogger.
pmloglabel- Verifies, modifies, or repairs the label of a Performance Co-Pilot archive file.
pmlogsummary- Calculates statistical information about performance metrics stored in a Performance Co-Pilot archive file.
pmprobe- Determines the availability of performance metrics.
pmsocks- Provides access to a Performance Co-Pilot hosted through a firewall.
pmstat- Periodically displays a brief summary of system performance.
pmstore- Modifies the values of performance metrics.
pmseries- Fast, scalable time series querying, using the facilities of PCP and a distributed key-value data store such as Valkey.
pmtrace- Provides a command-line interface to the trace PMDA.
pmval- An updating display of the current value of any performance metric.
Tools distributed with the separately installed pcp-system-tools package
pcp-atop- Shows the system-level occupation of the most critical hardware resources from the performance point of view: CPU, memory, disk, and network.
pcp-atopsar- Generates a system-level activity report over a variety of system resource utilization. The report is generated from a raw logfile previously recorded using pmlogger or the -w option of pcp-atop.
pcp-dmcache- Displays information about configured Device Mapper Cache targets. Metrics include device IOPS, utilization, and read/write hit/miss rates for each cache device.
pcp-dstat- Displays metrics of one system at a time. To display metrics of multiple systems, use --host option.
pcp-free- Reports on free and used memory in a system.
pcp-htop-
Lists all processes running on a system along with their command-line arguments. It is similar to the
topcommand, with vertical and horizontal scrolling and mouse interaction. pcp-ipcs- Displays information about the inter-process communication (IPC) facilities that the calling process has read access for.
pcp-mpstat- Reports CPU and interrupt-related statistics.
pcp-numastat- Displays NUMA allocation statistics from the kernel memory allocator.
pcp-pidstat- Displays information about individual tasks or processes running on the system. This includes CPU percentage, memory and stack usage, scheduling, and priority. Reports live data for the local host by default.
pcp-shping- Samples and reports on the shell-ping service metrics exported by the pmdashping Performance Metrics Domain Agent (PMDA).
pcp-ss- Displays socket statistics collected by the pmdasockets PMDA.
pcp-tapestat- Reports I/O statistics for tape devices.
pcp-uptime- Displays the system uptime, currently logged on users, and system load averages for the past 1, 5, and 15 minutes.
pcp-verify- Inspects various aspects of a Performance Co-Pilot collector’s configuration and ensures it is optimized for specific operations.
pcp-iostat- Reports I/O statistics for SCSI devices (by default) or device-mapper devices (with the -x device-mapper option).
pmrep- Reports on selected, easily customizable, performance metrics values.
Tools distributed with the separately installed pcp-gui package
pmchart- Plots performance metrics values available through the facilities of the PCP.
pmdumptext- Outputs the values of performance metrics collected live or from a PCP archive.
Tools distributed with the separately installed pcp-devel package
pmclient- Displays high-level system performance metrics by using the Performance Metrics Application Programming Interface (PMAPI).
pmdbg- Displays available Performance Co-Pilot debug control flags and their values.
pmerr- Displays available Performance Co-Pilot error codes and their corresponding error messages.
pcp-xsos- Gives a fast summary report for a system by using a single sample taken from either a PCP archive or live metrics from the system.
Other tools distributed as a separate packages
pcp-geolocate- Discovers collector system geographical labels.
pcp2openmetrics- A customizable performance metrics exporter tool from PCP to Open Metrics format. You can select any live or archived PCP metric for exporting by using either command line arguments or a configuration file.
4.4. PCP deployment architectures Copiar enlaceEnlace copiado en el portapapeles!
PCP supports multiple deployment architectures, based on the scale of the PCP deployment, and offers many options to accomplish advanced configurations.
Available scaling deployment configuration variants, determined by sizing factors and configuration options, include the following:
- Localhost
- Each service runs locally on the monitored machine. Starting a service without any configuration changes results in a default standalone deployment on the localhost. This configuration does not support scaling beyond a single node. However, Valkey can also run in a highly available and scalable clustered mode, where data is shared across multiple hosts. You can also deploy a Valkey cluster in the cloud or use a managed Valkey cluster from a cloud provider.
- Decentralized
- The only difference between localhost and decentralized configuration is the centralized Valkey service. In this model, the host runs pmlogger service on each monitored host and retrieves metrics from a local pmcd instance. A local pmproxy service then exports the performance metrics to a central Valkey instance.
Figure 4.1. Decentralized logging
- Centralized logging - pmlogger farm
-
For resource-constrained hosts, use a
pmloggerfarm. This deployment is also called centralized logging. A single logger host runspmloggerprocesses to collect metrics from multiple remote hosts. The centralized logger host runs thepmproxyservice, which discovers the resulting PCP archives and loads the metric data into a Valkey instance.
Figure 4.2. Centralized logging - pmlogger farm
- Federated - multiple pmlogger farms
- For large scale deployments, deploy multiple pmlogger farms in a federated fashion. For example, one pmlogger farm per rack or data center. Each pmlogger farm loads the metrics into a central Valkey instance.
Figure 4.3. Federated - multiple pmlogger farms
By default, the deployment configuration for Valkey is standalone, localhost. However, Valkey can optionally perform in a highly-available and highly scalable clustered fashion, where data is shared across multiple hosts. Another option is to deploy a Valkey cluster in the cloud, or to use a managed Valkey cluster from a cloud provider.
4.5. Factors affecting scaling in PCP logging Copiar enlaceEnlace copiado en el portapapeles!
The key factors influencing Performance Co-Pilot (PCP) logging are hardware resources, logged metrics, logging intervals, and post-upgrade archive management.
- Remote system size
-
Remote hardware such as CPUs, disks, and network interfaces, determines the volume of data collected by each
pmloggerinstance. - Logged metrics
-
The number and types of logged metrics significantly affect storage requirements. In particular, the
per-process proc.*metrics require a large amount of disk space. For example, a standardpcp-zeroconflogging at 10-second intervals uses 11 MB. Enabling proc metrics increases this to 155 MB. Additionally, the number of CPUs, block devices, and network interfaces impacts storage capacity requirements. - Logging interval
-
The frequency of metric logging determines storage usage. The expected daily PCP archive file sizes for each
pmloggerinstance are recorded in thepmlogger.logfile. PCP archives typically compress at 10:1. Use this ratio to estimate long-term disk space requirements for uncompressed data. - Managing archive updates with
pmlogrewrite -
After PCP upgrades,
pmlogrewriteupdates existing archives if changes are detected in the metric metadata between versions. The time required for this process scales linearly with the number of stored archives.
Chapter 5. Configuring pmda-openmetrics Copiar enlaceEnlace copiado en el portapapeles!
Performance Co-Pilot (PCP) is a system for monitoring and managing system performance. It includes several built-in agents called Performance Metric Domain Agents (PMDAs). These agents collect metrics from applications and services, such as PostgreSQL, Apache HTTPD, and KVM virtual machines.
5.1. Overview of pmda-openmetrics Copiar enlaceEnlace copiado en el portapapeles!
You can run custom or less common applications for which no out-of-the-box PMDA exists in the Red Hat repositories. In such scenarios, the pmda-openmetrics agent helps to bridge the gap. The pmda-openmetrics PMDA exposes performance metrics from arbitrary applications by converting OpenMetrics-style formatted text files into PCP-compatible metrics. OpenMetrics is a widely adopted format used by Prometheus and other monitoring tools, which makes integration easier.
You can run pmda-openmetrics to do the following tasks:
- Monitor custom applications that are not covered by existing PMDAs.
-
Integrate existing OpenMetrics or
Prometheus-exportedmetrics into the PCP framework. - Create quick models or test metrics for diagnostic or demonstration purposes.
5.2. Installing and configuring pmda-openmetrics Copiar enlaceEnlace copiado en el portapapeles!
You must install and configure pmda-openmetrics before you start using it. The following example demonstrates how to expose a single numeric value from a text file as a PCP metric by using pmda-openmetrics.
Prerequisites
-
PCP is installed and
pmcdis running. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
Install the
pmda-openmetricsPMDA.# dnf -y install pcp-pmda-openmetrics# cd /var/lib/pcp/pmdas/openmetrics/# ./InstallCreate a sample OpenMetrics file.
# echo 'var1 {var2="var3"} 42' > /tmp/example.txtReplace example.txt with the file name that you want to use.
Verify that the file is created correctly.
# cat /tmp/example.txtRegister the metric file path with the OpenMetrics agent.
# echo "file:///tmp/example.txt" > /etc/pcp/openmetrics/example.urlVerify that the configuration is created correctly.
# cat /etc/pcp/openmetrics/example.urlConfigure
systemd-tmpfilesto create the necessary symlinks.# echo 'L+ /var/lib/pcp/pmdas/openmetrics/config.d/example.url - - - - ../../../../../../etc/pcp/openmetrics/example.url' \ > /usr/lib/tmpfiles.d/pcp-pmda-openmetrics-cust.confVerify that the symlinks are configured correctly.
# cat /usr/lib/tmpfiles.d/pcp-pmda-openmetrics-cust.confCreate the symlinks by applying the tmpfiles configuration.
# systemd-tmpfiles --create --remove /usr/lib/tmpfiles.d/pcp-pmda-openmetrics-cust.confVerify that the symlinks are created correctly.
# ls -al /var/lib/pcp/pmdas/openmetrics/config.d/Verify that the metric is correctly reported.
# pminfo -f openmetrics.example.var1inst [0 or "0 var2:var3"] value 42
Verification
-
Run
pcpand confirm thatopenmetricsis listed. -
Run
systemd-analyze cat-config tmpfiles.dand confirm thatexample.urlappears in the output. -
Use
pminfoto confirm the presence and value of the metric.
Chapter 6. Logging performance data with pmlogger Copiar enlaceEnlace copiado en el portapapeles!
With the PCP tool you can log the performance metric values and replay them later. You can use this data to perform retrospective performance analysis. Using the pmlogger tool, you can:
- Create the archived logs of selected metrics on the system
- Specify which metrics are recorded on the system and how often
6.1. Modifying the pmlogger configuration file with pmlogconf Copiar enlaceEnlace copiado en el portapapeles!
When the pmlogger service is running, PCP logs a default set of metrics on the host. Use the pmlogconf utility to check the default configuration. If the pmlogger configuration file does not exist, pmlogconf creates it with a default metric values.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Create or modify the
pmloggerconfiguration file:# pmlogconf -r /var/lib/pcp/config/pmlogger/config.defaultFollow
pmlogconfprompts to enable or disable groups of related performance metrics and to control the logging interval for each enabled group. For example,# pmlogconf -r /var/lib/pcp/config/pmlogger/config.defaultGroup: per logical block device activity Log this group? [y] Logging interval? [default]
6.2. Configuring pmlogger manually Copiar enlaceEnlace copiado en el portapapeles!
You can edit the pmlogger configuration file manually to create a tailored logging configuration with specific metrics and given intervals. The default pmlogger configuration file is /var/lib/pcp/config/pmlogger/config.default. The configuration file specifies which metrics are logged by the primary logging instance.
In manual configuration, you can:
- Record metrics which are not listed in the automatic configuration.
- Choose custom logging frequencies.
- Add PMDA with the application metrics.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Open and edit the
/var/lib/pcp/config/pmlogger/config.defaultfile to add specific metrics:# It is safe to make additions from here on ... # log mandatory on every 5 seconds { xfs.write xfs.write_bytes xfs.read xfs.read_bytes } log mandatory on every 10 seconds { xfs.allocs xfs.block_map xfs.transactions xfs.log } [access] disallow * : all; allow localhost : enquire;
6.3. Enabling the pmlogger service Copiar enlaceEnlace copiado en el portapapeles!
The pmlogger service must be started and enabled to log the metric values on the local machine.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Start and enable the
pmloggerservice:# systemctl start pmlogger# systemctl enable pmlogger
Verification
Verify that the
pmloggerservice is enabled:# pcpplatform: Linux arm10.local 6.12.0-55.13.1.el10_0.aarch64 #1 SMP PREEMPT_DYNAMIC Mon May 19 07:29:57 UTC 2025 aarch64 hardware: 4 cpus, 1 disk, 1 node, 3579MB RAM timezone: JST-9 services: pmcd pmcd: Version 6.3.7-1, 12 agents, 6 clients pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm jbd2 dm openmetrics pmlogger: primary logger: /var/log/pcp/pmlogger/arm10.local/20250529.15.49 pmie: primary engine: /var/log/pcp/pmie/arm10.local/pmie.logFor more information, see the
/var/lib/pcp/config/pmlogger/config.defaultfile.
6.4. Setting up a client system for metrics collection Copiar enlaceEnlace copiado en el portapapeles!
You can configure a client system to enable a central server to collect performance metrics from clients running PCP.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
Procedure
Install the
pcp-system-toolspackage:# dnf install pcp-system-toolsConfigure an IP address for pmcd:
# echo "-i 192.168.4.62" >>/etc/pcp/pmcd/pmcd.optionsReplace 192.168.4.62 with the IP address, the client should listen on. By default,
pmcdis listening on the localhost.Configure the firewall to add the public zone permanently:
# firewall-cmd --permanent --zone=public --add-port=44321/tcpsuccess# firewall-cmd --reloadsuccessSet an SELinux boolean:
# setsebool -P pcp_bind_all_unreserved_ports onEnable the
pmcdandpmloggerservices:# systemctl enable pmcd pmlogger# systemctl restart pmcd pmlogger
Verification
Verify that the
pmcdis correctly listening on the configured IP address:# ss -tlp | grep 44321LISTEN 0 5 127.0.0.1:44321 0.0.0.0:* users:(("pmcd",pid=151595,fd=6)) LISTEN 0 5 192.168.4.62:44321 0.0.0.0:* users:(("pmcd",pid=151595,fd=0)) LISTEN 0 5 [::1]:44321 [::]:* users:(("pmcd",pid=151595,fd=7))
6.5. Setting up a central server to collect data Copiar enlaceEnlace copiado en el portapapeles!
You can create a central server to collect metrics from clients running PCP.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
- Client is configured for metrics collection. For more information, see Setting up a client system for metrics collection.
Procedure
Install the
pcp-system-toolspackage:# dnf install pcp-system-toolsCreate the
/etc/pcp/pmlogger/control.d/remotefile with the following content:# DO NOT REMOVE OR EDIT THE FOLLOWING LINE $version=1.1 192.168.4.13 n n PCP_ARCHIVE_DIR/rhel7u4a -r -T24h10m -c config.rhel7u4a 192.168.4.14 n n PCP_ARCHIVE_DIR/rhel6u10a -r -T24h10m -c config.rhel6u10a 192.168.4.62 n n PCP_ARCHIVE_DIR/rhel8u1a -r -T24h10m -c config.rhel8u1aReplace 192.168.4.13, 192.168.4.14 and 192.168.4.62 with the client IP addresses.
Enable the pmcd and pmlogger services:
# systemctl enable pmcd pmlogger# systemctl restart pmcd pmlogger
Verification
Verify that you can access the latest archive file from each directory:
# for i in /var/log/pcp/pmlogger/rhel*/*.0; do pmdumplog -L $i; doneLog Label (Log Format Version 2) Performance metrics from host rhel6u10a.local commencing Mon Nov 25 21:55:04.851 2019 ending Mon Nov 25 22:06:04.874 2019 Archive timezone: JST-9 PID for pmlogger: 24002 Log Label (Log Format Version 2) Performance metrics from host rhel7u4a commencing Tue Nov 26 06:49:24.954 2019 ending Tue Nov 26 07:06:24.979 2019 Archive timezone: CET-1 PID for pmlogger: 10941 [..]The archive files from the
/var/log/pcp/pmlogger/directory can be used for further analysis and graphing.
6.6. Systemd units and pmlogger Copiar enlaceEnlace copiado en el portapapeles!
The pmlogger deployments, whether local or a remote farm, include several associated systemd service and timer units. These units monitor pmlogger instances, restart failed processes, and manage archives through tasks like file compression.
The checking and housekeeping services typically deployed by pmlogger are:
- pmlogger_daily.service
-
Runs daily, soon after midnight by default, to aggregate, compress, and rotate one or more sets of PCP archives. Also culls archives older than the limit, 2 weeks by default. Triggered by the
pmlogger_daily.timerunit, which is required by thepmlogger.serviceunit. - pmlogger_check
-
Performs half-hourly checks that
pmloggerinstances are running. Restarts any missing instances and performs any required compression tasks. Triggered by thepmlogger_check.timerunit, which is required by thepmlogger.serviceunit. - pmlogger_farm_check
-
Checks the status of all configured
pmloggerinstances. Restarts any missing instances. Migrates all non-primary instances to thepmlogger_farmservice. Triggered by thepmlogger_farm_check.timer, which is required by thepmlogger_farm.serviceunit that is itself required by thepmlogger.service unit.
These services are managed through a series of positive dependencies, meaning that they are all enabled upon activating the primary pmlogger instance. Note that while pmlogger_daily.service is disabled by default, pmlogger_daily.timer being active through the dependency with pmlogger.service will trigger pmlogger_daily.service to run.
pmlogger_daily is also integrated with pmlogrewrite for automatically rewriting archives before merging. This helps to ensure metadata consistency amid changing production environments and PMDAs. For example, updating pmcd during logging can change metric semantics, creating new archives incompatible with previously recorded data. For more information, see the pmlogrewrite(1) man page on your system.
6.7. Managing systemd services triggered by pmlogger Copiar enlaceEnlace copiado en el portapapeles!
You can create an automated custom archive management system for data collected by your pmlogger instances by using the control files.
The primary pmlogger instance must be running on the same host as the pmcd it connects to. You do not need to configure a primary instance. If a central host collects data from several pmlogger instances connected to pmcd instances running on remote hosts, a primary instance might not be required.
Procedure
Do one of the following:
-
For the primary pmlogger instance, use
/etc/pcp/pmlogger/control.d/local For the remote hosts, use
/etc/pcp/pmlogger/control.d/remoteReplace remote with the file name that you want to use.
The file should contain one line for each host to be logged. The default format of the primary logger instance that is automatically created looks similar to:
# === LOGGER CONTROL SPECIFICATIONS === # #Host P? S? directory args # local primary logger LOCALHOSTNAME y n PCP_ARCHIVE_DIR/LOCALHOSTNAME -r -T24h10m -c config.default -v 100MbWhere the fields are:
- Host
- The name of the host to be logged.
- P?
-
Stands for “Primary?". This field indicates if the host is the primary logger instance, y, or not, n. Only one primary logger is permitted. It must run on the same host as its connected
pmcdinstance. - S?
-
Stands for “Socks?". This field indicates if this logger instance needs to use the
SOCKSprotocol to connect topmcdthrough a firewall, y, or not, n. - directory
- All archives associated with this line are created in this directory.
- args
-
Arguments passed to
pmlogger. The default values for theargsfield are: - -r
- Report the archive sizes and growth rate.
- T24h10m
-
Specifies when to end logging for each day. This is typically the time when
pmlogger_daily.serviceruns. The default value of24h10mindicates that logging should end 24 hours and 10 minutes after it begins, at the latest. - -c config.default
- Specifies which configuration file to use. This defines what metrics to record.
- -v 100Mb
-
Specifies the size at which point one data volume is filled and another is created. After it switches to the new archive, the previously recorded one will be compressed by either
pmlogger_dailyorpmlogger_check.
-
For the primary pmlogger instance, use
6.8. Replaying the PCP log archives with pmrep Copiar enlaceEnlace copiado en el portapapeles!
After recording the metric data, you can replay the PCP log archives. To export the logs to text files and import them into spreadsheets, use PCP utilities such as pcp2csv, pcp2xml, pmrep or pmlogsummary. By using the pmrep tool, you can:
- View the log files.
- Parse the selected PCP log archive and export the values into an ASCII table.
- Extract the entire archive log or only select metric values from the log by specifying individual metrics on the command line.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
-
The
pmloggerservice is enabled. For more information, see Enabling the pmlogger service. Installed the
pcp-guipackage.# dnf install pcp-gui
Procedure
Display the data on the metric:
$ pmrep --start @3:00am --archive 20211128 --interval 5seconds --samples 10 --output csv disk.dev.writeTime,"disk.dev.write-sda","disk.dev.write-sdb" 2021-11-28 03:00:00,, 2021-11-28 03:00:05,4.000,5.200 2021-11-28 03:00:10,1.600,7.600 2021-11-28 03:00:15,0.800,7.100 2021-11-28 03:00:20,16.600,8.400 2021-11-28 03:00:25,21.400,7.200 2021-11-28 03:00:30,21.200,6.800 2021-11-28 03:00:35,21.000,27.600 2021-11-28 03:00:40,12.400,33.800 2021-11-28 03:00:45,9.800,20.600The example displays the data on the
disk.dev.writemetric collected in an archive at a 5 second interval incomma-separated-valueformat. Replace 20211128 in this example with a filename containing thepmloggerarchive you want to display data for.
Chapter 7. Monitoring performance with Performance Co-Pilot Copiar enlaceEnlace copiado en el portapapeles!
Performance Co-Pilot (PCP) is a suite of tools, services, and libraries for monitoring, visualizing, storing, and analyzing system-level performance measurements. As a system administrator, you can monitor the system’s performance by using PCP in Red Hat Enterprise Linux.
7.1. Monitoring postfix with pmda-postfix Copiar enlaceEnlace copiado en el portapapeles!
You can monitor performance metrics of the postfix mail server with pmda-postfix. It helps to check how many emails are received per second.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
-
The
pmloggerservice is enabled. For more information, see Enabling the pmlogger service.
Procedure
Install the following packages:
Install the
pcp-system-tools:# dnf install pcp-system-toolsInstall the
pmda-postfixpackage to monitor postfix:# dnf install pcp-pmda-postfix postfixInstall the
loggingdaemon:# dnf install rsyslogInstall the mail client for testing:
# dnf install mutt
Enable the
postfixandrsyslogservices:# systemctl enable postfix rsyslog# systemctl restart postfix rsyslogEnable the SELinux boolean, so that
pmda-postfixcan access the required log files:# setsebool -P pcp_read_generic_logs=onInstall the PMDA:
# cd /var/lib/pcp/pmdas/postfix/# ./InstallUpdating the Performance Metrics Name Space (PMNS) ... Terminate PMDA if already installed ... Updating the PMCD control file, and notifying PMCD ... Waiting for pmcd to terminate ... Starting pmcd ... Check postfix metrics have appeared ... 7 metrics and 58 values
Verification
Verify the
pmda-postfixoperation:echo testmail | mutt rootVerify the available metrics:
# pminfo postfixpostfix.received postfix.sent postfix.queues.incoming postfix.queues.maildrop postfix.queues.hold postfix.queues.deferred postfix.queues.active
7.2. Visually tracing PCP log archives with the PCP Charts application Copiar enlaceEnlace copiado en el portapapeles!
After recording metric data, replay the PCP log archives as graphs. The metrics are sourced from one or more live hosts. You can also use metric data from PCP log archives as a source of historical data.
To customize the interface of the PCP Charts application to display performance metrics data, use line plots, bar graphs, or utilization graphs.
Using the PCP Charts application, you can:
- Replay the data in the PCP Charts application and use graphs to visualize the retrospective data alongside live data of the system.
- Plot performance metric values into graphs.
- Display multiple charts simultaneously.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
-
Logged performance data with the
pmlogger. For more information, see Logging performance data with pmlogger. Installed the
pcp-guipackage.# dnf install pcp-gui
Procedure
Launch the PCP Charts application from the command line:
# pmchartThe
pmtimeserver settings are located at the bottom. With start and pause buttons you can control:- The interval in which PCP polls the metric data
- The date and time for the metrics of historical data
- Click File and then New Chart to select metric from both the local machine and remote machines by specifying their host name or address. Advanced configuration options include the ability to manually set the axis values for the chart, and to manually choose the color of the plots.
Record the views created in the PCP Charts application:
Following are the options to take images or record the views created in the PCP Charts application:
- Click File and then Export to save an image of the current view.
- Click Record and then Start to start a recording.
- Click Record and then Stop to stop the recording. After stopping the recording, the recorded metrics are archived to be viewed later.
- Optional: In the PCP Charts application, the main configuration file, known as the view, stores the metadata associated with one or more charts. This metadata describes all chart aspects, including the metrics used and the chart columns.
Optional: Save the custom view configuration by clicking File and then Save View, and load the view configuration later.
The following example of the PCP Charts application view configuration file describes a stacking chart graph showing the total number of bytes read and written to the given XFS file system
loop1:# pmchartversion 1 chart title "Filesystem Throughput /loop1" style stacking antialiasing off plot legend Read rate metric xfs.read_bytes instance loop1 plot legend Write rate metric xfs.write_bytes instance loop1
7.3. Collecting data from SQL server by using PCP Copiar enlaceEnlace copiado en el portapapeles!
To monitor and analyze database performance issues, use the SQL Server agent in Performance Co-Pilot (PCP). You can collect data for Microsoft SQL Server by using PCP on your system.
Prerequisites
-
You have installed Microsoft SQL Server for Red Hat Enterprise Linux and established a
trustedconnection to an SQL server. - You have installed the Microsoft ODBC driver for SQL Server for Red Hat Enterprise Linux.
Procedure
Install PCP:
# dnf install pcp-zeroconfInstall packages required for the
pyodbcdriver:# dnf install gcc-c++ python3-devel unixODBC-devel# dnf install python3-pyodbcInstall the
mssqlagent:Install the Microsoft SQL Server domain agent for PCP:
# dnf install pcp-pmda-mssqlEdit the
/etc/pcp/mssql/mssql.conffile to configure the SQL server account’s username and password for the mssql agent. Ensure that the account you configure has access rights to performance data.username: user_name password: user_passwordReplace user_name with the SQL Server account username and user_password with the SQL Server user password for this account.
Install the agent:
# cd /var/lib/pcp/pmdas/mssql# ./InstallUpdating the Performance Metrics Name Space (PMNS) ... Terminate PMDA if already installed ... Updating the PMCD control file, and notifying PMCD ... Check mssql metrics have appeared ... 168 metrics and 598 values [...]
Verification
Using the
pcpcommand, verify if the SQL Server PMDA (mssql) is loaded and running:$ pcpPerformance Co-Pilot configuration on rhel.local: platform: Linux rhel.local 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC 2019 x86_64 hardware: 2 cpus, 1 disk, 1 node, 2770MB RAM timezone: PDT+7 services: pmcd pmproxy pmcd: Version 5.0.2-1, 12 agents, 4 clients pmda: root pmcd proc pmproxy xfs linux nfsclient mmv kvm mssql jbd2 dm pmlogger: primary logger: /var/log/pcp/pmlogger/rhel.local/20200326.16.31 pmie: primary engine: /var/log/pcp/pmie/rhel.local/pmie.logView the complete list of metrics that PCP can collect from the SQL Server:
# pminfo mssqlAfter viewing the list of metrics, you can report the rate of transactions. For example, to report on the overall transaction count per second, over a five second time window:
# pmval -t 1 -T 5 mssql.databases.transactions-
View the graphical chart of these metrics on your system by using the pmchart command. For more information, see Visually tracing PCP log archives with the PCP Charts application and
pcp(1),pminfo(1),pmval(1),pmchart(1), andpmdamssql(1)man pages on your system.
Chapter 8. Setting up graphical representation of PCP metrics Copiar enlaceEnlace copiado en el portapapeles!
Combine Performance Co-pilot (pcp), grafana, valkey, pcp bpftrace, and pcp vector to provide a graphical representation of the live or recorded PCP data.
8.1. Setting up PCP with pcp-zeroconf Copiar enlaceEnlace copiado en el portapapeles!
You can configure PCP on a system with the pcp-zeroconf package. Once the pcp-zeroconf package is installed, the system records the default set of metrics into archived files.
Procedure
Install the pcp-zeroconf package:
# dnf install pcp-zeroconf
Verification
Ensure that the pmlogger service is active, and starts archiving the metrics:
# pcp | grep pmloggerpmlogger: primary logger: /var/log/pcp/pmlogger/localhost.localdomain/20200401.00.12
8.2. Setting up a grafana-server Copiar enlaceEnlace copiado en el portapapeles!
Grafana generates graphs that are accessible from a browser. The grafana-server is a back-end server for the Grafana dashboard. It listens, by default, on all interfaces, and provides web services accessed through the web browser. The grafana-pcp plugin interacts with the pmproxy daemon in the backend.
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
Procedure
Install the following packages:
# dnf install grafana grafana-pcpRestart and enable
grafana-server:# systemctl restart grafana-server# systemctl enable grafana-serverOpen the server’s firewall for network traffic to the Grafana service.
# firewall-cmd --permanent --add-service=grafanasuccess# firewall-cmd --reloadsuccess
Verification
Ensure that the
grafana-serveris listening and responding to requests:# ss -ntlp | grep 3000LISTEN 0 128 :3000 *: users:(("grafana-server",pid=19522,fd=7))Ensure that the grafana-pcp plugin is installed:
# grafana-cli plugins ls | grep performancecopilot-pcp-appperformancecopilot-pcp-app @ 5.2.2
8.3. Configuring valkey Copiar enlaceEnlace copiado en el portapapeles!
You can use the valkey data source to:
- View data archives
- Query time series using pmseries language
- Analyze data across multiple hosts
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
-
The
grafana-serveris configured. For more information, see Setting up a grafana-server. -
Mail transfer agent, for example,
postfixis installed and configured.
Procedure
Install the
valkeypackage:# dnf install valkeyStart and enable the
pmproxyandvalkeyservices:# systemctl start pmproxy valkey# systemctl enable pmproxy valkeyRestart
grafana-server:# systemctl restart grafana-server
Verification
Ensure that the
pmproxyandvalkeyare working:# pmseries disk.dev.read2eb3e58d8f1e231361fb15cf1aa26fe534b4d9dfThis command does not return any data if the valkey package is not installed.
For details, see the
pmseries(1)man page on your system.
8.4. Accessing the Grafana web UI Copiar enlaceEnlace copiado en el portapapeles!
You can access the Grafana web interface. Using the Grafana web interface, you can:
- add Valkey, PCP bpftrace, and PCP Vector data sources
- create a dashboard
- view an overview of any useful metrics
- create alerts in Valkey
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
-
The
grafana-serveris configured. For more information, see Setting up a grafana-server.
Procedure
On the client system, open a browser and access the
grafana-serveron port 3000 by using the http://192.0.2.0:3000 link.Replace 192.0.2.0 with your machine IP when accessing Grafana web UI from a remote machine, or with localhost when accessing the web UI locally.
- For the first login, enter admin in both the Email or username and Password fields.
- Grafana prompts to set a New password to create a secured account. If you want to set it later, click Skip.
- From the more options icon (☰) on the top left, click Administration > Plugins.
- In the Plugins tab, type performance co-pilot in the Search by name or type text box and then click the Performance Co-Pilot (PCP) plugin.
- In the Plugins / Performance Co-Pilot pane, click Enable.
Click the Grafana icon. The Grafana Home page is displayed.
Figure 8.1. Home Dashboard
NoteThe top corner of the screen has a Settings icon, but it controls the general Dashboard settings.
In the Grafana Home page, click Add your first data source to add Valkey, PCP bpftrace, and PCP Vector data sources.
- To add PCP valkey data source, view default dashboard, create a panel, and an alert rule, see Creating panels and alerts in PCP valkey data source.
- To add pcp bpftrace data source and view the default dashboard, see Viewing the PCP bpftrace System Analysis dashboard.
- To add pcp vector data source, view the default dashboard, and to view the vector checklist, see Viewing the PCP Vector Checklist.
Optional: From the menu, hover over the admin profile icon to change the Preferences including Edit Profile, Change Password, or to Sign out.
For more information, see the
grafana-cliandgrafana-serverman pages on your system.
8.5. Configuring secure connections for Valkey and PCP Copiar enlaceEnlace copiado en el portapapeles!
You can establish secure connections between Performance Co-Pilot (PCP), Grafana, and Valkey. Establishing secure connections between these components helps prevent unauthorized parties from accessing or modifying the data being collected and monitored.
Prerequisites
- PCP is installed. For more information, see Installing and enabling PCP.
- The Grafana server is configured. For more information, see Setting up a grafana-server.
- Valkey is installed. For more information, see Configuring Valkey.
-
The private client key is stored in the
/etc/valkey/client.keyfile. If you use a different path, modify the path in the corresponding steps of the procedure. For details about creating a private key, creating a certificate signing request (CSR), and requesting a certificate from a certificate authority (CA), see your CA’s documentation. -
The TLS client certificate is stored in the
/etc/valkey/client.crtfile. If you use a different path, modify the path in the corresponding steps of the procedure. -
The TLS server key is stored in the
/etc/valkey/valkey.keyfile. If you use a different path, modify the path in the corresponding steps of the procedure. -
The TLS server certificate is stored in the
/etc/valkey/valkey.crtfile. If you use a different path, modify the path in the corresponding steps of the procedure. -
The CA certificate is stored in the
/etc/valkey/ca.crtfile. If you use a different path, modify the path in the corresponding steps of the procedure. -
For the
pmproxydaemon, the private server key is stored in the/etc/pcp/tls/server.keyfile. If you use a different path, modify the path in the corresponding steps of the procedure.
Procedure
As a root user, open the
/etc/valkey/valkey.conffile and adjust the TLS/SSL options to reflect the following properties:port 0 tls-port 6379 tls-cert-file /etc/valkey/valkey.crt tls-key-file /etc/valkey/valkey.key tls-client-key-file /etc/valkey/client.key tls-client-cert-file /etc/valkey/client.crt tls-ca-cert-file /etc/valkey/ca.crtEnsure valkey can access the TLS certificates:
# su valkey -s /bin/bash -c \'ls -1 /etc/valkey/ca.crt /etc/valkey/valkey.key /etc/valkey/valkey.crt' /etc/valkey/ca.crt /etc/valkey/valkey.crt /etc/valkey/valkey.keyRestart the valkey server to apply the configuration changes:
# systemctl restart valkey
Verification
Confirm the TLS configuration works:
# valkey-cli --tls --cert /etc/valkey/client.crt \ --key /etc/valkey/client.key \ --cacert /etc/valkey/ca.crt <<< "PING"PONG Unsuccessful TLS configuration might result in the following error message: Could not negotiate a TLS connection: Invalid CA Certificate File/Directory
8.6. Creating panels and alerts in PCP Valkey data source Copiar enlaceEnlace copiado en el portapapeles!
After adding the PCP Valkey data source, you can view the dashboard with an overview of useful metrics. Add a query to visualize the load graph, and create alerts that help you view system issues after they occur.
Prerequisites
- The Valkey is configured. For more information, see Configuring Valkey.
-
The
grafana-serveris accessible. For more information, see Accessing the Grafana web UI.
Procedure
- Log into the Grafana web UI.
- In the Grafana Home page, click Add your first data source.
- In the Add data source pane, type valkey in the Filter by name or type text box and then click Valkey.
In the Data Sources / Valkey pane, perform the following:
- Add http://localhost:44322 in the URL field and then click Save & Test.
Click Dashboards tab > Import > Valkey: Host Overview to see a dashboard with an overview of any useful metrics.
Figure 8.2. Valkey: Host overview
Add a new panel:
- From the menu, hover over the Create icon > Dashboard > Add new panel icon to add a panel.
-
In the Query tab, select the Valkey from the query list instead of the selected default option and in the text field of A, enter metric, for example,
kernel.all.loadto visualize the kernel load graph. - Optional: Add Panel title and Description, and update other options from the Settings.
- Click Save to apply changes and save the dashboard. Add Dashboard name.
Click Apply to apply changes and go back to the dashboard.
Figure 8.3. Valkey query panel
Create an alert rule:
- In the Valkey query panel, click Alert and then click Create Alert.
- Edit the Name, Evaluate query, and For fields from the Rule, and specify the Conditions for your alert.
Click Save to apply changes and save the dashboard. Click Apply to apply changes and go back to the dashboard.
Figure 8.4. Creating alerts in the Valkey panel
- Optional: In the same panel, scroll down and click Delete icon to delete the created rule.
- Optional: From the menu, click Alerting icon to view the created alert rules with different alert statuses, to edit the alert rule, or to pause the existing rule from the Alert Rules tab. To add a notification channel for the created alert rule to receive an alert notification from Grafana, see Adding notification channels for alerts.
8.7. Adding notification channels for alerts Copiar enlaceEnlace copiado en el portapapeles!
Add notification channels to receive an alert notification from Grafana when rule conditions are met and the system requires attention. You can receive these alerts after selecting any one type from the supported list of notifiers.
The notifiers include DingDing, Discord, Email, Google Hangouts Chat, HipChat, Kafka REST Proxy, LINE, Microsoft Teams, OpsGenie, PagerDuty, Prometheus Alertmanager, Pushover, Sensu, Slack, Telegram, Threema Gateway, VictorOps, and webhook.
Prerequisites
-
The
grafana-serveris accessible. For more information, see Accessing the Grafana web UI. - An alert rule is created. For more information, see Creating panels and alerts in PCP valkey data source.
Configured SMTP and added a valid sender’s email address in the
grafana/grafana.inifile, for example:# vi /etc/grafana/grafana.ini[smtp] enabled = true from_address = abc@gmail.comReplace abc@gmail.com by a valid email address.
Restarted grafana-server after changing the configurations:
# systemctl restart grafana-server.service
Procedure
- From the menu, hover over the Alerting icon > click Notification channels > Add channel.
In the Add notification channel details pane, perform the following:
- Enter your name in the Name text box.
-
Select the communication Type, for example, Email and enter the email address. You can add multiple email addresses using the
;separator. - Optional: Configure Optional Email settings and Notification settings.
- Click Save.
Select a notification channel in the alert rule:
- From the menu, hover over the Alerting icon and then click Alert rules.
- From the Alert Rules tab, click the created alert rule.
- On the Notifications tab, select your notification channel name from the Send to option, and then add an alert message.
- Click Apply.
8.8. Setting up authentication between PCP components Copiar enlaceEnlace copiado en el portapapeles!
You can configure authentication by using the scram-sha-256 authentication mechanism. It is supported by PCP through the Simple Authentication Security Layer (SASL) framework.
Procedure
Install the SASL framework for the
scram-sha-256authentication mechanism:# dnf install cyrus-sasl-scram cyrus-sasl-libSpecify the supported authentication mechanism and the user database path in the
pmcd.conffile:# vi /etc/sasl2/pmcd.confmech_list: scram-sha-256 sasldb_path: /etc/pcp/passwd.dbCreate a new user:
# useradd -r metricsReplace metrics by your user name.
Add the created user in the user database:
# saslpasswd2 -a pmcd metricsPassword: Again (for verification):To add the created user, it is required to enter the metrics account password.
Set the permissions of the user database:
# chown root:pcp /etc/pcp/passwd.db# chmod 640 /etc/pcp/passwd.dbRestart the pmcd service:
# systemctl restart pmcd
Verification
Verify the SASL configuration:
# pminfo -f -h "pcp://127.0.0.1?username=metrics" disk.dev.readPassword: disk.dev.read inst [0 or "sda"] value 19540
8.9. Installing PCP bpftrace Copiar enlaceEnlace copiado en el portapapeles!
You can install the PCP bpftrace agent to introspect a system and to gather metrics from the kernel and user-space tracepoints. The bpftrace agent uses bpftrace scripts to gather the metrics. The bpftrace scripts use the enhanced Berkeley Packet Filter (eBPF).
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
-
The
grafana-serveris configured. For more information, see Setting up a grafana-server. -
The
scram-sha-256authentication mechanism is configured. For more information, see Setting up authentication between PCP components.
Procedure
Install the
pcp-pmda-bpftracepackage:# dnf install pcp-pmda-bpftraceEdit the bpftrace.conf file and add the user:
# vi /var/lib/pcp/pmdas/bpftrace/bpftrace.conf[dynamic_scripts] enabled = true auth_enabled = true allowed_users = root,metricsReplace metrics by your user name from Setting up authentication between PCP components.
Install
bpftracePMDA:# cd /var/lib/pcp/pmdas/bpftrace/# ./InstallUpdating the Performance Metrics Name Space (PMNS) ... Terminate PMDA if already installed ... Updating the PMCD control file, and notifying PMCD … Check bpftrace metrics have appeared ... 7 metrics and 6 valuesThe
pmda-bpftraceis now installed, and can only be used after authenticating your user. For more information, see Viewing the PCP bpftrace System Analysis dashboard.
8.10. Viewing the PCP bpftrace System Analysis dashboard Copiar enlaceEnlace copiado en el portapapeles!
The PCP bpftrace data source accesses live data unavailable through standard pmlogger instances or archives. In the PCP bpftrace data source, view the dashboard with an overview of useful metrics.
Prerequisites
-
The PCP
bpftraceis installed. For more information, see Installing PCP bpftrace. -
The
grafana-serveris configured. For more information, see Setting up a grafana-server.
Procedure
- Log into the Grafana web UI.
- In the Grafana Home page, click Add your first data source.
-
In the Add data source pane, type
bpftracein the Filter by name or type text box and then click PCP bpftrace. In the Data Sources / PCP bpftrace pane, perform the following:
- Add http://localhost:44322 in the URL field.
- Toggle the Basic Auth option and add the created user credentials in the User and Password field.
Click Save & Test.
Figure 8.5. Adding PCP bpftrace in the data source
Click Dashboards tab > Import > PCP bpftrace: System Analysis to see a dashboard with an overview of any useful metrics.
Figure 8.6. PCP bpftrace: System Analysis
8.11. Installing PCP Vector Copiar enlaceEnlace copiado en el portapapeles!
You must install pcp vector before you start using it.
Prerequisites
- PCP is configured. For more information, see Setting up PCP with pcp-zeroconf.
-
The
grafana-serveris configured. For more information, see Setting up a grafana-server.
Procedure
Install the bcc PMDA:
# cd /var/lib/pcp/pmdas/bcc# ./Install[Wed Apr 1 00:27:48] pmdabcc(22341) Info: Initializing, currently in 'notready' state. [Wed Apr 1 00:27:48] pmdabcc(22341) Info: Enabled modules: [Wed Apr 1 00:27:48] pmdabcc(22341) Info: ['biolatency', 'sysfork', [...] Updating the Performance Metrics Name Space (PMNS) ... Terminate PMDA if already installed ... Updating the PMCD control file, and notifying PMCD … Check bcc metrics have appeared ... 1 warnings, 1 metrics and 0 values
8.12. Viewing the PCP Vector Checklist Copiar enlaceEnlace copiado en el portapapeles!
The PCP Vector data source displays live metrics and uses the pcp metrics. It analyzes data for individual hosts. After adding the PCP Vector data source, view the dashboard with an overview of useful metrics. Also, view related troubleshooting or reference links in the checklist.
Prerequisites
- The PCP Vector is installed. For more information, see Installing PCP Vector.
-
The
grafana-serveris accessible. For more information, see Accessing the Grafana web UI.
Procedure
- Log into the Grafana web UI.
- In the Grafana Home page, click Add your first data source.
- In the Add data source pane, type vector in the Filter by name or type text box and then click PCP Vector.
In the Data Sources / PCP Vector pane, perform the following:
- Add http://localhost:44322 in the URL field and then click Save & Test.
Click Dashboards tab > Import * PCP Vector: Host Overview to see a dashboard with an overview of any useful metrics.
Figure 8.7. PCP Vector: Host Overview
From the menu, hover over the Performance Co-Pilot plugin and then click PCP Vector Checklist.
In the PCP checklist, click help or warning icon to view the related troubleshooting or reference links.
Figure 8.8. Performance Co-Pilot / PCP Vector Checklist
8.13. Using heatmaps in Grafana Copiar enlaceEnlace copiado en el portapapeles!
Use heatmaps in Grafana to view data as histograms, identify trends and patterns in your data, and see how they change over time. Each heatmap column is a histogram, with cell colors showing the density of observations.
Prerequisites
- Valkey is configured. For more information see Configuring Valkey.
-
The
grafana-serveris accessible. For more information, see Accessing the Grafana web UI. - The PCP valkey data source is configured. For more information see Creating panels and alerts in PCP Valkey data source.
Procedure
- Hover the cursor over the Dashboards tab and click + New dashboard.
- In the Add panel menu, click Add a new panel.
In the Query tab:
- Select Valkey from the query list instead of the selected default option.
-
In the text field of A, enter a metric, for example,
kernel.all.loadto visualize the kernel load graph.
- Click the visualization dropdown menu, which is set to Time series by default, and then click Heatmap.
- Optional: In the Panel Options dropdown menu, add a Panel Title and Description.
In the Heatmap dropdown menu, under the Calculate from data setting, click Yes.
Figure 8.9. Heatmap
- Optional: In the Colors dropdown menu, change the Scheme from the default Orange and select the number of steps (color shades).
Optional: In the Tooltip dropdown menu, under the Show histogram (Y Axis) setting, click the toggle to display a cell’s position within its specific histogram when hovering your cursor over a cell in the heatmap. For example:
8.14. Managing SELinux booleans for Grafana Copiar enlaceEnlace copiado en el portapapeles!
The grafana-selinux subpackage includes several SELinux booleans that control Grafana’s access to external services. These booleans are off by default. You can turn them on or off based on your system requirements.
Prerequisites
-
You have installed
grafana-selinuxsubpackage. - You have root privileges.
Procedure
View all Grafana-related SELinux booleans:
# semanage boolean -l | grep grafanaEnable a SELinux boolean:
# setsebool [-P] <boolean_name> on- Replace <boolean_name> with the name of the SELinux boolean you want to enable.
- Use the -P option to make the change persistent across system reboots.
Optional: Turn off any SELinux option:
# setsebool [-P] <boolean_name> off
8.15. Troubleshooting Grafana issues Copiar enlaceEnlace copiado en el portapapeles!
At times, it is necessary to troubleshoot Grafana issues, such as Grafana not displaying data, a black dashboard, or similar issues.
Procedure
Verify that the
pmloggerservice is up and running by executing the following command:$ systemctl status pmloggerVerify if files were created or modified to the disk by executing the following command:
$ ls /var/log/pcp/pmlogger/$(hostname)/ -rlttotal 4024 -rw-r--r--. 1 pcp pcp 45996 Oct 13 2019 20191013.20.07.meta.xz -rw-r--r--. 1 pcp pcp 412 Oct 13 2019 20191013.20.07.index -rw-r--r--. 1 pcp pcp 32188 Oct 13 2019 20191013.20.07.0.xz -rw-r--r--. 1 pcp pcp 44756 Oct 13 2019 20191013.20.30-00.meta.xz [..]Verify that the
pmproxyservice is running by executing the following command:$ systemctl status pmproxyVerify that
pmproxyis running, time series support is enabled, and a connection to valkey is established by viewing the/var/log/pcp/pmproxy/pmproxy.logfile. Ensure that it contains the following text:pmproxy(1716) Info: valkey slots, command keys, schema version setup Here, 1716 is the PID of pmproxy, which will be different for every invocation of pmproxy. Verify if the valkey database contains any keys by executing the following command: $ valkey-cli dbsize (integer) 34837Verify if any PCP metrics are in the valkey database and pmproxy is able to access them by executing the following commands:
$ pmseries disk.dev.read2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df $ pmseries "disk.dev.read[count:10]" 2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df [Mon Jul 26 12:21:10.085468000 2021] 117971 70e83e88d4e1857a3a31605c6d1333755f2dd17c [Mon Jul 26 12:21:00.087401000 2021] 117758 70e83e88d4e1857a3a31605c6d1333755f2dd17c [Mon Jul 26 12:20:50.085738000 2021] 116688 70e83e88d4e1857a3a31605c6d1333755f2dd17c [...] $ valkey-cli --scan --pattern "*$(pmseries 'disk.dev.read')" pcp:metric.name:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:values:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:desc:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:labelvalue:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:instances:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9df pcp:labelflags:series:2eb3e58d8f1e231361fb15cf1aa26fe534b4d9dfVerify if there are any errors in the Grafana logs by executing the following command:
$ journalctl -e -u grafana-server-- Logs begin at Mon 2021-07-26 11:55:10 IST, end at Mon 2021-07-26 12:30:15 IST. -- Jul 26 11:55:17 localhost.localdomain systemd[1]: Starting Grafana instance... Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Starting Grafana" logger=server version=7.3.6 c> Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Config loaded from" logger=settings file=/usr/s> Jul 26 11:55:17 localhost.localdomain grafana-server[1171]: t=2021-07-26T11:55:17+0530 lvl=info msg="Config loaded from" logger=settings file=/etc/g> [...]
Chapter 9. Managing power consumption with PowerTOP Copiar enlaceEnlace copiado en el portapapeles!
Reducing the overall power consumption of computer systems helps to save cost. To optimize energy consumption, study system tasks and configure each component to match the performance requirements of that specific task.
Lowering the power consumption of a specific component or of the system as a whole leads to lower heat and performance. Proper power management results in:
- Heat reduction for servers and computing centers.
- Reduced secondary costs, including cooling, space, cables, generators, and uninterruptible power supplies (UPS).
- Extended battery life for laptops.
- Lower carbon dioxide output.
- Meeting government regulations or legal requirements regarding Green IT, for example, Energy Star.
- Meeting company guidelines for new systems.
9.1. Power management basics Copiar enlaceEnlace copiado en el portapapeles!
Effective power management is built on the following principles:
- An idle CPU should only wake up when needed
- Since Red Hat Enterprise Linux 6, the kernel runs tickless. It replaced periodic timer interrupts with on-demand interrupts. Therefore, idle CPUs are allowed to remain idle until a new task is queued for processing. CPUs that have entered lower power states can remain in these states longer.
However, benefits from this feature can be offset if your system has applications that create unnecessary timer events. Polling events, such as checks for volume changes or mouse movement, are examples of such events.
- Unused hardware and devices should be disabled completely
- This is true for devices that have moving parts, for example, hard disks. In addition, some applications may leave an unused but enabled device "open". When this occurs, the kernel assumes that the device is in use. It prevents the device from going into a power saving state.
- Low activity should translate to low wattage
- Power efficiency often depends on modern hardware and proper BIOS or UEFI configuration, especially on non-x86 architectures. Install the latest firmware. Enable power management features in the BIOS or system settings to improve power efficiency.
Some features to look for include:
- Collaborative Processor Performance Controls (CPPC) support for ARM64
- PowerNV support for IBM Power Systems
- Cool’n’Quiet
- ACPI (C-state)
- Smart
Red Hat Enterprise Linux automatically uses these features when supported and enabled in the BIOS.
- Different forms of CPU states and their effects
Modern CPUs together with Advanced Configuration and Power Interface (ACPI) provide different power states. The three different states are:
- Sleep (C-states)
- Frequency and voltage (P-states)
Heat output (T-states or thermal states)
A CPU running on the lowest sleep state consumes the least amount of energy, but it also takes considerably more time to wake it up from that state when needed. In very rare cases this can lead to the CPU having to wake up immediately every time it just went to sleep. This permanently busy CPU loses power savings available from more efficient idle states.
- A turned off machine uses the least amount of power
- One of the best ways to save power is to turn off systems. For example, companies can develop a corporate culture focused on "green IT" awareness under a guideline to turn off machines during breaks or when going home. Consolidate several physical servers into one bigger server and virtualize them by using the virtualization technology. The virtualization technology is shipped with Red Hat Enterprise Linux.
9.2. Audit and analysis overview Copiar enlaceEnlace copiado en el portapapeles!
The detailed manual audit, analysis, and tuning of a single system is rare. The time and cost spent to do so typically outweighs the benefits gained from these last pieces of system tuning.
Tuning is valuable for large groups of identical systems where you can reuse the same settings. For example, consider the deployment of thousands of desktop systems, or an HPC cluster where the machines are nearly identical.
Another reason for auditing and analysis is to provide a basis for comparison. Use the results to identify regressions or changes in system behavior and to avoid power consumption issues when hardware, BIOS, or software updates occur regularly. Generally, a thorough audit and analysis gives a better idea of what is really happening on a particular system.
Auditing and analyzing a system with regard to power consumption is relatively hard, even with the most modern systems available. Most systems do not provide the necessary means to measure power use by using software. Exceptions exist though:
- iLO management console of Hewlett Packard server systems has a power management module that you can access through the web.
- IBM provides a similar solution in their BladeCenter power management module.
- On some Dell systems, the IT Assistant offers power monitoring capabilities.
Other vendors are likely to offer similar capabilities for their server platforms. However, there is no single solution available that is supported by all vendors.
9.3. Tools for auditing Copiar enlaceEnlace copiado en el portapapeles!
Red Hat Enterprise Linux offers tools for system auditing and analysis. Use these tools to verify existing findings or gain in-depth information about specific system components.
Many of these tools are used for performance tuning, including:
- PowerTOP
-
PowerTOP identifies specific components of kernel and user-space applications that frequently wake up the CPU. Intel CPU’s Intel Hardware P-State (HWP) adjusts CPU frequency and voltage to regulate the power efficiency and performance. You can use the
powertopcommand as root to start the PowerTOP tool andpowertop --calibrateto calibrate the power estimation engine. diskdevstatandnetdevstatThese SystemTap tools collect detailed information about the disk and network activity of all applications running on a system. Use these statistics to detect power-wasting applications performing many small rather than few large I/O operations. Use the
dnf install tuned-utils-systemtap kernel-debuginfocommand to install thediskdevstatandnetdevstattools. To view detailed information about the disk and network activity, use:# diskdevstatPID UID DEV WRITE_CNT WRITE_MIN WRITE_MAX WRITE_AVG READ_CNT READ_MIN READ_MAX READ_AVG COMMAND 3575 1000 dm-2 59 0.000 0.365 0.006 5 0.000 0.000 0.000 mozStorage #5 3575 1000 dm-2 7 0.000 0.000 0.000 0 0.000 0.000 0.000 localStorage DB [...]# netdevstatPID UID DEV XMIT_CNT XMIT_MIN XMIT_MAX XMIT_AVG RECV_CNT RECV_MIN RECV_MAX RECV_AVG COMMAND 3572 991 enp0s31f6 40 0.000 0.882 0.108 0 0.000 0.000 0.000 openvpn 3575 1000 enp0s31f6 27 0.000 1.363 0.160 0 0.000 0.000 0.000 Socket Thread [...]With these commands, you can specify three parameters:
update_interval,total_duration, anddisplay_histogram.- TuneD
-
It is a profile-based system tuning tool. It uses the
udevdevice manager to monitor connected devices and enables both static and dynamic tuning of system settings. Use thetuned-adm recommendcommand to determine which profile Red Hat suggests as the most suitable for a particular product. - Virtual memory statistics (
vmstat) It is a tool provided by the
procps-ngpackage. It provides detailed information about processes, memory, paging, block I/O, traps, and CPU activity. Use thevmstatcommand to view this information:$ vmstatprocs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 5805576 380856 4852848 0 0 119 73 814 640 2 2 96 0 0View the active and inactive memory by using the
vmstat -acommand.iostatProvided by the
sysstatpackage, this tool is similar tovmstat, but only for monitoring I/O on block devices. It provides more verbose output and statistics. You can use the following command to monitor the system I/O:$ iostatavg-cpu: %user %nice %system %iowait %steal %idle 2.05 0.46 1.55 0.26 0.00 95.67 Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn nvme0n1 53.54 899.48 616.99 3445229 2363196 dm-0 42.84 753.72 238.71 2886921 914296 dm-1 0.03 0.60 0.00 2292 0 dm-2 24.15 143.12 379.80 548193 1454712blktraceProvides detailed information about how time is spent in the I/O subsystem. Use the following command to view this information in human readable format:
# blktrace -d /dev/dm-0 -o - | blkparse -i -253,0 1 1 0.000000000 17694 Q W 76423384 + 8 [kworker/u16:1] 253,0 2 1 0.001926913 0 C W 76423384 + 8 [0] [...]-
The first column,
253,0is the device major and minor tuple. - The second column displays sequence number of the event.
- The third column displays the CPU id for event occurrence.
-
The fourth column of
timestamps. - The fifth column of PID of the I/O-issuing process.
-
The sixth column displays the event type.
Qfor queued process, C for completed process. -
The seventh column,
Wfor write operation. -
The eighth column,
76423384, is the block number -
The ninth column
+ 8is the number of requested blocks as a number of bytes in I/O operation. -
The last field,
[kworker/u16:1], is the process name. By default, theblktracecommand runs forever until the process is explicitly killed. Use the-woption to specify the run-time duration.
-
The first column,
turbostatIt is provided by the
kernel-toolspackage. It reports on processor topology, frequency, idle power-state statistics, temperature, and power usage on x86-64 processors. Use the following command to view the summary:# turbostatCPUID(0): GenuineIntel 0x16 CPUID levels; 0x80000008 xlevels; family:model:stepping 0x6:8e:a (6:142:10) CPUID(1): SSE3 MONITOR SMX EIST TM2 TSC MSR ACPI-TM HT TM CPUID(6): APERF, TURBO, DTS, PTM, HWP, HWPnotify, HWPwindow, HWPepp, No-HWPpkg, EPB [...]By default,
turbostatprints a summary of counter results for the entire screen, followed by counter results every 5 seconds. Specify a different period between counter results with the-ioption. For example, use turbostat-i 10to print results every10seconds instead.turbostatis also useful for identifying servers that are inefficient in terms of power usage or idle time. It also helps to identify the rate of system management interrupts (SMIs) occurring on the system. Verify the effects of power management tuning withturbostat.cpupowerThe
cpupowerpackage contains a collection of tools to examine and tune power saving related features of processors. Use thecpupowercommand with thefrequency-info,frequency-set,idle-info,idle-set,set,info, and monitor options to display and set processor related values.For example, to view available
cpufreqgovernors, use:$ cpupower frequency-info --governorsanalyzing CPU 0: available cpufreq governors: performance powersave- GNOME Power Manager
- It is a daemon that is installed as part of the GNOME desktop environment. GNOME Power Manager notifies you of changes in your system’s power status, for example, a change from battery to AC power. It also reports battery status, and warns you when battery power is low.
pmda-denki-
pmda-denkimonitors power consumption. It’s part of the Performance Co-Pilot suite, and helps to monitor consumption over time, and visualize it.
9.4. The purpose of PowerTOP Copiar enlaceEnlace copiado en el portapapeles!
PowerTOP is a program that diagnoses issues related to power consumption and provides suggestions on how to extend battery lifetime. PowerTOP estimates total system power usage and consumption for individual processes, devices, and kernel handlers.
The tool can also identify specific components of kernel and user-space applications that frequently wake up the CPU. Red Hat Enterprise Linux uses version 2.x of PowerTOP.
9.5. Installing PowerTOP Copiar enlaceEnlace copiado en el portapapeles!
You can install PowerTop and start using it to diagnose issues related to power consumption.
Procedure
Open the command line and enter:
# dnf install powertop
9.6. Starting PowerTOP Copiar enlaceEnlace copiado en el portapapeles!
After you install it on your system, you can start using PowerTop to optimize system performance.
Prerequisites
- You are running your laptop on battery power.
Procedure
To run PowerTOP, enter:
# powertop
9.7. Calibrating PowerTOP Copiar enlaceEnlace copiado en el portapapeles!
You can use the PowerTOP calibration process to improve the accuracy of power consumption measurements on your laptop.
Procedure
On a laptop, you can calibrate the power estimation engine by running the following command:
# powertop --calibrateLet the calibration finish without interacting with the machine during the process. Calibration takes time because the process performs various tests, cycles through brightness levels and switches devices on and off. When the calibration process is completed, PowerTOP starts as normal. Let it run for approximately an hour to collect data.
When enough data is collected, power estimation figures will be displayed in the first column of the output table.
9.8. Setting the measuring interval Copiar enlaceEnlace copiado en el portapapeles!
By default, PowerTOP takes measurements in 20 seconds intervals. But, you can change this measuring frequency based on your requirements.
Procedure
To change the measuring frequency, run the
powertopcommand with the--timeoption:# powertop --time=<time_in_seconds>
9.9. Controlling CPU frequency drivers and modes Copiar enlaceEnlace copiado en el portapapeles!
While using the Intel P-State driver, PowerTOP only displays values in the Frequency Stats tab if the driver is in passive mode. But, even in this case, the values can be incomplete. In total, there are three possible modes of the Intel P-State driver:
- Active mode with Hardware P-States (HWP)
- Active mode without HWP
- Passive mode
Switching to the ACPI CPUfreq driver results in complete information being displayed by PowerTOP. However, it is recommended to keep your system on the default settings.
Procedure
View what driver is loaded and in what mode:
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver- intel_pstate is returned if the Intel P-State driver is loaded and in active mode.
- intel_cpufreq is returned if the Intel P-State driver is loaded and in passive mode.
- acpi-cpufreq is returned if the ACPI CPUfreq driver is loaded.
While using the Intel P-State driver:
Add the following argument to the kernel boot command line to force the driver to run in passive mode:
intel_pstate=passiveTo disable the Intel P-State driver and use the ACPI CPUfreq driver, add the following argument to the kernel boot command line:
intel_pstate=disable
9.10. Generating an HTML output Copiar enlaceEnlace copiado en el portapapeles!
You can also generate an HTML report apart from the powertop’s output in the command line.
Procedure
Run the
powertopcommand with the--htmloption:# powertop --html=<htmlfile.html>Replace the <htmlfile.html> parameter with the required name for the output file.
9.11. Optimizing power consumption Copiar enlaceEnlace copiado en el portapapeles!
You can use either the powertop service or the powertop2tuned utility to optimize power consumption.
9.11.1. Optimizing power consumption using the powertop service Copiar enlaceEnlace copiado en el portapapeles!
You can use the powertop service to automatically enable all PowerTOP’s suggestions from the Tunables tab on the boot.
Procedure
Enable the
powertopservice:# systemctl enable powertop
9.11.2. The powertop2tuned utility Copiar enlaceEnlace copiado en el portapapeles!
Use the powertop2tuned utility to create custom TuneD profiles from PowerTOP suggestions. By default, powertop2tuned creates profiles in the /etc/tuned/ directory, and bases the custom profile on the currently selected TuneD profile. For safety reasons, all PowerTOP tunings are initially disabled in the new profile. You can enable the tunings by uncommenting them in the /etc/tuned/profiles/profile_name/tuned.conf file.
Use the --enable or -e option to generate a new profile that enables most of the tunings suggested by PowerTOP. Certain potentially problematic tunings, such as the USB autosuspend, are disabled by default and need to be uncommented manually.
9.11.3. Optimizing power consumption using the powertop2tuned utility Copiar enlaceEnlace copiado en el portapapeles!
You can use the powertop2tuned utility to generate a custom TuneD profile that helps optimize your system’s power consumption.
Prerequisites
The
powertop2tunedutility is installed on the system.# dnf install tuned-utils
Procedure
Create a custom profile:
# powertop2tuned <new_profile_name>Activate the new profile:
# tuned-adm profile <new_profile_name>
9.11.4. Comparing powertop2tuned and powertop.service usage Copiar enlaceEnlace copiado en el portapapeles!
You can prefer optimizing power consumption with powertop2tuned over powertop.service for the following reasons:
-
The
powertop2tunedutility represents integration of PowerTOP into TuneD, which enables the benefit of both tools. -
The
powertop2tunedutility provides fine-grained control of enabled tuning. -
With
powertop2tuned, potentially dangerous tuning is not automatically enabled. -
With
powertop2tuned, rollback is possible without reboot.
Chapter 10. Getting started with perf Copiar enlaceEnlace copiado en el portapapeles!
As a system administrator, you can use the perf tool to collect and analyze performance data of your system. The perf user-space tool interfaces with the kernel-based subsystem Performance Counters for Linux (PCL).
perf uses the Performance Monitoring Unit (PMU) to measure, record, and monitor a variety of hardware and software events. perf also supports tracepoints, kprobes, and uprobes.
10.1. Installing perf Copiar enlaceEnlace copiado en el portapapeles!
You must install the perf user-space tool to start using the perf tool.
Procedure
Install the perf tool:
# dnf install perf
10.2. Common perf commands Copiar enlaceEnlace copiado en el portapapeles!
You can use the following commonly used perf commands to collect and analyze performance data.
- perf stat
- Provides overall statistics for common performance events, including instructions run and clock cycles used. Options allow for selection of events other than the default measurement events.
- perf record
-
Records performance data into a file,
perf.data, which can be later analyzed by using theperf reportcommand. - perf report
-
Reads and displays the performance data from the
perf.datafile created by theperf recordcommand. - perf list
- Lists the events available on a particular machine. These events vary based on performance monitoring hardware and software configuration of the system.
- perf top
- Performs a similar function to the top utility. It generates and displays a performance counter profile in real time.
- perf trace
-
Performs a similar function to the
stracetool. It monitors the system calls used by a specified thread or process and all signals received by that application. - perf help
-
Displays a complete list of the
perfcommands.
Chapter 11. Profiling CPU usage in real time with perf top Copiar enlaceEnlace copiado en el portapapeles!
You can use the perf top command to measure CPU usage of different functions in real time.
11.1. The purpose of perf top Copiar enlaceEnlace copiado en el portapapeles!
The perf top command performs real-time system profiling and functions similarly to the top utility. While top monitors process CPU usage, perf top displays CPU time by specific function. In its default state, perf top informs you about functions being used across all CPUs in both the user-space and the kernel-space. To use perf top, you need root access.
11.2. Profiling CPU usage with perf top Copiar enlaceEnlace copiado en el portapapeles!
You can use perf top to monitor real-time CPU usage and identify functions consuming the most processing time.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. - You have root access.
Procedure
Start the perf top monitoring interface:
# perf topThe monitoring interface looks similar to the following: Samples: 8K of event 'cycles', 2000 Hz, Event count (approx.): 4579432780 lost: 0/0 drop: 0/0 Overhead Shared Object Symbol 2.20% [kernel] [k] do_syscall_64 2.17% [kernel] [k] module_get_kallsym 1.49% [kernel] [k] copy_user_enhanced_fast_string 1.37% libpthread-2.29.so [.] pthread_mutex_lock 1.31% [unknown] [.] 0000000000000000 1.07% [kernel] [k] psi_task_change 1.04% [kernel] [k] switch_mm_irqs_off 0.94% [kernel] [k] fget 0.74% [kernel] [k] entry_SYSCALL_64 0.69% [kernel] [k] syscall_return_via_sysret 0.69% libxul.so [.] 0x000000000113f9b0 0.67% [kernel] [k] kallsyms_expand_symbol.constprop.0 0.65% firefox [.] moz_xmalloc 0.65% libpthread-2.29.so [.] __pthread_mutex_unlock_usercnt 0.60% firefox [.] free 0.60% libxul.so [.] 0x000000000241d1cd 0.60% [kernel] [k] do_sys_poll 0.58% [kernel] [k] menu_select 0.56% [kernel] [k] _raw_spin_lock_irqsave 0.55% perf [.] 0x00000000002ae0f3In this example, the kernel function
do_syscall_64is using the most CPU time.
11.3. Perf output and symbol resolution Copiar enlaceEnlace copiado en el portapapeles!
The perf top monitoring interface provides a real-time view of CPU usage and function activity. Understanding its output helps identify performance bottlenecks and optimizes system behavior.
- Key columns in perf top output
- The interface displays several following columns:
- Overhead
-
Shows the percentage of CPU time consumed by a given function. This helps pinpoint the most
resource-intensiveoperations. - Shared Object
- Indicates the name of the program or library where the function resides.
- Symbol
Displays the name of the function or symbol.
- Functions running in kernel space are marked with [k].
- Functions running in user space are marked with [.].
- Causes of unresolved symbols in perf output
For kernel functions,
perfuses the information from the/proc/kallsymsfile to map the samples to their respective function names or symbols. For functions executed in the user space, however, you might see raw function addresses because the binary is stripped.This information can be included by installing the corresponding debuginfo package or by compiling the application with debugging enabled, such as using the
-goption ingcc. Once the necessary debug information is available,perfcan accurately map sampled addresses to human-readable function names during reporting.NoteAfter making debug information available, it is not necessary to re-run the
perf recordcommand. Running theperf reportcommand again will reflect the resolved symbols.
11.4. Enabling debug and source repositories Copiar enlaceEnlace copiado en el portapapeles!
To access essential debugging data for system components, enable debug and source repositories. RHEL disables these by default to save space. Enable them to install debuginfo packages required for performance measurement and deep system troubleshooting.
Procedure
Enable the source and debug information package channels:
# subscription-manager repos --enable rhel-10-for-$(uname -m)-baseos-debug-rpms# subscription-manager repos --enable rhel-10-for-$(uname -m)-baseos-source-rpms# subscription-manager repos --enable rhel-10-for-$(uname -m)-appstream-debug-rpms# subscription-manager repos --enable rhel-10-for-$(uname -m)-appstream-source-rpmsThe
$(uname -m)part is automatically replaced with a matching value for architecture of your system:Expand Table 11.1. Architecture name mappings Architecture name Value 64-bit Intel and AMD
x86_64
64-bit ARM
aarch64
IBM POWER
ppc64le
64-bit IBM Z
s390x
11.5. Getting debuginfo packages for an application or library by using GDB Copiar enlaceEnlace copiado en el portapapeles!
When you debug an issue and the installed package lacks the required debugging information, the GNU Debugger detects the missing information. It then suggests commands to install the corresponding debug packages.
Prerequisites
- The application or library you want to debug must be installed on the system.
-
GDB and the
debuginfo-installtool must be installed on the system. For details, see Setting up to debug applications. -
Repositories providing
debuginfoanddebugsourcepackages must be configured and enabled on the system. For details, see Enabling debug and source repositories.
Procedure
Start GDB attached to the application or library you want to debug. GDB automatically recognizes missing debugging information and suggests a command to run.
$ gdb -q /bin/lsReading symbols from /bin/ls...Reading symbols from .gnu_debugdata for /usr/bin/ls...(no debugging symbols found)...done. (no debugging symbols found)...done. Missing separate debuginfos, use: dnf debuginfo-install coreutils-9.5-6.el10.x86_64 (gdb)To exit GDB, type q and confirm with Enter.
(gdb) qRun the command suggested by GDB to install the required
debuginfopackages:# dnf debuginfo-install coreutils-9.5-6.el10.x86_64The dnf package management tool provides a summary of the changes, asks for confirmation and once you confirm, downloads and installs all the necessary files. In case GDB is not able to suggest the
debuginfopackage, follow the procedure described in Gettingdebuginfopackages for an application or library manually.
Chapter 12. Counting events during process execution with perf stat Copiar enlaceEnlace copiado en el portapapeles!
You can use the perf stat command to count hardware and software events during process execution.
12.1. The purpose of perf stat Copiar enlaceEnlace copiado en el portapapeles!
The perf stat command generates counts for hardware and software performance events. If you do not specify any events, then perf stat counts a set of common hardware and software events.
12.2. Counting events with perf stat Copiar enlaceEnlace copiado en el portapapeles!
You can use perf stat to count hardware and software event occurrences while a command is running and generate statistics of these counts. By default, perf stat operates in per-thread mode.
Prerequisites
-
You have the
perfuser space tool installed as described in Installing perf. - You have root access to count both user-space and kernel-space events.
Procedure
Count the events.
Running the
perf statcommand without root access will only count events occurring in the user space:$ perf stat lsDesktop Documents Downloads Music Pictures Public Templates Videos Performance counter stats for 'ls': 1.28 msec task-clock:u # 0.165 CPUs utilized 0 context-switches:u # 0.000 M/sec 0 cpu-migrations:u # 0.000 K/sec 104 page-faults:u # 0.081 M/sec 1,054,302 cycles:u # 0.823 GHz 1,136,989 instructions:u # 1.08 insn per cycle 228,531 branches:u # 178.447 M/sec 11,331 branch-misses:u # 4.96% of all branches 0.007754312 seconds time elapsed 0.000000000 seconds user 0.007717000 seconds sysIn the previous example,
perf statruns without root access the event names are followed by:u, indicating that these events were counted only in the user-space.To count both user-space and kernel-space events, enter:
# perf stat lsDesktop Documents Downloads Music Pictures Public Templates Videos Performance counter stats for 'ls': 3.09 msec task-clock # 0.119 CPUs utilized 18 context-switches # 0.006 M/sec 3 cpu-migrations # 0.969 K/sec 108 page-faults # 0.035 M/sec 6,576,004 cycles # 2.125 GHz 5,694,223 instructions # 0.87 insn per cycle 1,092,372 branches # 352.960 M/sec 31,515 branch-misses # 2.89% of all branches 0.026020043 seconds time elapsed 0.000000000 seconds user 0.014061000 seconds sys
By default,
perf statoperates in per-thread mode. To change to CPU-wide event counting, pass the-aoption toperf stat. To count CPU-wide events, you need root access:# perf stat -a ls
12.3. Interpretation of perf stat output Copiar enlaceEnlace copiado en el portapapeles!
perf stat executes a specified command and counts event occurrences during the command execution. It displays statistics of these counts in three columns:
- The number of occurrences counted for a given event.
- The name of the event that was counted.
-
When related metrics are available, a ratio or percentage is displayed after the hash sign (
#) in the right-most column. For example, in default mode,perf statcounts cycles and instructions to calculate and display instructions per cycle in the right-most column. Branch misses as a percentage of total branches show similar behavior in default counts.
12.4. Attaching perf stat to a running process Copiar enlaceEnlace copiado en el portapapeles!
You can attach perf stat to a running process. This instructs perf stat to count event occurrences only in the specified processes during the execution of a command.
Prerequisites
-
You have the
perfuser space tool installed as described in Installing perf.
Procedure
Attach
perf statto a running process:$ perf stat -p ID1,ID2 sleep secondsThe previous example counts events in the processes with the IDs of
ID1andID2for a time period ofsecondsseconds as dictated by using thesleepcommand.
Chapter 13. Recording and analyzing performance profiles with perf Copiar enlaceEnlace copiado en el portapapeles!
Use the perf tool to record performance data and analyze it at a later time.
13.1. The purpose of perf record Copiar enlaceEnlace copiado en el portapapeles!
The perf record command samples performance data and stores it in a perf.data file. You can then read and visualize data with other perf commands. perf.data is generated in the current directory. It can be accessed at a later time, possibly on a different machine.
You can profile the entire system by running perf record -a. This records performance data until you press Ctrl+C or for the duration of a specific command. You can profile a single application/process by running perf record ./your_app or by attaching to an already running one by using perf record -p PID.
13.2. Recording a performance profile Copiar enlaceEnlace copiado en el portapapeles!
You can use perf record to sample and record performance data. The level of data captured depends on the access level:
-
Without root access:
perf recordcollects data only in the user space. -
With root access:
perf recordcollects data in both the user and kernel space.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. - You have root access if you want to collect data for user and kernel space.
Procedure
Do one of the following:
To sample and record the performance data for the user space:
$ perf record commandTo sample and record the performance data for kernel space (use root or sudo access):
# perf record commandReplace command with the actual command you want to profile. If you do not specify a command, then
perf recordsamples data until you manually stop it by pressing Ctrl+C.
13.3. Recording a performance profile in per-CPU mode Copiar enlaceEnlace copiado en el portapapeles!
Use perf record in per-CPU mode to sample and record performance data across all threads on a monitored CPU. It captures data from both user space and kernel space simultaneously. By default, per-CPU mode monitors all online CPUs.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Sample and record the performance data:
# perf record -a commandReplace command with the command you want to sample data during. If you do not specify a command, then
perf recordsamples data until you manually stop it by pressing Ctrl+C.
13.4. Capturing call graph data with perf record Copiar enlaceEnlace copiado en el portapapeles!
You can configure the perf record tool so that it records which function calls other functions in the performance profile. This helps to identify a bottleneck if several processes are calling the same function.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Sample and record performance data with the
--call-graphoption:$ perf record --call-graph method command-
Replace command with the command you want to sample data during. If you do not specify a command, then
perf recordsamples data until you manually stop it by pressing Ctrl+C. Replace method with one of the following unwinding methods:
- fp
-
Uses the frame pointer method. Depending on compiler optimization, such as with binaries built with the GCC option
--fomit-frame-pointer, this may not be able to unwind the stack. - dwarf
- Uses DWARF Call Frame Information to unwind the stack.
- lbr
- Uses the last branch record hardware on Intel processors.
-
Replace command with the command you want to sample data during. If you do not specify a command, then
13.5. Analyzing perf.data with perf report Copiar enlaceEnlace copiado en el portapapeles!
You can use perf report to display and analyze data from the perf.data file.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. -
A
perf.datafile is stored in the current directory. -
You have the
rootaccess, if theperf.datafile was created with root access.
Procedure
Display the contents of the perf.data file for further analysis:
# perf reportSamples: 2K of event 'cycles', Event count (approx.): 235462960 Overhead Command Shared Object Symbol 2.36% kswapd0 [kernel.kallsyms] [k] page_vma_mapped_walk 2.13% sssd_kcm libc-2.28.so [.] memset_avx2_erms 2.13% perf [kernel.kallsyms] [k] smp_call_function_single 1.53% gnome-shell libc-2.28.so [.] strcmp_avx2 1.17% gnome-shell libglib-2.0.so.0.5600.4 [.] g_hash_table_lookup 0.93% Xorg libc-2.28.so [.] memmove_avx_unaligned_erms 0.89% gnome-shell libgobject-2.0.so.0.5600.4 [.] g_object_unref 0.87% kswapd0 [kernel.kallsyms] [k] page_referenced_one 0.86% gnome-shell libc-2.28.so [.] memmove_avx_unaligned_erms 0.83% Xorg [kernel.kallsyms] [k] alloc_vmap_area 0.63% gnome-shell libglib-2.0.so.0.5600.4 [.] g_slice_alloc 0.53% gnome-shell libgirepository-1.0.so.1.0.0 [.] g_base_info_unref 0.53% gnome-shell ld-2.28.so [.] _dl_find_dso_for_object 0.49% kswapd0 [kernel.kallsyms] [k] vma_interval_tree_iter_next 0.48% gnome-shell libpthread-2.28.so [.] pthread_getspecific 0.47% gnome-shell libgirepository-1.0.so.1.0.0 [.] 0x0000000000013b1d 0.45% gnome-shell libglib-2.0.so.0.5600.4 [.] g_slice_free1 0.45% gnome-shell libgobject-2.0.so.0.5600.4 [.] g_type_check_instance_is_fundamentally_a 0.44% gnome-shell libc-2.28.so [.] malloc 0.41% swapper [kernel.kallsyms] [k] apic_timer_interrupt 0.40% gnome-shell ld-2.28.so [.] _dl_lookup_symbol_x 0.39% kswapd0 [kernel.kallsyms] [k] raw_callee_save___pv_queued_spin_unlockFor more information, see the
perf-record(1)man page on your system.
13.6. Attaching perf record to a running process Copiar enlaceEnlace copiado en el portapapeles!
You can attach perf record to a running process. This instructs perf record to only sample and record performance data in the specified processes.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Attach
perf recordto a running process:$ perf record -p ID1,ID2 sleep secondsThis command samples and records performance data of the processes with the process ID’s
ID1andID2for a time period ofsecondsseconds as dictated by using the sleep command. You can also configureperfto record events in specific threads:$ perf record -t ID1,ID2 sleep secondsNoteWhen using the
-tflag and stipulating thread ID’s,perfdisables inheritance by default. You can enable inheritance by adding the--inheritoption.
13.7. Interpretation of perf report output Copiar enlaceEnlace copiado en el portapapeles!
The table displayed by running the perf report command sorts the data into the following several columns:
- Overhead
- Indicates what percentage of overall samples were collected in that particular function.
- Command
- Tells you which process the samples were collected from.
- Shared Object
- Displays the ELF image name where the samples come from (the name [kernel.kallsyms] is used when the samples come from the kernel).
- Symbol
- Displays the function name or symbol. In default mode, the functions are sorted in descending order with those with the highest overhead displayed first.
13.8. Generating a perf.data file readable on a different device Copiar enlaceEnlace copiado en el portapapeles!
You can use the perf tool to record performance data into a perf.data file to be analyzed on a different device.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. -
The kernel
debuginfopackage is installed. For more information, see Getting debuginfo packages for an application or library using GDB.
Procedure
Capture performance data you are interested in investigating further:
# perf record -a --call-graph fp sleep <seconds>This example generates a
perf.dataover the entire system for a period of seconds seconds as dictated by the use of the sleep command. It would also capture call graph data using the frame pointer method.Generate an archive file containing debug symbols of the recorded data:
# perf archive
Verification
Verify that the archive file has been generated in your current active directory:
# ls perf.data*The output will display every file in your current directory that begins with
perf.data. The archive file will be named either:perf.data.tar.gzorperf.data.tar.bz2.
13.9. Analyzing a perf.data file from a different device Copiar enlaceEnlace copiado en el portapapeles!
You can use the perf tool to analyze a perf.data file that was generated on a different device.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. -
You have
perf.dataand associated archive files generated on a different device.
Procedure
-
Copy both the
perf.datafile and the archive file into your current active directory. Extract the archive file into
~/.debug:# mkdir -p ~/.debug# tar xf perf.data.tar.bz2 -C ~/.debugNoteThe archive file might also be named
perf.data.tar.gz.Open the
perf.datafile for further analysis:# perf report
Chapter 14. Investigating busy CPUs with perf Copiar enlaceEnlace copiado en el portapapeles!
When investigating performance issues on a system, use perf to identify and monitor the busiest CPUs. This helps you focus your efforts.
14.1. Displaying CPU events counted with perf stat Copiar enlaceEnlace copiado en el portapapeles!
You can use perf stat to display which CPU events were counted on by disabling CPU count aggregation. You must count events in the system-wide mode by using the -a flag in order to use this functionality.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Count the events with CPU count aggregation disabled:
# perf stat -a -A sleep secondsThis command displays the count of a default set of common hardware and software events recorded for a time-period of seconds seconds, as dictated by using the sleep command, over each CPU in ascending order, starting with CPU0. As such, it may be useful to specify an event such as cycles:
# perf stat -a -A -e cycles sleep seconds
14.2. Displaying CPU samples taken with perf report Copiar enlaceEnlace copiado en el portapapeles!
The perf record command samples performance data and stores it in a perf.data file. You can read this file by using the perf report command. The perf record command also records the CPU on which each sample was taken. To view this information, configure the perf report command accordingly.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. -
There is a
perf.datafile created with perf record in the current directory. If theperf.datafile was created with root access, you need to run perf report with root access too.
Procedure
Display the contents of the
perf.datafile for further analysis while sorting by CPU:# perf report --sort cpuYou can sort by CPU and command to display more detailed information about where CPU time is being spent:
# perf report --sort cpu,commThis example will list commands from all monitored CPUs by total overhead in descending order of overhead usage and identify the CPU the command was executed.
14.3. Displaying specific CPUs during profiling with perf top Copiar enlaceEnlace copiado en el portapapeles!
You can configure perf top to display specific CPUs and their relative usage while profiling your system in real time.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Start the perf top interface while sorting by CPU:
# perf top --sort cpuThis command lists CPUs and their respective overhead in descending order of overhead usage in real time. You can sort by CPU and command for more detailed information of where CPU time is being spent:
# perf top --sort cpu,commThis command lists commands by total overhead in descending order of overhead usage and identifies the CPU the command was executed on in real time.
14.4. Monitoring specific CPUs by using perf record and perf report Copiar enlaceEnlace copiado en el portapapeles!
Use perf to sample performance data from specific CPUs and analyze the results to identify CPU-specific bottlenecks.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Record performance data from specific CPUs:
Use the
-Coption withperf recordto target specific CPUs. The following examples demonstrate how to specify individual CPUs or a range.To sample data from selected CPUs (comma-separated):
# perf record -C 0,1 sleep <seconds>This command samples and records performance data from CPUs
0and1for the specified number of seconds.To sample data from a range of CPUs:
# perf record -C 0-2 sleep <seconds>This command samples and records performance data from CPUs
0,1, and2over the specified duration.
Analyze the recorded performance data:
Use the
perf reportcommand to read and analyze theperf.datafile.# perf reportThis command displays the contents of the
perf.datafile.NoteIf you want to see which CPU each sample was recorded on, see Displaying which CPU samples were taken on with perf report.
Chapter 15. Creating uprobes with perf Copiar enlaceEnlace copiado en el portapapeles!
Uprobes (user-space probes) provide dynamic instrumentation for user-space applications at runtime. They require no source code changes or recompilation.
There are two primary use cases where uprobes are useful:
- Debugging and performance analysis
-
Uprobesfunction similarly to watchpoints. You can insert them at specific locations in your application to count how often those code paths are executed. Additionally, they can capture rich context such as call stacks or variable values. Use them to identify performance bottlenecks or track bugs. - Event-based data collection
-
Uprobesact as switching events for mechanisms such as circular buffers. They can control when data is recorded or flushed based on execution triggers in user space.
Uprobes integrate seamlessly with perf, which can both consume existing uprobes and create new ones. This flexibility enables powerful, non-intrusive observability and profiling of user-space behavior alongside kernel-space instrumentation (by using kprobes).
15.1. Creating uprobes at the function level with perf Copiar enlaceEnlace copiado en el portapapeles!
Use the perf tool to create dynamic tracepoints at arbitrary points in a process or application. Use these tracepoints with perf stat or perf record to analyze process and application behavior.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Create the uprobe in the process or application you are interested in monitoring at a location of interest within the process or application:
# perf probe -x /path/to/executable -a functionAdded new event: probe_executable:function (on function in /path/to/executable) You can now use it in all perf tools, such as: perf record -e probe_executable:function -aR sleep 1
15.2. Creating uprobes on lines within a function with perf Copiar enlaceEnlace copiado en el portapapeles!
You can use the tracepoints conjunction with other perf tools such as perf stat and perf record to better understand the process or application behavior.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. You have received the debugging symbols for your executable:
# objdump -t ./your_executable | headNoteTo do this, install the
debuginfopackage of the executable. If the executable is a locally developed application, compile the application with debugging information by using the-goption in GCC.
Procedure
View the function lines where you can place a
uprobe:$ perf probe -x ./your_executable -L main<main@/home/user/my_executable:0> 0 int main(int argc, const char *argv) 1 { int err; const char *cmd; char sbuf[STRERR_BUFSIZE]; / libsubcmd init */ 7 exec_cmd_init("perf", PREFIX, PERF_EXEC_PATH, EXEC_PATH_ENVIRONMENT); 8 pager_init(PERF_PAGER_ENVIRONMENT);Create the
uprobefor the function line that you want to monitor:# perf probe -x ./my_executable main:8Added new event: probe_my_executable:main_L8 (on main:8 in /home/user/my_executable) you can now use it in all perf tools, such as: Perf record -e probe_my_executable:main_L8 -aR sleep 1
15.3. Perf script output of data recorded over uprobes Copiar enlaceEnlace copiado en el portapapeles!
A common method to analyze data collected with uprobes is to run the perf script command. It which reads the perf.data file and displays a detailed trace of the recorded workload.
- In the perf script example output
-
A
uprobeis added to the functionisprime()in a program calledmy_prog ais a function argument added to theuprobe. Alternatively,acould be an arbitrary variable visible in the code scope of where you add youruprobe:# perf scriptmy_prog 1367 [007] 10802159.906593: probe_my_prog:isprime: (400551) a=2 my_prog 1367 [007] 10802159.906623: probe_my_prog:isprime: (400551) a=3 my_prog 1367 [007] 10802159.906625: probe_my_prog:isprime: (400551) a=4 my_prog 1367 [007] 10802159.906627: probe_my_prog:isprime: (400551) a=5 my_prog 1367 [007] 10802159.906629: probe_my_prog:isprime: (400551) a=6 my_prog 1367 [007] 10802159.906631: probe_my_prog:isprime: (400551) a=7 my_prog 1367 [007] 10802159.906633: probe_my_prog:isprime: (400551) a=13 my_prog 1367 [007] 10802159.906635: probe_my_prog:isprime: (400551) a=17 my_prog 1367 [007] 10802159.906637: probe_my_prog:isprime: (400551) a=19
-
A
Chapter 16. Profiling memory accesses with perf mem Copiar enlaceEnlace copiado en el portapapeles!
You can use the perf mem command to sample memory accesses on your system. The mem subcommand of the perf tool enables the sampling of memory accesses (loads and stores).
The perf mem command tracks memory latency, access types, and cache hits. Recording data symbols also identifies memory locations for these events.
16.1. Sampling memory access with perf mem Copiar enlaceEnlace copiado en el portapapeles!
You can use the perf mem command to sample memory accesses on your system. The command takes the same options as perf record and perf report as well as some options exclusive to the mem sub-command. The recorded data is stored in a perf.data file in the current directory for later analysis.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Sample the memory accesses:
# perf mem record -a sleep <seconds>This command samples memory accesses across all CPUs for a period of
<seconds>seconds as dictated by thesleepcommand. You can replace thesleepcommand for any command during which you want to sample memory access data. By default,perf memsamples both memory loads and stores. You can select only one memory operation by using the-toption and specifying eitherloadorstorebetweenperf memandrecord. For loads, information over the memory hierarchy level, TLB memory accesses, bus snoops, and memory locks is captured.Open the
perf.datafile for analysis:# perf mem reportIf you have used the example commands, the output is:
Available samples 35k cpu/mem-loads,ldlat=30/P 54k cpu/mem-stores/PThe
cpu/mem-loads,ldlat=30/Pline denotes data collected over memory loads and thecpu/mem-stores/Pline denotes data collected over memory stores. Highlight the category of interest and press Enter to view the data:Samples: 35K of event 'cpu/mem-loads,ldlat=30/P', Event count (approx.): 4067062 Overhead Samples Local Weight Memory access Symbol Shared Object Data Symbol Data Object Snoop TLB access Locked 0.07% 29 98 L1 or L1 hit [.] 0x000000000000a255 libspeexdsp.so.1.5.0 [.] 0x00007f697a3cd0f0 anon None L1 or L2 hit No 0.06% 26 97 L1 or L1 hit [.] 0x000000000000a255 libspeexdsp.so.1.5.0 [.] 0x00007f697a3cd0f0 anon None L1 or L2 hit No 0.06% 25 96 L1 or L1 hit [.] 0x000000000000a255 libspeexdsp.so.1.5.0 [.] 0x00007f697a3cd0f0 anon None L1 or L2 hit No 0.06% 1 2325 Uncached or N/A hit [k] pci_azx_readl [kernel.kallsyms] [k] 0xffffb092c06e9084 [kernel.kallsyms] None L1 or L2 hit No 0.06% 1 2247 Uncached or N/A hit [k] pci_azx_readl [kernel.kallsyms] [k] 0xffffb092c06e8164 [kernel.kallsyms] None L1 or L2 hit No 0.05% 1 2166 L1 or L1 hit [.] 0x00000000038140d6 libxul.so [.] 0x00007ffd7b84b4a8 [stack] None L1 or L2 hit No 0.05% 1 2117 Uncached or N/A hit [k] check_for_unclaimed_mmio [kernel.kallsyms] [k] 0xffffb092c1842300 [kernel.kallsyms] None L1 or L2 hit No 0.05% 22 95 L1 or L1 hit [.] 0x000000000000a255 libspeexdsp.so.1.5.0 [.] 0x00007f697a3cd0f0 anon None L1 or L2 hit No 0.05% 1 1898 L1 or L1 hit [.] 0x0000000002a30e07 libxul.so [.] 0x00007f610422e0e0 anon None L1 or L2 hit No 0.05% 1 1878 Uncached or N/A hit [k] pci_azx_readl [kernel.kallsyms] [k] 0xffffb092c06e8164 [kernel.kallsyms] None L2 miss No 0.04% 18 94 L1 or L1 hit [.] 0x000000000000a255 libspeexdsp.so.1.5.0 [.] 0x00007f697a3cd0f0 anon None L1 or L2 hit No 0.04% 1 1593 Local RAM or RAM hit [.] 0x00000000026f907d libxul.so [.] 0x00007f3336d50a80 anon Hit L2 miss No 0.03% 1 1399 L1 or L1 hit [.] 0x00000000037cb5f1 libxul.so [.] 0x00007fbe81ef5d78 libxul.so None L1 or L2 hit No 0.03% 1 1229 LFB or LFB hit [.] 0x0000000002962aad libxul.so [.] 0x00007fb6f1be2b28 anon None L2 miss No 0.03% 1 1202 LFB or LFB hit [.] __pthread_mutex_lock libpthread-2.29.so [.] 0x00007fb75583ef20 anon None L1 or L2 hit No 0.03% 1 1193 Uncached or N/A hit [k] pci_azx_readl [kernel.kallsyms] [k] 0xffffb092c06e9164 [kernel.kallsyms] None L2 miss No 0.03% 1 1191 L1 or L1 hit [k] azx_get_delay_from_lpib [kernel.kallsyms] [k] 0xffffb092ca7efcf0 [kernel.kallsyms] None L1 or L2 hit NoOptional: Sort your results to investigate different aspects of interest when displaying the data. For example, to sort data over memory loads by type of memory accesses occurring during the sampling period in descending order of overhead they account for:
# perf mem -t load report --sort=memFor example, the output can be:
Samples: 35K of event 'cpu/mem-loads,ldlat=30/P', Event count (approx.): 40670 Overhead Samples Memory access 31.53% 9725 LFB or LFB hit 29.70% 12201 L1 or L1 hit 23.03% 9725 L3 or L3 hit 12.91% 2316 Local RAM or RAM hit 2.37% 743 L2 or L2 hit 0.34% 9 Uncached or N/A hit 0.10% 69 I/O or N/A hit 0.02% 825 L3 miss
16.2. Interpretation of perf mem report output Copiar enlaceEnlace copiado en el portapapeles!
When you run the perf mem report command without any modifiers sorts the data into several columns:
- Overhead
- Indicates percentage of overall samples collected in that particular function.
- Samples
- Displays the number of samples accounted for by that row.
- Local Weight
- Displays the access latency in processor core cycles.
- Memory Access
- Displays the type of memory access that occurred.
- Symbol
- Displays the function name or symbol.
- Shared Object
-
Displays the ELF image name where the samples come from (the name
[kernel.kallsyms]is used when the samples come from the kernel). - Data Symbol
Displays the address of the memory location that row was targeting.
ImportantThe
Data Symbolcolumn might display a raw address due to dynamic allocation of memory or stack memory being accessed.- Snoop
- Displays bus transactions.
- TLB Access
- Displays TLB memory accesses.
- Locked
- Indicates if a function was or was not memory locked. In default mode, the functions are sorted in descending order with those with the highest overhead displayed first.
Chapter 17. Detecting false sharing Copiar enlaceEnlace copiado en el portapapeles!
False sharing occurs when a processor core on a Symmetric Multi Processing (SMP) system modifies different data items located on the same cache line. These processors access other data items that are not being shared between the processors.
Modified cache lines force other processors to invalidate and update their copies, even if they do not need the modified data.
You can use the perf c2c command to detect false sharing.
17.1. The purpose of perf c2c Copiar enlaceEnlace copiado en el portapapeles!
The c2c subcommand of the perf tool enables Shared Data Cache-to-Cache (C2C) analysis. You can use the perf c2c command to inspect cache-line contention to detect both true and false sharing.
Cache-line contention occurs when a processor core modifies data on a cache line shared by other processors. All other processors using this cache-line then must invalidate their copy and request an updated one. This can lead to degraded performance.
The perf c2c command provides the following information:
- Cache lines where contention has been detected
- Processes reading and writing the data
- Instructions causing the contention
- The Non-Uniform Memory Access (NUMA) nodes involved in the contention
17.2. Detecting cache-line contention with perf c2c Copiar enlaceEnlace copiado en el portapapeles!
You can use the perf c2c command to detect cache-line contention in a system. The perf c2c command supports the same options as perf record as well as some options exclusive to the c2c subcommand. The recorded data is stored in a perf.data file in the current directory for later analysis.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Use
perf c2cto detect cache-line contention:# perf c2c record -a sleep <seconds>This command samples and records cache-line contention data across all CPU’s for a period of
<seconds>as dictated by thesleepcommand. You can replace thesleepcommand with any command you want to collect cache-line contention data over.
17.3. Visualizing a perf.data file recorded with perf c2c record Copiar enlaceEnlace copiado en el portapapeles!
You can visualize the perf.data file that is recorded with the perf c2c command.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. -
A
perf.datafile recorded by using theperf c2ccommand is available in the current directory. For more information, see Detecting cache-line contention with perf c2c.
Procedure
Open the
perf.datafile for analysis:# perf c2c report --stdioThis command visualizes the
perf.datafile into several graphs within the command line:================================================= Trace Event Information ================================================= Total records : 329219 Locked Load/Store Operations : 14654 Load Operations : 69679 Loads - uncacheable : 0 Loads - IO : 0 Loads - Miss : 3972 Loads - no mapping : 0 Load Fill Buffer Hit : 11958 Load L1D hit : 17235 Load L2D hit : 21 Load LLC hit : 14219 Load Local HITM : 3402 Load Remote HITM : 12757 Load Remote HIT : 5295 Load Local DRAM : 976 Load Remote DRAM : 3246 Load MESI State Exclusive : 4222 Load MESI State Shared : 0 Load LLC Misses : 22274 LLC Misses to Local DRAM : 4.4% LLC Misses to Remote DRAM : 14.6% LLC Misses to Remote cache (HIT) : 23.8% LLC Misses to Remote cache (HITM) : 57.3% Store Operations : 259539 Store - uncacheable : 0 Store - no mapping : 11 Store L1D Hit : 256696 Store L1D Miss : 2832 No Page Map Rejects : 2376 Unable to parse data source : 1 ================================================= Global Shared Cache Line Event Information ================================================= Total Shared Cache Lines : 55 Load HITs on shared lines : 55454 Fill Buffer Hits on shared lines : 10635 L1D hits on shared lines : 16415 L2D hits on shared lines : 0 LLC hits on shared lines : 8501 Locked Access on shared lines : 14351 Store HITs on shared lines : 109953 Store L1D hits on shared lines : 109449 Total Merged records : 126112 ================================================= c2c details ================================================= Events : cpu/mem-loads,ldlat=30/P : cpu/mem-stores/P Cachelines sort on : Remote HITMs Cacheline data grouping : offset,pid,iaddr ================================================= Shared Data Cache Line Table ================================================= # # Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- --- Load Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit -- # Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt # ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ....... ........ ........ # 0 0x602180 149904 77.09% 12103 2269 9834 109504 109036 468 727 2657 13747 40400 5355 16154 0 2875 529 1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65 200 3749 12128 5096 108 0 2056 652 2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1 1 15 99 25 50 0 6 1 3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7 20 156 50 59 0 27 4 4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0 10 25 62 11 1 0 24 7 ================================================= Shared Cache Line Distribution Pareto ================================================= # # ----- HITM ----- -- Store Refs -- Data address ---------- cycles ---------- cpu Shared # Num Rmt Lcl L1 Hit L1 Miss Offset Pid Code address rmt hitm lcl hitm load cnt Symbol Object Source:Line Node{cpu list} # ..... ....... ....... ....... ....... .................. ....... .................. ........ ........ ........ ........ ................... .................... ........................... .... # ------------------------------------------------------------- 0 9834 2269 109036 468 0x602180 ------------------------------------------------------------- 65.51% 55.88% 75.20% 0.00% 0x0 14604 0x400b4f 27161 26039 26017 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:144 0{0-1,4} 1{24-25,120} 2{48,54} 3{169} 0.41% 0.35% 0.00% 0.00% 0x0 14604 0x400b56 18088 12601 26671 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:145 0{0-1,4} 1{24-25,120} 2{48,54} 3{169} 0.00% 0.00% 24.80% 100.00% 0x0 14604 0x400b61 0 0 0 9 [.] read_write_func no_false_sharing.exe false_sharing_example.c:145 0{0-1,4} 1{24-25,120} 2{48,54} 3{169} 7.50% 9.92% 0.00% 0.00% 0x20 14604 0x400ba7 2470 1729 1897 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:154 1{122} 2{144} 17.61% 20.89% 0.00% 0.00% 0x28 14604 0x400bc1 2294 1575 1649 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:158 2{53} 3{170} 8.97% 12.96% 0.00% 0.00% 0x30 14604 0x400bdb 2325 1897 1828 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:162 0{96} 3{171} ------------------------------------------------------------- 1 2832 1119 0 0 0x602100 ------------------------------------------------------------- 29.13% 36.19% 0.00% 0.00% 0x20 14604 0x400bb3 1964 1230 1788 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:155 1{122} 2{144} 43.68% 34.41% 0.00% 0.00% 0x28 14604 0x400bcd 2274 1566 1793 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:159 2{53} 3{170} 27.19% 29.40% 0.00% 0.00% 0x30 14604 0x400be7 2045 1247 2011 2 [.] read_write_func no_false_sharing.exe false_sharing_example.c:163 0{96} 3{171}
17.4. Interpretation of perf c2c report output Copiar enlaceEnlace copiado en el portapapeles!
The visualization displayed by running the perf c2c report --stdio command sorts the data into several tables:
- Trace Events Information
-
Provides a high level summary of all the load and store samples collected by
perf c2c record. - Global Shared Cache Line Event Information
- Provides statistics over shared cache lines.
c2cDetails-
Provides information about what events were sampled and how the
perf c2c reportdata is organized. - Shared Data Cache Line Table
- Summarizes the hottest cache lines showing false sharing. Sorts by descending remote Hitm counts per cache line.
- Shared Cache Line Distribution Pareto
Provides a variety of information about each cache line experiencing contention:
-
The cache lines are numbered in the NUM column, starting at
0. - The virtual address of each cache line is contained in the Data address Offset column. It is then followed by the offset into the cache line where different accesses occurred.
- The Pid column contains the process ID.
- The Code Address column contains the instruction pointer code address.
- The columns under the cycles label show average load latencies.
- The cpu cnt column displays how many different CPUs samples came from. This is the number of different CPUs waiting for the data indexed at that given location.
- The Symbol column displays the function name or symbol.
- The Shared Object column displays the ELF image name where the samples come from ([kernel.kallsyms] is used when the samples come from the kernel).
- The Source:Line column displays the source file and line number.
- The Node{cpu list} column displays which specific CPUs samples came from for each node.
-
The cache lines are numbered in the NUM column, starting at
17.5. Detecting false sharing with perf c2c Copiar enlaceEnlace copiado en el portapapeles!
You can detect false sharing with the perf c2c command.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. -
A
perf.datafile recorded by using theperf c2ccommand is available in the current directory. For more information, see Detecting cache-line contention with perf c2c.
Procedure
Open the
perf.datafile:# perf c2c report --stdioThis opens the
perf.datafile in the command line.In the Trace Event Information table, locate the row containing the values for LLC Misses to Remote Cache (HITM):
The percentage in the value column of the LLC Misses to Remote Cache (HITM) row represents the percentage of LLC misses that were occurring across NUMA nodes in modified cache-lines and is a key indicator that false sharing has occurred.
================================================= Trace Event Information ================================================= Total records : 329219 Locked Load/Store Operations : 14654 Load Operations : 69679 Loads - uncacheable : 0 Loads - IO : 0 Loads - Miss : 3972 Loads - no mapping : 0 Load Fill Buffer Hit : 11958 Load L1D hit : 17235 Load L2D hit : 21 Load LLC hit : 14219 Load Local HITM : 3402 Load Remote HITM : 12757 Load Remote HIT : 5295 Load Local DRAM : 976 Load Remote DRAM : 3246 Load MESI State Exclusive : 4222 Load MESI State Shared : 0 Load LLC Misses : 22274 LLC Misses to Local DRAM : 4.4% LLC Misses to Remote DRAM : 14.6% LLC Misses to Remote cache (HIT) : 23.8% LLC Misses to Remote cache (HITM) : 57.3% Store Operations : 259539 Store - uncacheable : 0 Store - no mapping : 11 Store L1D Hit : 256696 Store L1D Miss : 2832 No Page Map Rejects : 2376 Unable to parse data source : 1Inspect the Rmt column of the LLC Load Hitm field of the Shared Data Cache Line Table:
================================================= Shared Data Cache Line Table ================================================= # # Total Rmt ----- LLC Load Hitm ----- ---- Store Reference ---- --- Load Dram ---- LLC Total ----- Core Load Hit ----- -- LLC Load Hit -- # Index Cacheline records Hitm Total Lcl Rmt Total L1Hit L1Miss Lcl Rmt Ld Miss Loads FB L1 L2 Llc Rmt # ..... .................. ....... ....... ....... ....... ....... ....... ....... ....... ........ ........ ....... ....... ....... ....... ....... ........ ........ # 0 0x602180 149904 77.09% 12103 2269 9834 109504 109036 468 727 2657 13747 40400 5355 16154 0 2875 529 1 0x602100 12128 22.20% 3951 1119 2832 0 0 0 65 200 3749 12128 5096 108 0 2056 652 2 0xffff883ffb6a7e80 260 0.09% 15 3 12 161 161 0 1 1 15 99 25 50 0 6 1 3 0xffffffff81aec000 157 0.07% 9 0 9 1 0 1 0 7 20 156 50 59 0 27 4 4 0xffffffff81e3f540 179 0.06% 9 1 8 117 97 20 0 10 25 62 11 1 0 24 7The table is sorted in descending order by the amount of remote Hitm detected per cache line. A high number in the Rmt column of the LLC Load Hitm section indicates false sharing. Further inspection of the cache line where it occurred is required to debug the false sharing activity.
Chapter 18. Getting started with flamegraphs Copiar enlaceEnlace copiado en el portapapeles!
As a system administrator, you can use flamegraphs to create visualizations of system performance data recorded with the perf tool. As a software developer, you can use flamegraphs to create visualizations of application performance data recorded with the perf tool. Sampling stack traces is a common technique for profiling CPU performance with the perf tool. Unfortunately, the results of profiling stack traces with perf can be extremely verbose and labor-intensive to analyze. flamegraphs are visualizations created from data recorded with perf to make identifying hot code-paths faster and easier.
18.1. Installing flamegraphs Copiar enlaceEnlace copiado en el portapapeles!
You must install the required packages to start using flamegraphs.
Procedure
Install the flamegraphs package:
# dnf install js-d3-flame-graph
18.2. Creating flamegraphs over the entire system Copiar enlaceEnlace copiado en el portapapeles!
You can visualize performance data recorded over an entire system by using flamegraphs.
Prerequisites
-
flamegraphsare installed as described in Installing flamegraphs. -
The
perftool is installed as described in Installing perf.
Procedure
Record the data and create the visualization.
# perf script flamegraph -a -F 99 sleep 60This command samples and records performance data over the entire system for 60 seconds, as stipulated by use of the
sleepcommand, and then constructs the visualization which is stored in the current active directory asflamegraph.html. The command samples call-graph data by default and takes the same arguments as theperftool, in this particular case:- -a
- Stipulates to record data over the entire system.
- -F
- To set the sampling frequency per second.
Verification
For analysis, view the generated visualization:
# xdg-open flamegraph.htmlThis command opens the visualization in the default browser:
18.3. Creating flamegraphs over specific processes Copiar enlaceEnlace copiado en el portapapeles!
You can use flamegraphs to visualize performance data recorded over specific running processes.
Prerequisites
-
flamegraphsare installed as described in Installing flamegraphs. -
The
perftool is installed as described in Installing perf.
Procedure
Record the data and create the visualization.
# perf script flamegraph -a -F 99 -p ID1,ID2 sleep 60This command samples and records performance data of the processes with the process ID’s
ID1andID2for60seconds, as stipulated by use of thesleepcommand, and then constructs the visualization which will be stored in the current active directory asflamegraph.html. The command samples call-graph data by default and takes the same arguments as theperftool, in this particular case:- -a
- Stipulates to record data over the entire system.
- -F
- To set the sampling frequency per second.
- -p
- To stipulate specific process ID’s to sample and record data over.
Verification
For analysis, view the generated visualization:
# xdg-open flamegraph.htmlThis command opens the visualization in the default browser:
18.4. Analyzing stack traces and function hotspots in flamegraphs Copiar enlaceEnlace copiado en el portapapeles!
Each box in the flamegraph represents a different function in the stack. The y-axis shows the stack depth. The top box in each stack represents the function that was on-CPU and the remaining boxes represent ancestry. The x-axis displays the population of the sampled call-graph data.
The x-axis sorts functions by sample count, not time. Children appear in descending order of their total samples. Box width represents frequency on-CPU. Wider boxes indicate functions or ancestors captured in more samples.
Procedure
To reveal the names of functions which may have not been displayed previously and further investigate the data, click the box within the flamegraph to zoom into the stack at that given location:
To return to the default view of the flamegraph, click Reset Zoom.
ImportantBoxes representing user-space functions may be labeled as
Unknownin flamegraphs because the binary of the function is stripped. Install thedebuginfopackage or compile your local application with debugging information enabled. Use the-goption in GCC, to display the function names or symbols in such a situation.
Chapter 19. Monitoring processes for performance bottlenecks by using perf circular buffers Copiar enlaceEnlace copiado en el portapapeles!
You can create circular buffers that take event-specific snapshots of data with the perf tool. Use them to monitor performance bottlenecks in specific processes or parts of applications running on your system.
In such cases, perf only writes data to a perf.data file for later analysis if a specified event is detected.
19.1. Circular buffers and event-specific snapshots with perf Copiar enlaceEnlace copiado en el portapapeles!
Recording hours of perf data before a specific event may cause excessive overhead or prove impractical for analysis. In such cases, use perf record to create custom circular buffers that take snapshots after specific events.
The --overwrite option makes perf record store all data in an overwritable circular buffer. When the buffer gets full, perf record automatically overwrites the oldest records which, therefore, never get written to a perf.data file.
Using the --overwrite and --switch-output-event options together configures a circular buffer. It records and dumps data continuously until it detects the --switch-output-event trigger event. The trigger signals perf record to write data from the circular buffer to the perf.data file after an event. This collects specific data while reducing overhead by only writing relevant information to the perf.data file.
19.2. Collecting specific data to monitor for performance bottlenecks by using perf circular buffers Copiar enlaceEnlace copiado en el portapapeles!
With the perf tool, you can create circular buffers. These are triggered by events you specify to only collect data you are interested in. To create circular buffers that collect event-specific data, use the --overwrite and --switch-output-event options for perf.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. You have placed a
uprobein the process or application you are interested in monitoring at a location of interest within the process or application:# perf probe -x </path/to/executable> -a functionAdded new event: probe_executable:function (on function in /path/to/executable) You can now use it in all perf tools, such as: perf record -e probe_executable:function -aR sleep 1
Procedure
Create the circular buffer with the
uprobeas the trigger event:# perf record --overwrite -e cycles --switch-output-event probe_executable:function ./executable[ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2021021012231959 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2021021012232008 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2021021012232082 ] [ perf record: Captured and wrote 5.621 MB perf.data.<timestamp> ]This command initiates the executable and collects cpu cycles, specified after the
-eoption, untilperfdetects theuprobe, the trigger event specified after the--switch-output-eventoption. At that point,perfsnapshots circular buffer data and stores it in a uniqueperf.datafile identified by timestamp. This example produced a total of 2 snapshots, the lastperf.datafile was forced by pressing Ctrl+C.
Chapter 20. Managing tracepoints by using perf collector without restarting perf Copiar enlaceEnlace copiado en el portapapeles!
Use the control pipe interface to manage different tracepoints in a running perf collector. You can dynamically adjust what data you are collecting without having to stop or restart perf. This prevents performance data loss that occurs when you stop or restart recording processes.
20.1. Adding tracepoints to a running perf collector without stopping or restarting perf Copiar enlaceEnlace copiado en el portapapeles!
You can add tracepoints to a running perf collector by using the control pipe interface. perf adjusts the data being recorded without having to stop perf, therefore preventing the loss of performance data.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf.
Procedure
Configure the control pipe interface:
# mkfifo control ack perf.pipeRun
perf recordwith the control file configuration and events you are interested in enabling:# perf record --control=fifo:control,ack -D -1 --no-buffering -e 'sched:*' -o - > perf.pipeIn this example, declaring
'sched:*'after the-eoption startsperf recordwith scheduler events.In a second command line, start the read side of the control pipe:
# cat perf.pipe | perf --no-pager script -i -Starting the read side of the control pipe triggers the following message in the first command line:
Events disabledIn a third command line, enable a tracepoint by using the control file:
# echo 'enable sched:sched_process_fork' > controlThis command triggers
perfto scan the current event list in the control file for the declared event. If the event is present, the tracepoint is enabled and the following message appears in the first command line:event sched:sched_process_fork enabledOnce the tracepoint is enabled, the second command line displays the output from perf detecting the tracepoint:
bash 33349 [034] 149587.674295: sched:sched_process_fork: comm=bash pid=33349 child_comm=bash child_pid=34056
20.2. Removing tracepoints from a running perf collector without stopping or restarting perf Copiar enlaceEnlace copiado en el portapapeles!
You can remove tracepoints from a running perf collector by using the control pipe interface. It helps to reduce the scope of data you are collecting without having to stop perf and losing performance data.
Prerequisites
-
You have the
perfuser space tool installed. For more information, see Installing perf. -
You have added tracepoints to a running
perfcollector by using the control pipe interface. For more information, see Adding tracepoints to a running perf collector without stopping or restarting perf.
Procedure
Remove the tracepoint:
# echo 'disable sched:sched_process_fork' > controlThis example assumes you have previously loaded scheduler events into the control file and enabled the tracepoint
sched:sched_process_fork.This command triggers
perfto scan the current event list in the control file for the declared event. If found, the tracepoint is disabled and the following message appears in the command line used to configure the control pipe:# event sched:sched_process_fork disabled
Chapter 21. Optimizing the system performance in the web console Copiar enlaceEnlace copiado en el portapapeles!
You can set a performance profile in the RHEL web console to optimize the performance of the system for a selected task.
21.1. Performance tuning options in the web console Copiar enlaceEnlace copiado en el portapapeles!
Red Hat Enterprise Linux provides several performance profiles that optimize the system for:
- Systems using the desktop
- Throughput performance
- Latency performance
- Network performance
- Low-power consumption
- Virtual machines
The TuneD service optimizes system options to match the selected profile. In the web console, you can set your system’s performance profile.
21.2. Setting a performance profile in the web console Copiar enlaceEnlace copiado en el portapapeles!
Depending on the task you want to perform, use the web console to optimize system performance by setting a suitable profile.
Prerequisites
You have installed the RHEL 10 web console.
For instructions, see Installing and enabling the web console.
Procedure
- Log in to the RHEL 10 web console.
- Click Overview.
In the Configuration section, click the current performance profile.
- In the Change performance profile dialog box, set the required profile.
- Click Change profile.
21.3. Monitoring performance on the local system by using the web console Copiar enlaceEnlace copiado en el portapapeles!
The RHEL web console uses the Utilization Saturation and Errors (USE) Method for troubleshooting. The new performance metrics page has a historical view of your data organized chronologically with the newest data at the top. In the Metrics and history page, you can view events, errors, and graphical representation for resource utilization and saturation.
Prerequisites
You have installed the RHEL 10 web console.
For instructions, see Installing and enabling the web console.
-
The
cockpit-pcppackage is installed. -
The Performance Co-Pilot (PCP) services
pmlogger.serviceandpmproxy.serviceare enabled and running.
Procedure
- Log in to the RHEL 10 web console.
- Click Overview.
In the Usage section, click View metrics and history.
The Metrics and history section opens and displays: the current system configuration and usage the performance metrics in a graphical form over a user-specified time interval
21.4. Monitoring performance on several systems by using the web console and Grafana Copiar enlaceEnlace copiado en el portapapeles!
Grafana aggregates data from multiple systems to visualize their Performance Co-Pilot (PCP) metrics graphically. You can configure performance metrics monitoring and export for several systems in the web console interface.
Prerequisites
You have installed the RHEL 10 web console.
For instructions, see Installing and enabling the web console.
-
You have installed the
cockpit-pcppackage. -
The Performance Co-Pilot (PCP) services
pmlogger.serviceandpmproxy.serviceare enabled and running. - You have configured the Grafana dashboard. For more information, see Setting up a grafana-server.
You have installed the
valkeypackage.Alternatively, you can install the package from the web console interface later in the procedure.
Procedure
- Log in to the RHEL 10 web console.
- In the Overview page, click View metrics and history in the Usage table.
- Click the Metrics settings button.
Move the Export to network slider to the active position.
If you do not have the
valkeypackage installed, the web console prompts you to install it.-
To open the
pmproxyservice, select a zone from a drop-down list and click the Add pmproxy button. - Click Save.
Verification
- Click Networking.
- In the Firewall table, click the Edit rules and zones button.
Search for
pmproxyin your selected zone.ImportantRepeat this procedure on all systems you want to monitor for performance.
Chapter 22. Analyzing system performance with BPF Compiler Collection Copiar enlaceEnlace copiado en el portapapeles!
The BPF Compiler Collection (BCC) analyzes system performance by using Berkeley Packet Filter (BPF) functions to run custom kernel programs. BCC accesses system events for performance monitoring, tracing, and debugging of BPF programs, providing tools and libraries for extracting system insights.
22.1. Installing the bcc-tools package Copiar enlaceEnlace copiado en el portapapeles!
You can install the bcc-tools package to get the BPF Compiler Collection (BCC) library and related tools.
Procedure
Install bcc-tools.
# dnf install bcc-toolsThe BCC tools are installed in the
/usr/share/bcc/tools/directory.
Verification
Inspect the installed tools.
# ls -l /usr/share/bcc/tools/A list of tools installed appears. The
docdirectory in the listing provides documentation for each tool.
22.2. Examining the system processes with execsnoop Copiar enlaceEnlace copiado en el portapapeles!
The execsnoop tool from the BCC suite captures and displays new process execution events in real time. It is useful for observing which commands or binaries are being executed on a system, helping with debugging, auditing, and security monitoring.
Procedure
Run the
execsnoopprogram in one command line:# /usr/share/bcc/tools/execsnoopTo create a short-lived process of the
lscommand, in another command line, enter:$ ls /usr/share/bcc/tools/doc/The command line running
execsnoopshows the following output:PCOMM PID PPID RET ARGS ls 8382 8287 0 /usr/bin/ls --color=auto /usr/share/bcc/tools/doc/The
execsnoopprogram prints a line of output for each new process that consumes system resources. It even detects processes of programs that run very shortly, such asls, and most monitoring tools would not register them. Theexecsnoopoutput displays the following fields:- PCOMM
-
The parent process name. (
ls) - PID
- The process ID. (8382)
- PPID
- The parent process ID. (8287)
- RET
- The return value of the exec() system call (0), which loads program code into new processes.
- ARGS
The location of the started program with arguments.
For more information, see the
/usr/share/bcc/tools/doc/execsnoop_example.txtfile and theexec(3)man page on your system.
22.3. Tracking files opened by a command with opensnoop Copiar enlaceEnlace copiado en el portapapeles!
Use the opensnoop tool from BPF Compiler Collection (BCC) to monitor and log file access by a specific command in real time. This is useful for debugging, auditing, or understanding the runtime behavior of an application.
Procedure
In one command line, run the
opensnoopprogram to print the output for files opened only by the process of theunamecommand:# /usr/share/bcc/tools/opensnoop -n unameIn another command line, enter the command to open certain files:
$ unameThe command line running opensnoop shows the output similar to the following: PID COMM FD ERR PATH 8596 uname 3 0 /etc/ld.so.cache 8596 uname 3 0 /lib64/libc.so.6 8596 uname 3 0 /usr/lib/locale/locale-archive ...The
opensnoopprogram watches theopen()system call across the whole system, and prints a line of output for each file thatunametries to open along the way. Theopensnoopoutput displays the following fields:- PID
- The process ID. (8596)
- COMM
-
The process name. (
uname) - FD
- The file descriptor - a value that open() returns to refer to the open file. (3)
- ERR
- Any errors.
- PATH
The location of files that
open()tries to open.If a command tries to read a non-existent file, the FD column returns
-1. Also, the ERR column prints a value corresponding to the relevant error. Useopensnoopto identify an application that does not behave properly. For more information, see the/usr/share/bcc/tools/doc/opensnoop_example.txtfile andopen(2)man page on your system.
22.4. Monitoring the top processes performing I/O operations on the disk with biotop Copiar enlaceEnlace copiado en el portapapeles!
The biotop tool provides a real-time view of processes generating the most disk I/O activity. It identifies high-disk-usage applications, serving as a tool for performance monitoring and troubleshooting.
Procedure
Run the
biotopprogram in one command line with 30 as an argument to produce 30 second summary:# /usr/share/bcc/tools/biotop 30When you do not provide any argument, the output screen refreshes every 1 second by default.
In another command line, enter command to read the content from the local hard disk device and write the output to the
/dev/zerofile:# dd if=/dev/vda of=/dev/zeroThis step generates certain I/O traffic to illustrate
biotop. The command line runningbiotopshows an output similar to the following:PID COMM D MAJ MIN DISK I/O Kbytes AVGms 9568 dd R 252 0 vda 16294 14440636.0 3.69 48 kswapd0 W 252 0 vda 1763 120696.0 1.65 7571 gnome-shell R 252 0 vda 834 83612.0 0.33 1891 gnome-shell R 252 0 vda 1379 19792.0 0.15 7515 Xorg R 252 0 vda 280 9940.0 0.28 7579 llvmpipe-1 R 252 0 vda 228 6928.0 0.19 9515 gnome-control-c R 252 0 vda 62 6444.0 0.43 8112 gnome-terminal- R 252 0 vda 67 2572.0 1.54 7807 gnome-software R 252 0 vda 31 2336.0 0.73 9578 awk R 252 0 vda 17 2228.0 0.66 7578 llvmpipe-0 R 252 0 vda 156 2204.0 0.07 9581 pgrep R 252 0 vda 58 1748.0 0.42 7531 InputThread R 252 0 vda 30 1200.0 0.48 7504 gdbus R 252 0 vda 3 1164.0 0.30 1983 llvmpipe-1 R 252 0 vda 39 724.0 0.08 1982 llvmpipe-0 R 252 0 vda 36 652.0 0.06 ...The
biotopoutput displays the following fields:- PID
- The process ID. (9568)
- COMM
-
The process name. (
dd) - DISK
- The disk performs the read operations. (vda)
- I/O
- The number of read operations performed. (16294)
- Kbytes
- The amount of Kbytes reached by the read operations. (14,440,636)
- AVGms
The average I/O time of read operations. (3.69)
For more information, see the
/usr/share/bcc/tools/doc/biotop_example.txtfile and thedd(1)man page on your system.
22.5. Exposing unexpectedly slow file system operations with xfsslower Copiar enlaceEnlace copiado en el portapapeles!
The xfsslower measures the time spent by the XFS file system in performing read, write, open or sync (fsync) operations. The argument 1 ensures that the program shows only the operations that are slower than 1 ms.
Procedure
Run the
xfsslowerprogram in one command line:# /usr/share/bcc/tools/xfsslower 1When you do not provide any arguments,
xfsslowerdisplays operations slower than 10 ms by default.Use
vimin another command line to create a text file to start interaction with the XFS file system:$ vim textThe command line running xfsslower shows something similar upon saving the file from the previous step: TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME 13:07:14 b'bash' 4754 R 256 0 7.11 b'vim' 13:07:14 b'vim' 4754 R 832 0 4.03 b'libgpm.so.2.1.0' 13:07:14 b'vim' 4754 R 32 20 1.04 b'libgpm.so.2.1.0' 13:07:14 b'vim' 4754 R 1982 0 2.30 b'vimrc' 13:07:14 b'vim' 4754 R 1393 0 2.52 b'getscriptPlugin.vim' 13:07:45 b'vim' 4754 S 0 0 6.71 b'text' 13:07:45 b'pool' 2588 R 16 0 5.58 b'text’ ...Each line represents an operation in the file system, which took more time than a certain threshold.
xfsslowerdetects possible file system problems, which can take the form of unexpectedly slow operations. Thexfssloweroutput displays the following fields:- COMM
-
The process name. (
b’bash') - T
The operation type. (
R)- Read
- Write
- Open
- Sync
- OFF_KB
- The file offset in KB. (0)
- FILENAME
The file that is read, written, or synced.
For more information, see the
/usr/share/bcc/tools/doc/xfsslower_example.txtfile and thefsync(2)man page on your system.
Chapter 23. Configuring an operating system to optimize CPU utilization Copiar enlaceEnlace copiado en el portapapeles!
You can configure the operating system to optimize CPU utilization across their workloads.
23.1. Tools for monitoring and diagnosing processor issues Copiar enlaceEnlace copiado en el portapapeles!
The following are the tools available in Red Hat Enterprise Linux to monitor and diagnose processor-related performance issues:
-
numactlutility provides a number of options to manage processor and memory affinity. Thenumactlpackage includes thelibnumalibrary, which offers a simple programming interface to the NUMA policy supported by the kernel, and can be used for more fine-grained tuning than thenumactlapplication. -
numadis an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system to dynamically improve NUMA resource allocation and management. -
numastattool displays per-NUMA node memory statistics for the operating system and its processes and shows administrators whether the process memory is spread throughout a system or is centralized on specific nodes. This tool is provided by thenumactlpackage. pqosutility is available in theintel-cmt-catpackage. It monitors CPU cache and memory bandwidth on recent Intel processors. It monitors the following types of information:- The instructions per cycle (IPC)
- The count of last level cache MISSES
- The size in kilobytes that the program executing in a given CPU occupies in the LLC
- The bandwidth to local memory (MBL)
- The bandwidth to remote memory (MBR)
/proc/interruptsfile displays the following types of information:- Interrupt request (IRQ) number
- The number of similar interrupt requests handled by each processor in the system
- The type of interrupt sent
- A comma-separated list of devices that respond to the listed interrupt request
-
tasksettool is provided by theutil-linuxpackage. It enables administrators to retrieve and set the processor affinity of a running process, or launch a process with a specified processor affinity. -
turbostattool prints counter results at specified intervals to help administrators identify unexpected behavior in servers, such as excessive power usage, failure to enter deep sleep states, or system management interrupts (SMIs) being created unnecessarily. -
x86_energy_perf_policytool enables administrators to define the relative importance of performance and energy efficiency. Influence supported processors to balance performance and energy efficiency by using these specific hardware features.
23.2. Types of system topology Copiar enlaceEnlace copiado en el portapapeles!
In modern computing, the idea of a single CPU is a misleading one, as most modern systems have multiple processors. The topology of the system is the way these processors are connected to each other and to other system resources.
This can affect system and application performance and the tuning considerations for a system.
The following are the two primary types of topology used in modern computing:
- Symmetric Multi-Processor (SMP) topology
- SMP topology enables all processors to access memory in the same amount of time. Serialized memory access in SMP systems creates scaling constraints that are no longer acceptable. Therefore, practically all modern server systems are NUMA machines.
- Non-Uniform Memory Access (NUMA) topology
NUMA topology was developed more recently than SMP topology. In a NUMA system, multiple processors are physically grouped on a socket. Each socket has a dedicated area of memory, and the processors on that socket have local access to this memory. Together, the socket, its memory, and the associated processors form what is referred to as a node. Processors on the same node have high-speed access to the node’s memory bank, and slower access to memory banks on other nodes.
As a result, there is a performance penalty when accessing non-local memory. Thus, performance sensitive applications on a system with NUMA topology should access local memory. They should also avoid accessing remote memory wherever possible to ensure optimal performance.
Multi-threaded applications that are sensitive to performance may benefit from being configured to run on a specific NUMA node rather than a specific processor. Whether this is suitable depends on your system and the requirements of your application.
- If multiple application threads access the same cached data, then configuring those threads to run on the same processor may be suitable.
-
If multiple threads that access and cache different data run on the same processor, each thread may evict cached data accessed by a previous thread. This means that each thread 'misses' the cache and wastes execution time fetching data from memory and replacing it in the cache. Use the
perftool to check for an excessive number of cache misses.
23.3. Displaying system topologies Copiar enlaceEnlace copiado en el portapapeles!
You can understand the topology of a system by using a number of commands.
Procedure
To display an overview of your system topology:
$ numactl --hardwareavailable: 4 nodes (0-3) node 0 cpus: 0 4 8 12 16 20 24 28 32 36 node 0 size: 65415 MB node 0 free: 43971 MB [...]To gather the information about the CPU architecture, such as the number of CPUs, threads, cores, sockets, and NUMA nodes:
$ lscpuArchitecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 1 Core(s) per socket: 10 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 47 Model name: Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz Stepping: 2 CPU MHz: 2394.204 BogoMIPS: 4787.85 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36 NUMA node1 CPU(s): 2,6,10,14,18,22,26,30,34,38 NUMA node2 CPU(s): 1,5,9,13,17,21,25,29,33,37 NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39To view a graphical representation of your system:
# dnf install hwloc-gui# lstopo
To view the detailed textual output:
# dnf install hwloc# lstopo-no-graphicsMachine (15GB) Package L#0 + L3 L#0 (8192KB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#4) HostBridge L#0 PCI 8086:5917 GPU L#0 "renderD128" GPU L#1 "controlD64" GPU L#2 "card0" PCIBridge PCI 8086:24fd Net L#3 "wlp61s0" PCIBridge PCI 8086:f1a6 PCI 8086:15d7 Net L#4 "enp0s31f6"
23.4. Configuring kernel tick time Copiar enlaceEnlace copiado en el portapapeles!
By default, RHEL uses a tickless kernel. It does not interrupt idle CPUs to reduce power usage and allow new processors to take advantage of deep sleep states. RHEL also offers a dynamic tickless option, which is useful for latency-sensitive workloads, such as high performance computing or real-time computing. By default, the dynamic tickless option is disabled. You can use the cpu-partitioning TuneD profile to enable the dynamic tickless option for cores specified as isolated_cores.
Procedure
To enable dynamic tickless behavior in certain cores, specify those cores on the kernel command line with the
nohz_fullparameter. For example, on a 16 core system, enable thenohz_full=1-15kernel option:# grubby --update-kernel=ALL --args="nohz_full=1-15"This enables dynamic tickless behavior on cores 1 through 15, moving all timekeeping to the only unspecified core (core 0).
When the system boots, manually move the
rcuthreads to the non-latency-sensitive core, in this case core 0:# for i in pgrep rcu[^c] ; do taskset -pc 0 $i ; done-
Optional: Use the
isolcpusparameter on the kernel command line to isolate certain cores from user-space tasks. Optional: Set the CPU affinity for the kernel’s write-back
bdi-flushthreads to the housekeeping core:echo 1 > /sys/bus/workqueue/devices/writeback/cpumask
Verification
Once the system is rebooted, verify if
dynticksare enabled:# journalctl -xe | grep dynticksMar 15 18:34:54 rhel-server kernel: NO_HZ: Full dynticks CPUs: 1-15.Verify that the dynamic tickless configuration is working correctly:
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 sleep 3This command measures ticks on CPU 1 while telling CPU 1 to sleep for 3 seconds. The default kernel timer configuration shows around 3100 ticks on a regular CPU:
# perf stat -C 0 -e irq_vectors:local_timer_entry taskset -c 0 sleep 3Performance counter stats for 'CPU(s) 0': 3,107 irq_vectors:local_timer_entry 3.001342790 seconds time elapsedWith the dynamic tickless kernel configured, you should see around 4 ticks instead:
# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 sleep 3Performance counter stats for 'CPU(s) 1': 4 irq_vectors:local_timer_entry 3.001544078 seconds time elapsed
23.5. Overview of an interrupt request Copiar enlaceEnlace copiado en el portapapeles!
An interrupt request (IRQ) is a signal for immediate attention sent from a piece of hardware to a processor. Each device in a system is assigned one or more IRQ numbers, which allow it to send unique interrupts.
When enabled interrupts, processors receiving an interrupt request pause execution of the current application thread to address the interrupt request.
Because an interrupt halts normal operation, high interrupt rates can severely degrade system performance. Reduce interrupt overhead by configuring interrupt affinity or by batching lower-priority interrupts using coalescing.
Interrupt requests have an associated affnity property, smp_affinity, which defines the processors that handle the interrupt request. To improve application performance and enable the specified interrupt and application threads to share cache lines, make the following improvements:
- Assign interrupt affinity.
- Process affinity to either the same processor or processors on the same core.
Modifying smp_affinity on supported systems enables hardware-level interrupt steering. The hardware routes interrupts to specific processors without kernel intervention.
23.6. Balancing interrupts manually Copiar enlaceEnlace copiado en el portapapeles!
If the BIOS exports Non-Uniform Memory Access (NUMA) topology, irqbalance serves interrupt requests on the node local to the requesting hardware.
Procedure
- Check which devices correspond to the interrupt requests that you want to configure.
Find the hardware specification for your platform. Check if the chipset on your system supports distributing interrupts.
- If the chipset supports distribution, you can configure interrupt delivery as described in the following steps. Additionally, check which algorithm your chipset uses to balance interrupts. Some BIOSes have options to configure interrupt delivery.
- If the chipset does not support distribution, your chipset always routes all interrupts to a single, static CPU. You cannot configure which CPU is used.
Check which Advanced Programmable Interrupt Controller (APIC) mode is in use on your system:
$ journalctl --dmesg | grep APIC- If your system uses a mode other than flat, you can see a line similar to Setting APIC routing to physical flat.
-
If you can see no such message, your system uses
flatmode. If your system uses
x2apicmode, you can disable it by adding thenox2apicoption to the kernel command line in the bootloader configuration.Only non-physical flat mode (flat) supports distributing interrupts to multiple CPUs. This mode is available only for systems with 8 CPUs or less.
-
Calculate the
smp_affinitymask. For more information about how to calculate thesmp_affinitymask, see Setting thesmp_affinitymask.
23.7. Setting the smp_affinity mask Copiar enlaceEnlace copiado en el portapapeles!
The smp_affinity value is stored as a hexadecimal bit mask representing all processors in the system. Each bit configures a different CPU. The least significant bit is CPU 0. The default value of the mask is f. It means that an interrupt request can be handled on any processor in the system.
Setting this value to 1 means that only processor 0 can handle the interrupt.
Procedure
In binary, use the value
1for CPUs that handle the interrupts. For example, to set CPU0and CPU7to handle interrupts, use0000000010000001as the binary code:Expand Table 23.1. Binary Bits for CPUs CPU
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Binary
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
Convert the binary code to hexadecimal:
For example, to convert the binary code using Python:
>>> hex(int('0000000010000001', 2)) '0x81'On systems with more than 32 processors, you must delimit the
smp_affinityvalues for discrete 32 bit groups. For example, if you want only the first 32 processors of a 64 processor system to service an interrupt request, use0xffffffff,00000000.The interrupt affinity value for a particular interrupt request is stored in the associated
/proc/irq/irq_number/smp_affinityfile. Set thesmp_affinitymask in this file:# echo mask > /proc/irq/irq_number/smp_affinity
Chapter 24. Reviewing a system by using the tuna interface Copiar enlaceEnlace copiado en el portapapeles!
The tuna tool reduces complexity of performing tuning tasks. You can use tuna to adjust scheduler tunables parameters, tune thread priority, IRQ handlers, and isolate CPU cores and sockets. By using tuna, you can perform the following operations:
- List the CPUs on a system.
- List the interrupt requests (IRQs), currently running on a system.
- Change policy and priority information about threads.
- Display the current policies and priorities of a system.
24.1. Installing the tuna tool Copiar enlaceEnlace copiado en el portapapeles!
The tuna tool is designed to be used on a running system. The application-specific measurement tools can start analyzing system performance immediately after making the changes.
Procedure
Install the
tunatool:# dnf install tuna
Verification
Display the available tuna
CLIoptions:# tuna -hFor more information, see the
tuna(8)man page on your system.
24.2. Viewing the system status by using the tuna tool Copiar enlaceEnlace copiado en el portapapeles!
You can use the tuna command-line interface (CLI) tool to view the system status.
Prerequisites
-
The
tunatool is installed. For details, see Installing the tuna tool.
Procedure
View the current policies and priorities:
# tuna show_threadspid SCHED_ rtpri affinity cmd 1 OTHER 0 0,1 init 2 FIFO 99 0 migration/0 3 OTHER 0 0 ksoftirqd/0 4 FIFO 99 0 watchdog/0Alternatively, view a specific thread corresponding to a PID or matching a command name:
# tuna show_threads -t pid_or_cmd_listThe pid_or_cmd_list argument is a list of comma-separated PIDs or command-name patterns.
Depending on the scenario, perform one of the following actions:
- To tune CPUs by using the tuna CLI, complete the steps in Tuning CPUs by using the tuna tool.
- To tune the IRQs by using the tuna tool, complete the steps in Tuning IRQs by using the tuna tool.
Save the changed configuration:
# tuna save filenameThis command saves only currently running kernel threads. Processes that are not running are not saved.
24.3. Tuning CPUs by using the tuna tool Copiar enlaceEnlace copiado en el portapapeles!
tuna commands can manage individual CPUs. By using the tuna tool, you can perform the following actions:
- Isolate CPUs
- All tasks running on the specified CPU move to the next available CPU. Isolating a CPU makes this CPU unavailable by removing it from the affinity mask of all threads.
- Include CPUs
- Enables tasks to run on the specified CPU.
- Restore CPUs
- Restores the specified CPU to its previous configuration.
Prerequisites
-
The
tunatool is installed. For more information, see Installing the tuna tool.
Procedure
List all currently running processes:
# ps ax | awk 'BEGIN { ORS="," }{ print $1 }'PID,1,2,3,4,5,6,8,10,11,12,13,14,15,16,17,19Display the thread list in the
tunainterface:# tuna show_threads -t 'thread_list from above cmd'Depending on your scenario, use this command to manage CPU affinity of processes with one of the following actions:
# tuna [command] --cpus cpu_list-
The cpu_list argument is a list of comma-separated CPU numbers, for example,
--cpus 0,2. -
To add a specific CPU to the current cpu_list, use, for example,
--cpus +0.
-
The cpu_list argument is a list of comma-separated CPU numbers, for example,
Depending on your scenario, perform one of the following actions:
To isolate a CPU, enter:
# tuna isolate --cpus cpu_listTo include a CPU, enter:
# tuna include --cpus cpu_list
To use a system with four or more processors, make all
sshthreads run on CPU0and1and allhttpthreads on CPU2and3:# tuna move --cpus 0,1 -t ssh*# tuna move --cpus 2,3 -t http\*
Verification
Verify the applied changes by displaying the current configuration:
# tuna show_threads -t ssh*pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 855 OTHER 0 0,1 23 15 sshd# tuna show_threads -t http\*pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 855 OTHER 0 2,3 23 15 httpFor more information, see the
/proc/cpuinfofile and thetuna(8)man page on your system.
24.4. Tuning IRQs by using the tuna tool Copiar enlaceEnlace copiado en el portapapeles!
The /proc/interrupts file records interrupt counts per IRQ including the interrupt type and the device name.
Prerequisites
-
The
tunatool is installed. For more information, see Installing the tuna tool.
Procedure
View the current IRQs and their affinity:
# tuna show_irqs# users affinity 0 timer 0 1 i8042 0 7 parport0 0Specify the list of IRQs to be affected by a command:
# tuna <command> --irqs irq_list --cpus cpu_list- The irq_list argument is a list of comma-separated IRQ numbers or user-name patterns.
-
Replace
<command>with, for example,--spread.
Move an interrupt to a specified CPU:
# tuna show_irqs --irqs <128>users affinity 128 iwlwifi 0,1,2,3# tuna move --irqs 128 --cpus 3-
Replace 128 with the irq_list argument and
3with the cpu_list argument. -
The cpu_list argument is a list of comma-separated CPU numbers, for example,
--cpus 0,2. For more information, see Tuning CPUs by using the tuna tool.
-
Replace 128 with the irq_list argument and
Verification
Compare the state of the selected IRQs before and after moving any interrupt to a specified CPU:
# tuna show_irqs --irqs 128users affinity 128 iwlwifi 3For more information, see the
/proc/interruptsfile and thetuna(8)man page on your system.
Chapter 25. Configuring an operating system to optimize memory access Copiar enlaceEnlace copiado en el portapapeles!
You can configure the operating system to optimize memory access across workloads with the tools that are included in RHEL.
25.1. Tools for monitoring and diagnosing system memory issues Copiar enlaceEnlace copiado en el portapapeles!
The following tools are available in Red Hat Enterprise Linux for monitoring system performance and diagnosing performance issues related to system memory:
-
The
vmstattool, included in theprocps-ngpackage, displays reports of a system’s processes, memory, paging, block I/O, traps, disks, and CPU activity. It generates an instant report displaying the average of these events from the time the machine was last turned on, or since the previous report. The
valgrindframework provides instrumentation to user-space binaries. This framework includes several tools, which you can use to profile and analyze program performance, such as:The
memchecktool is the default tool invalgrind. It detects and reports several memory errors that can be difficult to detect and diagnose, such as:- Invalid Memory Access
- Use of undefined or uninitialized values
- Incorrectly freed heap memory
- Pointer overlap (Buffer overlap)
Memory leaks
NoteMemcheckcan only report these errors, it cannot prevent them from occurring. However,memchecklogs an error message immediately before the error occurs.
-
The
cachegrindtool simulates how an application interacts with a system’s cache hierarchy and branch predictor. It gathers statistics for the duration of application’s execution and displays a summary to the console. The
massiftool measures the heap space used by a specified application. It measures both useful space and any additional space allocated for bookkeeping and alignment purposes.For more information, see the
/usr/share/doc/valgrind-version/valgrind_manual.pdffile andvmstat(8)andvalgrind(1)man pages on your system.
25.2. Overview of system memory Copiar enlaceEnlace copiado en el portapapeles!
The Linux Kernel is designed to maximize the utilization of resources in system memory (RAM). Due to these design characteristics, Kernel operations use most system memory for the workload, while leaving a small portion free.
This free memory is reserved for: special system allocations and other low or high priority system services. The rest of the system memory is dedicated to the workload itself, and divided into the following two categories:
- File memory
Pages added in this category represent parts of files in permanent storage. These pages, from the page cache, can be mapped or unmapped in address spaces of an application. Applications can map files into their address space by using the
mmapsystem calls or operate on files by using the buffered I/O read or write system calls.Buffered I/O system calls, as well as applications that map pages directly, can re-use unmapped pages. As a result, the kernel caches these pages to avoid repeated, slow I/O operations, particularly during periods of low memory usage.
- Anonymous memory
- These pages are used by dynamic processes or have no related files in permanent storage. These pages back up the in-memory control structures of each task, such as the application stack and heap.
25.3. Virtual memory parameters Copiar enlaceEnlace copiado en el portapapeles!
The virtual memory parameters are listed in the /proc/sys/vm directory.
The following are the available virtual memory parameters:
vm.dirty_ratio- A percentage value. When this percentage of total system memory is modified, the system begins writing the modifications to the disk. The default value is 20 percent.
vm.dirty_background_ratio- A percentage value. When this percentage of total system memory is modified, the system begins writing the modifications to the disk in the background. The default value is 10 percent.
vm.overcommit_memory- Defines the conditions that determine whether a large memory request is accepted or denied. The default value is 0. The kernel grants memory requests based on total available RAM and swap, rejecting only large allocations. Otherwise, virtual memory allocations are granted, and can allow memory overcommitment.
- Setting the
overcommit_memoryparameter’s value - When set to 1, the kernel performs no memory overcommit handling. This increases the possibility of memory overload, but improves performance for memory-intensive tasks.
-
When set to 2, the kernel denies allocations exceeding the total swap plus the RAM percentage defined in
overcommit_ratio. This setting reduces the risk of overcommitting memory. Use this only for systems with the swap partition larger than physical memory.
vm.overcommit_ratio-
Specifies the percentage of physical RAM considered when
overcommit_memoryis set to 2. The default value is 50. vm.max_map_count- Defines the maximum number of memory map areas that a process can use. The default value is 65530. Increase this value if your application needs more memory map areas.
vm.min_free_kbytes-
Sets the size of the reserved free pages pool. It is also responsible for setting the
min_page,low_page, andhigh_pagethresholds. These thresholds govern the behavior of the page reclaim algorithms in the Linux kernel. It also specifies to keep the minimum number of free kilobytes across the system. This calculates a specific value for each low memory zone. Each zone is assigned a number of reserved free pages in proportion to its size. - Setting the
vm.min_free_kbytesparameter’s value - Increasing the parameter value effectively reduces the application working set usable memory. Therefore, you can use it for only kernel-driven workloads, where driver buffers need to be allocated in atomic contexts.
Decreasing the parameter value might render the kernel unable to service system requests if memory becomes heavily contended in the system.
WarningExtreme values can be detrimental to the system’s performance. Setting the
vm.min_free_kbytesto an extremely low value prevents the system from reclaiming memory effectively. This can result in system crashes and failure to service interrupts or other kernel services. However, settingvm.min_free_kbytestoo high considerably increases system reclaim activity, causing allocation latency due to a false direct reclaim state. This might cause the system to immediately enter an out-of-memory state. Thevm.min_free_kbytesparameter also sets a page reclaim watermark, calledmin_pages. This watermark is used as a factor when determining the two other memory watermarks,low_pages, andhigh_pages, that govern page reclaim algorithms./proc/PID/oom_adjIn the event, when a system runs out of memory and thepanic_on_oomparameter is set to 0, theoom_killerfunction kills processes, starting with the process that has the highestoom_score, until the system recovers. Theoom_adjparameter determines theoom_scoreof a process. This parameter is set per process identifier. A value of -17 disables theoom_killerfor that process. Other valid values range from -16 to 15.NoteProcesses created by an adjusted process inherit the
oom_scoreof that process.
vm.swappiness- The swappiness value, ranging from 0 to 200, controls whether the system prioritizes reclaiming memory from anonymous pages or the page cache.
- Setting the swappiness parameter’s value
- Higher values prefer file-mapped driven workloads while swapping out the less actively accessed processes’ anonymous mapped memory of RAM. File servers and streaming applications use this to keep data in memory, reducing I/O latency.
Low values prefer anonymous-mapped driven workloads while reclaiming the page cache (file mapped memory). This setting is useful for applications that do not depend heavily on file system information and use dynamically allocated private memory. Examples include mathematical and number-crunching applications and some hardware virtualization supervisors such as QEMU. The default value of the
vm.swappinessparameter is 60.WarningSetting the
vm.swappinessto 0 aggressively avoids swapping out anonymous memory to a disk. This increases the likelihood that theoom_killerfunction kills processes during memory or I/O-intensive workloads.
25.4. File system parameters Copiar enlaceEnlace copiado en el portapapeles!
The file system parameters are listed in the /proc/sys/fs directory. The following are the available file system parameters:
aio-max-nr- Defines the maximum number of events allowed in all active asynchronous input/output contexts. The default value is 65536, and modifying this value does not pre-allocate or resize any kernel data structures.
file-maxDetermines the maximum number of file handles for the entire system. On Red Hat Enterprise Linux 10, the default value is
9223372036854775807.NoteThe default value is set by
systemdand corresponds to the configurable maximum.
25.5. Kernel parameters Copiar enlaceEnlace copiado en el portapapeles!
The default values for the kernel parameters are located in the /proc/sys/kernel/ directory. These are set default values provided by the kernel or values specified by a user by using sysctl. The following kernel parameters configure limits for the msg* and shm* System V IPC (sysvipc) system calls:
msgmax-
Defines the maximum allowed size in bytes of any single message in a message queue. This value must not exceed the size of the queue (
msgmnb). Use thesysctl kernel.msgmaxcommand to determine the currentmsgmaxvalue on your system. msgmnb-
Defines the maximum size in bytes of a single message queue. Use the
sysctl msgmnbcommand to determine the currentmsgmnbvalue on your system. msgmni-
Defines the maximum number of message queue identifiers, and therefore the maximum number of queues. Use the
sysctl kernel.msgmnicommand to determine the currentmsgmnivalue on your system. shmall-
Defines the total amount of shared memory pages that can be used on the system at one time. For example, a page is 4096 bytes on the AMD64 and Intel 64 architecture. Use the
sysctl kernel.shmallcommand to determine the currentshmallvalue on your system. shmmax-
Defines the maximum size in bytes of a single shared memory segment allowed by the kernel. Shared memory segments up to 1Gb are now supported in the kernel. Use the
sysctl kernel.shmmaxcommand to determine the currentshmmaxvalue on your system. shmmni- Defines the system-wide maximum number of shared memory segments. The default value is 4096 on all systems.
Chapter 26. Getting started with SystemTap Copiar enlaceEnlace copiado en el portapapeles!
As a system administrator, use SystemTap to identify underlying causes of a bug or performance problem on a running Red Hat Enterprise Linux (RHEL) system. As an application developer, you can use SystemTap to closely monitor and analyze your application’s behavior within the RHEL environment.
26.1. The purpose of SystemTap Copiar enlaceEnlace copiado en el portapapeles!
SystemTap monitors operating system and kernel activities in fine detail through tracing and probing. SystemTap provides information similar to the output of tools such as netstat, ps, top, and iostat. However, SystemTap provides more filtering and analysis options for collected information.
In SystemTap scripts, specify the information that SystemTap gathers.
SystemTap tracks kernel activity, supplementing Linux monitoring tools with two primary attributes:
- Flexibility
- Develop SystemTap scripts to monitor kernel functions, system calls, and other kernel-space events. This makes SystemTap not just a tool, but a system for developing your own kernel-specific forensic and monitoring tools.
- Ease-of-Use
- With SystemTap, you can monitor kernel activity without having to recompile the kernel or reboot the system.
26.2. Installing SystemTap Copiar enlaceEnlace copiado en el portapapeles!
To begin using SystemTap, install the required packages. To use SystemTap with multiple kernels, install the matching kernel packages for each version.
Prerequisites
- You have enabled debug repositories as described in Enabling debug and source repositories.
Procedure
Install the required SystemTap packages:
# dnf install systemtapInstall the required kernel packages:
Using
stap-prep:# stap-prepIf
stap-prepdoes not work, install the required kernel packages manually:# dnf install kernel-debuginfo-$(uname -r) kernel-debuginfo-common-$(uname -m)-$(uname -r) kernel-devel-$(uname -r)$(uname -m)is automatically replaced with the hardware platform of your system and$(uname -r)is automatically replaced with the version of your running kernel.
Verification
If the kernel to be probed with SystemTap is currently in use, test if your installation was successful:
# stap -v -e 'probe kernel.function("vfs_read") {printf("read performed\n"); exit()}'For a successful SystemTap deployment, you see an output similar to the following:
Pass 1: parsed user script and 45 library script(s) in 340usr/0sys/358real ms. Pass 2: analyzed script: 1 probe(s), 1 function(s), 0 embed(s), 0 global(s) in 290usr/260sys/568real ms. Pass 3: translated to C into "/tmp/stapiArgLX/stap_e5886fa50499994e6a87aacdc43cd392_399.c" in 490usr/430sys/938real ms. Pass 4: compiled C into "stap_e5886fa50499994e6a87aacdc43cd392_399.ko" in 3310usr/430sys/3714real ms. Pass 5: starting run. read performed Pass 5: run completed in 10usr/40sys/73real ms.where:
-
Pass 5: starting runindicates that SystemTap successfully created the instrumentation to probe the kernel and runs the instrumentation. -
read performedindicates that SystemTap detected the specified event, in this example, a VFS read. -
Pass 5: run completed in <time> msindicates that SystemTap executed a valid handler. It displayed text and closed with no errors.
26.3. Privileges to run SystemTap Copiar enlaceEnlace copiado en el portapapeles!
Running SystemTap scripts requires elevated system privileges, but in some instances non-privileged users might need to run SystemTap instrumentation on their machine.
To allow users to build and run the SystemTap scripts without root access, add users to both of these user groups:
- stapdev
-
Members of this group can use
stapto run SystemTap scripts, orstaprunto run SystemTap instrumentation modules. Runningstapinvolves compiling SystemTap scripts into kernel modules and loading them into the kernel. This requires elevated privileges to the system, which are granted tostapdevmembers. These privileges also grant effective root access tostapdevmembers. Grantstapdevgroup membership only to users who can be trusted with root access. - stapusr
-
Members of this group can only use
staprunto run SystemTap instrumentation modules. In addition, these users can run those modules only from the/lib/modules/<kernel_version>/systemtap/directory. This directory must be owned and writable only by the root user.
26.4. Running SystemTap scripts Copiar enlaceEnlace copiado en el portapapeles!
You can run SystemTap scripts from standard input or from a file. Find sample scripts that are distributed with the installation of SystemTap in Sample SystemTap scripts or in the /usr/share/systemtap/examples directory.
Prerequisites
- SystemTap and the required kernel packages are installed as described in Installing Systemtap.
To run SystemTap scripts as a normal user, add the user to the SystemTap groups:
# usermod --append --groups stapdev,stapusr <user_name>
Procedure
Run the SystemTap script:
From standard input:
# stap -e "probe timer.s(1) {exit()}"From a file:
# stap <file_name>.stp
26.5. Sample SystemTap scripts Copiar enlaceEnlace copiado en el portapapeles!
You can find sample scripts that are distributed with the installation of SystemTap in the /usr/share/systemtap/examples directory.
Use the stap command to run different SystemTap scripts:
- Tracing function calls
You can use the
para-callgraph.stpSystemTap script to trace function calls and function returns.# stap para-callgraph.stp <argument1 argument2>The script takes two command-line arguments:
- The name of the function(s) whose entry/exit you are tracing.
An optional trigger function, which enables or disables tracing on a per-thread basis.
Tracing in each thread continues as long as the trigger function has not exited.
- Monitoring polling applications
You can use the
timeout.stpSystemTap script to identify and monitor which applications are polling. Knowing this, you can track unnecessary or excessive polling. It can help you to pinpoint areas for improvement in terms of CPU usage and power savings.# stap timeout.stpThis script tracks how many times each application uses
poll,select,epoll,itimer,futex,nanosleep, andSignalsystem calls over time.- Tracking system call volume per process
You can use the
syscalls_by_proc.stpSystemTap script to see what processes are performing the highest volume of system calls. It displays the top 20 processes by default.# stap syscalls_by_proc.stp- Tracing functions called in network socket code
You can use the
socket-trace.stpSystemTap script to trace functions called from the kernel’snet/socket.cfile. This helps you to identify how each process interacts with the network at the kernel level in fine detail.# stap socket-trace.stp- Tracking I/O time for each file read or write
Use the
iotime.stpSystemTap script to monitor time taken for each process to read from or write to any file. This helps you to determine what files are slow to load on a system.# stap iotime.stp- Tracking IRQ’s and other processes stealing cycles from a task
Use the
cycle_thief.stpSystemTap script to track the time a task is running and the amount of time it is not running. This helps you to identify which processes are stealing cycles from a task.# stap cycle_thief.stp -x pidFor more information, see the
/usr/share/systemtap/examplesdirectory.NoteYou can find more examples and information about SystemTap scripts in the
/usr/share/systemtap/examples/index.htmlfile. Open it in a web browser to see a list of all the available scripts and their descriptions.
26.6. SystemTap cross-instrumentation Copiar enlaceEnlace copiado en el portapapeles!
Cross-instrumentation builds SystemTap modules on one host for use on others without the full software suite. When you run a SystemTap script, a kernel module is built out of that script. SystemTap then loads the module into the kernel.
Normally, SystemTap scripts can run only on systems where SystemTap is deployed. To run SystemTap on ten systems, SystemTap needs to be deployed on all those systems. In some cases, this might be neither feasible nor practical. For example, corporate policies may prohibit compilers or debug data, preventing a standard SystemTap deployment on all target systems.
To work around this, use cross-instrumentation. Cross-instrumentation is the process of generating SystemTap instrumentation modules from a SystemTap script on one system to be used on another system. This process offers the following benefits:
-
The kernel information packages for various machines can be installed on a single host machine. Kernel packaging bugs may prevent the installation. In such cases, the
kernel-debuginfoandkernel-develpackages for the host system and target system must match. If a bug occurs, report the bug to Red Hat JIRA. Each target machine needs only one package to be installed to use the generated SystemTap instrumentation module:
systemtap-runtime. Instrumentation modules require the host and target systems to share identical architectures and distributions.- Terminology
instrumentation module
Build the SystemTap module on the host, then load it onto the target system’s kernel.
host system
The system on which the instrumentation modules (from SystemTap scripts) are compiled, to be loaded on target systems.
target system
The system in which the instrumentation module is being built (from SystemTap scripts).
target kernel
The kernel of the target system. This is the kernel that loads and runs the instrumentation module.
26.7. Initializing cross-instrumentation of SystemTap Copiar enlaceEnlace copiado en el portapapeles!
Initialize SystemTap cross-instrumentation to build modules on one system and deploy them to systems lacking the full suite.
Prerequisites
- SystemTap is installed on the host system as described in Installing Systemtap.
The
systemtap-runtimepackage is installed on each target system:# dnf install systemtap-runtime- Both the host and the target systems are the same architecture.
Both the host and the target systems are running the same major version of Red Hat Enterprise Linux (such as Red Hat Enterprise Linux 10).
ImportantKernel packaging bugs might prevent multiple
kernel-debuginfoandkernel-develpackages from being installed on one system. In such cases, the minor version for the host and the target systems must match. If a bug occurs, report the bug at Red Hat JIRA.
Procedure
Determine the kernel running on each target system:
$ uname -r- Repeat this step for each target system.
- On the host system, install the target kernel and related packages for each target system by the method described in Installing Systemtap.
Build an instrumentation module on the host system, copy this module to and run this module on on the target system either:
If an SSH connection can be made to the target system from the host system, use remote implementation:
# stap --remote <target_system> scriptYou must ensure an SSH connection can be made to the target system from the host system for this to be successful.
Manually:
Build the instrumentation module on the host system:
# stap -r <kernel_version> script -m <module_name> -p 4Here,
<kernel_version>is the version of the target kernel determined in step 1,scriptis the script to be converted into an instrumentation module, and<module_name>specifies the name of the instrumentation module. The-p4option tells SystemTap to not load and run the compiled module.Once the instrumentation module is compiled, copy it to the target system and load it by using the following command:
# staprun <module_name>.ko
Chapter 27. Tuning scheduling policy Copiar enlaceEnlace copiado en el portapapeles!
In Red Hat Enterprise Linux (RHEL), thread is the smallest unit of process execution. The system scheduler chooses which processor runs a thread and for how long, prioritizing overall system utilization. Consequently, thread scheduling might not optimized for specific application performance policy.
For example, if an application on a NUMA system is running on Node A when a processor on Node B becomes available. To keep the processor on Node B busy, the scheduler moves one of the application’s threads to Node B. However, the application thread still requires access to memory on Node A. Accessing this memory takes longer. The thread now runs on Node B, making Node A memory remote rather than local. Running on Node B might take longer than waiting for Node A. Local memory access often outweighs the benefit of migration.
27.1. Categories of scheduling policies Copiar enlaceEnlace copiado en el portapapeles!
Performance-sensitive applications often benefit from the designer or administrator determining where threads are run. The Linux scheduler implements a number of scheduling policies that determine where and for how long a thread runs. The following are the two major categories of scheduling policies:
- Normal policies
- Normal threads are used for tasks of normal priority.
- Realtime policies
- Real-time policies are used for time-sensitive tasks that must be completed without interruptions. Real-time threads are not subject to time slicing. This means the thread runs until they block, exit, voluntarily yield, or are preempted by a higher priority thread.
The lowest priority realtime thread is scheduled before any thread with a normal policy. For more information, see the sched(7), sched_setaffinity(2), sched_getaffinity(2), sched_setscheduler(2), and sched_getscheduler(2) man pages on your system.
27.2. Static priority scheduling with SCHED_FIFO Copiar enlaceEnlace copiado en el portapapeles!
The SCHED_FIFO, also called static priority scheduling, is a real-time policy that defines a fixed priority for each thread. This policy enables administrators to improve event response time and reduce latency. Avoid using this policy for an extended period of time for time-sensitive tasks.
The scheduler runs the highest-priority SCHED_FIFO thread from the list of ready threads. The SCHED_FIFO priority levels range from integer 1 to 99, where 99 is the highest. Start with a lower number and increase priority only when you identify latency issues.
Because real-time threads are not subject to time slicing, avoid setting a priority of 99. This keeps your process at the same priority level as migration and watchdog threads. If your thread goes into a computational loop and these threads are blocked, they cannot run. Systems with a single processor will eventually hang in this situation.
Administrators can limit SCHED_FIFO bandwidth to prevent realtime application programmers from initiating realtime tasks that monopolize the processor. The following are some of the parameters used in this policy:
/proc/sys/kernel/sched_rt_period_us- This parameter defines the time period, in microseconds, that is considered to be one hundred percent of the processor bandwidth. The default value is 1000000 μs or 1 second.
/proc/sys/kernel/sched_rt_runtime_us- This parameter defines the time period, in microseconds, that is devoted to running real-time threads. The default value is 950000 μs or 0.95 seconds.
27.3. Round robin priority scheduling with SCHED_RR Copiar enlaceEnlace copiado en el portapapeles!
The SCHED_RR is a round-robin variant of the SCHED_FIFO. This policy is useful when multiple threads need to run at the same priority level. Like SCHED_FIFO, SCHED_RR is a realtime policy that defines a fixed priority for each thread. The scheduler scans the list of all SCHED_RR threads in order of priority. It then schedules the highest priority thread that is ready to run.
However, unlike SCHED_FIFO, threads that have the same priority are scheduled in a round-robin style within a certain time slice. You can set the value of this time slice in milliseconds with the sched_rr_timeslice_ms kernel parameter in the /proc/sys/kernel/sched_rr_timeslice_ms file. The lowest value is 1 millisecond.
27.4. Normal scheduling with SCHED_OTHER Copiar enlaceEnlace copiado en el portapapeles!
The SCHED_OTHER is the default scheduling policy. SCHED_OTHER uses the Completely Fair Scheduler (CFS) to enable fair processor access to all threads scheduled with this policy. SCHED_OTHER is useful when there are a large number of threads or data throughput is a priority. It enables more efficient scheduling of threads over time.
The scheduler creates a dynamic priority list based on each thread’s niceness value. Administrators can change the niceness value of a process, but cannot change the scheduler’s dynamic priority list directly.
27.5. Setting scheduler policies Copiar enlaceEnlace copiado en el portapapeles!
You can check and adjust scheduler policies and priorities by using the chrt command line tool. It can start new processes with the expected properties or change the properties of a running process. It can also be used for setting the policy at runtime.
Procedure
View the process ID (PID) of the active processes:
# ps-
Use the
--pidor-poption with thepscommand to view the details of the particular PID. Check the scheduling policy, PID, and priority of a particular process:
# chrt -p 468pid 468's current scheduling policy: SCHED_FIFO pid 468's current scheduling priority: 85# chrt -p 476pid 476's current scheduling policy: SCHED_OTHER pid 476's current scheduling priority: 0Set the scheduling policy of a process, for example:
To set the process with PID 1000 to
SCHED_FIFO, with a priority of 50:# chrt -f -p 50 1000To set the process with PID 1000 to
SCHED_OTHER, with a priority of 0:# chrt -o -p 0 1000To set the process with PID 1000 to
SCHED_RR, with a priority of 10:# chrt -r -p 10 1000To start a new application with a particular policy and priority, specify the name of the application:
# chrt -f 36 /bin/my-app
27.6. Policy options for the chrt command Copiar enlaceEnlace copiado en el portapapeles!
You can view and set the scheduling policy of processes by using the chrt command. The following table describes the appropriate policy options, which can be used to set the scheduling policy of a process.
| Short option | Long option | Description |
|---|---|---|
|
|
|
Set schedule to |
|
|
|
Set schedule to |
|
|
|
Set schedule to |
27.7. Changing the priority of services during the boot process Copiar enlaceEnlace copiado en el portapapeles!
You can configure real-time priorities for services launched during the boot process by using the systemd service. The unit configuration directives are used to change the priority of a service during the boot process. The boot process priority can be changed by using the following directives in the service section:
-
CPUSchedulingPolicy=: Sets the CPU scheduling policy for executed processes. It is used to set other, fifo, and rr policies. -
CPUSchedulingPriority=: Sets the CPU scheduling priority for executed processes. The available priority range depends on the selected CPU scheduling policy. For real-time scheduling policies, an integer between 1 (lowest priority) and 99 (highest priority) can be used.
You can change the priority of a service during the boot process and by using the mcelog service.
Prerequisites
-
You have installed and enabled
tuned. For more information, see Installing and enabling TuneD.
Procedure
View the scheduling priorities of running threads:
# tuna --show_threadsthread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 1 OTHER 0 0xff 3181 292 systemd 2 OTHER 0 0xff 254 0 kthreadd 3 OTHER 0 0xff 2 0 rcu_gp 4 OTHER 0 0xff 2 0 rcu_par_gp 6 OTHER 0 0 9 0 kworker/0:0H-kblockd 7 OTHER 0 0xff 1301 1 kworker/u16:0-events_unbound 8 OTHER 0 0xff 2 0 mm_percpu_wq 9 OTHER 0 0 266 0 ksoftirqd/0 [...]Create a supplementary
mcelogservice configuration directory file and insert the policy name and priority in this file:# cat << EOF > /etc/systemd/system/mcelog.service.d/priority.conf[Service] CPUSchedulingPolicy=fifo CPUSchedulingPriority=20 EOFReload the
systemdscripts configuration:# systemctl daemon-reloadRestart the
mcelogservice:# systemctl restart mcelog
Verification
Display the
mcelogpriority set bysystemdissue:# tuna -t mcelog -Pthread ctxt_switches pid SCHED_ rtpri affinity voluntary nonvoluntary cmd 826 FIFO 20 0,1,2,3 13 0 mcelog
27.8. Priority map Copiar enlaceEnlace copiado en el portapapeles!
Priorities are defined in groups, with some groups dedicated to certain kernel functions. For real-time scheduling policies, an integer between 1 (lowest priority) and 99 (highest priority) is used.
The following table describes the priority range, which can be used while setting the scheduling policy of a process.
| Priority | Threads | Description |
|---|---|---|
| 1 | Low priority kernel threads |
This priority is usually reserved for the tasks that need to be just prior to |
| 2 - 49 | Available for use | The range used for typical application priorities. |
| 50 | Default hard-IRQ value | |
| 51 - 98 | High priority threads | Use this range for threads that run periodically and must have quick response times. Do not use this range for CPU-bound threads as you will starve interrupts. |
| 99 | Watchdogs and migration | System threads that must run at the highest priority. |
27.9. TuneD cpu-partitioning profile Copiar enlaceEnlace copiado en el portapapeles!
For tuning Red Hat Enterprise Linux for latency-sensitive workloads, use the cpu-partitioning TuneD profile. In RHEL 9 and later, you can perform low-latency tuning more efficiently by using the cpu-partitioning TuneD profile.
This profile is easily customizable according to the requirements for individual low-latency applications. The following figure is an example to demonstrate how to use the cpu-partitioning profile. This example uses the CPU and node layout.
Configure the cpu-partitioning profile in the /etc/tuned/cpu-partitioning-variables.conf file by using the following configuration options:
- Isolated CPUs with load balancing
In the
cpu-partitioningfigure, the blocks numbered from 4 to 23, are the default isolated CPUs. The kernel scheduler’s process load balancing is enabled on these CPUs. It is designed for low-latency processes with multiple threads that need the kernel scheduler load balancing. Configure the cpu-partitioning profile in the/etc/tuned/cpu-partitioning-variables.conffile by using theisolated_cores=cpu-listoption. This option lists CPUs to isolate that will use the kernel scheduler load balancing.The list of isolated CPUs is comma-separated or specify a range using a dash, such as 3-5. This option is mandatory. Any CPU missing from this list is automatically considered a housekeeping CPU.
- Isolated CPUs without load balancing
In the cpu-partitioning figure, CPUs 2 and 3 are isolated and exclude kernel scheduler load balancing.
You can configure the
cpu-partitioningprofile in the/etc/tuned/cpu-partitioning-variables.conffile by using theno_balance_cores=cpu-listoption, which lists CPUs to isolate that will not use the kernel scheduler load balancing.Specifying the
no_balance_coresoption is optional, however any CPUs in this list must be a subset of the CPUs listed in theisolated_coreslist. Application threads using these CPUs need to be pinned individually to each CPU.- Housekeeping CPUs
-
Any CPU not isolated in the
cpu-partitioning-variables.conffile is automatically considered a housekeeping CPU. On the housekeeping CPUs, all services, daemons, user processes, movable kernel threads, interrupt handlers, and kernel timers are permitted to run.
27.10. Using the TuneD cpu-partitioning profile for low-latency tuning Copiar enlaceEnlace copiado en el portapapeles!
You can tune a system for low-latency by using the TuneD’s cpu-partitioning profile. The application in this case uses:
- One dedicated reader thread that reads data from the network will be pinned to CPU 2.
- A large number of threads that process this network data will be pinned to CPUs 4-23.
- A dedicated writer thread that writes the processed data to the network is pinned to CPU 3.
Prerequisites
-
You have installed the
cpu-partitioningTuneD profile by using thednf install tuned-profiles-cpu-partitioningcommand as root.
Procedure
Edit the
/etc/tuned/cpu-partitioning-variables.conffile with the following changes:Comment the
isolated_cores=${f:calc_isolated_cores:1}line:# isolated_cores=${f:calc_isolated_cores:1}Add the following information for isolated CPUS:
# All isolated CPUs: isolated_cores=2-23 # Isolated CPUs without the kernel’s scheduler load balancing: no_balance_cores=2,3Set the cpu-partitioning TuneD profile:
# tuned-adm profile cpu-partitioning
Reboot the system.
After rebooting, the system is tuned for low-latency, according to the isolation in the cpu-partitioning figure. The application can use taskset to pin the reader and writer threads to CPUs 2 and 3, and the remaining application threads on CPUs 4-23.
Verification
Verify that the isolated CPUs are not reflected in the Cpus_allowed_list field:
# cat /proc/self/status | grep CpuCpus_allowed: 003 Cpus_allowed_list: 0-1To see affinity of all processes, enter:
# ps -ae -o pid= | xargs -n 1 taskset -cppid 1's current affinity list: 0,1 pid 2's current affinity list: 0,1 pid 3's current affinity list: 0,1 pid 4's current affinity list: 0-5 pid 5's current affinity list: 0,1 pid 6's current affinity list: 0,1 pid 7's current affinity list: 0,1 pid 9's current affinity list: 0 ...NoteTuneD cannot change the affinity of some processes, mostly kernel processes. In this example, processes with PID 4 and 9 remain unchanged.
27.11. Customizing the cpu-partitioning TuneD profile Copiar enlaceEnlace copiado en el portapapeles!
You can extend the TuneD profile to make additional tuning changes. For example, the cpu-partitioning profile sets the CPUs to use cstate=1. To use the cpu-partitioning profile but to additionally change the CPU cstate from cstate1 to cstate0, follow this procedure.
Procedure
Create the
/etc/tuned/my_profiledirectory:# mkdir /etc/tuned/profiles/my_profileCreate a
tuned.conffile in this directory, and add the following content:# vi /etc/tuned/profiles/my_profile/tuned.conf[main] summary=Customized tuning on top of cpu-partitioning include=cpu-partitioning [cpu] force_latency=cstate.id:0|1Use the new profile.
# tuned-adm profile my_profileNoteIn the shared example, a reboot is not required. However, if the changes in the my_profile profile require a reboot to take effect, then reboot your machine.
Chapter 28. Profiling memory allocation with numastat Copiar enlaceEnlace copiado en el portapapeles!
The numastat tool provides detailed statistics on memory allocations within a system, presenting data for each NUMA node individually. This information is valuable for analyzing system memory performance and evaluating the efficacy of various memory policies.
28.1. Default numastat statistics Copiar enlaceEnlace copiado en el portapapeles!
By default, the numastat tool displays statistics over these categories of data for each NUMA node:
numa_hit- The number of pages that were successfully allocated to this node.
numa_missThe number of pages that were allocated on this node because of low memory on the intended node. Each
numa_missevent has a correspondingnuma_foreignevent on another node.NoteHigh
numa_hitvalues and lownuma_missvalues (relative to each other) indicate optimal performance.numa_foreign-
The number of pages initially intended for this node that were allocated to another node instead. Each
numa_foreignevent has a correspondingnuma_missevent on another node. interleave_hit- The number of interleave policy pages successfully allocated to this node.
local_node- The number of pages successfully allocated on this node by a process on this node.
other_node- The number of pages allocated on this node by a process on another node.
28.2. Viewing memory allocation with numastat Copiar enlaceEnlace copiado en el portapapeles!
You can view the memory allocation of the system by using the numastat tool.
Prerequisites
You have installed the
numactlpackage.# dnf install numactl
Procedure
View the memory allocation of your system:
$ numastatnode0 node1 numa_hit 76557759 92126519 numa_miss 30772308 30827638 numa_foreign 30827638 30772308 interleave_hit 106507 103832 local_node 76502227 92086995 other_node 30827840 30867162