Chapter 5. Important changes to external kernel parameters


This chapter provides system administrators with a summary of significant changes in the kernel distributed with Red Hat Enterprise Linux 9.2. These changes could include for example added or updated proc entries, sysctl, and sysfs default values, boot parameters, kernel configuration options, or any noticeable behavior changes.

New kernel parameters

nomodeset

With this kernel parameter, you can disable kernel mode setting. DRM drivers will not perform display-mode changes or accelerated rendering. Only the system frame buffer will be available for use if this was set-up by the firmware or boot loader.

nomodeset is useful as fallback, or for testing and debugging.

printk.console_no_auto_verbose

With this kernel parameter, you can disable console loglevel raise on oops, panic or lockdep-detected issues (only if lock debug is on). With an exception to setups with low baudrate on serial console, set this parameter to 0 to provide more debug information.

  • Format: <bool>
  • Defaults to 0 (auto_verbose is enabled)
rcupdate.rcu_exp_cpu_stall_timeout=[KNL]

With this kernel parameter, you can set timeout for expedited RCU CPU stall warning messages. The value is in milliseconds and the maximum allowed value is 21000 milliseconds.

Note that this value is adjusted to an arch timer tick resolution. Setting this to zero causes the value from rcupdate.rcu_cpu_stall_timeout to be used (after conversion from seconds to milliseconds).

rcupdate.rcu_task_stall_info=[KNL]

With this parameter, you can set initial timeout in jiffies for RCU task stall informational messages, which give some indication of the problem for those not patient enough to wait for ten minutes. Informational messages are only printed prior to the stall-warning message for a given grace period. Disable with a value less than or equal to zero.

  • Defaults to 10 seconds.
  • A change in value does not take effect until the beginning of the next grace period.
rcupdate.rcu_task_stall_info_mult=[KNL]

This parameter is a multiplier for time interval between successive RCU task stall informational messages for a given RCU tasks grace period. This value is clamped to one through ten, inclusive.

It defaults to the value of three, so that the first informational message is printed 10 seconds into the grace period, the second at 40 seconds, the third at 160 seconds, and then the stall warning at 600 seconds would prevent a fourth at 640 seconds.

smp.csd_lock_timeout=[KNL]

With this parameter, you can specify the period of time in milliseconds that smp_call_function() and friends will wait for a CPU to release the CSD lock. This is useful when diagnosing bugs involving CPUs disabling interrupts for extended periods of time.

  • Defaults to 5,000 milliseconds.
  • Setting a value of zero disables this feature.
  • This feature may be more efficiently disabled using the csdlock_debug- kernel parameter.
srcutree.big_cpu_lim=[KNL]

With this parameter, you can specify the number of CPUs constituting a large system, such that srcu_struct structures should immediately allocate an srcu_node array.

  • Defaults to 128.
  • takes effect only when the low-order four bits of srcutree.convert_to_big is equal to 3 (decide at boot).
srcutree.convert_to_big=[KNL]

With this parameter, you can specify under what conditions an SRCU tree srcu_struct structure will be converted to big form, that is, with an rcu_node tree:

  • 0: Never.
  • 1: At init_srcu_struct() time.
  • 2: When rcutorture decides to.
  • 3: Decide at boot time (default).
  • 0x1X: Above plus if high contention.

    Either way, the srcu_node tree will be sized based on the actual runtime number of CPUs (nr_cpu_ids) instead of the compile-time CONFIG_NR_CPUS.

srcutree.srcu_max_nodelay=[KNL]
With this parameter, you can specify the number of no-delay instances per jiffy for which the SRCU grace period worker thread will be rescheduled with zero delay. Beyond this limit, worker thread will be rescheduled with a sleep delay of one jiffy.
srcutree.srcu_max_nodelay_phase=[KNL]
With this parameter, you can specify the per-grace-period phase, number of non-sleeping polls of readers. Beyond this limit, grace period worker thread will be rescheduled with a sleep delay of one jiffy, between each rescan of the readers, for a grace period phase.
srcutree.srcu_retry_check_delay=[KNL]
With this parameter, you can specify number of microseconds of non-sleeping delay between each non-sleeping poll of readers.
srcutree.small_contention_lim=[KNL]

With this parameter, you can specify the number of update-side contention events per jiffy will be tolerated before initiating a conversion of an srcu_struct structure to big form.

Note

The value of srcutree.convert_to_big must have the 0x10 bit set for contention-based conversions to occur.

Updated kernel parameters

crashkernel=size[KMG][@offset[KMG]]

[KNL] Using kexec, Linux can switch to a crash kernel upon panic. This parameter reserves the physical memory region [offset, offset + size] for that kernel image. If @offset is omitted, then a suitable offset is selected automatically.

[KNL, X86-64, ARM64] Select a region under 4G first, and fall back to reserve region above 4G when @offset has not been specified.

For more details, see Documentation/admin-guide/kdump/kdump.rst.

crashkernel=size[KMG],low
  • [KNL, X86-64, ARM64] With this parameter, you can specify low range under 4G for the second kernel. When crashkernel=X,high is passed, that require some amount of low memory, for example swiotlb requires at least 64M+32K low memory, also enough extra low memory is needed to make sure DMA buffers for 32-bit devices will not run out. Kernel would try to allocate default size of memory below 4G automatically. The default size is platform dependent.

    • x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
    • arm64: 128MiB

      0: to disable low allocation.

      This parameter will be ignored when crashkernel=X,high is not used or memory reserved is below 4G.

  • [KNL, ARM64] With this parameter, you can specify a low range in the DMA zone for the crash dump kernel.

    This parameter will be ignored when crashkernel=X,high is not used.

deferred_probe_timeout=[KNL]

With this parameter, you can set a timeout in seconds for deferred probe to give up waiting on dependencies to probe. Only specific dependencies (subsystems or drivers) that have opted in will be ignored.

A timeout of 0 will time out at the end of initcalls. If the time out has not expired, the option will be restarted by each successful driver registration. This option will also dump out devices still on the deferred probe list after retrying.

driver_async_probe=[KNL]

With this parameter, you can list of driver names to be probed asynchronously. * (the asterisk) matches with all driver names.

  • If * is specified, the rest of the listed driver names are those that will NOT match the *.

    Format: <driver_name1>,<driver_name2>…​

hugetlb_cma=[HW,CMA]

With this parameter, you can specify the size of a CMA area used for allocation of gigantic hugepages. Or using node format, the size of a CMA area per node.

Format: nn[KMGTPE] or (node format) <node>:nn[KMGTPE][,<node>:nn[KMGTPE]]

Reserve a CMA area of given size and allocate gigantic hugepages using the CMA allocator. If enabled, the boot-time allocation of gigantic hugepages is skipped.

hugepages=[HW]

With this parameter, you can specify the number of HugeTLB pages to allocate at boot.

  • If this follows hugepagesz, it specifies the number of pages of hugepagesz to be allocated.
  • If this is the first HugeTLB parameter on the command line, it specifies the number of pages to allocate for the default huge page size.
  • If using node format, the number of pages to allocate per-node can be specified.

    See also Documentation/admin-guide/mm/hugetlbpage.rst.

    Format: <integer> or (node format) <node>:<integer>[,<node>:<integer>]

hugetlb_free_vmemmap=[KNL]

This parameter requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP to be enabled. Allows heavy hugetlb users to free up some more memory (7 * PAGE_SIZE for each 2MB hugetlb page).

  • Format: { [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
  • [oO][Nn]/Y/y/1: enable the feature
  • [oO][Ff]/N/n/0: disable the feature

    Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,

    Defaults to on.

    Note

    This parameter is not compatible with memory_hotplug.memmap_on_memory. If both parameters are enabled, hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory.

ivrs_ioapic=[HW,X86-64]

This parameter provides an override to the IOAPIC-ID <-> DEVICE-ID mapping provided in the IVRS ACPI table.

By default, PCI segment is 0, and can be omitted. For example,

  • to map IOAPIC-ID decimal 10 to PCI device 00:14.0, write the parameter as:

    ivrs_ioapic[10]=00:14.0
  • to map IOAPIC-ID decimal 10 to PCI segment 0x1 and PCI device 00:14.0, write the parameter as:

    ivrs_ioapic[10]=0001:00:14.0
ivrs_hpet=[HW,X86-64]

This parameter provides an override to the HPET-ID <-> DEVICE-ID mapping provided in the IVRS ACPI table.

By default, PCI segment is 0, and can be omitted. For example:

  • to map HPET-ID decimal 0 to PCI device 00:14.0, write the parameter as:

    ivrs_hpet[0]=00:14.0
  • to map HPET-ID decimal 10 to PCI segment 0x1 and PCI device 00:14.0, write the parameter as:

    ivrs_ioapic[10]=0001:00:14.0
ivrs_acpihid=[HW,X86-64]

This parameter provides an override to the ACPI-HID:UID <-> DEVICE-ID mapping provided in the IVRS ACPI table.

For example, to map UART-HID:UID AMD0020:0 to PCI segment 0x1 and PCI device ID 00:14.5, write the parameter as:

ivrs_acpihid[0001:00:14.5]=AMD0020:0

By default, PCI segment is 0, and can be omitted. For example, for the PCI device 00:14.5 write the parameter as:

ivrs_acpihid[00:14.5]=AMD0020:0
kvm.eager_page_split=[KVM,X86]

With this parameter, you can control whether or not KVM will try to proactively split all huge pages during dirty logging.

Eager page splitting reduces interruptions to vCPU execution by eliminating the write-protection faults and MMU lock contention that would otherwise be required to split huge pages lazily. VM workloads that rarely perform writes or that write only to a small region of VM memory may benefit from disabling eager page splitting to allow huge pages to still be used for reads.

The behavior of eager page splitting depends on whether KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled.

  • If disabled, all huge pages in a memslot will be eagerly split when dirty logging is enabled on that memslot.
  • If enabled, eager page splitting will be performed during the KVM_CLEAR_DIRTY ioctl, and only for the pages being cleared.

    Eager page splitting is only supported when kvm.tdp_mmu=Y.

    Defaults to Y (on).

kvm-arm.mode=[KVM,ARM]

With this parameter, you can select one of KVM/arm64’s modes of operation.

  • none: Forcefully disable KVM.
  • nvhe: Standard nVHE-based mode, without support for protected guests.
  • protected: nVHE-based mode with support for guests whose state is kept private from the host.

    Defaults to VHE/nVHE based on hardware support.

nosmep=[X86,PPC64s]

With this parameter, you can disable SMEP (Supervisor Mode Execution Prevention) even if it is supported by processor.

Format: pci=option[,option…​] [PCI] various_PCI_subsystem_options

Some options herein operate on a specific device or a set of devices (<pci_dev>). These are specified in one of the following formats:

[<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]*
pci:<vendor>:<device>[:<subvendor>:<subdevice>]
Note
  • The first format specifies a PCI bus/device/function address which may change if new hardware is inserted, if motherboard firmware changes, or due to changes caused by other kernel parameters. If the domain is left unspecified, it is taken to be zero. Optionally, a path to a device through multiple device and function addresses can be specified after the base address (this is more robust against renumbering issues).
  • The second format selects devices using IDs from the configuration space which may match multiple devices in the system.
  • earlydump: dump PCI config space before the kernel changes anything
  • off: [X86] do not probe for the PCI bus
  • bios: [X86-32] force use of PCI BIOS, do not access the hardware directly. Use this if your machine has a non-standard PCI host bridge.
  • nobios: [X86-32] disallow use of PCI BIOS, only direct hardware access methods are allowed. Use this if you experience crashes upon bootup and you suspect they are caused by the BIOS.
  • conf1: [X86] Force use of PCI Configuration Access Mechanism 1 (configuration address in IO port 0xCF8, data in IO port 0xCFC, both 32-bit).
  • conf2: [X86] Force use of PCI Configuration Access Mechanism 2 (IO port 0xCF8 is an 8-bit port for the function, IO port 0xCFA, also 8-bit, sets bus number. The config space is then accessed through ports 0xC000-0xCFFF).

  • noaer: [PCIE] If the PCIEAER kernel configuration parameter is enabled, this kernel boot option can be used to disable the use of PCIE advanced error reporting.
  • nodomains: [PCI] Disable support for multiple PCI root domains (aka PCI segments, in ACPI-speak).
  • nommconf: [X86] Disable use of MMCONFIG for PCI Configuration
  • check_enable_amd_mmconf [X86]: check for and enable properly configured MMIO access to PCI config space on AMD family 10h CPU
  • nomsi: [MSI] If the PCI_MSI kernel configuration parameter is enabled, this kernel boot option can be used to disable the use of MSI interrupts system-wide.
  • noioapicquirk: [APIC] Disable all boot interrupt quirks. Safety option to keep boot IRQs enabled. This should never be necessary.
  • ioapicreroute: [APIC] Enable rerouting of boot IRQs to the primary IO-APIC for bridges that cannot disable boot IRQs. This fixes a source of spurious IRQs when the system masks IRQs.
  • noioapicreroute [APIC] Disable workaround that uses the boot IRQ equivalent of an IRQ that connects to a chipset where boot IRQs cannot be disabled. The opposite of ioapicreroute.
  • biosirq: [X86-32] Use PCI BIOS calls to get the interrupt routing table. These calls are known to be buggy on several machines and they hang the machine when used, but on other computers it is the only way to get the interrupt routing table. Try this option if the kernel is unable to allocate IRQs or discover secondary PCI buses on your Motherboard.
  • rom: [X86] Assign address space to expansion ROMs. Use with caution as certain devices share address decoders between ROMs and other resources.
  • norom: [X86] Do not assign address space to expansion ROMs that do not already have BIOS assigned address ranges.
  • nobar: [X86] Do not assign address space to the BARs that were not assigned by the BIOS.
  • irqmask=0xMMMM: [X86] Set a bit mask of IRQs allowed to be assigned automatically to PCI devices. You can make the kernel exclude IRQs of your ISA cards this way.
  • pirqaddr=0xAAAAA: [X86] Specify the physical address of the PIRQ table (normally generated by the BIOS) if it is outside the F0000h-100000h range.
  • lastbus=N: [X86] Scan all buses thru bus #N. Can be useful if the kernel is unable to find your secondary buses and you want to tell it explicitly which ones they are.
  • assign-busses: [X86] Always assign all PCI bus numbers ourselves, overriding whatever the firmware may have done.
  • usepirqmask: [X86] Honor the possible IRQ mask stored in the BIOS $PIR table. This is needed on some systems with broken BIOSes, notably some HP Pavilion N5400 and Omnibook XE3 notebooks. This will have no effect if ACPI IRQ routing is enabled.
  • noacpi: [X86] Do not use ACPI for IRQ routing or for PCI scanning.
  • use_crs: [X86] Use PCI host bridge window information from ACPI. On BIOSes from 2008 or later, this is enabled by default. If you need to use this, please report a bug.
  • nocrs: [X86] Ignore PCI host bridge windows from ACPI. If you need to use this, please report a bug.
  • use_e820: [X86] Use E820 reservations to exclude parts of PCI host bridge windows. This is a workaround for BIOS defects in host bridge _CRS methods. If you need to use this, please report a bug to linux-pci@vger.kernel.org.
  • no_e820: [X86] Ignore E820 reservations for PCI host bridge windows. This is the default on modern hardware. If you need to use this, please report a bug to linux-pci@vger.kernel.org.
  • routeirq: Do IRQ routing for all PCI devices. This is normally done in pci_enable_device(), so this option is a temporary workaround for broken drivers that do not call it.
  • skip_isa_align: [X86] do not align io start addr, so can handle more pci cards
  • oearly: [X86] Do not do any early type 1 scanning. This might help on some broken boards which machine check when some devices' config space is read. But various workarounds are disabled and some IOMMU drivers will not work.
  • bfsort: Sort PCI devices into breadth-first order. This sorting is done to get a device order compatible with older (⇐ 2.4) kernels.
  • nobfsort: Do not sort PCI devices into breadth-first order.
  • pcie_bus_tune_off: Disable PCIe MPS (Max Payload Size) tuning and use the BIOS-configured MPS defaults.
  • pcie_bus_safe: Set every device’s MPS to the largest value supported by all devices below the root complex.
  • pcie_bus_perf Set device MPS to the largest allowable MPS based on its parent bus. Also set MRRS (Max Read Request Size) to the largest supported value (no larger than the MPS that the device or bus can support) for best performance.
  • pcie_bus_peer2peer: Set every device’s MPS to 128B, which every device is guaranteed to support. This configuration allows peer-to-peer DMA between any pair of devices, possibly at the cost of reduced performance. This also guarantees that hot-added devices will work.
  • cbiosize=nn[KMG]: The fixed amount of bus space which is reserved for the CardBus bridge’s IO window. The default value is 256 bytes.
  • cbmemsize=nn[KMG]: The fixed amount of bus space which is reserved for the CardBus bridge’s memory window. The default value is 64 megabytes.
  • resource_alignment=

    • Format: [<order of align>@]<pci_dev>[; …​]
    • Specifies alignment and device to reassign aligned memory resources. How to specify the device is described above. If <order of align> is not specified, PAGE_SIZE is used as alignment. A PCI-PCI bridge can be specified if resource windows need to be expanded. To specify the alignment for several instances of a device, the PCI vendor, device, subvendor, and subdevice may be specified, for example, 12@pci:8086:9c22:103c:198f for 4096-byte alignment.
  • ecrc=: Enable/disable PCIe ECRC (transaction layer end-to-end CRC checking).

    • bios: Use BIOS/firmware settings. This is the default.
    • off: Turn ECRC off
    • on: Turn ECRC on.
  • hpiosize=nn[KMG]: The fixed amount of bus space which is reserved for hotplug bridge’s IO window. Default size is 256 bytes.
  • hpmmiosize=nn[KMG]: The fixed amount of bus space which is reserved for hotplug bridge’s MMIO window. Default size is 2 megabytes.
  • hpmmioprefsize=nn[KMG]: The fixed amount of bus space which is reserved for hotplug bridge’s MMIO_PREF window. Default size is 2 megabytes.
  • hpmemsize=nn[KMG]: The fixed amount of bus space which is reserved for hotplug bridge’s MMIO and MMIO_PREF window. Default size is 2 megabytes.
  • hpbussize=nn: The minimum amount of additional bus numbers reserved for buses below a hotplug bridge. Default is 1.
  • realloc=: Enable/disable reallocating PCI bridge resources if allocations done by BIOS are too small to accommodate resources required by all child devices.

    • off: Turn realloc off
    • on: Turn realloc on
  • realloc: same as realloc=on
  • noari: do not use PCIe ARI.
  • noats: [PCIE, Intel-IOMMU, AMD-IOMMU] do not use PCIe ATS (and IOMMU device IOTLB).
  • pcie_scan_all: Scan all possible PCIe devices. Otherwise we only look for one device below a PCIe downstream port.
  • big_root_window: Try to add a big 64bit memory window to the PCIe root complex on AMD CPUs. Some GFX hardware can resize a BAR to allow access to all VRAM. Adding the window is slightly risky (it may conflict with unreported devices), so this taints the kernel.
  • disable_acs_redir=<pci_dev>[; …​]: Specify one or more PCI devices (in the format specified above) separated by semicolons. Each device specified will have the PCI ACS redirect capabilities forced off which will allow P2P traffic between devices through bridges without forcing it upstream. Note: this removes isolation between devices and may put more devices in an IOMMU group.
  • force_floating: [S390] Force usage of floating interrupts.
  • nomio: [S390] Do not use MIO instructions.
  • norid: [S390] ignore the RID field and force use of one PCI domain per PCI function
rcupdate.rcu_cpu_stall_timeout=[KNL]
Set timeout for RCU CPU stall warning messages. The value is in seconds and the maximum allowed value is 300 seconds.
rcupdate.rcu_task_stall_timeout=[KNL]

With this parameter, you can set timeout in jiffies for RCU task stall warning messages. Disable with a value less than or equal to zero.

Defaults to 10 minutes.

A change in value does not take effect until the beginning of the next grace period.

retbleed=[X86]

With this parameter, you can control mitigation of RETBleed (Arbitrary Speculative Code Execution with Return Instructions) vulnerability.

AMD-based UNRET and IBPB mitigations alone do not stop sibling threads from influencing the predictions of other sibling threads. For that reason, STIBP is used on processors that support it, and mitigate SMT on processors that do not.

  • off - no mitigation
  • auto - automatically select a migitation
  • auto,nosmt - automatically select a mitigation, disabling SMT if necessary for the full mitigation (only on Zen1 and older without STIBP).
  • ibpb - On AMD, mitigate short speculation windows on basic block boundaries too. Safe, highest perf impact. It also enables STIBP if present. Not suitable on Intel.
  • ibpb,nosmt - Like ibpb above but will disable SMT when STIBP is not available. This is the alternative for systems which do not have STIBP.
  • unret - Force enable untrained return thunks, only effective on AMD f15h-f17h based systems.
  • unret,nosmt - Like unret, but will disable SMT when STIBP is not available. This is the alternative for systems which do not have STIBP.

    Selecting auto will choose a mitigation method at run time according to the CPU.

    Not specifying this option is equivalent to retbleed=auto.

swiotlb=[ARM,IA-64,PPC,MIPS,X86]

Format: { <int> [,<int>] | force | noforce }

  • <int> - Number of I/O TLB slabs
  • <int> - Second integer after comma. Number of swiotlb areas with their own lock. Will be rounded up to a power of 2.
  • force - force using of bounce buffers even if they would not be automatically used by the kernel
  • noforce - Never use bounce buffers (for debugging)

New sysctl parameters

kernel.nmi_wd_lpm_factor (PPC only)

This factor represents the percentage added to watchdog_thresh when calculating the NMI watchdog timeout during an LPM. The soft lockup timeout is not impacted. Use this factor to apply to the NMI watchdog timeout (only when nmi_watchdog is set to 1).

  • A value of 0 means no change.
  • Defaults to 200, which means that the NMI watchdog is set to 30s (based on watchdog_thresh equal to 10).
net.core.txrehash

With this parameter, you can control default hash rethink behavior on listening socket when the SO_TXREHASH option is set to SOCK_TXREHASH_DEFAULT (that is, not overridden by setsockopt).

  • If set to 1 (default), hash rethink is performed on listening socket.
  • If set to 0, hash rethink is not performed.
net.sctp.reconf_enable - BOOLEAN

With this extension, you can enable or disable extension of Stream Reconfiguration functionality specified in RFC6525. This extension provides the ability to "reset" a stream and includes the parameters of Outgoing/Incoming SSN Reset, SSN/TSN Reset and Add Outgoing/Incoming Streams.

  • 1: Enable extension.
  • 0: Disable extension.
  • Defaults to 0.
net.sctp.intl_enable - BOOLEAN

With this extension, you can enable or disable extension of User Message Interleaving functionality specified in RFC8260. This extension allows the interleaving of user messages sent on different streams. With this feature enabled, I-DATA chunk will replace DATA chunk to carry user messages if also supported by the peer. Note that to use this feature, you must set this option to 1 and also set socket options SCTP_FRAGMENT_INTERLEAVE to 2 and SCTP_INTERLEAVING_SUPPORTED to 1.

  • 1: Enable extension.
  • 0: Disable extension.
  • Defaults to 0.
net.sctp.ecn_enable - BOOLEAN

With this extension, you can control use of Explicit Congestion Notification (ECN) by SCTP. Like in TCP, ECN is used only when both ends of the SCTP connection indicate support for it. This feature is useful in avoiding losses due to congestion by allowing supporting routers to signal congestion before having to drop packets.

  • 1: Enable ecn.
  • 0: Disable ecn.
  • Defaults to 1.
vm.hugetlb_optimize_vmemmap

This knob is not available when the memory_hotplug.memmap_on_memory kernel parameter is configured or the size of struct page (a structure defined in include/linux/mm_types.h) is not power of two (an unusual system configuration could result in this).

You can enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages associated with each HugeTLB page.

  • If enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool to the buddy allocator, the vmemmap pages representing that range needs to be remapped again and the vmemmap pages discarded earlier need to be rellocated again.
  • If your use case is that HugeTLB pages are allocated impromptu (for example, never explicitly allocating HugeTLB pages with nr_hugepages but only set nr_overcommit_hugepages, those overcommitted HugeTLB pages are allocated impromptu) instead of being pulled from the HugeTLB pool, you should weigh the benefits of memory savings against the more overhead (~2x slower than before) of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy allocator. Another behavior to note is that if the system is under heavy memory pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB pool to the buddy allocator since the allocation of vmemmap pages could be failed, you have to retry later if your system encounter this situation.
  • If disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from buddy allocator will not be optimized meaning the extra overhead at allocation time from buddy allocator disappears, whereas already optimized HugeTLB pages will not be affected. If you want to make sure there are no optimized HugeTLB pages, you can set nr_hugepages to 0 first and then disable this. Note that writing 0 to nr_hugepages will make any in use HugeTLB pages become surplus pages. So, those surplus pages are still optimized until they are no longer in use. You will need to wait for those surplus pages to be released before there are no optimized pages in the system.
net.core.rps_default_mask
The default RPS CPU mask used on newly created network devices. An empty mask means RPS disabled by default.

Changed sysctl parameters

kernel.numa_balancing

With this parameter, you can enable, disable, and configure automatic page fault based NUMA memory balancing. Memory is moved automatically to nodes that access it often. The value to set can be the result of ORing the following:

= =================================
0 NUMA_BALANCING_DISABLED
1 NUMA_BALANCING_NORMAL
2 NUMA_BALANCING_MEMORY_TIERING
= =================================

Or NUMA_BALANCING_NORMAL to optimize page placement among different NUMA nodes to reduce remote accessing. On NUMA machines, there is a performance penalty if remote memory is accessed by a CPU. When this feature is enabled the kernel samples what task thread is accessing memory by periodically unmapping pages and later trapping a page fault. At the time of the page fault, it is determined if the data being accessed should be migrated to a local memory node.

Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among different types of memory (represented as different NUMA nodes) to place the hot pages in the fast memory. This is implemented based on unmapping and page fault, too.

net.ipv6.route.max_size
This is now deprecated for ipv6 as garbage collection manages cached route entries.
net.sctp.sctp_wmem

This tunable previously was documented as not having any effect. Now, only the first value (min) is used, default and max are ignored.

  • min: Minimum size of send buffer that can be used by SCTP sockets. It is guaranteed to each SCTP socket (but not association) even under moderate memory pressure.
  • Defaults to 4K.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.