Chapter 5. Important changes to external kernel parameters
This chapter provides system administrators with a summary of significant changes in the kernel distributed with Red Hat Enterprise Linux 9.2. These changes could include for example added or updated proc
entries, sysctl
, and sysfs
default values, boot parameters, kernel configuration options, or any noticeable behavior changes.
New kernel parameters
- nomodeset
With this kernel parameter, you can disable kernel mode setting. DRM drivers will not perform display-mode changes or accelerated rendering. Only the system frame buffer will be available for use if this was set-up by the firmware or boot loader.
nomodeset
is useful as fallback, or for testing and debugging.- printk.console_no_auto_verbose
With this kernel parameter, you can disable console loglevel raise on oops, panic or lockdep-detected issues (only if lock debug is on). With an exception to setups with low baudrate on serial console, set this parameter to
0
to provide more debug information.-
Format:
<bool>
-
Defaults to
0
(auto_verbose
is enabled)
-
Format:
- rcupdate.rcu_exp_cpu_stall_timeout=[KNL]
With this kernel parameter, you can set timeout for expedited RCU CPU stall warning messages. The value is in milliseconds and the maximum allowed value is 21000 milliseconds.
Note that this value is adjusted to an arch timer tick resolution. Setting this to zero causes the value from
rcupdate.rcu_cpu_stall_timeout
to be used (after conversion from seconds to milliseconds).- rcupdate.rcu_task_stall_info=[KNL]
With this parameter, you can set initial timeout in jiffies for RCU task stall informational messages, which give some indication of the problem for those not patient enough to wait for ten minutes. Informational messages are only printed prior to the stall-warning message for a given grace period. Disable with a value less than or equal to zero.
-
Defaults to
10
seconds. - A change in value does not take effect until the beginning of the next grace period.
-
Defaults to
- rcupdate.rcu_task_stall_info_mult=[KNL]
This parameter is a multiplier for time interval between successive RCU task stall informational messages for a given RCU tasks grace period. This value is clamped to one through ten, inclusive.
It defaults to the value of three, so that the first informational message is printed 10 seconds into the grace period, the second at 40 seconds, the third at 160 seconds, and then the stall warning at 600 seconds would prevent a fourth at 640 seconds.
- smp.csd_lock_timeout=[KNL]
With this parameter, you can specify the period of time in milliseconds that
smp_call_function()
and friends will wait for a CPU to release the CSD lock. This is useful when diagnosing bugs involving CPUs disabling interrupts for extended periods of time.-
Defaults to
5,000
milliseconds. - Setting a value of zero disables this feature.
-
This feature may be more efficiently disabled using the
csdlock_debug-
kernel parameter.
-
Defaults to
- srcutree.big_cpu_lim=[KNL]
With this parameter, you can specify the number of CPUs constituting a large system, such that
srcu_struct
structures should immediately allocate ansrcu_node
array.-
Defaults to
128
. -
takes effect only when the low-order four bits of
srcutree.convert_to_big
is equal to3
(decide at boot).
-
Defaults to
- srcutree.convert_to_big=[KNL]
With this parameter, you can specify under what conditions an SRCU tree
srcu_struct
structure will be converted to big form, that is, with anrcu_node
tree:- 0: Never.
-
1: At
init_srcu_struct()
time. -
2: When
rcutorture
decides to. - 3: Decide at boot time (default).
0x1X: Above plus if high contention.
Either way, the
srcu_node
tree will be sized based on the actual runtime number of CPUs (nr_cpu_ids
) instead of the compile-timeCONFIG_NR_CPUS
.
- srcutree.srcu_max_nodelay=[KNL]
- With this parameter, you can specify the number of no-delay instances per jiffy for which the SRCU grace period worker thread will be rescheduled with zero delay. Beyond this limit, worker thread will be rescheduled with a sleep delay of one jiffy.
- srcutree.srcu_max_nodelay_phase=[KNL]
- With this parameter, you can specify the per-grace-period phase, number of non-sleeping polls of readers. Beyond this limit, grace period worker thread will be rescheduled with a sleep delay of one jiffy, between each rescan of the readers, for a grace period phase.
- srcutree.srcu_retry_check_delay=[KNL]
- With this parameter, you can specify number of microseconds of non-sleeping delay between each non-sleeping poll of readers.
- srcutree.small_contention_lim=[KNL]
With this parameter, you can specify the number of update-side contention events per jiffy will be tolerated before initiating a conversion of an
srcu_struct
structure to big form.NoteThe value of
srcutree.convert_to_big
must have the 0x10 bit set for contention-based conversions to occur.
Updated kernel parameters
- crashkernel=size[KMG][@offset[KMG]]
[KNL] Using
kexec
, Linux can switch to a crash kernel upon panic. This parameter reserves the physical memory region [offset, offset + size] for that kernel image. If@offset
is omitted, then a suitable offset is selected automatically.[KNL, X86-64, ARM64] Select a region under 4G first, and fall back to reserve region above 4G when
@offset
has not been specified.For more details, see
Documentation/admin-guide/kdump/kdump.rst
.- crashkernel=size[KMG],low
[KNL, X86-64, ARM64] With this parameter, you can specify low range under 4G for the second kernel. When
crashkernel=X,high
is passed, that require some amount of low memory, for exampleswiotlb
requires at least 64M+32K low memory, also enough extra low memory is needed to make sure DMA buffers for 32-bit devices will not run out. Kernel would try to allocate default size of memory below 4G automatically. The default size is platform dependent.- x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)
arm64: 128MiB
0
: to disable low allocation.This parameter will be ignored when
crashkernel=X,high
is not used or memory reserved is below 4G.
[KNL, ARM64] With this parameter, you can specify a low range in the DMA zone for the crash dump kernel.
This parameter will be ignored when
crashkernel=X,high
is not used.
- deferred_probe_timeout=[KNL]
With this parameter, you can set a timeout in seconds for deferred probe to give up waiting on dependencies to probe. Only specific dependencies (subsystems or drivers) that have opted in will be ignored.
A timeout of
0
will time out at the end of initcalls. If the time out has not expired, the option will be restarted by each successful driver registration. This option will also dump out devices still on the deferred probe list after retrying.- driver_async_probe=[KNL]
With this parameter, you can list of driver names to be probed asynchronously.
*
(the asterisk) matches with all driver names.If
*
is specified, the rest of the listed driver names are those that will NOT match the*
.Format:
<driver_name1>,<driver_name2>…
- hugetlb_cma=[HW,CMA]
With this parameter, you can specify the size of a CMA area used for allocation of gigantic hugepages. Or using node format, the size of a CMA area per node.
Format:
nn[KMGTPE] or (node format) <node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
Reserve a CMA area of given size and allocate gigantic hugepages using the CMA allocator. If enabled, the boot-time allocation of gigantic hugepages is skipped.
- hugepages=[HW]
With this parameter, you can specify the number of HugeTLB pages to allocate at boot.
- If this follows hugepagesz, it specifies the number of pages of hugepagesz to be allocated.
- If this is the first HugeTLB parameter on the command line, it specifies the number of pages to allocate for the default huge page size.
If using node format, the number of pages to allocate per-node can be specified.
See also
Documentation/admin-guide/mm/hugetlbpage.rst
.Format:
<integer> or (node format) <node>:<integer>[,<node>:<integer>]
- hugetlb_free_vmemmap=[KNL]
This parameter requires
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
to be enabled. Allows heavy hugetlb users to free up some more memory (7 * PAGE_SIZE for each 2MB hugetlb page).-
Format:
{ [oO][Nn]/Y/y/1 | [oO][Ff]/N/n/0 (default) }
- [oO][Nn]/Y/y/1: enable the feature
[oO][Ff]/N/n/0: disable the feature
Built with
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y
,Defaults to on.
NoteThis parameter is not compatible with
memory_hotplug.memmap_on_memory
. If both parameters are enabled,hugetlb_free_vmemmap
takes precedence overmemory_hotplug.memmap_on_memory
.
-
Format:
- ivrs_ioapic=[HW,X86-64]
This parameter provides an override to the IOAPIC-ID <-> DEVICE-ID mapping provided in the IVRS ACPI table.
By default, PCI segment is
0
, and can be omitted. For example,to map IOAPIC-ID decimal 10 to PCI device 00:14.0, write the parameter as:
ivrs_ioapic[10]=00:14.0
to map IOAPIC-ID decimal 10 to PCI segment 0x1 and PCI device 00:14.0, write the parameter as:
ivrs_ioapic[10]=0001:00:14.0
- ivrs_hpet=[HW,X86-64]
This parameter provides an override to the HPET-ID <-> DEVICE-ID mapping provided in the IVRS ACPI table.
By default, PCI segment is
0
, and can be omitted. For example:to map HPET-ID decimal 0 to PCI device 00:14.0, write the parameter as:
ivrs_hpet[0]=00:14.0
to map HPET-ID decimal 10 to PCI segment 0x1 and PCI device 00:14.0, write the parameter as:
ivrs_ioapic[10]=0001:00:14.0
- ivrs_acpihid=[HW,X86-64]
This parameter provides an override to the ACPI-HID:UID <-> DEVICE-ID mapping provided in the IVRS ACPI table.
For example, to map UART-HID:UID AMD0020:0 to PCI segment 0x1 and PCI device ID 00:14.5, write the parameter as:
ivrs_acpihid[0001:00:14.5]=AMD0020:0
By default, PCI segment is
0
, and can be omitted. For example, for the PCI device 00:14.5 write the parameter as:ivrs_acpihid[00:14.5]=AMD0020:0
- kvm.eager_page_split=[KVM,X86]
With this parameter, you can control whether or not KVM will try to proactively split all huge pages during dirty logging.
Eager page splitting reduces interruptions to vCPU execution by eliminating the write-protection faults and MMU lock contention that would otherwise be required to split huge pages lazily. VM workloads that rarely perform writes or that write only to a small region of VM memory may benefit from disabling eager page splitting to allow huge pages to still be used for reads.
The behavior of eager page splitting depends on whether
KVM_DIRTY_LOG_INITIALLY_SET
is enabled or disabled.- If disabled, all huge pages in a memslot will be eagerly split when dirty logging is enabled on that memslot.
If enabled, eager page splitting will be performed during the
KVM_CLEAR_DIRTY
ioctl, and only for the pages being cleared.Eager page splitting is only supported when
kvm.tdp_mmu=Y
.Defaults to
Y
(on).
- kvm-arm.mode=[KVM,ARM]
With this parameter, you can select one of KVM/arm64’s modes of operation.
- none: Forcefully disable KVM.
- nvhe: Standard nVHE-based mode, without support for protected guests.
protected: nVHE-based mode with support for guests whose state is kept private from the host.
Defaults to
VHE/nVHE
based on hardware support.
- nosmep=[X86,PPC64s]
With this parameter, you can disable SMEP (Supervisor Mode Execution Prevention) even if it is supported by processor.
Format:
pci=option[,option…] [PCI] various_PCI_subsystem_options
Some options herein operate on a specific device or a set of devices (
<pci_dev>
). These are specified in one of the following formats:[<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]* pci:<vendor>:<device>[:<subvendor>:<subdevice>]
Note- The first format specifies a PCI bus/device/function address which may change if new hardware is inserted, if motherboard firmware changes, or due to changes caused by other kernel parameters. If the domain is left unspecified, it is taken to be zero. Optionally, a path to a device through multiple device and function addresses can be specified after the base address (this is more robust against renumbering issues).
- The second format selects devices using IDs from the configuration space which may match multiple devices in the system.
- earlydump: dump PCI config space before the kernel changes anything
- off: [X86] do not probe for the PCI bus
- bios: [X86-32] force use of PCI BIOS, do not access the hardware directly. Use this if your machine has a non-standard PCI host bridge.
- nobios: [X86-32] disallow use of PCI BIOS, only direct hardware access methods are allowed. Use this if you experience crashes upon bootup and you suspect they are caused by the BIOS.
- conf1: [X86] Force use of PCI Configuration Access Mechanism 1 (configuration address in IO port 0xCF8, data in IO port 0xCFC, both 32-bit).
conf2: [X86] Force use of PCI Configuration Access Mechanism 2 (IO port 0xCF8 is an 8-bit port for the function, IO port 0xCFA, also 8-bit, sets bus number. The config space is then accessed through ports 0xC000-0xCFFF).
- See http://wiki.osdev.org/PCI for more info on the configuration access mechanisms.
- noaer: [PCIE] If the PCIEAER kernel configuration parameter is enabled, this kernel boot option can be used to disable the use of PCIE advanced error reporting.
- nodomains: [PCI] Disable support for multiple PCI root domains (aka PCI segments, in ACPI-speak).
- nommconf: [X86] Disable use of MMCONFIG for PCI Configuration
- check_enable_amd_mmconf [X86]: check for and enable properly configured MMIO access to PCI config space on AMD family 10h CPU
-
nomsi: [MSI] If the
PCI_MSI
kernel configuration parameter is enabled, this kernel boot option can be used to disable the use of MSI interrupts system-wide. - noioapicquirk: [APIC] Disable all boot interrupt quirks. Safety option to keep boot IRQs enabled. This should never be necessary.
- ioapicreroute: [APIC] Enable rerouting of boot IRQs to the primary IO-APIC for bridges that cannot disable boot IRQs. This fixes a source of spurious IRQs when the system masks IRQs.
- noioapicreroute [APIC] Disable workaround that uses the boot IRQ equivalent of an IRQ that connects to a chipset where boot IRQs cannot be disabled. The opposite of ioapicreroute.
- biosirq: [X86-32] Use PCI BIOS calls to get the interrupt routing table. These calls are known to be buggy on several machines and they hang the machine when used, but on other computers it is the only way to get the interrupt routing table. Try this option if the kernel is unable to allocate IRQs or discover secondary PCI buses on your Motherboard.
- rom: [X86] Assign address space to expansion ROMs. Use with caution as certain devices share address decoders between ROMs and other resources.
- norom: [X86] Do not assign address space to expansion ROMs that do not already have BIOS assigned address ranges.
- nobar: [X86] Do not assign address space to the BARs that were not assigned by the BIOS.
- irqmask=0xMMMM: [X86] Set a bit mask of IRQs allowed to be assigned automatically to PCI devices. You can make the kernel exclude IRQs of your ISA cards this way.
-
pirqaddr=0xAAAAA: [X86] Specify the physical address of the PIRQ table (normally generated by the BIOS) if it is outside the
F0000h-100000h
range. - lastbus=N: [X86] Scan all buses thru bus #N. Can be useful if the kernel is unable to find your secondary buses and you want to tell it explicitly which ones they are.
- assign-busses: [X86] Always assign all PCI bus numbers ourselves, overriding whatever the firmware may have done.
- usepirqmask: [X86] Honor the possible IRQ mask stored in the BIOS $PIR table. This is needed on some systems with broken BIOSes, notably some HP Pavilion N5400 and Omnibook XE3 notebooks. This will have no effect if ACPI IRQ routing is enabled.
- noacpi: [X86] Do not use ACPI for IRQ routing or for PCI scanning.
- use_crs: [X86] Use PCI host bridge window information from ACPI. On BIOSes from 2008 or later, this is enabled by default. If you need to use this, please report a bug.
- nocrs: [X86] Ignore PCI host bridge windows from ACPI. If you need to use this, please report a bug.
- use_e820: [X86] Use E820 reservations to exclude parts of PCI host bridge windows. This is a workaround for BIOS defects in host bridge _CRS methods. If you need to use this, please report a bug to linux-pci@vger.kernel.org.
- no_e820: [X86] Ignore E820 reservations for PCI host bridge windows. This is the default on modern hardware. If you need to use this, please report a bug to linux-pci@vger.kernel.org.
-
routeirq: Do IRQ routing for all PCI devices. This is normally done in
pci_enable_device()
, so this option is a temporary workaround for broken drivers that do not call it. - skip_isa_align: [X86] do not align io start addr, so can handle more pci cards
- oearly: [X86] Do not do any early type 1 scanning. This might help on some broken boards which machine check when some devices' config space is read. But various workarounds are disabled and some IOMMU drivers will not work.
- bfsort: Sort PCI devices into breadth-first order. This sorting is done to get a device order compatible with older (⇐ 2.4) kernels.
- nobfsort: Do not sort PCI devices into breadth-first order.
- pcie_bus_tune_off: Disable PCIe MPS (Max Payload Size) tuning and use the BIOS-configured MPS defaults.
- pcie_bus_safe: Set every device’s MPS to the largest value supported by all devices below the root complex.
- pcie_bus_perf Set device MPS to the largest allowable MPS based on its parent bus. Also set MRRS (Max Read Request Size) to the largest supported value (no larger than the MPS that the device or bus can support) for best performance.
- pcie_bus_peer2peer: Set every device’s MPS to 128B, which every device is guaranteed to support. This configuration allows peer-to-peer DMA between any pair of devices, possibly at the cost of reduced performance. This also guarantees that hot-added devices will work.
- cbiosize=nn[KMG]: The fixed amount of bus space which is reserved for the CardBus bridge’s IO window. The default value is 256 bytes.
- cbmemsize=nn[KMG]: The fixed amount of bus space which is reserved for the CardBus bridge’s memory window. The default value is 64 megabytes.
resource_alignment=
-
Format:
[<order of align>@]<pci_dev>[; …]
-
Specifies alignment and device to reassign aligned memory resources. How to specify the device is described above. If
<order of align>
is not specified,PAGE_SIZE
is used as alignment. A PCI-PCI bridge can be specified if resource windows need to be expanded. To specify the alignment for several instances of a device, the PCI vendor, device, subvendor, and subdevice may be specified, for example,12@pci:8086:9c22:103c:198f
for 4096-byte alignment.
-
Format:
ecrc=: Enable/disable PCIe ECRC (transaction layer end-to-end CRC checking).
- bios: Use BIOS/firmware settings. This is the default.
- off: Turn ECRC off
- on: Turn ECRC on.
- hpiosize=nn[KMG]: The fixed amount of bus space which is reserved for hotplug bridge’s IO window. Default size is 256 bytes.
- hpmmiosize=nn[KMG]: The fixed amount of bus space which is reserved for hotplug bridge’s MMIO window. Default size is 2 megabytes.
- hpmmioprefsize=nn[KMG]: The fixed amount of bus space which is reserved for hotplug bridge’s MMIO_PREF window. Default size is 2 megabytes.
- hpmemsize=nn[KMG]: The fixed amount of bus space which is reserved for hotplug bridge’s MMIO and MMIO_PREF window. Default size is 2 megabytes.
- hpbussize=nn: The minimum amount of additional bus numbers reserved for buses below a hotplug bridge. Default is 1.
realloc=: Enable/disable reallocating PCI bridge resources if allocations done by BIOS are too small to accommodate resources required by all child devices.
- off: Turn realloc off
- on: Turn realloc on
- realloc: same as realloc=on
- noari: do not use PCIe ARI.
- noats: [PCIE, Intel-IOMMU, AMD-IOMMU] do not use PCIe ATS (and IOMMU device IOTLB).
- pcie_scan_all: Scan all possible PCIe devices. Otherwise we only look for one device below a PCIe downstream port.
- big_root_window: Try to add a big 64bit memory window to the PCIe root complex on AMD CPUs. Some GFX hardware can resize a BAR to allow access to all VRAM. Adding the window is slightly risky (it may conflict with unreported devices), so this taints the kernel.
- disable_acs_redir=<pci_dev>[; …]: Specify one or more PCI devices (in the format specified above) separated by semicolons. Each device specified will have the PCI ACS redirect capabilities forced off which will allow P2P traffic between devices through bridges without forcing it upstream. Note: this removes isolation between devices and may put more devices in an IOMMU group.
- force_floating: [S390] Force usage of floating interrupts.
- nomio: [S390] Do not use MIO instructions.
- norid: [S390] ignore the RID field and force use of one PCI domain per PCI function
- rcupdate.rcu_cpu_stall_timeout=[KNL]
- Set timeout for RCU CPU stall warning messages. The value is in seconds and the maximum allowed value is 300 seconds.
- rcupdate.rcu_task_stall_timeout=[KNL]
With this parameter, you can set timeout in jiffies for RCU task stall warning messages. Disable with a value less than or equal to zero.
Defaults to
10
minutes.A change in value does not take effect until the beginning of the next grace period.
- retbleed=[X86]
With this parameter, you can control mitigation of RETBleed (Arbitrary Speculative Code Execution with Return Instructions) vulnerability.
AMD-based UNRET and IBPB mitigations alone do not stop sibling threads from influencing the predictions of other sibling threads. For that reason, STIBP is used on processors that support it, and mitigate SMT on processors that do not.
- off - no mitigation
- auto - automatically select a migitation
- auto,nosmt - automatically select a mitigation, disabling SMT if necessary for the full mitigation (only on Zen1 and older without STIBP).
- ibpb - On AMD, mitigate short speculation windows on basic block boundaries too. Safe, highest perf impact. It also enables STIBP if present. Not suitable on Intel.
-
ibpb,nosmt - Like
ibpb
above but will disable SMT when STIBP is not available. This is the alternative for systems which do not have STIBP. - unret - Force enable untrained return thunks, only effective on AMD f15h-f17h based systems.
unret,nosmt - Like unret, but will disable SMT when STIBP is not available. This is the alternative for systems which do not have STIBP.
Selecting
auto
will choose a mitigation method at run time according to the CPU.Not specifying this option is equivalent to
retbleed=auto
.
- swiotlb=[ARM,IA-64,PPC,MIPS,X86]
Format:
{ <int> [,<int>] | force | noforce }
- <int> - Number of I/O TLB slabs
-
<int> - Second integer after comma. Number of
swiotlb
areas with their own lock. Will be rounded up to a power of 2. - force - force using of bounce buffers even if they would not be automatically used by the kernel
- noforce - Never use bounce buffers (for debugging)
New sysctl parameters
- kernel.nmi_wd_lpm_factor (PPC only)
This factor represents the percentage added to
watchdog_thresh
when calculating the NMI watchdog timeout during an LPM. The soft lockup timeout is not impacted. Use this factor to apply to the NMI watchdog timeout (only whennmi_watchdog
is set to 1).-
A value of
0
means no change. -
Defaults to
200
, which means that the NMI watchdog is set to 30s (based onwatchdog_thresh
equal to 10).
-
A value of
- net.core.txrehash
With this parameter, you can control default hash rethink behavior on listening socket when the
SO_TXREHASH
option is set toSOCK_TXREHASH_DEFAULT
(that is, not overridden bysetsockopt
).-
If set to
1
(default), hash rethink is performed on listening socket. -
If set to
0
, hash rethink is not performed.
-
If set to
- net.sctp.reconf_enable - BOOLEAN
With this extension, you can enable or disable extension of Stream Reconfiguration functionality specified in RFC6525. This extension provides the ability to "reset" a stream and includes the parameters of
Outgoing/Incoming SSN Reset
,SSN/TSN Reset
andAdd Outgoing/Incoming Streams
.- 1: Enable extension.
- 0: Disable extension.
-
Defaults to
0
.
- net.sctp.intl_enable - BOOLEAN
With this extension, you can enable or disable extension of User Message Interleaving functionality specified in RFC8260. This extension allows the interleaving of user messages sent on different streams. With this feature enabled, I-DATA chunk will replace DATA chunk to carry user messages if also supported by the peer. Note that to use this feature, you must set this option to
1
and also set socket optionsSCTP_FRAGMENT_INTERLEAVE
to2
andSCTP_INTERLEAVING_SUPPORTED
to1
.- 1: Enable extension.
- 0: Disable extension.
-
Defaults to
0
.
- net.sctp.ecn_enable - BOOLEAN
With this extension, you can control use of Explicit Congestion Notification (ECN) by SCTP. Like in TCP, ECN is used only when both ends of the SCTP connection indicate support for it. This feature is useful in avoiding losses due to congestion by allowing supporting routers to signal congestion before having to drop packets.
- 1: Enable ecn.
- 0: Disable ecn.
-
Defaults to
1
.
- vm.hugetlb_optimize_vmemmap
This knob is not available when the
memory_hotplug.memmap_on_memory
kernel parameter is configured or the size of struct page (a structure defined ininclude/linux/mm_types.h
) is not power of two (an unusual system configuration could result in this).You can enable (set to 1) or disable (set to 0) the feature of optimizing
vmemmap
pages associated with each HugeTLB page.-
If enabled, the
vmemmap
pages of subsequent allocation of HugeTLB pages from buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool to the buddy allocator, thevmemmap
pages representing that range needs to be remapped again and thevmemmap
pages discarded earlier need to be rellocated again. -
If your use case is that HugeTLB pages are allocated impromptu (for example, never explicitly allocating HugeTLB pages with
nr_hugepages
but only setnr_overcommit_hugepages
, those overcommitted HugeTLB pages are allocated impromptu) instead of being pulled from the HugeTLB pool, you should weigh the benefits of memory savings against the more overhead (~2x slower than before) of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy allocator. Another behavior to note is that if the system is under heavy memory pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB pool to the buddy allocator since the allocation ofvmemmap
pages could be failed, you have to retry later if your system encounter this situation. -
If disabled, the
vmemmap
pages of subsequent allocation of HugeTLB pages from buddy allocator will not be optimized meaning the extra overhead at allocation time from buddy allocator disappears, whereas already optimized HugeTLB pages will not be affected. If you want to make sure there are no optimized HugeTLB pages, you can setnr_hugepages
to0
first and then disable this. Note that writing0
tonr_hugepages
will make any in use HugeTLB pages become surplus pages. So, those surplus pages are still optimized until they are no longer in use. You will need to wait for those surplus pages to be released before there are no optimized pages in the system.
-
If enabled, the
- net.core.rps_default_mask
- The default RPS CPU mask used on newly created network devices. An empty mask means RPS disabled by default.
Changed sysctl parameters
- kernel.numa_balancing
With this parameter, you can enable, disable, and configure automatic page fault based NUMA memory balancing. Memory is moved automatically to nodes that access it often. The value to set can be the result of ORing the following:
= ================================= 0 NUMA_BALANCING_DISABLED 1 NUMA_BALANCING_NORMAL 2 NUMA_BALANCING_MEMORY_TIERING = =================================
Or
NUMA_BALANCING_NORMAL
to optimize page placement among different NUMA nodes to reduce remote accessing. On NUMA machines, there is a performance penalty if remote memory is accessed by a CPU. When this feature is enabled the kernel samples what task thread is accessing memory by periodically unmapping pages and later trapping a page fault. At the time of the page fault, it is determined if the data being accessed should be migrated to a local memory node.Or
NUMA_BALANCING_MEMORY_TIERING
to optimize page placement among different types of memory (represented as different NUMA nodes) to place the hot pages in the fast memory. This is implemented based on unmapping and page fault, too.- net.ipv6.route.max_size
- This is now deprecated for ipv6 as garbage collection manages cached route entries.
- net.sctp.sctp_wmem
This tunable previously was documented as not having any effect. Now, only the first value (
min
) is used,default
andmax
are ignored.- min: Minimum size of send buffer that can be used by SCTP sockets. It is guaranteed to each SCTP socket (but not association) even under moderate memory pressure.
-
Defaults to
4K
.