Chapter 5. Important changes to external kernel parameters
This chapter provides system administrators with a summary of significant changes in the kernel distributed with Red Hat Enterprise Linux 9.1. These changes could include for example added or updated proc
entries, sysctl
, and sysfs
default values, boot parameters, kernel configuration options, or any noticeable behavior changes.
New kernel parameters
- allow_mismatched_32bit_el0 = [ARM64]
With this parameter you can allow systems with mismatched 32-bit support at the EL0 level to run 32-bit applications. The set of CPUs supporting 32-bit EL0 is indicated by the
/sys/devices/system/cpu/aarch32_el0
file. Also, you can restrict hot-unplug operations.For more information, see
Documentation/arm64/asymmetric-32bit.rst
.- arm64.nomte = [ARM64]
- With this parameter you can unconditionally disable Memory Tagging Extension (MTE) support.
- i8042.probe_defer = [HW]
-
With this parameter you can allow deferred probing on
i8042
probe errors. - idxd.tc_override = [HW]
With this parameter in the
<bool>
format, you can allow override of default traffic class configuration for the device.The default value is set to
false
(0
).- kvm.eager_page_split = [KVM,X86]
With this parameter you can control whether or not a KVM proactively splits all huge pages during dirty logging. Eager page splitting reduces interruptions to vCPU execution by eliminating the write-protection faults and Memory Management Unit (MMU) lock contention that is otherwise required to split huge pages lazily.
VM workloads that rarely perform writes or that write only to a small region of VM memory can benefit from disabling eager page splitting to allow huge pages to still be used for reads.
The behavior of eager page splitting depends on whether the
KVM_DIRTY_LOG_INITIALLY_SET
option is enabled or disabled.-
If disabled, all huge pages in a
memslot
are eagerly split when dirty logging is enabled on thatmemslot
. If enabled, eager page splitting is performed during the
KVM_CLEAR_DIRTY
ioctl()
system call, and only for the pages being cleared.Eager page splitting currently only supports splitting huge pages mapped by the two dimensional paging (TDP) MMU.
The default value is set to
Y
(on
).
-
If disabled, all huge pages in a
- kvm.nx_huge_pages_recovery_period_ms = [KVM]
With this parameter you can control the time period at which KVM zaps 4 KiB pages back to huge pages.
-
If the value is a non-zero
N
, KVM zaps a portion of the pages everyN
milliseconds. If the value is
0
, KVM picks a period based on the ratio, such that a page is zapped after 1 hour on average.The default value is set to
0
.
-
If the value is a non-zero
- l1d_flush = [X86,INTEL]
With this parameter you can control mitigation for L1D-based snooping vulnerability.
Certain CPUs are vulnerable to an exploit against CPU internal buffers which can, under certain conditions, forward information to a disclosure gadget. In vulnerable processors, the speculatively forwarded data can be used in a cache side channel attack, to access data to which the attacker does not have direct access.
The available option is
on
, which meansenable the interface for the mitigation
.- mmio_stale_data = [X86,INTEL]
With this parameter you can control mitigation for the Processor Memory-mapped I/O (MMIO) Stale Data vulnerabilities.
Processor MMIO Stale Data is a class of vulnerabilities that can expose data after an MMIO operation. Exposed data could originate or end in the same CPU buffers as affected by metadata server (MDS) and Transactional Asynchronous Abort (TAA). Therefore, similar to MDS and TAA, the mitigation is to clear the affected CPU buffers.
The available options are:
-
full
: enable mitigation on vulnerable CPUs -
full,nosmt
: enable mitigation and disable SMT on vulnerable CPUs. off
: unconditionally disable mitigationOn MDS or TAA affected machines,
mmio_stale_data=off
can be prevented by an active MDS or TAA mitigation as these vulnerabilities are mitigated with the same mechanism. Thus, in order to disable this mitigation, you need to specifymds=off
andtsx_async_abort=off
, too.Not specifying this option is equivalent to
mmio_stale_data=full
.For more information, see
Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
.
-
- random.trust_bootloader={on,off} = [KNL]
-
With this parameter you can enable or disable trusting the use of a seed passed by the boot loader (if available) to fully seed the kernel’s CRNG. The default behavior is controlled by the
CONFIG_RANDOM_TRUST_BOOTLOADER
option. - rcupdate.rcu_task_collapse_lim = [KNL]
-
With this parameter you can set the maximum number of callbacks present at the beginning of a grace period that allows the RCU Tasks flavors to collapse back to using a single callback queue. This switching only occurs when the
rcupdate.rcu_task_enqueue_lim
option is set to the default value of-1
. - rcupdate.rcu_task_contend_lim = [KNL]
-
With this parameter you can set the minimum number of callback-queuing-time lock-contention events per jiffy required to cause the RCU Tasks flavors to switch to per-CPU callback queuing. This switching only occurs when the
rcupdate.rcu_task_enqueue_lim
option is set to the default value of-1
. - rcupdate.rcu_task_enqueue_lim = [KNL]
With this parameter you can set the number of callback queues to use for the RCU Tasks family of RCU flavors. You can adjust the number of callback queues automatically and dynamically with the default value of
-1
.This parameter is intended for use in testing.
- retbleed = [X86]
With this parameter you can control mitigation of Arbitrary Speculative Code Execution with Return Instructions (RETBleed) vulnerability. The available options are:
-
off
: no mitigation -
auto
: automatically select a mitigation -
auto,nosmt
: automatically select a mitigation, disabling SMT if necessary for the full mitigation (only on Zen1 and older without STIBP). -
ibpb
: mitigate short speculation windows on basic block boundaries too. Safe, highest performance impact. -
unret
: force enable untrained return thunks, only effective on AMD f15h-f17h based systems. unret,nosmt
: like theunret
option, will disable SMT when STIBP is not available.Selecting the
auto
option chooses a mitigation method at run time according to the CPU.Not specifying this option is equivalent to
retbleed=auto
.
-
- sev=option[,option…] = [X86-64]
-
For more information, see
Documentation/x86/x86_64/boot-options.rst
.
Updated kernel parameters
- acpi_sleep = [HW,ACPI]
Format: { s3_bios, s3_mode, s3_beep, s4_hwsig, s4_nohwsig, old_ordering, nonvs, sci_force_enable, nobl }
-
For more information on
s3_bios
ands3_mode
, seeDocumentation/power/video.rst
. -
s3_beep
is for debugging; it makes the PC’s speaker beep as soon as the kernel real-mode entry point is called. -
s4_hwsig
causes the kernel to check the ACPI hardware signature during resume from hibernation, and gracefully refuse to resume if it has changed. The default behavior is to allow resume and simply warn when the signature changes, unless thes4_hwsig
option is enabled. -
s4_nohwsig
prevents ACPI hardware signature from being used, or even warned about, during resume.old_ordering
causes the ACPI 1.0 ordering of the_PTS
control method, with respect to putting devices into low power states, to be enforced. The ACPI 2.0 ordering of_PTS
is used by default. -
nonvs
prevents the kernel from saving and restoring the ACPI NVS memory during suspend, hibernation, and resume. -
sci_force_enable
causes the kernel to setSCI_EN
directly on resume from S1/S3. Even though this behavior is contrary to the ACPI specifications, some corrupted systems do not work without it. nobl
causes the internal denylist of systems known to behave incorrectly in some ways with respect to system suspend and resume to be ignored. Use this option wisely.For more information, see
Documentation/power/video.rst
.
-
For more information on
- crashkernel=size[KMG],high = [KNL, X86-64, ARM64]
With this parameter you can allocate physical memory region from top as follows:
- If the system has more than 4 GB RAM installed, a physical memory region can exceed 4 GB.
If the system has less than 4 GB RAM installed, a physical memory region will be allocated below 4 GB, if available.
This parameter is ignored if the
crashkernel=X
parameter is specified.
- crashkernel=size[KMG],low = [KNL, X86-64]
When you pass
crashkernel=X,high
, the kernel can allocate a physical memory region above 4 GB. This causes the second kernel crash on systems that require some amount of low memory (for example,swiotlb
requires at least 64M+32K low memory) and enough extra low memory to make sure DMA buffers for 32-bit devices are not exhausted. Kernel tries to allocate at least 256 M below 4 GB automatically. With this parameter you can specify the low range under 4 GB for the second kernel instead.-
0:
disables low allocation. It will be ignored whencrashkernel=X,high
is not used or memory reserved is below 4 GB.
-
- crashkernel=size[KMG],low = [KNL, ARM64]
-
With this parameter you can specify a low range in the DMA zone for the crash dump kernel. It will be ignored when
crashkernel=X,high
is not used or memory reserved is located in the DMA zones. - kvm.nx_huge_pages_recovery_ratio = [KVM]
With this parameter you can control how many 4 KiB pages are periodically zapped back to huge pages:
-
0
disables the recovery N
KVM will zap1/Nth
of the 4 KiB pages every period.The default is set to
60
.
-
- kvm-arm.mode = [KVM,ARM]
With this parameter you can select one of KVM modes of operation:
-
none
: forcefully disable KVM. -
nvhe
: standard nVHE-based mode, without support for protected guests. protected
:nVHE
-based mode with support for guests whose state is kept private from the host. Not valid if the kernel is running in the EL2 level.The default value is set to
VHE/nVHE
based on hardware support.
-
- mitigations = [X86,PPC,S390,ARM64]
With this parameter you can control optional mitigations for CPU vulnerabilities. This is a set of curated, arch-independent options, each of which is an aggregation of existing arch-specific options:
off
: disable all optional CPU mitigations. This improves system performance, but it may also expose users to several CPU vulnerabilities.-
Equivalent to:
nopti [X86,PPC]
,kpti=0 [ARM64]
,nospectre_v1 [X86,PPC]
,nobp=0 [S390]
,nospectre_v2 [X86,PPC,S390,ARM64]
,spectre_v2_user=off [X86]
,spec_store_bypass_disable=off [X86,PPC]
,ssbd=force-off [ARM64]
,l1tf=off [X86]
,mds=off [X86]
,tsx_async_abort=off [X86]
,kvm.nx_huge_pages=off [X86]
,no_entry_flush [PPC]
,no_uaccess_flush [PPC]
,mmio_stale_data=off [X86]
. -
Exceptions: This does not have any effect on
kvm.nx_huge_pages
when thekvm.nx_huge_pages=force
option is specified.
-
Equivalent to:
auto
(default): mitigate all CPU vulnerabilities, but leave SMT enabled, even if it is vulnerable.- Equivalent to: (default behavior)
auto,nosmt
: mitigate all CPU vulnerabilities, disabling SMT if needed.-
Equivalent to:
l1tf=flush,nosmt [X86]
,mds=full,nosmt [X86]
,tsx_async_abort=full,nosmt [X86]
,mmio_stale_data=full,nosmt [X86]
-
Equivalent to:
- rcu_nocbs[=cpu-list] = [KNL]
The optional argument is a CPU list.
In kernels built with
CONFIG_RCU_NOCB_CPU=y
, you can enable the no-callback CPU mode, which prevents such CPUs callbacks from being invoked in softirq context. Invocation of such CPUs' RCU callbacks will instead be offloaded torcuox/N
kthreads
created for that purpose, wherex
isp
for RCU-preempt,s
for RCU-sched, andg
for thekthreads
that mediate grace periods; andN
is the CPU number. This reduces OS jitter on the offloaded CPUs, which can be useful for HPC and real-time workloads. It can also improve energy efficiency for asymmetric multiprocessors.-
If a
cpulist
is passed as an argument, the specified list of CPUs is set to no-callback mode from boot. -
If the
=
sign and thecpulist
arguments are omitted, no CPU will be set to no-callback mode from boot but you can toggle the mode at runtime usingcpusets
.
-
If a
- rcutree.kthread_prio = [KNL,BOOT]
With this parameter you can set the
SCHED_FIFO
priority of the RCU per-CPUkthreads
(rcuc/N
). This value is also used for the priority of the RCU boost threads (rcub/N
) and for the RCU grace-periodkthreads
(rcu_bh
,rcu_preempt
, andrcu_sched
).-
If
RCU_BOOST
is set, valid values are 1-99 and the default is1
, the least-favored priority. If
RCU_BOOST
is not set, valid values are 0-99 and the default is0
, non-realtime operation.When
RCU_NOCB_CPU
is set, you should adjust the priority ofNOCB
callbackkthreads
.
-
If
- rcutorture.fwd_progress = [KNL]
With this parameter you can specify the number of
kthreads
to be used for RCU grace-period forward-progress testing for the types of RCU supporting this notion.The default is set to
1
kthread
. Values less than zero or greater than the number of CPUs cause the number of CPUs to be used.- spectre_v2 = [X86]
With this parameter you can control mitigation of Spectre variant 2 (indirect branch speculation) vulnerability. The default operation protects the kernel from user space attacks.
-
on
: unconditionally enable, impliesspectre_v2_user=on
-
off
: unconditionally disable, impliesspectre_v2_user=off
-
auto
: kernel detects whether your CPU model is vulnerable -
Selecting
on
will, andauto
may, choose a mitigation method at run time according to the CPU, the available microcode, the setting of theCONFIG_RETPOLINE
configuration option, and the compiler with which the kernel was built. -
Selecting
on
will also enable the mitigation against user space to user space task attacks. -
Selecting
off
will disable both the kernel and the user space protections. Specific mitigations can also be selected manually:
-
retpoline
: replace indirect branches -
retpoline,generic
: Retpolines -
retpoline,lfence
: LFENCE; indirect branch -
retpoline,amd
: alias for retpoline,lfence -
eibrs
: enhanced IBRS -
eibrs,retpoline
: enhanced IBRS + Retpolines -
eibrs,lfence
: enhanced IBRS + LFENCE ibrs
: use IBRS to protect kernelNot specifying this option is equivalent to
spectre_v2=auto
.
-
-
New sysctl parameters
- max_rcu_stall_to_panic
-
When you set
panic_on_rcu_stall
to1
, you determine the number of times that RCU can stall beforepanic()
is called. When you setpanic_on_rcu_stall
to0
, this value has no effect. - perf_user_access = [ARM64]
With this parameter you can control user space access for reading
perf
event counters.-
When set to
1
, user space can read performance monitor counter registers directly. The default is set to
0
, which meansaccess disabled
.For more information, see
Documentation/arm64/perf.rst
.
-
When set to
- gro_normal_batch
-
With this parameter you can set the maximum number of the segments to batch up on output of GRO. When a packet exits GRO, either as a coalesced superframe or as an original packet which GRO has decided not to coalesce, it is placed on a per-NAPI list. This list is then passed to the stack when the number of segments reaches the
gro_normal_batch
limit. - high_order_alloc_disable
With this parameter you can choose order-0 allocation. By default, the allocator for page fragments tries to use high order pages, that is order-3 on X86 systems. While the default behavior returns good results, in certain situations a contention in page allocations and freeing occurs. This was especially true on older kernels (version 5.14 and higher) when high-order pages were not stored on per-CPU lists. This parameter exists now mostly of historical importance.
The default value is
0
.- page_lock_unfairness
By specifying the value for this parameter you can determine the number of times that the page lock can be stolen from under a waiter. After the lock is stolen the number of times specified in this file, the
fair lock handoff
semantics will apply, and the waiter will only be awakened if the lock can be taken.The default value is
5
.
Changed sysctl parameters
- urandom_min_reseed_secs
-
You can use this parameter to determine the minimum number of seconds between
urandom
pool reseeding. This file is writable for compatibility purposes, but writing to it has no effect on any RNG behavior. - write_wakeup_threshold
-
When the entropy count sinks below this threshold in a number of bits, you can wake up processes waiting to write to the
/dev/random
file. This file is writable for compatibility purposes, but writing to it has no effect on any RNG behavior.