Search

Chapter 5. Important changes to external kernel parameters

download PDF

This chapter provides system administrators with a summary of significant changes in the kernel distributed with Red Hat Enterprise Linux 9.1. These changes could include for example added or updated proc entries, sysctl, and sysfs default values, boot parameters, kernel configuration options, or any noticeable behavior changes.

New kernel parameters

allow_mismatched_32bit_el0 = [ARM64]

With this parameter you can allow systems with mismatched 32-bit support at the EL0 level to run 32-bit applications. The set of CPUs supporting 32-bit EL0 is indicated by the /sys/devices/system/cpu/aarch32_el0 file. Also, you can restrict hot-unplug operations.

For more information, see Documentation/arm64/asymmetric-32bit.rst.

arm64.nomte = [ARM64]
With this parameter you can unconditionally disable Memory Tagging Extension (MTE) support.
i8042.probe_defer = [HW]
With this parameter you can allow deferred probing on i8042 probe errors.
idxd.tc_override = [HW]

With this parameter in the <bool> format, you can allow override of default traffic class configuration for the device.

The default value is set to false (0).

kvm.eager_page_split = [KVM,X86]

With this parameter you can control whether or not a KVM proactively splits all huge pages during dirty logging. Eager page splitting reduces interruptions to vCPU execution by eliminating the write-protection faults and Memory Management Unit (MMU) lock contention that is otherwise required to split huge pages lazily.

VM workloads that rarely perform writes or that write only to a small region of VM memory can benefit from disabling eager page splitting to allow huge pages to still be used for reads.

The behavior of eager page splitting depends on whether the KVM_DIRTY_LOG_INITIALLY_SET option is enabled or disabled.

  • If disabled, all huge pages in a memslot are eagerly split when dirty logging is enabled on that memslot.
  • If enabled, eager page splitting is performed during the KVM_CLEAR_DIRTY ioctl() system call, and only for the pages being cleared.

    Eager page splitting currently only supports splitting huge pages mapped by the two dimensional paging (TDP) MMU.

    The default value is set to Y (on).

kvm.nx_huge_pages_recovery_period_ms = [KVM]

With this parameter you can control the time period at which KVM zaps 4 KiB pages back to huge pages.

  • If the value is a non-zero N, KVM zaps a portion of the pages every N milliseconds.
  • If the value is 0, KVM picks a period based on the ratio, such that a page is zapped after 1 hour on average.

    The default value is set to 0.

l1d_flush = [X86,INTEL]

With this parameter you can control mitigation for L1D-based snooping vulnerability.

Certain CPUs are vulnerable to an exploit against CPU internal buffers which can, under certain conditions, forward information to a disclosure gadget. In vulnerable processors, the speculatively forwarded data can be used in a cache side channel attack, to access data to which the attacker does not have direct access.

The available option is on, which means enable the interface for the mitigation.

mmio_stale_data = [X86,INTEL]

With this parameter you can control mitigation for the Processor Memory-mapped I/O (MMIO) Stale Data vulnerabilities.

Processor MMIO Stale Data is a class of vulnerabilities that can expose data after an MMIO operation. Exposed data could originate or end in the same CPU buffers as affected by metadata server (MDS) and Transactional Asynchronous Abort (TAA). Therefore, similar to MDS and TAA, the mitigation is to clear the affected CPU buffers.

The available options are:

  • full: enable mitigation on vulnerable CPUs
  • full,nosmt: enable mitigation and disable SMT on vulnerable CPUs.
  • off: unconditionally disable mitigation

    On MDS or TAA affected machines, mmio_stale_data=off can be prevented by an active MDS or TAA mitigation as these vulnerabilities are mitigated with the same mechanism. Thus, in order to disable this mitigation, you need to specify mds=off and tsx_async_abort=off, too.

    Not specifying this option is equivalent to mmio_stale_data=full.

    For more information, see Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst.

random.trust_bootloader={on,off} = [KNL]
With this parameter you can enable or disable trusting the use of a seed passed by the boot loader (if available) to fully seed the kernel’s CRNG. The default behavior is controlled by the CONFIG_RANDOM_TRUST_BOOTLOADER option.
rcupdate.rcu_task_collapse_lim = [KNL]
With this parameter you can set the maximum number of callbacks present at the beginning of a grace period that allows the RCU Tasks flavors to collapse back to using a single callback queue. This switching only occurs when the rcupdate.rcu_task_enqueue_lim option is set to the default value of -1.
rcupdate.rcu_task_contend_lim = [KNL]
With this parameter you can set the minimum number of callback-queuing-time lock-contention events per jiffy required to cause the RCU Tasks flavors to switch to per-CPU callback queuing. This switching only occurs when the rcupdate.rcu_task_enqueue_lim option is set to the default value of -1.
rcupdate.rcu_task_enqueue_lim = [KNL]

With this parameter you can set the number of callback queues to use for the RCU Tasks family of RCU flavors. You can adjust the number of callback queues automatically and dynamically with the default value of -1.

This parameter is intended for use in testing.

retbleed = [X86]

With this parameter you can control mitigation of Arbitrary Speculative Code Execution with Return Instructions (RETBleed) vulnerability. The available options are:

  • off: no mitigation
  • auto: automatically select a mitigation
  • auto,nosmt: automatically select a mitigation, disabling SMT if necessary for the full mitigation (only on Zen1 and older without STIBP).
  • ibpb: mitigate short speculation windows on basic block boundaries too. Safe, highest performance impact.
  • unret: force enable untrained return thunks, only effective on AMD f15h-f17h based systems.
  • unret,nosmt: like the unret option, will disable SMT when STIBP is not available.

    Selecting the auto option chooses a mitigation method at run time according to the CPU.

    Not specifying this option is equivalent to retbleed=auto.

sev=option[,option…​] = [X86-64]
For more information, see Documentation/x86/x86_64/boot-options.rst.

Updated kernel parameters

acpi_sleep = [HW,ACPI]

Format: { s3_bios, s3_mode, s3_beep, s4_hwsig, s4_nohwsig, old_ordering, nonvs, sci_force_enable, nobl }

  • For more information on s3_bios and s3_mode, see Documentation/power/video.rst.
  • s3_beep is for debugging; it makes the PC’s speaker beep as soon as the kernel real-mode entry point is called.
  • s4_hwsig causes the kernel to check the ACPI hardware signature during resume from hibernation, and gracefully refuse to resume if it has changed. The default behavior is to allow resume and simply warn when the signature changes, unless the s4_hwsig option is enabled.
  • s4_nohwsig prevents ACPI hardware signature from being used, or even warned about, during resume. old_ordering causes the ACPI 1.0 ordering of the _PTS control method, with respect to putting devices into low power states, to be enforced. The ACPI 2.0 ordering of _PTS is used by default.
  • nonvs prevents the kernel from saving and restoring the ACPI NVS memory during suspend, hibernation, and resume.
  • sci_force_enable causes the kernel to set SCI_EN directly on resume from S1/S3. Even though this behavior is contrary to the ACPI specifications, some corrupted systems do not work without it.
  • nobl causes the internal denylist of systems known to behave incorrectly in some ways with respect to system suspend and resume to be ignored. Use this option wisely.

    For more information, see Documentation/power/video.rst.

crashkernel=size[KMG],high = [KNL, X86-64, ARM64]

With this parameter you can allocate physical memory region from top as follows:

  • If the system has more than 4 GB RAM installed, a physical memory region can exceed 4 GB.
  • If the system has less than 4 GB RAM installed, a physical memory region will be allocated below 4 GB, if available.

    This parameter is ignored if the crashkernel=X parameter is specified.

crashkernel=size[KMG],low = [KNL, X86-64]

When you pass crashkernel=X,high, the kernel can allocate a physical memory region above 4 GB. This causes the second kernel crash on systems that require some amount of low memory (for example, swiotlb requires at least 64M+32K low memory) and enough extra low memory to make sure DMA buffers for 32-bit devices are not exhausted. Kernel tries to allocate at least 256 M below 4 GB automatically. With this parameter you can specify the low range under 4 GB for the second kernel instead.

  • 0: disables low allocation. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4 GB.
crashkernel=size[KMG],low = [KNL, ARM64]
With this parameter you can specify a low range in the DMA zone for the crash dump kernel. It will be ignored when crashkernel=X,high is not used or memory reserved is located in the DMA zones.
kvm.nx_huge_pages_recovery_ratio = [KVM]

With this parameter you can control how many 4 KiB pages are periodically zapped back to huge pages:

  • 0 disables the recovery
  • N KVM will zap 1/Nth of the 4 KiB pages every period.

    The default is set to 60.

kvm-arm.mode = [KVM,ARM]

With this parameter you can select one of KVM modes of operation:

  • none: forcefully disable KVM.
  • nvhe: standard nVHE-based mode, without support for protected guests.
  • protected: nVHE-based mode with support for guests whose state is kept private from the host. Not valid if the kernel is running in the EL2 level.

    The default value is set to VHE/nVHE based on hardware support.

mitigations = [X86,PPC,S390,ARM64]

With this parameter you can control optional mitigations for CPU vulnerabilities. This is a set of curated, arch-independent options, each of which is an aggregation of existing arch-specific options:

  • off: disable all optional CPU mitigations. This improves system performance, but it may also expose users to several CPU vulnerabilities.

    • Equivalent to: nopti [X86,PPC], kpti=0 [ARM64], nospectre_v1 [X86,PPC], nobp=0 [S390], nospectre_v2 [X86,PPC,S390,ARM64], spectre_v2_user=off [X86], spec_store_bypass_disable=off [X86,PPC], ssbd=force-off [ARM64], l1tf=off [X86], mds=off [X86], tsx_async_abort=off [X86], kvm.nx_huge_pages=off [X86], no_entry_flush [PPC], no_uaccess_flush [PPC], mmio_stale_data=off [X86].
    • Exceptions: This does not have any effect on kvm.nx_huge_pages when the kvm.nx_huge_pages=force option is specified.
  • auto (default): mitigate all CPU vulnerabilities, but leave SMT enabled, even if it is vulnerable.

    • Equivalent to: (default behavior)
  • auto,nosmt: mitigate all CPU vulnerabilities, disabling SMT if needed.

    • Equivalent to: l1tf=flush,nosmt [X86], mds=full,nosmt [X86], tsx_async_abort=full,nosmt [X86], mmio_stale_data=full,nosmt [X86]
rcu_nocbs[=cpu-list] = [KNL]

The optional argument is a CPU list.

In kernels built with CONFIG_RCU_NOCB_CPU=y, you can enable the no-callback CPU mode, which prevents such CPUs callbacks from being invoked in softirq context. Invocation of such CPUs' RCU callbacks will instead be offloaded to rcuox/N kthreads created for that purpose, where x is p for RCU-preempt, s for RCU-sched, and g for the kthreads that mediate grace periods; and N is the CPU number. This reduces OS jitter on the offloaded CPUs, which can be useful for HPC and real-time workloads. It can also improve energy efficiency for asymmetric multiprocessors.

  • If a cpulist is passed as an argument, the specified list of CPUs is set to no-callback mode from boot.
  • If the = sign and the cpulist arguments are omitted, no CPU will be set to no-callback mode from boot but you can toggle the mode at runtime using cpusets.
rcutree.kthread_prio = [KNL,BOOT]

With this parameter you can set the SCHED_FIFO priority of the RCU per-CPU kthreads (rcuc/N). This value is also used for the priority of the RCU boost threads (rcub/N) and for the RCU grace-period kthreads (rcu_bh, rcu_preempt, and rcu_sched).

  • If RCU_BOOST is set, valid values are 1-99 and the default is 1, the least-favored priority.
  • If RCU_BOOST is not set, valid values are 0-99 and the default is 0, non-realtime operation.

    When RCU_NOCB_CPU is set, you should adjust the priority of NOCB callback kthreads.

rcutorture.fwd_progress = [KNL]

With this parameter you can specify the number of kthreads to be used for RCU grace-period forward-progress testing for the types of RCU supporting this notion.

The default is set to 1 kthread. Values less than zero or greater than the number of CPUs cause the number of CPUs to be used.

spectre_v2 = [X86]

With this parameter you can control mitigation of Spectre variant 2 (indirect branch speculation) vulnerability. The default operation protects the kernel from user space attacks.

  • on: unconditionally enable, implies spectre_v2_user=on
  • off: unconditionally disable, implies spectre_v2_user=off
  • auto: kernel detects whether your CPU model is vulnerable
  • Selecting on will, and auto may, choose a mitigation method at run time according to the CPU, the available microcode, the setting of the CONFIG_RETPOLINE configuration option, and the compiler with which the kernel was built.
  • Selecting on will also enable the mitigation against user space to user space task attacks.
  • Selecting off will disable both the kernel and the user space protections.
  • Specific mitigations can also be selected manually:

    • retpoline: replace indirect branches
    • retpoline,generic: Retpolines
    • retpoline,lfence: LFENCE; indirect branch
    • retpoline,amd: alias for retpoline,lfence
    • eibrs: enhanced IBRS
    • eibrs,retpoline: enhanced IBRS + Retpolines
    • eibrs,lfence: enhanced IBRS + LFENCE
    • ibrs: use IBRS to protect kernel

      Not specifying this option is equivalent to spectre_v2=auto.

New sysctl parameters

max_rcu_stall_to_panic
When you set panic_on_rcu_stall to 1, you determine the number of times that RCU can stall before panic() is called. When you set panic_on_rcu_stall to 0, this value has no effect.
perf_user_access = [ARM64]

With this parameter you can control user space access for reading perf event counters.

  • When set to 1, user space can read performance monitor counter registers directly.
  • The default is set to 0, which means access disabled.

    For more information, see Documentation/arm64/perf.rst.

gro_normal_batch
With this parameter you can set the maximum number of the segments to batch up on output of GRO. When a packet exits GRO, either as a coalesced superframe or as an original packet which GRO has decided not to coalesce, it is placed on a per-NAPI list. This list is then passed to the stack when the number of segments reaches the gro_normal_batch limit.
high_order_alloc_disable

With this parameter you can choose order-0 allocation. By default, the allocator for page fragments tries to use high order pages, that is order-3 on X86 systems. While the default behavior returns good results, in certain situations a contention in page allocations and freeing occurs. This was especially true on older kernels (version 5.14 and higher) when high-order pages were not stored on per-CPU lists. This parameter exists now mostly of historical importance.

The default value is 0.

page_lock_unfairness

By specifying the value for this parameter you can determine the number of times that the page lock can be stolen from under a waiter. After the lock is stolen the number of times specified in this file, the fair lock handoff semantics will apply, and the waiter will only be awakened if the lock can be taken.

The default value is 5.

Changed sysctl parameters

urandom_min_reseed_secs
You can use this parameter to determine the minimum number of seconds between urandom pool reseeding. This file is writable for compatibility purposes, but writing to it has no effect on any RNG behavior.
write_wakeup_threshold
When the entropy count sinks below this threshold in a number of bits, you can wake up processes waiting to write to the /dev/random file. This file is writable for compatibility purposes, but writing to it has no effect on any RNG behavior.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.