Chapter 5. Important changes to external kernel parameters


This chapter provides system administrators with a summary of significant changes in the kernel distributed with Red Hat Enterprise Linux 9.5. These changes could include, for example, added or updated proc entries, sysctl, and sysfs default values, boot parameters, kernel configuration options, or any noticeable behavior changes.

New kernel parameters

numa_cma=<node>:nn[MG][,<node>:nn[MG]]

[KNL,CMA]

Sets the size of kernel numa memory area for contiguous memory allocations. It will reserve CMA area for the specified node.

With numa CMA enabled, DMA users on node nid will first try to allocate buffer from the numa area which is located in node nid, if the allocation fails, they will fallback to the global default memory area.

reg_file_data_sampling=

[x86]

Controls mitigation for Register File Data Sampling (RFDS) vulnerability. RFDS is a CPU vulnerability which might allow userspace to infer kernel data values previously stored in floating point registers, vector registers, or integer registers. RFDS only affects Intel Atom processors.

Values:

  • on :: Turns ON the mitigation
  • off :: Turns OFF the mitigation

This parameter overrides the compile time default set by CONFIG_MITIGATION_RFDS. Mitigation cannot be disabled when other VERW based mitigations (such as MDS) are enabled. To disable RFDS mitigation all VERW based mitigations need to be disabled.

For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst

locktorture.acq_writer_lim=

[KNL]

Set the time limit in jiffies for a lock acquisition. Acquisitions exceeding this limit will result in a splat once they do complete.

locktorture.bind_readers=

[KNL]

Specify the list of CPUs to which the readers are to be bound.

locktorture.bind_writers=

[KNL]

Specify the list of CPUs to which the writers are to be bound.

locktorture.call_rcu_chains=

[KNL]

Specify the number of self-propagating call_rcu() chains to set up. These are used to ensure that there is a high probability of an RCU grace period in progress at any given time. Defaults to 0, which disables these call_rcu() chains.

locktorture.long_hold=

[KNL]

Specify the duration in milliseconds for the occasional long-duration lock hold time. Defaults to 100 milliseconds. Select 0 to disable.

locktorture.nested_locks=

[KNL]

Specify the maximum lock nesting depth that locktorture is to exercise, up to a limit of 8 (MAX_NESTED_LOCKS). Specify zero to disable. Note that this parameter is ineffective on types of locks that do not support nested acquisition.

workqueue.default_affinity_scope=

Select the default affinity scope to use for unbound work queues. Can be one of "cpu", "smt", "cache", "numa" and "system". Default is "cache". For more information, see the Affinity Scopes section in Documentation/core-api/workqueue.rst.

This can be changed after boot by writing to the matching /sys/module/workqueue/parameters file. All work queues with the "default" affinity scope will be updated accordingly.

locktorture.rt_boost=

[KNL]

Do periodic testing of real-time lock priority boosting. Select 0 to disable, 1 to boost only rt_mutex, and 2 to boost unconditionally. Defaults to 2, which might seem to be an odd choice, but which should be harmless for non-real-time spinlocks, due to their disabling of preemption. Note that non-realtime mutexes disable boosting.

locktorture.writer_fifo=

[KNL]

Run the write-side locktorture kthreads at sched_set_fifo() real-time priority.

locktorture.rt_boost_factor=

[KNL]

Number that determines how often and for how long priority boosting is exercised. This is scaled down by the number of writers, so that the number of boosts per unit time remains roughly constant as the number of writers increases. However, the duration of each boost increases with the number of writers.

microcode.force_minrev=

[X86]

Format: <bool>

Enable or disable the microcode minimal revision enforcement for the runtime microcode loader.

module.async_probe=<bool>

[KNL]

When set to true, modules will use async probing by default. To enable or disable async probing for a specific module, use the module specific control that is documented under <module>.async_probe. When both module.async_probe and <module>.async_probe are specified, <module>.async_probe takes precedence for the specific module.

module.enable_dups_trace

[KNL]

When CONFIG_MODULE_DEBUG_AUTOLOAD_DUPS is set, this means that duplicate request_module() calls will trigger a WARN_ON() instead of a pr_warn(). Note that if MODULE_DEBUG_AUTOLOAD_DUPS_TRACE is set, WARN_ON() will always be issued and this option does nothing.

nfs.delay_retrans=

[NFS]

Specifies the number of times the NFSv4 client retries the request before returning an EAGAIN error, after a reply of NFS4ERR_DELAY from the server. Only applies if the softerr mount option is enabled, and the specified value is >= 0.

rcutree.do_rcu_barrier=

[KNL]

Request a call to rcu_barrier(). This is throttled so that userspace tests can safely hammer on the sysfs variable if they so choose. If triggered before the RCU grace-period machinery is fully active, this will error out with EAGAIN.

rcuscale.minruntime=

[KNL]

Set the minimum test run time in seconds. This does not affect the data-collection interval, but instead allows better measurement of things such as CPU consumption.

rcuscale.writer_holdoff_jiffies=

[KNL]

Additional write-side holdoff between grace periods, but in jiffies. The default of zero says no holdoff.

rcupdate.rcu_cpu_stall_notifiers=

[KNL]

Provide RCU CPU stall notifiers, but see the warnings in the RCU_CPU_STALL_NOTIFIER Kconfig option’s help text. TL;DR: You almost certainly do not want rcupdate.rcu_cpu_stall_notifiers.

rcupdate.rcu_task_lazy_lim=

[KNL]

Number of callbacks on a given CPU that will cancel laziness on that CPU. Use -1 to disable cancellation of laziness, but be advised that doing so increases the danger of OOM due to callback flooding.

rcupdate.rcu_tasks_lazy_ms= ++

[KNL]

Set timeout in milliseconds RCU Tasks asynchronous callback batching for call_rcu_tasks(). A negative value will take the default. A value of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks().

rcupdate.rcu_tasks_rude_lazy_ms=

[KNL]

Set timeout in milliseconds RCU Tasks Rude asynchronous callback batching for call_rcu_tasks_rude(). A negative value will take the default. A value of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks_rude().

rcupdate.rcu_tasks_trace_lazy_ms=

[KNL]

Set timeout in milliseconds RCU Tasks Trace asynchronous callback batching for call_rcu_tasks_trace(). A negative value will take the default. A value of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks_trace().

spectre_bhi=

[X86]

Control mitigation of Branch History Injection (BHI) vulnerability. This setting affects the deployment of the HW BHI control and the SW BHB clearing sequence.

Values:

on
(default) Enable the HW or SW mitigation as needed.
off
Disable the mitigation.

unwind_debug

[X86-64]

Enable unwinder debug output. This can be useful for debugging certain unwinder error conditions, including corrupt stacks and bad/missing unwinder metadata.

workqueue.cpu_intensive_thresh_us=

Per-cpu work items which run for longer than this threshold are automatically considered CPU intensive and excluded from concurrency management to prevent them from noticeably delaying other per-cpu work items. Default is 10000 (10ms).

If CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel will report the work functions which violate this threshold repeatedly. They are likely good candidates for using WQ_UNBOUND work queues instead.

workqueue.cpu_intensive_warning_thresh=<uint> If CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel will report the work functions which violate the intensive_threshold_us repeatedly. To prevent spurious warnings, start printing only after a work function has violated this threshold number of times.

The default is 4 times. 0 disables the warning.

workqueue.default_affinity_scope=

Select the default affinity scope to use for unbound work queues. Can be one of "cpu", "smt", "cache", "numa" and "system". Default is "cache". For more information, see the Affinity Scopes section in Documentation/core-api/workqueue.rst.

This can be changed after boot by writing to the matching /sys/module/workqueue/parameters file. All work queues with the "default" affinity scope will be updated accordingly.

xen_msr_safe=

[X86,XEN]

Format: <bool>

Select whether to always use non-faulting (safe) MSR access functions when running as Xen PV guest. The default value is controlled by CONFIG_XEN_PV_MSR_SAFE.

Updated kernel parameters

clearcpuid=

X[,X...] [X86]

Disable CPUID feature X for the kernel. See numbers X.

Note the Linux-specific bits are not necessarily stable over kernel options, but the vendor-specific ones should be. X can also be a string as appearing in the flags: line in /proc/cpuinfo which does not have the above instability issue. However, not all features have names in /proc/cpuinfo. Note that using this option will taint your kernel.

Also note that user programs calling CPUID directly or using the feature without checking anything will still see it. This just prevents it from being used by the kernel or shown in /proc/cpuinfo. Also note the kernel might malfunction if you disable some critical bits.

cma_pernuma=nn[MG]

[KNL,CMA]

Sets the size of kernel per-numa memory area for contiguous memory allocations. A value of 0 disables per-numa CMA altogether. And If this option is not specified, the default value is 0. With per-numa CMA enabled, DMA users on node nid will first try to allocate buffer from the pernuma area which is located in node nid, if the allocation fails, they will fallback to the global default memory area.

csdlock_debug=

[KNL]

Enable debug add-ons of cross-CPU function call handling. When switched on, additional debug data is printed to the console in case a hanging CPU is detected, and that CPU is pinged again to try to resolve the hang situation. The default value of this option depends on the CSD_LOCK_WAIT_DEBUG_DEFAULT Kconfig option.

<module>.async_probe[=<bool>]

[KNL]

If no <bool> value is specified or if the value specified is not a valid <bool>, enable asynchronous probe on this module. Otherwise, enable or disable asynchronous probe on this module as indicated by the <bool> value. See also: module.async_probe

earlycon=

[KNL]

Output early console device and options.

When used with no options, the early console is determined by stdout-path property in device tree’s chosen node or the ACPI SPCR table if supported by the platform.

cdns,<addr>[,options] Start an early, polled-mode console on a Cadence (xuartps) serial port at the specified address. Only supported option is baud rate. If baud rate is not specified, the serial port must already be setup and configured.

uart[8250],io,<addr>[,options[,uartclk]]uart[8250],mmio,<addr>[,options[,uartclk]]uart[8250],mmio32,<addr>[,options[,uartclk]]uart[8250],mmio32be,<addr>[,options[,uartclk]]uart[8250],0x<addr>[,options]

Start an early, polled-mode console on the 8250/16550 UART at the specified I/O port or MMIO address. MMIO inter-register address stride is either 8-bit (mmio) or 32-bit (mmio32 or mmio32be). If none of [io|mmio|mmio32|mmio32be], <addr> is assumed to be equivalent to 'mmio'. 'options' are specified in the same format described for "console=ttyS<n>"; if unspecified, the h/w is not initialized. 'uartclk' is the uart clock frequency; if unspecified, it is set to 'BASE_BAUD' * 16.

earlyprintk=

[X86,SH,ARM,M68k,S390]

earlyprintk=vga earlyprintk=sclp earlyprintk=xen earlyprintk=serial[,ttySn[,baudrate]] earlyprintk=serial[,0x…​[,baudrate]] earlyprintk=ttySn[,baudrate] earlyprintk=dbgp[debugController#] earlyprintk=pciserial[,force],bus:device.function[,baudrate] earlyprintk=xdbc[xhciController#]

earlyprintk is useful when the kernel crashes before the normal console is initialized. It is not enabled by default because it has some cosmetic problems.

Append ",keep" to not disable it when the real console takes over.

Only one of vga, efi, serial, or USB debug port can be used at a time.

Currently only ttyS0 and ttyS1 might be specified by name. Other I/O ports might be explicitly specified on some architectures (x86 and arm at least) by replacing ttySn with an I/O port address, suc as: earlyprintk=serial,0x1008,115200 You can find the port for a given device in /proc/tty/driver/serial: 2: uart:ST16650V2 port:00001008 irq:18.

Interaction with the standard serial driver is not very good.

The VGA and EFI output is eventually overwritten by the real console.

The xen option can only be used in Xen domains.

The sclp output can only be used on s390.

The optional "force" to "pciserial" enables use of a PCI device even when its classcode is not of the UART class.

iommu.strict=

[ARM64, X86, S390]

Configure TLB invalidation behaviour.

Format: { "0" | "1" }

0 - Lazy mode
Request that DMA unmap operations use deferred invalidation of hardware TLBs, for increased throughput at the cost of reduced device isolation. Will fall back to strict mode if not supported by the relevant IOMMU driver.
1 - Strict mode
DMA unmap operations invalidate IOMMU hardware TLBs synchronously.
unset
Use value of CONFIG_IOMMU_DEFAULT_DMA_{LAZY,STRICT}.
Note

On x86, strict mode specified via one of the legacy driver-specific options takes precedence.

mem_encrypt=

[x86_64]

AMD Secure Memory Encryption (SME) control.

Valid arguments: on, off

Default: off

mem_encrypt=on: Activate SME mem_encrypt=off: Do not activate SME

mitigations=

[X86,PPC,S390,ARM64]

Control optional mitigations for CPU vulnerabilities. This is a set of curated, arch-independent options, each of which is an aggregation of existing arch-specific options.

Value: off

Disable all optional CPU mitigations. This improves system performance, but it might also expose users to several CPU vulnerabilities. Equivalent to: if nokaslr then kpti=0 [ARM64] gather_data_sampling=off [X86] kvm.nx_huge_pages=off [X86] l1tf=off [X86] mds=off [X86] mmio_stale_data=off [X86] no_entry_flush [PPC] no_uaccess_flush [PPC] nobp=0 [S390] nopti [X86,PPC] nospectre_bhb [ARM64] nospectre_v1 [X86,PPC] nospectre_v2 [X86,PPC,S390,ARM64] reg_file_data_sampling=off [X86] retbleed=off [X86] spec_rstack_overflow=off [X86] spec_store_bypass_disable=off [X86,PPC] spectre_bhi=off [X86] spectre_v2_user=off [X86] srbds=off [X86,INTEL] ssbd=force-off [ARM64] tsx_async_abort=off [X86]

nosmap

[PPC]

Disable SMAP (Supervisor Mode Access Prevention) even if it is supported by processor.

nosmep

[PPC64s]

Disable SMEP (Supervisor Mode Execution Prevention) even if it is supported by processor.

nox2apic

[x86_64,APIC]

Do not enable x2APIC mode.

Note

This parameter will be ignored on systems with the LEGACY_XAPIC_DISABLED bit set in the IA32_XAPIC_DISABLE_STATUS MSR.

panic_print=

Bitmask for printing system info when panic happens. User can chose combination of the following bits:

  • bit 0: print all tasks info
  • bit 1: print system memory info
  • bit 2: print timer info
  • bit 3: print locks info if CONFIG_LOCKDEP is on
  • bit 4: print ftrace buffer
  • bit 5: print all printk messages in buffer
  • bit 6: print all CPUs backtrace (if available in the arch)
Important

This option might print a lot of lines, so there are risks of losing older messages in the log. Use this option carefully, you mnight consider setting up a bigger log buffer with "log_buf_len" along with this.

pcie_aspm=

[PCIE]

Forcibly enable or ignore PCIe Active State Power Management.

Value:

off
Do not touch ASPM configuration at all. Leave any configuration done by firmware unchanged.
force
Enable ASPM even on devices that claim not to support it.
Warning

Forcing ASPM on might cause system lockups.

s390_iommu=

[HW,S390]

Set s390 IOTLB flushing mode.

Value:

strict
with strict flushing every unmap operation will result in an IOTLB flush.
Default
is lazy flushing before reuse, which is faster.
Deprecated
equivalent to iommu.strict=1.

spectre_v2=

[x86]

Control mitigation of Spectre variant 2 (indirect branch speculation) vulnerability. The default operation protects the kernel from user space attacks.

Value:

on
unconditionally enable, implies spectre_v2_user=on.
off
unconditionally disable, implies spectre_v2_user=off.
auto
kernel detects whether your CPU model is vulnerable.

Selecting 'on' will, and 'auto' might, choose a mitigation method at run time according to the CPU, the available microcode, the setting of the CONFIG_MITIGATION_RETPOLINE configuration option, and the compiler with which the kernel was built.

usbcore.quirks=

[USB]

A list of quirk entries to augment the built-in USB core quirk list. List entries are separated by commas. Each entry has the form VendorID:ProductID:Flags. The IDs are 4-digit hex numbers and Flags is a set of letters. Each letter will change the built-in quirk; setting it if it is clear and clearing it if it is set. The letters have the following meanings:

  • a = USB_QUIRK_STRING_FETCH_255 (string descriptors must not be fetched by using a 255-byte read);
  • b = USB_QUIRK_RESET_RESUME (device cannot resume correctly so reset it instead);
  • c = USB_QUIRK_NO_SET_INTF (device cannot handle Set-Interface requests);
  • d = USB_QUIRK_CONFIG_INTF_STRINGS (device cannot handle its Configuration or Interface strings);
  • e = USB_QUIRK_RESET (device cannot be reset (e.g morph devices), do not use reset);
  • f = USB_QUIRK_HONOR_BNUMINTERFACES (device has more interface descriptions than the bNumInterfaces count, and cannot handle talking to these interfaces);
  • g = USB_QUIRK_DELAY_INIT (device needs a pause during initialization, after we read the device descriptor);
  • h = USB_QUIRK_LINEAR_UFRAME_INTR_BINTERVAL (For high speed and super speed interrupt endpoints, the USB 2.0 and USB 3.0 spec require the interval in microframes (1 microframe = 125 microseconds) to be calculated as interval = 2 ^ (bInterval-1). Devices with this quirk report their bInterval as the result of this calculation instead of the exponent variable used in the calculation);
  • i = USB_QUIRK_DEVICE_QUALIFIER (device cannot handle device_qualifier descriptor requests);
  • j = USB_QUIRK_IGNORE_REMOTE_WAKEUP (device generates spurious wakeup, ignore remote wakeup capability);
  • k = USB_QUIRK_NO_LPM (device cannot handle Link Power Management);
  • l = USB_QUIRK_LINEAR_FRAME_INTR_BINTERVAL (Device reports its bInterval as linear frames instead of the USB 2.0 calculation);
  • m = USB_QUIRK_DISCONNECT_SUSPEND (Device needs to be disconnected before suspend to prevent spurious wakeup);
  • n = USB_QUIRK_DELAY_CTRL_MSG (Device needs a pause after every control message);
  • o = USB_QUIRK_HUB_SLOW_RESET (Hub needs extra delay after resetting its port);
  • p = USB_QUIRK_SHORT_SET_ADDRESS_REQ_TIMEOUT (Reduce timeout of the SET_ADDRESS request from 5000 ms to 500 ms);

Example: quirks=0781:5580:bk,0a5c:5834:gij

Removed kernel parameters

  • [BUGS=X86] noclflush:: Do not use the CLFLUSH instruction.
  • Workqueue.disable_numa
  • [X86] noexec
  • [BUGS=X86-32] nosep:: Disables x86 SYSENTER/SYSEXIT support.
  • [X86] nordrand:: Disable kernel use of the RDRAND.
  • thermal.nocrt

New sysctl parameters

oops_limit

Number of kernel oopses after which the kernel should panic when panic_on_oops is not set. Setting this to 0 disables checking the count. Setting this to 1 has the same effect as setting panic_on_oops=1. The default value is 10000.

warn_limit

Number of kernel warnings after which the kernel should panic when panic_on_warn is not set. Setting this to 0 disables checking the warning count. Setting this to 1 has the same effect as setting panic_on_warn=1. The default value is 0.

kexec_load_limit_panic

This parameter specifies a limit to the number of times the syscalls kexec_load and kexec_file_load can be called with a crash image. It can only be set with a more restrictive value than the current one.

Value:

-1
Unlimited calls to kexec. This is the default setting.
N
Number of calls left.

kexec_load_limit_reboot

Similar functionality as kexec_load_limit_panic, but for a normal image.

numa_balancing_promote_rate_limit_MBps

Too high promotion or demotion throughput between different memory types might hurt application latency. You can use this parameter to rate-limit the promotion throughput. The per-node maximum promotion throughput in MB/s is limited to be no more than the set value.

Set this parameter to less than 1/10 of the PMEM node write bandwidth.

Updated sysctl parameters

kexec_load_disabled

A toggle indicating if the syscalls kexec_load and kexec_file_load have been disabled. This value defaults to 0 (false: kexec_*load enabled), but can be set to 1 (true: kexec_*load disabled).

Once true, kexec can no longer be used, and the toggle cannot be set back to false. This allows a kexec image to be loaded before disabling the syscall allowing a system to set up (and later use) an image without it being altered. Generally used together with the `modules_disabled`_ sysctl.

panic_print

Bitmask for printing system info when panic happens. User can chose combination of the following bits:

  • bit 0 print all tasks info
  • bit 1 print system memory info
  • bit 2 print timer info
  • bit 3 print locks info if CONFIG_LOCKDEP is on
  • bit 4 print ftrace buffer
  • bit 5 print all printk messages in buffer
  • bit 6 print all CPUs backtrace (if available in the arch)

sched_energy_aware

Enables or disables Energy Aware Scheduling (EAS). EAS starts automatically on platforms where it can run (that is, platforms with asymmetric CPU topologies and having an Energy Model available). If your platform happens to meet the requirements for EAS but you do not want to use it, change this value to 0. On Non-EAS platforms, write operation fails and read doesn’t return anything.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.