Chapter 4. Using the Node Tuning Operator
Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.
4.1. About the Node Tuning Operator
The Node Tuning Operator helps you manage node-level tuning by orchestrating the Tuned daemon. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs. The Operator manages the containerized Tuned daemon for OpenShift Container Platform as a Kubernetes DaemonSet. It ensures the custom tuning specification is passed to all containerized Tuned daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.
Node-level settings applied by the containerized Tuned daemon are rolled back on an event that triggers a profile change or when the containerized Tuned daemon is terminated gracefully by receiving and handling a termination signal.
The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.
4.2. Accessing an example Node Tuning Operator specification
Use this process to access an example Node Tuning Operator specification.
Procedure
Run:
$ oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator
The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or Pod labels and profile priorities.
4.3. Default profiles set on a cluster
The following are the default profiles set on a cluster.
apiVersion: tuned.openshift.io/v1alpha1 kind: Tuned metadata: name: default namespace: openshift-cluster-node-tuning-operator spec: profile: - name: "openshift" data: | [main] summary=Optimize systems running OpenShift (parent profile) include=${f:virt_check:virtual-guest:throughput-performance} [selinux] avc_cache_threshold=8192 [net] nf_conntrack_hashsize=131072 [sysctl] net.ipv4.ip_forward=1 kernel.pid_max=>131072 net.netfilter.nf_conntrack_max=1048576 net.ipv4.neigh.default.gc_thresh1=8192 net.ipv4.neigh.default.gc_thresh2=32768 net.ipv4.neigh.default.gc_thresh3=65536 net.ipv6.neigh.default.gc_thresh1=8192 net.ipv6.neigh.default.gc_thresh2=32768 net.ipv6.neigh.default.gc_thresh3=65536 [sysfs] /sys/module/nvme_core/parameters/io_timeout=4294967295 /sys/module/nvme_core/parameters/max_retries=10 - name: "openshift-control-plane" data: | [main] summary=Optimize systems running OpenShift control plane include=openshift [sysctl] # ktune sysctl settings, maximizing i/o throughput # # Minimal preemption granularity for CPU-bound tasks: # (default: 1 msec# (1 + ilog(ncpus)), units: nanoseconds) kernel.sched_min_granularity_ns=10000000 # The total time the scheduler will consider a migrated process # "cache hot" and thus less likely to be re-migrated # (system default is 500000, i.e. 0.5 ms) kernel.sched_migration_cost_ns=5000000 # SCHED_OTHER wake-up granularity. # # Preemption granularity when tasks wake up. Lower the value to # improve wake-up latency and throughput for latency critical tasks. kernel.sched_wakeup_granularity_ns=4000000 - name: "openshift-node" data: | [main] summary=Optimize systems running OpenShift nodes include=openshift [sysctl] net.ipv4.tcp_fastopen=3 fs.inotify.max_user_watches=65536 - name: "openshift-control-plane-es" data: | [main] summary=Optimize systems running ES on OpenShift control-plane include=openshift-control-plane [sysctl] vm.max_map_count=262144 - name: "openshift-node-es" data: | [main] summary=Optimize systems running ES on OpenShift nodes include=openshift-node [sysctl] vm.max_map_count=262144 recommend: - profile: "openshift-control-plane-es" priority: 10 match: - label: "tuned.openshift.io/elasticsearch" type: "pod" match: - label: "node-role.kubernetes.io/master" - label: "node-role.kubernetes.io/infra" - profile: "openshift-node-es" priority: 20 match: - label: "tuned.openshift.io/elasticsearch" type: "pod" - profile: "openshift-control-plane" priority: 30 match: - label: "node-role.kubernetes.io/master" - label: "node-role.kubernetes.io/infra" - profile: "openshift-node" priority: 40
4.4. Verifying that the Tuned profiles are applied
Use this procedure to check which Tuned profiles are applied on every node.
Procedure
Check which Tuned Pods are running on each node:
$ oc get pods -n openshift-cluster-node-tuning-operator -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-node-tuning-operator-599489d4f7-k4hw4 1/1 Running 0 6d2h 10.129.0.76 ip-10-0-145-113.eu-west-3.compute.internal <none> <none> tuned-2jkzp 1/1 Running 1 6d3h 10.0.145.113 ip-10-0-145-113.eu-west-3.compute.internal <none> <none> tuned-g9mkx 1/1 Running 1 6d3h 10.0.147.108 ip-10-0-147-108.eu-west-3.compute.internal <none> <none> tuned-kbxsh 1/1 Running 1 6d3h 10.0.132.143 ip-10-0-132-143.eu-west-3.compute.internal <none> <none> tuned-kn9x6 1/1 Running 1 6d3h 10.0.163.177 ip-10-0-163-177.eu-west-3.compute.internal <none> <none> tuned-vvxwx 1/1 Running 1 6d3h 10.0.131.87 ip-10-0-131-87.eu-west-3.compute.internal <none> <none> tuned-zqrwq 1/1 Running 1 6d3h 10.0.161.51 ip-10-0-161-51.eu-west-3.compute.internal <none> <none>
Extract the profile applied from each Pod and match them against the previous list:
$ for p in `oc get pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned -o=jsonpath='{range .items[*]}{.metadata.name} {end}'`; do printf "\n*** $p ***\n" ; oc logs pod/$p -n openshift-cluster-node-tuning-operator | grep applied; done
Example output
*** tuned-2jkzp *** 2020-07-10 13:53:35,368 INFO tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied *** tuned-g9mkx *** 2020-07-10 14:07:17,089 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node' applied 2020-07-10 15:56:29,005 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied 2020-07-10 16:00:19,006 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node' applied 2020-07-10 16:00:48,989 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied *** tuned-kbxsh *** 2020-07-10 13:53:30,565 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node' applied 2020-07-10 15:56:30,199 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied *** tuned-kn9x6 *** 2020-07-10 14:10:57,123 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node' applied 2020-07-10 15:56:28,757 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node-es' applied *** tuned-vvxwx *** 2020-07-10 14:11:44,932 INFO tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied *** tuned-zqrwq *** 2020-07-10 14:07:40,246 INFO tuned.daemon.daemon: static tuning from profile 'openshift-control-plane' applied
Custom profiles for custom tuning specification is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.
4.5. Custom tuning specification
The custom resource (CR) for the operator has two major sections. The first section, profile:
, is a list of Tuned profiles and their names. The second, recommend:
, defines the profile selection logic.
Multiple custom tuning specifications can co-exist as multiple CRs in the operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized Tuned daemons are updated.
Profile data
The profile:
section lists Tuned profiles and their names.
profile: - name: tuned_profile_1 data: | # Tuned profile specification [main] summary=Description of tuned_profile_1 profile [sysctl] net.ipv4.ip_forward=1 # ... other sysctl's or other Tuned daemon plug-ins supported by the containerized Tuned # ... - name: tuned_profile_n data: | # Tuned profile specification [main] summary=Description of tuned_profile_n profile # tuned_profile_n profile settings
Recommended profiles
The profile:
selection logic is defined by the recommend:
section of the CR:
recommend: - match: # optional; if omitted, profile match is assumed unless a profile with a higher matches first <match> # an optional array priority: <priority> # profile ordering priority, lower numbers mean higher priority (0 is the highest priority) profile: <tuned_profile_name> # e.g. tuned_profile_1 # ... - match: <match> priority: <priority> profile: <tuned_profile_name> # e.g. tuned_profile_n
If <match>
is omitted, a profile match (for example, true
) is assumed.
<match>
is an optional array recursively defined as follows:
- label: <label_name> # node or Pod label name value: <label_value> # optional node or Pod label value; if omitted, the presence of <label_name> is enough to match type: <label_type> # optional node or Pod type (`node` or `pod`); if omitted, `node` is assumed <match> # an optional <match> array
If <match>
is not omitted, all nested <match>
sections must also evaluate to true
. Otherwise, false
is assumed and the profile with the respective <match>
section will not be applied or recommended. Therefore, the nesting (child <match>
sections) works as logical AND operator. Conversely, if any item of the <match>
array matches, the entire <match>
array evaluates to true
. Therefore, the array acts as logical OR operator.
Example
- match: - label: tuned.openshift.io/elasticsearch match: - label: node-role.kubernetes.io/master - label: node-role.kubernetes.io/infra type: pod priority: 10 profile: openshift-control-plane-es - match: - label: node-role.kubernetes.io/master - label: node-role.kubernetes.io/infra priority: 20 profile: openshift-control-plane - priority: 30 profile: openshift-node
The CR above is translated for the containerized Tuned daemon into its recommend.conf
file based on the profile priorities. The profile with the highest priority (10
) is openshift-control-plane-es
and, therefore, it is considered first. The containerized Tuned daemon running on a given node looks to see if there is a Pod running on the same node with the tuned.openshift.io/elasticsearch
label set. If not, the entire <match>
section evaluates as false
. If there is such a Pod with the label, in order for the <match>
section to evaluate to true
, the node label also needs to be node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
If the labels for the profile with priority 10
matched, openshift-control-plane-es
profile is applied and no other profile is considered. If the node/Pod label combination did not match, the second highest priority profile (openshift-control-plane
) is considered. This profile is applied if the containerized Tuned Pod runs on a node with labels node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
Finally, the profile openshift-node
has the lowest priority of 30
. It lacks the <match>
section and, therefore, will always match. It acts as a profile catch-all to set openshift-node
profile, if no other profile with higher priority matches on a given node.
4.6. Custom tuning example
The following CR applies custom node-level tuning for OpenShift Container Platform nodes that run an ingress Pod with label tuned.openshift.io/ingress-pod-label=ingress-pod-label-value
. As an administrator, use the following command to create a custom Tuned CR.
Example
oc create -f- <<_EOF_ apiVersion: tuned.openshift.io/v1 kind: Tuned metadata: name: ingress namespace: openshift-cluster-node-tuning-operator spec: profile: - data: | [main] summary=A custom OpenShift ingress profile include=openshift-control-plane [sysctl] net.ipv4.ip_local_port_range="1024 65535" net.ipv4.tcp_tw_reuse=1 name: openshift-ingress recommend: - match: - label: tuned.openshift.io/ingress-pod-label value: "ingress-pod-label-value" type: pod priority: 10 profile: openshift-ingress _EOF_
4.7. Supported Tuned daemon plug-ins
Excluding the [main]
section, the following Tuned plug-ins are supported when using custom profiles defined in the profile:
section of the Tuned CR:
- audio
- cpu
- disk
- eeepc_she
- modules
- mounts
- net
- scheduler
- scsi_host
- selinux
- sysctl
- sysfs
- usb
- video
- vm
There is some dynamic tuning functionality provided by some of these plug-ins that is not supported. The following Tuned plug-ins are currently not supported:
- bootloader
- script
- systemd
See Available Tuned Plug-ins and Getting Started with Tuned for more information.