Scalability and performance


OpenShift Container Platform 4.11

Scaling your OpenShift Container Platform cluster and tuning performance in production environments

Red Hat OpenShift Documentation Team

Abstract

This document provides instructions for scaling your cluster and optimizing the performance of your OpenShift Container Platform environment.

Chapter 4. Using the Node Tuning Operator

Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.

4.1. About the Node Tuning Operator

The Node Tuning Operator helps you manage node-level tuning by orchestrating the TuneD daemon and achieves low latency performance by using the Performance Profile controller. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs.

The Operator manages the containerized TuneD daemon for OpenShift Container Platform as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized TuneD daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.

Node-level settings applied by the containerized TuneD daemon are rolled back on an event that triggers a profile change or when the containerized TuneD daemon is terminated gracefully by receiving and handling a termination signal.

The Node Tuning Operator uses the Performance Profile controller to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications. The cluster administrator configures a performance profile to define node-level settings such as the following:

  • Updating the kernel to kernel-rt.
  • Choosing CPUs for housekeeping.
  • Choosing CPUs for running workloads.

The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.

Note

In earlier versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.

4.2. Accessing an example Node Tuning Operator specification

Use this process to access an example Node Tuning Operator specification.

Procedure

  • Run the following command to access an example Node Tuning Operator specification:

    oc get tuned.tuned.openshift.io/default -o yaml -n openshift-cluster-node-tuning-operator

The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or pod labels and profile priorities.

Warning

While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality will be deprecated in future versions of the Node Tuning Operator.

4.3. Default profiles set on a cluster

The following are the default profiles set on a cluster.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: default
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Optimize systems running OpenShift (provider specific parent profile)
      include=-provider-${f:exec:cat:/var/lib/tuned/provider},openshift
    name: openshift
  recommend:
  - profile: openshift-control-plane
    priority: 30
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
  - profile: openshift-node
    priority: 40

Starting with OpenShift Container Platform 4.9, all OpenShift TuneD profiles are shipped with the TuneD package. You can use the oc exec command to view the contents of these profiles:

$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/openshift{,-control-plane,-node} -name tuned.conf -exec grep -H ^ {} \;

4.4. Verifying that the TuneD profiles are applied

Verify the TuneD profiles that are applied to your cluster node.

$ oc get profile.tuned.openshift.io -n openshift-cluster-node-tuning-operator

Example output

NAME             TUNED                     APPLIED   DEGRADED   AGE
master-0         openshift-control-plane   True      False      6h33m
master-1         openshift-control-plane   True      False      6h33m
master-2         openshift-control-plane   True      False      6h33m
worker-a         openshift-node            True      False      6h28m
worker-b         openshift-node            True      False      6h28m

  • NAME: Name of the Profile object. There is one Profile object per node and their names match.
  • TUNED: Name of the desired TuneD profile to apply.
  • APPLIED: True if the TuneD daemon applied the desired profile. (True/False/Unknown).
  • DEGRADED: True if any errors were reported during application of the TuneD profile (True/False/Unknown).
  • AGE: Time elapsed since the creation of Profile object.

4.5. Custom tuning specification

The custom resource (CR) for the Operator has two major sections. The first section, profile:, is a list of TuneD profiles and their names. The second, recommend:, defines the profile selection logic.

Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized TuneD daemons are updated.

Management state

The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:

  • Managed: the Operator will update its operands as configuration resources are updated
  • Unmanaged: the Operator will ignore changes to the configuration resources
  • Removed: the Operator will remove its operands and resources the Operator provisioned

Profile data

The profile: section lists TuneD profiles and their names.

profile:
- name: tuned_profile_1
  data: |
    # TuneD profile specification
    [main]
    summary=Description of tuned_profile_1 profile

    [sysctl]
    net.ipv4.ip_forward=1
    # ... other sysctl's or other TuneD daemon plugins supported by the containerized TuneD

# ...

- name: tuned_profile_n
  data: |
    # TuneD profile specification
    [main]
    summary=Description of tuned_profile_n profile

    # tuned_profile_n profile settings

Recommended profiles

The profile: selection logic is defined by the recommend: section of the CR. The recommend: section is a list of items to recommend the profiles based on a selection criteria.

recommend:
<recommend-item-1>
# ...
<recommend-item-n>

The individual items of the list:

- machineConfigLabels: 1
    <mcLabels> 2
  match: 3
    <match> 4
  priority: <priority> 5
  profile: <tuned_profile_name> 6
  operand: 7
    debug: <bool> 8
    tunedConfig:
      reapply_sysctl: <bool> 9
1
Optional.
2
A dictionary of key/value MachineConfig labels. The keys must be unique.
3
If omitted, profile match is assumed unless a profile with a higher priority matches first or machineConfigLabels is set.
4
An optional list.
5
Profile ordering priority. Lower numbers mean higher priority (0 is the highest priority).
6
A TuneD profile to apply on a match. For example tuned_profile_1.
7
Optional operand configuration.
8
Turn debugging on or off for the TuneD daemon. Options are true for on or false for off. The default is false.
9
Turn reapply_sysctl functionality on or off for the TuneD daemon. Options are true for on and false for off.

<match> is an optional list recursively defined as follows:

- label: <label_name> 1
  value: <label_value> 2
  type: <label_type> 3
    <match> 4
1
Node or pod label name.
2
Optional node or pod label value. If omitted, the presence of <label_name> is enough to match.
3
Optional object type (node or pod). If omitted, node is assumed.
4
An optional <match> list.

If <match> is not omitted, all nested <match> sections must also evaluate to true. Otherwise, false is assumed and the profile with the respective <match> section will not be applied or recommended. Therefore, the nesting (child <match> sections) works as logical AND operator. Conversely, if any item of the <match> list matches, the entire <match> list evaluates to true. Therefore, the list acts as logical OR operator.

If machineConfigLabels is defined, machine config pool based matching is turned on for the given recommend: list item. <mcLabels> specifies the labels for a machine config. The machine config is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>. This involves finding all machine config pools with machine config selector matching <mcLabels> and setting the profile <tuned_profile_name> on all nodes that are assigned the found machine config pools. To target nodes that have both master and worker roles, you must use the master role.

The list items match and machineConfigLabels are connected by the logical OR operator. The match item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true, the machineConfigLabels item is not considered.

Important

When using machine config pool based matching, it is advised to group nodes with the same hardware configuration into the same machine config pool. Not following this practice might result in TuneD operands calculating conflicting kernel parameters for two or more nodes sharing the same machine config pool.

Example: node or pod label based matching

- match:
  - label: tuned.openshift.io/elasticsearch
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
    type: pod
  priority: 10
  profile: openshift-control-plane-es
- match:
  - label: node-role.kubernetes.io/master
  - label: node-role.kubernetes.io/infra
  priority: 20
  profile: openshift-control-plane
- priority: 30
  profile: openshift-node

The CR above is translated for the containerized TuneD daemon into its recommend.conf file based on the profile priorities. The profile with the highest priority (10) is openshift-control-plane-es and, therefore, it is considered first. The containerized TuneD daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch label set. If not, the entire <match> section evaluates as false. If there is such a pod with the label, in order for the <match> section to evaluate to true, the node label also needs to be node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

If the labels for the profile with priority 10 matched, openshift-control-plane-es profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane) is considered. This profile is applied if the containerized TuneD pod runs on a node with labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

Finally, the profile openshift-node has the lowest priority of 30. It lacks the <match> section and, therefore, will always match. It acts as a profile catch-all to set openshift-node profile, if no other profile with higher priority matches on a given node.

Decision workflow

Example: machine config pool based matching

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-node-custom
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift node profile with an additional kernel parameter
      include=openshift-node
      [bootloader]
      cmdline_openshift_node_custom=+skew_tick=1
    name: openshift-node-custom

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-custom"
    priority: 20
    profile: openshift-node-custom

To minimize node reboots, label the target nodes with a label the machine config pool’s node selector will match, then create the Tuned CR above and finally create the custom machine config pool itself.

Cloud provider-specific TuneD profiles

With this functionality, all Cloud provider-specific nodes can conveniently be assigned a TuneD profile specifically tailored to a given Cloud provider on a OpenShift Container Platform cluster. This can be accomplished without adding additional node labels or grouping nodes into machine config pools.

This functionality takes advantage of spec.providerID node object values in the form of <cloud-provider>://<cloud-provider-specific-id> and writes the file /var/lib/tuned/provider with the value <cloud-provider> in NTO operand containers. The content of this file is then used by TuneD to load provider-<cloud-provider> profile if such profile exists.

The openshift profile that both openshift-control-plane and openshift-node profiles inherit settings from is now updated to use this functionality through the use of conditional profile loading. Neither NTO nor TuneD currently ship any Cloud provider-specific profiles. However, it is possible to create a custom profile provider-<cloud-provider> that will be applied to all Cloud provider-specific cluster nodes.

Example GCE Cloud provider profile

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: provider-gce
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=GCE Cloud provider-specific profile
      # Your tuning for GCE Cloud provider goes here.
    name: provider-gce

Note

Due to profile inheritance, any setting specified in the provider-<cloud-provider> profile will be overwritten by the openshift profile and its child profiles.

4.6. Custom tuning examples

Using TuneD profiles from the default CR

The following CR applies custom node-level tuning for OpenShift Container Platform nodes with label tuned.openshift.io/ingress-node-label set to any value.

Example: custom tuning using the openshift-control-plane TuneD profile

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: ingress
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=A custom OpenShift ingress profile
      include=openshift-control-plane
      [sysctl]
      net.ipv4.ip_local_port_range="1024 65535"
      net.ipv4.tcp_tw_reuse=1
    name: openshift-ingress
  recommend:
  - match:
    - label: tuned.openshift.io/ingress-node-label
    priority: 10
    profile: openshift-ingress

Important

Custom profile writers are strongly encouraged to include the default TuneD daemon profiles shipped within the default Tuned CR. The example above uses the default openshift-control-plane profile to accomplish this.

Using built-in TuneD profiles

Given the successful rollout of the NTO-managed daemon set, the TuneD operands all manage the same version of the TuneD daemon. To list the built-in TuneD profiles supported by the daemon, query any TuneD pod in the following way:

$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/ -name tuned.conf -printf '%h\n' | sed 's|^.*/||'

You can use the profile names retrieved by this in your custom tuning specification.

Example: using built-in hpc-compute TuneD profile

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-node-hpc-compute
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift node profile for HPC compute workloads
      include=openshift-node,hpc-compute
    name: openshift-node-hpc-compute

  recommend:
  - match:
    - label: tuned.openshift.io/openshift-node-hpc-compute
    priority: 20
    profile: openshift-node-hpc-compute

In addition to the built-in hpc-compute profile, the example above includes the openshift-node TuneD daemon profile shipped within the default Tuned CR to use OpenShift-specific tuning for compute nodes.

4.7. Supported TuneD daemon plugins

Excluding the [main] section, the following TuneD plugins are supported when using custom profiles defined in the profile: section of the Tuned CR:

  • audio
  • cpu
  • disk
  • eeepc_she
  • modules
  • mounts
  • net
  • scheduler
  • scsi_host
  • selinux
  • sysctl
  • sysfs
  • usb
  • video
  • vm
  • bootloader

There is some dynamic tuning functionality provided by some of these plugins that is not supported. The following TuneD plugins are currently not supported:

  • script
  • systemd
Note

The TuneD bootloader plugin only supports Red Hat Enterprise Linux CoreOS (RHCOS) worker nodes.

Chapter 5. Using CPU Manager and Topology Manager

CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.

CPU Manager is useful for workloads that have some of these attributes:

  • Require as much CPU time as possible.
  • Are sensitive to processor cache misses.
  • Are low-latency network applications.
  • Coordinate with other processes and benefit from sharing a single processor cache.

Topology Manager collects hints from the CPU Manager, Device Manager, and other Hint Providers to align pod resources, such as CPU, SR-IOV VFs, and other device resources, for all Quality of Service (QoS) classes on the same non-uniform memory access (NUMA) node.

Topology Manager uses topology information from the collected hints to decide if a pod can be accepted or rejected on a node, based on the configured Topology Manager policy and pod resources requested.

Topology Manager is useful for workloads that use hardware accelerators to support latency-critical execution and high throughput parallel computation.

To use Topology Manager you must configure CPU Manager with the static policy.

5.1. Setting up CPU Manager

Procedure

  1. Optional: Label a node:

    # oc label node perf-node.example.com cpumanager=true
  2. Edit the MachineConfigPool of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:

    # oc edit machineconfigpool worker
  3. Add a label to the worker machine config pool:

    metadata:
      creationTimestamp: 2020-xx-xxx
      generation: 3
      labels:
        custom-kubelet: cpumanager-enabled
  4. Create a KubeletConfig, cpumanager-kubeletconfig.yaml, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See the machineConfigPoolSelector section:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: cpumanager-enabled
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: cpumanager-enabled
      kubeletConfig:
         cpuManagerPolicy: static 1
         cpuManagerReconcilePeriod: 5s 2
    1
    Specify a policy:
    • none. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically. This is the default policy.
    • static. This policy allows containers in guaranteed pods with integer CPU requests. It also limits access to exclusive CPUs on the node. If static, you must use a lowercase s.
    2
    Optional. Specify the CPU Manager reconcile frequency. The default is 5s.
  5. Create the dynamic kubelet config:

    # oc create -f cpumanager-kubeletconfig.yaml

    This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.

  6. Check for the merged kubelet config:

    # oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7

    Example output

           "ownerReferences": [
                {
                    "apiVersion": "machineconfiguration.openshift.io/v1",
                    "kind": "KubeletConfig",
                    "name": "cpumanager-enabled",
                    "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
                }
            ]

  7. Check the worker for the updated kubelet.conf:

    # oc debug node/perf-node.example.com
    sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager

    Example output

    cpuManagerPolicy: static        1
    cpuManagerReconcilePeriod: 5s   2

    1
    cpuManagerPolicy is defined when you create the KubeletConfig CR.
    2
    cpuManagerReconcilePeriod is defined when you create the KubeletConfig CR.
  8. Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:

    # cat cpumanager-pod.yaml

    Example output

    apiVersion: v1
    kind: Pod
    metadata:
      generateName: cpumanager-
    spec:
      containers:
      - name: cpumanager
        image: gcr.io/google_containers/pause-amd64:3.0
        resources:
          requests:
            cpu: 1
            memory: "1G"
          limits:
            cpu: 1
            memory: "1G"
      nodeSelector:
        cpumanager: "true"

  9. Create the pod:

    # oc create -f cpumanager-pod.yaml
  10. Verify that the pod is scheduled to the node that you labeled:

    # oc describe pod cpumanager

    Example output

    Name:               cpumanager-6cqz7
    Namespace:          default
    Priority:           0
    PriorityClassName:  <none>
    Node:  perf-node.example.com/xxx.xx.xx.xxx
    ...
     Limits:
          cpu:     1
          memory:  1G
        Requests:
          cpu:        1
          memory:     1G
    ...
    QoS Class:       Guaranteed
    Node-Selectors:  cpumanager=true

  11. Verify that the cgroups are set up correctly. Get the process ID (PID) of the pause process:

    # ├─init.scope
    │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17
    └─kubepods.slice
      ├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice
      │ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope
      │ └─32706 /pause

    Pods of quality of service (QoS) tier Guaranteed are placed within the kubepods.slice. Pods of other QoS tiers end up in child cgroups of kubepods:

    # cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope
    # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done

    Example output

    cpuset.cpus 1
    tasks 32706

  12. Check the allowed CPU list for the task:

    # grep ^Cpus_allowed_list /proc/32706/status

    Example output

     Cpus_allowed_list:    1

  13. Verify that another pod (in this case, the pod in the burstable QoS tier) on the system cannot run on the core allocated for the Guaranteed pod:

    # cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus
    0
    # oc describe node perf-node.example.com

    Example output

    ...
    Capacity:
     attachable-volumes-aws-ebs:  39
     cpu:                         2
     ephemeral-storage:           124768236Ki
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      8162900Ki
     pods:                        250
    Allocatable:
     attachable-volumes-aws-ebs:  39
     cpu:                         1500m
     ephemeral-storage:           124768236Ki
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      7548500Ki
     pods:                        250
    -------                               ----                           ------------  ----------  ---------------  -------------  ---
      default                                 cpumanager-6cqz7               1 (66%)       1 (66%)     1G (12%)         1G (12%)       29m
    
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource                    Requests          Limits
      --------                    --------          ------
      cpu                         1440m (96%)       1 (66%)

    This VM has two CPU cores. The system-reserved setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at the Node Allocatable amount. You can see that Allocatable CPU is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:

    NAME                    READY   STATUS    RESTARTS   AGE
    cpumanager-6cqz7        1/1     Running   0          33m
    cpumanager-7qc2t        0/1     Pending   0          11s

5.2. Topology Manager policies

Topology Manager aligns Pod resources of all Quality of Service (QoS) classes by collecting topology hints from Hint Providers, such as CPU Manager and Device Manager, and using the collected hints to align the Pod resources.

Topology Manager supports four allocation policies, which you assign in the KubeletConfig custom resource (CR) named cpumanager-enabled:

none policy
This is the default policy and does not perform any topology alignment.
best-effort policy
For each container in a pod with the best-effort topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager stores this and admits the pod to the node.
restricted policy
For each container in a pod with the restricted topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager rejects this pod from the node, resulting in a pod in a Terminated state with a pod admission failure.
single-numa-node policy
For each container in a pod with the single-numa-node topology management policy, kubelet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod is admitted to the node. If a single NUMA Node affinity is not possible, the Topology Manager rejects the pod from the node. This results in a pod in a Terminated state with a pod admission failure.

5.3. Setting up Topology Manager

To use Topology Manager, you must configure an allocation policy in the KubeletConfig custom resource (CR) named cpumanager-enabled. This file might exist if you have set up CPU Manager. If the file does not exist, you can create the file.

Prequisites

  • Configure the CPU Manager policy to be static.

Procedure

To activate Topololgy Manager:

  1. Configure the Topology Manager allocation policy in the custom resource.

    $ oc edit KubeletConfig cpumanager-enabled
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: cpumanager-enabled
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: cpumanager-enabled
      kubeletConfig:
         cpuManagerPolicy: static 1
         cpuManagerReconcilePeriod: 5s
         topologyManagerPolicy: single-numa-node 2
    1
    This parameter must be static with a lowercase s.
    2
    Specify your selected Topology Manager allocation policy. Here, the policy is single-numa-node. Acceptable values are: default, best-effort, restricted, single-numa-node.

5.4. Pod interactions with Topology Manager policies

The example Pod specs below help illustrate pod interactions with Topology Manager.

The following pod runs in the BestEffort QoS class because no resource requests or limits are specified.

spec:
  containers:
  - name: nginx
    image: nginx

The next pod runs in the Burstable QoS class because requests are less than limits.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
      requests:
        memory: "100Mi"

If the selected policy is anything other than none, Topology Manager would not consider either of these Pod specifications.

The last example pod below runs in the Guaranteed QoS class because requests are equal to limits.

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"
      requests:
        memory: "200Mi"
        cpu: "2"
        example.com/device: "1"

Topology Manager would consider this pod. The Topology Manager would consult the hint providers, which are CPU Manager and Device Manager, to get topology hints for the pod.

Topology Manager will use this information to store the best topology for this container. In the case of this pod, CPU Manager and Device Manager will use this stored information at the resource allocation stage.

Chapter 6. Scheduling NUMA-aware workloads

Learn about NUMA-aware scheduling and how you can use it to deploy high performance workloads in an OpenShift Container Platform cluster.

Important

NUMA-aware scheduling is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The NUMA Resources Operator allows you to schedule high-performance workloads in the same NUMA zone. It deploys a node resources exporting agent that reports on available cluster node NUMA resources, and a secondary scheduler that manages the workloads.

6.1. About NUMA-aware scheduling

Non-Uniform Memory Access (NUMA) is a compute platform architecture that allows different CPUs to access different regions of memory at different speeds. NUMA resource topology refers to the locations of CPUs, memory, and PCI devices relative to each other in the compute node. Co-located resources are said to be in the same NUMA zone. For high-performance applications, the cluster needs to process pod workloads in a single NUMA zone.

NUMA architecture allows a CPU with multiple memory controllers to use any available memory across CPU complexes, regardless of where the memory is located. This allows for increased flexibility at the expense of performance. A CPU processing a workload using memory that is outside its NUMA zone is slower than a workload processed in a single NUMA zone. Also, for I/O-constrained workloads, the network interface on a distant NUMA zone slows down how quickly information can reach the application. High-performance workloads, such as telecommunications workloads, cannot operate to specification under these conditions. NUMA-aware scheduling aligns the requested cluster compute resources (CPUs, memory, devices) in the same NUMA zone to process latency-sensitive or high-performance workloads efficiently. NUMA-aware scheduling also improves pod density per compute node for greater resource efficiency.

The default OpenShift Container Platform pod scheduler scheduling logic considers the available resources of the entire compute node, not individual NUMA zones. If the most restrictive resource alignment is requested in the kubelet topology manager, error conditions can occur when admitting the pod to a node. Conversely, if the most restrictive resource alignment is not requested, the pod can be admitted to the node without proper resource alignment, leading to worse or unpredictable performance. For example, runaway pod creation with Topology Affinity Error statuses can occur when the pod scheduler makes suboptimal scheduling decisions for guaranteed pod workloads by not knowing if the pod’s requested resources are available. Scheduling mismatch decisions can cause indefinite pod startup delays. Also, depending on the cluster state and resource allocation, poor pod scheduling decisions can cause extra load on the cluster because of failed startup attempts.

The NUMA Resources Operator deploys a custom NUMA resources secondary scheduler and other resources to mitigate against the shortcomings of the default OpenShift Container Platform pod scheduler. The following diagram provides a high-level overview of NUMA-aware pod scheduling.

Figure 6.1. NUMA-aware scheduling overview

Diagram of NUMA-aware scheduling that shows how the various components interact with each other in the cluster
NodeResourceTopology API
The NodeResourceTopology API describes the available NUMA zone resources in each compute node.
NUMA-aware scheduler
The NUMA-aware secondary scheduler receives information about the available NUMA zones from the NodeResourceTopology API and schedules high-performance workloads on a node where it can be optimally processed.
Node topology exporter
The node topology exporter exposes the available NUMA zone resources for each compute node to the NodeResourceTopology API. The node topology exporter daemon tracks the resource allocation from the kubelet by using the PodResources API.
PodResources API
The PodResources API is local to each node and exposes the resource topology and available resources to the kubelet.

Additional resources

6.2. Installing the NUMA Resources Operator

NUMA Resources Operator deploys resources that allow you to schedule NUMA-aware workloads and deployments. You can install the NUMA Resources Operator using the OpenShift Container Platform CLI or the web console.

6.2.1. Installing the NUMA Resources Operator using the CLI

As a cluster administrator, you can install the Operator using the CLI.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a namespace for the NUMA Resources Operator:

    1. Save the following YAML in the nro-namespace.yaml file:

      apiVersion: v1
      kind: Namespace
      metadata:
        name: openshift-numaresources
    2. Create the Namespace CR by running the following command:

      $ oc create -f nro-namespace.yaml
  2. Create the Operator group for the NUMA Resources Operator:

    1. Save the following YAML in the nro-operatorgroup.yaml file:

      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        name: numaresources-operator
        namespace: openshift-numaresources
      spec:
        targetNamespaces:
        - openshift-numaresources
    2. Create the OperatorGroup CR by running the following command:

      $ oc create -f nro-operatorgroup.yaml
  3. Create the subscription for the NUMA Resources Operator:

    1. Save the following YAML in the nro-sub.yaml file:

      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: numaresources-operator
        namespace: openshift-numaresources
      spec:
        channel: "4.11"
        name: numaresources-operator
        source: redhat-operators
        sourceNamespace: openshift-marketplace
    2. Create the Subscription CR by running the following command:

      $ oc create -f nro-sub.yaml

Verification

  1. Verify that the installation succeeded by inspecting the CSV resource in the openshift-numaresources namespace. Run the following command:

    $ oc get csv -n openshift-numaresources

    Example output

    NAME                             DISPLAY                  VERSION   REPLACES   PHASE
    numaresources-operator.v4.11.2   numaresources-operator   4.11.2               Succeeded

6.2.2. Installing the NUMA Resources Operator using the web console

As a cluster administrator, you can install the NUMA Resources Operator using the web console.

Procedure

  1. Create a namespace for the NUMA Resources Operator:

    1. In the OpenShift Container Platform web console, click AdministrationNamespaces.
    2. Click Create Namespace, enter openshift-numaresources in the Name field, and then click Create.
  2. Install the NUMA Resources Operator:

    1. In the OpenShift Container Platform web console, click OperatorsOperatorHub.
    2. Choose NUMA Resources Operator from the list of available Operators, and then click Install.
    3. In the Installed Namespaces field, select the openshift-numaresources namespace, and then click Install.
  3. Optional: Verify that the NUMA Resources Operator installed successfully:

    1. Switch to the OperatorsInstalled Operators page.
    2. Ensure that NUMA Resources Operator is listed in the openshift-numaresources namespace with a Status of InstallSucceeded.

      Note

      During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

      If the Operator does not appear as installed, to troubleshoot further:

      • Go to the OperatorsInstalled Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
      • Go to the WorkloadsPods page and check the logs for pods in the default project.

6.3. Creating the NUMAResourcesOperator custom resource

When you have installed the NUMA Resources Operator, then create the NUMAResourcesOperator custom resource (CR) that instructs the NUMA Resources Operator to install all the cluster infrastructure needed to support the NUMA-aware scheduler, including daemon sets and APIs.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the NUMA Resources Operator.

Procedure

  1. Create the MachineConfigPool custom resource that enables custom kubelet configurations for worker nodes:

    1. Save the following YAML in the nro-machineconfig.yaml file:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
      metadata:
        labels:
          cnf-worker-tuning: enabled
          machineconfiguration.openshift.io/mco-built-in: ""
          pools.operator.machineconfiguration.openshift.io/worker: ""
        name: worker
      spec:
        machineConfigSelector:
          matchLabels:
            machineconfiguration.openshift.io/role: worker
        nodeSelector:
          matchLabels:
            node-role.kubernetes.io/worker: ""
    2. Create the MachineConfigPool CR by running the following command:

      $ oc create -f nro-machineconfig.yaml
  2. Create the NUMAResourcesOperator custom resource:

    1. Save the following YAML in the nrop.yaml file:

      apiVersion: nodetopology.openshift.io/v1alpha1
      kind: NUMAResourcesOperator
      metadata:
        name: numaresourcesoperator
      spec:
        nodeGroups:
        - machineConfigPoolSelector:
            matchLabels:
              pools.operator.machineconfiguration.openshift.io/worker: "" 1
      1
      Should match the label applied to worker nodes in the related MachineConfigPool CR.
    2. Create the NUMAResourcesOperator CR by running the following command:

      $ oc create -f nrop.yaml

Verification

Verify that the NUMA Resources Operator deployed successfully by running the following command:

$ oc get numaresourcesoperators.nodetopology.openshift.io

Example output

NAME                    AGE
numaresourcesoperator   10m

6.4. Deploying the NUMA-aware secondary pod scheduler

After you install the NUMA Resources Operator, do the following to deploy the NUMA-aware secondary pod scheduler:

  • Configure the pod admittance policy for the required machine profile
  • Create the required machine config pool
  • Deploy the NUMA-aware secondary scheduler

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the NUMA Resources Operator.

Procedure

  1. Create the KubeletConfig custom resource that configures the pod admittance policy for the machine profile:

    1. Save the following YAML in the nro-kubeletconfig.yaml file:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
        name: cnf-worker-tuning
      spec:
        machineConfigPoolSelector:
          matchLabels:
            cnf-worker-tuning: enabled
        kubeletConfig:
          cpuManagerPolicy: "static" 1
          cpuManagerReconcilePeriod: "5s"
          reservedSystemCPUs: "0,1"
          memoryManagerPolicy: "Static" 2
          evictionHard:
            memory.available: "100Mi"
          kubeReserved:
            memory: "512Mi"
          reservedMemory:
            - numaNode: 0
              limits:
                memory: "1124Mi"
          systemReserved:
            memory: "512Mi"
          topologyManagerPolicy: "single-numa-node" 3
          topologyManagerScope: "pod"
      1
      For cpuManagerPolicy, static must use a lowercase s.
      2
      For memoryManagerPolicy, Static must use an uppercase S.
      3
      topologyManagerPolicy must be set to single-numa-node.
    2. Create the KubeletConfig custom resource (CR) by running the following command:

      $ oc create -f nro-kubeletconfig.yaml
  2. Create the NUMAResourcesScheduler custom resource that deploys the NUMA-aware custom pod scheduler:

    1. Save the following YAML in the nro-scheduler.yaml file:

      apiVersion: nodetopology.openshift.io/v1alpha1
      kind: NUMAResourcesScheduler
      metadata:
        name: numaresourcesscheduler
      spec:
        imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.11"
    2. Create the NUMAResourcesScheduler CR by running the following command:

      $ oc create -f nro-scheduler.yaml

Verification

Verify that the required resources deployed successfully by running the following command:

$ oc get all -n openshift-numaresources

Example output

NAME                                                    READY   STATUS    RESTARTS   AGE
pod/numaresources-controller-manager-7575848485-bns4s   1/1     Running   0          13m
pod/numaresourcesoperator-worker-dvj4n                  2/2     Running   0          16m
pod/numaresourcesoperator-worker-lcg4t                  2/2     Running   0          16m
pod/secondary-scheduler-56994cf6cf-7qf4q                1/1     Running   0          16m
NAME                                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
daemonset.apps/numaresourcesoperator-worker   2         2         2       2            2           node-role.kubernetes.io/worker=   16m
NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/numaresources-controller-manager   1/1     1            1           13m
deployment.apps/secondary-scheduler                1/1     1            1           16m
NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/numaresources-controller-manager-7575848485   1         1         1       13m
replicaset.apps/secondary-scheduler-56994cf6cf                1         1         1       16m

6.5. Scheduling workloads with the NUMA-aware scheduler

You can schedule workloads with the NUMA-aware scheduler using Deployment CRs that specify the minimum required resources to process the workload.

The following example deployment uses NUMA-aware scheduling for a sample workload.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

Procedure

  1. Get the name of the NUMA-aware scheduler that is deployed in the cluster by running the following command:

    $ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'

    Example output

    topo-aware-scheduler

  2. Create a Deployment CR that uses scheduler named topo-aware-scheduler, for example:

    1. Save the following YAML in the nro-deployment.yaml file:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: numa-deployment-1
        namespace: openshift-numaresources
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: test
        template:
          metadata:
            labels:
              app: test
          spec:
            schedulerName: topo-aware-scheduler 1
            containers:
            - name: ctnr
              image: quay.io/openshifttest/hello-openshift:openshift
              imagePullPolicy: IfNotPresent
              resources:
                limits:
                  memory: "100Mi"
                  cpu: "10"
                requests:
                  memory: "100Mi"
                  cpu: "10"
            - name: ctnr2
              image: registry.access.redhat.com/rhel:latest
              imagePullPolicy: IfNotPresent
              command: ["/bin/sh", "-c"]
              args: [ "while true; do sleep 1h; done;" ]
              resources:
                limits:
                  memory: "100Mi"
                  cpu: "8"
                requests:
                  memory: "100Mi"
                  cpu: "8"
      1
      schedulerName must match the name of the NUMA-aware scheduler that is deployed in your cluster, for example topo-aware-scheduler.
    2. Create the Deployment CR by running the following command:

      $ oc create -f nro-deployment.yaml

Verification

  1. Verify that the deployment was successful:

    $ oc get pods -n openshift-numaresources

    Example output

    NAME                                                READY   STATUS    RESTARTS   AGE
    numa-deployment-1-56954b7b46-pfgw8                  2/2     Running   0          129m
    numaresources-controller-manager-7575848485-bns4s   1/1     Running   0          15h
    numaresourcesoperator-worker-dvj4n                  2/2     Running   0          18h
    numaresourcesoperator-worker-lcg4t                  2/2     Running   0          16h
    secondary-scheduler-56994cf6cf-7qf4q                1/1     Running   0          18h

  2. Verify that the topo-aware-scheduler is scheduling the deployed pod by running the following command:

    $ oc describe pod numa-deployment-1-56954b7b46-pfgw8 -n openshift-numaresources

    Example output

    Events:
      Type    Reason          Age   From                  Message
      ----    ------          ----  ----                  -------
      Normal  Scheduled       130m  topo-aware-scheduler  Successfully assigned openshift-numaresources/numa-deployment-1-56954b7b46-pfgw8 to compute-0.example.com

    Note

    Deployments that request more resources than is available for scheduling will fail with a MinimumReplicasUnavailable error. The deployment succeeds when the required resources become available. Pods remain in the Pending state until the required resources are available.

  3. Verify that the expected allocated resources are listed for the node. Run the following command:

    $ oc describe noderesourcetopologies.topology.node.k8s.io

    Example output

    ...
    
    Zones:
      Costs:
        Name:   node-0
        Value:  10
        Name:   node-1
        Value:  21
      Name:     node-0
      Resources:
        Allocatable:  39
        Available:    21 1
        Capacity:     40
        Name:         cpu
        Allocatable:  6442450944
        Available:    6442450944
        Capacity:     6442450944
        Name:         hugepages-1Gi
        Allocatable:  134217728
        Available:    134217728
        Capacity:     134217728
        Name:         hugepages-2Mi
        Allocatable:  262415904768
        Available:    262206189568
        Capacity:     270146007040
        Name:         memory
      Type:           Node

    1
    The Available capacity is reduced because of the resources that have been allocated to the guaranteed pod.

    Resources consumed by guaranteed pods are subtracted from the available node resources listed under noderesourcetopologies.topology.node.k8s.io.

  4. Resource allocations for pods with a Best-effort or Burstable quality of service (qosClass) are not reflected in the NUMA node resources under noderesourcetopologies.topology.node.k8s.io. If a pod’s consumed resources are not reflected in the node resource calculation, verify that the pod has qosClass of Guaranteed by running the following command:

    $ oc get pod <pod_name> -n <pod_namespace> -o jsonpath="{ .status.qosClass }"

    Example output

    Guaranteed

6.6. Troubleshooting NUMA-aware scheduling

To troubleshoot common problems with NUMA-aware pod scheduling, perform the following steps.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

Procedure

  1. Verify that the noderesourcetopologies CRD is deployed in the cluster by running the following command:

    $ oc get crd | grep noderesourcetopologies

    Example output

    NAME                                                              CREATED AT
    noderesourcetopologies.topology.node.k8s.io                       2022-01-18T08:28:06Z

  2. Check that the NUMA-aware scheduler name matches the name specified in your NUMA-aware workloads by running the following command:

    $ oc get numaresourcesschedulers.nodetopology.openshift.io numaresourcesscheduler -o json | jq '.status.schedulerName'

    Example output

    topo-aware-scheduler

  3. Verify that NUMA-aware scheduable nodes have the noderesourcetopologies CR applied to them. Run the following command:

    $ oc get noderesourcetopologies.topology.node.k8s.io

    Example output

    NAME                    AGE
    compute-0.example.com   17h
    compute-1.example.com   17h

    Note

    The number of nodes should equal the number of worker nodes that are configured by the machine config pool (mcp) worker definition.

  4. Verify the NUMA zone granularity for all scheduable nodes by running the following command:

    $ oc get noderesourcetopologies.topology.node.k8s.io -o yaml

    Example output

    apiVersion: v1
    items:
    - apiVersion: topology.node.k8s.io/v1alpha1
      kind: NodeResourceTopology
      metadata:
        annotations:
          k8stopoawareschedwg/rte-update: periodic
        creationTimestamp: "2022-06-16T08:55:38Z"
        generation: 63760
        name: worker-0
        resourceVersion: "8450223"
        uid: 8b77be46-08c0-4074-927b-d49361471590
      topologyPolicies:
      - SingleNUMANodeContainerLevel
      zones:
      - costs:
        - name: node-0
          value: 10
        - name: node-1
          value: 21
        name: node-0
        resources:
        - allocatable: "38"
          available: "38"
          capacity: "40"
          name: cpu
        - allocatable: "134217728"
          available: "134217728"
          capacity: "134217728"
          name: hugepages-2Mi
        - allocatable: "262352048128"
          available: "262352048128"
          capacity: "270107316224"
          name: memory
        - allocatable: "6442450944"
          available: "6442450944"
          capacity: "6442450944"
          name: hugepages-1Gi
        type: Node
      - costs:
        - name: node-0
          value: 21
        - name: node-1
          value: 10
        name: node-1
        resources:
        - allocatable: "268435456"
          available: "268435456"
          capacity: "268435456"
          name: hugepages-2Mi
        - allocatable: "269231067136"
          available: "269231067136"
          capacity: "270573244416"
          name: memory
        - allocatable: "40"
          available: "40"
          capacity: "40"
          name: cpu
        - allocatable: "1073741824"
          available: "1073741824"
          capacity: "1073741824"
          name: hugepages-1Gi
        type: Node
    - apiVersion: topology.node.k8s.io/v1alpha1
      kind: NodeResourceTopology
      metadata:
        annotations:
          k8stopoawareschedwg/rte-update: periodic
        creationTimestamp: "2022-06-16T08:55:37Z"
        generation: 62061
        name: worker-1
        resourceVersion: "8450129"
        uid: e8659390-6f8d-4e67-9a51-1ea34bba1cc3
      topologyPolicies:
      - SingleNUMANodeContainerLevel
      zones: 1
      - costs:
        - name: node-0
          value: 10
        - name: node-1
          value: 21
        name: node-0
        resources: 2
        - allocatable: "38"
          available: "38"
          capacity: "40"
          name: cpu
        - allocatable: "6442450944"
          available: "6442450944"
          capacity: "6442450944"
          name: hugepages-1Gi
        - allocatable: "134217728"
          available: "134217728"
          capacity: "134217728"
          name: hugepages-2Mi
        - allocatable: "262391033856"
          available: "262391033856"
          capacity: "270146301952"
          name: memory
        type: Node
      - costs:
        - name: node-0
          value: 21
        - name: node-1
          value: 10
        name: node-1
        resources:
        - allocatable: "40"
          available: "40"
          capacity: "40"
          name: cpu
        - allocatable: "1073741824"
          available: "1073741824"
          capacity: "1073741824"
          name: hugepages-1Gi
        - allocatable: "268435456"
          available: "268435456"
          capacity: "268435456"
          name: hugepages-2Mi
        - allocatable: "269192085504"
          available: "269192085504"
          capacity: "270534262784"
          name: memory
        type: Node
    kind: List
    metadata:
      resourceVersion: ""
      selfLink: ""

    1
    Each stanza under zones describes the resources for a single NUMA zone.
    2
    resources describes the current state of the NUMA zone resources. Check that resources listed under items.zones.resources.available correspond to the exclusive NUMA zone resources allocated to each guaranteed pod.

6.6.1. Checking the NUMA-aware scheduler logs

Troubleshoot problems with the NUMA-aware scheduler by reviewing the logs. If required, you can increase the scheduler log level by modifying the spec.logLevel field of the NUMAResourcesScheduler resource. Acceptable values are Normal, Debug, and Trace, with Trace being the most verbose option.

Note

To change the log level of the secondary scheduler, delete the running scheduler resource and re-deploy it with the changed log level. The scheduler is unavailable for scheduling new workloads during this downtime.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Delete the currently running NUMAResourcesScheduler resource:

    1. Get the active NUMAResourcesScheduler by running the following command:

      $ oc get NUMAResourcesScheduler

      Example output

      NAME                     AGE
      numaresourcesscheduler   90m

    2. Delete the secondary scheduler resource by running the following command:

      $ oc delete NUMAResourcesScheduler numaresourcesscheduler

      Example output

      numaresourcesscheduler.nodetopology.openshift.io "numaresourcesscheduler" deleted

  2. Save the following YAML in the file nro-scheduler-debug.yaml. This example changes the log level to Debug:

    apiVersion: nodetopology.openshift.io/v1alpha1
    kind: NUMAResourcesScheduler
    metadata:
      name: numaresourcesscheduler
    spec:
      imageSpec: "registry.redhat.io/openshift4/noderesourcetopology-scheduler-container-rhel8:v4.11"
      logLevel: Debug
  3. Create the updated Debug logging NUMAResourcesScheduler resource by running the following command:

    $ oc create -f nro-scheduler-debug.yaml

    Example output

    numaresourcesscheduler.nodetopology.openshift.io/numaresourcesscheduler created

Verification steps

  1. Check that the NUMA-aware scheduler was successfully deployed:

    1. Run the following command to check that the CRD is created succesfully:

      $ oc get crd | grep numaresourcesschedulers

      Example output

      NAME                                                              CREATED AT
      numaresourcesschedulers.nodetopology.openshift.io                 2022-02-25T11:57:03Z

    2. Check that the new custom scheduler is available by running the following command:

      $ oc get numaresourcesschedulers.nodetopology.openshift.io

      Example output

      NAME                     AGE
      numaresourcesscheduler   3h26m

  2. Check that the logs for the scheduler shows the increased log level:

    1. Get the list of pods running in the openshift-numaresources namespace by running the following command:

      $ oc get pods -n openshift-numaresources

      Example output

      NAME                                               READY   STATUS    RESTARTS   AGE
      numaresources-controller-manager-d87d79587-76mrm   1/1     Running   0          46h
      numaresourcesoperator-worker-5wm2k                 2/2     Running   0          45h
      numaresourcesoperator-worker-pb75c                 2/2     Running   0          45h
      secondary-scheduler-7976c4d466-qm4sc               1/1     Running   0          21m

    2. Get the logs for the secondary scheduler pod by running the following command:

      $ oc logs secondary-scheduler-7976c4d466-qm4sc -n openshift-numaresources

      Example output

      ...
      I0223 11:04:55.614788       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Namespace total 11 items received
      I0223 11:04:56.609114       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.ReplicationController total 10 items received
      I0223 11:05:22.626818       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received
      I0223 11:05:31.610356       1 reflector.go:535] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PodDisruptionBudget total 7 items received
      I0223 11:05:31.713032       1 eventhandlers.go:186] "Add event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"
      I0223 11:05:53.461016       1 eventhandlers.go:244] "Delete event for scheduled pod" pod="openshift-marketplace/certified-operators-thtvq"

6.6.2. Troubleshooting the resource topology exporter

Troubleshoot noderesourcetopologies objects where unexpected results are occurring by inspecting the corresponding resource-topology-exporter logs.

Note

It is recommended that NUMA resource topology exporter instances in the cluster are named for nodes they refer to. For example, a worker node with the name worker should have a corresponding noderesourcetopologies object called worker.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Get the daemonsets managed by the NUMA Resources Operator. Each daemonset has a corresponding nodeGroup in the NUMAResourcesOperator CR. Run the following command:

    $ oc get numaresourcesoperators.nodetopology.openshift.io numaresourcesoperator -o jsonpath="{.status.daemonsets[0]}"

    Example output

    {"name":"numaresourcesoperator-worker","namespace":"openshift-numaresources"}

  2. Get the label for the daemonset of interest using the value for name from the previous step:

    $ oc get ds -n openshift-numaresources numaresourcesoperator-worker -o jsonpath="{.spec.selector.matchLabels}"

    Example output

    {"name":"resource-topology"}

  3. Get the pods using the resource-topology label by running the following command:

    $ oc get pods -n openshift-numaresources -l name=resource-topology -o wide

    Example output

    NAME                                 READY   STATUS    RESTARTS   AGE    IP            NODE
    numaresourcesoperator-worker-5wm2k   2/2     Running   0          2d1h   10.135.0.64   compute-0.example.com
    numaresourcesoperator-worker-pb75c   2/2     Running   0          2d1h   10.132.2.33   compute-1.example.com

  4. Examine the logs of the resource-topology-exporter container running on the worker pod that corresponds to the node you are troubleshooting. Run the following command:

    $ oc logs -n openshift-numaresources -c resource-topology-exporter numaresourcesoperator-worker-pb75c

    Example output

    I0221 13:38:18.334140       1 main.go:206] using sysinfo:
    reservedCpus: 0,1
    reservedMemory:
      "0": 1178599424
    I0221 13:38:18.334370       1 main.go:67] === System information ===
    I0221 13:38:18.334381       1 sysinfo.go:231] cpus: reserved "0-1"
    I0221 13:38:18.334493       1 sysinfo.go:237] cpus: online "0-103"
    I0221 13:38:18.546750       1 main.go:72]
    cpus: allocatable "2-103"
    hugepages-1Gi:
      numa cell 0 -> 6
      numa cell 1 -> 1
    hugepages-2Mi:
      numa cell 0 -> 64
      numa cell 1 -> 128
    memory:
      numa cell 0 -> 45758Mi
      numa cell 1 -> 48372Mi

6.6.3. Correcting a missing resource topology exporter config map

If you install the NUMA Resources Operator in a cluster with misconfigured cluster settings, in some circumstances, the Operator is shown as active but the logs of the resource topology exporter (RTE) daemon set pods show that the configuration for the RTE is missing, for example:

Info: couldn't find configuration in "/etc/resource-topology-exporter/config.yaml"

This log message indicates that the kubeletconfig with the required configuration was not properly applied in the cluster, resulting in a missing RTE configmap. For example, the following cluster is missing a numaresourcesoperator-worker configmap custom resource (CR):

$ oc get configmap

Example output

NAME                           DATA   AGE
0e2a6bd3.openshift-kni.io      0      6d21h
kube-root-ca.crt               1      6d21h
openshift-service-ca.crt       1      6d21h
topo-aware-scheduler-config    1      6d18h

In a correctly configured cluster, oc get configmap also returns a numaresourcesoperator-worker configmap CR.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the NUMA Resources Operator and deploy the NUMA-aware secondary scheduler.

Procedure

  1. Compare the values for spec.machineConfigPoolSelector.matchLabels in kubeletconfig and metadata.labels in the MachineConfigPool (mcp) worker CR using the following commands:

    1. Check the kubeletconfig labels by running the following command:

      $ oc get kubeletconfig -o yaml

      Example output

      machineConfigPoolSelector:
        matchLabels:
          cnf-worker-tuning: enabled

    2. Check the mcp labels by running the following command:

      $ oc get mcp worker -o yaml

      Example output

      labels:
        machineconfiguration.openshift.io/mco-built-in: ""
        pools.operator.machineconfiguration.openshift.io/worker: ""

      The cnf-worker-tuning: enabled label is not present in the MachineConfigPool object.

  2. Edit the MachineConfigPool CR to include the missing label, for example:

    $ oc edit mcp worker -o yaml

    Example output

    labels:
      machineconfiguration.openshift.io/mco-built-in: ""
      pools.operator.machineconfiguration.openshift.io/worker: ""
      cnf-worker-tuning: enabled

  3. Apply the label changes and wait for the cluster to apply the updated configuration. Run the following command:

Verification

  • Check that the missing numaresourcesoperator-worker configmap CR is applied:

    $ oc get configmap

    Example output

    NAME                           DATA   AGE
    0e2a6bd3.openshift-kni.io      0      6d21h
    kube-root-ca.crt               1      6d21h
    numaresourcesoperator-worker   1      5m
    openshift-service-ca.crt       1      6d21h
    topo-aware-scheduler-config    1      6d18h

Chapter 7. Scaling the Cluster Monitoring Operator

OpenShift Container Platform exposes metrics that the Cluster Monitoring Operator collects and stores in the Prometheus-based monitoring stack. As an administrator, you can view dashboards for system resources, containers, and components metrics in the OpenShift Container Platform web console by navigating to ObserveDashboards.

7.1. Prometheus database storage requirements

Red Hat performed various tests for different scale sizes.

Note

The Prometheus storage requirements below are not prescriptive and should be used as a reference. Higher resource consumption might be observed in your cluster depending on workload activity and resource density, including the number of pods, containers, routes, or other resources exposing metrics collected by Prometheus.

Table 7.1. Prometheus Database storage requirements based on number of nodes/pods in the cluster
Number of NodesNumber of pods (2 containers per pod)Prometheus storage growth per dayPrometheus storage growth per 15 daysNetwork (per tsdb chunk)

50

1800

6.3 GB

94 GB

16 MB

100

3600

13 GB

195 GB

26 MB

150

5400

19 GB

283 GB

36 MB

200

7200

25 GB

375 GB

46 MB

Approximately 20 percent of the expected size was added as overhead to ensure that the storage requirements do not exceed the calculated value.

The above calculation is for the default OpenShift Container Platform Cluster Monitoring Operator.

Note

CPU utilization has minor impact. The ratio is approximately 1 core out of 40 per 50 nodes and 1800 pods.

Recommendations for OpenShift Container Platform

  • Use at least two infrastructure (infra) nodes.
  • Use at least three openshift-container-storage nodes with non-volatile memory express (SSD or NVMe) drives.

7.2. Configuring cluster monitoring

You can increase the storage capacity for the Prometheus component in the cluster monitoring stack.

Procedure

To increase the storage capacity for Prometheus:

  1. Create a YAML configuration file, cluster-monitoring-config.yaml. For example:

    apiVersion: v1
    kind: ConfigMap
    data:
      config.yaml: |
        prometheusK8s:
          retention: {{PROMETHEUS_RETENTION_PERIOD}} 1
          nodeSelector:
            node-role.kubernetes.io/infra: ""
          volumeClaimTemplate:
            spec:
              storageClassName: {{STORAGE_CLASS}} 2
              resources:
                requests:
                  storage: {{PROMETHEUS_STORAGE_SIZE}} 3
        alertmanagerMain:
          nodeSelector:
            node-role.kubernetes.io/infra: ""
          volumeClaimTemplate:
            spec:
              storageClassName: {{STORAGE_CLASS}} 4
              resources:
                requests:
                  storage: {{ALERTMANAGER_STORAGE_SIZE}} 5
    metadata:
      name: cluster-monitoring-config
      namespace: openshift-monitoring
    1
    The default value of Prometheus retention is PROMETHEUS_RETENTION_PERIOD=15d. Units are measured in time using one of these suffixes: s, m, h, d.
    2 4
    The storage class for your cluster.
    3
    A typical value is PROMETHEUS_STORAGE_SIZE=2000Gi. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
    5
    A typical value is ALERTMANAGER_STORAGE_SIZE=20Gi. Storage values can be a plain integer or a fixed-point integer using one of these suffixes: E, P, T, G, M, K. You can also use the power-of-two equivalents: Ei, Pi, Ti, Gi, Mi, Ki.
  2. Add values for the retention period, storage class, and storage sizes.
  3. Save the file.
  4. Apply the changes by running:

    $ oc create -f cluster-monitoring-config.yaml

Chapter 8. Planning your environment according to object maximums

Consider the following tested object maximums when you plan your OpenShift Container Platform cluster.

These guidelines are based on the largest possible cluster. For smaller clusters, the maximums are lower. There are many factors that influence the stated thresholds, including the etcd version or storage data format.

Important

These guidelines apply to OpenShift Container Platform with software-defined networking (SDN), not Open Virtual Network (OVN).

In most cases, exceeding these numbers results in lower overall performance. It does not necessarily mean that the cluster will fail.

Warning

Clusters that experience rapid change, such as those with many starting and stopping pods, can have a lower practical maximum size than documented.

8.1. OpenShift Container Platform tested cluster maximums for major releases

Tested Cloud Platforms for OpenShift Container Platform 3.x: Red Hat OpenStack Platform (RHOSP), Amazon Web Services and Microsoft Azure. Tested Cloud Platforms for OpenShift Container Platform 4.x: Amazon Web Services, Microsoft Azure and Google Cloud Platform.

Maximum type3.x tested maximum4.x tested maximum

Number of nodes

2,000

2,000 [1]

Number of pods [2]

150,000

150,000

Number of pods per node

250

500 [3]

Number of pods per core

There is no default value.

There is no default value.

Number of namespaces [4]

10,000

10,000

Number of builds

10,000 (Default pod RAM 512 Mi) - Pipeline Strategy

10,000 (Default pod RAM 512 Mi) - Source-to-Image (S2I) build strategy

Number of pods per namespace [5]

25,000

25,000

Number of routes and back ends per Ingress Controller

2,000 per router

2,000 per router

Number of secrets

80,000

80,000

Number of config maps

90,000

90,000

Number of services [6]

10,000

10,000

Number of services per namespace

5,000

5,000

Number of back-ends per service

5,000

5,000

Number of deployments per namespace [5]

2,000

2,000

Number of build configs

12,000

12,000

Number of custom resource definitions (CRD)

There is no default value.

512 [7]

  1. Pause pods were deployed to stress the control plane components of OpenShift Container Platform at 2000 node scale.
  2. The pod count displayed here is the number of test pods. The actual number of pods depends on the application’s memory, CPU, and storage requirements.
  3. This was tested on a cluster with 100 worker nodes with 500 pods per worker node. The default maxPods is still 250. To get to 500 maxPods, the cluster must be created with a maxPods set to 500 using a custom kubelet config. If you need 500 user pods, you need a hostPrefix of 22 because there are 10-15 system pods already running on the node. The maximum number of pods with attached persistent volume claims (PVC) depends on storage backend from where PVC are allocated. In our tests, only OpenShift Data Foundation v4 (OCS v4) was able to satisfy the number of pods per node discussed in this document.
  4. When there are a large number of active projects, etcd might suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentation, is highly recommended to free etcd storage.
  5. There are a number of control loops in the system that must iterate over all objects in a given namespace as a reaction to some changes in state. Having a large number of objects of a given type in a single namespace can make those loops expensive and slow down processing given state changes. The limit assumes that the system has enough CPU, memory, and disk to satisfy the application requirements.
  6. Each service port and each service back-end has a corresponding entry in iptables. The number of back-ends of a given service impact the size of the endpoints objects, which impacts the size of data that is being sent all over the system.
  7. OpenShift Container Platform has a limit of 512 total custom resource definitions (CRD), including those installed by OpenShift Container Platform, products integrating with OpenShift Container Platform and user created CRDs. If there are more than 512 CRDs created, then there is a possibility that oc commands requests may be throttled.
Note

Red Hat does not provide direct guidance on sizing your OpenShift Container Platform cluster. This is because determining whether your cluster is within the supported bounds of OpenShift Container Platform requires careful consideration of all the multidimensional factors that limit the cluster scale.

8.2. OpenShift Container Platform environment and configuration on which the cluster maximums are tested

8.2.1. AWS cloud platform

NodeFlavorvCPURAM(GiB)Disk typeDisk size(GiB)/IOSCountRegion

Control plane/etcd [1]

r5.4xlarge

16

128

gp3

220

3

us-west-2

Infra [2]

m5.12xlarge

48

192

gp3

100

3

us-west-2

Workload [3]

m5.4xlarge

16

64

gp3

500 [4]

1

us-west-2

Compute

m5.2xlarge

8

32

gp3

100

3/25/250/500 [5]

us-west-2

  1. gp3 disks with a baseline performance of 3000 IOPS and 125 MiB per second are used for control plane/etcd nodes because etcd is latency sensitive. gp3 volumes do not use burst performance.
  2. Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.
  3. Workload node is dedicated to run performance and scalability workload generators.
  4. Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.
  5. Cluster is scaled in iterations and performance and scalability tests are executed at the specified node counts.

8.2.2. IBM Power platform

NodevCPURAM(GiB)Disk typeDisk size(GiB)/IOSCount

Control plane/etcd [1]

16

32

io1

120 / 10 IOPS per GiB

3

Infra [2]

16

64

gp2

120

2

Workload [3]

16

256

gp2

120 [4]

1

Compute

16

64

gp2

120

2 to 100 [5]

  1. io1 disks with 120 / 10 IOPS per GiB are used for control plane/etcd nodes as etcd is I/O intensive and latency sensitive.
  2. Infra nodes are used to host Monitoring, Ingress, and Registry components to ensure they have enough resources to run at large scale.
  3. Workload node is dedicated to run performance and scalability workload generators.
  4. Larger disk size is used so that there is enough space to store the large amounts of data that is collected during the performance and scalability test run.
  5. Cluster is scaled in iterations.

8.2.3. IBM Z platform

NodevCPU [4]RAM(GiB)[5]Disk typeDisk size(GiB)/IOSCount

Control plane/etcd [1,2]

8

32

ds8k

300 / LCU 1

3

Compute [1,3]

8

32

ds8k

150 / LCU 2

4 nodes (scaled to 100/250/500 pods per node)

  1. Nodes are distributed between two logical control units (LCUs) to optimize disk I/O load of the control plane/etcd nodes as etcd is I/O intensive and latency sensitive. Etcd I/O demand should not interfere with other workloads.
  2. Four compute nodes are used for the tests running several iterations with 100/250/500 pods at the same time. First, idling pods were used to evaluate if pods can be instanced. Next, a network and CPU demanding client/server workload were used to evaluate the stability of the system under stress. Client and server pods were pairwise deployed and each pair was spread over two compute nodes.
  3. No separate workload node was used. The workload simulates a microservice workload between two compute nodes.
  4. Physical number of processors used is six Integrated Facilities for Linux (IFLs).
  5. Total physical memory used is 512 GiB.

8.3. How to plan your environment according to tested cluster maximums

Important

Oversubscribing the physical resources on a node affects resource guarantees the Kubernetes scheduler makes during pod placement. Learn what measures you can take to avoid memory swapping.

Some of the tested maximums are stretched only in a single dimension. They will vary when many objects are running on the cluster.

The numbers noted in this documentation are based on Red Hat’s test methodology, setup, configuration, and tunings. These numbers can vary based on your own individual setup and environments.

While planning your environment, determine how many pods are expected to fit per node:

required pods per cluster / pods per node = total number of nodes needed

The default maximum number of pods per node is 250. However, the number of pods that fit on a node is dependent on the application itself. Consider the application’s memory, CPU, and storage requirements, as described in "How to plan your environment according to application requirements".

Example scenario

If you want to scope your cluster for 2200 pods per cluster, you would need at least five nodes, assuming that there are 500 maximum pods per node:

2200 / 500 = 4.4

If you increase the number of nodes to 20, then the pod distribution changes to 110 pods per node:

2200 / 20 = 110

Where:

required pods per cluster / total number of nodes = expected pods per node

OpenShift Container Platform comes with several system pods, such as SDN, DNS, Operators, and others, which run across every worker node by default. Therefore, the result of the above formula can vary.

8.4. How to plan your environment according to application requirements

Consider an example application environment:

Pod typePod quantityMax memoryCPU coresPersistent storage

apache

100

500 MB

0.5

1 GB

node.js

200

1 GB

1

1 GB

postgresql

100

1 GB

2

10 GB

JBoss EAP

100

1 GB

1

1 GB

Extrapolated requirements: 550 CPU cores, 450GB RAM, and 1.4TB storage.

Instance size for nodes can be modulated up or down, depending on your preference. Nodes are often resource overcommitted. In this deployment scenario, you can choose to run additional smaller nodes or fewer larger nodes to provide the same amount of resources. Factors such as operational agility and cost-per-instance should be considered.

Node typeQuantityCPUsRAM (GB)

Nodes (option 1)

100

4

16

Nodes (option 2)

50

8

32

Nodes (option 3)

25

16

64

Some applications lend themselves well to overcommitted environments, and some do not. Most Java applications and applications that use huge pages are examples of applications that would not allow for overcommitment. That memory can not be used for other applications. In the example above, the environment would be roughly 30 percent overcommitted, a common ratio.

The application pods can access a service either by using environment variables or DNS. If using environment variables, for each active service the variables are injected by the kubelet when a pod is run on a node. A cluster-aware DNS server watches the Kubernetes API for new services and creates a set of DNS records for each one. If DNS is enabled throughout your cluster, then all pods should automatically be able to resolve services by their DNS name. Service discovery using DNS can be used in case you must go beyond 5000 services. When using environment variables for service discovery, the argument list exceeds the allowed length after 5000 services in a namespace, then the pods and deployments will start failing. Disable the service links in the deployment’s service specification file to overcome this:

---
apiVersion: template.openshift.io/v1
kind: Template
metadata:
  name: deployment-config-template
  creationTimestamp:
  annotations:
    description: This template will create a deploymentConfig with 1 replica, 4 env vars and a service.
    tags: ''
objects:
- apiVersion: apps.openshift.io/v1
  kind: DeploymentConfig
  metadata:
    name: deploymentconfig${IDENTIFIER}
  spec:
    template:
      metadata:
        labels:
          name: replicationcontroller${IDENTIFIER}
      spec:
        enableServiceLinks: false
        containers:
        - name: pause${IDENTIFIER}
          image: "${IMAGE}"
          ports:
          - containerPort: 8080
            protocol: TCP
          env:
          - name: ENVVAR1_${IDENTIFIER}
            value: "${ENV_VALUE}"
          - name: ENVVAR2_${IDENTIFIER}
            value: "${ENV_VALUE}"
          - name: ENVVAR3_${IDENTIFIER}
            value: "${ENV_VALUE}"
          - name: ENVVAR4_${IDENTIFIER}
            value: "${ENV_VALUE}"
          resources: {}
          imagePullPolicy: IfNotPresent
          capabilities: {}
          securityContext:
            capabilities: {}
            privileged: false
        restartPolicy: Always
        serviceAccount: ''
    replicas: 1
    selector:
      name: replicationcontroller${IDENTIFIER}
    triggers:
    - type: ConfigChange
    strategy:
      type: Rolling
- apiVersion: v1
  kind: Service
  metadata:
    name: service${IDENTIFIER}
  spec:
    selector:
      name: replicationcontroller${IDENTIFIER}
    ports:
    - name: serviceport${IDENTIFIER}
      protocol: TCP
      port: 80
      targetPort: 8080
    clusterIP: ''
    type: ClusterIP
    sessionAffinity: None
  status:
    loadBalancer: {}
parameters:
- name: IDENTIFIER
  description: Number to append to the name of resources
  value: '1'
  required: true
- name: IMAGE
  description: Image to use for deploymentConfig
  value: gcr.io/google-containers/pause-amd64:3.0
  required: false
- name: ENV_VALUE
  description: Value to use for environment variables
  generate: expression
  from: "[A-Za-z0-9]{255}"
  required: false
labels:
  template: deployment-config-template

The number of application pods that can run in a namespace is dependent on the number of services and the length of the service name when the environment variables are used for service discovery. ARG_MAX on the system defines the maximum argument length for a new process and it is set to 2097152 bytes (2 MiB) by default. The Kubelet injects environment variables in to each pod scheduled to run in the namespace including:

  • <SERVICE_NAME>_SERVICE_HOST=<IP>
  • <SERVICE_NAME>_SERVICE_PORT=<PORT>
  • <SERVICE_NAME>_PORT=tcp://<IP>:<PORT>
  • <SERVICE_NAME>_PORT_<PORT>_TCP=tcp://<IP>:<PORT>
  • <SERVICE_NAME>_PORT_<PORT>_TCP_PROTO=tcp
  • <SERVICE_NAME>_PORT_<PORT>_TCP_PORT=<PORT>
  • <SERVICE_NAME>_PORT_<PORT>_TCP_ADDR=<ADDR>

The pods in the namespace will start to fail if the argument length exceeds the allowed value and the number of characters in a service name impacts it. For example, in a namespace with 5000 services, the limit on the service name is 33 characters, which enables you to run 5000 pods in the namespace.

Chapter 9. Optimizing storage

Optimizing storage helps to minimize storage use across all resources. By optimizing storage, administrators help ensure that existing storage resources are working in an efficient manner.

9.1. Available persistent storage options

Understand your persistent storage options so that you can optimize your OpenShift Container Platform environment.

Table 9.1. Available storage options
Storage typeDescriptionExamples

Block

  • Presented to the operating system (OS) as a block device
  • Suitable for applications that need full control of storage and operate at a low level on files bypassing the file system
  • Also referred to as a Storage Area Network (SAN)
  • Non-shareable, which means that only one client at a time can mount an endpoint of this type

AWS EBS and VMware vSphere support dynamic persistent volume (PV) provisioning natively in OpenShift Container Platform.

File

  • Presented to the OS as a file system export to be mounted
  • Also referred to as Network Attached Storage (NAS)
  • Concurrency, latency, file locking mechanisms, and other capabilities vary widely between protocols, implementations, vendors, and scales.

RHEL NFS, NetApp NFS [1], and Vendor NFS

Object

  • Accessible through a REST API endpoint
  • Configurable for use in the OpenShift image registry
  • Applications must build their drivers into the application and/or container.

AWS S3

  1. NetApp NFS supports dynamic PV provisioning when using the Trident plugin.

9.3. Data storage management

The following table summarizes the main directories that OpenShift Container Platform components write data to.

Table 9.3. Main directories for storing OpenShift Container Platform data
DirectoryNotesSizingExpected growth

/var/log

Log files for all components.

10 to 30 GB.

Log files can grow quickly; size can be managed by growing disks or by using log rotate.

/var/lib/etcd

Used for etcd storage when storing the database.

Less than 20 GB.

Database can grow up to 8 GB.

Will grow slowly with the environment. Only storing metadata.

Additional 20-25 GB for every additional 8 GB of memory.

/var/lib/containers

This is the mount point for the CRI-O runtime. Storage used for active container runtimes, including pods, and storage of local images. Not used for registry storage.

50 GB for a node with 16 GB memory. Note that this sizing should not be used to determine minimum cluster requirements.

Additional 20-25 GB for every additional 8 GB of memory.

Growth is limited by capacity for running containers.

/var/lib/kubelet

Ephemeral volume storage for pods. This includes anything external that is mounted into a container at runtime. Includes environment variables, kube secrets, and data volumes not backed by persistent volumes.

Varies

Minimal if pods requiring storage are using persistent volumes. If using ephemeral storage, this can grow quickly.

9.4. Optimizing storage performance for Microsoft Azure

OpenShift Container Platform and Kubernetes are sensitive to disk performance, and faster storage is recommended, particularly for etcd on the control plane nodes.

For production Azure clusters and clusters with intensive workloads, the virtual machine operating system disk for control plane machines should be able to sustain a tested and recommended minimum throughput of 5000 IOPS / 200MBps. This throughput can be provided by having a minimum of 1 TiB Premium SSD (P30). In Azure and Azure Stack Hub, disk performance is directly dependent on SSD disk sizes. To achieve the throughput supported by a Standard_D8s_v3 virtual machine, or other similar machine types, and the target of 5000 IOPS, at least a P30 disk is required.

Host caching must be set to ReadOnly for low latency and high IOPS and throughput when reading data. Reading data from the cache, which is present either in the VM memory or in the local SSD disk, is much faster than reading from the disk, which is in the blob storage.

9.5. Additional resources

Chapter 10. Optimizing routing

The OpenShift Container Platform HAProxy router can be scaled or configured to optimize performance.

10.1. Baseline Ingress Controller (router) performance

The OpenShift Container Platform Ingress Controller, or router, is the ingress point for ingress traffic for applications and services that are configured using routes and ingresses.

When evaluating a single HAProxy router performance in terms of HTTP requests handled per second, the performance varies depending on many factors. In particular:

  • HTTP keep-alive/close mode
  • Route type
  • TLS session resumption client support
  • Number of concurrent connections per target route
  • Number of target routes
  • Back end server page size
  • Underlying infrastructure (network/SDN solution, CPU, and so on)

While performance in your specific environment will vary, Red Hat lab tests on a public cloud instance of size 4 vCPU/16GB RAM. A single HAProxy router handling 100 routes terminated by backends serving 1kB static pages is able to handle the following number of transactions per second.

In HTTP keep-alive mode scenarios:

EncryptionLoadBalancerServiceHostNetwork

none

21515

29622

edge

16743

22913

passthrough

36786

53295

re-encrypt

21583

25198

In HTTP close (no keep-alive) scenarios:

EncryptionLoadBalancerServiceHostNetwork

none

5719

8273

edge

2729

4069

passthrough

4121

5344

re-encrypt

2320

2941

The default Ingress Controller configuration was used with the spec.tuningOptions.threadCount field set to 4. Two different endpoint publishing strategies were tested: Load Balancer Service and Host Network. TLS session resumption was used for encrypted routes. With HTTP keep-alive, a single HAProxy router is capable of saturating a 1 Gbit NIC at page sizes as small as 8 kB.

When running on bare metal with modern processors, you can expect roughly twice the performance of the public cloud instance above. This overhead is introduced by the virtualization layer in place on public clouds and holds mostly true for private cloud-based virtualization as well. The following table is a guide to how many applications to use behind the router:

Number of applicationsApplication type

5-10

static file/web server or caching proxy

100-1000

applications generating dynamic content

In general, HAProxy can support routes for up to 1000 applications, depending on the technology in use. Ingress Controller performance might be limited by the capabilities and performance of the applications behind it, such as language or static versus dynamic content.

Ingress, or router, sharding should be used to serve more routes towards applications and help horizontally scale the routing tier.

For more information on Ingress sharding, see Configuring Ingress Controller sharding by using route labels and Configuring Ingress Controller sharding by using namespace labels.

For more information on tuningOptions, see Ingress Controller configuration parameters.

You can modify the Ingress Controller deployment using the information provided in Setting Ingress Controller thread count for threads and Ingress Controller configuration parameters for timeouts, and other tuning configurations in the Ingress Controller specification.

Chapter 11. Optimizing networking

The OpenShift SDN uses OpenvSwitch, virtual extensible LAN (VXLAN) tunnels, OpenFlow rules, and iptables. This network can be tuned by using jumbo frames, network interface controllers (NIC) offloads, multi-queue, and ethtool settings.

OVN-Kubernetes uses Geneve (Generic Network Virtualization Encapsulation) instead of VXLAN as the tunnel protocol.

VXLAN provides benefits over VLANs, such as an increase in networks from 4096 to over 16 million, and layer 2 connectivity across physical networks. This allows for all pods behind a service to communicate with each other, even if they are running on different systems.

VXLAN encapsulates all tunneled traffic in user datagram protocol (UDP) packets. However, this leads to increased CPU utilization. Both these outer- and inner-packets are subject to normal checksumming rules to guarantee data is not corrupted during transit. Depending on CPU performance, this additional processing overhead can cause a reduction in throughput and increased latency when compared to traditional, non-overlay networks.

Cloud, VM, and bare metal CPU performance can be capable of handling much more than one Gbps network throughput. When using higher bandwidth links such as 10 or 40 Gbps, reduced performance can occur. This is a known issue in VXLAN-based environments and is not specific to containers or OpenShift Container Platform. Any network that relies on VXLAN tunnels will perform similarly because of the VXLAN implementation.

If you are looking to push beyond one Gbps, you can:

  • Evaluate network plugins that implement different routing techniques, such as border gateway protocol (BGP).
  • Use VXLAN-offload capable network adapters. VXLAN-offload moves the packet checksum calculation and associated CPU overhead off of the system CPU and onto dedicated hardware on the network adapter. This frees up CPU cycles for use by pods and applications, and allows users to utilize the full bandwidth of their network infrastructure.

VXLAN-offload does not reduce latency. However, CPU utilization is reduced even in latency tests.

11.1. Optimizing the MTU for your network

There are two important maximum transmission units (MTUs): the network interface controller (NIC) MTU and the cluster network MTU.

The NIC MTU is only configured at the time of OpenShift Container Platform installation. The MTU must be less than or equal to the maximum supported value of the NIC of your network. If you are optimizing for throughput, choose the largest possible value. If you are optimizing for lowest latency, choose a lower value.

The OpenShift SDN network plugin overlay MTU must be less than the NIC MTU by 50 bytes at a minimum. This accounts for the SDN overlay header. So, on a normal ethernet network, this should be set to 1450. On a jumbo frame ethernet network, this should be set to 8950. These values should be set automatically by the Cluster Network Operator based on the NIC’s configured MTU. Therefore, cluster administrators do not typically update these values. Amazon Web Services (AWS) and bare-metal environments support jumbo frame ethernet networks. This setting will help throughput, especially with transmission control protocol (TCP).

For OVN and Geneve, the MTU must be less than the NIC MTU by 100 bytes at a minimum.

Note

This 50 byte overlay header is relevant to the OpenShift SDN network plugin. Other SDN solutions might require the value to be more or less.

11.3. Impact of IPsec

Because encrypting and decrypting node hosts uses CPU power, performance is affected both in throughput and CPU usage on the nodes when encryption is enabled, regardless of the IP security system being used.

IPSec encrypts traffic at the IP payload level, before it hits the NIC, protecting fields that would otherwise be used for NIC offloading. This means that some NIC acceleration features might not be usable when IPSec is enabled and will lead to decreased throughput and increased CPU usage.

11.4. Additional resources

Chapter 12. Managing bare metal hosts

When you install OpenShift Container Platform on a bare metal cluster, you can provision and manage bare metal nodes using machine and machineset custom resources (CRs) for bare metal hosts that exist in the cluster.

12.1. About bare metal hosts and nodes

To provision a Red Hat Enterprise Linux CoreOS (RHCOS) bare metal host as a node in your cluster, first create a MachineSet custom resource (CR) object that corresponds to the bare metal host hardware. Bare metal host machine sets describe infrastructure components specific to your configuration. You apply specific Kubernetes labels to these machine sets and then update the infrastructure components to run on only those machines.

Machine CR’s are created automatically when you scale up the relevant MachineSet containing a metal3.io/autoscale-to-hosts annotation. OpenShift Container Platform uses Machine CR’s to provision the bare metal node that corresponds to the host as specified in the MachineSet CR.

12.2. Maintaining bare metal hosts

You can maintain the details of the bare metal hosts in your cluster from the OpenShift Container Platform web console. Navigate to ComputeBare Metal Hosts, and select a task from the Actions drop down menu. Here you can manage items such as BMC details, boot MAC address for the host, enable power management, and so on. You can also review the details of the network interfaces and drives for the host.

You can move a bare metal host into maintenance mode. When you move a host into maintenance mode, the scheduler moves all managed workloads off the corresponding bare metal node. No new workloads are scheduled while in maintenance mode.

You can deprovision a bare metal host in the web console. Deprovisioning a host does the following actions:

  1. Annotates the bare metal host CR with cluster.k8s.io/delete-machine: true
  2. Scales down the related machine set
Note

Powering off the host without first moving the daemon set and unmanaged static pods to another node can cause service disruption and loss of data.

12.2.1. Adding a bare metal host to the cluster using the web console

You can add bare metal hosts to the cluster in the web console.

Prerequisites

  • Install an RHCOS cluster on bare metal.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. In the web console, navigate to ComputeBare Metal Hosts.
  2. Select Add HostNew with Dialog.
  3. Specify a unique name for the new bare metal host.
  4. Set the Boot MAC address.
  5. Set the Baseboard Management Console (BMC) Address.
  6. Enter the user credentials for the host’s baseboard management controller (BMC).
  7. Select to power on the host after creation, and select Create.
  8. Scale up the number of replicas to match the number of available bare metal hosts. Navigate to ComputeMachineSets, and increase the number of machine replicas in the cluster by selecting Edit Machine count from the Actions drop-down menu.
Note

You can also manage the number of bare metal nodes using the oc scale command and the appropriate bare metal machine set.

12.2.2. Adding a bare metal host to the cluster using YAML in the web console

You can add bare metal hosts to the cluster in the web console using a YAML file that describes the bare metal host.

Prerequisites

  • Install a RHCOS compute machine on bare metal infrastructure for use in the cluster.
  • Log in as a user with cluster-admin privileges.
  • Create a Secret CR for the bare metal host.

Procedure

  1. In the web console, navigate to ComputeBare Metal Hosts.
  2. Select Add HostNew from YAML.
  3. Copy and paste the below YAML, modifying the relevant fields with the details of your host:

    apiVersion: metal3.io/v1alpha1
    kind: BareMetalHost
    metadata:
      name: <bare_metal_host_name>
    spec:
      online: true
      bmc:
        address: <bmc_address>
        credentialsName: <secret_credentials_name>  1
        disableCertificateVerification: True 2
      bootMACAddress: <host_boot_mac_address>
    1
    credentialsName must reference a valid Secret CR. The baremetal-operator cannot manage the bare metal host without a valid Secret referenced in the credentialsName. For more information about secrets and how to create them, see Understanding secrets.
    2
    Setting disableCertificateVerification to true disables TLS host validation between the cluster and the baseboard management controller (BMC).
  4. Select Create to save the YAML and create the new bare metal host.
  5. Scale up the number of replicas to match the number of available bare metal hosts. Navigate to ComputeMachineSets, and increase the number of machines in the cluster by selecting Edit Machine count from the Actions drop-down menu.

    Note

    You can also manage the number of bare metal nodes using the oc scale command and the appropriate bare metal machine set.

12.2.3. Automatically scaling machines to the number of available bare metal hosts

To automatically create the number of Machine objects that matches the number of available BareMetalHost objects, add a metal3.io/autoscale-to-hosts annotation to the MachineSet object.

Prerequisites

  • Install RHCOS bare metal compute machines for use in the cluster, and create corresponding BareMetalHost objects.
  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Annotate the machine set that you want to configure for automatic scaling by adding the metal3.io/autoscale-to-hosts annotation. Replace <machineset> with the name of the machine set.

    $ oc annotate machineset <machineset> -n openshift-machine-api 'metal3.io/autoscale-to-hosts=<any_value>'

    Wait for the new scaled machines to start.

Note

When you use a BareMetalHost object to create a machine in the cluster and labels or selectors are subsequently changed on the BareMetalHost, the BareMetalHost object continues be counted against the MachineSet that the Machine object was created from.

12.2.4. Removing bare metal hosts from the provisioner node

In certain circumstances, you might want to temporarily remove bare metal hosts from the provisioner node. For example, during provisioning when a bare metal host reboot is triggered by using the OpenShift Container Platform administration console or as a result of a Machine Config Pool update, OpenShift Container Platform logs into the integrated Dell Remote Access Controller (iDrac) and issues a delete of the job queue.

To prevent the management of the number of Machine objects that matches the number of available BareMetalHost objects, add a baremetalhost.metal3.io/detached annotation to the MachineSet object.

Note

This annotation has an effect for only BareMetalHost objects that are in either Provisioned, ExternallyProvisioned or Ready/Available state.

Prerequisites

  • Install RHCOS bare metal compute machines for use in the cluster and create corresponding BareMetalHost objects.
  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Annotate the compute machine set that you want to remove from the provisioner node by adding the baremetalhost.metal3.io/detached annotation.

    $ oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached'

    Wait for the new machines to start.

    Note

    When you use a BareMetalHost object to create a machine in the cluster and labels or selectors are subsequently changed on the BareMetalHost, the BareMetalHost object continues be counted against the MachineSet that the Machine object was created from.

  2. In the provisioning use case, remove the annotation after the reboot is complete by using the following command:

    $ oc annotate machineset <machineset> -n openshift-machine-api 'baremetalhost.metal3.io/detached-'

Chapter 13. Monitoring bare-metal events with the Bare Metal Event Relay

Important

Bare Metal Event Relay is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

13.1. About bare-metal events

Use the Bare Metal Event Relay to subscribe applications that run in your OpenShift Container Platform cluster to events that are generated on the underlying bare-metal host. The Redfish service publishes events on a node and transmits them on an advanced message queue to subscribed applications.

Bare-metal events are based on the open Redfish standard that is developed under the guidance of the Distributed Management Task Force (DMTF). Redfish provides a secure industry-standard protocol with a REST API. The protocol is used for the management of distributed, converged or software-defined resources and infrastructure.

Hardware-related events published through Redfish includes:

  • Breaches of temperature limits
  • Server status
  • Fan status

Begin using bare-metal events by deploying the Bare Metal Event Relay Operator and subscribing your application to the service. The Bare Metal Event Relay Operator installs and manages the lifecycle of the Redfish bare-metal event service.

Note

The Bare Metal Event Relay works only with Redfish-capable devices on single-node clusters provisioned on bare-metal infrastructure.

13.2. How bare-metal events work

The Bare Metal Event Relay enables applications running on bare-metal clusters to respond quickly to Redfish hardware changes and failures such as breaches of temperature thresholds, fan failure, disk loss, power outages, and memory failure. These hardware events are delivered over a reliable low-latency transport channel based on Advanced Message Queuing Protocol (AMQP). The latency of the messaging service is between 10 to 20 milliseconds.

The Bare Metal Event Relay provides a publish-subscribe service for the hardware events, where multiple applications can use REST APIs to subscribe and consume the events. The Bare Metal Event Relay supports hardware that complies with Redfish OpenAPI v1.8 or higher.

13.2.1. Bare Metal Event Relay data flow

The following figure illustrates an example of bare-metal events data flow:

Figure 13.1. Bare Metal Event Relay data flow

Bare-metal events data flow
13.2.1.1. Operator-managed pod

The Operator uses custom resources to manage the pod containing the Bare Metal Event Relay and its components using the HardwareEvent CR.

13.2.1.2. Bare Metal Event Relay

At startup, the Bare Metal Event Relay queries the Redfish API and downloads all the message registries, including custom registries. The Bare Metal Event Relay then begins to receive subscribed events from the Redfish hardware.

The Bare Metal Event Relay enables applications running on bare-metal clusters to respond quickly to Redfish hardware changes and failures such as breaches of temperature thresholds, fan failure, disk loss, power outages, and memory failure. The events are reported using the HardwareEvent CR.

13.2.1.3. Cloud native event

Cloud native events (CNE) is a REST API specification for defining the format of event data.

13.2.1.4. CNCF CloudEvents

CloudEvents is a vendor-neutral specification developed by the Cloud Native Computing Foundation (CNCF) for defining the format of event data.

13.2.1.5. AMQP dispatch router

The dispatch router is responsible for the message delivery service between publisher and subscriber. AMQP 1.0 qpid is an open standard that supports reliable, high-performance, fully-symmetrical messaging over the internet.

13.2.1.6. Cloud event proxy sidecar

The cloud event proxy sidecar container image is based on the ORAN API specification and provides a publish-subscribe event framework for hardware events.

13.2.2. Redfish message parsing service

In addition to handling Redfish events, the Bare Metal Event Relay provides message parsing for events without a Message property. The proxy downloads all the Redfish message registries including vendor specific registries from the hardware when it starts. If an event does not contain a Message property, the proxy uses the Redfish message registries to construct the Message and Resolution properties and add them to the event before passing the event to the cloud events framework. This service allows Redfish events to have smaller message size and lower transmission latency.

13.2.3. Installing the Bare Metal Event Relay using the CLI

As a cluster administrator, you can install the Bare Metal Event Relay Operator by using the CLI.

Prerequisites

  • A cluster that is installed on bare-metal hardware with nodes that have a RedFish-enabled Baseboard Management Controller (BMC).
  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a namespace for the Bare Metal Event Relay.

    1. Save the following YAML in the bare-metal-events-namespace.yaml file:

      apiVersion: v1
      kind: Namespace
      metadata:
        name: openshift-bare-metal-events
        labels:
          name: openshift-bare-metal-events
          openshift.io/cluster-monitoring: "true"
    2. Create the Namespace CR:

      $ oc create -f bare-metal-events-namespace.yaml
  2. Create an Operator group for the Bare Metal Event Relay Operator.

    1. Save the following YAML in the bare-metal-events-operatorgroup.yaml file:

      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        name: bare-metal-event-relay-group
        namespace: openshift-bare-metal-events
      spec:
        targetNamespaces:
        - openshift-bare-metal-events
    2. Create the OperatorGroup CR:

      $ oc create -f bare-metal-events-operatorgroup.yaml
  3. Subscribe to the Bare Metal Event Relay.

    1. Save the following YAML in the bare-metal-events-sub.yaml file:

      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: bare-metal-event-relay-subscription
        namespace: openshift-bare-metal-events
      spec:
        channel: "stable"
        name: bare-metal-event-relay
        source: redhat-operators
        sourceNamespace: openshift-marketplace
    2. Create the Subscription CR:

      $ oc create -f bare-metal-events-sub.yaml

Verification

To verify that the Bare Metal Event Relay Operator is installed, run the following command:

$ oc get csv -n openshift-bare-metal-events -o custom-columns=Name:.metadata.name,Phase:.status.phase

Example output

Name                                                          Phase
bare-metal-event-relay.4.11.0-xxxxxxxxxxxx            Succeeded

13.2.4. Installing the Bare Metal Event Relay using the web console

As a cluster administrator, you can install the Bare Metal Event Relay Operator using the web console.

Prerequisites

  • A cluster that is installed on bare-metal hardware with nodes that have a RedFish-enabled Baseboard Management Controller (BMC).
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Install the Bare Metal Event Relay using the OpenShift Container Platform web console:

    1. In the OpenShift Container Platform web console, click OperatorsOperatorHub.
    2. Choose Bare Metal Event Relay from the list of available Operators, and then click Install.
    3. On the Install Operator page, select or create a Namespace, select openshift-bare-metal-events, and then click Install.

Verification

Optional: You can verify that the Operator installed successfully by performing the following check:

  1. Switch to the OperatorsInstalled Operators page.
  2. Ensure that Bare Metal Event Relay is listed in the project with a Status of InstallSucceeded.

    Note

    During installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.

If the Operator does not appear as installed, to troubleshoot further:

  • Go to the OperatorsInstalled Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
  • Go to the WorkloadsPods page and check the logs for pods in the project namespace.

13.3. Installing the AMQ messaging bus

To pass Redfish bare-metal event notifications between publisher and subscriber on a node, you must install and configure an AMQ messaging bus to run locally on the node. You do this by installing the AMQ Interconnect Operator for use in the cluster.

Prerequisites

  • Install the OpenShift Container Platform CLI (oc).
  • Log in as a user with cluster-admin privileges.

Procedure

Verification

  1. Verify that the AMQ Interconnect Operator is available and the required pods are running:

    $ oc get pods -n amq-interconnect

    Example output

    NAME                                    READY   STATUS    RESTARTS   AGE
    amq-interconnect-645db76c76-k8ghs       1/1     Running   0          23h
    interconnect-operator-5cb5fc7cc-4v7qm   1/1     Running   0          23h

  2. Verify that the required bare-metal-event-relay bare-metal event producer pod is running in the openshift-bare-metal-events namespace:

    $ oc get pods -n openshift-bare-metal-events

    Example output

    NAME                                                            READY   STATUS    RESTARTS   AGE
    hw-event-proxy-operator-controller-manager-74d5649b7c-dzgtl     2/2     Running   0          25s

13.4. Subscribing to Redfish BMC bare-metal events for a cluster node

As a cluster administrator, you can subscribe to Redfish BMC events generated on a node in your cluster by creating a BMCEventSubscription custom resource (CR) for the node, creating a HardwareEvent CR for the event, and a Secret CR for the BMC.

13.4.1. Subscribing to bare-metal events

You can configure the baseboard management controller (BMC) to send bare-metal events to subscribed applications running in an OpenShift Container Platform cluster. Example Redfish bare-metal events include an increase in device temperature, or removal of a device. You subscribe applications to bare-metal events using a REST API.

Important

You can only create a BMCEventSubscription custom resource (CR) for physical hardware that supports Redfish and has a vendor interface set to redfish or idrac-redfish.

Note

Use the BMCEventSubscription CR to subscribe to predefined Redfish events. The Redfish standard does not provide an option to create specific alerts and thresholds. For example, to receive an alert event when an enclosure’s temperature exceeds 40° Celsius, you must manually configure the event according to the vendor’s recommendations.

Perform the following procedure to subscribe to bare-metal events for the node using a BMCEventSubscription CR.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Get the user name and password for the BMC.
  • Deploy a bare-metal node with a Redfish-enabled Baseboard Management Controller (BMC) in your cluster, and enable Redfish events on the BMC.

    Note

    Enabling Redfish events on specific hardware is outside the scope of this information. For more information about enabling Redfish events for your specific hardware, consult the BMC manufacturer documentation.

Procedure

  1. Confirm that the node hardware has the Redfish EventService enabled by running the following curl command:

    curl https://<bmc_ip_address>/redfish/v1/EventService --insecure -H 'Content-Type: application/json' -u "<bmc_username>:<password>"

    where:

    bmc_ip_address
    is the IP address of the BMC where the Redfish events are generated.

    Example output

    {
       "@odata.context": "/redfish/v1/$metadata#EventService.EventService",
       "@odata.id": "/redfish/v1/EventService",
       "@odata.type": "#EventService.v1_0_2.EventService",
       "Actions": {
          "#EventService.SubmitTestEvent": {
             "EventType@Redfish.AllowableValues": ["StatusChange", "ResourceUpdated", "ResourceAdded", "ResourceRemoved", "Alert"],
             "target": "/redfish/v1/EventService/Actions/EventService.SubmitTestEvent"
          }
       },
       "DeliveryRetryAttempts": 3,
       "DeliveryRetryIntervalSeconds": 30,
       "Description": "Event Service represents the properties for the service",
       "EventTypesForSubscription": ["StatusChange", "ResourceUpdated", "ResourceAdded", "ResourceRemoved", "Alert"],
       "EventTypesForSubscription@odata.count": 5,
       "Id": "EventService",
       "Name": "Event Service",
       "ServiceEnabled": true,
       "Status": {
          "Health": "OK",
          "HealthRollup": "OK",
          "State": "Enabled"
       },
       "Subscriptions": {
          "@odata.id": "/redfish/v1/EventService/Subscriptions"
       }
    }

  2. Get the Bare Metal Event Relay service route for the cluster by running the following command:

    $ oc get route -n openshift-bare-metal-events

    Example output

    NAME             HOST/PORT                                                                                           PATH   SERVICES                 PORT   TERMINATION   WILDCARD
    hw-event-proxy   hw-event-proxy-openshift-bare-metal-events.apps.compute-1.example.com          hw-event-proxy-service   9087   edge          None

  3. Create a BMCEventSubscription resource to subscribe to the Redfish events:

    1. Save the following YAML in the bmc_sub.yaml file:

      apiVersion: metal3.io/v1alpha1
      kind: BMCEventSubscription
      metadata:
        name: sub-01
        namespace: openshift-machine-api
      spec:
         hostName: <hostname> 1
         destination: <proxy_service_url> 2
         context: ''
      1
      Specifies the name or UUID of the worker node where the Redfish events are generated.
      2
      Specifies the bare-metal event proxy service, for example, https://hw-event-proxy-openshift-bare-metal-events.apps.compute-1.example.com/webhook.
    2. Create the BMCEventSubscription CR:

      $ oc create -f bmc_sub.yaml
  4. Optional: To delete the BMC event subscription, run the following command:

    $ oc delete -f bmc_sub.yaml
  5. Optional: To manually create a Redfish event subscription without creating a BMCEventSubscription CR, run the following curl command, specifying the BMC username and password.

    $ curl -i -k -X POST -H "Content-Type: application/json"  -d '{"Destination": "https://<proxy_service_url>", "Protocol" : "Redfish", "EventTypes": ["Alert"], "Context": "root"}' -u <bmc_username>:<password> 'https://<bmc_ip_address>/redfish/v1/EventService/Subscriptions' –v

    where:

    proxy_service_url
    is the bare-metal event proxy service, for example, https://hw-event-proxy-openshift-bare-metal-events.apps.compute-1.example.com/webhook.
    bmc_ip_address
    is the IP address of the BMC where the Redfish events are generated.

    Example output

    HTTP/1.1 201 Created
    Server: AMI MegaRAC Redfish Service
    Location: /redfish/v1/EventService/Subscriptions/1
    Allow: GET, POST
    Access-Control-Allow-Origin: *
    Access-Control-Expose-Headers: X-Auth-Token
    Access-Control-Allow-Headers: X-Auth-Token
    Access-Control-Allow-Credentials: true
    Cache-Control: no-cache, must-revalidate
    Link: <http://redfish.dmtf.org/schemas/v1/EventDestination.v1_6_0.json>; rel=describedby
    Link: <http://redfish.dmtf.org/schemas/v1/EventDestination.v1_6_0.json>
    Link: </redfish/v1/EventService/Subscriptions>; path=
    ETag: "1651135676"
    Content-Type: application/json; charset=UTF-8
    OData-Version: 4.0
    Content-Length: 614
    Date: Thu, 28 Apr 2022 08:47:57 GMT

13.4.2. Querying Redfish bare-metal event subscriptions with curl

Some hardware vendors limit the amount of Redfish hardware event subscriptions. You can query the number of Redfish event subscriptions by using curl.

Prerequisites

  • Get the user name and password for the BMC.
  • Deploy a bare-metal node with a Redfish-enabled Baseboard Management Controller (BMC) in your cluster, and enable Redfish hardware events on the BMC.

Procedure

  1. Check the current subscriptions for the BMC by running the following curl command:

    $ curl --globoff -H "Content-Type: application/json" -k -X GET --user <bmc_username>:<password> https://<bmc_ip_address>/redfish/v1/EventService/Subscriptions

    where:

    bmc_ip_address
    is the IP address of the BMC where the Redfish events are generated.

    Example output

    % Total % Received % Xferd Average Speed Time Time Time Current
    Dload Upload Total Spent Left Speed
    100 435 100 435 0 0 399 0 0:00:01 0:00:01 --:--:-- 399
    {
      "@odata.context": "/redfish/v1/$metadata#EventDestinationCollection.EventDestinationCollection",
      "@odata.etag": ""
      1651137375 "",
      "@odata.id": "/redfish/v1/EventService/Subscriptions",
      "@odata.type": "#EventDestinationCollection.EventDestinationCollection",
      "Description": "Collection for Event Subscriptions",
      "Members": [
      {
        "@odata.id": "/redfish/v1/EventService/Subscriptions/1"
      }],
      "Members@odata.count": 1,
      "Name": "Event Subscriptions Collection"
    }

    In this example, a single subscription is configured: /redfish/v1/EventService/Subscriptions/1.

  2. Optional: To remove the /redfish/v1/EventService/Subscriptions/1 subscription with curl, run the following command, specifying the BMC username and password:

    $ curl --globoff -L -w "%{http_code} %{url_effective}\n" -k -u <bmc_username>:<password >-H "Content-Type: application/json" -d '{}' -X DELETE https://<bmc_ip_address>/redfish/v1/EventService/Subscriptions/1

    where:

    bmc_ip_address
    is the IP address of the BMC where the Redfish events are generated.

13.4.3. Creating the bare-metal event and Secret CRs

To start using bare-metal events, create the HardwareEvent custom resource (CR) for the host where the Redfish hardware is present. Hardware events and faults are reported in the hw-event-proxy logs.

Prerequisites

  • Install the OpenShift CLI (oc).
  • Log in as a user with cluster-admin privileges.
  • Install the Bare Metal Event Relay.
  • Create a BMCEventSubscription CR for the BMC Redfish hardware.
Note

Multiple HardwareEvent resources are not permitted.

Procedure

  1. Create the HardwareEvent custom resource (CR):

    1. Save the following YAML in the hw-event.yaml file:

      apiVersion: "event.redhat-cne.org/v1alpha1"
      kind: "HardwareEvent"
      metadata:
        name: "hardware-event"
      spec:
        nodeSelector:
          node-role.kubernetes.io/hw-event: "" 1
        transportHost: "amqp://amq-router-service-name.amq-namespace.svc.cluster.local" 2
        logLevel: "debug" 3
        msgParserTimeout: "10" 4
      1
      Required. Use the nodeSelector field to target nodes with the specified label, for example, node-role.kubernetes.io/hw-event: "".
      2
      Required. AMQP host that delivers the events at the transport layer using the AMQP protocol.
      3
      Optional. The default value is debug. Sets the log level in hw-event-proxy logs. The following log levels are available: fatal, error, warning, info, debug, trace.
      4
      Optional. Sets the timeout value in milliseconds for the Message Parser. If a message parsing request is not responded to within the timeout duration, the original hardware event message is passed to the cloud native event framework. The default value is 10.
    2. Create the HardwareEvent CR:

      $ oc create -f hardware-event.yaml
  2. Create a BMC username and password Secret CR that enables the hardware events proxy to access the Redfish message registry for the bare-metal host.

    1. Save the following YAML in the hw-event-bmc-secret.yaml file:

      apiVersion: v1
      kind: Secret
      metadata:
        name: redfish-basic-auth
      type: Opaque
      stringData: 1
        username: <bmc_username>
        password: <bmc_password>
        # BMC host DNS or IP address
        hostaddr: <bmc_host_ip_address>
      1
      Enter plain text values for the various items under stringData.
    2. Create the Secret CR:

      $ oc create -f hw-event-bmc-secret.yaml

13.5. Subscribing applications to bare-metal events REST API reference

Use the bare-metal events REST API to subscribe an application to the bare-metal events that are generated on the parent node.

Subscribe applications to Redfish events by using the resource address /cluster/node/<node_name>/redfish/event, where <node_name> is the cluster node running the application.

Deploy your cloud-event-consumer application container and cloud-event-proxy sidecar container in a separate application pod. The cloud-event-consumer application subscribes to the cloud-event-proxy container in the application pod.

Use the following API endpoints to subscribe the cloud-event-consumer application to Redfish events posted by the cloud-event-proxy container at http://localhost:8089/api/ocloudNotifications/v1/ in the application pod:

  • /api/ocloudNotifications/v1/subscriptions

    • POST: Creates a new subscription
    • GET: Retrieves a list of subscriptions
  • /api/ocloudNotifications/v1/subscriptions/<subscription_id>

    • PUT: Creates a new status ping request for the specified subscription ID
  • /api/ocloudNotifications/v1/health

    • GET: Returns the health status of ocloudNotifications API
Note

9089 is the default port for the cloud-event-consumer container deployed in the application pod. You can configure a different port for your application as required.

api/ocloudNotifications/v1/subscriptions
HTTP method

GET api/ocloudNotifications/v1/subscriptions

Description

Returns a list of subscriptions. If subscriptions exist, a 200 OK status code is returned along with the list of subscriptions.

Example API response

[
 {
  "id": "ca11ab76-86f9-428c-8d3a-666c24e34d32",
  "endpointUri": "http://localhost:9089/api/ocloudNotifications/v1/dummy",
  "uriLocation": "http://localhost:8089/api/ocloudNotifications/v1/subscriptions/ca11ab76-86f9-428c-8d3a-666c24e34d32",
  "resource": "/cluster/node/openshift-worker-0.openshift.example.com/redfish/event"
 }
]

HTTP method

POST api/ocloudNotifications/v1/subscriptions

Description

Creates a new subscription. If a subscription is successfully created, or if it already exists, a 201 Created status code is returned.

Table 13.1. Query parameters
ParameterType

subscription

data

Example payload

{
  "uriLocation": "http://localhost:8089/api/ocloudNotifications/v1/subscriptions",
  "resource": "/cluster/node/openshift-worker-0.openshift.example.com/redfish/event"
}

api/ocloudNotifications/v1/subscriptions/<subscription_id>
HTTP method

GET api/ocloudNotifications/v1/subscriptions/<subscription_id>

Description

Returns details for the subscription with ID <subscription_id>

Table 13.2. Query parameters
ParameterType

<subscription_id>

string

Example API response

{
  "id":"ca11ab76-86f9-428c-8d3a-666c24e34d32",
  "endpointUri":"http://localhost:9089/api/ocloudNotifications/v1/dummy",
  "uriLocation":"http://localhost:8089/api/ocloudNotifications/v1/subscriptions/ca11ab76-86f9-428c-8d3a-666c24e34d32",
  "resource":"/cluster/node/openshift-worker-0.openshift.example.com/redfish/event"
}

api/ocloudNotifications/v1/health/
HTTP method

GET api/ocloudNotifications/v1/health/

Description

Returns the health status for the ocloudNotifications REST API.

Example API response

OK

Chapter 14. What huge pages do and how they are consumed by applications

14.1. What huge pages do

Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, and so on. CPUs have a built-in memory management unit that manages a list of these pages in hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of virtual-to-physical page mappings. If the virtual address passed in a hardware instruction can be found in the TLB, the mapping can be determined quickly. If not, a TLB miss occurs, and the system falls back to slower, software-based address translation, resulting in performance issues. Since the size of the TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the page size.

A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other architectures. To use huge pages, code must be written so that applications are aware of them. Transparent Huge Pages (THP) attempt to automate the management of huge pages without application knowledge, but they have limitations. In particular, they are limited to 2Mi page sizes. THP can lead to performance degradation on nodes with high memory utilization or fragmentation due to defragmenting efforts of THP, which can lock memory pages. For this reason, some applications may be designed to (or recommend) usage of pre-allocated huge pages instead of THP.

In OpenShift Container Platform, applications in a pod can allocate and consume pre-allocated huge pages.

14.2. How huge pages are consumed by apps

Nodes must pre-allocate huge pages in order for the node to report its huge page capacity. A node can only pre-allocate huge pages for a single size.

Huge pages can be consumed through container-level resource requirements using the resource name hugepages-<size>, where size is the most compact binary notation using integer values supported on a particular node. For example, if a node supports 2048KiB page sizes, it exposes a schedulable resource hugepages-2Mi. Unlike CPU or memory, huge pages do not support over-commitment.

apiVersion: v1
kind: Pod
metadata:
  generateName: hugepages-volume-
spec:
  containers:
  - securityContext:
      privileged: true
    image: rhel7:latest
    command:
    - sleep
    - inf
    name: example
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
        hugepages-2Mi: 100Mi 1
        memory: "1Gi"
        cpu: "1"
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
1
Specify the amount of memory for hugepages as the exact amount to be allocated. Do not specify this value as the amount of memory for hugepages multiplied by the size of the page. For example, given a huge page size of 2MB, if you want to use 100MB of huge-page-backed RAM for your application, then you would allocate 50 huge pages. OpenShift Container Platform handles the math for you. As in the above example, you can specify 100MB directly.

Allocating huge pages of a specific size

Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, precede the huge pages boot command parameters with a huge page size selection parameter hugepagesz=<size>. The <size> value must be specified in bytes with an optional scale suffix [kKmMgG]. The default huge page size can be defined with the default_hugepagesz=<size> boot parameter.

Huge page requirements

  • Huge page requests must equal the limits. This is the default if limits are specified, but requests are not.
  • Huge pages are isolated at a pod scope. Container isolation is planned in a future iteration.
  • EmptyDir volumes backed by huge pages must not consume more huge page memory than the pod request.
  • Applications that consume huge pages via shmget() with SHM_HUGETLB must run with a supplemental group that matches proc/sys/vm/hugetlb_shm_group.

14.3. Consuming huge pages resources using the Downward API

You can use the Downward API to inject information about the huge pages resources that are consumed by a container.

You can inject the resource allocation as environment variables, a volume plugin, or both. Applications that you develop and run in the container can determine the resources that are available by reading the environment variables or files in the specified volumes.

Procedure

  1. Create a hugepages-volume-pod.yaml file that is similar to the following example:

    apiVersion: v1
    kind: Pod
    metadata:
      generateName: hugepages-volume-
      labels:
        app: hugepages-example
    spec:
      containers:
      - securityContext:
          capabilities:
            add: [ "IPC_LOCK" ]
        image: rhel7:latest
        command:
        - sleep
        - inf
        name: example
        volumeMounts:
        - mountPath: /dev/hugepages
          name: hugepage
        - mountPath: /etc/podinfo
          name: podinfo
        resources:
          limits:
            hugepages-1Gi: 2Gi
            memory: "1Gi"
            cpu: "1"
          requests:
            hugepages-1Gi: 2Gi
        env:
        - name: REQUESTS_HUGEPAGES_1GI <.>
          valueFrom:
            resourceFieldRef:
              containerName: example
              resource: requests.hugepages-1Gi
      volumes:
      - name: hugepage
        emptyDir:
          medium: HugePages
      - name: podinfo
        downwardAPI:
          items:
            - path: "hugepages_1G_request" <.>
              resourceFieldRef:
                containerName: example
                resource: requests.hugepages-1Gi
                divisor: 1Gi

    <.> Specifies to read the resource use from requests.hugepages-1Gi and expose the value as the REQUESTS_HUGEPAGES_1GI environment variable. <.> Specifies to read the resource use from requests.hugepages-1Gi and expose the value as the file /etc/podinfo/hugepages_1G_request.

  2. Create the pod from the hugepages-volume-pod.yaml file:

    $ oc create -f hugepages-volume-pod.yaml

Verification

  1. Check the value of the REQUESTS_HUGEPAGES_1GI environment variable:

    $ oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \
         -- env | grep REQUESTS_HUGEPAGES_1GI

    Example output

    REQUESTS_HUGEPAGES_1GI=2147483648

  2. Check the value of the /etc/podinfo/hugepages_1G_request file:

    $ oc exec -it $(oc get pods -l app=hugepages-example -o jsonpath='{.items[0].metadata.name}') \
         -- cat /etc/podinfo/hugepages_1G_request

    Example output

    2

14.4. Configuring huge pages

Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. There are two ways of reserving huge pages: at boot time and at run time. Reserving at boot time increases the possibility of success because the memory has not yet been significantly fragmented. The Node Tuning Operator currently supports boot time allocation of huge pages on specific nodes.

14.4.1. At boot time

Procedure

To minimize node reboots, the order of the steps below needs to be followed:

  1. Label all nodes that need the same huge pages setting by a label.

    $ oc label node <node_using_hugepages> node-role.kubernetes.io/worker-hp=
  2. Create a file with the following content and name it hugepages-tuned-boottime.yaml:

    apiVersion: tuned.openshift.io/v1
    kind: Tuned
    metadata:
      name: hugepages 1
      namespace: openshift-cluster-node-tuning-operator
    spec:
      profile: 2
      - data: |
          [main]
          summary=Boot time configuration for hugepages
          include=openshift-node
          [bootloader]
          cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 3
        name: openshift-node-hugepages
    
      recommend:
      - machineConfigLabels: 4
          machineconfiguration.openshift.io/role: "worker-hp"
        priority: 30
        profile: openshift-node-hugepages
    1
    Set the name of the Tuned resource to hugepages.
    2
    Set the profile section to allocate huge pages.
    3
    Note the order of parameters is important as some platforms support huge pages of various sizes.
    4
    Enable machine config pool based matching.
  3. Create the Tuned hugepages object

    $ oc create -f hugepages-tuned-boottime.yaml
  4. Create a file with the following content and name it hugepages-mcp.yaml:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: worker-hp
      labels:
        worker-hp: ""
    spec:
      machineConfigSelector:
        matchExpressions:
          - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-hp]}
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker-hp: ""
  5. Create the machine config pool:

    $ oc create -f hugepages-mcp.yaml

Given enough non-fragmented memory, all the nodes in the worker-hp machine config pool should now have 50 2Mi huge pages allocated.

$ oc get node <node_using_hugepages> -o jsonpath="{.status.allocatable.hugepages-2Mi}"
100Mi
Note

The TuneD bootloader plugin only supports Red Hat Enterprise Linux CoreOS (RHCOS) worker nodes.

14.5. Disabling Transparent Huge Pages

Transparent Huge Pages (THP) attempt to automate most aspects of creating, managing, and using huge pages. Since THP automatically manages the huge pages, this is not always handled optimally for all types of workloads. THP can lead to performance regressions, since many applications handle huge pages on their own. Therefore, consider disabling THP. The following steps describe how to disable THP using the Node Tuning Operator (NTO).

Procedure

  1. Create a file with the following content and name it thp-disable-tuned.yaml:

    apiVersion: tuned.openshift.io/v1
    kind: Tuned
    metadata:
      name: thp-workers-profile
      namespace: openshift-cluster-node-tuning-operator
    spec:
      profile:
      - data: |
          [main]
          summary=Custom tuned profile for OpenShift to turn off THP on worker nodes
          include=openshift-node
    
          [vm]
          transparent_hugepages=never
        name: openshift-thp-never-worker
    
      recommend:
      - match:
        - label: node-role.kubernetes.io/worker
        priority: 25
        profile: openshift-thp-never-worker
  2. Create the Tuned object:

    $ oc create -f thp-disable-tuned.yaml
  3. Check the list of active profiles:

    $ oc get profile -n openshift-cluster-node-tuning-operator

Verification

  • Log in to one of the nodes and do a regular THP check to verify if the nodes applied the profile successfully:

    $ cat /sys/kernel/mm/transparent_hugepage/enabled

    Example output

    always madvise [never]

Chapter 15. Low latency tuning

15.1. Understanding low latency

The emergence of Edge computing in the area of Telco / 5G plays a key role in reducing latency and congestion problems and improving application performance.

Simply put, latency determines how fast data (packets) moves from the sender to receiver and returns to the sender after processing by the receiver. Maintaining a network architecture with the lowest possible delay of latency speeds is key for meeting the network performance requirements of 5G. Compared to 4G technology, with an average latency of 50 ms, 5G is targeted to reach latency numbers of 1 ms or less. This reduction in latency boosts wireless throughput by a factor of 10.

Many of the deployed applications in the Telco space require low latency that can only tolerate zero packet loss. Tuning for zero packet loss helps mitigate the inherent issues that degrade network performance. For more information, see Tuning for Zero Packet Loss in Red Hat OpenStack Platform (RHOSP).

The Edge computing initiative also comes in to play for reducing latency rates. Think of it as being on the edge of the cloud and closer to the user. This greatly reduces the distance between the user and distant data centers, resulting in reduced application response times and performance latency.

Administrators must be able to manage their many Edge sites and local services in a centralized way so that all of the deployments can run at the lowest possible management cost. They also need an easy way to deploy and configure certain nodes of their cluster for real-time low latency and high-performance purposes. Low latency nodes are useful for applications such as Cloud-native Network Functions (CNF) and Data Plane Development Kit (DPDK).

OpenShift Container Platform currently provides mechanisms to tune software on an OpenShift Container Platform cluster for real-time running and low latency (around <20 microseconds reaction time). This includes tuning the kernel and OpenShift Container Platform set values, installing a kernel, and reconfiguring the machine. But this method requires setting up four different Operators and performing many configurations that, when done manually, is complex and could be prone to mistakes.

OpenShift Container Platform uses the Node Tuning Operator to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications. The cluster administrator uses this performance profile configuration that makes it easier to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt, reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolate CPUs for application containers to run the workloads.

OpenShift Container Platform also supports workload hints for the Node Tuning Operator that can tune the PerformanceProfile to meet the demands of different industry environments. Workload hints are available for highPowerConsumption (very low latency at the cost of increased power consumption) and realTime (priority given to optimum latency). A combination of true/false settings for these hints can be used to deal with application-specific workload profiles and requirements.

Workload hints simplify the fine-tuning of performance to industry sector settings. Instead of a “one size fits all” approach, workload hints can cater to usage patterns such as placing priority on:

  • Low latency
  • Real-time capability
  • Efficient use of power

In an ideal world, all of those would be prioritized: in real life, some come at the expense of others. The Node Tuning Operator is now aware of the workload expectations and better able to meet the demands of the workload. The cluster admin can now specify into which use case that workload falls. The Node Tuning Operator uses the PerformanceProfile to fine tune the performance settings for the workload.

The environment in which an application is operating influences its behavior. For a typical data center with no strict latency requirements, only minimal default tuning is needed that enables CPU partitioning for some high performance workload pods. For data centers and workloads where latency is a higher priority, measures are still taken to optimize power consumption. The most complicated cases are clusters close to latency-sensitive equipment such as manufacturing machinery and software-defined radios. This last class of deployment is often referred to as Far edge. For Far edge deployments, ultra-low latency is the ultimate priority, and is achieved at the expense of power management.

In OpenShift Container Platform version 4.10 and previous versions, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance. Now this functionality is part of the Node Tuning Operator.

15.1.1. About hyperthreading for low latency and real-time applications

Hyperthreading is an Intel processor technology that allows a physical CPU processor core to function as two logical cores, executing two independent threads simultaneously. Hyperthreading allows for better system throughput for certain workload types where parallel processing is beneficial. The default OpenShift Container Platform configuration expects hyperthreading to be enabled by default.

For telecommunications applications, it is important to design your application infrastructure to minimize latency as much as possible. Hyperthreading can slow performance times and negatively affect throughput for compute intensive workloads that require low latency. Disabling hyperthreading ensures predictable performance and can decrease processing times for these workloads.

Note

Hyperthreading implementation and configuration differs depending on the hardware you are running OpenShift Container Platform on. Consult the relevant host hardware tuning information for more details of the hyperthreading implementation specific to that hardware. Disabling hyperthreading can increase the cost per core of the cluster.

15.2. Provisioning real-time and low latency workloads

Many industries and organizations need extremely high performance computing and might require low and predictable latency, especially in the financial and telecommunications industries. For these industries, with their unique requirements, OpenShift Container Platform provides the Node Tuning Operator to implement automatic tuning to achieve low latency performance and consistent response time for OpenShift Container Platform applications.

The cluster administrator can use this performance profile configuration to make these changes in a more reliable way. The administrator can specify whether to update the kernel to kernel-rt (real-time), reserve CPUs for cluster and operating system housekeeping duties, including pod infra containers, isolate CPUs for application containers to run the workloads, and disable unused CPUs to reduce power consumption.

Warning

The usage of execution probes in conjunction with applications that require guaranteed CPUs can cause latency spikes. It is recommended to use other probes, such as a properly configured set of network probes, as an alternative.

Note

In earlier versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, these functions are part of the Node Tuning Operator.

15.2.1. Known limitations for real-time

Note

In most deployments, kernel-rt is supported only on worker nodes when you use a standard cluster with three control plane nodes and three worker nodes. There are exceptions for compact and single nodes on OpenShift Container Platform deployments. For installations on a single node, kernel-rt is supported on the single control plane node.

To fully utilize the real-time mode, the containers must run with elevated privileges. See Set capabilities for a Container for information on granting privileges.

OpenShift Container Platform restricts the allowed capabilities, so you might need to create a SecurityContext as well.

Note

This procedure is fully supported with bare metal installations using Red Hat Enterprise Linux CoreOS (RHCOS) systems.

Establishing the right performance expectations refers to the fact that the real-time kernel is not a panacea. Its objective is consistent, low-latency determinism offering predictable response times. There is some additional kernel overhead associated with the real-time kernel. This is due primarily to handling hardware interruptions in separately scheduled threads. The increased overhead in some workloads results in some degradation in overall throughput. The exact amount of degradation is very workload dependent, ranging from 0% to 30%. However, it is the cost of determinism.

15.2.2. Provisioning a worker with real-time capabilities

  1. Optional: Add a node to the OpenShift Container Platform cluster. See Setting BIOS parameters for system tuning.
  2. Add the label worker-rt to the worker nodes that require the real-time capability by using the oc command.
  3. Create a new machine config pool for real-time nodes:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: worker-rt
      labels:
        machineconfiguration.openshift.io/role: worker-rt
    spec:
      machineConfigSelector:
        matchExpressions:
          - {
               key: machineconfiguration.openshift.io/role,
               operator: In,
               values: [worker, worker-rt],
            }
      paused: false
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/worker-rt: ""

    Note that a machine config pool worker-rt is created for group of nodes that have the label worker-rt.

  4. Add the node to the proper machine config pool by using node role labels.

    Note

    You must decide which nodes are configured with real-time workloads. You could configure all of the nodes in the cluster, or a subset of the nodes. The Node Tuning Operator that expects all of the nodes are part of a dedicated machine config pool. If you use all of the nodes, you must point the Node Tuning Operator to the worker node role label. If you use a subset, you must group the nodes into a new machine config pool.

  5. Create the PerformanceProfile with the proper set of housekeeping cores and realTimeKernel: enabled: true.
  6. You must set machineConfigPoolSelector in PerformanceProfile:

      apiVersion: performance.openshift.io/v2
      kind: PerformanceProfile
      metadata:
       name: example-performanceprofile
      spec:
      ...
        realTimeKernel:
          enabled: true
        nodeSelector:
           node-role.kubernetes.io/worker-rt: ""
        machineConfigPoolSelector:
           machineconfiguration.openshift.io/role: worker-rt
  7. Verify that a matching machine config pool exists with a label:

    $ oc describe mcp/worker-rt

    Example output

    Name:         worker-rt
    Namespace:
    Labels:       machineconfiguration.openshift.io/role=worker-rt

  8. OpenShift Container Platform will start configuring the nodes, which might involve multiple reboots. Wait for the nodes to settle. This can take a long time depending on the specific hardware you use, but 20 minutes per node is expected.
  9. Verify everything is working as expected.

15.2.3. Verifying the real-time kernel installation

Use this command to verify that the real-time kernel is installed:

$ oc get node -o wide

Note the worker with the role worker-rt that contains the string 4.18.0-305.30.1.rt7.102.el8_4.x86_64 cri-o://1.24.0-99.rhaos4.10.gitc3131de.el8:

NAME                               	STATUS   ROLES           	AGE 	VERSION                  	INTERNAL-IP
EXTERNAL-IP   OS-IMAGE                                       	KERNEL-VERSION
CONTAINER-RUNTIME
rt-worker-0.example.com	          Ready	 worker,worker-rt   5d17h   v1.24.0
128.66.135.107   <none>    	        Red Hat Enterprise Linux CoreOS 46.82.202008252340-0 (Ootpa)
4.18.0-305.30.1.rt7.102.el8_4.x86_64   cri-o://1.24.0-99.rhaos4.10.gitc3131de.el8
[...]

15.2.4. Creating a workload that works in real-time

Use the following procedures for preparing a workload that will use real-time capabilities.

Procedure

  1. Create a pod with a QoS class of Guaranteed.
  2. Optional: Disable CPU load balancing for DPDK.
  3. Assign a proper node selector.

When writing your applications, follow the general recommendations described in Application tuning and deployment.

15.2.5. Creating a pod with a QoS class of Guaranteed

Keep the following in mind when you create a pod that is given a QoS class of Guaranteed:

  • Every container in the pod must have a memory limit and a memory request, and they must be the same.
  • Every container in the pod must have a CPU limit and a CPU request, and they must be the same.

The following example shows the configuration file for a pod that has one container. The container has a memory limit and a memory request, both equal to 200 MiB. The container has a CPU limit and a CPU request, both equal to 1 CPU.

apiVersion: v1
kind: Pod
metadata:
  name: qos-demo
  namespace: qos-example
spec:
  containers:
  - name: qos-demo-ctr
    image: <image-pull-spec>
    resources:
      limits:
        memory: "200Mi"
        cpu: "1"
      requests:
        memory: "200Mi"
        cpu: "1"
  1. Create the pod:

    $ oc  apply -f qos-pod.yaml --namespace=qos-example
  2. View detailed information about the pod:

    $ oc get pod qos-demo --namespace=qos-example --output=yaml

    Example output

    spec:
      containers:
        ...
    status:
      qosClass: Guaranteed

    Note

    If a container specifies its own memory limit, but does not specify a memory request, OpenShift Container Platform automatically assigns a memory request that matches the limit. Similarly, if a container specifies its own CPU limit, but does not specify a CPU request, OpenShift Container Platform automatically assigns a CPU request that matches the limit.

15.2.6. Optional: Disabling CPU load balancing for DPDK

Functionality to disable or enable CPU load balancing is implemented on the CRI-O level. The code under the CRI-O disables or enables CPU load balancing only when the following requirements are met.

  • The pod must use the performance-<profile-name> runtime class. You can get the proper name by looking at the status of the performance profile, as shown here:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    ...
    status:
      ...
      runtimeClass: performance-manual
  • The pod must have the cpu-load-balancing.crio.io: true annotation.

The Node Tuning Operator is responsible for the creation of the high-performance runtime handler config snippet under relevant nodes and for creation of the high-performance runtime class under the cluster. It will have the same content as default runtime handler except it enables the CPU load balancing configuration functionality.

To disable the CPU load balancing for the pod, the Pod specification must include the following fields:

apiVersion: v1
kind: Pod
metadata:
  ...
  annotations:
    ...
    cpu-load-balancing.crio.io: "disable"
    ...
  ...
spec:
  ...
  runtimeClassName: performance-<profile_name>
  ...
Note

Only disable CPU load balancing when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU load balancing can affect the performance of other containers in the cluster.

15.2.7. Assigning a proper node selector

The preferred way to assign a pod to nodes is to use the same node selector the performance profile used, as shown here:

apiVersion: v1
kind: Pod
metadata:
  name: example
spec:
  # ...
  nodeSelector:
    node-role.kubernetes.io/worker-rt: ""

For more information, see Placing pods on specific nodes using node selectors.

15.2.8. Scheduling a workload onto a worker with real-time capabilities

Use label selectors that match the nodes attached to the machine config pool that was configured for low latency by the Node Tuning Operator. For more information, see Assigning pods to nodes.

15.2.9. Reducing power consumption by taking CPUs offline

You can generally anticipate telecommunication workloads. When not all of the CPU resources are required, the Node Tuning Operator allows you take unused CPUs offline to reduce power consumption by manually updating the performance profile.

To take unused CPUs offline, you must perform the following tasks:

  1. Set the offline CPUs in the performance profile and save the contents of the YAML file:

    Example performance profile with offlined CPUs

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: performance
    spec:
      additionalKernelArgs:
      - nmi_watchdog=0
      - audit=0
      - mce=off
      - processor.max_cstate=1
      - intel_idle.max_cstate=0
      - idle=poll
      cpu:
        isolated: "2-23,26-47"
        reserved: "0,1,24,25"
        offlined: “48-59” 1
      nodeSelector:
        node-role.kubernetes.io/worker-cnf: ""
      numa:
        topologyPolicy: single-numa-node
      realTimeKernel:
        enabled: true

    1
    Optional. You can list CPUs in the offlined field to take the specified CPUs offline.
  2. Apply the updated profile by running the following command:

    $ oc apply -f my-performance-profile.yaml

Additional resources

15.2.10. Managing device interrupt processing for guaranteed pod isolated CPUs

The Node Tuning Operator can manage host CPUs by dividing them into reserved CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolated CPUs for application containers to run the workloads. This allows you to set CPUs for low latency workloads as isolated.

Device interrupts are load balanced between all isolated and reserved CPUs to avoid CPUs being overloaded, with the exception of CPUs where there is a guaranteed pod running. Guaranteed pod CPUs are prevented from processing device interrupts when the relevant annotations are set for the pod.

In the performance profile, globallyDisableIrqLoadBalancing is used to manage whether device interrupts are processed or not. For certain workloads, the reserved CPUs are not always sufficient for dealing with device interrupts, and for this reason, device interrupts are not globally disabled on the isolated CPUs. By default, Node Tuning Operator does not disable device interrupts on isolated CPUs.

To achieve low latency for workloads, some (but not all) pods require the CPUs they are running on to not process device interrupts. A pod annotation, irq-load-balancing.crio.io, is used to define whether device interrupts are processed or not. When configured, CRI-O disables device interrupts only as long as the pod is running.

15.2.10.1. Disabling CPU CFS quota

To reduce CPU throttling for individual guaranteed pods, create a pod specification with the annotation cpu-quota.crio.io: "disable". This annotation disables the CPU completely fair scheduler (CFS) quota at the pod run time. The following pod specification contains this annotation:

apiVersion: v1
kind: Pod
metadata:
  annotations:
      cpu-quota.crio.io: "disable"
spec:
    runtimeClassName: performance-<profile_name>
...
Note

Only disable CPU CFS quota when the CPU manager static policy is enabled and for pods with guaranteed QoS that use whole CPUs. Otherwise, disabling CPU CFS quota can affect the performance of other containers in the cluster.

15.2.10.2. Disabling global device interrupts handling in Node Tuning Operator

To configure Node Tuning Operator to disable global device interrupts for the isolated CPU set, set the globallyDisableIrqLoadBalancing field in the performance profile to true. When true, conflicting pod annotations are ignored. When false, IRQ loads are balanced across all CPUs.

A performance profile snippet illustrates this setting:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  globallyDisableIrqLoadBalancing: true
...
15.2.10.3. Disabling interrupt processing for individual pods

To disable interrupt processing for individual pods, ensure that globallyDisableIrqLoadBalancing is set to false in the performance profile. Then, in the pod specification, set the irq-load-balancing.crio.io pod annotation to disable. The following pod specification contains this annotation:

apiVersion: performance.openshift.io/v2
kind: Pod
metadata:
  annotations:
      irq-load-balancing.crio.io: "disable"
spec:
    runtimeClassName: performance-<profile_name>
...

15.2.11. Upgrading the performance profile to use device interrupt processing

When you upgrade the Node Tuning Operator performance profile custom resource definition (CRD) from v1 or v1alpha1 to v2, globallyDisableIrqLoadBalancing is set to true on existing profiles.

Note

globallyDisableIrqLoadBalancing toggles whether IRQ load balancing will be disabled for the Isolated CPU set. When the option is set to true it disables IRQ load balancing for the Isolated CPU set. Setting the option to false allows the IRQs to be balanced across all CPUs.

15.2.11.1. Supported API Versions

The Node Tuning Operator supports v2, v1, and v1alpha1 for the performance profile apiVersion field. The v1 and v1alpha1 APIs are identical. The v2 API includes an optional boolean field globallyDisableIrqLoadBalancing with a default value of false.

15.2.11.1.1. Upgrading Node Tuning Operator API from v1alpha1 to v1

When upgrading Node Tuning Operator API version from v1alpha1 to v1, the v1alpha1 performance profiles are converted on-the-fly using a "None" Conversion strategy and served to the Node Tuning Operator with API version v1.

15.2.11.1.2. Upgrading Node Tuning Operator API from v1alpha1 or v1 to v2

When upgrading from an older Node Tuning Operator API version, the existing v1 and v1alpha1 performance profiles are converted using a conversion webhook that injects the globallyDisableIrqLoadBalancing field with a value of true.

15.3. Tuning nodes for low latency with the performance profile

The performance profile lets you control latency tuning aspects of nodes that belong to a certain machine config pool. After you specify your settings, the PerformanceProfile object is compiled into multiple objects that perform the actual node level tuning:

  • A MachineConfig file that manipulates the nodes.
  • A KubeletConfig file that configures the Topology Manager, the CPU Manager, and the OpenShift Container Platform nodes.
  • The Tuned profile that configures the Node Tuning Operator.

You can use a performance profile to specify whether to update the kernel to kernel-rt, to allocate huge pages, and to partition the CPUs for performing housekeeping duties or running workloads.

Note

You can manually create the PerformanceProfile object or use the Performance Profile Creator (PPC) to generate a performance profile. See the additional resources below for more information on the PPC.

Sample performance profile

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
 name: performance
spec:
 cpu:
  isolated: "4-15" 1
  reserved: "0-3" 2
 hugepages:
  defaultHugepagesSize: "1G"
  pages:
  - size: "1G"
    count: 16
    node: 0
 realTimeKernel:
  enabled: true  3
 numa:  4
  topologyPolicy: "best-effort"
 nodeSelector:
  node-role.kubernetes.io/worker-cnf: "" 5

1
Use this field to isolate specific CPUs to use with application containers for workloads. Set an even number of isolated CPUs to enable the pods to run without errors when hyperthreading is enabled.
2
Use this field to reserve specific CPUs to use with infra containers for housekeeping.
3
Use this field to install the real-time kernel on the node. Valid values are true or false. Setting the true value installs the real-time kernel.
4
Use this field to configure the topology manager policy. Valid values are none (default), best-effort, restricted, and single-numa-node. For more information, see Topology Manager Policies.
5
Use this field to specify a node selector to apply the performance profile to specific nodes.

Additional resources

15.3.1. Configuring huge pages

Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. Use the Node Tuning Operator to allocate huge pages on a specific node.

OpenShift Container Platform provides a method for creating and allocating huge pages. Node Tuning Operator provides an easier method for doing this using the performance profile.

For example, in the hugepages pages section of the performance profile, you can specify multiple blocks of size, count, and, optionally, node:

hugepages:
   defaultHugepagesSize: "1G"
   pages:
   - size:  "1G"
     count:  4
     node:  0 1
1
node is the NUMA node in which the huge pages are allocated. If you omit node, the pages are evenly spread across all NUMA nodes.
Note

Wait for the relevant machine config pool status that indicates the update is finished.

These are the only configuration steps you need to do to allocate huge pages.

Verification

  • To verify the configuration, see the /proc/meminfo file on the node:

    $ oc debug node/ip-10-0-141-105.ec2.internal
    # grep -i huge /proc/meminfo

    Example output

    AnonHugePages:    ###### ##
    ShmemHugePages:        0 kB
    HugePages_Total:       2
    HugePages_Free:        2
    HugePages_Rsvd:        0
    HugePages_Surp:        0
    Hugepagesize:       #### ##
    Hugetlb:            #### ##

  • Use oc describe to report the new size:

    $ oc describe node worker-0.ocp4poc.example.com | grep -i huge

    Example output

                                       hugepages-1g=true
     hugepages-###:  ###
     hugepages-###:  ###

15.3.2. Allocating multiple huge page sizes

You can request huge pages with different sizes under the same container. This allows you to define more complicated pods consisting of containers with different huge page size needs.

For example, you can define sizes 1G and 2M and the Node Tuning Operator will configure both sizes on the node, as shown here:

spec:
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 1024
      node: 0
      size: 2M
    - count: 4
      node: 1
      size: 1G

15.3.3. Configuring a node for IRQ dynamic load balancing

To configure a cluster node to handle IRQ dynamic load balancing, do the following:

  1. Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.
  2. Set the performance profile apiVersion to use performance.openshift.io/v2.
  3. Remove the globallyDisableIrqLoadBalancing field or set it to false.
  4. Set the appropriate isolated and reserved CPUs. The following snippet illustrates a profile that reserves 2 CPUs. IRQ load-balancing is enabled for pods running on the isolated CPU set:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: dynamic-irq-profile
    spec:
      cpu:
        isolated: 2-5
        reserved: 0-1
    ...
    Note

    When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.

  5. Create the pod that uses exclusive CPUs, and set irq-load-balancing.crio.io and cpu-quota.crio.io annotations to disable. For example:

    apiVersion: v1
    kind: Pod
    metadata:
      name: dynamic-irq-pod
      annotations:
         irq-load-balancing.crio.io: "disable"
         cpu-quota.crio.io: "disable"
    spec:
      containers:
      - name: dynamic-irq-pod
        image: "registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11"
        command: ["sleep", "10h"]
        resources:
          requests:
            cpu: 2
            memory: "200M"
          limits:
            cpu: 2
            memory: "200M"
      nodeSelector:
        node-role.kubernetes.io/worker-cnf: ""
      runtimeClassName: performance-dynamic-irq-profile
    ...
  6. Enter the pod runtimeClassName in the form performance-<profile_name>, where <profile_name> is the name from the PerformanceProfile YAML, in this example, performance-dynamic-irq-profile.
  7. Set the node selector to target a cnf-worker.
  8. Ensure the pod is running correctly. Status should be running, and the correct cnf-worker node should be set:

    $ oc get pod -o wide

    Expected output

    NAME              READY   STATUS    RESTARTS   AGE     IP             NODE          NOMINATED NODE   READINESS GATES
    dynamic-irq-pod   1/1     Running   0          5h33m   <ip-address>   <node-name>   <none>           <none>

  9. Get the CPUs that the pod configured for IRQ dynamic load balancing runs on:

    $ oc exec -it dynamic-irq-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"

    Expected output

    Cpus_allowed_list:  2-3

  10. Ensure the node configuration is applied correctly. SSH into the node to verify the configuration.

    $ oc debug node/<node-name>

    Expected output

    Starting pod/<node-name>-debug ...
    To use host binaries, run `chroot /host`
    
    Pod IP: <ip-address>
    If you don't see a command prompt, try pressing enter.
    
    sh-4.4#

  11. Verify that you can use the node file system:

    sh-4.4# chroot /host

    Expected output

    sh-4.4#

  12. Ensure the default system CPU affinity mask does not include the dynamic-irq-pod CPUs, for example, CPUs 2 and 3.

    $ cat /proc/irq/default_smp_affinity

    Example output

    33

  13. Ensure the system IRQs are not configured to run on the dynamic-irq-pod CPUs:

    find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;

    Example output

    /proc/irq/0/smp_affinity_list: 0-5
    /proc/irq/1/smp_affinity_list: 5
    /proc/irq/2/smp_affinity_list: 0-5
    /proc/irq/3/smp_affinity_list: 0-5
    /proc/irq/4/smp_affinity_list: 0
    /proc/irq/5/smp_affinity_list: 0-5
    /proc/irq/6/smp_affinity_list: 0-5
    /proc/irq/7/smp_affinity_list: 0-5
    /proc/irq/8/smp_affinity_list: 4
    /proc/irq/9/smp_affinity_list: 4
    /proc/irq/10/smp_affinity_list: 0-5
    /proc/irq/11/smp_affinity_list: 0
    /proc/irq/12/smp_affinity_list: 1
    /proc/irq/13/smp_affinity_list: 0-5
    /proc/irq/14/smp_affinity_list: 1
    /proc/irq/15/smp_affinity_list: 0
    /proc/irq/24/smp_affinity_list: 1
    /proc/irq/25/smp_affinity_list: 1
    /proc/irq/26/smp_affinity_list: 1
    /proc/irq/27/smp_affinity_list: 5
    /proc/irq/28/smp_affinity_list: 1
    /proc/irq/29/smp_affinity_list: 0
    /proc/irq/30/smp_affinity_list: 0-5

Some IRQ controllers do not support IRQ re-balancing and will always expose all online CPUs as the IRQ mask. These IRQ controllers effectively run on CPU 0. For more information on the host configuration, SSH into the host and run the following, replacing <irq-num> with the CPU number that you want to query:

$ cat /proc/irq/<irq-num>/effective_affinity

15.3.4. Configuring hyperthreading for a cluster

To configure hyperthreading for an OpenShift Container Platform cluster, set the CPU threads in the performance profile to the same cores that are configured for the reserved or isolated CPU pools.

Note

If you configure a performance profile, and subsequently change the hyperthreading configuration for the host, ensure that you update the CPU isolated and reserved fields in the PerformanceProfile YAML to match the new configuration.

Warning

Disabling a previously enabled host hyperthreading configuration can cause the CPU core IDs listed in the PerformanceProfile YAML to be incorrect. This incorrect configuration can cause the node to become unavailable because the listed CPUs can no longer be found.

Prerequisites

  • Access to the cluster as a user with the cluster-admin role.
  • Install the OpenShift CLI (oc).

Procedure

  1. Ascertain which threads are running on what CPUs for the host you want to configure.

    You can view which threads are running on the host CPUs by logging in to the cluster and running the following command:

    $ lscpu --all --extended

    Example output

    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
    0   0    0      0    0:0:0:0       yes    4800.0000 400.0000
    1   0    0      1    1:1:1:0       yes    4800.0000 400.0000
    2   0    0      2    2:2:2:0       yes    4800.0000 400.0000
    3   0    0      3    3:3:3:0       yes    4800.0000 400.0000
    4   0    0      0    0:0:0:0       yes    4800.0000 400.0000
    5   0    0      1    1:1:1:0       yes    4800.0000 400.0000
    6   0    0      2    2:2:2:0       yes    4800.0000 400.0000
    7   0    0      3    3:3:3:0       yes    4800.0000 400.0000

    In this example, there are eight logical CPU cores running on four physical CPU cores. CPU0 and CPU4 are running on physical Core0, CPU1 and CPU5 are running on physical Core 1, and so on.

    Alternatively, to view the threads that are set for a particular physical CPU core (cpu0 in the example below), open a command prompt and run the following:

    $ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list

    Example output

    0-4

  2. Apply the isolated and reserved CPUs in the PerformanceProfile YAML. For example, you can set logical cores CPU0 and CPU4 as isolated, and logical cores CPU1 to CPU3 and CPU5 to CPU7 as reserved. When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.

    ...
      cpu:
        isolated: 0,4
        reserved: 1-3,5-7
    ...
    Note

    The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.

Important

Hyperthreading is enabled by default on most Intel processors. If you enable hyperthreading, all threads processed by a particular core must be isolated or processed on the same core.

15.3.4.1. Disabling hyperthreading for low latency applications

When configuring clusters for low latency processing, consider whether you want to disable hyperthreading before you deploy the cluster. To disable hyperthreading, do the following:

  1. Create a performance profile that is appropriate for your hardware and topology.
  2. Set nosmt as an additional kernel argument. The following example performance profile illustrates this setting:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: example-performanceprofile
    spec:
      additionalKernelArgs:
        - nmi_watchdog=0
        - audit=0
        - mce=off
        - processor.max_cstate=1
        - idle=poll
        - intel_idle.max_cstate=0
        - nosmt
      cpu:
        isolated: 2-3
        reserved: 0-1
      hugepages:
        defaultHugepagesSize: 1G
        pages:
          - count: 2
            node: 0
            size: 1G
      nodeSelector:
        node-role.kubernetes.io/performance: ''
      realTimeKernel:
        enabled: true
    Note

    When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.

15.3.5. Understanding workload hints

The following table describes how combinations of power consumption and real-time settings impact on latency.

Note

The following workload hints can be configured manually. You can also work with workload hints using the Performance Profile Creator. For more information about the performance profile, see the "Creating a performance profile" section. If the workload hint is configured manually and the realTime workload hint is not explicitly set then it defaults to true.

Performance Profile creator settingHintEnvironmentDescription

Default

workloadHints:
highPowerConsumption: false
realTime: false

High throughput cluster without latency requirements

Performance achieved through CPU partitioning only.

Low-latency

workloadHints:
highPowerConsumption: false
realTime: true

Regional datacenters

Both energy savings and low-latency are desirable: compromise between power management, latency and throughput.

Ultra-low-latency

workloadHints:
highPowerConsumption: true
realTime: true

Far edge clusters, latency critical workloads

Optimized for absolute minimal latency and maximum determinism at the cost of increased power consumption.

Per-pod power management

workloadHints:
realTime: true
highPowerConsumption: false
perPodPowerManagement: true

Critical and non-critical workloads

Allows for power management per pod.

Additional resources

15.3.6. Configuring workload hints manually

Procedure

  1. Create a PerformanceProfile appropriate for the environment’s hardware and topology as described in the table in "Understanding workload hints". Adjust the profile to match the expected workload. In this example, we tune for the lowest possible latency.
  2. Add the highPowerConsumption and realTime workload hints. Both are set to true here.

        apiVersion: performance.openshift.io/v2
        kind: PerformanceProfile
        metadata:
          name: workload-hints
        spec:
          ...
          workloadHints:
            highPowerConsumption: true 1
            realTime: true 2
    1
    If highPowerConsumption is true, the node is tuned for very low latency at the cost of increased power consumption.
    2
    Disables some debugging and monitoring features that can affect system latency.
Note

When the realTime workload hint flag is set to true in a performance profile, add the cpu-quota.crio.io: disable annotation to every guaranteed pod with pinned CPUs. This annotation is necessary to prevent the degradation of the process performance within the pod. If the realTime workload hint is not explicitly set then it defaults to true.

15.3.7. Restricting CPUs for infra and application containers

Generic housekeeping and workload tasks use CPUs in a way that may impact latency-sensitive processes. By default, the container runtime uses all online CPUs to run all containers together, which can result in context switches and spikes in latency. Partitioning the CPUs prevents noisy processes from interfering with latency-sensitive processes by separating them from each other. The following table describes how processes run on a CPU after you have tuned the node using the Node Tuning Operator:

Table 15.1. Process' CPU assignments
Process typeDetails

Burstable and BestEffort pods

Runs on any CPU except where low latency workload is running

Infrastructure pods

Runs on any CPU except where low latency workload is running

Interrupts

Redirects to reserved CPUs (optional in OpenShift Container Platform 4.7 and later)

Kernel processes

Pins to reserved CPUs

Latency-sensitive workload pods

Pins to a specific set of exclusive CPUs from the isolated pool

OS processes/systemd services

Pins to reserved CPUs

The allocatable capacity of cores on a node for pods of all QoS process types, Burstable, BestEffort, or Guaranteed, is equal to the capacity of the isolated pool. The capacity of the reserved pool is removed from the node’s total core capacity for use by the cluster and operating system housekeeping duties.

Example 1

A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 25 cores to QoS Guaranteed pods and 25 cores for BestEffort or Burstable pods. This matches the capacity of the isolated pool.

Example 2

A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 50 cores to QoS Guaranteed pods and one core for BestEffort or Burstable pods. This exceeds the capacity of the isolated pool by one core. Pod scheduling fails because of insufficient CPU capacity.

The exact partitioning pattern to use depends on many factors like hardware, workload characteristics and the expected system load. Some sample use cases are as follows:

  • If the latency-sensitive workload uses specific hardware, such as a network interface controller (NIC), ensure that the CPUs in the isolated pool are as close as possible to this hardware. At a minimum, you should place the workload in the same Non-Uniform Memory Access (NUMA) node.
  • The reserved pool is used for handling all interrupts. When depending on system networking, allocate a sufficiently-sized reserve pool to handle all the incoming packet interrupts. In 4.11 and later versions, workloads can optionally be labeled as sensitive.

The decision regarding which specific CPUs should be used for reserved and isolated partitions requires detailed analysis and measurements. Factors like NUMA affinity of devices and memory play a role. The selection also depends on the workload architecture and the specific use case.

Important

The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node.

To ensure that housekeeping tasks and workloads do not interfere with each other, specify two groups of CPUs in the spec section of the performance profile.

  • isolated - Specifies the CPUs for the application container workloads. These CPUs have the lowest latency. Processes in this group have no interruptions and can, for example, reach much higher DPDK zero packet loss bandwidth.
  • reserved - Specifies the CPUs for the cluster and operating system housekeeping duties. Threads in the reserved group are often busy. Do not run latency-sensitive applications in the reserved group. Latency-sensitive applications run in the isolated group.

Procedure

  1. Create a performance profile appropriate for the environment’s hardware and topology.
  2. Add the reserved and isolated parameters with the CPUs you want reserved and isolated for the infra and application containers:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: infra-cpus
    spec:
      cpu:
        reserved: "0-4,9" 1
        isolated: "5-8" 2
      nodeSelector: 3
        node-role.kubernetes.io/worker: ""
    1
    Specify which CPUs are for infra containers to perform cluster and operating system housekeeping duties.
    2
    Specify which CPUs are for application containers to run workloads.
    3
    Optional: Specify a node selector to apply the performance profile to specific nodes.

15.4. Reducing NIC queues using the Node Tuning Operator

The Node Tuning Operator allows you to adjust the network interface controller (NIC) queue count for each network device by configuring the performance profile. Device network queues allows the distribution of packets among different physical queues and each queue gets a separate thread for packet processing.

In real-time or low latency systems, all the unnecessary interrupt request lines (IRQs) pinned to the isolated CPUs must be moved to reserved or housekeeping CPUs.

In deployments with applications that require system, OpenShift Container Platform networking or in mixed deployments with Data Plane Development Kit (DPDK) workloads, multiple queues are needed to achieve good throughput and the number of NIC queues should be adjusted or remain unchanged. For example, to achieve low latency the number of NIC queues for DPDK based workloads should be reduced to just the number of reserved or housekeeping CPUs.

Too many queues are created by default for each CPU and these do not fit into the interrupt tables for housekeeping CPUs when tuning for low latency. Reducing the number of queues makes proper tuning possible. Smaller number of queues means a smaller number of interrupts that then fit in the IRQ table.

Note

In earlier versions of OpenShift Container Platform, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.

15.4.1. Adjusting the NIC queues with the performance profile

The performance profile lets you adjust the queue count for each network device.

Supported network devices:

  • Non-virtual network devices
  • Network devices that support multiple queues (channels)

Unsupported network devices:

  • Pure software network interfaces
  • Block devices
  • Intel DPDK virtual functions

Prerequisites

  • Access to the cluster as a user with the cluster-admin role.
  • Install the OpenShift CLI (oc).

Procedure

  1. Log in to the OpenShift Container Platform cluster running the Node Tuning Operator as a user with cluster-admin privileges.
  2. Create and apply a performance profile appropriate for your hardware and topology. For guidance on creating a profile, see the "Creating a performance profile" section.
  3. Edit this created performance profile:

    $ oc edit -f <your_profile_name>.yaml
  4. Populate the spec field with the net object. The object list can contain two fields:

    • userLevelNetworking is a required field specified as a boolean flag. If userLevelNetworking is true, the queue count is set to the reserved CPU count for all supported devices. The default is false.
    • devices is an optional field specifying a list of devices that will have the queues set to the reserved CPU count. If the device list is empty, the configuration applies to all network devices. The configuration is as follows:

      • interfaceName: This field specifies the interface name, and it supports shell-style wildcards, which can be positive or negative.

        • Example wildcard syntax is as follows: <string> .*
        • Negative rules are prefixed with an exclamation mark. To apply the net queue changes to all devices other than the excluded list, use !<device>, for example, !eno1.
      • vendorID: The network device vendor ID represented as a 16-bit hexadecimal number with a 0x prefix.
      • deviceID: The network device ID (model) represented as a 16-bit hexadecimal number with a 0x prefix.

        Note

        When a deviceID is specified, the vendorID must also be defined. A device that matches all of the device identifiers specified in a device entry interfaceName, vendorID, or a pair of vendorID plus deviceID qualifies as a network device. This network device then has its net queues count set to the reserved CPU count.

        When two or more devices are specified, the net queues count is set to any net device that matches one of them.

  5. Set the queue count to the reserved CPU count for all devices by using this example performance profile:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: manual
    spec:
      cpu:
        isolated: 3-51,54-103
        reserved: 0-2,52-54
      net:
        userLevelNetworking: true
      nodeSelector:
        node-role.kubernetes.io/worker-cnf: ""
  6. Set the queue count to the reserved CPU count for all devices matching any of the defined device identifiers by using this example performance profile:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: manual
    spec:
      cpu:
        isolated: 3-51,54-103
        reserved: 0-2,52-54
      net:
        userLevelNetworking: true
        devices:
        - interfaceName: “eth0”
        - interfaceName: “eth1”
        - vendorID: “0x1af4”
        - deviceID: “0x1000”
      nodeSelector:
        node-role.kubernetes.io/worker-cnf: ""
  7. Set the queue count to the reserved CPU count for all devices starting with the interface name eth by using this example performance profile:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: manual
    spec:
      cpu:
        isolated: 3-51,54-103
        reserved: 0-2,52-54
      net:
        userLevelNetworking: true
        devices:
        - interfaceName: “eth*”
      nodeSelector:
        node-role.kubernetes.io/worker-cnf: ""
  8. Set the queue count to the reserved CPU count for all devices with an interface named anything other than eno1 by using this example performance profile:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: manual
    spec:
      cpu:
        isolated: 3-51,54-103
        reserved: 0-2,52-54
      net:
        userLevelNetworking: true
        devices:
        - interfaceName: “!eno1”
      nodeSelector:
        node-role.kubernetes.io/worker-cnf: ""
  9. Set the queue count to the reserved CPU count for all devices that have an interface name eth0, vendorID of 0x1af4, and deviceID of 0x1000 by using this example performance profile:

    apiVersion: performance.openshift.io/v2
    kind: PerformanceProfile
    metadata:
      name: manual
    spec:
      cpu:
        isolated: 3-51,54-103
        reserved: 0-2,52-54
      net:
        userLevelNetworking: true
        devices:
        - interfaceName: “eth0”
        - vendorID: “0x1af4”
        - deviceID: “0x1000”
      nodeSelector:
        node-role.kubernetes.io/worker-cnf: ""
  10. Apply the updated performance profile:

    $ oc apply -f <your_profile_name>.yaml

Additional resources

15.4.2. Verifying the queue status

In this section, a number of examples illustrate different performance profiles and how to verify the changes are applied.

Example 1

In this example, the net queue count is set to the reserved CPU count (2) for all supported devices.

The relevant section from the performance profile is:

apiVersion: performance.openshift.io/v2
metadata:
  name: performance
spec:
  kind: PerformanceProfile
  spec:
    cpu:
      reserved: 0-1  #total = 2
      isolated: 2-8
    net:
      userLevelNetworking: true
# ...
  • Display the status of the queues associated with a device using the following command:

    Note

    Run this command on the node where the performance profile was applied.

    $ ethtool -l <device>
  • Verify the queue status before the profile is applied:

    $ ethtool -l ens4

    Example output

    Channel parameters for ens4:
    Pre-set maximums:
    RX:         0
    TX:         0
    Other:      0
    Combined:   4
    Current hardware settings:
    RX:         0
    TX:         0
    Other:      0
    Combined:   4

  • Verify the queue status after the profile is applied:

    $ ethtool -l ens4

    Example output

    Channel parameters for ens4:
    Pre-set maximums:
    RX:         0
    TX:         0
    Other:      0
    Combined:   4
    Current hardware settings:
    RX:         0
    TX:         0
    Other:      0
    Combined:   2 1

1
The combined channel shows that the total count of reserved CPUs for all supported devices is 2. This matches what is configured in the performance profile.

Example 2

In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices with a specific vendorID.

The relevant section from the performance profile is:

apiVersion: performance.openshift.io/v2
metadata:
  name: performance
spec:
  kind: PerformanceProfile
  spec:
    cpu:
      reserved: 0-1  #total = 2
      isolated: 2-8
    net:
      userLevelNetworking: true
      devices:
      - vendorID = 0x1af4
# ...
  • Display the status of the queues associated with a device using the following command:

    Note

    Run this command on the node where the performance profile was applied.

    $ ethtool -l <device>
  • Verify the queue status after the profile is applied:

    $ ethtool -l ens4

    Example output

    Channel parameters for ens4:
    Pre-set maximums:
    RX:         0
    TX:         0
    Other:      0
    Combined:   4
    Current hardware settings:
    RX:         0
    TX:         0
    Other:      0
    Combined:   2 1

1
The total count of reserved CPUs for all supported devices with vendorID=0x1af4 is 2. For example, if there is another network device ens2 with vendorID=0x1af4 it will also have total net queues of 2. This matches what is configured in the performance profile.

Example 3

In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices that match any of the defined device identifiers.

The command udevadm info provides a detailed report on a device. In this example the devices are:

# udevadm info -p /sys/class/net/ens4
...
E: ID_MODEL_ID=0x1000
E: ID_VENDOR_ID=0x1af4
E: INTERFACE=ens4
...
# udevadm info -p /sys/class/net/eth0
...
E: ID_MODEL_ID=0x1002
E: ID_VENDOR_ID=0x1001
E: INTERFACE=eth0
...
  • Set the net queues to 2 for a device with interfaceName equal to eth0 and any devices that have a vendorID=0x1af4 with the following performance profile:

    apiVersion: performance.openshift.io/v2
    metadata:
      name: performance
    spec:
      kind: PerformanceProfile
        spec:
          cpu:
            reserved: 0-1  #total = 2
            isolated: 2-8
          net:
            userLevelNetworking: true
            devices:
            - interfaceName = eth0
            - vendorID = 0x1af4
    ...
  • Verify the queue status after the profile is applied:

    $ ethtool -l ens4

    Example output

    Channel parameters for ens4:
    Pre-set maximums:
    RX:         0
    TX:         0
    Other:      0
    Combined:   4
    Current hardware settings:
    RX:         0
    TX:         0
    Other:      0
    Combined:   2 1

    1
    The total count of reserved CPUs for all supported devices with vendorID=0x1af4 is set to 2. For example, if there is another network device ens2 with vendorID=0x1af4, it will also have the total net queues set to 2. Similarly, a device with interfaceName equal to eth0 will have total net queues set to 2.

15.4.3. Logging associated with adjusting NIC queues

Log messages detailing the assigned devices are recorded in the respective Tuned daemon logs. The following messages might be recorded to the /var/log/tuned/tuned.log file:

  • An INFO message is recorded detailing the successfully assigned devices:

    INFO tuned.plugins.base: instance net_test (net): assigning devices ens1, ens2, ens3
  • A WARNING message is recorded if none of the devices can be assigned:

    WARNING  tuned.plugins.base: instance net_test: no matching devices available

15.5. Debugging low latency CNF tuning status

The PerformanceProfile custom resource (CR) contains status fields for reporting tuning status and debugging latency degradation issues. These fields report on conditions that describe the state of the operator’s reconciliation functionality.

A typical issue can arise when the status of machine config pools that are attached to the performance profile are in a degraded state, causing the PerformanceProfile status to degrade. In this case, the machine config pool issues a failure message.

The Node Tuning Operator contains the performanceProfile.spec.status.Conditions status field:

Status:
  Conditions:
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                True
    Type:                  Available
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                True
    Type:                  Upgradeable
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                False
    Type:                  Progressing
    Last Heartbeat Time:   2020-06-02T10:01:24Z
    Last Transition Time:  2020-06-02T10:01:24Z
    Status:                False
    Type:                  Degraded

The Status field contains Conditions that specify Type values that indicate the status of the performance profile:

Available
All machine configs and Tuned profiles have been created successfully and are available for cluster components are responsible to process them (NTO, MCO, Kubelet).
Upgradeable
Indicates whether the resources maintained by the Operator are in a state that is safe to upgrade.
Progressing
Indicates that the deployment process from the performance profile has started.
Degraded

Indicates an error if:

  • Validation of the performance profile has failed.
  • Creation of all relevant components did not complete successfully.

Each of these types contain the following fields:

Status
The state for the specific type (true or false).
Timestamp
The transaction timestamp.
Reason string
The machine readable reason.
Message string
The human readable reason describing the state and error details, if any.

15.5.1. Machine config pools

A performance profile and its created products are applied to a node according to an associated machine config pool (MCP). The MCP holds valuable information about the progress of applying the machine configurations created by performance profiles that encompass kernel args, kube config, huge pages allocation, and deployment of rt-kernel. The Performance Profile controller monitors changes in the MCP and updates the performance profile status accordingly.

The only conditions returned by the MCP to the performance profile status is when the MCP is Degraded, which leads to performaceProfile.status.condition.Degraded = true.

Example

The following example is for a performance profile with an associated machine config pool (worker-cnf) that was created for it:

  1. The associated machine config pool is in a degraded state:

    # oc get mcp

    Example output

    NAME         CONFIG                                                 UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master       rendered-master-2ee57a93fa6c9181b546ca46e1571d2d       True      False      False      3              3                   3                     0                      2d21h
    worker       rendered-worker-d6b2bdc07d9f5a59a6b68950acf25e5f       True      False      False      2              2                   2                     0                      2d21h
    worker-cnf   rendered-worker-cnf-6c838641b8a08fff08dbd8b02fb63f7c   False     True       True       2              1                   1                     1                      2d20h

  2. The describe section of the MCP shows the reason:

    # oc describe mcp worker-cnf

    Example output

      Message:               Node node-worker-cnf is reporting: "prepping update:
      machineconfig.machineconfiguration.openshift.io \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not
      found"
        Reason:                1 nodes are reporting degraded status on sync

  3. The degraded state should also appear under the performance profile status field marked as degraded = true:

    # oc describe performanceprofiles performance

    Example output

    Message: Machine config pool worker-cnf Degraded Reason: 1 nodes are reporting degraded status on sync.
    Machine config pool worker-cnf Degraded Message: Node yquinn-q8s5v-w-b-z5lqn.c.openshift-gce-devel.internal is
    reporting: "prepping update: machineconfig.machineconfiguration.openshift.io
    \"rendered-worker-cnf-40b9996919c08e335f3ff230ce1d170\" not found".    Reason:  MCPDegraded
       Status:  True
       Type:    Degraded

15.6. Collecting low latency tuning debugging data for Red Hat Support

When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.

The must-gather tool enables you to collect diagnostic information about your OpenShift Container Platform cluster, including node tuning, NUMA topology, and other information needed to debug issues with low latency setup.

For prompt support, supply diagnostic information for both OpenShift Container Platform and low latency tuning.

15.6.1. About the must-gather tool

The oc adm must-gather CLI command collects the information from your cluster that is most likely needed for debugging issues, such as:

  • Resource definitions
  • Audit logs
  • Service logs

You can specify one or more images when you run the command by including the --image argument. When you specify an image, the tool collects data related to that feature or product. When you run oc adm must-gather, a new pod is created on the cluster. The data is collected on that pod and saved in a new directory that starts with must-gather.local. This directory is created in your current working directory.

15.6.2. About collecting low latency tuning data

Use the oc adm must-gather CLI command to collect information about your cluster, including features and objects associated with low latency tuning, including:

  • The Node Tuning Operator namespaces and child objects.
  • MachineConfigPool and associated MachineConfig objects.
  • The Node Tuning Operator and associated Tuned objects.
  • Linux Kernel command line options.
  • CPU and NUMA topology
  • Basic PCI device information and NUMA locality.

To collect debugging information with must-gather, you must specify the Performance Addon Operator must-gather image:

--image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.11.
Note

In earlier versions of OpenShift Container Platform, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator. However, you must still use the performance-addon-operator-must-gather image when running the must-gather command.

15.6.3. Gathering data about specific features

You can gather debugging information about specific features by using the oc adm must-gather CLI command with the --image or --image-stream argument. The must-gather tool supports multiple images, so you can gather data about more than one feature by running a single command.

Note

To collect the default must-gather data in addition to specific feature data, add the --image-stream=openshift/must-gather argument.

Note

In earlier versions of OpenShift Container Platform, the Performance Addon Operator provided automatic, low latency performance tuning for applications. In OpenShift Container Platform 4.11, these functions are part of the Node Tuning Operator. However, you must still use the performance-addon-operator-must-gather image when running the must-gather command.

Prerequisites

  • Access to the cluster as a user with the cluster-admin role.
  • The OpenShift Container Platform CLI (oc) installed.

Procedure

  1. Navigate to the directory where you want to store the must-gather data.
  2. Run the oc adm must-gather command with one or more --image or --image-stream arguments. For example, the following command gathers both the default cluster data and information specific to the Node Tuning Operator:

    $ oc adm must-gather \
     --image-stream=openshift/must-gather \ 1
    
     --image=registry.redhat.io/openshift4/performance-addon-operator-must-gather-rhel8:v4.11 2
    1
    The default OpenShift Container Platform must-gather image.
    2
    The must-gather image for low latency tuning diagnostics.
  3. Create a compressed file from the must-gather directory that was created in your working directory. For example, on a computer that uses a Linux operating system, run the following command:

     $ tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/ 1
    1
    Replace must-gather-local.5421342344627712289/ with the actual directory name.
  4. Attach the compressed file to your support case on the Red Hat Customer Portal.

Additional resources

Chapter 16. Performing latency tests for platform verification

You can use the Cloud-native Network Functions (CNF) tests image to run latency tests on a CNF-enabled OpenShift Container Platform cluster, where all the components required for running CNF workloads are installed. Run the latency tests to validate node tuning for your workload.

The cnf-tests container image is available at registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11.

Important

The cnf-tests image also includes several tests that are not supported by Red Hat at this time. Only the latency tests are supported by Red Hat.

16.1. Prerequisites for running latency tests

Your cluster must meet the following requirements before you can run the latency tests:

  1. You have configured a performance profile with the Node Tuning Operator.
  2. You have applied all the required CNF configurations in the cluster.
  3. You have a pre-existing MachineConfigPool CR applied in the cluster. The default worker pool is worker-cnf.

Additional resources

16.2. About discovery mode for latency tests

Use discovery mode to validate the functionality of a cluster without altering its configuration. Existing environment configurations are used for the tests. The tests can find the configuration items needed and use those items to execute the tests. If resources needed to run a specific test are not found, the test is skipped, providing an appropriate message to the user. After the tests are finished, no cleanup of the preconfigured configuration items is done, and the test environment can be immediately used for another test run.

Important

When running the latency tests, always run the tests with -e DISCOVERY_MODE=true and -ginkgo.focus set to the appropriate latency test. If you do not run the latency tests in discovery mode, your existing live cluster performance profile configuration will be modified by the test run.

Limiting the nodes used during tests

The nodes on which the tests are executed can be limited by specifying a NODES_SELECTOR environment variable, for example, -e NODES_SELECTOR=node-role.kubernetes.io/worker-cnf. Any resources created by the test are limited to nodes with matching labels.

Note

If you want to override the default worker pool, pass the -e ROLE_WORKER_CNF=<custom_worker_pool> variable to the command specifying an appropriate label.

16.3. Measuring latency

The cnf-tests image uses three tools to measure the latency of the system:

  • hwlatdetect
  • cyclictest
  • oslat

Each tool has a specific use. Use the tools in sequence to achieve reliable test results.

hwlatdetect
Measures the baseline that the bare-metal hardware can achieve. Before proceeding with the next latency test, ensure that the latency reported by hwlatdetect meets the required threshold because you cannot fix hardware latency spikes by operating system tuning.
cyclictest
Verifies the real-time kernel scheduler latency after hwlatdetect passes validation. The cyclictest tool schedules a repeated timer and measures the difference between the desired and the actual trigger times. The difference can uncover basic issues with the tuning caused by interrupts or process priorities. The tool must run on a real-time kernel.
oslat
Behaves similarly to a CPU-intensive DPDK application and measures all the interruptions and disruptions to the busy loop that simulates CPU heavy data processing.

The tests introduce the following environment variables:

Table 16.1. Latency test environment variables
Environment variablesDescription

LATENCY_TEST_DELAY

Specifies the amount of time in seconds after which the test starts running. You can use the variable to allow the CPU manager reconcile loop to update the default CPU pool. The default value is 0.

LATENCY_TEST_CPUS

Specifies the number of CPUs that the pod running the latency tests uses. If you do not set the variable, the default configuration includes all isolated CPUs.

LATENCY_TEST_RUNTIME

Specifies the amount of time in seconds that the latency test must run. The default value is 300 seconds.

HWLATDETECT_MAXIMUM_LATENCY

Specifies the maximum acceptable hardware latency in microseconds for the workload and operating system. If you do not set the value of HWLATDETECT_MAXIMUM_LATENCY or MAXIMUM_LATENCY, the tool compares the default expected threshold (20μs) and the actual maximum latency in the tool itself. Then, the test fails or succeeds accordingly.

CYCLICTEST_MAXIMUM_LATENCY

Specifies the maximum latency in microseconds that all threads expect before waking up during the cyclictest run. If you do not set the value of CYCLICTEST_MAXIMUM_LATENCY or MAXIMUM_LATENCY, the tool skips the comparison of the expected and the actual maximum latency.

OSLAT_MAXIMUM_LATENCY

Specifies the maximum acceptable latency in microseconds for the oslat test results. If you do not set the value of OSLAT_MAXIMUM_LATENCY or MAXIMUM_LATENCY, the tool skips the comparison of the expected and the actual maximum latency.

MAXIMUM_LATENCY

Unified variable that specifies the maximum acceptable latency in microseconds. Applicable for all available latency tools.

LATENCY_TEST_RUN

Boolean parameter that indicates whether the tests should run. LATENCY_TEST_RUN is set to false by default. To run the latency tests, set this value to true.

Note

Variables that are specific to a latency tool take precedence over unified variables. For example, if OSLAT_MAXIMUM_LATENCY is set to 30 microseconds and MAXIMUM_LATENCY is set to 10 microseconds, the oslat test will run with maximum acceptable latency of 30 microseconds.

16.4. Running the latency tests

Run the cluster latency tests to validate node tuning for your Cloud-native Network Functions (CNF) workload.

Important

Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.

Note

When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. To make the podman command work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.

Procedure

  1. Open a shell prompt in the directory containing the kubeconfig file.

    You provide the test image with a kubeconfig file in current directory and its related $KUBECONFIG environment variable, mounted through a volume. This allows the running container to use the kubeconfig file from inside the container.

  2. Run the latency tests by entering the following command:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
  3. Optional: Append -ginkgo.dryRun to run the latency tests in dry-run mode. This is useful for checking what the tests run.
  4. Optional: Append -ginkgo.v to run the tests with increased verbosity.
  5. Optional: To run the latency tests against a specific performance profile, run the following command, substituting appropriate values:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e LATENCY_TEST_RUN=true -e FEATURES=performance -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \
    -e PERF_TEST_PROFILE=<performance_profile> registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/test-run.sh -ginkgo.focus="[performance]\ Latency\ Test"

    where:

    <performance_profile>
    Is the name of the performance profile you want to run the latency tests against.
    Important

    For valid latency test results, run the tests for at least 12 hours.

16.4.1. Running hwlatdetect

The hwlatdetect tool is available in the rt-kernel package with a regular subscription of Red Hat Enterprise Linux (RHEL) 8.x.

Important

Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.

Note

When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. To make the podman command work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.

Prerequisites

  • You have installed the real-time kernel in the cluster.
  • You have logged in to registry.redhat.io with your Customer Portal credentials.

Procedure

  • To run the hwlatdetect tests, run the following command, substituting variable values as appropriate:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \
    -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="hwlatdetect"

    The hwlatdetect test runs for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower than MAXIMUM_LATENCY (20 μs).

    If the results exceed the latency threshold, the test fails.

    Important

    For valid results, the test should run for at least 12 hours.

    Example failure output

    running /usr/bin/cnftests -ginkgo.v -ginkgo.focus=hwlatdetect
    I0908 15:25:20.023712      27 request.go:601] Waited for 1.046586367s due to client-side throttling, not priority and fairness, request: GET:https://api.hlxcl6.lab.eng.tlv2.redhat.com:6443/apis/imageregistry.operator.openshift.io/v1?timeout=32s
    Running Suite: CNF Features e2e integration tests
    =================================================
    Random Seed: 1662650718
    Will run 1 of 194 specs
    
    [...]
    
    • Failure [283.574 seconds]
    [performance] Latency Test
    /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:62
      with the hwlatdetect image
      /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:228
        should succeed [It]
        /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:236
    
        Log file created at: 2022/09/08 15:25:27
        Running on machine: hwlatdetect-b6n4n
        Binary: Built with gc go1.17.12 for linux/amd64
        Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
        I0908 15:25:27.160620       1 node.go:39] Environment information: /proc/cmdline: BOOT_IMAGE=(hd1,gpt3)/ostree/rhcos-c6491e1eedf6c1f12ef7b95e14ee720bf48359750ac900b7863c625769ef5fb9/vmlinuz-4.18.0-372.19.1.el8_6.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/c6491e1eedf6c1f12ef7b95e14ee720bf48359750ac900b7863c625769ef5fb9/0 ip=dhcp root=UUID=5f80c283-f6e6-4a27-9b47-a287157483b2 rw rootflags=prjquota boot=UUID=773bf59a-bafd-48fc-9a87-f62252d739d3 skew_tick=1 nohz=on rcu_nocbs=0-3 tuned.non_isolcpus=0000ffff,ffffffff,fffffff0 systemd.cpu_affinity=4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79 intel_iommu=on iommu=pt isolcpus=managed_irq,0-3 nohz_full=0-3 tsc=nowatchdog nosoftlockup nmi_watchdog=0 mce=off skew_tick=1 rcutree.kthread_prio=11 + +
        I0908 15:25:27.160830       1 node.go:46] Environment information: kernel version 4.18.0-372.19.1.el8_6.x86_64
        I0908 15:25:27.160857       1 main.go:50] running the hwlatdetect command with arguments [/usr/bin/hwlatdetect --threshold 1 --hardlimit 1 --duration 100 --window 10000000us --width 950000us]
        F0908 15:27:10.603523       1 main.go:53] failed to run hwlatdetect command; out: hwlatdetect:  test duration 100 seconds
           detector: tracer
           parameters:
                Latency threshold: 1us 1
                Sample window:     10000000us
                Sample width:      950000us
             Non-sampling period:  9050000us
                Output File:       None
    
        Starting test
        test finished
        Max Latency: 326us 2
        Samples recorded: 5
        Samples exceeding threshold: 5
        ts: 1662650739.017274507, inner:6, outer:6
        ts: 1662650749.257272414, inner:14, outer:326
        ts: 1662650779.977272835, inner:314, outer:12
        ts: 1662650800.457272384, inner:3, outer:9
        ts: 1662650810.697273520, inner:3, outer:2
    
    [...]
    
    JUnit report was created: /junit.xml/cnftests-junit.xml
    
    
    Summarizing 1 Failure:
    
    [Fail] [performance] Latency Test with the hwlatdetect image [It] should succeed
    /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:476
    
    Ran 1 of 194 Specs in 365.797 seconds
    FAIL! -- 0 Passed | 1 Failed | 0 Pending | 193 Skipped
    --- FAIL: TestTest (366.08s)
    FAIL

    1
    You can configure the latency threshold by using the MAXIMUM_LATENCY or the HWLATDETECT_MAXIMUM_LATENCY environment variables.
    2
    The maximum latency value measured during the test.
Example hwlatdetect test results

You can capture the following types of results:

  • Rough results that are gathered after each run to create a history of impact on any changes made throughout the test.
  • The combined set of the rough tests with the best results and configuration settings.

Example of good results

hwlatdetect: test duration 3600 seconds
detector: tracer
parameters:
Latency threshold: 10us
Sample window: 1000000us
Sample width: 950000us
Non-sampling period: 50000us
Output File: None

Starting test
test finished
Max Latency: Below threshold
Samples recorded: 0

The hwlatdetect tool only provides output if the sample exceeds the specified threshold.

Example of bad results

hwlatdetect: test duration 3600 seconds
detector: tracer
parameters:Latency threshold: 10usSample window: 1000000us
Sample width: 950000usNon-sampling period: 50000usOutput File: None

Starting tests:1610542421.275784439, inner:78, outer:81
ts: 1610542444.330561619, inner:27, outer:28
ts: 1610542445.332549975, inner:39, outer:38
ts: 1610542541.568546097, inner:47, outer:32
ts: 1610542590.681548531, inner:13, outer:17
ts: 1610543033.818801482, inner:29, outer:30
ts: 1610543080.938801990, inner:90, outer:76
ts: 1610543129.065549639, inner:28, outer:39
ts: 1610543474.859552115, inner:28, outer:35
ts: 1610543523.973856571, inner:52, outer:49
ts: 1610543572.089799738, inner:27, outer:30
ts: 1610543573.091550771, inner:34, outer:28
ts: 1610543574.093555202, inner:116, outer:63

The output of hwlatdetect shows that multiple samples exceed the threshold. However, the same output can indicate different results based on the following factors:

  • The duration of the test
  • The number of CPU cores
  • The host firmware settings
Warning

Before proceeding with the next latency test, ensure that the latency reported by hwlatdetect meets the required threshold. Fixing latencies introduced by hardware might require you to contact the system vendor support.

Not all latency spikes are hardware related. Ensure that you tune the host firmware to meet your workload requirements. For more information, see Setting firmware parameters for system tuning.

16.4.2. Running cyclictest

The cyclictest tool measures the real-time kernel scheduler latency on the specified CPUs.

Important

Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.

Note

When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. To make the podman command work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.

Prerequisites

  • You have logged in to registry.redhat.io with your Customer Portal credentials.
  • You have installed the real-time kernel in the cluster.
  • You have applied a cluster performance profile by using Node Tuning Operator.

Procedure

  • To perform the cyclictest, run the following command, substituting variable values as appropriate:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \
    -e LATENCY_TEST_CPUS=10 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="cyclictest"

    The command runs the cyclictest tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower than MAXIMUM_LATENCY (in this example, 20 μs). Latency spikes of 20 μs and above are generally not acceptable for telco RAN workloads.

    If the results exceed the latency threshold, the test fails.

    Important

    For valid results, the test should run for at least 12 hours.

    Example failure output

    running /usr/bin/cnftests -ginkgo.v -ginkgo.focus=cyclictest
    I0908 13:01:59.193776      27 request.go:601] Waited for 1.046228824s due to client-side throttling, not priority and fairness, request: GET:https://api.compute-1.example.com:6443/apis/packages.operators.coreos.com/v1?timeout=32s
    Running Suite: CNF Features e2e integration tests
    =================================================
    Random Seed: 1662642118
    Will run 1 of 194 specs
    
    [...]
    
    Summarizing 1 Failure:
    
    [Fail] [performance] Latency Test with the cyclictest image [It] should succeed
    /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:220
    
    Ran 1 of 194 Specs in 161.151 seconds
    FAIL! -- 0 Passed | 1 Failed | 0 Pending | 193 Skipped
    --- FAIL: TestTest (161.48s)
    FAIL

Example cyclictest results

The same output can indicate different results for different workloads. For example, spikes up to 18μs are acceptable for 4G DU workloads, but not for 5G DU workloads.

Example of good results

running cmd: cyclictest -q -D 10m -p 1 -t 16 -a 2,4,6,8,10,12,14,16,54,56,58,60,62,64,66,68 -h 30 -i 1000 -m
# Histogram
000000 000000   000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000
000001 000000   000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000
000002 579506   535967  418614  573648  532870  529897  489306  558076  582350  585188  583793  223781  532480  569130  472250  576043
More histogram entries ...
# Total: 000600000 000600000 000600000 000599999 000599999 000599999 000599998 000599998 000599998 000599997 000599997 000599996 000599996 000599995 000599995 000599995
# Min Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002
# Avg Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002
# Max Latencies: 00005 00005 00004 00005 00004 00004 00005 00005 00006 00005 00004 00005 00004 00004 00005 00004
# Histogram Overflows: 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
# Histogram Overflow at cycle number:
# Thread 0:
# Thread 1:
# Thread 2:
# Thread 3:
# Thread 4:
# Thread 5:
# Thread 6:
# Thread 7:
# Thread 8:
# Thread 9:
# Thread 10:
# Thread 11:
# Thread 12:
# Thread 13:
# Thread 14:
# Thread 15:

Example of bad results

running cmd: cyclictest -q -D 10m -p 1 -t 16 -a 2,4,6,8,10,12,14,16,54,56,58,60,62,64,66,68 -h 30 -i 1000 -m
# Histogram
000000 000000   000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000
000001 000000   000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000  000000
000002 564632   579686  354911  563036  492543  521983  515884  378266  592621  463547  482764  591976  590409  588145  589556  353518
More histogram entries ...
# Total: 000599999 000599999 000599999 000599997 000599997 000599998 000599998 000599997 000599997 000599996 000599995 000599996 000599995 000599995 000599995 000599993
# Min Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002
# Avg Latencies: 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002 00002
# Max Latencies: 00493 00387 00271 00619 00541 00513 00009 00389 00252 00215 00539 00498 00363 00204 00068 00520
# Histogram Overflows: 00001 00001 00001 00002 00002 00001 00000 00001 00001 00001 00002 00001 00001 00001 00001 00002
# Histogram Overflow at cycle number:
# Thread 0: 155922
# Thread 1: 110064
# Thread 2: 110064
# Thread 3: 110063 155921
# Thread 4: 110063 155921
# Thread 5: 155920
# Thread 6:
# Thread 7: 110062
# Thread 8: 110062
# Thread 9: 155919
# Thread 10: 110061 155919
# Thread 11: 155918
# Thread 12: 155918
# Thread 13: 110060
# Thread 14: 110060
# Thread 15: 110059 155917

16.4.3. Running oslat

The oslat test simulates a CPU-intensive DPDK application and measures all the interruptions and disruptions to test how the cluster handles CPU heavy data processing.

Important

Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.

Note

When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. To make the podman command work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.

Prerequisites

  • You have logged in to registry.redhat.io with your Customer Portal credentials.
  • You have applied a cluster performance profile by using the Node Tuning Operator.

Procedure

  • To perform the oslat test, run the following command, substituting variable values as appropriate:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=worker-cnf \
    -e LATENCY_TEST_CPUS=7 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="oslat"

    LATENCY_TEST_CPUS specifies the list of CPUs to test with the oslat command.

    The command runs the oslat tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower than MAXIMUM_LATENCY (20 μs).

    If the results exceed the latency threshold, the test fails.

    Important

    For valid results, the test should run for at least 12 hours.

    Example failure output

    running /usr/bin/cnftests -ginkgo.v -ginkgo.focus=oslat
    I0908 12:51:55.999393      27 request.go:601] Waited for 1.044848101s due to client-side throttling, not priority and fairness, request: GET:https://compute-1.example.com:6443/apis/machineconfiguration.openshift.io/v1?timeout=32s
    Running Suite: CNF Features e2e integration tests
    =================================================
    Random Seed: 1662641514
    Will run 1 of 194 specs
    
    [...]
    
    • Failure [77.833 seconds]
    [performance] Latency Test
    /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:62
      with the oslat image
      /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:128
        should succeed [It]
        /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:153
    
        The current latency 304 is bigger than the expected one 1 : 1
    
    [...]
    
    Summarizing 1 Failure:
    
    [Fail] [performance] Latency Test with the oslat image [It] should succeed
    /remote-source/app/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/4_latency/latency.go:177
    
    Ran 1 of 194 Specs in 161.091 seconds
    FAIL! -- 0 Passed | 1 Failed | 0 Pending | 193 Skipped
    --- FAIL: TestTest (161.42s)
    FAIL

    1
    In this example, the measured latency is outside the maximum allowed value.

16.5. Generating a latency test failure report

Use the following procedures to generate a JUnit latency test output and test failure report.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.

Procedure

  • Create a test failure report with information about the cluster state and resources for troubleshooting by passing the --report parameter with the path to where the report is dumped:

    $ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/reportdest:<report_folder_path> \
    -e KUBECONFIG=/kubeconfig/kubeconfig  -e DISCOVERY_MODE=true -e FEATURES=performance \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/test-run.sh --report <report_folder_path> \
    -ginkgo.focus="\[performance\]\ Latency\ Test"

    where:

    <report_folder_path>
    Is the path to the folder where the report is generated.

16.6. Generating a JUnit latency test report

Use the following procedures to generate a JUnit latency test output and test failure report.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.

Procedure

  • Create a JUnit-compliant XML report by passing the --junit parameter together with the path to where the report is dumped:

    $ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/junitdest:<junit_folder_path> \
    -e KUBECONFIG=/kubeconfig/kubeconfig -e DISCOVERY_MODE=true -e FEATURES=performance \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/test-run.sh --junit <junit_folder_path> \
    -ginkgo.focus="\[performance\]\ Latency\ Test"

    where:

    <junit_folder_path>
    Is the path to the folder where the junit report is generated

16.7. Running latency tests on a single-node OpenShift cluster

You can run latency tests on single-node OpenShift clusters.

Important

Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.

Note

When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. To make the podman command work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.

Procedure

  • To run the latency tests on a single-node OpenShift cluster, run the following command:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e DISCOVERY_MODE=true -e FEATURES=performance -e ROLE_WORKER_CNF=master \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
    Note

    ROLE_WORKER_CNF=master is required because master is the only machine pool to which the node belongs. For more information about setting the required MachineConfigPool for the latency tests, see "Prerequisites for running latency tests".

    After running the test suite, all the dangling resources are cleaned up.

16.8. Running latency tests in a disconnected cluster

The CNF tests image can run tests in a disconnected cluster that is not able to reach external registries. This requires two steps:

  1. Mirroring the cnf-tests image to the custom disconnected registry.
  2. Instructing the tests to consume the images from the custom disconnected registry.
Mirroring the images to a custom registry accessible from the cluster

A mirror executable is shipped in the image to provide the input required by oc to mirror the test image to a local registry.

  1. Run this command from an intermediate machine that has access to the cluster and registry.redhat.io:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    /usr/bin/mirror -registry <disconnected_registry> | oc image mirror -f -

    where:

    <disconnected_registry>
    Is the disconnected mirror registry you have configured, for example, my.local.registry:5000/.
  2. When you have mirrored the cnf-tests image into the disconnected registry, you must override the original registry used to fetch the images when running the tests, for example:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e DISCOVERY_MODE=true -e FEATURES=performance -e IMAGE_REGISTRY="<disconnected_registry>" \
    -e CNF_TESTS_IMAGE="cnf-tests-rhel8:v4.11" \
    /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
Configuring the tests to consume images from a custom registry

You can run the latency tests using a custom test image and image registry using CNF_TESTS_IMAGE and IMAGE_REGISTRY variables.

  • To configure the latency tests to use a custom test image and image registry, run the following command:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e IMAGE_REGISTRY="<custom_image_registry>" \
    -e CNF_TESTS_IMAGE="<custom_cnf-tests_image>" \
    -e FEATURES=performance \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/test-run.sh

    where:

    <custom_image_registry>
    is the custom image registry, for example, custom.registry:5000/.
    <custom_cnf-tests_image>
    is the custom cnf-tests image, for example, custom-cnf-tests-image:latest.
Mirroring images to the cluster OpenShift image registry

OpenShift Container Platform provides a built-in container image registry, which runs as a standard workload on the cluster.

Procedure

  1. Gain external access to the registry by exposing it with a route:

    $ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge
  2. Fetch the registry endpoint by running the following command:

    $ REGISTRY=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')
  3. Create a namespace for exposing the images:

    $ oc create ns cnftests
  4. Make the image stream available to all the namespaces used for tests. This is required to allow the tests namespaces to fetch the images from the cnf-tests image stream. Run the following commands:

    $ oc policy add-role-to-user system:image-puller system:serviceaccount:cnf-features-testing:default --namespace=cnftests
    $ oc policy add-role-to-user system:image-puller system:serviceaccount:performance-addon-operators-testing:default --namespace=cnftests
  5. Retrieve the docker secret name and auth token by running the following commands:

    $ SECRET=$(oc -n cnftests get secret | grep builder-docker | awk {'print $1'}
    $ TOKEN=$(oc -n cnftests get secret $SECRET -o jsonpath="{.data['\.dockercfg']}" | base64 --decode | jq '.["image-registry.openshift-image-registry.svc:5000"].auth')
  6. Create a dockerauth.json file, for example:

    $ echo "{\"auths\": { \"$REGISTRY\": { \"auth\": $TOKEN } }}" > dockerauth.json
  7. Do the image mirroring:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    registry.redhat.io/openshift4/cnf-tests-rhel8:4.11 \
    /usr/bin/mirror -registry $REGISTRY/cnftests |  oc image mirror --insecure=true \
    -a=$(pwd)/dockerauth.json -f -
  8. Run the tests:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    -e DISCOVERY_MODE=true -e FEATURES=performance -e IMAGE_REGISTRY=image-registry.openshift-image-registry.svc:5000/cnftests \
    cnf-tests-local:latest /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"
Mirroring a different set of test images

You can optionally change the default upstream images that are mirrored for the latency tests.

Procedure

  1. The mirror command tries to mirror the upstream images by default. This can be overridden by passing a file with the following format to the image:

    [
        {
            "registry": "public.registry.io:5000",
            "image": "imageforcnftests:4.11"
        }
    ]
  2. Pass the file to the mirror command, for example saving it locally as images.json. With the following command, the local path is mounted in /kubeconfig inside the container and that can be passed to the mirror command.

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 /usr/bin/mirror \
    --registry "my.local.registry:5000/" --images "/kubeconfig/images.json" \
    |  oc image mirror -f -

16.9. Troubleshooting errors with the cnf-tests container

To run latency tests, the cluster must be accessible from within the cnf-tests container.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.

Procedure

  • Verify that the cluster is accessible from inside the cnf-tests container by running the following command:

    $ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
    registry.redhat.io/openshift4/cnf-tests-rhel8:v4.11 \
    oc get nodes

    If this command does not work, an error related to spanning across DNS, MTU size, or firewall access might be occurring.

Chapter 17. Improving cluster stability in high latency environments using worker latency profiles

If the cluster administrator has performed latency tests for platform verification, they can discover the need to adjust the operation of the cluster to ensure stability in cases of high latency. The cluster administrator need change only one parameter, recorded in a file, which controls four parameters affecting how supervisory processes read status and interpret the health of the cluster. Changing only the one parameter provides cluster tuning in an easy, supportable manner.

The Kubelet process provides the starting point for monitoring cluster health. The Kubelet sets status values for all nodes in the OpenShift Container Platform cluster. The Kubernetes Controller Manager (kube controller) reads the status values every 10 seconds, by default. If the kube controller cannot read a node status value, it loses contact with that node after a configured period. The default behavior is:

  1. The node controller on the control plane updates the node health to Unhealthy and marks the node Ready condition`Unknown`.
  2. In response, the scheduler stops scheduling pods to that node.
  3. The Node Lifecycle Controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to the node and schedules any pods on the node for eviction after five minutes, by default.

This behavior can cause problems if your network is prone to latency issues, especially if you have nodes at the network edge. In some cases, the Kubernetes Controller Manager might not receive an update from a healthy node due to network latency. The Kubelet evicts pods from the node even though the node is healthy.

To avoid this problem, you can use worker latency profiles to adjust the frequency that the Kubelet and the Kubernetes Controller Manager wait for status updates before taking action. These adjustments help to ensure that your cluster runs properly if network latency between the control plane and the worker nodes is not optimal.

These worker latency profiles contain three sets of parameters that are pre-defined with carefully tuned values to control the reaction of the cluster to increased latency. No need to experimentally find the best values manually.

You can configure worker latency profiles when installing a cluster or at any time you notice increased latency in your cluster network.

17.1. Understanding worker latency profiles

Worker latency profiles are four different categories of carefully-tuned parameters. The four parameters which implement these values are node-status-update-frequency, node-monitor-grace-period, default-not-ready-toleration-seconds and default-unreachable-toleration-seconds. These parameters can use values which allow you control the reaction of the cluster to latency issues without needing to determine the best values using manual methods.

Important

Setting these parameters manually is not supported. Incorrect parameter settings adversely affect cluster stability.

All worker latency profiles configure the following parameters:

node-status-update-frequency
Specifies how often the kubelet posts node status to the API server.
node-monitor-grace-period
Specifies the amount of time in seconds that the Kubernetes Controller Manager waits for an update from a kubelet before marking the node unhealthy and adding the node.kubernetes.io/not-ready or node.kubernetes.io/unreachable taint to the node.
default-not-ready-toleration-seconds
Specifies the amount of time in seconds after marking a node unhealthy that the Kube API Server Operator waits before evicting pods from that node.
default-unreachable-toleration-seconds
Specifies the amount of time in seconds after marking a node unreachable that the Kube API Server Operator waits before evicting pods from that node.

The following Operators monitor the changes to the worker latency profiles and respond accordingly:

  • The Machine Config Operator (MCO) updates the node-status-update-frequency parameter on the worker nodes.
  • The Kubernetes Controller Manager updates the node-monitor-grace-period parameter on the control plane nodes.
  • The Kubernetes API Server Operator updates the default-not-ready-toleration-seconds and default-unreachable-toleration-seconds parameters on the control plane nodes.

While the default configuration works in most cases, OpenShift Container Platform offers two other worker latency profiles for situations where the network is experiencing higher latency than usual. The three worker latency profiles are described in the following sections:

Default worker latency profile

With the Default profile, each Kubelet updates it’s status every 10 seconds (node-status-update-frequency). The Kube Controller Manager checks the statuses of Kubelet every 5 seconds (node-monitor-grace-period).

The Kubernetes Controller Manager waits 40 seconds for a status update from Kubelet before considering the Kubelet unhealthy. If no status is made available to the Kubernetes Controller Manager, it then marks the node with the node.kubernetes.io/not-ready or node.kubernetes.io/unreachable taint and evicts the pods on that node.

If a pod on that node has the NoExecute taint, the pod is run according to tolerationSeconds. If the pod has no taint, it will be evicted in 300 seconds (default-not-ready-toleration-seconds and default-unreachable-toleration-seconds settings of the Kube API Server).

ProfileComponentParameterValue

Default

kubelet

node-status-update-frequency

10s

Kubelet Controller Manager

node-monitor-grace-period

40s

Kubernetes API Server Operator

default-not-ready-toleration-seconds

300s

Kubernetes API Server Operator

default-unreachable-toleration-seconds

300s

Medium worker latency profile

Use the MediumUpdateAverageReaction profile if the network latency is slightly higher than usual.

The MediumUpdateAverageReaction profile reduces the frequency of kubelet updates to 20 seconds and changes the period that the Kubernetes Controller Manager waits for those updates to 2 minutes. The pod eviction period for a pod on that node is reduced to 60 seconds. If the pod has the tolerationSeconds parameter, the eviction waits for the period specified by that parameter.

The Kubernetes Controller Manager waits for 2 minutes to consider a node unhealthy. In another minute, the eviction process starts.

ProfileComponentParameterValue

MediumUpdateAverageReaction

kubelet

node-status-update-frequency

20s

Kubelet Controller Manager

node-monitor-grace-period

2m

Kubernetes API Server Operator

default-not-ready-toleration-seconds

60s

Kubernetes API Server Operator

default-unreachable-toleration-seconds

60s

Low worker latency profile

Use the LowUpdateSlowReaction profile if the network latency is extremely high.

The LowUpdateSlowReaction profile reduces the frequency of kubelet updates to 1 minute and changes the period that the Kubernetes Controller Manager waits for those updates to 5 minutes. The pod eviction period for a pod on that node is reduced to 60 seconds. If the pod has the tolerationSeconds parameter, the eviction waits for the period specified by that parameter.

The Kubernetes Controller Manager waits for 5 minutes to consider a node unhealthy. In another minute, the eviction process starts.

ProfileComponentParameterValue

LowUpdateSlowReaction

kubelet

node-status-update-frequency

1m

Kubelet Controller Manager

node-monitor-grace-period

5m

Kubernetes API Server Operator

default-not-ready-toleration-seconds

60s

Kubernetes API Server Operator

default-unreachable-toleration-seconds

60s

17.2. Implementing worker latency profiles at cluster creation

Important

To edit the configuration of the installer, you will first need to use the command openshift-install create manifests to create the default node manifest as well as other manifest YAML files. This file structure must exist before we can add workerLatencyProfile. The platform on which you are installing may have varying requirements. Refer to the Installing section of the documentation for your specific platform.

The workerLatencyProfile must be added to the manifest in the following sequence:

  1. Create the manifest needed to build the cluster, using a folder name appropriate for your installation.
  2. Create a YAML file to define config.node. The file must be in the manifests directory.
  3. When defining workerLatencyProfile in the manifest for the first time, specify any of the profiles at cluster creation time: Default, MediumUpdateAverageReaction or LowUpdateSlowReaction.

Verification

  • Here is an example manifest creation showing the spec.workerLatencyProfile Default value in the manifest file:

    $ openshift-install create manifests --dir=<cluster-install-dir>
  • Edit the manifest and add the value. In this example we use vi to show an example manifest file with the "Default" workerLatencyProfile value added:

    $ vi <cluster-install-dir>/manifests/config-node-default-profile.yaml

    Example output

    apiVersion: config.openshift.io/v1
    kind: Node
    metadata:
    name: cluster
    spec:
    workerLatencyProfile: "Default"

17.3. Using and changing worker latency profiles

To change a worker latency profile to deal with network latency, edit the node.config object to add the name of the profile. You can change the profile at any time as latency increases or decreases.

You must move one worker latency profile at a time. For example, you cannot move directly from the Default profile to the LowUpdateSlowReaction worker latency profile. You must move from the Default worker latency profile to the MediumUpdateAverageReaction profile first, then to LowUpdateSlowReaction. Similarly, when returning to the Default profile, you must move from the low profile to the medium profile first, then to Default.

Note

You can also configure worker latency profiles upon installing an OpenShift Container Platform cluster.

Procedure

To move from the default worker latency profile:

  1. Move to the medium worker latency profile:

    1. Edit the node.config object:

      $ oc edit nodes.config/cluster
    2. Add spec.workerLatencyProfile: MediumUpdateAverageReaction:

      Example node.config object

      apiVersion: config.openshift.io/v1
      kind: Node
      metadata:
        annotations:
          include.release.openshift.io/ibm-cloud-managed: "true"
          include.release.openshift.io/self-managed-high-availability: "true"
          include.release.openshift.io/single-node-developer: "true"
          release.openshift.io/create-only: "true"
        creationTimestamp: "2022-07-08T16:02:51Z"
        generation: 1
        name: cluster
        ownerReferences:
        - apiVersion: config.openshift.io/v1
          kind: ClusterVersion
          name: version
          uid: 36282574-bf9f-409e-a6cd-3032939293eb
        resourceVersion: "1865"
        uid: 0c0f7a4c-4307-4187-b591-6155695ac85b
      spec:
        workerLatencyProfile: MediumUpdateAverageReaction 1
      
      # ...

      1
      Specifies the medium worker latency policy.

      Scheduling on each worker node is disabled as the change is being applied.

  2. Optional: Move to the low worker latency profile:

    1. Edit the node.config object:

      $ oc edit nodes.config/cluster
    2. Change the spec.workerLatencyProfile value to LowUpdateSlowReaction:

      Example node.config object

      apiVersion: config.openshift.io/v1
      kind: Node
      metadata:
        annotations:
          include.release.openshift.io/ibm-cloud-managed: "true"
          include.release.openshift.io/self-managed-high-availability: "true"
          include.release.openshift.io/single-node-developer: "true"
          release.openshift.io/create-only: "true"
        creationTimestamp: "2022-07-08T16:02:51Z"
        generation: 1
        name: cluster
        ownerReferences:
        - apiVersion: config.openshift.io/v1
          kind: ClusterVersion
          name: version
          uid: 36282574-bf9f-409e-a6cd-3032939293eb
        resourceVersion: "1865"
        uid: 0c0f7a4c-4307-4187-b591-6155695ac85b
      spec:
        workerLatencyProfile: LowUpdateSlowReaction 1
      
      # ...

      1
      Specifies use of the low worker latency policy.

Scheduling on each worker node is disabled as the change is being applied.

Verification

  • When all nodes return to the Ready condition, you can use the following command to look in the Kubernetes Controller Manager to ensure it was applied:

    $ oc get KubeControllerManager -o yaml | grep -i workerlatency -A 5 -B 5

    Example output

    # ...
        - lastTransitionTime: "2022-07-11T19:47:10Z"
          reason: ProfileUpdated
          status: "False"
          type: WorkerLatencyProfileProgressing
        - lastTransitionTime: "2022-07-11T19:47:10Z" 1
          message: all static pod revision(s) have updated latency profile
          reason: ProfileUpdated
          status: "True"
          type: WorkerLatencyProfileComplete
        - lastTransitionTime: "2022-07-11T19:20:11Z"
          reason: AsExpected
          status: "False"
          type: WorkerLatencyProfileDegraded
        - lastTransitionTime: "2022-07-11T19:20:36Z"
          status: "False"
    # ...

    1
    Specifies that the profile is applied and active.

To change the medium profile to default or change the default to medium, edit the node.config object and set the spec.workerLatencyProfile parameter to the appropriate value.

17.4. Example steps for displaying resulting values of workerLatencyProfile

You can display the values in the workerLatencyProfile with the following commands.

Verification

  1. Check the default-not-ready-toleration-seconds and default-unreachable-toleration-seconds fields output by the Kube API Server:

    $ oc get KubeAPIServer -o yaml | grep -A 1 default-

    Example output

    default-not-ready-toleration-seconds:
    - "300"
    default-unreachable-toleration-seconds:
    - "300"

  2. Check the values of the node-monitor-grace-period field from the Kube Controller Manager:

    $ oc get KubeControllerManager -o yaml | grep -A 1 node-monitor

    Example output

    node-monitor-grace-period:
    - 40s

  3. Check the nodeStatusUpdateFrequency value from the Kubelet. Set the directory /host as the root directory within the debug shell. By changing the root directory to /host, you can run binaries contained in the host’s executable paths:

    $ oc debug node/<worker-node-name>
    $ chroot /host
    # cat /etc/kubernetes/kubelet.conf|grep nodeStatusUpdateFrequency

    Example output

      “nodeStatusUpdateFrequency”: “10s”

These outputs validate the set of timing variables for the Worker Latency Profile.

Chapter 18. Topology Aware Lifecycle Manager for cluster updates

You can use the Topology Aware Lifecycle Manager (TALM) to manage the software lifecycle of multiple single-node OpenShift clusters. TALM uses Red Hat Advanced Cluster Management (RHACM) policies to perform changes on the target clusters.

Important

Topology Aware Lifecycle Manager is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

18.1. About the Topology Aware Lifecycle Manager configuration

The Topology Aware Lifecycle Manager (TALM) manages the deployment of Red Hat Advanced Cluster Management (RHACM) policies for one or more OpenShift Container Platform clusters. Using TALM in a large network of clusters allows the phased rollout of policies to the clusters in limited batches. This helps to minimize possible service disruptions when updating. With TALM, you can control the following actions:

  • The timing of the update
  • The number of RHACM-managed clusters
  • The subset of managed clusters to apply the policies to
  • The update order of the clusters
  • The set of policies remediated to the cluster
  • The order of policies remediated to the cluster

TALM supports the orchestration of the OpenShift Container Platform y-stream and z-stream updates, and day-two operations on y-streams and z-streams.

18.2. About managed policies used with Topology Aware Lifecycle Manager

The Topology Aware Lifecycle Manager (TALM) uses RHACM policies for cluster updates.

TALM can be used to manage the rollout of any policy CR where the remediationAction field is set to inform. Supported use cases include the following:

  • Manual user creation of policy CRs
  • Automatically generated policies from the PolicyGenTemplate custom resource definition (CRD)

For policies that update an Operator subscription with manual approval, TALM provides additional functionality that approves the installation of the updated Operator.

For more information about managed policies, see Policy Overview in the RHACM documentation.

For more information about the PolicyGenTemplate CRD, see the "About the PolicyGenTemplate CRD" section in "Configuring managed clusters with policies and PolicyGenTemplate resources".

18.3. Installing the Topology Aware Lifecycle Manager by using the web console

You can use the OpenShift Container Platform web console to install the Topology Aware Lifecycle Manager.

Prerequisites

  • Install the latest version of the RHACM Operator.
  • Set up a hub cluster with disconnected regitry.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. In the OpenShift Container Platform web console, navigate to OperatorsOperatorHub.
  2. Search for the Topology Aware Lifecycle Manager from the list of available Operators, and then click Install.
  3. Keep the default selection of Installation mode ["All namespaces on the cluster (default)"] and Installed Namespace ("openshift-operators") to ensure that the Operator is installed properly.
  4. Click Install.

Verification

To confirm that the installation is successful:

  1. Navigate to the OperatorsInstalled Operators page.
  2. Check that the Operator is installed in the All Namespaces namespace and its status is Succeeded.

If the Operator is not installed successfully:

  1. Navigate to the OperatorsInstalled Operators page and inspect the Status column for any errors or failures.
  2. Navigate to the WorkloadsPods page and check the logs in any containers in the cluster-group-upgrades-controller-manager pod that are reporting issues.

18.4. Installing the Topology Aware Lifecycle Manager by using the CLI

You can use the OpenShift CLI (oc) to install the Topology Aware Lifecycle Manager (TALM).

Prerequisites

  • Install the OpenShift CLI (oc).
  • Install the latest version of the RHACM Operator.
  • Set up a hub cluster with disconnected registry.
  • Log in as a user with cluster-admin privileges.

Procedure

  1. Create a Subscription CR:

    1. Define the Subscription CR and save the YAML file, for example, talm-subscription.yaml:

      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: openshift-topology-aware-lifecycle-manager-subscription
        namespace: openshift-operators
      spec:
        channel: "stable"
        name: topology-aware-lifecycle-manager
        source: redhat-operators
        sourceNamespace: openshift-marketplace
    2. Create the Subscription CR by running the following command:

      $ oc create -f talm-subscription.yaml

Verification

  1. Verify that the installation succeeded by inspecting the CSV resource:

    $ oc get csv -n openshift-operators

    Example output

    NAME                                                   DISPLAY                            VERSION               REPLACES                           PHASE
    topology-aware-lifecycle-manager.4.11.x   Topology Aware Lifecycle Manager   4.11.x                                      Succeeded

  2. Verify that the TALM is up and running:

    $ oc get deploy -n openshift-operators

    Example output

    NAMESPACE                                          NAME                                             READY   UP-TO-DATE   AVAILABLE   AGE
    openshift-operators                                cluster-group-upgrades-controller-manager        1/1     1            1           14s

18.5. About the ClusterGroupUpgrade CR

The Topology Aware Lifecycle Manager (TALM) builds the remediation plan from the ClusterGroupUpgrade CR for a group of clusters. You can define the following specifications in a ClusterGroupUpgrade CR:

  • Clusters in the group
  • Blocking ClusterGroupUpgrade CRs
  • Applicable list of managed policies
  • Number of concurrent updates
  • Applicable canary updates
  • Actions to perform before and after the update
  • Update timing

As TALM works through remediation of the policies to the specified clusters, the ClusterGroupUpgrade CR can have the following states:

  • UpgradeNotStarted
  • UpgradeCannotStart
  • UpgradeNotComplete
  • UpgradeTimedOut
  • UpgradeCompleted
  • PrecachingRequired
Note

After TALM completes a cluster update, the cluster does not update again under the control of the same ClusterGroupUpgrade CR. You must create a new ClusterGroupUpgrade CR in the following cases:

  • When you need to update the cluster again
  • When the cluster changes to non-compliant with the inform policy after being updated

18.5.1. The UpgradeNotStarted state

The initial state of the ClusterGroupUpgrade CR is UpgradeNotStarted.

TALM builds a remediation plan based on the following fields:

  • The clusterSelector field specifies the labels of the clusters that you want to update.
  • The clusters field specifies a list of clusters to update.
  • The canaries field specifies the clusters for canary updates.
  • The maxConcurrency field specifies the number of clusters to update in a batch.

You can use the clusters and the clusterSelector fields together to create a combined list of clusters.

The remediation plan starts with the clusters listed in the canaries field. Each canary cluster forms a single-cluster batch.

Note

Any failures during the update of a canary cluster stops the update process.

The ClusterGroupUpgrade CR transitions to the UpgradeNotCompleted state after the remediation plan is successfully created and after the enable field is set to true. At this point, TALM starts to update the non-compliant clusters with the specified managed policies.

Note

You can only make changes to the spec fields if the ClusterGroupUpgrade CR is either in the UpgradeNotStarted or the UpgradeCannotStart state.

Sample ClusterGroupUpgrade CR in the UpgradeNotStarted state

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: cgu-upgrade-complete
  namespace: default
spec:
  clusters: 1
  - spoke1
  enable: false
  managedPolicies: 2
  - policy1-common-cluster-version-policy
  - policy2-common-nto-sub-policy
  remediationStrategy: 3
    canaries: 4
      - spoke1
    maxConcurrency: 1 5
    timeout: 240
status: 6
  conditions:
  - message: The ClusterGroupUpgrade CR is not enabled
    reason: UpgradeNotStarted
    status: "False"
    type: Ready
  copiedPolicies:
  - cgu-upgrade-complete-policy1-common-cluster-version-policy
  - cgu-upgrade-complete-policy2-common-nto-sub-policy
  managedPoliciesForUpgrade:
  - name: policy1-common-cluster-version-policy
    namespace: default
  - name: policy2-common-nto-sub-policy
    namespace: default
  placementBindings:
  - cgu-upgrade-complete-policy1-common-cluster-version-policy
  - cgu-upgrade-complete-policy2-common-nto-sub-policy
  placementRules:
  - cgu-upgrade-complete-policy1-common-cluster-version-policy
  - cgu-upgrade-complete-policy2-common-nto-sub-policy
  remediationPlan:
  - - spoke1

1
Defines the list of clusters to update.
2
Lists the user-defined set of policies to remediate.
3
Defines the specifics of the cluster updates.
4
Defines the clusters for canary updates.
5
Defines the maximum number of concurrent updates in a batch. The number of remediation batches is the number of canary clusters, plus the number of clusters, except the canary clusters, divided by the maxConcurrency value. The clusters that are already compliant with all the managed policies are excluded from the remediation plan.
6
Displays information about the status of the updates.

18.5.2. The UpgradeCannotStart state

In the UpgradeCannotStart state, the update cannot start because of the following reasons:

  • Blocking CRs are missing from the system
  • Blocking CRs have not yet finished

18.5.3. The UpgradeNotCompleted state

In the UpgradeNotCompleted state, TALM enforces the policies following the remediation plan defined in the UpgradeNotStarted state.

Enforcing the policies for subsequent batches starts immediately after all the clusters of the current batch are compliant with all the managed policies. If the batch times out, TALM moves on to the next batch. The timeout value of a batch is the spec.timeout field divided by the number of batches in the remediation plan.

Note

The managed policies apply in the order that they are listed in the managedPolicies field in the ClusterGroupUpgrade CR. One managed policy is applied to the specified clusters at a time. After the specified clusters comply with the current policy, the next managed policy is applied to the next non-compliant cluster.

Sample ClusterGroupUpgrade CR in the UpgradeNotCompleted state

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: cgu-upgrade-complete
  namespace: default
spec:
  clusters:
  - spoke1
  enable: true 1
  managedPolicies:
  - policy1-common-cluster-version-policy
  - policy2-common-nto-sub-policy
  remediationStrategy:
    maxConcurrency: 1
    timeout: 240
status: 2
  conditions:
  - message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant
    reason: UpgradeNotCompleted
    status: "False"
    type: Ready
  copiedPolicies:
  - cgu-upgrade-complete-policy1-common-cluster-version-policy
  - cgu-upgrade-complete-policy2-common-nto-sub-policy
  managedPoliciesForUpgrade:
  - name: policy1-common-cluster-version-policy
    namespace: default
  - name: policy2-common-nto-sub-policy
    namespace: default
  placementBindings:
  - cgu-upgrade-complete-policy1-common-cluster-version-policy
  - cgu-upgrade-complete-policy2-common-nto-sub-policy
  placementRules:
  - cgu-upgrade-complete-policy1-common-cluster-version-policy
  - cgu-upgrade-complete-policy2-common-nto-sub-policy
  remediationPlan:
  - - spoke1
  status:
    currentBatch: 1
    remediationPlanForBatch: 3
      spoke1: 0

1
The update starts when the value of the spec.enable field is true.
2
The status fields change accordingly when the update begins.
3
Lists the clusters in the batch and the index of the policy that is being currently applied to each cluster. The index of the policies starts with 0 and the index follows the order of the status.managedPoliciesForUpgrade list.

18.5.4. The UpgradeTimedOut state

In the UpgradeTimedOut state, TALM checks every hour if all the policies for the ClusterGroupUpgrade CR are compliant. The checks continue until the ClusterGroupUpgrade CR is deleted or the updates are completed. The periodic checks allow the updates to complete if they get prolonged due to network, CPU, or other issues.

TALM transitions to the UpgradeTimedOut state in two cases:

  • When the current batch contains canary updates and the cluster in the batch does not comply with all the managed policies within the batch timeout.
  • When the clusters do not comply with the managed policies within the timeout value specified in the remediationStrategy field.

If the policies are compliant, TALM transitions to the UpgradeCompleted state.

18.5.5. The UpgradeCompleted state

In the UpgradeCompleted state, the cluster updates are complete.

Sample ClusterGroupUpgrade CR in the UpgradeCompleted state

apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
  name: cgu-upgrade-complete
  namespace: default
spec:
  actions:
    afterCompletion:
      deleteObjects: true 1
  clusters:
  - spoke1
  enable: true
  managedPolicies:
  - policy1-common-cluster-version-policy
  - policy2-common-nto-sub-policy
  remediationStrategy:
    maxConcurrency: 1
    timeout: 240
status: 2
  conditions:
  - message: The ClusterGroupUpgrade CR has all clusters compliant with all the managed policies
    reason: UpgradeCompleted
    status: "True"
    type: Ready
  managedPoliciesForUpgrade:
  - name: policy1-common-cluster-version-policy
    namespace: default
  - name: policy2-common-nto-sub-policy
    namespace: default
  remediationPlan:
  - - spoke1
  status:
    remediationPlanForBatch:
      spoke1: -2 3

1
The value of spec.action.afterCompletion.deleteObjects field is true by default. After the update is completed, TALM deletes the underlying RHACM objects that were created during the update. This option is to prevent the RHACM hub from continuously checking for compliance after a successful update.
2
The status fields show that the updates completed successfully.
3
Displays that all the policies are applied to the cluster.
<discreet><title>The PrecachingRequired state</title>

In the PrecachingRequired state, the clusters need to have images pre-cached before the update can start. For more information about pre-caching, see the "Using the container image pre-cache feature" section.

</discreet>

18.5.6. Blocking ClusterGroupUpgrade CRs

You can create multiple ClusterGroupUpgrade CRs and control their order of application.

For example, if you create ClusterGroupUpgrade CR C that blocks the start of ClusterGroupUpgrade CR A, then ClusterGroupUpgrade CR A cannot start until the status of ClusterGroupUpgrade CR C becomes UpgradeComplete.

One ClusterGroupUpgrade CR can have multiple blocking CRs. In this case, all the blocking CRs must complete before the upgrade for the current CR can start.

Prerequisites

  • Install the Topology Aware Lifecycle Manager (TALM).
  • Provision one or more managed clusters.
  • Log in as a user with cluster-admin privileges.
  • Create RHACM policies in the hub cluster.

Procedure

  1. Save the content of the ClusterGroupUpgrade CRs in the cgu-a.yaml, cgu-b.yaml, and cgu-c.yaml files.

    apiVersion: ran.openshift.io/v1alpha1
    kind: ClusterGroupUpgrade
    metadata:
      name: cgu-a
      namespace: default
    spec:
      blockingCRs: 1
      - name: cgu-c
        namespace: default
      clusters:
      - spoke1
      - spoke2
      - spoke3
      enable: false
      managedPolicies:
      - policy1-common-cluster-version-policy
      - policy2-common-pao-sub-policy
      - policy3-common-ptp-sub-policy
      remediationStrategy:
        canaries:
        - spoke1
        maxConcurrency: 2
        timeout: 240
    status:
      conditions:
      - message: The ClusterGroupUpgrade CR is not enabled
        reason: UpgradeNotStarted
        status: "False"
        type: Ready
      copiedPolicies:
      - cgu-a-policy1-common-cluster-version-policy
      - cgu-a-policy2-common-pao-sub-policy
      - cgu-a-policy3-common-ptp-sub-policy
      managedPoliciesForUpgrade:
      - name: policy1-common-cluster-version-policy
        namespace: default
      - name: policy2-common-pao-sub-policy
        namespace: default
      - name: policy3-common-ptp-sub-policy
        namespace: default
      placementBindings:
      - cgu-a-policy1-common-cluster-version-policy
      - cgu-a-policy2-common-pao-sub-policy
      - cgu-a-policy3-common-ptp-sub-policy
      placementRules:
      - cgu-a-policy1-common-cluster-version-policy
      - cgu-a-policy2-common-pao-sub-policy
      - cgu-a-policy3-common-ptp-sub-policy
      remediationPlan:
      - - spoke1
      - - spoke2
    1
    Defines the blocking CRs. The cgu-a update cannot start until cgu-c is complete.
    apiVersion: ran.openshift.io/v1alpha1
    kind: ClusterGroupUpgrade
    metadata:
      name: cgu-b
      namespace: default
    spec:
      blockingCRs: 1
      - name: cgu-a
        namespace: default
      clusters:
      - spoke4
      - spoke5
      enable: false
      managedPolicies:
      - policy1-common-cluster-version-policy
      - policy2-common-pao-sub-policy
      - policy3-common-ptp-sub-policy
      - policy4-common-sriov-sub-policy
      remediationStrategy:
        maxConcurrency: 1
        timeout: 240
    status:
      conditions:
      - message: The ClusterGroupUpgrade CR is not enabled
        reason: UpgradeNotStarted
        status: "False"
        type: Ready
      copiedPolicies:
      - cgu-b-policy1-common-cluster-version-policy
      - cgu-b-policy2-common-pao-sub-policy
      - cgu-b-policy3-common-ptp-sub-policy
      - cgu-b-policy4-common-sriov-sub-policy
      managedPoliciesForUpgrade:
      - name: policy1-common-cluster-version-policy
        namespace: default
      - name: policy2-common-pao-sub-policy
        namespace: default
      - name: policy3-common-ptp-sub-policy
        namespace: default
      - name: policy4-common-sriov-sub-policy
        namespace: default
      placementBindings:
      - cgu-b-policy1-common-cluster-version-policy
      - cgu-b-policy2-common-pao-sub-policy
      - cgu-b-policy3-common-ptp-sub-policy
      - cgu-b-policy4-common-sriov-sub-policy
      placementRules:
      - cgu-b-policy1-common-cluster-version-policy
      - cgu-b-policy2-common-pao-sub-policy
      - cgu-b-policy3-common-ptp-sub-policy
      - cgu-b-policy4-common-sriov-sub-policy
      remediationPlan:
      - - spoke4
      - - spoke5
      status: {}
    1
    The cgu-b update cannot start until cgu-a is complete.
    apiVersion: ran.openshift.io/v1alpha1
    kind: ClusterGroupUpgrade
    metadata:
      name: cgu-c
      namespace: default
    spec: 1
      clusters:
      - spoke6
      enable: false
      managedPolicies:
      - policy1-common-cluster-version-policy
      - policy2-common-pao-sub-policy
      - policy3-common-ptp-sub-policy
      - policy4-common-sriov-sub-policy
      remediationStrategy:
        maxConcurrency: 1
        timeout: 240
    status:
      conditions:
      - message: The ClusterGroupUpgrade CR is not enabled
        reason: UpgradeNotStarted
        status: "False"
        type: Ready
      copiedPolicies:
      - cgu-c-policy1-common-cluster-version-policy
      - cgu-c-policy4-common-sriov-sub-policy
      managedPoliciesCompliantBeforeUpgrade:
      - policy2-common-pao-sub-policy
      - policy3-common-ptp-sub-policy
      managedPoliciesForUpgrade:
      - name: policy1-common-cluster-version-policy
        namespace: default
      - name: policy4-common-sriov-sub-policy
        namespace: default
      placementBindings:
      - cgu-c-policy1-common-cluster-version-policy
      - cgu-c-policy4-common-sriov-sub-policy
      placementRules:
      - cgu-c-policy1-common-cluster-version-policy
      - cgu-c-policy4-common-sriov-sub-policy
      remediationPlan:
      - - spoke6
      status: {}
    1
    The cgu-c update does not have any blocking CRs. TALM starts the cgu-c update when the enable field is set to true.
  2. Create the ClusterGroupUpgrade CRs by running the following command for each relevant CR:

    $ oc apply -f <name>.yaml
  3. Start the update process by running the following command for each relevant CR:

    $ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \
    --type merge -p '{"spec":{"enable":true}}'

    The following examples show ClusterGroupUpgrade CRs where the enable field is set to true:

    Example for cgu-a with blocking CRs

    apiVersion: ran.openshift.io/v1alpha1
    kind: ClusterGroupUpgrade
    metadata:
      name: cgu-a
      namespace: default
    spec:
      blockingCRs:
      - name: cgu-c
        namespace: default
      clusters:
      - spoke1
      - spoke2
      - spoke3
      enable: true
      managedPolicies:
      - policy1-common-cluster-version-policy
      - policy2-common-pao-sub-policy
      - policy3-common-ptp-sub-policy
      remediationStrategy:
        canaries:
        - spoke1
        maxConcurrency: 2
        timeout: 240
    status:
      conditions:
      - message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet
          completed: [cgu-c]' 1
        reason: UpgradeCannotStart
        status: "False"
        type: Ready
      copiedPolicies:
      - cgu-a-policy1-common-cluster-version-policy
      - cgu-a-policy2-common-pao-sub-policy
      - cgu-a-policy3-common-ptp-sub-policy
      managedPoliciesForUpgrade:
      - name: policy1-common-cluster-version-policy
        namespace: default
      - name: policy2-common-pao-sub-policy
        namespace: default
      - name: policy3-common-ptp-sub-policy
        namespace: default
      placementBindings:
      - cgu-a-policy1-common-cluster-version-policy
      - cgu-a-policy2-common-pao-sub-policy
      - cgu-a-policy3-common-ptp-sub-policy
      placementRules:
      - cgu-a-policy1-common-cluster-version-policy
      - cgu-a-policy2-common-pao-sub-policy
      - cgu-a-policy3-common-ptp-sub-policy
      remediationPlan:
      - - spoke1
      - - spoke2
      status: {}

    1
    Shows the list of blocking CRs.

    Example for cgu-b with blocking CRs

    apiVersion: ran.openshift.io/v1alpha1
    kind: ClusterGroupUpgrade
    metadata:
      name: cgu-b
      namespace: default
    spec:
      blockingCRs:
      - name: cgu-a
        namespace: default
      clusters:
      - spoke4
      - spoke5
      enable: true
      managedPolicies:
      - policy1-common-cluster-version-policy
      - policy2-common-pao-sub-policy
      - policy3-common-ptp-sub-policy
      - policy4-common-sriov-sub-policy
      remediationStrategy:
        maxConcurrency: 1
        timeout: 240
    status:
      conditions:
      - message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet
          completed: [cgu-a]' 1
        reason: UpgradeCannotStart
        status: "False"
        type: Ready
      copiedPolicies:
      - cgu-b-policy1-common-cluster-version-policy
      - cgu-b-policy2-common-pao-sub-policy
      - cgu-b-policy3-common-ptp-sub-policy
      - cgu-b-policy4-common-sriov-sub-policy
      managedPoliciesForUpgrade:
      - name: policy1-common-cluster-version-policy
        namespace: default
      - name: policy2-common-pao-sub-policy
        namespace: default
      - name: policy3-common-ptp-sub-policy
        namespace: default
      - name: policy4-common-sriov-sub-policy
        namespace: default
      placementBindings:
      - cgu-b-policy1-common-cluster-version-policy
      - cgu-b-policy2-common-pao-sub-policy
      - cgu-b-policy3-common-ptp-sub-policy
      - cgu-b-policy4-common-sriov-sub-policy
      placementRules:
      - cgu-b-policy1-common-cluster-version-policy
      - cgu-b-policy2-common-pao-sub-policy
      - cgu-b-policy3-common-ptp-sub-policy
      - cgu-b-policy4-common-sriov-sub-policy
      remediationPlan:
      - - spoke4
      - - spoke5
      status: {}

    1
    Shows the list of blocking CRs.

    Example for cgu-c with blocking CRs

    apiVersion: ran.openshift.io/v1alpha1
    kind: ClusterGroupUpgrade
    metadata:
      name: cgu-c
      namespace: default
    spec:
      clusters:
      - spoke6
      enable: true
      managedPolicies:
      - policy1-common-cluster-version-policy
      - policy2-common-pao-sub-policy
      - policy3-common-ptp-sub-policy
      - policy4-common-sriov-sub-policy
      remediationStrategy:
        maxConcurrency: 1
        timeout: 240
    status:
      conditions:
      - message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant 1
        reason: UpgradeNotCompleted
        status: "False"
        type: Ready
      copiedPolicies:
      - cgu-c-policy1-common-cluster-version-policy
      - cgu-c-policy4-common-sriov-sub-policy
      managedPoliciesCompliantBeforeUpgrade:
      - policy2-common-pao-sub-policy
      - policy3-common-ptp-sub-policy
      managedPoliciesForUpgrade:
      - name: policy1-common-cluster-version-policy
        namespace: default
      - name: policy4-common-sriov-sub-policy
        namespace: default
      placementBindings:
      - cgu-c-policy1-common-cluster-version-policy
      - cgu-c-policy4-common-sriov-sub-policy
      placementRules:
      - cgu-c-policy1-common-cluster-version-policy
      - cgu-c-policy4-common-sriov-sub-policy
      remediationPlan:
      - - spoke6
      status:
        currentBatch: 1
        remediationPlanForBatch:
          spoke6: 0

    1
    The cgu-c update does not have any blocking CRs.

18.6. Update policies on managed clusters

The Topology Aware Lifecycle Manager (TALM) remediates a set of inform policies for the clusters specified in the ClusterGroupUpgrade CR. TALM remediates inform policies by making enforce copies of the managed RHACM policies. Each copied policy has its own corresponding RHACM placement rule and RHACM placement binding.

One by one, TALM adds each cluster from the current batch to the placement rule that corresponds with the applicable managed policy. If a cluster is already compliant with a policy, TALM skips applying that policy on the compliant cluster. TALM then moves on to applying the next policy to the non-compliant cluster. After TALM completes the updates in a batch, all clusters are removed from the placement rules associated with the copied policies. Then, the update of the next batch starts.

If a spoke cluster does not report any compliant state to RHACM, the managed policies on the hub cluster can be missing status information that TALM needs. TALM handles these cases in the following ways:

  • If a policy’s status.compliant field is missing, TALM ignores the policy and adds a log entry. Then, TALM continues looking at the policy’s status.status field.
  • If a policy’s status.status is missing, TALM produces an error.
  • If a cluster’s compliance status is missing in the policy’s status.status field, TALM considers that cluster to be non-compliant with that policy.

For more information about RHACM policies, see Policy overview.

Additional resources

For more information about the PolicyGenTemplate CRD, see About the PolicyGenTemplate CRD.

18.6.1. Applying update policies to managed clusters

You can update your managed clusters by applying your policies.

Prerequisites

  • Install the Topology Aware Lifecycle Manager (TALM).
  • Provision one or more managed clusters.
  • Log in as a user with cluster-admin privileges.
  • Create RHACM policies in the hub cluster.

Procedure

  1. Save the contents of the ClusterGroupUpgrade CR in the cgu-1.yaml file.

    apiVersion: ran.openshift.io/v1alpha1
    kind: ClusterGroupUpgrade
    metadata:
      name: cgu-1
      namespace: default
    spec:
      managedPolicies: 1
        - policy1-common-cluster-version-policy
        - policy2-common-nto-sub-policy
        - policy3-common-ptp-sub-policy
        - policy4-common-sriov-sub-policy
      enable: false
      clusters: 2
      - spoke1
      - spoke2
      - spoke5
      - spoke6
      remediationStrategy:
        maxConcurrency: 2 3
        timeout: 240 4
    1
    The name of the policies to apply.
    2
    The list of clusters to update.
    3
    The maxConcurrency field signifies the number of clusters updated at the same time.
    4
    The update timeout in minutes.
  2. Create the ClusterGroupUpgrade CR by running the following command:

    $ oc create -f cgu-1.yaml
    1. Check if the ClusterGroupUpgrade CR was created in the hub cluster by running the following command:

      $ oc get cgu --all-namespaces

      Example output

      NAMESPACE   NAME      AGE
      default     cgu-1     8m55s

    2. Check the status of the update by running the following command:

      $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq

      Example output

      {
        "computedMaxConcurrency": 2,
        "conditions": [
          {
            "lastTransitionTime": "2022-02-25T15:34:07Z",
            "message": "The ClusterGroupUpgrade CR is not enabled", 1
            "reason": "UpgradeNotStarted",
            "status": "False",
            "type": "Ready"
          }
        ],
        "copiedPolicies": [
          "cgu-policy1-common-cluster-version-policy",
          "cgu-policy2-common-nto-sub-policy",
          "cgu-policy3-common-ptp-sub-policy",
          "cgu-policy4-common-sriov-sub-policy"
        ],
        "managedPoliciesContent": {
          "policy1-common-cluster-version-policy": "null",
          "policy2-common-nto-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"node-tuning-operator\",\"namespace\":\"openshift-cluster-node-tuning-operator\"}]",
          "policy3-common-ptp-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"ptp-operator-subscription\",\"namespace\":\"openshift-ptp\"}]",
          "policy4-common-sriov-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"sriov-network-operator-subscription\",\"namespace\":\"openshift-sriov-network-operator\"}]"
        },
        "managedPoliciesForUpgrade": [
          {
            "name": "policy1-common-cluster-version-policy",
            "namespace": "default"
          },
          {
            "name": "policy2-common-nto-sub-policy",
            "namespace": "default"
          },
          {
            "name": "policy3-common-ptp-sub-policy",
            "namespace": "default"
          },
          {
            "name": "policy4-common-sriov-sub-policy",
            "namespace": "default"
          }
        ],
        "managedPoliciesNs": {
          "policy1-common-cluster-version-policy": "default",
          "policy2-common-nto-sub-policy": "default",
          "policy3-common-ptp-sub-policy": "default",
          "policy4-common-sriov-sub-policy": "default"
        },
        "placementBindings": [
          "cgu-policy1-common-cluster-version-policy",
          "cgu-policy2-common-nto-sub-policy",
          "cgu-policy3-common-ptp-sub-policy",
          "cgu-policy4-common-sriov-sub-policy"
        ],
        "placementRules": [
          "cgu-policy1-common-cluster-version-policy",
          "cgu-policy2-common-nto-sub-policy",
          "cgu-policy3-common-ptp-sub-policy",
          "cgu-policy4-common-sriov-sub-policy"
        ],
        "precaching": {
          "spec": {}
        },
        "remediationPlan": [
          [
            "spoke1",
            "spoke2"
          ],
          [
            "spoke5",
            "spoke6"
          ]
        ],
        "status": {}
      }

      1
      The spec.enable field in the ClusterGroupUpgrade CR is set to false.
    3. Check the status of the policies by running the following command:

      $ oc get policies -A

      Example output

      NAMESPACE   NAME                                                 REMEDIATION ACTION   COMPLIANCE STATE   AGE
      default     cgu-policy1-common-cluster-version-policy            enforce                                 17m 1
      default     cgu-policy2-common-nto-sub-policy                    enforce                                 17m
      default     cgu-policy3-common-ptp-sub-policy                    enforce                                 17m
      default     cgu-policy4-common-sriov-sub-policy                  enforce                                 17m
      default     policy1-common-cluster-version-policy                inform               NonCompliant       15h
      default     policy2-common-nto-sub-policy                        inform               NonCompliant       15h
      default     policy3-common-ptp-sub-policy                        inform               NonCompliant       18m
      default     policy4-common-sriov-sub-policy                      inform               NonCompliant       18m

      1
      The spec.remediationAction field of policies currently applied on the clusters is set to enforce. The managed policies in inform mode from the ClusterGroupUpgrade CR remain in inform mode during the update.
  3. Change the value of the spec.enable field to true by running the following command:

    $ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/cgu-1 \
    --patch '{"spec":{"enable":true}}' --type=merge

Verification

  1. Check the status of the update again by running the following command:

    $ oc get cgu -n default cgu-1 -ojsonpath='{.status}' | jq

    Example output

    {
      "computedMaxConcurrency": 2,
      "conditions": [ 1
        {
          "lastTransitionTime": "2022-02-25T15:34:07Z",
          "message": "The ClusterGroupUpgrade CR has upgrade policies that are still non compliant",
          "reason": "UpgradeNotCompleted",
          "status": "False",
          "type": "Ready"
        }
      ],
      "copiedPolicies": [
        "cgu-policy1-common-cluster-version-policy",
        "cgu-policy2-common-nto-sub-policy",
        "cgu-policy3-common-ptp-sub-policy",
        "cgu-policy4-common-sriov-sub-policy"
      ],
      "managedPoliciesContent": {
        "policy1-common-cluster-version-policy": "null",
        "policy2-common-nto-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"node-tuning-operator\",\"namespace\":\"openshift-cluster-node-tuning-operator\"}]",
        "policy3-common-ptp-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"ptp-operator-subscription\",\"namespace\":\"openshift-ptp\"}]",
        "policy4-common-sriov-sub-policy": "[{\"kind\":\"Subscription\",\"name\":\"sriov-network-operator-subscription\",\"namespace\":\"openshift-sriov-network-operator\"}]"
      },
      "managedPoliciesForUpgrade": [
        {
          "name": "policy1-common-cluster-version-policy",
          "namespace": "default"
        },
        {
          "name": "policy2-common-nto-sub-policy",
          "namespace": "default"
        },
        {
          "name": "policy3-common-ptp-sub-policy",
          "namespace": "default"
        },
        {
          "name": "policy4-common-sriov-sub-policy",
          "namespace": "default"
        }
      ],
      "managedPoliciesNs": {
        "policy1-common-cluster-version-policy": "default",
        "policy2-common-nto-sub-policy": "default",
        "policy3-common-ptp-sub-policy": "default",
        "policy4-common-sriov-sub-policy": "default"
      },
      "placementBindings": [
        "cgu-policy1-common-cluster-version-policy",
        "cgu-policy2-common-nto-sub-policy",
        "cgu-policy3-common-ptp-sub-policy",
        "cgu-policy4-common-sriov-sub-policy"
      ],
      "placementRules": [
        "cgu-policy1-common-cluster-version-policy",
        "cgu-policy2-common-nto-sub-policy",
        "cgu-policy3-common-ptp-sub-policy",
        "cgu-policy4-common-sriov-sub-policy"
      ],
      "precaching": {
        "spec": {}
      },
      "remediationPlan": [
        [
          "spoke1",
          "spoke2"
        ],
        [
          "spoke5",
          "spoke6"
        ]
      ],
      "status": {
        "currentBatch": 1,
        "currentBatchStartedAt": "2022-02-25T15:54:16Z",
        "remediationPlanForBatch": {
          "spoke1": 0,
          "spoke2": 1
        },
        "startedAt": "2022-02-25T15:54:16Z"
      }
    }

    1
    Reflects the update progress of the current batch. Run this command again to receive updated information about the progress.
  2. If the policies include Operator subscriptions, you can check the installation progress directly on the single-node cluster.

    1. Export the KUBECONFIG file of the single-node cluster you want to check the installation progress for by running the following command:

      $ export KUBECONFIG=<cluster_kubeconfig_absolute_path>
    2. Check all the subscriptions present on the single-node cluster and look for the one in the policy you are trying to install through the ClusterGroupUpgrade CR by running the following command:

      $ oc get subs -A | grep -i <subscription_name>

      Example output for cluster-logging policy

      NAMESPACE                              NAME                         PACKAGE                      SOURCE             CHANNEL
      openshift-logging                      cluster-logging              cluster-logging              redhat-operators   stable

  3. If one of the managed policies includes a ClusterVersion CR, check the status of platform updates in the current batch by running the following command against the spoke cluster:

    $ oc get clusterversion

    Example output

    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.9.5     True        True          43s     Working towards 4.9.7: 71 of 735 done (9% complete)

  4. Check the Operator subscription by running the following command:

    $ oc get subs -n <operator-namespace> <operator-subscription> -ojsonpath="{.status}"
  5. Check the install plans present on the single-node cluster that is associated with the desired subscription by running the following command:

    $ oc get installplan -n <subscription_namespace>

    Example output for cluster-logging Operator

    NAMESPACE                              NAME            CSV                                 APPROVAL   APPROVED
    openshift-logging                      install-6khtw   cluster-logging.5.3.3-4             Manual     true 1

    1
    The install plans have their Approval field set to Manual and their Approved field changes from false to true after TALM approves the install plan.
    Note

    When TALM is remediating a policy containing a subscription, it automatically approves any install plans attached to that subscription. Where multiple install plans are needed to get the operator to the latest known version, TALM might approve multiple install plans, upgrading through one or more intermediate versions to get to the final version.

  6. Check if the cluster service version for the Operator of the policy that the ClusterGroupUpgrade is installing reached the Succeeded phase by running the following command:

    $ oc get csv -n <operator_namespace>

    Example output for OpenShift Logging Operator

    NAME                    DISPLAY                     VERSION   REPLACES   PHASE
    cluster-logging.5.4.2   Red Hat OpenShift Logging   5.4.2                Succeeded

18.7. Creating a backup of cluster resources before upgrade

For single-node OpenShift, the Topology Aware Lifecycle Manager (TALM) can create a backup of a deployment before an upgrade. If the upgrade fails, you can recover the previous version and restore a cluster to a working state without requiring a reprovision of applications.

The container image backup starts when the backup field is set to true in the ClusterGroupUpgrade CR.

The backup process can be in the following statuses:

BackupStatePreparingToStart
The first reconciliation pass is in progress. The TALM deletes any spoke backup namespace and hub view resources that have been created in a failed upgrade attempt.
BackupStateStarting
The backup prerequisites and backup job are being created.
BackupStateActive
The backup is in progress.
BackupStateSucceeded
The backup has succeeded.
BackupStateTimeout
Artifact backup has been partially done.
BackupStateError
The backup has ended with a non-zero exit code.
Note

If the backup fails and enters the BackupStateTimeout or BackupStateError state, the cluster upgrade does not proceed.

18.7.1. Creating a ClusterGroupUpgrade CR with backup

For single-node OpenShift, you can create a backup of a deployment before an upgrade. If the upgrade fails you can use the upgrade-recovery.sh script generated by Topology Aware Lifecycle Manager (TALM) to return the system to its preupgrade state. The backup consists of the following items:

Cluster backup
A snapshot of etcd and static pod manifests.
Content backup
Backups of folders, for example, /etc, /usr/local, /var/lib/kubelet.
Changed files backup
Any files managed by machine-config that have been changed.
Deployment
A pinned ostree deployment.
Images (Optional)
Any container images that are in use.

Prerequisites

  • Install the Topology Aware Lifecycle Manager (TALM).
  • Provision one or more managed clusters.
  • Log in as a user with cluster-admin privileges.
  • Install Red Hat Advanced Cluster Management (RHACM).
Note

It is highly recommended that you create a recovery partition. The following is an example SiteConfig custom resource (CR) for a recovery partition of 50 GB:

nodes:
    - hostName: "snonode.sno-worker-0.e2e.bos.redhat.com"
    role: "master"
    rootDeviceHints:
        hctl: "0:2:0:0"
        deviceName: /dev/sda
........
........
    #Disk /dev/sda: 893.3 GiB, 959119884288 bytes, 1873281024 sectors
    diskPartition:
        - device: /dev/sda
        partitions:
        - mount_point: /var/recovery
            size: 51200
            start: 800000

Procedure

  1. Save the contents of the ClusterGroupUpgrade CR with the backup field set to true in the clustergroupupgrades-group-du.yaml file:

    apiVersion: ran.openshift.io/v1alpha1
    kind: ClusterGroupUpgrade
    metadata:
      name: du-upgrade-4918
      namespace: ztp-group-du-sno
    spec:
      preCaching: true
      backup: true
      clusters:
      - cnfdb1
      - cnfdb2
      enable: false
      managedPolicies:
      - du-upgrade-platform-upgrade
      remediationStrategy:
        maxConcurrency: 2
        timeout: 240
  2. To start the update, apply the ClusterGroupUpgrade CR by running the following command:

    $ oc apply -f clustergroupupgrades-group-du.yaml