OpenShift Container Platform

Use the autoscaling/v2 API.

Specify a name for this horizontal pod autoscaler object.

Specify the API version of the object to scale:

For a ReplicationController, use v1.
For a DeploymentConfig, use apps.openshift.io/v1.
For a Deployment, ReplicaSet, Statefulset object, use apps/v1.

Specify the type of object. The object must be a Deployment, DeploymentConfig, ReplicaSet, ReplicationController, or StatefulSet.

Specify the name of the object to scale. The object must exist.

Specify the minimum number of replicas when scaling down.

Specify the maximum number of replicas when scaling up.

Use the metrics parameter for memory usage.

Specify memory for memory usage.

Set to Utilization.

Specify averageUtilization and a target average memory usage over all the pods, represented as a percent of requested memory. The target pods must have memory requests configured.

Optional: Specify a scaling policy to control the rate of scaling up or down.

Create the horizontal pod autoscaler by using a command similar to the following:
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap
For example:
```
oc create -f hpa.yaml
```
```
$ oc create -f hpa.yaml
```
Copy to Clipboard Toggle word wrap
Example output
```
horizontalpodautoscaler.autoscaling/hpa-resource-metrics-memory created
```
```
horizontalpodautoscaler.autoscaling/hpa-resource-metrics-memory created
```
Copy to Clipboard Toggle word wrap

Verification

Check that the horizontal pod autoscaler was created by using a command similar to the following:

oc get hpa hpa-resource-metrics-memory

$ oc get hpa hpa-resource-metrics-memory

Copy to Clipboard

Toggle word wrap

Example output

NAME                          REFERENCE            TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
hpa-resource-metrics-memory   Deployment/example   2441216/500Mi   1         10        1          20m

NAME                          REFERENCE            TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
hpa-resource-metrics-memory   Deployment/example   2441216/500Mi   1         10        1          20m

Copy to Clipboard

Toggle word wrap

Check the details of the horizontal pod autoscaler by using a command similar to the following:

oc describe hpa hpa-resource-metrics-memory

$ oc describe hpa hpa-resource-metrics-memory

Copy to Clipboard

Toggle word wrap

Example output

Name:                        hpa-resource-metrics-memory
Namespace:                   default
Labels:                      <none>
Annotations:                 <none>
CreationTimestamp:           Wed, 04 Mar 2020 16:31:37 +0530
Reference:                   Deployment/example
Metrics:                     ( current / target )
  resource memory on pods:   2441216 / 500Mi
Min replicas:                1
Max replicas:                10
ReplicationController pods:  1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from memory resource
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type     Reason                   Age                 From                       Message
  ----     ------                   ----                ----                       -------
  Normal   SuccessfulRescale        6m34s               horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

Name:                        hpa-resource-metrics-memory
Namespace:                   default
Labels:                      <none>
Annotations:                 <none>
CreationTimestamp:           Wed, 04 Mar 2020 16:31:37 +0530
Reference:                   Deployment/example
Metrics:                     ( current / target )
  resource memory on pods:   2441216 / 500Mi
Min replicas:                1
Max replicas:                10
ReplicationController pods:  1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from memory resource
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type     Reason                   Age                 From                       Message
  ----     ------                   ----                ----                       -------
  Normal   SuccessfulRescale        6m34s               horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

Copy to Clipboard

Toggle word wrap

2.4.6.4. Creating a horizontal pod autoscaler object for specific memory use
Copy link

Using the OpenShift Container Platform CLI, you can create a horizontal pod autoscaler (HPA) to automatically scale an existing object. The HPA scales the pods associated with that object to maintain the average memory use that you specify.

Note

Use a Deployment object or ReplicaSet object unless you need a specific feature or behavior provided by other objects.

Prerequisites

oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal

$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal

Copy to Clipboard

Toggle word wrap

Example output

Name:         openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Namespace:    openshift-kube-scheduler
Labels:       <none>
Annotations:  <none>
API Version:  metrics.k8s.io/v1beta1
Containers:
  Name:  wait-for-host-port
  Usage:
    Memory:  0
  Name:      scheduler
  Usage:
    Cpu:     8m
    Memory:  45440Ki
Kind:        PodMetrics
Metadata:
  Creation Timestamp:  2019-05-23T18:47:56Z
  Self Link:           /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Timestamp:             2019-05-23T18:47:56Z
Window:                1m0s
Events:                <none>

Name:         openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Namespace:    openshift-kube-scheduler
Labels:       <none>
Annotations:  <none>
API Version:  metrics.k8s.io/v1beta1
Containers:
  Name:  wait-for-host-port
  Usage:
    Memory:  0
  Name:      scheduler
  Usage:
    Cpu:     8m
    Memory:  45440Ki
Kind:        PodMetrics
Metadata:
  Creation Timestamp:  2019-05-23T18:47:56Z
  Self Link:           /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Timestamp:             2019-05-23T18:47:56Z
Window:                1m0s
Events:                <none>

Copy to Clipboard

Toggle word wrap

Procedure

Create a HorizontalPodAutoscaler object similar to the following for an existing object:

apiVersion: autoscaling/v2 
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-resource-metrics-memory 
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment 
    name: example 
  minReplicas: 1 
  maxReplicas: 10 
  metrics: 
  - type: Resource
    resource:
      name: memory 
      target:
        type: AverageValue 
        averageValue: 500Mi 
  behavior: 
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
      - type: Percent
        value: 10
        periodSeconds: 60
      selectPolicy: Max

apiVersion: autoscaling/v2


kind: HorizontalPodAutoscaler
metadata:
  name: hpa-resource-metrics-memory


  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1


    kind: Deployment


    name: example


  minReplicas: 1


  maxReplicas: 10


  metrics:


  - type: Resource
    resource:
      name: memory


      target:
        type: AverageValue


        averageValue: 500Mi


  behavior:


    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
      - type: Percent
        value: 10
        periodSeconds: 60
      selectPolicy: Max

Copy to Clipboard

Toggle word wrap

Use the autoscaling/v2 API.

Specify a name for this horizontal pod autoscaler object.

Specify the API version of the object to scale:

For a Deployment, ReplicaSet, or Statefulset object, use apps/v1.
For a ReplicationController, use v1.
For a DeploymentConfig, use apps.openshift.io/v1.

Specify the type of object. The object must be a Deployment, DeploymentConfig, ReplicaSet, ReplicationController, or StatefulSet.

Specify the name of the object to scale. The object must exist.

Specify the minimum number of replicas when scaling down.

Specify the maximum number of replicas when scaling up.

Use the metrics parameter for memory usage.

Specify memory for memory usage.

Set the type to AverageValue.

Specify averageValue and a specific memory value.

Optional: Specify a scaling policy to control the rate of scaling up or down.

Create the horizontal pod autoscaler by using a command similar to the following:
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap
For example:
```
oc create -f hpa.yaml
```
```
$ oc create -f hpa.yaml
```
Copy to Clipboard Toggle word wrap
Example output
```
horizontalpodautoscaler.autoscaling/hpa-resource-metrics-memory created
```
```
horizontalpodautoscaler.autoscaling/hpa-resource-metrics-memory created
```
Copy to Clipboard Toggle word wrap

Verification

Check that the horizontal pod autoscaler was created by using a command similar to the following:

oc get hpa hpa-resource-metrics-memory

$ oc get hpa hpa-resource-metrics-memory

Copy to Clipboard

Toggle word wrap

Example output

NAME                          REFERENCE            TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
hpa-resource-metrics-memory   Deployment/example   2441216/500Mi   1         10        1          20m

NAME                          REFERENCE            TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
hpa-resource-metrics-memory   Deployment/example   2441216/500Mi   1         10        1          20m

Copy to Clipboard

Toggle word wrap

Check the details of the horizontal pod autoscaler by using a command similar to the following:

oc describe hpa hpa-resource-metrics-memory

$ oc describe hpa hpa-resource-metrics-memory

Copy to Clipboard

Toggle word wrap

Example output

Name:                        hpa-resource-metrics-memory
Namespace:                   default
Labels:                      <none>
Annotations:                 <none>
CreationTimestamp:           Wed, 04 Mar 2020 16:31:37 +0530
Reference:                   Deployment/example
Metrics:                     ( current / target )
  resource memory on pods:   2441216 / 500Mi
Min replicas:                1
Max replicas:                10
ReplicationController pods:  1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from memory resource
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type     Reason                   Age                 From                       Message
  ----     ------                   ----                ----                       -------
  Normal   SuccessfulRescale        6m34s               horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

Name:                        hpa-resource-metrics-memory
Namespace:                   default
Labels:                      <none>
Annotations:                 <none>
CreationTimestamp:           Wed, 04 Mar 2020 16:31:37 +0530
Reference:                   Deployment/example
Metrics:                     ( current / target )
  resource memory on pods:   2441216 / 500Mi
Min replicas:                1
Max replicas:                10
ReplicationController pods:  1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from memory resource
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:
  Type     Reason                   Age                 From                       Message
  ----     ------                   ----                ----                       -------
  Normal   SuccessfulRescale        6m34s               horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

Copy to Clipboard

Toggle word wrap

2.4.7. Understanding horizontal pod autoscaler status conditions by using the CLI
Copy link

You can use the status conditions set to determine whether or not the horizontal pod autoscaler (HPA) is able to scale and whether or not it is currently restricted in any way.

The HPA status conditions are available with the v2 version of the autoscaling API.

The HPA responds with the following status conditions:

The AbleToScale condition indicates whether HPA is able to fetch and update metrics, as well as whether any backoff-related conditions could prevent scaling.
- A True condition indicates scaling is allowed.
- A False condition indicates scaling is not allowed for the reason specified.
The ScalingActive condition indicates whether the HPA is enabled (for example, the replica count of the target is not zero) and is able to calculate desired metrics.
- A True condition indicates metrics is working properly.
- A False condition generally indicates a problem with fetching metrics.

The ScalingLimited condition indicates that the desired scale was capped by the maximum or minimum of the horizontal pod autoscaler.

A True condition indicates that you need to raise or lower the minimum or maximum replica count in order to scale.

A False condition indicates that the requested scaling is allowed.

oc describe hpa cm-test

$ oc describe hpa cm-test

Copy to Clipboard

Toggle word wrap

Example output

Name:                           cm-test
Namespace:                      prom
Labels:                         <none>
Annotations:                    <none>
CreationTimestamp:              Fri, 16 Jun 2017 18:09:22 +0000
Reference:                      ReplicationController/cm-test
Metrics:                        ( current / target )
  "http_requests" on pods:      66m / 500m
Min replicas:                   1
Max replicas:                   4
ReplicationController pods:     1 current / 1 desired
Conditions: 
  Type              Status    Reason              Message
  ----              ------    ------              -------
  AbleToScale       True      ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive     True      ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric http_request
  ScalingLimited    False     DesiredWithinRange  the desired replica count is within the acceptable range
Events:

Name:                           cm-test
Namespace:                      prom
Labels:                         <none>
Annotations:                    <none>
CreationTimestamp:              Fri, 16 Jun 2017 18:09:22 +0000
Reference:                      ReplicationController/cm-test
Metrics:                        ( current / target )
  "http_requests" on pods:      66m / 500m
Min replicas:                   1
Max replicas:                   4
ReplicationController pods:     1 current / 1 desired
Conditions:


  Type              Status    Reason              Message
  ----              ------    ------              -------
  AbleToScale       True      ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive     True      ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric http_request
  ScalingLimited    False     DesiredWithinRange  the desired replica count is within the acceptable range
Events:

Copy to Clipboard

Toggle word wrap

1: The horizontal pod autoscaler status messages.

The following is an example of a pod that is unable to scale:

Example output

Conditions:
  Type         Status  Reason          Message
  ----         ------  ------          -------
  AbleToScale  False   FailedGetScale  the HPA controller was unable to get the target's current scale: no matches for kind "ReplicationController" in group "apps"
Events:
  Type     Reason          Age               From                       Message
  ----     ------          ----              ----                       -------
  Warning  FailedGetScale  6s (x3 over 36s)  horizontal-pod-autoscaler  no matches for kind "ReplicationController" in group "apps"

Conditions:
  Type         Status  Reason          Message
  ----         ------  ------          -------
  AbleToScale  False   FailedGetScale  the HPA controller was unable to get the target's current scale: no matches for kind "ReplicationController" in group "apps"
Events:
  Type     Reason          Age               From                       Message
  ----     ------          ----              ----                       -------
  Warning  FailedGetScale  6s (x3 over 36s)  horizontal-pod-autoscaler  no matches for kind "ReplicationController" in group "apps"

Copy to Clipboard

Toggle word wrap

The following is an example of a pod that could not obtain the needed metrics for scaling:

Example output

Conditions:
  Type                  Status    Reason                    Message
  ----                  ------    ------                    -------
  AbleToScale           True     SucceededGetScale          the HPA controller was able to get the target's current scale
  ScalingActive         False    FailedGetResourceMetric    the HPA was unable to compute the replica count: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API

Conditions:
  Type                  Status    Reason                    Message
  ----                  ------    ------                    -------
  AbleToScale           True     SucceededGetScale          the HPA controller was able to get the target's current scale
  ScalingActive         False    FailedGetResourceMetric    the HPA was unable to compute the replica count: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API

Copy to Clipboard

Toggle word wrap

The following is an example of a pod where the requested autoscaling was less than the required minimums:

Example output

Conditions:
  Type              Status    Reason              Message
  ----              ------    ------              -------
  AbleToScale       True      ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive     True      ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric http_request
  ScalingLimited    False     DesiredWithinRange  the desired replica count is within the acceptable range

Conditions:
  Type              Status    Reason              Message
  ----              ------    ------              -------
  AbleToScale       True      ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive     True      ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric http_request
  ScalingLimited    False     DesiredWithinRange  the desired replica count is within the acceptable range

Copy to Clipboard

Toggle word wrap

2.4.7.1. Viewing horizontal pod autoscaler status conditions by using the CLI
Copy link

You can view the status conditions set on a pod by the horizontal pod autoscaler (HPA).

Note

The horizontal pod autoscaler status conditions are available with the v2 version of the autoscaling API.

Prerequisites

oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal

$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal

Copy to Clipboard

Toggle word wrap

Example output

Name:         openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Namespace:    openshift-kube-scheduler
Labels:       <none>
Annotations:  <none>
API Version:  metrics.k8s.io/v1beta1
Containers:
  Name:  wait-for-host-port
  Usage:
    Memory:  0
  Name:      scheduler
  Usage:
    Cpu:     8m
    Memory:  45440Ki
Kind:        PodMetrics
Metadata:
  Creation Timestamp:  2019-05-23T18:47:56Z
  Self Link:           /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Timestamp:             2019-05-23T18:47:56Z
Window:                1m0s
Events:                <none>

Name:         openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Namespace:    openshift-kube-scheduler
Labels:       <none>
Annotations:  <none>
API Version:  metrics.k8s.io/v1beta1
Containers:
  Name:  wait-for-host-port
  Usage:
    Memory:  0
  Name:      scheduler
  Usage:
    Cpu:     8m
    Memory:  45440Ki
Kind:        PodMetrics
Metadata:
  Creation Timestamp:  2019-05-23T18:47:56Z
  Self Link:           /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Timestamp:             2019-05-23T18:47:56Z
Window:                1m0s
Events:                <none>

Copy to Clipboard

Toggle word wrap

Procedure

To view the status conditions on a pod, use the following command with the name of the pod:

oc describe hpa <pod-name>

$ oc describe hpa <pod-name>

Copy to Clipboard

Toggle word wrap

For example:

oc describe hpa cm-test

$ oc describe hpa cm-test

Copy to Clipboard

Toggle word wrap

The conditions appear in the Conditions field in the output.

Example output

Name:                           cm-test
Namespace:                      prom
Labels:                         <none>
Annotations:                    <none>
CreationTimestamp:              Fri, 16 Jun 2017 18:09:22 +0000
Reference:                      ReplicationController/cm-test
Metrics:                        ( current / target )
  "http_requests" on pods:      66m / 500m
Min replicas:                   1
Max replicas:                   4
ReplicationController pods:     1 current / 1 desired
Conditions: 
  Type              Status    Reason              Message
  ----              ------    ------              -------
  AbleToScale       True      ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive     True      ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric http_request
  ScalingLimited    False     DesiredWithinRange  the desired replica count is within the acceptable range

Name:                           cm-test
Namespace:                      prom
Labels:                         <none>
Annotations:                    <none>
CreationTimestamp:              Fri, 16 Jun 2017 18:09:22 +0000
Reference:                      ReplicationController/cm-test
Metrics:                        ( current / target )
  "http_requests" on pods:      66m / 500m
Min replicas:                   1
Max replicas:                   4
ReplicationController pods:     1 current / 1 desired
Conditions:


  Type              Status    Reason              Message
  ----              ------    ------              -------
  AbleToScale       True      ReadyForNewScale    the last scale time was sufficiently old as to warrant a new scale
  ScalingActive     True      ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric http_request
  ScalingLimited    False     DesiredWithinRange  the desired replica count is within the acceptable range

Copy to Clipboard

Toggle word wrap

2.5. Automatically adjust pod resource levels with the vertical pod autoscaler
Copy link

The OpenShift Container Platform Vertical Pod Autoscaler Operator (VPA) automatically reviews the historic and current CPU and memory resources for containers in pods and can update the resource limits and requests based on the usage values it learns. The VPA uses individual custom resources (CR) to update all of the pods associated with a workload object, such as a Deployment, DeploymentConfig, StatefulSet, Job, DaemonSet, ReplicaSet, or ReplicationController, in a project.

The VPA helps you to understand the optimal CPU and memory usage for your pods and can automatically maintain pod resources through the pod lifecycle.

2.5.1. About the Vertical Pod Autoscaler Operator
Copy link

The Vertical Pod Autoscaler Operator (VPA) is implemented as an API resource and a custom resource (CR). The CR determines the actions for the VPA to take with the pods associated with a specific workload object, such as a daemon set, replication controller, and so forth, in a project.

The VPA consists of three components, each of which has its own pod in the VPA namespace:

Recommender: The VPA recommender monitors the current and past resource consumption. Based on this data, the VPA recommender determines the optimal CPU and memory resources for the pods in the associated workload object.
Updater: The VPA updater checks if the pods in the associated workload object have the correct resources. If the resources are correct, the updater takes no action. If the resources are not correct, the updater kills the pod so that pods' controllers can re-create them with the updated requests.
Admission controller: The VPA admission controller sets the correct resource requests on each new pod in the associated workload object. This applies whether the pod is new or the controller re-created the pod due to the VPA updater actions.

You can use the default recommender or use your own alternative recommender to autoscale based on your own algorithms.

The default recommender automatically computes historic and current CPU and memory usage for the containers in those pods. The default recommender uses this data to determine optimized resource limits and requests to ensure that these pods are operating efficiently at all times. For example, the default recommender suggests reduced resources for pods that are requesting more resources than they are using and increased resources for pods that are not requesting enough.

The VPA then automatically deletes any pods that are out of alignment with these recommendations one at a time, so that your applications can continue to serve requests with no downtime. The workload objects then redeploy the pods with the original resource limits and requests. The VPA uses a mutating admission webhook to update the pods with optimized resource limits and requests before admitting the pods to a node. If you do not want the VPA to delete pods, you can view the VPA resource limits and requests and manually update the pods as needed.

Note

By default, workload objects must specify a minimum of two replicas for the VPA to automatically delete their pods. Workload objects that specify fewer replicas than this minimum are not deleted. If you manually delete these pods, when the workload object redeploys the pods, the VPA updates the new pods with its recommendations. You can change this minimum by modifying the VerticalPodAutoscalerController object as shown in Changing the VPA minimum value.

For example, if you have a pod that uses 50% of the CPU but only requests 10%, the VPA determines that the pod is consuming more CPU than requested and deletes the pod. The workload object, such as replica set, restarts the pods and the VPA updates the new pod with its recommended resources.

For developers, you can use the VPA to help ensure that your pods active during periods of high demand by scheduling pods onto nodes that have appropriate resources for each pod.

Administrators can use the VPA to better use cluster resources, such as preventing pods from reserving more CPU resources than needed. The VPA monitors the resources that workloads are actually using and adjusts the resource requirements so capacity is available to other workloads. The VPA also maintains the ratios between limits and requests specified in the initial container configuration.

Note

If you stop running the VPA or delete a specific VPA CR in your cluster, the resource requests for the pods already modified by the VPA do not change. However, any new pods get the resources defined in the workload object, not the previous recommendations made by the VPA.

2.5.2. Installing the Vertical Pod Autoscaler Operator
Copy link

You can use the OpenShift Container Platform web console to install the Vertical Pod Autoscaler Operator (VPA).

Procedure

In the OpenShift Container Platform web console, click Operators → OperatorHub.
Choose VerticalPodAutoscaler from the list of available Operators, and click Install.
On the Install Operator page, ensure that the Operator recommended namespace option is selected. This installs the Operator in the mandatory openshift-vertical-pod-autoscaler namespace, which is automatically created if it does not exist.
Click Install.

Verification

Verify the installation by listing the VPA components:
1. Navigate to Workloads → Pods.
2. Select the openshift-vertical-pod-autoscaler project from the drop-down menu and verify that there are four pods running.
3. Navigate to Workloads → Deployments to verify that there are four deployments running.

Optional: Verify the installation in the OpenShift Container Platform CLI using the following command:

oc get all -n openshift-vertical-pod-autoscaler

$ oc get all -n openshift-vertical-pod-autoscaler

Copy to Clipboard

Toggle word wrap

The output shows four pods and four deplyoments:

Example output

NAME                                                    READY   STATUS    RESTARTS   AGE
pod/vertical-pod-autoscaler-operator-85b4569c47-2gmhc   1/1     Running   0          3m13s
pod/vpa-admission-plugin-default-67644fc87f-xq7k9       1/1     Running   0          2m56s
pod/vpa-recommender-default-7c54764b59-8gckt            1/1     Running   0          2m56s
pod/vpa-updater-default-7f6cc87858-47vw9                1/1     Running   0          2m56s

NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/vpa-webhook   ClusterIP   172.30.53.206   <none>        443/TCP   2m56s

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vertical-pod-autoscaler-operator   1/1     1            1           3m13s
deployment.apps/vpa-admission-plugin-default       1/1     1            1           2m56s
deployment.apps/vpa-recommender-default            1/1     1            1           2m56s
deployment.apps/vpa-updater-default                1/1     1            1           2m56s

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/vertical-pod-autoscaler-operator-85b4569c47   1         1         1       3m13s
replicaset.apps/vpa-admission-plugin-default-67644fc87f       1         1         1       2m56s
replicaset.apps/vpa-recommender-default-7c54764b59            1         1         1       2m56s
replicaset.apps/vpa-updater-default-7f6cc87858                1         1         1       2m56s

NAME                                                    READY   STATUS    RESTARTS   AGE
pod/vertical-pod-autoscaler-operator-85b4569c47-2gmhc   1/1     Running   0          3m13s
pod/vpa-admission-plugin-default-67644fc87f-xq7k9       1/1     Running   0          2m56s
pod/vpa-recommender-default-7c54764b59-8gckt            1/1     Running   0          2m56s
pod/vpa-updater-default-7f6cc87858-47vw9                1/1     Running   0          2m56s

NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/vpa-webhook   ClusterIP   172.30.53.206   <none>        443/TCP   2m56s

NAME                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/vertical-pod-autoscaler-operator   1/1     1            1           3m13s
deployment.apps/vpa-admission-plugin-default       1/1     1            1           2m56s
deployment.apps/vpa-recommender-default            1/1     1            1           2m56s
deployment.apps/vpa-updater-default                1/1     1            1           2m56s

NAME                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/vertical-pod-autoscaler-operator-85b4569c47   1         1         1       3m13s
replicaset.apps/vpa-admission-plugin-default-67644fc87f       1         1         1       2m56s
replicaset.apps/vpa-recommender-default-7c54764b59            1         1         1       2m56s
replicaset.apps/vpa-updater-default-7f6cc87858                1         1         1       2m56s

Copy to Clipboard

Toggle word wrap

2.5.3. About using the Vertical Pod Autoscaler Operator
Copy link

To use the Vertical Pod Autoscaler Operator (VPA), you create a VPA custom resource (CR) for a workload object in your cluster. The VPA learns and applies the optimal CPU and memory resources for the pods associated with that workload object. You can use a VPA with a deployment, stateful set, job, daemon set, replica set, or replication controller workload object. The VPA CR must be in the same project as the pods that you want to check.

You use the VPA CR to associate a workload object and specify the mode that the VPA operates in:

The Auto and Recreate modes automatically apply the VPA CPU and memory recommendations throughout the pod lifetime. The VPA deletes any pods in the project that are out of alignment with its recommendations. When redeployed by the workload object, the VPA updates the new pods with its recommendations.
The Initial mode automatically applies VPA recommendations only at pod creation.
The Off mode only provides recommended resource limits and requests. You can then manually apply the recommendations. The Off mode does not update pods.

You can also use the CR to opt-out certain containers from VPA evaluation and updates.

For example, a pod has the following limits and requests:

resources:
  limits:
    cpu: 1
    memory: 500Mi
  requests:
    cpu: 500m
    memory: 100Mi

resources:
  limits:
    cpu: 1
    memory: 500Mi
  requests:
    cpu: 500m
    memory: 100Mi

Copy to Clipboard

Toggle word wrap

After creating a VPA that is set to Auto, the VPA learns the resource usage and deletes the pod. When redeployed, the pod uses the new resource limits and requests:

resources:
  limits:
    cpu: 50m
    memory: 1250Mi
  requests:
    cpu: 25m
    memory: 262144k

resources:
  limits:
    cpu: 50m
    memory: 1250Mi
  requests:
    cpu: 25m
    memory: 262144k

Copy to Clipboard

Toggle word wrap

You can view the VPA recommendations by using the following command:

oc get vpa <vpa-name> --output yaml

$ oc get vpa <vpa-name> --output yaml

Copy to Clipboard

Toggle word wrap

After a few minutes, the output shows the recommendations for CPU and memory requests, similar to the following:

Example output

...
status:
...
  recommendation:
    containerRecommendations:
    - containerName: frontend
      lowerBound:
        cpu: 25m
        memory: 262144k
      target:
        cpu: 25m
        memory: 262144k
      uncappedTarget:
        cpu: 25m
        memory: 262144k
      upperBound:
        cpu: 262m
        memory: "274357142"
    - containerName: backend
      lowerBound:
        cpu: 12m
        memory: 131072k
      target:
        cpu: 12m
        memory: 131072k
      uncappedTarget:
        cpu: 12m
        memory: 131072k
      upperBound:
        cpu: 476m
        memory: "498558823"
...

...
status:
...
  recommendation:
    containerRecommendations:
    - containerName: frontend
      lowerBound:
        cpu: 25m
        memory: 262144k
      target:
        cpu: 25m
        memory: 262144k
      uncappedTarget:
        cpu: 25m
        memory: 262144k
      upperBound:
        cpu: 262m
        memory: "274357142"
    - containerName: backend
      lowerBound:
        cpu: 12m
        memory: 131072k
      target:
        cpu: 12m
        memory: 131072k
      uncappedTarget:
        cpu: 12m
        memory: 131072k
      upperBound:
        cpu: 476m
        memory: "498558823"
...

Copy to Clipboard

Toggle word wrap

The output shows the recommended resources, target, the minimum recommended resources, lowerBound, the highest recommended resources, upperBound, and the most recent resource recommendations, uncappedTarget.

The VPA uses the lowerBound and upperBound values to determine if a pod needs updating. If a pod has resource requests less than the lowerBound values or more than the upperBound values, the VPA terminates and recreates the pod with the target values.

2.5.3.1. Changing the VPA minimum value
Copy link

By default, workload objects must specify a minimum of two replicas in order for the VPA to automatically delete and update their pods. As a result, workload objects that specify fewer than two replicas are not automatically acted upon by the VPA. The VPA does update new pods from these workload objects if a process external to the VPA restarts the pods. You can change this cluster-wide minimum value by modifying the minReplicas parameter in the VerticalPodAutoscalerController custom resource (CR).

For example, if you set minReplicas to 3, the VPA does not delete and update pods for workload objects that specify fewer than three replicas.

Note

If you set minReplicas to 1, the VPA can delete the only pod for a workload object that specifies only one replica. Use this setting with one-replica objects only if your workload can tolerate downtime whenever the VPA deletes a pod to adjust its resources. To avoid unwanted downtime with one-replica objects, configure the VPA CRs with the podUpdatePolicy set to Initial, which automatically updates the pod only when a process external to the VPA restarts, or Off, which you can use to update the pod manually at an appropriate time for your application.

Example VerticalPodAutoscalerController object

apiVersion: autoscaling.openshift.io/v1
kind: VerticalPodAutoscalerController
metadata:
  creationTimestamp: "2021-04-21T19:29:49Z"
  generation: 2
  name: default
  namespace: openshift-vertical-pod-autoscaler
  resourceVersion: "142172"
  uid: 180e17e9-03cc-427f-9955-3b4d7aeb2d59
spec:
  minReplicas: 3 
  podMinCPUMillicores: 25
  podMinMemoryMb: 250
  recommendationOnly: false
  safetyMarginFraction: 0.15

apiVersion: autoscaling.openshift.io/v1
kind: VerticalPodAutoscalerController
metadata:
  creationTimestamp: "2021-04-21T19:29:49Z"
  generation: 2
  name: default
  namespace: openshift-vertical-pod-autoscaler
  resourceVersion: "142172"
  uid: 180e17e9-03cc-427f-9955-3b4d7aeb2d59
spec:
  minReplicas: 3


  podMinCPUMillicores: 25
  podMinMemoryMb: 250
  recommendationOnly: false
  safetyMarginFraction: 0.15

Copy to Clipboard

Toggle word wrap

1 1: Specify the minimum number of replicas in a workload object for the VPA to act on. Any objects with replicas fewer than the minimum are not automatically deleted by the VPA.

2.5.3.2. Automatically applying VPA recommendations
Copy link

To use the VPA to automatically update pods, create a VPA CR for a specific workload object with updateMode set to Auto or Recreate.

When the pods are created for the workload object, the VPA constantly monitors the containers to analyze their CPU and memory needs. The VPA deletes any pods that do not meet the VPA recommendations for CPU and memory. When redeployed, the pods use the new resource limits and requests based on the VPA recommendations, honoring any pod disruption budget set for your applications. The recommendations are added to the status field of the VPA CR for reference.

Note

By default, workload objects must specify a minimum of two replicas in order for the VPA to automatically delete their pods. Workload objects that specify fewer replicas than this minimum are not deleted. If you manually delete these pods, when the workload object redeploys the pods, the VPA does update the new pods with its recommendations. You can change this minimum by modifying the VerticalPodAutoscalerController object as shown shown in Changing the VPA minimum value.

Example VPA CR for the Auto mode

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment 
    name:       frontend 
  updatePolicy:
    updateMode: "Auto"

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment


    name:       frontend


  updatePolicy:
    updateMode: "Auto"

Copy to Clipboard

Toggle word wrap

The type of workload object you want this VPA CR to manage.

The name of the workload object you want this VPA CR to manage.

Set the mode to Auto or Recreate:

Auto. The VPA assigns resource requests on pod creation and updates the existing pods by terminating them when the requested resources differ significantly from the new recommendation.
Recreate. The VPA assigns resource requests on pod creation and updates the existing pods by terminating them when the requested resources differ significantly from the new recommendation. Use this mode rarely, only if you need to ensure that when the resource request changes the pods restart.

Note

Before a VPA can determine recommendations for resources and apply the recommended resources to new pods, operating pods must exist and be running in the project.

If a workload’s resource usage, such as CPU and memory, is consistent, the VPA can determine recommendations for resources in a few minutes. If a workload’s resource usage is inconsistent, the VPA must collect metrics at various resource usage intervals for the VPA to make an accurate recommendation.

2.5.3.3. Automatically applying VPA recommendations on pod creation
Copy link

To use the VPA to apply the recommended resources only when a pod is first deployed, create a VPA CR for a specific workload object with updateMode set to Initial.

Then, manually delete any pods associated with the workload object that you want to use the VPA recommendations. In the Initial mode, the VPA does not delete pods and does not update the pods as it learns new resource recommendations.

Example VPA CR for the Initial mode

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment 
    name:       frontend 
  updatePolicy:
    updateMode: "Initial"

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment


    name:       frontend


  updatePolicy:
    updateMode: "Initial"

Copy to Clipboard

Toggle word wrap

1: The type of workload object you want this VPA CR to manage.
2: The name of the workload object you want this VPA CR to manage.
3: Set the mode to Initial. The VPA assigns resources when pods are created and does not change the resources during the lifetime of the pod.

Note

Before a VPA can determine recommended resources and apply the recommendations to new pods, operating pods must exist and be running in the project.

To obtain the most accurate recommendations from the VPA, wait at least 8 days for the pods to run and for the VPA to stabilize.

2.5.3.4. Manually applying VPA recommendations
Copy link

To use the VPA to only determine the recommended CPU and memory values, create a VPA CR for a specific workload object with updateMode set to Off.

When the pods are created for that workload object, the VPA analyzes the CPU and memory needs of the containers and records those recommendations in the status field of the VPA CR. The VPA does not update the pods as it determines new resource recommendations.

Example VPA CR for the Off mode

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment 
    name:       frontend 
  updatePolicy:
    updateMode: "Off"

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment


    name:       frontend


  updatePolicy:
    updateMode: "Off"

Copy to Clipboard

Toggle word wrap

1: The type of workload object you want this VPA CR to manage.
2: The name of the workload object you want this VPA CR to manage.
3: Set the mode to Off.

You can view the recommendations by using the following command.

oc get vpa <vpa-name> --output yaml

$ oc get vpa <vpa-name> --output yaml

Copy to Clipboard

Toggle word wrap

With the recommendations, you can edit the workload object to add CPU and memory requests, then delete and redeploy the pods by using the recommended resources.

Note

Before a VPA can determine recommended resources and apply the recommendations to new pods, operating pods must exist and be running in the project.

To obtain the most accurate recommendations from the VPA, wait at least 8 days for the pods to run and for the VPA to stabilize.

2.5.3.5. Exempting containers from applying VPA recommendations
Copy link

If your workload object has multiple containers and you do not want the VPA to evaluate and act on all of the containers, create a VPA CR for a specific workload object and add a resourcePolicy to opt-out specific containers.

When the VPA updates the pods with recommended resources, any containers with a resourcePolicy are not updated and the VPA does not present recommendations for those containers in the pod.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment 
    name:       frontend 
  updatePolicy:
    updateMode: "Auto" 
  resourcePolicy: 
    containerPolicies:
    - containerName: my-opt-sidecar
      mode: "Off"

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment


    name:       frontend


  updatePolicy:
    updateMode: "Auto"


  resourcePolicy:


    containerPolicies:
    - containerName: my-opt-sidecar
      mode: "Off"

Copy to Clipboard

Toggle word wrap

1: The type of workload object you want this VPA CR to manage.
2: The name of the workload object you want this VPA CR to manage.
3: Set the mode to Auto, Recreate, Initial, or Off. Use the Recreate mode rarely, only if you need to ensure that when the resource request changes the pods restart.
4: Specify the containers that you do not want updated by the VPA and set the mode to Off.

For example, a pod has two containers, the same resource requests and limits:

# ...
spec:
  containers:
  - name: frontend
    resources:
      limits:
        cpu: 1
        memory: 500Mi
      requests:
        cpu: 500m
        memory: 100Mi
  - name: backend
    resources:
      limits:
        cpu: "1"
        memory: 500Mi
      requests:
        cpu: 500m
        memory: 100Mi
# ...

# ...
spec:
  containers:
  - name: frontend
    resources:
      limits:
        cpu: 1
        memory: 500Mi
      requests:
        cpu: 500m
        memory: 100Mi
  - name: backend
    resources:
      limits:
        cpu: "1"
        memory: 500Mi
      requests:
        cpu: 500m
        memory: 100Mi
# ...

Copy to Clipboard

Toggle word wrap

After launching a VPA CR with the backend container set to opt-out, the VPA terminates and recreates the pod with the recommended resources applied only to the frontend container:

...
spec:
  containers:
    name: frontend
    resources:
      limits:
        cpu: 50m
        memory: 1250Mi
      requests:
        cpu: 25m
        memory: 262144k
...
    name: backend
    resources:
      limits:
        cpu: "1"
        memory: 500Mi
      requests:
        cpu: 500m
        memory: 100Mi
...

...
spec:
  containers:
    name: frontend
    resources:
      limits:
        cpu: 50m
        memory: 1250Mi
      requests:
        cpu: 25m
        memory: 262144k
...
    name: backend
    resources:
      limits:
        cpu: "1"
        memory: 500Mi
      requests:
        cpu: 500m
        memory: 100Mi
...

Copy to Clipboard

Toggle word wrap

2.5.3.6. Custom memory bump-up after OOM event
Copy link

If your cluster experiences an OOM (out of memory) event, the Vertical Pod Autoscaler Operator (VPA) increases the memory recommendation. The basis for the recommendation is the memory consumption observed during the OOM event and a specified multiplier value to prevent future crashes due to insufficient memory.

The recommendation is the higher of two calculations: the memory in use by the pod when the OOM event happened multiplied by a specified number of bytes or a specified percentage. The following formula represents the calculation:

recommendation = max(memory-usage-in-oom-event + oom-min-bump-up-bytes, memory-usage-in-oom-event * oom-bump-up-ratio)

recommendation = max(memory-usage-in-oom-event + oom-min-bump-up-bytes, memory-usage-in-oom-event * oom-bump-up-ratio)

Copy to Clipboard

Toggle word wrap

You can configure the memory increase by specifying the following values in the recommender pod:

oom-min-bump-up-bytes. This value, in bytes, is a specific increase in memory after an OOM event occurs. The default is 100MiB.
oom-bump-up-ratio. This value is a percentage increase in memory when the OOM event occurred. The default value is 1.2.

For example, if the pod memory usage during an OOM event is 100 MB, and oom-min-bump-up-bytes is set to 150 MB with a oom-min-bump-ratio of 1.2. After an OOM event, the VPA recommends increasing the memory request for that pod to 150 MB, as it is higher than at 120 MB (100 MB * 1.2).

Example recommender deployment object

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vpa-recommender-default
  namespace: openshift-vertical-pod-autoscaler
# ...
spec:
# ...
  template:
# ...
    spec
      containers:
      - name: recommender
        args:
        - --oom-bump-up-ratio=2.0
        - --oom-min-bump-up-bytes=524288000
# ...

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vpa-recommender-default
  namespace: openshift-vertical-pod-autoscaler
# ...
spec:
# ...
  template:
# ...
    spec
      containers:
      - name: recommender
        args:
        - --oom-bump-up-ratio=2.0
        - --oom-min-bump-up-bytes=524288000
# ...

Copy to Clipboard

Toggle word wrap

Additional resources

Understanding OOM kill policy

2.5.3.7. Using an alternative recommender
Copy link

You can use your own recommender to autoscale based on your own algorithms. If you do not specify an alternative recommender, OpenShift Container Platform uses the default recommender, which suggests CPU and memory requests based on historical usage. Because there is no universal recommendation policy that applies to all types of workloads, you might want to create and deploy different recommenders for specific workloads.

For example, the default recommender might not accurately predict future resource usage when containers exhibit certain resource behaviors. Examples are cyclical patterns that alternate between usage spikes and idling as used by monitoring applications, or recurring and repeating patterns used with deep learning applications. Using the default recommender with these usage behaviors might result in significant over-provisioning and Out of Memory (OOM) kills for your applications.

Note

Instructions for how to create a recommender are beyond the scope of this documentation.

Procedure

To use an alternative recommender for your pods:

Create a service account for the alternative recommender and bind that service account to the required cluster role:

apiVersion: v1 
kind: ServiceAccount
metadata:
  name: alt-vpa-recommender-sa
  namespace: <namespace_name>
---
apiVersion: rbac.authorization.k8s.io/v1 
kind: ClusterRoleBinding
metadata:
  name: system:example-metrics-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:metrics-reader
subjects:
- kind: ServiceAccount
  name: alt-vpa-recommender-sa
  namespace: <namespace_name>
---
apiVersion: rbac.authorization.k8s.io/v1 
kind: ClusterRoleBinding
metadata:
  name: system:example-vpa-actor
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:vpa-actor
subjects:
- kind: ServiceAccount
  name: alt-vpa-recommender-sa
  namespace: <namespace_name>
---
apiVersion: rbac.authorization.k8s.io/v1 
kind: ClusterRoleBinding
metadata:
  name: system:example-vpa-target-reader-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:vpa-target-reader
subjects:
- kind: ServiceAccount
  name: alt-vpa-recommender-sa
  namespace: <namespace_name>

apiVersion: v1


kind: ServiceAccount
metadata:
  name: alt-vpa-recommender-sa
  namespace: <namespace_name>
---
apiVersion: rbac.authorization.k8s.io/v1


kind: ClusterRoleBinding
metadata:
  name: system:example-metrics-reader
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:metrics-reader
subjects:
- kind: ServiceAccount
  name: alt-vpa-recommender-sa
  namespace: <namespace_name>
---
apiVersion: rbac.authorization.k8s.io/v1


kind: ClusterRoleBinding
metadata:
  name: system:example-vpa-actor
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:vpa-actor
subjects:
- kind: ServiceAccount
  name: alt-vpa-recommender-sa
  namespace: <namespace_name>
---
apiVersion: rbac.authorization.k8s.io/v1


kind: ClusterRoleBinding
metadata:
  name: system:example-vpa-target-reader-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:vpa-target-reader
subjects:
- kind: ServiceAccount
  name: alt-vpa-recommender-sa
  namespace: <namespace_name>

Copy to Clipboard

Toggle word wrap

1: Creates a service account for the recommender in the namespace that displays the recommender.
2: Binds the recommender service account to the metrics-reader role. Specify the namespace for where to deploy the recommender.
3: Binds the recommender service account to the vpa-actor role. Specify the namespace for where to deploy the recommender.
4: Binds the recommender service account to the vpa-target-reader role. Specify the namespace for where to display the recommender.

To add the alternative recommender to the cluster, create a Deployment object similar to the following:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alt-vpa-recommender
  namespace: <namespace_name>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alt-vpa-recommender
  template:
    metadata:
      labels:
        app: alt-vpa-recommender
    spec:
      containers: 
      - name: recommender
        image: quay.io/example/alt-recommender:latest 
        imagePullPolicy: Always
        resources:
          limits:
            cpu: 200m
            memory: 1000Mi
          requests:
            cpu: 50m
            memory: 500Mi
        ports:
        - name: prometheus
          containerPort: 8942
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL
          seccompProfile:
            type: RuntimeDefault
      serviceAccountName: alt-vpa-recommender-sa 
      securityContext:
        runAsNonRoot: true

apiVersion: apps/v1
kind: Deployment
metadata:
  name: alt-vpa-recommender
  namespace: <namespace_name>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alt-vpa-recommender
  template:
    metadata:
      labels:
        app: alt-vpa-recommender
    spec:
      containers:


      - name: recommender
        image: quay.io/example/alt-recommender:latest


        imagePullPolicy: Always
        resources:
          limits:
            cpu: 200m
            memory: 1000Mi
          requests:
            cpu: 50m
            memory: 500Mi
        ports:
        - name: prometheus
          containerPort: 8942
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL
          seccompProfile:
            type: RuntimeDefault
      serviceAccountName: alt-vpa-recommender-sa


      securityContext:
        runAsNonRoot: true

Copy to Clipboard

Toggle word wrap

1: Creates a container for your alternative recommender.
2: Specifies your recommender image.
3: Associates the service account that you created for the recommender.

A new pod is created for the alternative recommender in the same namespace.

oc get pods

$ oc get pods

Copy to Clipboard

Toggle word wrap

Example output

NAME                                        READY   STATUS    RESTARTS   AGE
frontend-845d5478d-558zf                    1/1     Running   0          4m25s
frontend-845d5478d-7z9gx                    1/1     Running   0          4m25s
frontend-845d5478d-b7l4j                    1/1     Running   0          4m25s
vpa-alt-recommender-55878867f9-6tp5v        1/1     Running   0          9s

NAME                                        READY   STATUS    RESTARTS   AGE
frontend-845d5478d-558zf                    1/1     Running   0          4m25s
frontend-845d5478d-7z9gx                    1/1     Running   0          4m25s
frontend-845d5478d-b7l4j                    1/1     Running   0          4m25s
vpa-alt-recommender-55878867f9-6tp5v        1/1     Running   0          9s

Copy to Clipboard

Toggle word wrap

Configure a Vertical Pod Autoscaler Operator (VPA) custom resource (CR) that includes the name of the alternative recommender Deployment object.

Example VPA CR to include the alternative recommender

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
  namespace: <namespace_name>
spec:
  recommenders:
    - name: alt-vpa-recommender 
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment 
    name:       frontend

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
  namespace: <namespace_name>
spec:
  recommenders:
    - name: alt-vpa-recommender


  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment


    name:       frontend

Copy to Clipboard

Toggle word wrap

1: Specifies the name of the alternative recommender deployment.
2: Specifies the name of an existing workload object you want this VPA to manage.

2.5.4. Using the Vertical Pod Autoscaler Operator
Copy link

You can use the Vertical Pod Autoscaler Operator (VPA) by creating a VPA custom resource (CR). The CR indicates the pods to analyze and determines the actions for the VPA to take with those pods.

Prerequisites

Ensure the workload object that you want to autoscale exists.
Ensure that if you want to use an alternative recommender, a deployment including that recommender exists.

Procedure

To create a VPA CR for a specific workload object:

Change to the location of the project for the workload object you want to scale.

Create a VPA CR YAML file:
```
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment 
    name:       frontend 
  updatePolicy:
    updateMode: "Auto" 
  resourcePolicy: 
    containerPolicies:
    - containerName: my-opt-sidecar
      mode: "Off"
  recommenders: 
    - name: my-recommender
```
```
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: vpa-recommender
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind:       Deployment 
```
1
```
    name:       frontend 
```
2
```
  updatePolicy:
    updateMode: "Auto" 
```
3
```
  resourcePolicy: 
```
4
```
    containerPolicies:
    - containerName: my-opt-sidecar
      mode: "Off"
  recommenders: 
```
5
```
    - name: my-recommender
```
Copy to Clipboard Toggle word wrap
1
Specify the type of workload object you want this VPA to manage: Deployment, StatefulSet, Job, DaemonSet, ReplicaSet, or ReplicationController.
2
Specify the name of an existing workload object you want this VPA to manage.
3
Specify the VPA mode:
Auto to automatically apply the recommended resources on pods associated with the controller. The VPA terminates existing pods and creates new pods with the recommended resource limits and requests.
Recreate to automatically apply the recommended resources on pods associated with the workload object. The VPA terminates existing pods and creates new pods with the recommended resource limits and requests. Use the Recreate mode rarely, only if you need to ensure that the pods restart whenever the resource request changes.
Initial to automatically apply the recommended resources to newly-created pods associated with the workload object. The VPA does not update the pods as it learns new resource recommendations.
Off to only generate resource recommendations for the pods associated with the workload object. The VPA does not update the pods as it learns new resource recommendations and does not apply the recommendations to new pods.
4
Optional. Specify the containers you want to opt-out and set the mode to Off.
5
Optional. Specify an alternative recommender.

Create the VPA CR:

oc create -f <file-name>.yaml

$ oc create -f <file-name>.yaml

Copy to Clipboard

Toggle word wrap

After a few moments, the VPA learns the resource usage of the containers in the pods associated with the workload object.

You can view the VPA recommendations by using the following command:

oc get vpa <vpa-name> --output yaml

$ oc get vpa <vpa-name> --output yaml

Copy to Clipboard

Toggle word wrap

The output shows the recommendations for CPU and memory requests, similar to the following:

Example output

...
status:

...

  recommendation:
    containerRecommendations:
    - containerName: frontend
      lowerBound: 
        cpu: 25m
        memory: 262144k
      target: 
        cpu: 25m
        memory: 262144k
      uncappedTarget: 
        cpu: 25m
        memory: 262144k
      upperBound: 
        cpu: 262m
        memory: "274357142"
    - containerName: backend
      lowerBound:
        cpu: 12m
        memory: 131072k
      target:
        cpu: 12m
        memory: 131072k
      uncappedTarget:
        cpu: 12m
        memory: 131072k
      upperBound:
        cpu: 476m
        memory: "498558823"

...

...
status:

...

  recommendation:
    containerRecommendations:
    - containerName: frontend
      lowerBound:


        cpu: 25m
        memory: 262144k
      target:


        cpu: 25m
        memory: 262144k
      uncappedTarget:


        cpu: 25m
        memory: 262144k
      upperBound:


        cpu: 262m
        memory: "274357142"
    - containerName: backend
      lowerBound:
        cpu: 12m
        memory: 131072k
      target:
        cpu: 12m
        memory: 131072k
      uncappedTarget:
        cpu: 12m
        memory: 131072k
      upperBound:
        cpu: 476m
        memory: "498558823"

...

Copy to Clipboard

Toggle word wrap

1: lowerBound is the minimum recommended resource levels.
2: target is the recommended resource levels.
3: upperBound is the highest recommended resource levels.
4: uncappedTarget is the most recent resource recommendations.

2.5.5. Uninstalling the Vertical Pod Autoscaler Operator
Copy link

You can remove the Vertical Pod Autoscaler Operator (VPA) from your OpenShift Container Platform cluster. After uninstalling, the resource requests for the pods that are already modified by an existing VPA custom resource (CR) do not change. The resources defined in the workload object, not the previous recommendations made by the VPA, are allocated to any new pods.

Note

You can remove a specific VPA CR by using the oc delete vpa <vpa-name> command. The same actions apply for resource requests as uninstalling the vertical pod autoscaler.

After removing the VPA, it is recommended that you remove the other components associated with the Operator to avoid potential issues.

Prerequisites

You installed the VPA.

Procedure

In the OpenShift Container Platform web console, click Operators → Installed Operators.
Switch to the openshift-vertical-pod-autoscaler project.
For the VerticalPodAutoscaler Operator, click the Options menu and select Uninstall Operator.
Optional: To remove all operands associated with the Operator, in the dialog box, select Delete all operand instances for this operator checkbox.
Click Uninstall.
Optional: Use the OpenShift CLI to remove the VPA components:
1. Delete the VPA namespace:
  $ oc delete namespace openshift-vertical-pod-autoscaler
  Copy to Clipboard Toggle word wrap
2. Delete the VPA custom resource definition (CRD) objects:
  $ oc delete crd verticalpodautoscalercheckpoints.autoscaling.k8s.io
  Copy to Clipboard Toggle word wrap
  $ oc delete crd verticalpodautoscalercontrollers.autoscaling.openshift.io
  Copy to Clipboard Toggle word wrap
  $ oc delete crd verticalpodautoscalers.autoscaling.k8s.io
  Copy to Clipboard Toggle word wrap
  Deleting the CRDs removes the associated roles, cluster roles, and role bindings.
  Note
  This action removes from the cluster all user-created VPA CRs. If you re-install the VPA, you must create these objects again.
3. Delete the MutatingWebhookConfiguration object by running the following command:
  $ oc delete MutatingWebhookConfiguration vpa-webhook-config
  Copy to Clipboard Toggle word wrap
4. Delete the VPA Operator:
  $ oc delete operator/vertical-pod-autoscaler.openshift-vertical-pod-autoscaler
  Copy to Clipboard Toggle word wrap

2.6. Providing sensitive data to pods
Copy link

Some applications need sensitive information, such as passwords and user names, that you do not want developers to have.

As an administrator, you can use Secret objects to provide this information without exposing that information in clear text.

2.6.1. Understanding secrets
Copy link

The Secret object type provides a mechanism to hold sensitive information such as passwords, OpenShift Container Platform client configuration files, private source repository credentials, and so on. Secrets decouple sensitive content from the pods. You can mount secrets into containers using a volume plugin or the system can use secrets to perform actions on behalf of a pod.

Key properties include:

Secret data can be referenced independently from its definition.
Secret data volumes are backed by temporary file-storage facilities (tmpfs) and never come to rest on a node.
Secret data can be shared within a namespace.

YAML Secret object definition

apiVersion: v1
kind: Secret
metadata:
  name: test-secret
  namespace: my-namespace
type: Opaque 
data: 
  username: <username> 
  password: <password>
stringData: 
  hostname: myapp.mydomain.com

apiVersion: v1
kind: Secret
metadata:
  name: test-secret
  namespace: my-namespace
type: Opaque


data:


  username: <username>


  password: <password>
stringData:


  hostname: myapp.mydomain.com

Copy to Clipboard

Toggle word wrap

1: Indicates the structure of the secret’s key names and values.
2: The allowable format for the keys in the data field must meet the guidelines in the DNS_SUBDOMAIN value in the Kubernetes identifiers glossary.
3: The value associated with keys in the data map must be base64 encoded.
4: Entries in the stringData map are converted to base64 and the entry will then be moved to the data map automatically. This field is write-only; the value will only be returned via the data field.
5: The value associated with keys in the stringData map is made up of plain text strings.

You must create a secret before creating the pods that depend on that secret.

When creating secrets:

Create a secret object with secret data.
Update the pod’s service account to allow the reference to the secret.
Create a pod, which consumes the secret as an environment variable or as a file (using a secret volume).

2.6.1.1. Types of secrets
Copy link

The value in the type field indicates the structure of the secret’s key names and values. The type can be used to enforce the presence of user names and keys in the secret object. If you do not want validation, use the opaque type, which is the default.

Specify one of the following types to trigger minimal server-side validation to ensure the presence of specific key names in the secret data:

kubernetes.io/basic-auth: Use with Basic authentication
kubernetes.io/dockercfg: Use as an image pull secret
kubernetes.io/dockerconfigjson: Use as an image pull secret
kubernetes.io/service-account-token: Use to obtain a legacy service account API token
kubernetes.io/ssh-auth: Use with SSH key authentication
kubernetes.io/tls: Use with TLS certificate authorities

Specify type: Opaque if you do not want validation, which means the secret does not claim to conform to any convention for key names or values. An opaque secret, allows for unstructured key:value pairs that can contain arbitrary values.

Note

You can specify other arbitrary types, such as example.com/my-secret-type. These types are not enforced server-side, but indicate that the creator of the secret intended to conform to the key/value requirements of that type.

For examples of creating different types of secrets, see Understanding how to create secrets.

2.6.1.2. Secret data keys
Copy link

Secret keys must be in a DNS subdomain.

2.6.1.3. About automatically generated service account token secrets
Copy link

When a service account is created, a service account token secret is automatically generated for it. This service account token secret, along with an automatically generated docker configuration secret, is used to authenticate to the internal OpenShift Container Platform registry. Do not rely on these automatically generated secrets for your own use; they might be removed in a future OpenShift Container Platform release.

Note

Prior to OpenShift Container Platform 4.11, a second service account token secret was generated when a service account was created. This service account token secret was used to access the Kubernetes API.

Starting with OpenShift Container Platform 4.11, this second service account token secret is no longer created. This is because the LegacyServiceAccountTokenNoAutoGeneration upstream Kubernetes feature gate was enabled, which stops the automatic generation of secret-based service account tokens to access the Kubernetes API.

After upgrading to 4.12, any existing service account token secrets are not deleted and continue to function.

Workloads are automatically injected with a projected volume to obtain a bound service account token. If your workload needs an additional service account token, add an additional projected volume in your workload manifest. Bound service account tokens are more secure than service account token secrets for the following reasons:

Bound service account tokens have a bounded lifetime.
Bound service account tokens contain audiences.
Bound service account tokens can be bound to pods or secrets and the bound tokens are invalidated when the bound object is removed.

For more information, see Configuring bound service account tokens using volume projection.

You can also manually create a service account token secret to obtain a token, if the security exposure of a non-expiring token in a readable API object is acceptable to you. For more information, see Creating a service account token secret.

Additional resources

For information about requesting bound service account tokens, see Using bound service account tokens
For information about creating a service account token secret, see Creating a service account token secret.

2.6.2. Understanding how to create secrets
Copy link

As an administrator you must create a secret before developers can create the pods that depend on that secret.

When creating secrets:

Create a secret object that contains the data you want to keep secret. The specific data required for each secret type is descibed in the following sections.

Example YAML object that creates an opaque secret

apiVersion: v1
kind: Secret
metadata:
  name: test-secret
type: Opaque 
data: 
  username: <username>
  password: <password>
stringData: 
  hostname: myapp.mydomain.com
  secret.properties: |
    property1=valueA
    property2=valueB

apiVersion: v1
kind: Secret
metadata:
  name: test-secret
type: Opaque


data:


  username: <username>
  password: <password>
stringData:


  hostname: myapp.mydomain.com
  secret.properties: |
    property1=valueA
    property2=valueB

Copy to Clipboard

Toggle word wrap

1: Specifies the type of secret.
2: Specifies encoded string and data.
3: Specifies decoded string and data.

Use either the data or stringdata fields, not both.

Update the pod’s service account to reference the secret:
YAML of a service account that uses a secret
```
apiVersion: v1
kind: ServiceAccount
 ...
secrets:
- name: test-secret
```
```
apiVersion: v1
kind: ServiceAccount
 ...
secrets:
- name: test-secret
```
Copy to Clipboard Toggle word wrap

Create a pod, which consumes the secret as an environment variable or as a file (using a secret volume):

YAML of a pod populating files in a volume with secret data

apiVersion: v1
kind: Pod
metadata:
  name: secret-example-pod
spec:
  containers:
    - name: secret-test-container
      image: busybox
      command: [ "/bin/sh", "-c", "cat /etc/secret-volume/*" ]
      volumeMounts: 
          - name: secret-volume
            mountPath: /etc/secret-volume 
            readOnly: true 
  volumes:
    - name: secret-volume
      secret:
        secretName: test-secret 
  restartPolicy: Never

apiVersion: v1
kind: Pod
metadata:
  name: secret-example-pod
spec:
  containers:
    - name: secret-test-container
      image: busybox
      command: [ "/bin/sh", "-c", "cat /etc/secret-volume/*" ]
      volumeMounts:


          - name: secret-volume
            mountPath: /etc/secret-volume


            readOnly: true


  volumes:
    - name: secret-volume
      secret:
        secretName: test-secret


  restartPolicy: Never

Copy to Clipboard

Toggle word wrap

1: Add a volumeMounts field to each container that needs the secret.
2: Specifies an unused directory name where you would like the secret to appear. Each key in the secret data map becomes the filename under mountPath.
3: Set to true. If true, this instructs the driver to provide a read-only volume.
4: Specifies the name of the secret.

YAML of a pod populating environment variables with secret data

apiVersion: v1
kind: Pod
metadata:
  name: secret-example-pod
spec:
  containers:
    - name: secret-test-container
      image: busybox
      command: [ "/bin/sh", "-c", "export" ]
      env:
        - name: TEST_SECRET_USERNAME_ENV_VAR
          valueFrom:
            secretKeyRef: 
              name: test-secret
              key: username
  restartPolicy: Never

apiVersion: v1
kind: Pod
metadata:
  name: secret-example-pod
spec:
  containers:
    - name: secret-test-container
      image: busybox
      command: [ "/bin/sh", "-c", "export" ]
      env:
        - name: TEST_SECRET_USERNAME_ENV_VAR
          valueFrom:
            secretKeyRef:


              name: test-secret
              key: username
  restartPolicy: Never

Copy to Clipboard

Toggle word wrap

1: Specifies the environment variable that consumes the secret key.

YAML of a build config populating environment variables with secret data

apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: secret-example-bc
spec:
  strategy:
    sourceStrategy:
      env:
      - name: TEST_SECRET_USERNAME_ENV_VAR
        valueFrom:
          secretKeyRef: 
            name: test-secret
            key: username
      from:
        kind: ImageStreamTag
        namespace: openshift
        name: 'cli:latest'

apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: secret-example-bc
spec:
  strategy:
    sourceStrategy:
      env:
      - name: TEST_SECRET_USERNAME_ENV_VAR
        valueFrom:
          secretKeyRef:


            name: test-secret
            key: username
      from:
        kind: ImageStreamTag
        namespace: openshift
        name: 'cli:latest'

Copy to Clipboard

Toggle word wrap

1: Specifies the environment variable that consumes the secret key.

2.6.2.1. Secret creation restrictions
Copy link

To use a secret, a pod needs to reference the secret. A secret can be used with a pod in three ways:

To populate environment variables for containers.
As files in a volume mounted on one or more of its containers.
By kubelet when pulling images for the pod.

Volume type secrets write data into the container as a file using the volume mechanism. Image pull secrets use service accounts for the automatic injection of the secret into all pods in a namespace.

When a template contains a secret definition, the only way for the template to use the provided secret is to ensure that the secret volume sources are validated and that the specified object reference actually points to a Secret object. Therefore, a secret needs to be created before any pods that depend on it. The most effective way to ensure this is to have it get injected automatically through the use of a service account.

Secret API objects reside in a namespace. They can only be referenced by pods in that same namespace.

Individual secrets are limited to 1MB in size. This is to discourage the creation of large secrets that could exhaust apiserver and kubelet memory. However, creation of a number of smaller secrets could also exhaust memory.

2.6.2.2. Creating an opaque secret
Copy link

As an administrator, you can create an opaque secret, which allows you to store unstructured key:value pairs that can contain arbitrary values.

Procedure

Create a Secret object in a YAML file on a control plane node.

For example:

apiVersion: v1
kind: Secret
metadata:
  name: mysecret
type: Opaque 
data:
  username: <username>
  password: <password>

apiVersion: v1
kind: Secret
metadata:
  name: mysecret
type: Opaque


data:
  username: <username>
  password: <password>

Copy to Clipboard

Toggle word wrap

1: Specifies an opaque secret.

Use the following command to create a Secret object:
```
oc create -f <filename>.yaml
```
```
$ oc create -f <filename>.yaml
```
Copy to Clipboard Toggle word wrap
To use the secret in a pod:
1. Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
2. Create the pod, which consumes the secret as an environment variable or as a file (using a secret volume), as shown in the "Understanding how to create secrets" section.

2.6.2.3. Creating a service account token secret
Copy link

As an administrator, you can create a service account token secret, which allows you to distribute a service account token to applications that must authenticate to the API.

Note

It is recommended to obtain bound service account tokens using the TokenRequest API instead of using service account token secrets. The tokens obtained from the TokenRequest API are more secure than the tokens stored in secrets, because they have a bounded lifetime and are not readable by other API clients.

You should create a service account token secret only if you cannot use the TokenRequest API and if the security exposure of a non-expiring token in a readable API object is acceptable to you.

See the Additional resources section that follows for information on creating bound service account tokens.

Procedure

Create a Secret object in a YAML file on a control plane node:

Example secret object:

apiVersion: v1
kind: Secret
metadata:
  name: secret-sa-sample
  annotations:
    kubernetes.io/service-account.name: "sa-name" 
type: kubernetes.io/service-account-token

apiVersion: v1
kind: Secret
metadata:
  name: secret-sa-sample
  annotations:
    kubernetes.io/service-account.name: "sa-name"


type: kubernetes.io/service-account-token

Copy to Clipboard

Toggle word wrap

1: Specifies an existing service account name. If you are creating both the ServiceAccount and the Secret objects, create the ServiceAccount object first.
2: Specifies a service account token secret.

Use the following command to create the Secret object:
```
oc create -f <filename>.yaml
```
```
$ oc create -f <filename>.yaml
```
Copy to Clipboard Toggle word wrap
To use the secret in a pod:
1. Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
2. Create the pod, which consumes the secret as an environment variable or as a file (using a secret volume), as shown in the "Understanding how to create secrets" section.

2.6.2.4. Creating a basic authentication secret
Copy link

As an administrator, you can create a basic authentication secret, which allows you to store the credentials needed for basic authentication. When using this secret type, the data parameter of the Secret object must contain the following keys encoded in the base64 format:

username: the user name for authentication
password: the password or token for authentication

Note

You can use the stringData parameter to use clear text content.

Procedure

Create a Secret object in a YAML file on a control plane node:

Example secret object

apiVersion: v1
kind: Secret
metadata:
  name: secret-basic-auth
type: kubernetes.io/basic-auth 
data:
stringData: 
  username: admin
  password: <password>

apiVersion: v1
kind: Secret
metadata:
  name: secret-basic-auth
type: kubernetes.io/basic-auth


data:
stringData:


  username: admin
  password: <password>

Copy to Clipboard

Toggle word wrap

1: Specifies a basic authentication secret.
2: Specifies the basic authentication values to use.

Use the following command to create the Secret object:
```
oc create -f <filename>.yaml
```
```
$ oc create -f <filename>.yaml
```
Copy to Clipboard Toggle word wrap
To use the secret in a pod:
1. Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
2. Create the pod, which consumes the secret as an environment variable or as a file (using a secret volume), as shown in the "Understanding how to create secrets" section.

2.6.2.5. Creating an SSH authentication secret
Copy link

As an administrator, you can create an SSH authentication secret, which allows you to store data used for SSH authentication. When using this secret type, the data parameter of the Secret object must contain the SSH credential to use.

Procedure

Create a Secret object in a YAML file on a control plane node:

Example secret object:

apiVersion: v1
kind: Secret
metadata:
  name: secret-ssh-auth
type: kubernetes.io/ssh-auth 
data:
  ssh-privatekey: | 
          MIIEpQIBAAKCAQEAulqb/Y ...

apiVersion: v1
kind: Secret
metadata:
  name: secret-ssh-auth
type: kubernetes.io/ssh-auth


data:
  ssh-privatekey: |


          MIIEpQIBAAKCAQEAulqb/Y ...

Copy to Clipboard

Toggle word wrap

1: Specifies an SSH authentication secret.
2: Specifies the SSH key/value pair as the SSH credentials to use.

Use the following command to create the Secret object:
```
oc create -f <filename>.yaml
```
```
$ oc create -f <filename>.yaml
```
Copy to Clipboard Toggle word wrap
To use the secret in a pod:
1. Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
2. Create the pod, which consumes the secret as an environment variable or as a file (using a secret volume), as shown in the "Understanding how to create secrets" section.

2.6.2.6. Creating a Docker configuration secret
Copy link

As an administrator, you can create a Docker configuration secret, which allows you to store the credentials for accessing a container image registry.

kubernetes.io/dockercfg. Use this secret type to store your local Docker configuration file. The data parameter of the secret object must contain the contents of a .dockercfg file encoded in the base64 format.
kubernetes.io/dockerconfigjson. Use this secret type to store your local Docker configuration JSON file. The data parameter of the secret object must contain the contents of a .docker/config.json file encoded in the base64 format.

Procedure

Create a Secret object in a YAML file on a control plane node.

Example Docker configuration secret object

apiVersion: v1
kind: Secret
metadata:
  name: secret-docker-cfg
  namespace: my-project
type: kubernetes.io/dockerconfig 
data:
  .dockerconfig:bm5ubm5ubm5ubm5ubm5ubm5ubm5ubmdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cgYXV0aCBrZXlzCg==

apiVersion: v1
kind: Secret
metadata:
  name: secret-docker-cfg
  namespace: my-project
type: kubernetes.io/dockerconfig


data:
  .dockerconfig:bm5ubm5ubm5ubm5ubm5ubm5ubm5ubmdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cgYXV0aCBrZXlzCg==

Copy to Clipboard

Toggle word wrap

1: Specifies that the secret is using a Docker configuration file.
2: The output of a base64-encoded Docker configuration file

Example Docker configuration JSON secret object

apiVersion: v1
kind: Secret
metadata:
  name: secret-docker-json
  namespace: my-project
type: kubernetes.io/dockerconfig 
data:
  .dockerconfigjson:bm5ubm5ubm5ubm5ubm5ubm5ubm5ubmdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cgYXV0aCBrZXlzCg==

apiVersion: v1
kind: Secret
metadata:
  name: secret-docker-json
  namespace: my-project
type: kubernetes.io/dockerconfig


data:
  .dockerconfigjson:bm5ubm5ubm5ubm5ubm5ubm5ubm5ubmdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cgYXV0aCBrZXlzCg==

Copy to Clipboard

Toggle word wrap

1: Specifies that the secret is using a Docker configuration JSONfile.
2: The output of a base64-encoded Docker configuration JSON file

Use the following command to create the Secret object
```
oc create -f <filename>.yaml
```
```
$ oc create -f <filename>.yaml
```
Copy to Clipboard Toggle word wrap
To use the secret in a pod:
1. Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
2. Create the pod, which consumes the secret as an environment variable or as a file (using a secret volume), as shown in the "Understanding how to create secrets" section.

2.6.2.7. Creating a secret using the web console
Copy link

You can create secrets using the web console.

Procedure

Navigate to Workloads → Secrets.
Click Create → From YAML.
1. Edit the YAML manually to your specifications, or drag and drop a file into the YAML editor. For example:
  apiVersion: v1 kind: Secret metadata: name: example namespace: <namespace> type: Opaque
  1
  data: username: <base64 encoded username> password: <base64 encoded password> stringData:
  2
  hostname: myapp.mydomain.com
  Copy to Clipboard Toggle word wrap
  1
  This example specifies an opaque secret; however, you may see other secret types such as service account token secret, basic authentication secret, SSH authentication secret, or a secret that uses Docker configuration.
  2
  Entries in the stringData map are converted to base64 and the entry will then be moved to the data map automatically. This field is write-only; the value will only be returned via the data field.
Click Create.
Click Add Secret to workload.
1. From the drop-down menu, select the workload to add.
2. Click Save.

2.6.3. Understanding how to update secrets
Copy link

When you modify the value of a secret, the value (used by an already running pod) will not dynamically change. To change a secret, you must delete the original pod and create a new pod (perhaps with an identical PodSpec).

Updating a secret follows the same workflow as deploying a new Container image. You can use the kubectl rolling-update command.

The resourceVersion value in a secret is not specified when it is referenced. Therefore, if a secret is updated at the same time as pods are starting, the version of the secret that is used for the pod is not defined.

Note

Currently, it is not possible to check the resource version of a secret object that was used when a pod was created. It is planned that pods will report this information, so that a controller could restart ones using an old resourceVersion. In the interim, do not update the data of existing secrets, but create new ones with distinct names.

2.6.4. Creating and using secrets
Copy link

As an administrator, you can create a service account token secret. This allows you to distribute a service account token to applications that must authenticate to the API.

Procedure

Create a service account in your namespace by running the following command:
```
oc create sa <service_account_name> -n <your_namespace>
```
```
$ oc create sa <service_account_name> -n <your_namespace>
```
Copy to Clipboard Toggle word wrap
Save the following YAML example to a file named service-account-token-secret.yaml. The example includes a Secret object configuration that you can use to generate a service account token:
```
apiVersion: v1
kind: Secret
metadata:
  name: <secret_name> 
  annotations:
    kubernetes.io/service-account.name: "sa-name" 
type: kubernetes.io/service-account-token 
```
```
apiVersion: v1
kind: Secret
metadata:
  name: <secret_name> 
```
1
```
  annotations:
    kubernetes.io/service-account.name: "sa-name" 
```
2
```
type: kubernetes.io/service-account-token 
```
3
Copy to Clipboard Toggle word wrap
1
Replace <secret_name> with the name of your service token secret.
2
Specifies an existing service account name. If you are creating both the ServiceAccount and the Secret objects, create the ServiceAccount object first.
3
Specifies a service account token secret type.
Generate the service account token by applying the file:
```
oc apply -f service-account-token-secret.yaml
```
```
$ oc apply -f service-account-token-secret.yaml
```
Copy to Clipboard Toggle word wrap

Get the service account token from the secret by running the following command:

oc get secret <sa_token_secret> -o jsonpath='{.data.token}' | base64 --decode

$ oc get secret <sa_token_secret> -o jsonpath='{.data.token}' | base64 --decode

Copy to Clipboard

Toggle word wrap

Example output

ayJhbGciOiJSUzI1NiIsImtpZCI6IklOb2dtck1qZ3hCSWpoNnh5YnZhSE9QMkk3YnRZMVZoclFfQTZfRFp1YlUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJkZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImJ1aWxkZXItdG9rZW4tdHZrbnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiYnVpbGRlciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjNmZGU2MGZmLTA1NGYtNDkyZi04YzhjLTNlZjE0NDk3MmFmNyIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpkZWZhdWx0OmJ1aWxkZXIifQ.OmqFTDuMHC_lYvvEUrjr1x453hlEEHYcxS9VKSzmRkP1SiVZWPNPkTWlfNRp6bIUZD3U6aN3N7dMSN0eI5hu36xPgpKTdvuckKLTCnelMx6cxOdAbrcw1mCmOClNscwjS1KO1kzMtYnnq8rXHiMJELsNlhnRyyIXRTtNBsy4t64T3283s3SLsancyx0gy0ujx-Ch3uKAKdZi5iT-I8jnnQ-ds5THDs2h65RJhgglQEmSxpHrLGZFmyHAQI-_SjvmHZPXEc482x3SkaQHNLqpmrpJorNqh1M8ZHKzlujhZgVooMvJmWPXTb2vnvi3DGn2XI-hZxl1yD2yGH1RBpYUHA

ayJhbGciOiJSUzI1NiIsImtpZCI6IklOb2dtck1qZ3hCSWpoNnh5YnZhSE9QMkk3YnRZMVZoclFfQTZfRFp1YlUifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJkZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImJ1aWxkZXItdG9rZW4tdHZrbnIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiYnVpbGRlciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjNmZGU2MGZmLTA1NGYtNDkyZi04YzhjLTNlZjE0NDk3MmFmNyIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpkZWZhdWx0OmJ1aWxkZXIifQ.OmqFTDuMHC_lYvvEUrjr1x453hlEEHYcxS9VKSzmRkP1SiVZWPNPkTWlfNRp6bIUZD3U6aN3N7dMSN0eI5hu36xPgpKTdvuckKLTCnelMx6cxOdAbrcw1mCmOClNscwjS1KO1kzMtYnnq8rXHiMJELsNlhnRyyIXRTtNBsy4t64T3283s3SLsancyx0gy0ujx-Ch3uKAKdZi5iT-I8jnnQ-ds5THDs2h65RJhgglQEmSxpHrLGZFmyHAQI-_SjvmHZPXEc482x3SkaQHNLqpmrpJorNqh1M8ZHKzlujhZgVooMvJmWPXTb2vnvi3DGn2XI-hZxl1yD2yGH1RBpYUHA

Copy to Clipboard

Toggle word wrap

1: Replace <sa_token_secret> with the name of your service token secret.

Use your service account token to authenticate with the API of your cluster:
```
curl -X GET <openshift_cluster_api> --header "Authorization: Bearer <token>"
```
```
$ curl -X GET <openshift_cluster_api> --header "Authorization: Bearer <token>" 
```
1
```
 
```
2
Copy to Clipboard Toggle word wrap
1
Replace <openshift_cluster_api> with the OpenShift cluster API.
2
Replace <token> with the service account token that is output in the preceding command.

2.6.5. About using signed certificates with secrets
Copy link

To secure communication to your service, you can configure OpenShift Container Platform to generate a signed serving certificate/key pair that you can add into a secret in a project.

A service serving certificate secret is intended to support complex middleware applications that need out-of-the-box certificates. It has the same settings as the server certificates generated by the administrator tooling for nodes and masters.

Service Pod spec configured for a service serving certificates secret.

apiVersion: v1
kind: Service
metadata:
  name: registry
  annotations:
    service.beta.openshift.io/serving-cert-secret-name: registry-cert
# ...

apiVersion: v1
kind: Service
metadata:
  name: registry
  annotations:
    service.beta.openshift.io/serving-cert-secret-name: registry-cert


# ...

Copy to Clipboard

Toggle word wrap

1: Specify the name for the certificate

Other pods can trust cluster-created certificates (which are only signed for internal DNS names), by using the CA bundle in the /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt file that is automatically mounted in their pod.

The signature algorithm for this feature is x509.SHA256WithRSA. To manually rotate, delete the generated secret. A new certificate is created.

2.6.5.1. Generating signed certificates for use with secrets
Copy link

To use a signed serving certificate/key pair with a pod, create or edit the service to add the service.beta.openshift.io/serving-cert-secret-name annotation, then add the secret to the pod.

Procedure

To create a service serving certificate secret:

Edit the Pod spec for your service.

Add the service.beta.openshift.io/serving-cert-secret-name annotation with the name you want to use for your secret.

kind: Service
apiVersion: v1
metadata:
  name: my-service
  annotations:
      service.beta.openshift.io/serving-cert-secret-name: my-cert 
spec:
  selector:
    app: MyApp
  ports:
  - protocol: TCP
    port: 80
    targetPort: 9376

kind: Service
apiVersion: v1
metadata:
  name: my-service
  annotations:
      service.beta.openshift.io/serving-cert-secret-name: my-cert


spec:
  selector:
    app: MyApp
  ports:
  - protocol: TCP
    port: 80
    targetPort: 9376

Copy to Clipboard

Toggle word wrap

The certificate and key are in PEM format, stored in tls.crt and tls.key respectively.

Create the service:
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap

View the secret to make sure it was created:

View a list of all secrets:

oc get secrets

$ oc get secrets

Copy to Clipboard

Toggle word wrap

Example output

NAME                     TYPE                                  DATA      AGE
my-cert                  kubernetes.io/tls                     2         9m

NAME                     TYPE                                  DATA      AGE
my-cert                  kubernetes.io/tls                     2         9m

Copy to Clipboard

Toggle word wrap

View details on your secret:

oc describe secret my-cert

$ oc describe secret my-cert

Copy to Clipboard

Toggle word wrap

Example output

Name:         my-cert
Namespace:    openshift-console
Labels:       <none>
Annotations:  service.beta.openshift.io/expiry: 2023-03-08T23:22:40Z
              service.beta.openshift.io/originating-service-name: my-service
              service.beta.openshift.io/originating-service-uid: 640f0ec3-afc2-4380-bf31-a8c784846a11
              service.beta.openshift.io/expiry: 2023-03-08T23:22:40Z

Type:  kubernetes.io/tls

Data
====
tls.key:  1679 bytes
tls.crt:  2595 bytes

Name:         my-cert
Namespace:    openshift-console
Labels:       <none>
Annotations:  service.beta.openshift.io/expiry: 2023-03-08T23:22:40Z
              service.beta.openshift.io/originating-service-name: my-service
              service.beta.openshift.io/originating-service-uid: 640f0ec3-afc2-4380-bf31-a8c784846a11
              service.beta.openshift.io/expiry: 2023-03-08T23:22:40Z

Type:  kubernetes.io/tls

Data
====
tls.key:  1679 bytes
tls.crt:  2595 bytes

Copy to Clipboard

Toggle word wrap

Edit your Pod spec with that secret.

apiVersion: v1
kind: Pod
metadata:
  name: my-service-pod
spec:
  containers:
  - name: mypod
    image: redis
    volumeMounts:
    - name: my-container
      mountPath: "/etc/my-path"
  volumes:
  - name: my-volume
    secret:
      secretName: my-cert
      items:
      - key: username
        path: my-group/my-username
        mode: 511

apiVersion: v1
kind: Pod
metadata:
  name: my-service-pod
spec:
  containers:
  - name: mypod
    image: redis
    volumeMounts:
    - name: my-container
      mountPath: "/etc/my-path"
  volumes:
  - name: my-volume
    secret:
      secretName: my-cert
      items:
      - key: username
        path: my-group/my-username
        mode: 511

Copy to Clipboard

Toggle word wrap

When it is available, your pod will run. The certificate will be good for the internal service DNS name, <service.name>.<service.namespace>.svc.

The certificate/key pair is automatically replaced when it gets close to expiration. View the expiration date in the service.beta.openshift.io/expiry annotation on the secret, which is in RFC3339 format.

Note

In most cases, the service DNS name <service.name>.<service.namespace>.svc is not externally routable. The primary use of <service.name>.<service.namespace>.svc is for intracluster or intraservice communication, and with re-encrypt routes.

2.6.6. Troubleshooting secrets
Copy link

If a service certificate generation fails with (service’s service.beta.openshift.io/serving-cert-generation-error annotation contains):

secret/ssl-key references serviceUID 62ad25ca-d703-11e6-9d6f-0e9c0057b608, which does not match 77b6dd80-d716-11e6-9d6f-0e9c0057b60

secret/ssl-key references serviceUID 62ad25ca-d703-11e6-9d6f-0e9c0057b608, which does not match 77b6dd80-d716-11e6-9d6f-0e9c0057b60

Copy to Clipboard

Toggle word wrap

The service that generated the certificate no longer exists, or has a different serviceUID. You must force certificates regeneration by removing the old secret, and clearing the following annotations on the service service.beta.openshift.io/serving-cert-generation-error, service.beta.openshift.io/serving-cert-generation-error-num:

Delete the secret:
```
oc delete secret <secret_name>
```
```
$ oc delete secret <secret_name>
```
Copy to Clipboard Toggle word wrap

Clear the annotations:

oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-

$ oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-

Copy to Clipboard

Toggle word wrap

oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-num-

$ oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-num-

Copy to Clipboard

Toggle word wrap

Note

The command removing annotation has a - after the annotation name to be removed.

2.7. Creating and using config maps
Copy link

The following sections define config maps and how to create and use them.

2.7.1. Understanding config maps
Copy link

Many applications require configuration by using some combination of configuration files, command-line arguments, and environment variables. In OpenShift Container Platform, these configuration artifacts are decoupled from image content to keep containerized applications portable.

The ConfigMap object provides mechanisms to inject containers with configuration data while keeping containers agnostic of OpenShift Container Platform. A config map can be used to store fine-grained information like individual properties or coarse-grained information like entire configuration files or JSON blobs.

The ConfigMap object holds key-value pairs of configuration data that can be consumed in pods or used to store configuration data for system components such as controllers. For example:

ConfigMap Object Definition

kind: ConfigMap
apiVersion: v1
metadata:
  creationTimestamp: 2016-02-18T19:14:38Z
  name: example-config
  namespace: my-namespace
data: 
  example.property.1: hello
  example.property.2: world
  example.property.file: |-
    property.1=value-1
    property.2=value-2
    property.3=value-3
binaryData:
  bar: L3Jvb3QvMTAw

kind: ConfigMap
apiVersion: v1
metadata:
  creationTimestamp: 2016-02-18T19:14:38Z
  name: example-config
  namespace: my-namespace
data:


  example.property.1: hello
  example.property.2: world
  example.property.file: |-
    property.1=value-1
    property.2=value-2
    property.3=value-3
binaryData:
  bar: L3Jvb3QvMTAw

Copy to Clipboard

Toggle word wrap

1 1: Contains the configuration data.
2: Points to a file that contains non-UTF8 data, for example, a binary Java keystore file. Enter the file data in Base 64.

Note

You can use the binaryData field when you create a config map from a binary file, such as an image.

Configuration data can be consumed in pods in a variety of ways. A config map can be used to:

Populate environment variable values in containers
Set command-line arguments in a container
Populate configuration files in a volume

Users and system components can store configuration data in a config map.

A config map is similar to a secret, but designed to more conveniently support working with strings that do not contain sensitive information.

Config map restrictions

A config map must be created before its contents can be consumed in pods.

Controllers can be written to tolerate missing configuration data. Consult individual components configured by using config maps on a case-by-case basis.

ConfigMap objects reside in a project.

They can only be referenced by pods in the same project.

The Kubelet only supports the use of a config map for pods it gets from the API server.

This includes any pods created by using the CLI, or indirectly from a replication controller. It does not include pods created by using the OpenShift Container Platform node’s --manifest-url flag, its --config flag, or its REST API because these are not common ways to create pods.

2.7.2. Creating a config map in the OpenShift Container Platform web console
Copy link

You can create a config map in the OpenShift Container Platform web console.

Procedure

To create a config map as a cluster administrator:
1. In the Administrator perspective, select Workloads → Config Maps.
2. At the top right side of the page, select Create Config Map.
3. Enter the contents of your config map.
4. Select Create.
To create a config map as a developer:
1. In the Developer perspective, select Config Maps.
2. At the top right side of the page, select Create Config Map.
3. Enter the contents of your config map.
4. Select Create.

2.7.3. Creating a config map by using the CLI
Copy link

You can use the following command to create a config map from directories, specific files, or literal values.

Procedure

Create a config map:

oc create configmap <configmap_name> [options]

$ oc create configmap <configmap_name> [options]

Copy to Clipboard

Toggle word wrap

2.7.3.1. Creating a config map from a directory
Copy link

You can create a config map from a directory by using the --from-file flag. This method allows you to use multiple files within a directory to create a config map.

Each file in the directory is used to populate a key in the config map, where the name of the key is the file name, and the value of the key is the content of the file.

For example, the following command creates a config map with the contents of the example-files directory:

oc create configmap game-config --from-file=example-files/

$ oc create configmap game-config --from-file=example-files/

Copy to Clipboard

Toggle word wrap

View the keys in the config map:

oc describe configmaps game-config

$ oc describe configmaps game-config

Copy to Clipboard

Toggle word wrap

Example output

Name:           game-config
Namespace:      default
Labels:         <none>
Annotations:    <none>

Data

game.properties:        158 bytes
ui.properties:          83 bytes

Name:           game-config
Namespace:      default
Labels:         <none>
Annotations:    <none>

Data

game.properties:        158 bytes
ui.properties:          83 bytes

Copy to Clipboard

Toggle word wrap

You can see that the two keys in the map are created from the file names in the directory specified in the command. The content of those keys might be large, so the output of oc describe only shows the names of the keys and their sizes.

Prerequisite

You must have a directory with files that contain the data you want to populate a config map with.

The following procedure uses these example files: game.properties and ui.properties:

cat example-files/game.properties

$ cat example-files/game.properties

Copy to Clipboard

Toggle word wrap

Example output

enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30

enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30

Copy to Clipboard

Toggle word wrap

cat example-files/ui.properties

$ cat example-files/ui.properties

Copy to Clipboard

Toggle word wrap

Example output

color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice

color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice

Copy to Clipboard

Toggle word wrap

Procedure

Create a config map holding the content of each file in this directory by entering the following command:
```
oc create configmap game-config \
    --from-file=example-files/
```
```
$ oc create configmap game-config \
    --from-file=example-files/
```
Copy to Clipboard Toggle word wrap

Verification

Enter the oc get command for the object with the -o option to see the values of the keys:

oc get configmaps game-config -o yaml

$ oc get configmaps game-config -o yaml

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
data:
  game.properties: |-
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
  ui.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true
    how.nice.to.look=fairlyNice
kind: ConfigMap
metadata:
  creationTimestamp: 2016-02-18T18:34:05Z
  name: game-config
  namespace: default
  resourceVersion: "407"
  selflink: /api/v1/namespaces/default/configmaps/game-config
  uid: 30944725-d66e-11e5-8cd0-68f728db1985

apiVersion: v1
data:
  game.properties: |-
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
  ui.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true
    how.nice.to.look=fairlyNice
kind: ConfigMap
metadata:
  creationTimestamp: 2016-02-18T18:34:05Z
  name: game-config
  namespace: default
  resourceVersion: "407"
  selflink: /api/v1/namespaces/default/configmaps/game-config
  uid: 30944725-d66e-11e5-8cd0-68f728db1985

Copy to Clipboard

Toggle word wrap

2.7.3.2. Creating a config map from a file
Copy link

You can create a config map from a file by using the --from-file flag. You can pass the --from-file option multiple times to the CLI.

You can also specify the key to set in a config map for content imported from a file by passing a key=value expression to the --from-file option. For example:

oc create configmap game-config-3 --from-file=game-special-key=example-files/game.properties

$ oc create configmap game-config-3 --from-file=game-special-key=example-files/game.properties

Copy to Clipboard

Toggle word wrap

Note

If you create a config map from a file, you can include files containing non-UTF8 data that are placed in this field without corrupting the non-UTF8 data. OpenShift Container Platform detects binary files and transparently encodes the file as MIME. On the server, the MIME payload is decoded and stored without corrupting the data.

Prerequisite

You must have a directory with files that contain the data you want to populate a config map with.

The following procedure uses these example files: game.properties and ui.properties:

cat example-files/game.properties

$ cat example-files/game.properties

Copy to Clipboard

Toggle word wrap

Example output

enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30

enemies=aliens
lives=3
enemies.cheat=true
enemies.cheat.level=noGoodRotten
secret.code.passphrase=UUDDLRLRBABAS
secret.code.allowed=true
secret.code.lives=30

Copy to Clipboard

Toggle word wrap

cat example-files/ui.properties

$ cat example-files/ui.properties

Copy to Clipboard

Toggle word wrap

Example output

color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice

color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice

Copy to Clipboard

Toggle word wrap

Procedure

Create a config map by specifying a specific file:

oc create configmap game-config-2 \
    --from-file=example-files/game.properties \
    --from-file=example-files/ui.properties

$ oc create configmap game-config-2 \
    --from-file=example-files/game.properties \
    --from-file=example-files/ui.properties

Copy to Clipboard

Toggle word wrap

Create a config map by specifying a key-value pair:

oc create configmap game-config-3 \
    --from-file=game-special-key=example-files/game.properties

$ oc create configmap game-config-3 \
    --from-file=game-special-key=example-files/game.properties

Copy to Clipboard

Toggle word wrap

Verification

Enter the oc get command for the object with the -o option to see the values of the keys from the file:

oc get configmaps game-config-2 -o yaml

$ oc get configmaps game-config-2 -o yaml

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
data:
  game.properties: |-
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
  ui.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true
    how.nice.to.look=fairlyNice
kind: ConfigMap
metadata:
  creationTimestamp: 2016-02-18T18:52:05Z
  name: game-config-2
  namespace: default
  resourceVersion: "516"
  selflink: /api/v1/namespaces/default/configmaps/game-config-2
  uid: b4952dc3-d670-11e5-8cd0-68f728db1985

apiVersion: v1
data:
  game.properties: |-
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
  ui.properties: |
    color.good=purple
    color.bad=yellow
    allow.textmode=true
    how.nice.to.look=fairlyNice
kind: ConfigMap
metadata:
  creationTimestamp: 2016-02-18T18:52:05Z
  name: game-config-2
  namespace: default
  resourceVersion: "516"
  selflink: /api/v1/namespaces/default/configmaps/game-config-2
  uid: b4952dc3-d670-11e5-8cd0-68f728db1985

Copy to Clipboard

Toggle word wrap

Enter the oc get command for the object with the -o option to see the values of the keys from the key-value pair:

oc get configmaps game-config-3 -o yaml

$ oc get configmaps game-config-3 -o yaml

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
data:
  game-special-key: |- 
    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
kind: ConfigMap
metadata:
  creationTimestamp: 2016-02-18T18:54:22Z
  name: game-config-3
  namespace: default
  resourceVersion: "530"
  selflink: /api/v1/namespaces/default/configmaps/game-config-3
  uid: 05f8da22-d671-11e5-8cd0-68f728db1985

apiVersion: v1
data:
  game-special-key: |-


    enemies=aliens
    lives=3
    enemies.cheat=true
    enemies.cheat.level=noGoodRotten
    secret.code.passphrase=UUDDLRLRBABAS
    secret.code.allowed=true
    secret.code.lives=30
kind: ConfigMap
metadata:
  creationTimestamp: 2016-02-18T18:54:22Z
  name: game-config-3
  namespace: default
  resourceVersion: "530"
  selflink: /api/v1/namespaces/default/configmaps/game-config-3
  uid: 05f8da22-d671-11e5-8cd0-68f728db1985

Copy to Clipboard

Toggle word wrap

1: This is the key that you set in the preceding step.

2.7.3.3. Creating a config map from literal values
Copy link

You can supply literal values for a config map.

The --from-literal option takes a key=value syntax, which allows literal values to be supplied directly on the command line.

Procedure

Create a config map by specifying a literal value:

oc create configmap special-config \
    --from-literal=special.how=very \
    --from-literal=special.type=charm

$ oc create configmap special-config \
    --from-literal=special.how=very \
    --from-literal=special.type=charm

Copy to Clipboard

Toggle word wrap

Verification

Enter the oc get command for the object with the -o option to see the values of the keys:

oc get configmaps special-config -o yaml

$ oc get configmaps special-config -o yaml

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
data:
  special.how: very
  special.type: charm
kind: ConfigMap
metadata:
  creationTimestamp: 2016-02-18T19:14:38Z
  name: special-config
  namespace: default
  resourceVersion: "651"
  selflink: /api/v1/namespaces/default/configmaps/special-config
  uid: dadce046-d673-11e5-8cd0-68f728db1985

apiVersion: v1
data:
  special.how: very
  special.type: charm
kind: ConfigMap
metadata:
  creationTimestamp: 2016-02-18T19:14:38Z
  name: special-config
  namespace: default
  resourceVersion: "651"
  selflink: /api/v1/namespaces/default/configmaps/special-config
  uid: dadce046-d673-11e5-8cd0-68f728db1985

Copy to Clipboard

Toggle word wrap

2.7.4. Use cases: Consuming config maps in pods
Copy link

The following sections describe some uses cases when consuming ConfigMap objects in pods.

2.7.4.1. Populating environment variables in containers by using config maps
Copy link

You can use config maps to populate individual environment variables in containers or to populate environment variables in containers from all keys that form valid environment variable names.

As an example, consider the following config map:

ConfigMap with two environment variables

apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config 
  namespace: default 
data:
  special.how: very 
  special.type: charm

apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config


  namespace: default


data:
  special.how: very


  special.type: charm

Copy to Clipboard

Toggle word wrap

1: Name of the config map.
2: The project in which the config map resides. Config maps can only be referenced by pods in the same project.
3 4: Environment variables to inject.

ConfigMap with one environment variable

apiVersion: v1
kind: ConfigMap
metadata:
  name: env-config 
  namespace: default
data:
  log_level: INFO

apiVersion: v1
kind: ConfigMap
metadata:
  name: env-config


  namespace: default
data:
  log_level: INFO

Copy to Clipboard

Toggle word wrap

1: Name of the config map.
2: Environment variable to inject.

Procedure

You can consume the keys of this ConfigMap in a pod using configMapKeyRef sections.

Sample Pod specification configured to inject specific environment variables

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "env" ]
      env: 
        - name: SPECIAL_LEVEL_KEY 
          valueFrom:
            configMapKeyRef:
              name: special-config 
              key: special.how 
        - name: SPECIAL_TYPE_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config 
              key: special.type 
              optional: true 
      envFrom: 
        - configMapRef:
            name: env-config 
  restartPolicy: Never

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "env" ]
      env:


        - name: SPECIAL_LEVEL_KEY


          valueFrom:
            configMapKeyRef:
              name: special-config


              key: special.how


        - name: SPECIAL_TYPE_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config


              key: special.type


              optional: true


      envFrom:


        - configMapRef:
            name: env-config


  restartPolicy: Never

Copy to Clipboard

Toggle word wrap

1: Stanza to pull the specified environment variables from a ConfigMap.
2: Name of a pod environment variable that you are injecting a key’s value into.
3 5: Name of the ConfigMap to pull specific environment variables from.
4 6: Environment variable to pull from the ConfigMap.
7: Makes the environment variable optional. As optional, the pod will be started even if the specified ConfigMap and keys do not exist.
8: Stanza to pull all environment variables from a ConfigMap.
9: Name of the ConfigMap to pull all environment variables from.

When this pod is run, the pod logs will include the following output:

SPECIAL_LEVEL_KEY=very
log_level=INFO

SPECIAL_LEVEL_KEY=very
log_level=INFO

Copy to Clipboard

Toggle word wrap

Note

SPECIAL_TYPE_KEY=charm is not listed in the example output because optional: true is set.

2.7.4.2. Setting command-line arguments for container commands with config maps
Copy link

You can use a config map to set the value of the commands or arguments in a container by using the Kubernetes substitution syntax $(VAR_NAME).

As an example, consider the following config map:

apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
  namespace: default
data:
  special.how: very
  special.type: charm

apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
  namespace: default
data:
  special.how: very
  special.type: charm

Copy to Clipboard

Toggle word wrap

Procedure

To inject values into a command in a container, you must consume the keys you want to use as environment variables. Then you can refer to them in a container’s command using the $(VAR_NAME) syntax.

Sample pod specification configured to inject specific environment variables

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "echo $(SPECIAL_LEVEL_KEY) $(SPECIAL_TYPE_KEY)" ] 
      env:
        - name: SPECIAL_LEVEL_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config
              key: special.how
        - name: SPECIAL_TYPE_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config
              key: special.type
  restartPolicy: Never

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "echo $(SPECIAL_LEVEL_KEY) $(SPECIAL_TYPE_KEY)" ]


      env:
        - name: SPECIAL_LEVEL_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config
              key: special.how
        - name: SPECIAL_TYPE_KEY
          valueFrom:
            configMapKeyRef:
              name: special-config
              key: special.type
  restartPolicy: Never

Copy to Clipboard

Toggle word wrap

1: Inject the values into a command in a container using the keys you want to use as environment variables.

When this pod is run, the output from the echo command run in the test-container container is as follows:

very charm

very charm

Copy to Clipboard

Toggle word wrap

2.7.4.3. Injecting content into a volume by using config maps
Copy link

You can inject content into a volume by using config maps.

Example ConfigMap custom resource (CR)

apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
  namespace: default
data:
  special.how: very
  special.type: charm

apiVersion: v1
kind: ConfigMap
metadata:
  name: special-config
  namespace: default
data:
  special.how: very
  special.type: charm

Copy to Clipboard

Toggle word wrap

Procedure

You have a couple different options for injecting content into a volume by using config maps.

The most basic way to inject content into a volume by using a config map is to populate the volume with files where the key is the file name and the content of the file is the value of the key:

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "cat", "/etc/config/special.how" ]
      volumeMounts:
      - name: config-volume
        mountPath: /etc/config
  volumes:
    - name: config-volume
      configMap:
        name: special-config 
  restartPolicy: Never

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "cat", "/etc/config/special.how" ]
      volumeMounts:
      - name: config-volume
        mountPath: /etc/config
  volumes:
    - name: config-volume
      configMap:
        name: special-config


  restartPolicy: Never

Copy to Clipboard

Toggle word wrap

1: File containing key.

When this pod is run, the output of the cat command will be:

very

very

Copy to Clipboard

Toggle word wrap

You can also control the paths within the volume where config map keys are projected:

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "cat", "/etc/config/path/to/special-key" ]
      volumeMounts:
      - name: config-volume
        mountPath: /etc/config
  volumes:
    - name: config-volume
      configMap:
        name: special-config
        items:
        - key: special.how
          path: path/to/special-key 
  restartPolicy: Never

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: gcr.io/google_containers/busybox
      command: [ "/bin/sh", "-c", "cat", "/etc/config/path/to/special-key" ]
      volumeMounts:
      - name: config-volume
        mountPath: /etc/config
  volumes:
    - name: config-volume
      configMap:
        name: special-config
        items:
        - key: special.how
          path: path/to/special-key


  restartPolicy: Never

Copy to Clipboard

Toggle word wrap

1: Path to config map key.

When this pod is run, the output of the cat command will be:

very

very

Copy to Clipboard

Toggle word wrap

2.8. Using device plugins to access external resources with pods
Copy link

Device plugins allow you to use a particular device type (GPU, InfiniBand, or other similar computing resources that require vendor-specific initialization and setup) in your OpenShift Container Platform pod without needing to write custom code.

2.8.1. Understanding device plugins
Copy link

The device plugin provides a consistent and portable solution to consume hardware devices across clusters. The device plugin provides support for these devices through an extension mechanism, which makes these devices available to Containers, provides health checks of these devices, and securely shares them.

Important

OpenShift Container Platform supports the device plugin API, but the device plugin Containers are supported by individual vendors.

A device plugin is a gRPC service running on the nodes (external to the kubelet) that is responsible for managing specific hardware resources. Any device plugin must support following remote procedure calls (RPCs):

service DevicePlugin {
      // GetDevicePluginOptions returns options to be communicated with Device
      // Manager
      rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}

      // ListAndWatch returns a stream of List of Devices
      // Whenever a Device state change or a Device disappears, ListAndWatch
      // returns the new list
      rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}

      // Allocate is called during container creation so that the Device
      // Plug-in can run device specific operations and instruct Kubelet
      // of the steps to make the Device available in the container
      rpc Allocate(AllocateRequest) returns (AllocateResponse) {}

      // PreStartcontainer is called, if indicated by Device Plug-in during
      // registration phase, before each container start. Device plug-in
      // can run device specific operations such as resetting the device
      // before making devices available to the container
      rpc PreStartcontainer(PreStartcontainerRequest) returns (PreStartcontainerResponse) {}
}

service DevicePlugin {
      // GetDevicePluginOptions returns options to be communicated with Device
      // Manager
      rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {}

      // ListAndWatch returns a stream of List of Devices
      // Whenever a Device state change or a Device disappears, ListAndWatch
      // returns the new list
      rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {}

      // Allocate is called during container creation so that the Device
      // Plug-in can run device specific operations and instruct Kubelet
      // of the steps to make the Device available in the container
      rpc Allocate(AllocateRequest) returns (AllocateResponse) {}

      // PreStartcontainer is called, if indicated by Device Plug-in during
      // registration phase, before each container start. Device plug-in
      // can run device specific operations such as resetting the device
      // before making devices available to the container
      rpc PreStartcontainer(PreStartcontainerRequest) returns (PreStartcontainerResponse) {}
}

Copy to Clipboard

Toggle word wrap

2.8.1.1. Example device plugins
Copy link

Note

For easy device plugin reference implementation, there is a stub device plugin in the Device Manager code: vendor/k8s.io/kubernetes/pkg/kubelet/cm/deviceplugin/device_plugin_stub.go.

2.8.1.2. Methods for deploying a device plugin
Copy link

Daemon sets are the recommended approach for device plugin deployments.
Upon start, the device plugin will try to create a UNIX domain socket at /var/lib/kubelet/device-plugin/ on the node to serve RPCs from Device Manager.
Since device plugins must manage hardware resources, access to the host file system, as well as socket creation, they must be run in a privileged security context.
More specific details regarding deployment steps can be found with each device plugin implementation.

2.8.2. Understanding the Device Manager
Copy link

Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plugins known as device plugins.

You can advertise specialized hardware without requiring any upstream code changes.

Important

OpenShift Container Platform supports the device plugin API, but the device plugin Containers are supported by individual vendors.

Device Manager advertises devices as Extended Resources. User pods can consume devices, advertised by Device Manager, using the same Limit/Request mechanism, which is used for requesting any other Extended Resource.

Upon start, the device plugin registers itself with Device Manager invoking Register on the /var/lib/kubelet/device-plugins/kubelet.sock and starts a gRPC service at /var/lib/kubelet/device-plugins/<plugin>.sock for serving Device Manager requests.

Device Manager, while processing a new registration request, invokes ListAndWatch remote procedure call (RPC) at the device plugin service. In response, Device Manager gets a list of Device objects from the plugin over a gRPC stream. Device Manager will keep watching on the stream for new updates from the plugin. On the plugin side, the plugin will also keep the stream open and whenever there is a change in the state of any of the devices, a new device list is sent to the Device Manager over the same streaming connection.

While handling a new pod admission request, Kubelet passes requested Extended Resources to the Device Manager for device allocation. Device Manager checks in its database to verify if a corresponding plugin exists or not. If the plugin exists and there are free allocatable devices as well as per local cache, Allocate RPC is invoked at that particular device plugin.

Additionally, device plugins can also perform several other device-specific operations, such as driver installation, device initialization, and device resets. These functionalities vary from implementation to implementation.

2.8.3. Enabling Device Manager
Copy link

Enable Device Manager to implement a device plugin to advertise specialized hardware without any upstream code changes.

Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plugins known as device plugins.

Obtain the label associated with the static MachineConfigPool CRD for the type of node you want to configure by entering the following command. Perform one of the following steps:
1. View the machine config:
  # oc describe machineconfig <name>
  Copy to Clipboard Toggle word wrap
  For example:
  # oc describe machineconfig 00-worker
  Copy to Clipboard Toggle word wrap
  Example output
  Name: 00-worker Namespace: Labels: machineconfiguration.openshift.io/role=worker
  1
  
  Copy to Clipboard Toggle word wrap
  1
  Label required for the Device Manager.

Procedure

Create a custom resource (CR) for your configuration change.

Sample configuration for a Device Manager CR

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: devicemgr 
spec:
  machineConfigPoolSelector:
    matchLabels:
       machineconfiguration.openshift.io: devicemgr 
  kubeletConfig:
    feature-gates:
      - DevicePlugins=true

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: devicemgr


spec:
  machineConfigPoolSelector:
    matchLabels:
       machineconfiguration.openshift.io: devicemgr


  kubeletConfig:
    feature-gates:
      - DevicePlugins=true

Copy to Clipboard

Toggle word wrap

1: Assign a name to CR.
2: Enter the label from the Machine Config Pool.
3: Set DevicePlugins to 'true`.

Create the Device Manager:

oc create -f devicemgr.yaml

$ oc create -f devicemgr.yaml

Copy to Clipboard

Toggle word wrap

Example output

kubeletconfig.machineconfiguration.openshift.io/devicemgr created

kubeletconfig.machineconfiguration.openshift.io/devicemgr created

Copy to Clipboard

Toggle word wrap

Ensure that Device Manager was actually enabled by confirming that /var/lib/kubelet/device-plugins/kubelet.sock is created on the node. This is the UNIX domain socket on which the Device Manager gRPC server listens for new plugin registrations. This sock file is created when the Kubelet is started only if Device Manager is enabled.

2.9. Including pod priority in pod scheduling decisions
Copy link

You can enable pod priority and preemption in your cluster. Pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.

To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.

2.9.1. Understanding pod priority
Copy link

When you use the Pod Priority and Preemption feature, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.

2.9.1.1. Pod priority classes
Copy link

You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.

A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than or equal to one billion for critical pods that must not be preempted or evicted. By default, OpenShift Container Platform has two reserved priority classes for critical system pods to have guaranteed scheduling.

oc get priorityclasses

$ oc get priorityclasses

Copy to Clipboard

Toggle word wrap

Example output

NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-node-critical      2000001000   false            72m
system-cluster-critical   2000000000   false            72m
openshift-user-critical   1000000000   false            3d13h
cluster-logging           1000000      false            29s

NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-node-critical      2000001000   false            72m
system-cluster-critical   2000000000   false            72m
openshift-user-critical   1000000000   false            3d13h
cluster-logging           1000000      false            29s

Copy to Clipboard

Toggle word wrap

system-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are sdn-ovs, sdn, and so forth. A number of critical components include the system-node-critical priority class by default, for example:
- master-api
- master-controller
- master-etcd
- sdn
- sdn-ovs
- sync
system-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the system-node-critical priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth. A number of critical components include the system-cluster-critical priority class by default, for example:
- fluentd
- metrics-server
- descheduler
openshift-user-critical - You can use the priorityClassName field with important pods that cannot bind their resource consumption and do not have predictable resource consumption behavior. Prometheus pods under the openshift-monitoring and openshift-user-workload-monitoring namespaces use the openshift-user-critical priorityClassName. Monitoring workloads use system-critical as their first priorityClass, but this causes problems when monitoring uses excessive memory and the nodes cannot evict them. As a result, monitoring drops priority to give the scheduler flexibility, moving heavy workloads around to keep critical nodes operating.
cluster-logging - This priority is used by Fluentd to make sure Fluentd pods are scheduled to nodes over other apps.

2.9.1.2. Pod priority names
Copy link

After you have one or more priority classes, you can create pods that specify a priority class name in a Pod spec. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.

2.9.2. Understanding pod preemption
Copy link

When a developer creates a pod, the pod goes into a queue. If the developer configured the pod for pod priority or preemption, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.

When the scheduler preempts one or more pods on a node, the nominatedNodeName field of higher-priority Pod spec is set to the name of the node, along with the nodename field. The scheduler uses the nominatedNodeName field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.

After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the nominatedNodeName field and nodeName field of the Pod spec might be different.

Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the nominatedNodeName of the pending pod, making the pod eligible for another node.

Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.

The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.

2.9.2.1. Non-preempting priority classes
Copy link

Pods with the preemption policy set to Never are placed in the scheduling queue ahead of lower-priority pods, but they cannot preempt other pods. A non-preempting pod waiting to be scheduled stays in the scheduling queue until sufficient resources are free and it can be scheduled. Non-preempting pods, like other pods, are subject to scheduler back-off. This means that if the scheduler tries unsuccessfully to schedule these pods, they are retried with lower frequency, allowing other pods with lower priority to be scheduled before them.

Non-preempting pods can still be preempted by other, high-priority pods.

2.9.2.2. Pod preemption and other scheduler settings
Copy link

If you enable pod priority and preemption, consider your other scheduler settings:

Pod priority and pod disruption budget: A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OpenShift Container Platform respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
Pod priority and pod affinity: Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.

If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.

To prevent this situation, carefully configure pod affinity with equal-priority pods.

2.9.2.3. Graceful termination of preempted pods
Copy link

When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.

To minimize this gap, configure a small graceful termination period for lower-priority pods.

2.9.3. Configuring priority and preemption
Copy link

You apply pod priority and preemption by creating a priority class object and associating pods to the priority by using the priorityClassName in your pod specs.

Note

You cannot add a priority class directly to an existing scheduled pod.

Procedure

To configure your cluster to use priority and preemption:

Create one or more priority classes:
1. Create a YAML file similar to the following:
  apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority
  1
  value: 1000000
  2
  preemptionPolicy: PreemptLowerPriority
  3
  globalDefault: false
  4
  description: "This priority class should be used for XYZ service pods only."
  5
  Copy to Clipboard Toggle word wrap
  1
  The name of the priority class object.
  2
  The priority value of the object.
  3
  Optional. Specifies whether this priority class is preempting or non-preempting. The preemption policy defaults to PreemptLowerPriority, which allows pods of that priority class to preempt lower-priority pods. If the preemption policy is set to Never, pods in that priority class are non-preempting.
  4
  Optional. Specifies whether this priority class should be used for pods without a priority class name specified. This field is false by default. Only one priority class with globalDefault set to true can exist in the cluster. If there is no priority class with globalDefault:true, the priority of pods with no priority class name is zero. Adding a priority class with globalDefault:true affects only pods created after the priority class is added and does not change the priorities of existing pods.
  5
  Optional. Describes which pods developers should use with this priority class. Enter an arbitrary text string.
2. Create the priority class:
  $ oc create -f <file-name>.yaml
  Copy to Clipboard Toggle word wrap

Create a pod spec to include the name of a priority class:

Create a YAML file similar to the following:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  priorityClassName: high-priority

Copy to Clipboard

Toggle word wrap

1: Specify the priority class to use with this pod.

Create the pod:
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap
You can add the priority name directly to the pod configuration or to a pod template.

2.10. Placing pods on specific nodes using node selectors
Copy link

A node selector specifies a map of key-value pairs. The rules are defined using custom labels on nodes and selectors specified in pods.

For the pod to be eligible to run on a node, the pod must have the indicated key-value pairs as the label on the node.

If you are using node affinity and node selectors in the same pod configuration, see the important considerations below.

2.10.1. Using node selectors to control pod placement
Copy link

You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.

You add labels to a node, a compute machine set, or a machine config. Adding the label to the compute machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.

To add node selectors to an existing pod, add a node selector to the controlling object for that pod, such as a ReplicaSet object, DaemonSet object, StatefulSet object, Deployment object, or DeploymentConfig object. Any existing pods under that controlling object are recreated on a node with a matching label. If you are creating a new pod, you can add the node selector directly to the pod spec. If the pod does not have a controlling object, you must delete the pod, edit the pod spec, and recreate the pod.

Note

You cannot add a node selector directly to an existing scheduled pod.

Prerequisites

To add a node selector to existing pods, determine the controlling object for that pod. For example, the router-default-66d5cf9464-m2g75 pod is controlled by the router-default-66d5cf9464 replica set:

oc describe pod router-default-66d5cf9464-7pwkc

$ oc describe pod router-default-66d5cf9464-7pwkc

Copy to Clipboard

Toggle word wrap

Example output

kind: Pod
apiVersion: v1
metadata:
#...
Name:               router-default-66d5cf9464-7pwkc
Namespace:          openshift-ingress
# ...
Controlled By:      ReplicaSet/router-default-66d5cf9464
# ...

kind: Pod
apiVersion: v1
metadata:
#...
Name:               router-default-66d5cf9464-7pwkc
Namespace:          openshift-ingress
# ...
Controlled By:      ReplicaSet/router-default-66d5cf9464
# ...

Copy to Clipboard

Toggle word wrap

The web console lists the controlling object under ownerReferences in the pod YAML:

apiVersion: v1
kind: Pod
metadata:
  name: router-default-66d5cf9464-7pwkc
# ...
  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: router-default-66d5cf9464
      uid: d81dd094-da26-11e9-a48a-128e7edf0312
      controller: true
      blockOwnerDeletion: true
# ...

apiVersion: v1
kind: Pod
metadata:
  name: router-default-66d5cf9464-7pwkc
# ...
  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: router-default-66d5cf9464
      uid: d81dd094-da26-11e9-a48a-128e7edf0312
      controller: true
      blockOwnerDeletion: true
# ...

Copy to Clipboard

Toggle word wrap

Procedure

Add labels to a node by using a compute machine set or editing the node directly:

Use a MachineSet object to add labels to nodes managed by the compute machine set when a node is created:

Run the following command to add labels to a MachineSet object:

oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

For example:

oc patch MachineSet abc612-msrtw-worker-us-east-1c  --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api

$ oc patch MachineSet abc612-msrtw-worker-us-east-1c  --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add labels to a compute machine set:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: xf2bd-infra-us-east-2a
  namespace: openshift-machine-api
spec:
  template:
    spec:
      metadata:
        labels:
          region: "east"
          type: "user-node"
#...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: xf2bd-infra-us-east-2a
  namespace: openshift-machine-api
spec:
  template:
    spec:
      metadata:
        labels:
          region: "east"
          type: "user-node"
#...

Copy to Clipboard

Toggle word wrap

Verify that the labels are added to the MachineSet object by using the oc edit command:

For example:

oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api

$ oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Example MachineSet object

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet

# ...

spec:
# ...
  template:
    metadata:
# ...
    spec:
      metadata:
        labels:
          region: east
          type: user-node
# ...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet

# ...

spec:
# ...
  template:
    metadata:
# ...
    spec:
      metadata:
        labels:
          region: east
          type: user-node
# ...

Copy to Clipboard

Toggle word wrap

Add labels directly to a node:

Edit the Node object for the node:

oc label nodes <name> <key>=<value>

$ oc label nodes <name> <key>=<value>

Copy to Clipboard

Toggle word wrap

For example, to label a node:

oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east

$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add labels to a node:

kind: Node
apiVersion: v1
metadata:
  name: hello-node-6fbccf8d9
  labels:
    type: "user-node"
    region: "east"
#...

kind: Node
apiVersion: v1
metadata:
  name: hello-node-6fbccf8d9
  labels:
    type: "user-node"
    region: "east"
#...

Copy to Clipboard

Toggle word wrap

Verify that the labels are added to the node:

oc get nodes -l type=user-node,region=east

$ oc get nodes -l type=user-node,region=east

Copy to Clipboard

Toggle word wrap

Example output

NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-142-25.ec2.internal   Ready    worker   17m   v1.25.0

NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-142-25.ec2.internal   Ready    worker   17m   v1.25.0

Copy to Clipboard

Toggle word wrap

Add the matching node selector to a pod:

To add a node selector to existing and future pods, add a node selector to the controlling object for the pods:

Example ReplicaSet object with labels

kind: ReplicaSet
apiVersion: apps/v1
metadata:
  name: hello-node-6fbccf8d9
# ...
spec:
# ...
  template:
    metadata:
      creationTimestamp: null
      labels:
        ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
        pod-template-hash: 66d5cf9464
    spec:
      nodeSelector:
        kubernetes.io/os: linux
        node-role.kubernetes.io/worker: ''
        type: user-node 
#...

kind: ReplicaSet
apiVersion: apps/v1
metadata:
  name: hello-node-6fbccf8d9
# ...
spec:
# ...
  template:
    metadata:
      creationTimestamp: null
      labels:
        ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
        pod-template-hash: 66d5cf9464
    spec:
      nodeSelector:
        kubernetes.io/os: linux
        node-role.kubernetes.io/worker: ''
        type: user-node


#...

Copy to Clipboard

Toggle word wrap

1: Add the node selector.

To add a node selector to a specific, new pod, add the selector to the Pod object directly:

Example Pod object with a node selector

apiVersion: v1
kind: Pod
metadata:
  name: hello-node-6fbccf8d9
#...
spec:
  nodeSelector:
    region: east
    type: user-node
#...

apiVersion: v1
kind: Pod
metadata:
  name: hello-node-6fbccf8d9
#...
spec:
  nodeSelector:
    region: east
    type: user-node
#...

Copy to Clipboard

Toggle word wrap

Note

You cannot add a node selector directly to an existing scheduled pod.

Chapter 3. Automatically scaling pods with the Custom Metrics Autoscaler Operator
Copy link

3.1. Release notes
Copy link

3.1.1. Custom Metrics Autoscaler Operator release notes
Copy link

The release notes for the Custom Metrics Autoscaler Operator for Red Hat OpenShift describe new features and enhancements, deprecated features, and known issues.

The Custom Metrics Autoscaler Operator uses the Kubernetes-based Event Driven Autoscaler (KEDA) and is built on top of the OpenShift Container Platform horizontal pod autoscaler (HPA).

Note

The Custom Metrics Autoscaler Operator for Red Hat OpenShift is provided as an installable component, with a distinct release cycle from the core OpenShift Container Platform. The Red Hat OpenShift Container Platform Life Cycle Policy outlines release compatibility.

3.1.1.1. Supported versions
Copy link

The following table defines the Custom Metrics Autoscaler Operator versions for each OpenShift Container Platform version.

Expand

Version	OpenShift Container Platform version	General availability
2.17.2	4.19	General availability
2.17.2	4.18	General availability
2.17.2	4.17	General availability
2.17.2	4.16	General availability
2.17.2	4.15	General availability
2.17.2	4.14	General availability
2.17.2	4.13	General availability
2.17.2	4.12	General availability

3.1.1.2. Custom Metrics Autoscaler Operator 2.17.2 release notes
Copy link

Issued: 25 September 2025

This release of the Custom Metrics Autoscaler Operator 2.17.2 addresses Common Vulnerabilities and Exposures (CVEs). The following advisory is available for the Custom Metrics Autoscaler Operator:

RHSA-2025:16124

Important

Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of Kubernetes-based Event Driven Autoscaler (KEDA).

3.1.1.2.1. New features and enhancements
Copy link

3.1.1.2.1.1. The KEDA controller is automatically created during installation
Copy link

The KEDA controller is now automatically created when you install the Custom Metrics Autoscaler Operator. Previously, you needed to manually create the KEDA controller. You can edit the automatically-created KEDA controller, as needed. For more information, see Editing the Keda Controller CR.

3.1.1.2.1.2. Support for the Kubernetes workload trigger
Copy link

The Cluster Metrics Autoscaler Operator now supports using the Kubernetes workload trigger to scale pods based on the number of pods matching a specific label selector. For more information, see Understanding the Kubernetes workload trigger.

3.1.1.2.1.3. Support for bound service account tokens
Copy link

The Cluster Metrics Autoscaler Operator now supports bound service account tokens. Previously, the Operator supported only legacy service account tokens, which are being phased out in favor of bound service account tokens for security reasons. For more information see Understanding custom metrics autoscaler trigger authentications.

3.1.1.2.2. Bug fixes
Copy link

Previously, the KEDA controller did not support volume mounts. As a result, you could not use Kerberos with the Kafka scaler. With this fix, the KEDA controller now supports volume mounts. (OCPBUGS-42559)
Previously, the KEDA version in the keda-operator deployment object log reported that the Custom Metrics Autoscaler Operator was based on an incorrect KEDA version. With this fix, the correct KEDA version is reported in the log. (OCPBUGS-58129)

3.1.2. Release notes for past releases of the Custom Metrics Autoscaler Operator
Copy link

The following release notes are for previous versions of the Custom Metrics Autoscaler Operator.

For the current version, see Custom Metrics Autoscaler Operator release notes.

3.1.2.1. Custom Metrics Autoscaler Operator 2.15.1-4 release notes
Copy link

Issued: 31 March 2025

This release of the Custom Metrics Autoscaler Operator 2.15.1-4 addresses Common Vulnerabilities and Exposures (CVEs). The following advisory is available for the Custom Metrics Autoscaler Operator:

RHSA-2025:3501

Important

3.1.2.1.1. New features and enhancements
Copy link

3.1.2.1.1.1. CMA multi-arch builds
Copy link

With this version of the Custom Metrics Autoscaler Operator, you can now install and run the Operator on an ARM64 OpenShift Container Platform cluster.

3.1.2.2. Custom Metrics Autoscaler Operator 2.14.1-467 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.14.1-467 provides a CVE and a bug fix for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHSA-2024:7348.

Important

3.1.2.2.1. Bug fixes
Copy link

Previously, the root file system of the Custom Metrics Autoscaler Operator pod was writable, which is unnecessary and could present security issues. This update makes the pod root file system read-only, which addresses the potential security issue. (OCPBUGS-37989)

3.1.2.3. Custom Metrics Autoscaler Operator 2.14.1-454 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.14.1-454 provides a CVE, a new feature, and bug fixes for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHBA-2024:5865.

Important

3.1.2.3.1. New features and enhancements
Copy link

3.1.2.3.1.1. Support for the Cron trigger with the Custom Metrics Autoscaler Operator
Copy link

The Custom Metrics Autoscaler Operator can now use the Cron trigger to scale pods based on an hourly schedule. When your specified time frame starts, the Custom Metrics Autoscaler Operator scales pods to your desired amount. When the time frame ends, the Operator scales back down to the previous level.

For more information, see Understanding the Cron trigger.

3.1.2.3.2. Bug fixes
Copy link

Previously, if you made changes to audit configuration parameters in the KedaController custom resource, the keda-metrics-server-audit-policy config map would not get updated. As a consequence, you could not change the audit configuration parameters after the initial deployment of the Custom Metrics Autoscaler. With this fix, changes to the audit configuration now render properly in the config map, allowing you to change the audit configuration any time after installation. (OCPBUGS-32521)

3.1.2.4. Custom Metrics Autoscaler Operator 2.13.1 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.13.1-421 provides a new feature and a bug fix for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHBA-2024:4837.

Important

3.1.2.4.1. New features and enhancements
Copy link

3.1.2.4.1.1. Support for custom certificates with the Custom Metrics Autoscaler Operator
Copy link

The Custom Metrics Autoscaler Operator can now use custom service CA certificates to connect securely to TLS-enabled metrics sources, such as an external Kafka cluster or an external Prometheus service. By default, the Operator uses automatically-generated service certificates to connect to on-cluster services only. There is a new field in the KedaController object that allows you to load custom server CA certificates for connecting to external services by using config maps.

For more information, see Custom CA certificates for the Custom Metrics Autoscaler.

3.1.2.4.2. Bug fixes
Copy link

Previously, the custom-metrics-autoscaler and custom-metrics-autoscaler-adapter images were missing time zone information. As a consequence, scaled objects with cron triggers failed to work because the controllers were unable to find time zone information. With this fix, the image builds are updated to include time zone information. As a result, scaled objects containing cron triggers now function properly. Scaled objects containing cron triggers are currently not supported for the custom metrics autoscaler. (OCPBUGS-34018)

3.1.2.5. Custom Metrics Autoscaler Operator 2.12.1-394 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.12.1-394 provides a bug fix for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHSA-2024:2901.

Important

3.1.2.5.1. Bug fixes
Copy link

Previously, the protojson.Unmarshal function entered into an infinite loop when unmarshaling certain forms of invalid JSON. This condition could occur when unmarshaling into a message that contains a google.protobuf.Any value or when the UnmarshalOptions.DiscardUnknown option is set. This release fixes this issue. (OCPBUGS-30305)
Previously, when parsing a multipart form, either explicitly with the Request.ParseMultipartForm method or implicitly with the Request.FormValue, Request.PostFormValue, or Request.FormFile method, the limits on the total size of the parsed form were not applied to the memory consumed. This could cause memory exhaustion. With this fix, the parsing process now correctly limits the maximum size of form lines while reading a single form line. (OCPBUGS-30360)
Previously, when following an HTTP redirect to a domain that is not on a matching subdomain or on an exact match of the initial domain, an HTTP client would not forward sensitive headers, such as Authorization or Cookie. For example, a redirect from example.com to www.example.com would forward the Authorization header, but a redirect to www.example.org would not forward the header. This release fixes this issue. (OCPBUGS-30365)
Previously, verifying a certificate chain that contains a certificate with an unknown public key algorithm caused the certificate verification process to panic. This condition affected all crypto and Transport Layer Security (TLS) clients and servers that set the Config.ClientAuth parameter to the VerifyClientCertIfGiven or RequireAndVerifyClientCert value. The default behavior is for TLS servers to not verify client certificates. This release fixes this issue. (OCPBUGS-30370)
Previously, if errors returned from the MarshalJSON method contained user-controlled data, an attacker could have used the data to break the contextual auto-escaping behavior of the HTML template package. This condition would allow for subsequent actions to inject unexpected content into the templates. This release fixes this issue. (OCPBUGS-30397)
Previously, the net/http and golang.org/x/net/http2 Go packages did not limit the number of CONTINUATION frames for an HTTP/2 request. This condition could result in excessive CPU consumption. This release fixes this issue. (OCPBUGS-30894)

3.1.2.6. Custom Metrics Autoscaler Operator 2.12.1-384 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.12.1-384 provides a bug fix for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHBA-2024:2043.

Important

Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.

3.1.2.6.1. Bug fixes
Copy link

Previously, the custom-metrics-autoscaler and custom-metrics-autoscaler-adapter images were missing time zone information. As a consequence, scaled objects with cron triggers failed to work because the controllers were unable to find time zone information. With this fix, the image builds are updated to include time zone information. As a result, scaled objects containing cron triggers now function properly. (OCPBUGS-32395)

3.1.2.7. Custom Metrics Autoscaler Operator 2.12.1-376 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.12.1-376 provides security updates and bug fixes for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHSA-2024:1812.

Important

Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.

3.1.2.7.1. Bug fixes
Copy link

Previously, if invalid values such as nonexistent namespaces were specified in scaled object metadata, the underlying scaler clients would not free, or close, their client descriptors, resulting in a slow memory leak. This fix properly closes the underlying client descriptors when there are errors, preventing memory from leaking. (OCPBUGS-30145)
Previously the ServiceMonitor custom resource (CR) for the keda-metrics-apiserver pod was not functioning, because the CR referenced an incorrect metrics port name of http. This fix corrects the ServiceMonitor CR to reference the proper port name of metrics. As a result, the Service Monitor functions properly. (OCPBUGS-25806)

3.1.2.8. Custom Metrics Autoscaler Operator 2.11.2-322 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.11.2-322 provides security updates and bug fixes for running the Operator in an OpenShift Container Platform cluster. The following advisory is available for the RHSA-2023:6144.

Important

Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.

3.1.2.8.1. Bug fixes
Copy link

Because the Custom Metrics Autoscaler Operator version 3.11.2-311 was released without a required volume mount in the Operator deployment, the Custom Metrics Autoscaler Operator pod would restart every 15 minutes. This fix adds the required volume mount to the Operator deployment. As a result, the Operator no longer restarts every 15 minutes. (OCPBUGS-22361)

3.1.2.9. Custom Metrics Autoscaler Operator 2.11.2-311 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.11.2-311 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.11.2-311 were released in RHBA-2023:5981.

Important

Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.

3.1.2.9.1. New features and enhancements
Copy link

3.1.2.9.1.1. Red Hat OpenShift Service on AWS and OpenShift Dedicated are now supported
Copy link

The Custom Metrics Autoscaler Operator 2.11.2-311 can be installed on Red Hat OpenShift Service on AWS and OpenShift Dedicated managed clusters. Previous versions of the Custom Metrics Autoscaler Operator could be installed only in the openshift-keda namespace. This prevented the Operator from being installed on Red Hat OpenShift Service on AWS and OpenShift Dedicated clusters. This version of Custom Metrics Autoscaler allows installation to other namespaces such as openshift-operators or keda, enabling installation into Red Hat OpenShift Service on AWS and OpenShift Dedicated clusters.

3.1.2.9.2. Bug fixes
Copy link

Previously, if the Custom Metrics Autoscaler Operator was installed and configured, but not in use, the OpenShift CLI reported the couldn’t get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1 error after any oc command was entered. The message, although harmless, could have caused confusion. With this fix, the Got empty response for: external.metrics… error no longer appears inappropriately. (OCPBUGS-15779)
Previously, any annotation or label change to objects managed by the Custom Metrics Autoscaler were reverted by Custom Metrics Autoscaler Operator any time the Keda Controller was modified, for example after a configuration change. This caused continuous changing of labels in your objects. The Custom Metrics Autoscaler now uses its own annotation to manage labels and annotations, and annotation or label are no longer inappropriately reverted. (OCPBUGS-15590)

3.1.2.10. Custom Metrics Autoscaler Operator 2.10.1-267 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.10.1-267 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.10.1-267 were released in RHBA-2023:4089.

Important

Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.

3.1.2.10.1. Bug fixes
Copy link

Previously, the custom-metrics-autoscaler and custom-metrics-autoscaler-adapter images did not contain time zone information. Because of this, scaled objects with cron triggers failed to work because the controllers were unable to find time zone information. With this fix, the image builds now include time zone information. As a result, scaled objects containing cron triggers now function properly. (OCPBUGS-15264)
Previously, the Custom Metrics Autoscaler Operator would attempt to take ownership of all managed objects, including objects in other namespaces and cluster-scoped objects. Because of this, the Custom Metrics Autoscaler Operator was unable to create the role binding for reading the credentials necessary to be an API server. This caused errors in the kube-system namespace. With this fix, the Custom Metrics Autoscaler Operator skips adding the ownerReference field to any object in another namespace or any cluster-scoped object. As a result, the role binding is now created without any errors. (OCPBUGS-15038)
Previously, the Custom Metrics Autoscaler Operator added an ownerReferences field to the openshift-keda namespace. While this did not cause functionality problems, the presence of this field could have caused confusion for cluster administrators. With this fix, the Custom Metrics Autoscaler Operator does not add the ownerReference field to the openshift-keda namespace. As a result, the openshift-keda namespace no longer has a superfluous ownerReference field. (OCPBUGS-15293)
Previously, if you used a Prometheus trigger configured with authentication method other than pod identity, and the podIdentity parameter was set to none, the trigger would fail to scale. With this fix, the Custom Metrics Autoscaler for OpenShift now properly handles the none pod identity provider type. As a result, a Prometheus trigger configured with authentication method other than pod identity, and the podIdentity parameter sset to none now properly scales. (OCPBUGS-15274)

3.1.2.11. Custom Metrics Autoscaler Operator 2.10.1 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.10.1 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.10.1 were released in RHEA-2023:3199.

Important

Before installing this version of the Custom Metrics Autoscaler Operator, remove any previously installed Technology Preview versions or the community-supported version of KEDA.

3.1.2.11.1. New features and enhancements
Copy link

3.1.2.11.1.1. Custom Metrics Autoscaler Operator general availability
Copy link

The Custom Metrics Autoscaler Operator is now generally available as of Custom Metrics Autoscaler Operator version 2.10.1.

Important

Scaling by using a scaled job is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

3.1.2.11.1.2. Performance metrics
Copy link

You can now use the Prometheus Query Language (PromQL) to query metrics on the Custom Metrics Autoscaler Operator.

3.1.2.11.1.3. Pausing the custom metrics autoscaling for scaled objects
Copy link

You can now pause the autoscaling of a scaled object, as needed, and resume autoscaling when ready.

3.1.2.11.1.4. Replica fall back for scaled objects
Copy link

You can now specify the number of replicas to fall back to if a scaled object fails to get metrics from the source.

3.1.2.11.1.5. Customizable HPA naming for scaled objects
Copy link

You can now specify a custom name for the horizontal pod autoscaler in scaled objects.

3.1.2.11.1.6. Activation and scaling thresholds
Copy link

Because the horizontal pod autoscaler (HPA) cannot scale to or from 0 replicas, the Custom Metrics Autoscaler Operator does that scaling, after which the HPA performs the scaling. You can now specify when the HPA takes over autoscaling, based on the number of replicas. This allows for more flexibility with your scaling policies.

3.1.2.12. Custom Metrics Autoscaler Operator 2.8.2-174 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.8.2-174 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.8.2-174 were released in RHEA-2023:1683.

Important

The Custom Metrics Autoscaler Operator version 2.8.2-174 is a Technology Preview feature.

3.1.2.12.1. New features and enhancements
Copy link

3.1.2.12.1.1. Operator upgrade support
Copy link

You can now upgrade from a prior version of the Custom Metrics Autoscaler Operator. See "Changing the update channel for an Operator" in the "Additional resources" for information on upgrading an Operator.

3.1.2.12.1.2. must-gather support
Copy link

You can now collect data about the Custom Metrics Autoscaler Operator and its components by using the OpenShift Container Platform must-gather tool. Currently, the process for using the must-gather tool with the Custom Metrics Autoscaler is different than for other operators. See "Gathering debugging data in the "Additional resources" for more information.

3.1.2.13. Custom Metrics Autoscaler Operator 2.8.2 release notes
Copy link

This release of the Custom Metrics Autoscaler Operator 2.8.2 provides new features and bug fixes for running the Operator in an OpenShift Container Platform cluster. The components of the Custom Metrics Autoscaler Operator 2.8.2 were released in RHSA-2023:1042.

Important

The Custom Metrics Autoscaler Operator version 2.8.2 is a Technology Preview feature.

3.1.2.13.1. New features and enhancements
Copy link

3.1.2.13.1.1. Audit Logging
Copy link

You can now gather and view audit logs for the Custom Metrics Autoscaler Operator and its associated components. Audit logs are security-relevant chronological sets of records that document the sequence of activities that have affected the system by individual users, administrators, or other components of the system.

3.1.2.13.1.2. Scale applications based on Apache Kafka metrics
Copy link

You can now use the KEDA Apache kafka trigger/scaler to scale deployments based on an Apache Kafka topic.

3.1.2.13.1.3. Scale applications based on CPU metrics
Copy link

You can now use the KEDA CPU trigger/scaler to scale deployments based on CPU metrics.

3.1.2.13.1.4. Scale applications based on memory metrics
Copy link

You can now use the KEDA memory trigger/scaler to scale deployments based on memory metrics.

3.2. Custom Metrics Autoscaler Operator overview
Copy link

As a developer, you can use Custom Metrics Autoscaler Operator for Red Hat OpenShift to specify how OpenShift Container Platform should automatically increase or decrease the number of pods for a deployment, stateful set, custom resource, or job based on custom metrics that are not based only on CPU or memory.

The Custom Metrics Autoscaler Operator is an optional Operator, based on the Kubernetes Event Driven Autoscaler (KEDA), that allows workloads to be scaled using additional metrics sources other than pod metrics.

The custom metrics autoscaler currently supports only the Prometheus, CPU, memory, and Apache Kafka metrics.

The Custom Metrics Autoscaler Operator scales your pods up and down based on custom, external metrics from specific applications. Your other applications continue to use other scaling methods. You configure triggers, also known as scalers, which are the source of events and metrics that the custom metrics autoscaler uses to determine how to scale. The custom metrics autoscaler uses a metrics API to convert the external metrics to a form that OpenShift Container Platform can use. The custom metrics autoscaler creates a horizontal pod autoscaler (HPA) that performs the actual scaling.

To use the custom metrics autoscaler, you create a ScaledObject or ScaledJob object for a workload, which is a custom resource (CR) that defines the scaling metadata. You specify the deployment or job to scale, the source of the metrics to scale on (trigger), and other parameters such as the minimum and maximum replica counts allowed.

Note

You can create only one scaled object or scaled job for each workload that you want to scale. Also, you cannot use a scaled object or scaled job and the horizontal pod autoscaler (HPA) on the same workload.

The custom metrics autoscaler, unlike the HPA, can scale to zero. If you set the minReplicaCount value in the custom metrics autoscaler CR to 0, the custom metrics autoscaler scales the workload down from 1 to 0 replicas to or up from 0 replicas to 1. This is known as the activation phase. After scaling up to 1 replica, the HPA takes control of the scaling. This is known as the scaling phase.

Some triggers allow you to change the number of replicas that are scaled by the cluster metrics autoscaler. In all cases, the parameter to configure the activation phase always uses the same phrase, prefixed with activation. For example, if the threshold parameter configures scaling, activationThreshold would configure activation. Configuring the activation and scaling phases allows you more flexibility with your scaling policies. For example, you can configure a higher activation phase to prevent scaling up or down if the metric is particularly low.

The activation value has more priority than the scaling value in case of different decisions for each. For example, if the threshold is set to 10, and the activationThreshold is 50, if the metric reports 40, the scaler is not active and the pods are scaled to zero even if the HPA requires 4 instances.

Figure 3.1. Custom metrics autoscaler workflow

You create or modify a scaled object custom resource for a workload on a cluster. The object contains the scaling configuration for that workload. Prior to accepting the new object, the OpenShift API server sends it to the custom metrics autoscaler admission webhooks process to ensure that the object is valid. If validation succeeds, the API server persists the object.
The custom metrics autoscaler controller watches for new or modified scaled objects. When the OpenShift API server notifies the controller of a change, the controller monitors any external trigger sources, also known as data sources, that are specified in the object for changes to the metrics data. One or more scalers request scaling data from the external trigger source. For example, for a Kafka trigger type, the controller uses the Kafka scaler to communicate with a Kafka instance to obtain the data requested by the trigger.
The controller creates a horizontal pod autoscaler object for the scaled object. As a result, the Horizontal Pod Autoscaler (HPA) Operator starts monitoring the scaling data associated with the trigger. The HPA requests scaling data from the cluster OpenShift API server endpoint.
The OpenShift API server endpoint is served by the custom metrics autoscaler metrics adapter. When the metrics adapter receives a request for custom metrics, it uses a GRPC connection to the controller to request it for the most recent trigger data received from the scaler.
The HPA makes scaling decisions based upon the data received from the metrics adapter and scales the workload up or down by increasing or decreasing the replicas.
As a it operates, a workload can affect the scaling metrics. For example, if a workload is scaled up to handle work in a Kafka queue, the queue size decreases after the workload processes all the work. As a result, the workload is scaled down.
If the metrics are in a range specified by the minReplicaCount value, the custom metrics autoscaler controller disables all scaling, and leaves the replica count at a fixed level. If the metrics exceed that range, the custom metrics autoscaler controller enables scaling and allows the HPA to scale the workload. While scaling is disabled, the HPA does not take any action.

3.2.1. Custom CA certificates for the Custom Metrics Autoscaler
Copy link

By default, the Custom Metrics Autoscaler Operator uses automatically-generated service CA certificates to connect to on-cluster services.

If you want to use off-cluster services that require custom CA certificates, you can add the required certificates to a config map. Then, add the config map to the KedaController custom resource as described in Installing the custom metrics autoscaler. The Operator loads those certificates on start-up and registers them as trusted by the Operator.

The config maps can contain one or more certificate files that contain one or more PEM-encoded CA certificates. Or, you can use separate config maps for each certificate file.

Note

If you later update the config map to add additional certificates, you must restart the keda-operator-* pod for the changes to take effect.

3.3. Installing the custom metrics autoscaler
Copy link

You can use the OpenShift Container Platform web console to install the Custom Metrics Autoscaler Operator.

The installation creates the following five CRDs:

ClusterTriggerAuthentication
KedaController
ScaledJob
ScaledObject
TriggerAuthentication

The installation process also creates the KedaController custom resource (CR). You can modify the default KedaController CR, if needed. For more information, see "Editing the Keda Controller CR".

Note

If you are installing a Custom Metrics Autoscaler Operator version lower than 2.17.2, you must manually create the Keda Controller CR. You can use the procedure described in "Editing the Keda Controller CR" to create the CR.

3.3.1. Installing the custom metrics autoscaler
Copy link

You can use the following procedure to install the Custom Metrics Autoscaler Operator.

Prerequisites

Remove any previously-installed Technology Preview versions of the Cluster Metrics Autoscaler Operator.
Remove any versions of the community-based KEDA.
Also, remove the KEDA 1.x custom resource definitions by running the following commands:
```
oc delete crd scaledobjects.keda.k8s.io
```
```
$ oc delete crd scaledobjects.keda.k8s.io
```
Copy to Clipboard Toggle word wrap
```
oc delete crd triggerauthentications.keda.k8s.io
```
```
$ oc delete crd triggerauthentications.keda.k8s.io
```
Copy to Clipboard Toggle word wrap
Optional: If you need the Custom Metrics Autoscaler Operator to connect to off-cluster services, such as an external Kafka cluster or an external Prometheus service, put any required service CA certificates into a config map. The config map must exist in the same namespace where the Operator is installed. For example:
```
oc create configmap -n openshift-keda thanos-cert  --from-file=ca-cert.pem
```
```
$ oc create configmap -n openshift-keda thanos-cert  --from-file=ca-cert.pem
```
Copy to Clipboard Toggle word wrap

Procedure

In the OpenShift Container Platform web console, click Operators → OperatorHub.
Choose Custom Metrics Autoscaler from the list of available Operators, and click Install.
On the Install Operator page, ensure that the All namespaces on the cluster (default) option is selected for Installation Mode. This installs the Operator in all namespaces.
Ensure that the openshift-keda namespace is selected for Installed Namespace. OpenShift Container Platform creates the namespace, if not present in your cluster.
Click Install.
Verify the installation by listing the Custom Metrics Autoscaler Operator components:
1. Navigate to Workloads → Pods.
2. Select the openshift-keda project from the drop-down menu and verify that the custom-metrics-autoscaler-operator-* pod is running.
3. Navigate to Workloads → Deployments to verify that the custom-metrics-autoscaler-operator deployment is running.

Optional: Verify the installation in the OpenShift CLI using the following commands:

oc get all -n openshift-keda

$ oc get all -n openshift-keda

Copy to Clipboard

Toggle word wrap

The output appears similar to the following:

Example output

NAME                                                      READY   STATUS    RESTARTS   AGE
pod/custom-metrics-autoscaler-operator-5fd8d9ffd8-xt4xp   1/1     Running   0          18m

NAME                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/custom-metrics-autoscaler-operator   1/1     1            1           18m

NAME                                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/custom-metrics-autoscaler-operator-5fd8d9ffd8   1         1         1       18m

NAME                                                      READY   STATUS    RESTARTS   AGE
pod/custom-metrics-autoscaler-operator-5fd8d9ffd8-xt4xp   1/1     Running   0          18m

NAME                                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/custom-metrics-autoscaler-operator   1/1     1            1           18m

NAME                                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/custom-metrics-autoscaler-operator-5fd8d9ffd8   1         1         1       18m

Copy to Clipboard

Toggle word wrap

3.3.2. Editing the Keda Controller CR
Copy link

You can use the following procedure to modify the KedaController custom resource (CR), which is automatically installed during the installation of the Custom Metrics Autoscaler Operator.

Procedure

In the OpenShift Container Platform web console, click Operators → Installed Operators.
Click Custom Metrics Autoscaler.
On the Operator Details page, click the KedaController tab.

On the KedaController tab, click Create KedaController and edit the file.

kind: KedaController
apiVersion: keda.sh/v1alpha1
metadata:
  name: keda
  namespace: openshift-keda
spec:
  watchNamespace: '' 
  operator:
    logLevel: info 
    logEncoder: console 
    caConfigMaps: 
    - thanos-cert
    - kafka-cert
    volumeMounts: 
    - mountPath: /<path_to_directory>
      name: <name>
    volumes: 
    - name: <volume_name>
      emptyDir:
        medium: Memory
  metricsServer:
    logLevel: '0' 
    auditConfig: 
      logFormat: "json"
      logOutputVolumeClaim: "persistentVolumeClaimName"
      policy:
        rules:
        - level: Metadata
        omitStages: ["RequestReceived"]
        omitManagedFields: false
      lifetime:
        maxAge: "2"
        maxBackup: "1"
        maxSize: "50"
  serviceAccount: {}

kind: KedaController
apiVersion: keda.sh/v1alpha1
metadata:
  name: keda
  namespace: openshift-keda
spec:
  watchNamespace: ''


  operator:
    logLevel: info


    logEncoder: console


    caConfigMaps:


    - thanos-cert
    - kafka-cert
    volumeMounts:


    - mountPath: /<path_to_directory>
      name: <name>
    volumes:


    - name: <volume_name>
      emptyDir:
        medium: Memory
  metricsServer:
    logLevel: '0'


    auditConfig:


      logFormat: "json"
      logOutputVolumeClaim: "persistentVolumeClaimName"
      policy:
        rules:
        - level: Metadata
        omitStages: ["RequestReceived"]
        omitManagedFields: false
      lifetime:
        maxAge: "2"
        maxBackup: "1"
        maxSize: "50"
  serviceAccount: {}

Copy to Clipboard

Toggle word wrap

1: Specifies a single namespace in which the Custom Metrics Autoscaler Operator scales applications. Leave it blank or leave it empty to scale applications in all namespaces. This field should have a namespace or be empty. The default value is empty.
2: Specifies the level of verbosity for the Custom Metrics Autoscaler Operator log messages. The allowed values are debug, info, error. The default is info.
3: Specifies the logging format for the Custom Metrics Autoscaler Operator log messages. The allowed values are console or json. The default is console.
4: Optional: Specifies one or more config maps with CA certificates, which the Custom Metrics Autoscaler Operator can use to connect securely to TLS-enabled metrics sources.
5: Optional: Add the container mount path.
6: Optional: Add a volumes block to list each projected volume source.
7: Specifies the logging level for the Custom Metrics Autoscaler Metrics Server. The allowed values are 0 for info and 4 for debug. The default is 0.
8: Activates audit logging for the Custom Metrics Autoscaler Operator and specifies the audit policy to use, as described in the "Configuring audit logging" section.

Click Save to save the changes.

3.4. Understanding custom metrics autoscaler triggers
Copy link

Triggers, also known as scalers, provide the metrics that the Custom Metrics Autoscaler Operator uses to scale your pods.

The custom metrics autoscaler currently supports the Prometheus, CPU, memory, Apache Kafka, and cron triggers.

You use a ScaledObject or ScaledJob custom resource to configure triggers for specific objects, as described in the sections that follow.

You can configure a certificate authority to use with your scaled objects or for all scalers in the cluster.

3.4.1. Understanding the Prometheus trigger
Copy link

You can scale pods based on Prometheus metrics, which can use the installed OpenShift Container Platform monitoring or an external Prometheus server as the metrics source. See "Configuring the custom metrics autoscaler to use OpenShift Container Platform monitoring" for information on the configurations required to use the OpenShift Container Platform monitoring as a source for metrics.

Note

If Prometheus is collecting metrics from the application that the custom metrics autoscaler is scaling, do not set the minimum replicas to 0 in the custom resource. If there are no application pods, the custom metrics autoscaler does not have any metrics to scale on.

Example scaled object with a Prometheus target

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prom-scaledobject
  namespace: my-namespace
spec:
# ...
  triggers:
  - type: prometheus 
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092 
      namespace: kedatest 
      metricName: http_requests_total 
      threshold: '5' 
      query: sum(rate(http_requests_total{job="test-app"}[1m])) 
      authModes: basic 
      cortexOrgID: my-org 
      ignoreNullValues: false 
      unsafeSsl: false 
      timeout: 1000

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: prom-scaledobject
  namespace: my-namespace
spec:
# ...
  triggers:
  - type: prometheus


    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092


      namespace: kedatest


      metricName: http_requests_total


      threshold: '5'


      query: sum(rate(http_requests_total{job="test-app"}[1m]))


      authModes: basic


      cortexOrgID: my-org


      ignoreNullValues: false


      unsafeSsl: false


      timeout: 1000

Copy to Clipboard

Toggle word wrap

Specifies Prometheus as the trigger type.

Specifies the address of the Prometheus server. This example uses OpenShift Container Platform monitoring.

Optional: Specifies the namespace of the object you want to scale. This parameter is mandatory if using OpenShift Container Platform monitoring as a source for the metrics.

Specifies the name to identify the metric in the external.metrics.k8s.io API. If you are using more than one trigger, all metric names must be unique.

Specifies the value that triggers scaling. Must be specified as a quoted string value.

Specifies the Prometheus query to use.

Specifies the authentication method to use. Prometheus scalers support bearer authentication (bearer), basic authentication (basic), or TLS authentication (tls). You configure the specific authentication parameters in a trigger authentication, as discussed in a following section. As needed, you can also use a secret.

Optional: Passes the X-Scope-OrgID header to multi-tenant Cortex or Mimir storage for Prometheus. This parameter is required only with multi-tenant Prometheus storage, to indicate which data Prometheus should return.

Optional: Specifies how the trigger should proceed if the Prometheus target is lost.

If true, the trigger continues to operate if the Prometheus target is lost. This is the default behavior.
If false, the trigger returns an error if the Prometheus target is lost.

Optional: Specifies whether the certificate check should be skipped. For example, you might skip the check if you are running in a test environment and using self-signed certificates at the Prometheus endpoint.

If false, the certificate check is performed. This is the default behavior.
If true, the certificate check is not performed.
Important
Skipping the check is not recommended.

Optional: Specifies an HTTP request timeout in milliseconds for the HTTP client used by this Prometheus trigger. This value overrides any global timeout setting.

3.4.1.1. Configuring GPU-based autoscaling with Prometheus and DCGM metrics
Copy link

You can use the Custom Metrics Autoscaler with NVIDIA Data Center GPU Manager (DCGM) metrics to scale workloads based on GPU utilization. This is particularly useful for AI and machine learning workloads that require GPU resources.

Example scaled object with a Prometheus target for GPU-based autoscaling

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: gpu-scaledobject
  namespace: my-namespace
spec:
  scaleTargetRef:
    kind: Deployment
    name: gpu-deployment
  minReplicaCount: 1 
  maxReplicaCount: 5 
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: my-namespace
      metricName: gpu_utilization
      threshold: '90' 
      query: SUM(DCGM_FI_DEV_GPU_UTIL{instance=~".+", gpu=~".+"}) 
      authModes: bearer
    authenticationRef:
      name: keda-trigger-auth-prometheus

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: gpu-scaledobject
  namespace: my-namespace
spec:
  scaleTargetRef:
    kind: Deployment
    name: gpu-deployment
  minReplicaCount: 1


  maxReplicaCount: 5


  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: my-namespace
      metricName: gpu_utilization
      threshold: '90'


      query: SUM(DCGM_FI_DEV_GPU_UTIL{instance=~".+", gpu=~".+"})


      authModes: bearer
    authenticationRef:
      name: keda-trigger-auth-prometheus

Copy to Clipboard

Toggle word wrap

1: Specifies the minimum number of replicas to maintain. For GPU workloads, this should not be set to 0 to ensure that metrics continue to be collected.
2: Specifies the maximum number of replicas allowed during scale-up operations.
3: Specifies the GPU utilization percentage threshold that triggers scaling. When the average GPU utilization exceeds 90%, the autoscaler scales up the deployment.
4: Specifies a Prometheus query using NVIDIA DCGM metrics to monitor GPU utilization across all GPU devices. The DCGM_FI_DEV_GPU_UTIL metric provides GPU utilization percentages.

3.4.1.2. Configuring the custom metrics autoscaler to use OpenShift Container Platform monitoring
Copy link

You can use the installed OpenShift Container Platform Prometheus monitoring as a source for the metrics used by the custom metrics autoscaler. However, there are some additional configurations you must perform.

For your scaled objects to be able to read the OpenShift Container Platform Prometheus metrics, you must use a trigger authentication or a cluster trigger authentication in order to provide the authentication information required. The following procedure differs depending on which trigger authentication method you use. For more information on trigger authentications, see "Understanding custom metrics autoscaler trigger authentications".

Note

These steps are not required for an external Prometheus source.

You must perform the following tasks, as described in this section:

Create a service account.
Create the trigger authentication.
Create a role.
Add that role to the service account.
Reference the token in the trigger authentication object used by Prometheus.

Prerequisites

OpenShift Container Platform monitoring must be installed.
Monitoring of user-defined workloads must be enabled in OpenShift Container Platform monitoring, as described in the Creating a user-defined workload monitoring config map section.
The Custom Metrics Autoscaler Operator must be installed.

Procedure

Change to the appropriate project:
```
oc project <project_name>
```
```
$ oc project <project_name> 
```
1
Copy to Clipboard Toggle word wrap
1
Specifies one of the following projects:
If you are using a trigger authentication, specify the project with the object you want to scale.
If you are using a cluster trigger authentication, specify the openshift-keda project.
Create a service account if your cluster does not have one:
1. Create a service account object by using the following command:
  $ oc create serviceaccount thanos
  1
  Copy to Clipboard Toggle word wrap
  1
  Specifies the name of the service account.
Create a trigger authentication with the service account token:
1. Create a YAML file similar to the following:
  apiVersion: keda.sh/v1alpha1 kind: <authentication_method>
  1
  metadata: name: keda-trigger-auth-prometheus spec: boundServiceAccountToken:
  2
  - parameter: bearerToken
  3
  serviceAccountName: thanos
  4
  Copy to Clipboard Toggle word wrap
  1
  Specifies one of the following trigger authentication methods:
  If you are using a trigger authentication, specify TriggerAuthentication. This example configures a trigger authentication.
  If you are using a cluster trigger authentication, specify ClusterTriggerAuthentication.
  2
  Specifies that this trigger authentication uses a bound service account token for authorization when connecting to the metrics endpoint.
  3
  Specifies the authentication parameter to supply by using the token. Here, the example uses bearer authentication.
  4
  Specifies the name of the service account to use.
2. Create the CR object:
  $ oc create -f <file-name>.yaml
  Copy to Clipboard Toggle word wrap

Create a role for reading Thanos metrics:

Create a YAML file with the following parameters:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: thanos-metrics-reader
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: thanos-metrics-reader
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch

Copy to Clipboard

Toggle word wrap

Create the CR object:
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap

Create a role binding for reading Thanos metrics:
1. Create a YAML file similar to the following:
  apiVersion: rbac.authorization.k8s.io/v1 kind: <binding_type>
  1
  metadata: name: thanos-metrics-reader
  2
  namespace: my-project
  3
  roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: thanos-metrics-reader subjects: - kind: ServiceAccount name: thanos
  4
  namespace: <namespace_name>
  5
  Copy to Clipboard Toggle word wrap
  1
  Specifies one of the following object types:
  If you are using a trigger authentication, specify RoleBinding.
  If you are using a cluster trigger authentication, specify ClusterRoleBinding.
  2
  Specifies the name of the role you created.
  3
  Specifies one of the following projects:
  If you are using a trigger authentication, specify the project with the object you want to scale.
  If you are using a cluster trigger authentication, specify the openshift-keda project.
  4
  Specifies the name of the service account to bind to the role.
  5
  Specifies the project where you previously created the service account.
2. Create the CR object:
  $ oc create -f <file-name>.yaml
  Copy to Clipboard Toggle word wrap

You can now deploy a scaled object or scaled job to enable autoscaling for your application, as described in "Understanding how to add custom metrics autoscalers". To use OpenShift Container Platform monitoring as the source, in the trigger, or scaler, you must include the following parameters:

triggers.type must be prometheus
triggers.metadata.serverAddress must be https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
triggers.metadata.authModes must be bearer
triggers.metadata.namespace must be set to the namespace of the object to scale
triggers.authenticationRef must point to the trigger authentication resource specified in the previous step

3.4.2. Understanding the CPU trigger
Copy link

You can scale pods based on CPU metrics. This trigger uses cluster metrics as the source for metrics.

The custom metrics autoscaler scales the pods associated with an object to maintain the CPU usage that you specify. The autoscaler increases or decreases the number of replicas between the minimum and maximum numbers to maintain the specified CPU utilization across all pods. The memory trigger considers the memory utilization of the entire pod. If the pod has multiple containers, the memory trigger considers the total memory utilization of all containers in the pod.

Note

This trigger cannot be used with the ScaledJob custom resource.
When using a memory trigger to scale an object, the object does not scale to 0, even if you are using multiple triggers.

Example scaled object with a CPU target

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cpu-scaledobject
  namespace: my-namespace
spec:
# ...
  triggers:
  - type: cpu 
    metricType: Utilization 
    metadata:
      value: '60' 
  minReplicaCount: 1

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cpu-scaledobject
  namespace: my-namespace
spec:
# ...
  triggers:
  - type: cpu


    metricType: Utilization


    metadata:
      value: '60'


  minReplicaCount: 1

Copy to Clipboard

Toggle word wrap

Specifies CPU as the trigger type.

Specifies the type of metric to use, either Utilization or AverageValue.

Specifies the value that triggers scaling. Must be specified as a quoted string value.

When using Utilization, the target value is the average of the resource metrics across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
When using AverageValue, the target value is the average of the metrics across all relevant pods.

Specifies the minimum number of replicas when scaling down. For a CPU trigger, enter a value of 1 or greater, because the HPA cannot scale to zero if you are using only CPU metrics.

3.4.3. Understanding the memory trigger
Copy link

You can scale pods based on memory metrics. This trigger uses cluster metrics as the source for metrics.

The custom metrics autoscaler scales the pods associated with an object to maintain the average memory usage that you specify. The autoscaler increases and decreases the number of replicas between the minimum and maximum numbers to maintain the specified memory utilization across all pods. The memory trigger considers the memory utilization of entire pod. If the pod has multiple containers, the memory utilization is the sum of all of the containers.

Note

This trigger cannot be used with the ScaledJob custom resource.
When using a memory trigger to scale an object, the object does not scale to 0, even if you are using multiple triggers.

Example scaled object with a memory target

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: memory-scaledobject
  namespace: my-namespace
spec:
# ...
  triggers:
  - type: memory 
    metricType: Utilization 
    metadata:
      value: '60' 
      containerName: api

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: memory-scaledobject
  namespace: my-namespace
spec:
# ...
  triggers:
  - type: memory


    metricType: Utilization


    metadata:
      value: '60'


      containerName: api

Copy to Clipboard

Toggle word wrap

Specifies memory as the trigger type.

Specifies the type of metric to use, either Utilization or AverageValue.

Specifies the value that triggers scaling. Must be specified as a quoted string value.

When using Utilization, the target value is the average of the resource metrics across all relevant pods, represented as a percentage of the requested value of the resource for the pods.
When using AverageValue, the target value is the average of the metrics across all relevant pods.

Optional: Specifies an individual container to scale, based on the memory utilization of only that container, rather than the entire pod. In this example, only the container named api is to be scaled.

3.4.4. Understanding the Kafka trigger
Copy link

You can scale pods based on an Apache Kafka topic or other services that support the Kafka protocol. The custom metrics autoscaler does not scale higher than the number of Kafka partitions, unless you set the allowIdleConsumers parameter to true in the scaled object or scaled job.

Note

If the number of consumer groups exceeds the number of partitions in a topic, the extra consumer groups remain idle. To avoid this, by default the number of replicas does not exceed:

The number of partitions on a topic, if a topic is specified
The number of partitions of all topics in the consumer group, if no topic is specified
The maxReplicaCount specified in scaled object or scaled job CR

You can use the allowIdleConsumers parameter to disable these default behaviors.

Example scaled object with a Kafka target

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject
  namespace: my-namespace
spec:
# ...
  triggers:
  - type: kafka 
    metadata:
      topic: my-topic 
      bootstrapServers: my-cluster-kafka-bootstrap.openshift-operators.svc:9092 
      consumerGroup: my-group 
      lagThreshold: '10' 
      activationLagThreshold: '5' 
      offsetResetPolicy: latest 
      allowIdleConsumers: true 
      scaleToZeroOnInvalidOffset: false 
      excludePersistentLag: false 
      version: '1.0.0' 
      partitionLimitation: '1,2,10-20,31' 
      tls: enable

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaledobject
  namespace: my-namespace
spec:
# ...
  triggers:
  - type: kafka


    metadata:
      topic: my-topic


      bootstrapServers: my-cluster-kafka-bootstrap.openshift-operators.svc:9092


      consumerGroup: my-group


      lagThreshold: '10'


      activationLagThreshold: '5'


      offsetResetPolicy: latest


      allowIdleConsumers: true


      scaleToZeroOnInvalidOffset: false


      excludePersistentLag: false


      version: '1.0.0'


      partitionLimitation: '1,2,10-20,31'


      tls: enable

Copy to Clipboard

Toggle word wrap

Specifies Kafka as the trigger type.

Specifies the name of the Kafka topic on which Kafka is processing the offset lag.

Specifies a comma-separated list of Kafka brokers to connect to.

Specifies the name of the Kafka consumer group used for checking the offset on the topic and processing the related lag.

Optional: Specifies the average target value that triggers scaling. Must be specified as a quoted string value. The default is 5.

Optional: Specifies the target value for the activation phase. Must be specified as a quoted string value.

Optional: Specifies the Kafka offset reset policy for the Kafka consumer. The available values are: latest and earliest. The default is latest.

Optional: Specifies whether the number of Kafka replicas can exceed the number of partitions on a topic.

If true, the number of Kafka replicas can exceed the number of partitions on a topic. This allows for idle Kafka consumers.
If false, the number of Kafka replicas cannot exceed the number of partitions on a topic. This is the default.

Specifies how the trigger behaves when a Kafka partition does not have a valid offset.

If true, the consumers are scaled to zero for that partition.
If false, the scaler keeps a single consumer for that partition. This is the default.

Optional: Specifies whether the trigger includes or excludes partition lag for partitions whose current offset is the same as the current offset of the previous polling cycle.

If true, the scaler excludes partition lag in these partitions.
If false, the trigger includes all consumer lag in all partitions. This is the default.

Optional: Specifies the version of your Kafka brokers. Must be specified as a quoted string value. The default is 1.0.0.

Optional: Specifies a comma-separated list of partition IDs to scope the scaling on. If set, only the listed IDs are considered when calculating lag. Must be specified as a quoted string value. The default is to consider all partitions.

13

Optional: Specifies whether to use TSL client authentication for Kafka. The default is disable. For information on configuring TLS, see "Understanding custom metrics autoscaler trigger authentications".

3.4.5. Understanding the Cron trigger
Copy link

You can scale pods based on a time range.

When the time range starts, the custom metrics autoscaler scales the pods associated with an object from the configured minimum number of pods to the specified number of desired pods. At the end of the time range, the pods are scaled back to the configured minimum. The time period must be configured in cron format.

The following example scales the pods associated with this scaled object from 0 to 100 from 6:00 AM to 6:30 PM India Standard Time.

Example scaled object with a Cron trigger

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cron-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: my-deployment
  minReplicaCount: 0 
  maxReplicaCount: 100 
  cooldownPeriod: 300
  triggers:
  - type: cron 
    metadata:
      timezone: Asia/Kolkata 
      start: "0 6 * * *" 
      end: "30 18 * * *" 
      desiredReplicas: "100"

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cron-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: my-deployment
  minReplicaCount: 0


  maxReplicaCount: 100


  cooldownPeriod: 300
  triggers:
  - type: cron


    metadata:
      timezone: Asia/Kolkata


      start: "0 6 * * *"


      end: "30 18 * * *"


      desiredReplicas: "100"

Copy to Clipboard

Toggle word wrap

1: Specifies the minimum number of pods to scale down to at the end of the time frame.
2: Specifies the maximum number of replicas when scaling up. This value should be the same as desiredReplicas. The default is 100.
3: Specifies a Cron trigger.
4: Specifies the timezone for the time frame. This value must be from the IANA Time Zone Database.
5: Specifies the start of the time frame.
6: Specifies the end of the time frame.
7: Specifies the number of pods to scale to between the start and end of the time frame. This value should be the same as maxReplicaCount.

3.4.6. Understanding the Kubernetes workload trigger
Copy link

You can scale pods based on the number of pods matching a specific label selector.

The Custom Metrics Autoscaler Operator tracks the number of pods with a specific label that are in the same namespace, then calculates a relation based on the number of labeled pods to the pods for the scaled object. Using this relation, the Custom Metrics Autoscaler Operator scales the object according to the scaling policy in the ScaledObject or ScaledJob specification.

The pod counts includes pods with a Succeeded or Failed phase.

For example, if you have a frontend deployment and a backend deployment. You can use a kubernetes-workload trigger to scale the backend deployment based on the number of frontend pods. If number of frontend pods goes up, the Operator would scale the backend pods to maintain the specified ratio. In this example, if there are 10 pods with the app=frontend pod selector, the Operator scales the backend pods to 5 in order to maintain the 0.5 ratio set in the scaled object.

Example scaled object with a Kubernetes workload trigger

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: workload-scaledobject
  namespace: my-namespace
spec:
  triggers:
  - type: kubernetes-workload 
    metadata:
      podSelector: 'app=frontend' 
      value: '0.5' 
      activationValue: '3.1'

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: workload-scaledobject
  namespace: my-namespace
spec:
  triggers:
  - type: kubernetes-workload


    metadata:
      podSelector: 'app=frontend'


      value: '0.5'


      activationValue: '3.1'

Copy to Clipboard

Toggle word wrap

Specifies a Kubernetes workload trigger.

Specifies one or more pod selectors and/or set-based selectors, separated with commas, to use to get the pod count.

Specifies the target relation between the scaled workload and the number of pods that match the selector. The relation is calculated following the following formula:

relation = (pods that match the selector) / (scaled workload pods)

relation = (pods that match the selector) / (scaled workload pods)

Copy to Clipboard

Toggle word wrap

Optional: Specifies the target value for scaler activation phase. The default is 0.

3.5. Understanding custom metrics autoscaler trigger authentications
Copy link

A trigger authentication allows you to include authentication information in a scaled object or a scaled job that can be used by the associated containers. You can use trigger authentications to pass OpenShift Container Platform secrets, platform-native pod authentication mechanisms, environment variables, and so on.

You define a TriggerAuthentication object in the same namespace as the object that you want to scale. That trigger authentication can be used only by objects in that namespace.

Alternatively, to share credentials between objects in multiple namespaces, you can create a ClusterTriggerAuthentication object that can be used across all namespaces.

Trigger authentications and cluster trigger authentication use the same configuration. However, a cluster trigger authentication requires an additional kind parameter in the authentication reference of the scaled object.

Example trigger authentication that uses a bound service account token

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: secret-triggerauthentication
  namespace: my-namespace 
spec:
  boundServiceAccountToken: 
    - parameter: bearerToken
      serviceAccountName: thanos

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: secret-triggerauthentication
  namespace: my-namespace


spec:
  boundServiceAccountToken:


    - parameter: bearerToken
      serviceAccountName: thanos

Copy to Clipboard

Toggle word wrap

1: Specifies the namespace of the object you want to scale.
2: Specifies that this trigger authentication uses a bound service account token for authorization when connecting to the metrics endpoint.
3: Specifies the name of the service account to use.

Example cluster trigger authentication that uses a bound service account token

kind: ClusterTriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: bound-service-account-token-triggerauthentication 
spec:
  boundServiceAccountToken: 
    - parameter: bearerToken
      serviceAccountName: thanos

kind: ClusterTriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: bound-service-account-token-triggerauthentication


spec:
  boundServiceAccountToken:


    - parameter: bearerToken
      serviceAccountName: thanos

Copy to Clipboard

Toggle word wrap

1: Specifies the namespace of the object you want to scale.
2: Specifies that this cluster trigger authentication uses a bound service account token for authorization when connecting to the metrics endpoint.
3: Specifies the name of the service account to use.

Example trigger authentication that uses a secret for Basic authentication

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: secret-triggerauthentication
  namespace: my-namespace 
spec:
  secretTargetRef: 
  - parameter: username 
    name: my-basic-secret 
    key: username 
  - parameter: password
    name: my-basic-secret
    key: password

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: secret-triggerauthentication
  namespace: my-namespace


spec:
  secretTargetRef:


  - parameter: username


    name: my-basic-secret


    key: username


  - parameter: password
    name: my-basic-secret
    key: password

Copy to Clipboard

Toggle word wrap

1: Specifies the namespace of the object you want to scale.
2: Specifies that this trigger authentication uses a secret for authorization when connecting to the metrics endpoint.
3: Specifies the authentication parameter to supply by using the secret.
4: Specifies the name of the secret to use. See the following example secret for Basic authentication.
5: Specifies the key in the secret to use with the specified parameter.

Example secret for Basic authentication

apiVersion: v1
kind: Secret
metadata:
  name: my-basic-secret
  namespace: default
data:
  username: "dXNlcm5hbWU=" 
  password: "cGFzc3dvcmQ="

apiVersion: v1
kind: Secret
metadata:
  name: my-basic-secret
  namespace: default
data:
  username: "dXNlcm5hbWU="


  password: "cGFzc3dvcmQ="

Copy to Clipboard

Toggle word wrap

1: User name and password to supply to the trigger authentication. The values in the data stanza must be base-64 encoded.

Example trigger authentication that uses a secret for CA details

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: secret-triggerauthentication
  namespace: my-namespace 
spec:
  secretTargetRef: 
    - parameter: key 
      name: my-secret 
      key: client-key.pem 
    - parameter: ca 
      name: my-secret 
      key: ca-cert.pem

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: secret-triggerauthentication
  namespace: my-namespace


spec:
  secretTargetRef:


    - parameter: key


      name: my-secret


      key: client-key.pem


    - parameter: ca


      name: my-secret


      key: ca-cert.pem

Copy to Clipboard

Toggle word wrap

1: Specifies the namespace of the object you want to scale.
2: Specifies that this trigger authentication uses a secret for authorization when connecting to the metrics endpoint.
3: Specifies the type of authentication to use.
4: Specifies the name of the secret to use.
5: Specifies the key in the secret to use with the specified parameter.
6: Specifies the authentication parameter for a custom CA when connecting to the metrics endpoint.
7: Specifies the name of the secret to use. See the following example secret with certificate authority (CA) details.
8: Specifies the key in the secret to use with the specified parameter.

Example secret with certificate authority (CA) details

apiVersion: v1
kind: Secret
metadata:
  name: my-secret
  namespace: my-namespace
data:
  ca-cert.pem: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0... 
  client-cert.pem: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0... 
  client-key.pem: LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0t...

apiVersion: v1
kind: Secret
metadata:
  name: my-secret
  namespace: my-namespace
data:
  ca-cert.pem: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0...


  client-cert.pem: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0...


  client-key.pem: LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0t...

Copy to Clipboard

Toggle word wrap

1: Specifies the TLS CA Certificate for authentication of the metrics endpoint. The value must be base-64 encoded.
2: Specifies the TLS certificates and key for TLS client authentication. The values must be base-64 encoded.

Example trigger authentication that uses a bearer token

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: token-triggerauthentication
  namespace: my-namespace 
spec:
  secretTargetRef: 
  - parameter: bearerToken 
    name: my-secret 
    key: bearerToken

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: token-triggerauthentication
  namespace: my-namespace


spec:
  secretTargetRef:


  - parameter: bearerToken


    name: my-secret


    key: bearerToken

Copy to Clipboard

Toggle word wrap

1: Specifies the namespace of the object you want to scale.
2: Specifies that this trigger authentication uses a secret for authorization when connecting to the metrics endpoint.
3: Specifies the type of authentication to use.
4: Specifies the name of the secret to use. See the following example secret for a bearer token.
5: Specifies the key in the token to use with the specified parameter.

Example secret for a bearer token

apiVersion: v1
kind: Secret
metadata:
  name: my-secret
  namespace: my-namespace
data:
  bearerToken: "<bearer_token>"

apiVersion: v1
kind: Secret
metadata:
  name: my-secret
  namespace: my-namespace
data:
  bearerToken: "<bearer_token>"

Copy to Clipboard

Toggle word wrap

1: Specifies a bearer token to use with bearer authentication. The value must be base-64 encoded.

Example trigger authentication that uses an environment variable

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: env-var-triggerauthentication
  namespace: my-namespace 
spec:
  env: 
  - parameter: access_key 
    name: ACCESS_KEY 
    containerName: my-container

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: env-var-triggerauthentication
  namespace: my-namespace


spec:
  env:


  - parameter: access_key


    name: ACCESS_KEY


    containerName: my-container

Copy to Clipboard

Toggle word wrap

1: Specifies the namespace of the object you want to scale.
2: Specifies that this trigger authentication uses environment variables for authorization when connecting to the metrics endpoint.
3: Specify the parameter to set with this variable.
4: Specify the name of the environment variable.
5: Optional: Specify a container that requires authentication. The container must be in the same resource as referenced by scaleTargetRef in the scaled object.

Example trigger authentication that uses pod authentication providers

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: pod-id-triggerauthentication
  namespace: my-namespace 
spec:
  podIdentity: 
    provider: aws-eks

kind: TriggerAuthentication
apiVersion: keda.sh/v1alpha1
metadata:
  name: pod-id-triggerauthentication
  namespace: my-namespace


spec:
  podIdentity:


    provider: aws-eks

Copy to Clipboard

Toggle word wrap

1: Specifies the namespace of the object you want to scale.
2: Specifies that this trigger authentication uses a platform-native pod authentication when connecting to the metrics endpoint.
3: Specifies a pod identity. Supported values are none, azure, gcp, aws-eks, or aws-kiam. The default is none.

Additional resources

3.5.1. Using trigger authentications
Copy link

You use trigger authentications and cluster trigger authentications by using a custom resource to create the authentication, then add a reference to a scaled object or scaled job.

Prerequisites

The Custom Metrics Autoscaler Operator must be installed.
If you are using a bound service account token, the service account must exist.

If you are using a bound service account token, a role-based access control (RBAC) object that enables the Custom Metrics Autoscaler Operator to request service account tokens from the service account must exist.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: keda-operator-token-creator
  namespace: <namespace_name> 
rules:
- apiGroups:
  - ""
  resources:
  - serviceaccounts/token
  verbs:
  - create
  resourceNames:
  - thanos 
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: keda-operator-token-creator-binding
  namespace: <namespace_name> 
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: keda-operator-token-creator
subjects:
- kind: ServiceAccount
  name: keda-operator
  namespace: openshift-keda

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: keda-operator-token-creator
  namespace: <namespace_name>


rules:
- apiGroups:
  - ""
  resources:
  - serviceaccounts/token
  verbs:
  - create
  resourceNames:
  - thanos


---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: keda-operator-token-creator-binding
  namespace: <namespace_name>


roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: keda-operator-token-creator
subjects:
- kind: ServiceAccount
  name: keda-operator
  namespace: openshift-keda

Copy to Clipboard

Toggle word wrap

1: Specifies the namespace of the service account.
2: Specifies the name of the service account.
3: Specifies the namespace of the service account.

If you are using a secret, the Secret object must exist.

Procedure

Create the TriggerAuthentication or ClusterTriggerAuthentication object.
1. Create a YAML file that defines the object:
  Example trigger authentication with a bound service account token
  kind: TriggerAuthentication apiVersion: keda.sh/v1alpha1 metadata: name: prom-triggerauthentication namespace: my-namespace
  1
  spec: boundServiceAccountToken:
  2
  - parameter: token serviceAccountName: thanos
  3
  
  Copy to Clipboard Toggle word wrap
  1
  Specifies the namespace of the object you want to scale.
  2
  Specifies that this trigger authentication uses a bound service account token for authorization when connecting to the metrics endpoint.
  3
  Specifies the name of the service account to use.
2. Create the TriggerAuthentication object:
  $ oc create -f <filename>.yaml
  Copy to Clipboard Toggle word wrap

Create or edit a ScaledObject YAML file that uses the trigger authentication:

Create a YAML file that defines the object by running the following command:

Example scaled object with a trigger authentication

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scaledobject
  namespace: my-namespace
spec:
  scaleTargetRef:
    name: example-deployment
  maxReplicaCount: 100
  minReplicaCount: 0
  pollingInterval: 30
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: kedatest # replace <NAMESPACE>
      metricName: http_requests_total
      threshold: '5'
      query: sum(rate(http_requests_total{job="test-app"}[1m]))
      authModes: "basic"
    authenticationRef:
      name: prom-triggerauthentication 
      kind: TriggerAuthentication

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scaledobject
  namespace: my-namespace
spec:
  scaleTargetRef:
    name: example-deployment
  maxReplicaCount: 100
  minReplicaCount: 0
  pollingInterval: 30
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: kedatest # replace <NAMESPACE>
      metricName: http_requests_total
      threshold: '5'
      query: sum(rate(http_requests_total{job="test-app"}[1m]))
      authModes: "basic"
    authenticationRef:
      name: prom-triggerauthentication


      kind: TriggerAuthentication

Copy to Clipboard

Toggle word wrap

1: Specify the name of your trigger authentication object.
2: Specify TriggerAuthentication. TriggerAuthentication is the default.

Example scaled object with a cluster trigger authentication

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scaledobject
  namespace: my-namespace
spec:
  scaleTargetRef:
    name: example-deployment
  maxReplicaCount: 100
  minReplicaCount: 0
  pollingInterval: 30
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: kedatest # replace <NAMESPACE>
      metricName: http_requests_total
      threshold: '5'
      query: sum(rate(http_requests_total{job="test-app"}[1m]))
      authModes: "basic"
    authenticationRef:
      name: prom-cluster-triggerauthentication 
      kind: ClusterTriggerAuthentication

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scaledobject
  namespace: my-namespace
spec:
  scaleTargetRef:
    name: example-deployment
  maxReplicaCount: 100
  minReplicaCount: 0
  pollingInterval: 30
  triggers:
  - type: prometheus
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: kedatest # replace <NAMESPACE>
      metricName: http_requests_total
      threshold: '5'
      query: sum(rate(http_requests_total{job="test-app"}[1m]))
      authModes: "basic"
    authenticationRef:
      name: prom-cluster-triggerauthentication


      kind: ClusterTriggerAuthentication

Copy to Clipboard

Toggle word wrap

1: Specify the name of your trigger authentication object.
2: Specify ClusterTriggerAuthentication.

Create the scaled object by running the following command:
```
oc apply -f <filename>
```
```
$ oc apply -f <filename>
```
Copy to Clipboard Toggle word wrap

3.6. Understanding how to add custom metrics autoscalers
Copy link

To add a custom metrics autoscaler, create a ScaledObject custom resource for a deployment, stateful set, or custom resource. Create a ScaledJob custom resource for a job.

You can create only one scaled object for each workload that you want to scale. Also, you cannot use a scaled object and the horizontal pod autoscaler (HPA) on the same workload.

3.6.1. Adding a custom metrics autoscaler to a workload
Copy link

You can create a custom metrics autoscaler for a workload that is created by a Deployment, StatefulSet, or custom resource object.

Prerequisites

The Custom Metrics Autoscaler Operator must be installed.

If you use a custom metrics autoscaler for scaling based on CPU or memory:

Your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics <pod-name> command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with CPU and Memory displayed under Usage.

oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal

$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal

Copy to Clipboard

Toggle word wrap

Example output

Name:         openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Namespace:    openshift-kube-scheduler
Labels:       <none>
Annotations:  <none>
API Version:  metrics.k8s.io/v1beta1
Containers:
  Name:  wait-for-host-port
  Usage:
    Memory:  0
  Name:      scheduler
  Usage:
    Cpu:     8m
    Memory:  45440Ki
Kind:        PodMetrics
Metadata:
  Creation Timestamp:  2019-05-23T18:47:56Z
  Self Link:           /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Timestamp:             2019-05-23T18:47:56Z
Window:                1m0s
Events:                <none>

Name:         openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Namespace:    openshift-kube-scheduler
Labels:       <none>
Annotations:  <none>
API Version:  metrics.k8s.io/v1beta1
Containers:
  Name:  wait-for-host-port
  Usage:
    Memory:  0
  Name:      scheduler
  Usage:
    Cpu:     8m
    Memory:  45440Ki
Kind:        PodMetrics
Metadata:
  Creation Timestamp:  2019-05-23T18:47:56Z
  Self Link:           /apis/metrics.k8s.io/v1beta1/namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Timestamp:             2019-05-23T18:47:56Z
Window:                1m0s
Events:                <none>

Copy to Clipboard

Toggle word wrap

The pods associated with the object you want to scale must include specified memory and CPU limits. For example:

Example pod spec

apiVersion: v1
kind: Pod
# ...
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      limits:
        memory: "128Mi"
        cpu: "500m"
# ...

apiVersion: v1
kind: Pod
# ...
spec:
  containers:
  - name: app
    image: images.my-company.example/app:v4
    resources:
      limits:
        memory: "128Mi"
        cpu: "500m"
# ...

Copy to Clipboard

Toggle word wrap

Procedure

Create a YAML file similar to the following. Only the name <2>, object name <4>, and object kind <5> are required:
Example scaled object
```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "0" 
  name: scaledobject 
  namespace: my-namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    name: example-deployment 
    kind: Deployment 
    envSourceContainerName: .spec.template.spec.containers[0] 
  cooldownPeriod:  200 
  maxReplicaCount: 100 
  minReplicaCount: 0 
  metricsServer: 
    auditConfig:
      logFormat: "json"
      logOutputVolumeClaim: "persistentVolumeClaimName"
      policy:
        rules:
        - level: Metadata
        omitStages: "RequestReceived"
        omitManagedFields: false
      lifetime:
        maxAge: "2"
        maxBackup: "1"
        maxSize: "50"
  fallback: 
    failureThreshold: 3
    replicas: 6
    behavior: static 
  pollingInterval: 30 
  advanced:
    restoreToOriginalReplicaCount: false 
    horizontalPodAutoscalerConfig:
      name: keda-hpa-scale-down 
      behavior: 
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 100
            periodSeconds: 15
  triggers:
  - type: prometheus 
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: kedatest
      metricName: http_requests_total
      threshold: '5'
      query: sum(rate(http_requests_total{job="test-app"}[1m]))
      authModes: basic
    authenticationRef: 
      name: prom-triggerauthentication
      kind: TriggerAuthentication
```
```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "0" 
```
1
```
  name: scaledobject 
```
2
```
  namespace: my-namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
```
3
```
    name: example-deployment 
```
4
```
    kind: Deployment 
```
5
```
    envSourceContainerName: .spec.template.spec.containers[0] 
```
6
```
  cooldownPeriod:  200 
```
7
```
  maxReplicaCount: 100 
```
8
```
  minReplicaCount: 0 
```
9
```
  metricsServer: 
```
10
```
    auditConfig:
      logFormat: "json"
      logOutputVolumeClaim: "persistentVolumeClaimName"
      policy:
        rules:
        - level: Metadata
        omitStages: "RequestReceived"
        omitManagedFields: false
      lifetime:
        maxAge: "2"
        maxBackup: "1"
        maxSize: "50"
  fallback: 
```
11
```
    failureThreshold: 3
    replicas: 6
    behavior: static 
```
12
```
  pollingInterval: 30 
```
13
```
  advanced:
    restoreToOriginalReplicaCount: false 
```
14
```
    horizontalPodAutoscalerConfig:
      name: keda-hpa-scale-down 
```
15
```
      behavior: 
```
16
```
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 100
            periodSeconds: 15
  triggers:
  - type: prometheus 
```
17
```
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: kedatest
      metricName: http_requests_total
      threshold: '5'
      query: sum(rate(http_requests_total{job="test-app"}[1m]))
      authModes: basic
    authenticationRef: 
```
18
```
      name: prom-triggerauthentication
      kind: TriggerAuthentication
```
Copy to Clipboard Toggle word wrap
1
Optional: Specifies that the Custom Metrics Autoscaler Operator is to scale the replicas to the specified value and stop autoscaling, as described in the "Pausing the custom metrics autoscaler for a workload" section.
2
Specifies a name for this custom metrics autoscaler.
3
Optional: Specifies the API version of the target resource. The default is apps/v1.
4
Specifies the name of the object that you want to scale.
5
Specifies the kind as Deployment, StatefulSet or CustomResource.
6
Optional: Specifies the name of the container in the target resource, from which the custom metrics autoscaler gets environment variables holding secrets and so forth. The default is .spec.template.spec.containers[0].
7
Optional. Specifies the period in seconds to wait after the last trigger is reported before scaling the deployment back to 0 if the minReplicaCount is set to 0. The default is 300.
8
Optional: Specifies the maximum number of replicas when scaling up. The default is 100.
9
Optional: Specifies the minimum number of replicas when scaling down.
10
Optional: Specifies the parameters for audit logs. as described in the "Configuring audit logging" section.
11
Optional: Specifies the number of replicas to fall back to if a scaler fails to get metrics from the source for the number of times defined by the failureThreshold parameter. For more information on fallback behavior, see the KEDA documentation.
12
Optional: Specifies the replica count to be used if a fallback occurs. Enter one of the following options or omit the parameter:
Enter static to use the number of replicas specified by the fallback.replicas parameter. This is the default.
Enter currentReplicas to maintain the current number of replicas.
Enter currentReplicasIfHigher to maintain the current number of replicas, if that number is higher than the fallback.replicas parameter. If the current number of replicas is lower than the fallback.replicas parameter, use the fallback.replicas value.
Enter currentReplicasIfLower to maintain the current number of replicas, if that number is lower than the fallback.replicas parameter. If the current number of replicas is higher than the fallback.replicas parameter, use the fallback.replicas value.
13
Optional: Specifies the interval in seconds to check each trigger on. The default is 30.
14
Optional: Specifies whether to scale back the target resource to the original replica count after the scaled object is deleted. The default is false, which keeps the replica count as it is when the scaled object is deleted.
15
Optional: Specifies a name for the horizontal pod autoscaler. The default is keda-hpa-{scaled-object-name}.
16
Optional: Specifies a scaling policy to use to control the rate to scale pods up or down, as described in the "Scaling policies" section.
17
Specifies the trigger to use as the basis for scaling, as described in the "Understanding the custom metrics autoscaler triggers" section. This example uses OpenShift Container Platform monitoring.
18
Optional: Specifies a trigger authentication or a cluster trigger authentication. For more information, see Understanding the custom metrics autoscaler trigger authentication in the Additional resources section.
Enter TriggerAuthentication to use a trigger authentication. This is the default.
Enter ClusterTriggerAuthentication to use a cluster trigger authentication.
Create the custom metrics autoscaler by running the following command:
```
oc create -f <filename>.yaml
```
```
$ oc create -f <filename>.yaml
```
Copy to Clipboard Toggle word wrap

Verification

View the command output to verify that the custom metrics autoscaler was created:
```
oc get scaledobject <scaled_object_name>
```
```
$ oc get scaledobject <scaled_object_name>
```
Copy to Clipboard Toggle word wrap
Example output
```
NAME            SCALETARGETKIND      SCALETARGETNAME        MIN   MAX   TRIGGERS     AUTHENTICATION               READY   ACTIVE   FALLBACK   AGE
scaledobject    apps/v1.Deployment   example-deployment     0     50    prometheus   prom-triggerauthentication   True    True     True       17s
```
```
NAME            SCALETARGETKIND      SCALETARGETNAME        MIN   MAX   TRIGGERS     AUTHENTICATION               READY   ACTIVE   FALLBACK   AGE
scaledobject    apps/v1.Deployment   example-deployment     0     50    prometheus   prom-triggerauthentication   True    True     True       17s
```
Copy to Clipboard Toggle word wrap
Note the following fields in the output:
- TRIGGERS: Indicates the trigger, or scaler, that is being used.
- AUTHENTICATION: Indicates the name of any trigger authentication being used.
- READY: Indicates whether the scaled object is ready to start scaling:
  - If True, the scaled object is ready.
  - If False, the scaled object is not ready because of a problem in one or more of the objects you created.
- ACTIVE: Indicates whether scaling is taking place:
  - If True, scaling is taking place.
  - If False, scaling is not taking place because there are no metrics or there is a problem in one or more of the objects you created.
- FALLBACK: Indicates whether the custom metrics autoscaler is able to get metrics from the source
  - If False, the custom metrics autoscaler is getting metrics.
  - If True, the custom metrics autoscaler is getting metrics because there are no metrics or there is a problem in one or more of the objects you created.

3.6.2. Adding a custom metrics autoscaler to a job
Copy link

You can create a custom metrics autoscaler for any Job object.

Important

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

The Custom Metrics Autoscaler Operator must be installed.

Procedure

Create a YAML file similar to the following:

kind: ScaledJob
apiVersion: keda.sh/v1alpha1
metadata:
  name: scaledjob
  namespace: my-namespace
spec:
  failedJobsHistoryLimit: 5
  jobTargetRef:
    activeDeadlineSeconds: 600 
    backoffLimit: 6 
    parallelism: 1 
    completions: 1 
    template:  
      metadata:
        name: pi
      spec:
        containers:
        - name: pi
          image: perl
          command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
  maxReplicaCount: 100 
  pollingInterval: 30 
  successfulJobsHistoryLimit: 5 
  failedJobsHistoryLimit: 5 
  envSourceContainerName: 
  rolloutStrategy: gradual 
  scalingStrategy: 
    strategy: "custom"
    customScalingQueueLengthDeduction: 1
    customScalingRunningJobPercentage: "0.5"
    pendingPodConditions:
      - "Ready"
      - "PodScheduled"
      - "AnyOtherCustomPodCondition"
    multipleScalersCalculation : "max"
  triggers:
  - type: prometheus 
    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: kedatest
      metricName: http_requests_total
      threshold: '5'
      query: sum(rate(http_requests_total{job="test-app"}[1m]))
      authModes: "bearer"
    authenticationRef: 
      name: prom-cluster-triggerauthentication

kind: ScaledJob
apiVersion: keda.sh/v1alpha1
metadata:
  name: scaledjob
  namespace: my-namespace
spec:
  failedJobsHistoryLimit: 5
  jobTargetRef:
    activeDeadlineSeconds: 600


    backoffLimit: 6


    parallelism: 1


    completions: 1


    template:


      metadata:
        name: pi
      spec:
        containers:
        - name: pi
          image: perl
          command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
  maxReplicaCount: 100


  pollingInterval: 30


  successfulJobsHistoryLimit: 5


  failedJobsHistoryLimit: 5


  envSourceContainerName:


  rolloutStrategy: gradual


  scalingStrategy:


    strategy: "custom"
    customScalingQueueLengthDeduction: 1
    customScalingRunningJobPercentage: "0.5"
    pendingPodConditions:
      - "Ready"
      - "PodScheduled"
      - "AnyOtherCustomPodCondition"
    multipleScalersCalculation : "max"
  triggers:
  - type: prometheus


    metadata:
      serverAddress: https://thanos-querier.openshift-monitoring.svc.cluster.local:9092
      namespace: kedatest
      metricName: http_requests_total
      threshold: '5'
      query: sum(rate(http_requests_total{job="test-app"}[1m]))
      authModes: "bearer"
    authenticationRef:


      name: prom-cluster-triggerauthentication

Copy to Clipboard

Toggle word wrap

Specifies the maximum duration the job can run.

Specifies the number of retries for a job. The default is 6.

Optional: Specifies how many pod replicas a job should run in parallel; defaults to 1.

For non-parallel jobs, leave unset. When unset, the default is 1.

Optional: Specifies how many successful pod completions are needed to mark a job completed.

For non-parallel jobs, leave unset. When unset, the default is 1.
For parallel jobs with a fixed completion count, specify the number of completions.
For parallel jobs with a work queue, leave unset. When unset the default is the value of the parallelism parameter.

Specifies the template for the pod the controller creates.

Optional: Specifies the maximum number of replicas when scaling up. The default is 100.

Optional: Specifies the interval in seconds to check each trigger on. The default is 30.

Optional: Specifies the number of successful finished jobs should be kept. The default is 100.

Optional: Specifies how many failed jobs should be kept. The default is 100.

Optional: Specifies the name of the container in the target resource, from which the custom autoscaler gets environment variables holding secrets and so forth. The default is .spec.template.spec.containers[0].

Optional: Specifies whether existing jobs are terminated whenever a scaled job is being updated:

default: The autoscaler terminates an existing job if its associated scaled job is updated. The autoscaler recreates the job with the latest specs.
gradual: The autoscaler does not terminate an existing job if its associated scaled job is updated. The autoscaler creates new jobs with the latest specs.

Optional: Specifies a scaling strategy: default, custom, or accurate. The default is default. For more information, see the link in the "Additional resources" section that follows.

13

Specifies the trigger to use as the basis for scaling, as described in the "Understanding the custom metrics autoscaler triggers" section.

14

Optional: Specifies a trigger authentication or a cluster trigger authentication. For more information, see Understanding the custom metrics autoscaler trigger authentication in the Additional resources section.

Enter TriggerAuthentication to use a trigger authentication. This is the default.
Enter ClusterTriggerAuthentication to use a cluster trigger authentication.

Create the custom metrics autoscaler by running the following command:
```
oc create -f <filename>.yaml
```
```
$ oc create -f <filename>.yaml
```
Copy to Clipboard Toggle word wrap

Verification

View the command output to verify that the custom metrics autoscaler was created:
```
oc get scaledjob <scaled_job_name>
```
```
$ oc get scaledjob <scaled_job_name>
```
Copy to Clipboard Toggle word wrap
Example output
```
NAME        MAX   TRIGGERS     AUTHENTICATION              READY   ACTIVE    AGE
scaledjob   100   prometheus   prom-triggerauthentication  True    True      8s
```
```
NAME        MAX   TRIGGERS     AUTHENTICATION              READY   ACTIVE    AGE
scaledjob   100   prometheus   prom-triggerauthentication  True    True      8s
```
Copy to Clipboard Toggle word wrap
Note the following fields in the output:
- TRIGGERS: Indicates the trigger, or scaler, that is being used.
- AUTHENTICATION: Indicates the name of any trigger authentication being used.
- READY: Indicates whether the scaled object is ready to start scaling:
  - If True, the scaled object is ready.
  - If False, the scaled object is not ready because of a problem in one or more of the objects you created.
- ACTIVE: Indicates whether scaling is taking place:
  - If True, scaling is taking place.
  - If False, scaling is not taking place because there are no metrics or there is a problem in one or more of the objects you created.

3.7. Pausing the custom metrics autoscaler for a scaled object
Copy link

You can pause and restart the autoscaling of a workload, as needed.

For example, you might want to pause autoscaling before performing cluster maintenance or to avoid resource starvation by removing non-mission-critical workloads.

3.7.1. Pausing a custom metrics autoscaler
Copy link

You can pause the autoscaling of a scaled object by adding the autoscaling.keda.sh/paused-replicas annotation to the custom metrics autoscaler for that scaled object. The custom metrics autoscaler scales the replicas for that workload to the specified value and pauses autoscaling until the annotation is removed.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "4"
# ...

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "4"
# ...

Copy to Clipboard

Toggle word wrap

Procedure

Use the following command to edit the ScaledObject CR for your workload:
```
oc edit ScaledObject scaledobject
```
```
$ oc edit ScaledObject scaledobject
```
Copy to Clipboard Toggle word wrap

Add the autoscaling.keda.sh/paused-replicas annotation with any value:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "4" 
  creationTimestamp: "2023-02-08T14:41:01Z"
  generation: 1
  name: scaledobject
  namespace: my-project
  resourceVersion: '65729'
  uid: f5aec682-acdf-4232-a783-58b5b82f5dd0

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "4"


  creationTimestamp: "2023-02-08T14:41:01Z"
  generation: 1
  name: scaledobject
  namespace: my-project
  resourceVersion: '65729'
  uid: f5aec682-acdf-4232-a783-58b5b82f5dd0

Copy to Clipboard

Toggle word wrap

1: Specifies that the Custom Metrics Autoscaler Operator is to scale the replicas to the specified value and stop autoscaling.

3.7.2. Restarting the custom metrics autoscaler for a scaled object
Copy link

You can restart a paused custom metrics autoscaler by removing the autoscaling.keda.sh/paused-replicas annotation for that ScaledObject.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "4"
# ...

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "4"
# ...

Copy to Clipboard

Toggle word wrap

Procedure

Use the following command to edit the ScaledObject CR for your workload:
```
oc edit ScaledObject scaledobject
```
```
$ oc edit ScaledObject scaledobject
```
Copy to Clipboard Toggle word wrap

Remove the autoscaling.keda.sh/paused-replicas annotation.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "4" 
  creationTimestamp: "2023-02-08T14:41:01Z"
  generation: 1
  name: scaledobject
  namespace: my-project
  resourceVersion: '65729'
  uid: f5aec682-acdf-4232-a783-58b5b82f5dd0

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "4"


  creationTimestamp: "2023-02-08T14:41:01Z"
  generation: 1
  name: scaledobject
  namespace: my-project
  resourceVersion: '65729'
  uid: f5aec682-acdf-4232-a783-58b5b82f5dd0

Copy to Clipboard

Toggle word wrap

1: Remove this annotation to restart a paused custom metrics autoscaler.

3.8. Gathering audit logs
Copy link

You can gather audit logs, which are a security-relevant chronological set of records documenting the sequence of activities that have affected the system by individual users, administrators, or other components of the system.

For example, audit logs can help you understand where an autoscaling request is coming from. This is key information when backends are getting overloaded by autoscaling requests made by user applications and you need to determine which is the troublesome application.

3.8.1. Configuring audit logging
Copy link

You can configure auditing for the Custom Metrics Autoscaler Operator by editing the KedaController custom resource. The logs are sent to an audit log file on a volume that is secured by using a persistent volume claim in the KedaController CR.

Prerequisites

The Custom Metrics Autoscaler Operator must be installed.

Procedure

Edit the KedaController custom resource to add the auditConfig stanza:
```
kind: KedaController
apiVersion: keda.sh/v1alpha1
metadata:
  name: keda
  namespace: openshift-keda
spec:
# ...
  metricsServer:
# ...
    auditConfig:
      logFormat: "json" 
      logOutputVolumeClaim: "pvc-audit-log" 
      policy:
        rules: 
        - level: Metadata
        omitStages: "RequestReceived" 
        omitManagedFields: false 
      lifetime: 
        maxAge: "2"
        maxBackup: "1"
        maxSize: "50"
```
```
kind: KedaController
apiVersion: keda.sh/v1alpha1
metadata:
  name: keda
  namespace: openshift-keda
spec:
# ...
  metricsServer:
# ...
    auditConfig:
      logFormat: "json" 
```
1
```
      logOutputVolumeClaim: "pvc-audit-log" 
```
2
```
      policy:
        rules: 
```
3
```
        - level: Metadata
        omitStages: "RequestReceived" 
```
4
```
        omitManagedFields: false 
```
5
```
      lifetime: 
```
6
```
        maxAge: "2"
        maxBackup: "1"
        maxSize: "50"
```
Copy to Clipboard Toggle word wrap
1
Specifies the output format of the audit log, either legacy or json.
2
Specifies an existing persistent volume claim for storing the log data. All requests coming to the API server are logged to this persistent volume claim. If you leave this field empty, the log data is sent to stdout.
3
Specifies which events should be recorded and what data they should include:
None: Do not log events.
Metadata: Log only the metadata for the request, such as user, timestamp, and so forth. Do not log the request text and the response text. This is the default.
Request: Log only the metadata and the request text but not the response text. This option does not apply for non-resource requests.
RequestResponse: Log event metadata, request text, and response text. This option does not apply for non-resource requests.
4
Specifies stages for which no event is created.
5
Specifies whether to omit the managed fields of the request and response bodies from being written to the API audit log, either true to omit the fields or false to include the fields.
6
Specifies the size and lifespan of the audit logs.
maxAge: The maximum number of days to retain audit log files, based on the timestamp encoded in their filename.
maxBackup: The maximum number of audit log files to retain. Set to 0 to retain all audit log files.
maxSize: The maximum size in megabytes of an audit log file before it gets rotated.

Verification

View the audit log file directly:

Obtain the name of the keda-metrics-apiserver-* pod:

oc get pod -n openshift-keda

oc get pod -n openshift-keda

Copy to Clipboard

Toggle word wrap

Example output

NAME                                                  READY   STATUS    RESTARTS   AGE
custom-metrics-autoscaler-operator-5cb44cd75d-9v4lv   1/1     Running   0          8m20s
keda-metrics-apiserver-65c7cc44fd-rrl4r               1/1     Running   0          2m55s
keda-operator-776cbb6768-zpj5b                        1/1     Running   0          2m55s

NAME                                                  READY   STATUS    RESTARTS   AGE
custom-metrics-autoscaler-operator-5cb44cd75d-9v4lv   1/1     Running   0          8m20s
keda-metrics-apiserver-65c7cc44fd-rrl4r               1/1     Running   0          2m55s
keda-operator-776cbb6768-zpj5b                        1/1     Running   0          2m55s

Copy to Clipboard

Toggle word wrap

View the log data by using a command similar to the following:

oc logs keda-metrics-apiserver-<hash>|grep -i metadata

$ oc logs keda-metrics-apiserver-<hash>|grep -i metadata

Copy to Clipboard

Toggle word wrap

1: Optional: You can use the grep command to specify the log level to display: Metadata, Request, RequestResponse.

For example:

oc logs keda-metrics-apiserver-65c7cc44fd-rrl4r|grep -i metadata

$ oc logs keda-metrics-apiserver-65c7cc44fd-rrl4r|grep -i metadata

Copy to Clipboard

Toggle word wrap

Example output

 ...
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4c81d41b-3dab-4675-90ce-20b87ce24013","stage":"ResponseComplete","requestURI":"/healthz","verb":"get","user":{"username":"system:anonymous","groups":["system:unauthenticated"]},"sourceIPs":["10.131.0.1"],"userAgent":"kube-probe/1.26","responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2023-02-16T13:00:03.554567Z","stageTimestamp":"2023-02-16T13:00:03.555032Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
 ...

 ...
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"4c81d41b-3dab-4675-90ce-20b87ce24013","stage":"ResponseComplete","requestURI":"/healthz","verb":"get","user":{"username":"system:anonymous","groups":["system:unauthenticated"]},"sourceIPs":["10.131.0.1"],"userAgent":"kube-probe/1.26","responseStatus":{"metadata":{},"code":200},"requestReceivedTimestamp":"2023-02-16T13:00:03.554567Z","stageTimestamp":"2023-02-16T13:00:03.555032Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
 ...

Copy to Clipboard

Toggle word wrap

Alternatively, you can view a specific log:

Use a command similar to the following to log into the keda-metrics-apiserver-* pod:

oc rsh pod/keda-metrics-apiserver-<hash> -n openshift-keda

$ oc rsh pod/keda-metrics-apiserver-<hash> -n openshift-keda

Copy to Clipboard

Toggle word wrap

For example:

oc rsh pod/keda-metrics-apiserver-65c7cc44fd-rrl4r -n openshift-keda

$ oc rsh pod/keda-metrics-apiserver-65c7cc44fd-rrl4r -n openshift-keda

Copy to Clipboard

Toggle word wrap

Change to the /var/audit-policy/ directory:
```
cd /var/audit-policy/
```
```
sh-4.4$ cd /var/audit-policy/
```
Copy to Clipboard Toggle word wrap
List the available logs:
```
ls
```
```
sh-4.4$ ls
```
Copy to Clipboard Toggle word wrap
Example output
```
log-2023.02.17-14:50  policy.yaml
```
```
log-2023.02.17-14:50  policy.yaml
```
Copy to Clipboard Toggle word wrap

View the log, as needed:

cat <log_name>/<pvc_name>|grep -i <log_level>

sh-4.4$ cat <log_name>/<pvc_name>|grep -i <log_level>

Copy to Clipboard

Toggle word wrap

1: Optional: You can use the grep command to specify the log level to display: Metadata, Request, RequestResponse.

For example:

cat log-2023.02.17-14:50/pvc-audit-log|grep -i Request

sh-4.4$ cat log-2023.02.17-14:50/pvc-audit-log|grep -i Request

Copy to Clipboard

Toggle word wrap

Example output

 ...
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"63e7f68c-04ec-4f4d-8749-bf1656572a41","stage":"ResponseComplete","requestURI":"/openapi/v2","verb":"get","user":{"username":"system:aggregator","groups":["system:authenticated"]},"sourceIPs":["10.128.0.1"],"responseStatus":{"metadata":{},"code":304},"requestReceivedTimestamp":"2023-02-17T13:12:55.035478Z","stageTimestamp":"2023-02-17T13:12:55.038346Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:discovery\" of ClusterRole \"system:discovery\" to Group \"system:authenticated\""}}
 ...

 ...
{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Request","auditID":"63e7f68c-04ec-4f4d-8749-bf1656572a41","stage":"ResponseComplete","requestURI":"/openapi/v2","verb":"get","user":{"username":"system:aggregator","groups":["system:authenticated"]},"sourceIPs":["10.128.0.1"],"responseStatus":{"metadata":{},"code":304},"requestReceivedTimestamp":"2023-02-17T13:12:55.035478Z","stageTimestamp":"2023-02-17T13:12:55.038346Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"system:discovery\" of ClusterRole \"system:discovery\" to Group \"system:authenticated\""}}
 ...

Copy to Clipboard

Toggle word wrap

3.9. Gathering debugging data
Copy link

When opening a support case, it is helpful to provide debugging information about your cluster to Red Hat Support.

To help troubleshoot your issue, provide the following information:

Data gathered using the must-gather tool.
The unique cluster ID.

You can use the must-gather tool to collect data about the Custom Metrics Autoscaler Operator and its components, including the following items:

The openshift-keda namespace and its child objects.
The Custom Metric Autoscaler Operator installation objects.
The Custom Metric Autoscaler Operator CRD objects.

3.9.1. Gathering debugging data
Copy link

The following command runs the must-gather tool for the Custom Metrics Autoscaler Operator:

oc adm must-gather --image="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \
-n openshift-marketplace \
-o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"

$ oc adm must-gather --image="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \
-n openshift-marketplace \
-o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"

Copy to Clipboard

Toggle word wrap

Note

The standard OpenShift Container Platform must-gather command, oc adm must-gather, does not collect Custom Metrics Autoscaler Operator data.

Prerequisites

Access to the cluster as a user with the cluster-admin role.
The OpenShift Container Platform CLI (oc) installed.

Procedure

Navigate to the directory where you want to store the must-gather data.
Note
If your cluster is using a restricted network, you must take additional steps. If your mirror registry has a trusted CA, you must first add the trusted CA to the cluster. For all clusters on restricted networks, you must import the default must-gather image as an image stream by running the following command.
$ oc import-image is/must-gather -n openshift
Copy to Clipboard Toggle word wrap

Perform one of the following:

To get only the Custom Metrics Autoscaler Operator must-gather data, use the following command:

oc adm must-gather --image="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \
-n openshift-marketplace \
-o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"

$ oc adm must-gather --image="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \
-n openshift-marketplace \
-o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"

Copy to Clipboard

Toggle word wrap

The custom image for the must-gather command is pulled directly from the Operator package manifests, so that it works on any cluster where the Custom Metric Autoscaler Operator is available.

To gather the default must-gather data in addition to the Custom Metric Autoscaler Operator information:

Use the following command to obtain the Custom Metrics Autoscaler Operator image and set it as an environment variable:

IMAGE="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \
  -n openshift-marketplace \
  -o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"

$ IMAGE="$(oc get packagemanifests openshift-custom-metrics-autoscaler-operator \
  -n openshift-marketplace \
  -o jsonpath='{.status.channels[?(@.name=="stable")].currentCSVDesc.annotations.containerImage}')"

Copy to Clipboard

Toggle word wrap

Use the oc adm must-gather with the Custom Metrics Autoscaler Operator image:

oc adm must-gather --image-stream=openshift/must-gather --image=${IMAGE}

$ oc adm must-gather --image-stream=openshift/must-gather --image=${IMAGE}

Copy to Clipboard

Toggle word wrap

Example 3.1. Example must-gather output for the Custom Metric Autoscaler:

└── openshift-keda
    ├── apps
    │   ├── daemonsets.yaml
    │   ├── deployments.yaml
    │   ├── replicasets.yaml
    │   └── statefulsets.yaml
    ├── apps.openshift.io
    │   └── deploymentconfigs.yaml
    ├── autoscaling
    │   └── horizontalpodautoscalers.yaml
    ├── batch
    │   ├── cronjobs.yaml
    │   └── jobs.yaml
    ├── build.openshift.io
    │   ├── buildconfigs.yaml
    │   └── builds.yaml
    ├── core
    │   ├── configmaps.yaml
    │   ├── endpoints.yaml
    │   ├── events.yaml
    │   ├── persistentvolumeclaims.yaml
    │   ├── pods.yaml
    │   ├── replicationcontrollers.yaml
    │   ├── secrets.yaml
    │   └── services.yaml
    ├── discovery.k8s.io
    │   └── endpointslices.yaml
    ├── image.openshift.io
    │   └── imagestreams.yaml
    ├── k8s.ovn.org
    │   ├── egressfirewalls.yaml
    │   └── egressqoses.yaml
    ├── keda.sh
    │   ├── kedacontrollers
    │   │   └── keda.yaml
    │   ├── scaledobjects
    │   │   └── example-scaledobject.yaml
    │   └── triggerauthentications
    │       └── example-triggerauthentication.yaml
    ├── monitoring.coreos.com
    │   └── servicemonitors.yaml
    ├── networking.k8s.io
    │   └── networkpolicies.yaml
    ├── openshift-keda.yaml
    ├── pods
    │   ├── custom-metrics-autoscaler-operator-58bd9f458-ptgwx
    │   │   ├── custom-metrics-autoscaler-operator
    │   │   │   └── custom-metrics-autoscaler-operator
    │   │   │       └── logs
    │   │   │           ├── current.log
    │   │   │           ├── previous.insecure.log
    │   │   │           └── previous.log
    │   │   └── custom-metrics-autoscaler-operator-58bd9f458-ptgwx.yaml
    │   ├── custom-metrics-autoscaler-operator-58bd9f458-thbsh
    │   │   └── custom-metrics-autoscaler-operator
    │   │       └── custom-metrics-autoscaler-operator
    │   │           └── logs
    │   ├── keda-metrics-apiserver-65c7cc44fd-6wq4g
    │   │   ├── keda-metrics-apiserver
    │   │   │   └── keda-metrics-apiserver
    │   │   │       └── logs
    │   │   │           ├── current.log
    │   │   │           ├── previous.insecure.log
    │   │   │           └── previous.log
    │   │   └── keda-metrics-apiserver-65c7cc44fd-6wq4g.yaml
    │   └── keda-operator-776cbb6768-fb6m5
    │       ├── keda-operator
    │       │   └── keda-operator
    │       │       └── logs
    │       │           ├── current.log
    │       │           ├── previous.insecure.log
    │       │           └── previous.log
    │       └── keda-operator-776cbb6768-fb6m5.yaml
    ├── policy
    │   └── poddisruptionbudgets.yaml
    └── route.openshift.io
        └── routes.yaml

└── openshift-keda
    ├── apps
    │   ├── daemonsets.yaml
    │   ├── deployments.yaml
    │   ├── replicasets.yaml
    │   └── statefulsets.yaml
    ├── apps.openshift.io
    │   └── deploymentconfigs.yaml
    ├── autoscaling
    │   └── horizontalpodautoscalers.yaml
    ├── batch
    │   ├── cronjobs.yaml
    │   └── jobs.yaml
    ├── build.openshift.io
    │   ├── buildconfigs.yaml
    │   └── builds.yaml
    ├── core
    │   ├── configmaps.yaml
    │   ├── endpoints.yaml
    │   ├── events.yaml
    │   ├── persistentvolumeclaims.yaml
    │   ├── pods.yaml
    │   ├── replicationcontrollers.yaml
    │   ├── secrets.yaml
    │   └── services.yaml
    ├── discovery.k8s.io
    │   └── endpointslices.yaml
    ├── image.openshift.io
    │   └── imagestreams.yaml
    ├── k8s.ovn.org
    │   ├── egressfirewalls.yaml
    │   └── egressqoses.yaml
    ├── keda.sh
    │   ├── kedacontrollers
    │   │   └── keda.yaml
    │   ├── scaledobjects
    │   │   └── example-scaledobject.yaml
    │   └── triggerauthentications
    │       └── example-triggerauthentication.yaml
    ├── monitoring.coreos.com
    │   └── servicemonitors.yaml
    ├── networking.k8s.io
    │   └── networkpolicies.yaml
    ├── openshift-keda.yaml
    ├── pods
    │   ├── custom-metrics-autoscaler-operator-58bd9f458-ptgwx
    │   │   ├── custom-metrics-autoscaler-operator
    │   │   │   └── custom-metrics-autoscaler-operator
    │   │   │       └── logs
    │   │   │           ├── current.log
    │   │   │           ├── previous.insecure.log
    │   │   │           └── previous.log
    │   │   └── custom-metrics-autoscaler-operator-58bd9f458-ptgwx.yaml
    │   ├── custom-metrics-autoscaler-operator-58bd9f458-thbsh
    │   │   └── custom-metrics-autoscaler-operator
    │   │       └── custom-metrics-autoscaler-operator
    │   │           └── logs
    │   ├── keda-metrics-apiserver-65c7cc44fd-6wq4g
    │   │   ├── keda-metrics-apiserver
    │   │   │   └── keda-metrics-apiserver
    │   │   │       └── logs
    │   │   │           ├── current.log
    │   │   │           ├── previous.insecure.log
    │   │   │           └── previous.log
    │   │   └── keda-metrics-apiserver-65c7cc44fd-6wq4g.yaml
    │   └── keda-operator-776cbb6768-fb6m5
    │       ├── keda-operator
    │       │   └── keda-operator
    │       │       └── logs
    │       │           ├── current.log
    │       │           ├── previous.insecure.log
    │       │           └── previous.log
    │       └── keda-operator-776cbb6768-fb6m5.yaml
    ├── policy
    │   └── poddisruptionbudgets.yaml
    └── route.openshift.io
        └── routes.yaml

Copy to Clipboard

Toggle word wrap

Create a compressed file from the must-gather directory that was created in your working directory. For example, on a computer that uses a Linux operating system, run the following command:
```
tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/
```
```
$ tar cvaf must-gather.tar.gz must-gather.local.5421342344627712289/ 
```
1
Copy to Clipboard Toggle word wrap
1
Replace must-gather-local.5421342344627712289/ with the actual directory name.
Attach the compressed file to your support case on the Red Hat Customer Portal.

3.10. Viewing Operator metrics
Copy link

The Custom Metrics Autoscaler Operator exposes ready-to-use metrics that it pulls from the on-cluster monitoring component. You can query the metrics by using the Prometheus Query Language (PromQL) to analyze and diagnose issues. All metrics are reset when the controller pod restarts.

3.10.1. Accessing performance metrics
Copy link

You can access the metrics and run queries by using the OpenShift Container Platform web console.

Procedure

Select the Administrator perspective in the OpenShift Container Platform web console.
Select Observe → Metrics.
To create a custom query, add your PromQL query to the Expression field.
To add multiple queries, select Add Query.

3.10.1.1. Provided Operator metrics
Copy link

The Custom Metrics Autoscaler Operator exposes the following metrics, which you can view by using the OpenShift Container Platform web console.

Expand

Table 3.1. Custom Metric Autoscaler Operator metrics
Metric name	Description
`keda_scaler_activity`	Whether the particular scaler is active or inactive. A value of `1` indicates the scaler is active; a value of `0` indicates the scaler is inactive.
`keda_scaler_metrics_value`	The current value for each scaler’s metric, which is used by the Horizontal Pod Autoscaler (HPA) in computing the target average.
`keda_scaler_metrics_latency`	The latency of retrieving the current metric from each scaler.
`keda_scaler_errors`	The number of errors that have occurred for each scaler.
`keda_scaler_errors_total`	The total number of errors encountered for all scalers.
`keda_scaled_object_errors`	The number of errors that have occurred for each scaled obejct.
`keda_resource_totals`	The total number of Custom Metrics Autoscaler custom resources in each namespace for each custom resource type.
`keda_trigger_totals`	The total number of triggers by trigger type.

Custom Metrics Autoscaler Admission webhook metrics

The Custom Metrics Autoscaler Admission webhook also exposes the following Prometheus metrics.

Expand

Metric name	Description
`keda_scaled_object_validation_total`	The number of scaled object validations.
`keda_scaled_object_validation_errors`	The number of validation errors.

3.11. Removing the Custom Metrics Autoscaler Operator
Copy link

You can remove the custom metrics autoscaler from your OpenShift Container Platform cluster. After removing the Custom Metrics Autoscaler Operator, remove other components associated with the Operator to avoid potential issues.

Note

Delete the KedaController custom resource (CR) first. If you do not delete the KedaController CR, OpenShift Container Platform can hang when you delete the openshift-keda project. If you delete the Custom Metrics Autoscaler Operator before deleting the CR, you are not able to delete the CR.

3.11.1. Uninstalling the Custom Metrics Autoscaler Operator
Copy link

Use the following procedure to remove the custom metrics autoscaler from your OpenShift Container Platform cluster.

Prerequisites

The Custom Metrics Autoscaler Operator must be installed.

Procedure

In the OpenShift Container Platform web console, click Operators → Installed Operators.
Switch to the openshift-keda project.
Remove the KedaController custom resource.
1. Find the CustomMetricsAutoscaler Operator and click the KedaController tab.
2. Find the custom resource, and then click Delete KedaController.
3. Click Uninstall.
Remove the Custom Metrics Autoscaler Operator:
1. Click Operators → Installed Operators.
2. Find the CustomMetricsAutoscaler Operator and click the Options menu and select Uninstall Operator.
3. Click Uninstall.
Optional: Use the OpenShift CLI to remove the custom metrics autoscaler components:
1. Delete the custom metrics autoscaler CRDs:
  - clustertriggerauthentications.keda.sh
  - kedacontrollers.keda.sh
  - scaledjobs.keda.sh
  - scaledobjects.keda.sh
  - triggerauthentications.keda.sh
  $ oc delete crd clustertriggerauthentications.keda.sh kedacontrollers.keda.sh scaledjobs.keda.sh scaledobjects.keda.sh triggerauthentications.keda.sh
  Copy to Clipboard Toggle word wrap
  Deleting the CRDs removes the associated roles, cluster roles, and role bindings. However, there might be a few cluster roles that must be manually deleted.
2. List any custom metrics autoscaler cluster roles:
  $ oc get clusterrole | grep keda.sh
  Copy to Clipboard Toggle word wrap
3. Delete the listed custom metrics autoscaler cluster roles. For example:
  $ oc delete clusterrole.keda.sh-v1alpha1-admin
  Copy to Clipboard Toggle word wrap
4. List any custom metrics autoscaler cluster role bindings:
  $ oc get clusterrolebinding | grep keda.sh
  Copy to Clipboard Toggle word wrap
5. Delete the listed custom metrics autoscaler cluster role bindings. For example:
  $ oc delete clusterrolebinding.keda.sh-v1alpha1-admin
  Copy to Clipboard Toggle word wrap
Delete the custom metrics autoscaler project:
```
oc delete project openshift-keda
```
```
$ oc delete project openshift-keda
```
Copy to Clipboard Toggle word wrap

Delete the Cluster Metric Autoscaler Operator:

oc delete operator/openshift-custom-metrics-autoscaler-operator.openshift-keda

$ oc delete operator/openshift-custom-metrics-autoscaler-operator.openshift-keda

Copy to Clipboard

Toggle word wrap

Chapter 4. Controlling pod placement onto nodes (scheduling)
Copy link

4.1. Controlling pod placement using the scheduler
Copy link

Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster.

The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API.

Default pod scheduling

OpenShift Container Platform comes with a default scheduler that serves the needs of most users. The default scheduler uses both inherent and customization tools to determine the best fit for a pod.

Advanced pod scheduling

In situations where you might want more control over where new pods are placed, the OpenShift Container Platform advanced scheduling features allow you to configure a pod so that the pod is required or has a preference to run on a particular node or alongside a specific pod.

You can control pod placement by using the following scheduling features:

Scheduler profiles
Pod affinity and anti-affinity rules
Node affinity
Node selectors
Taints and tolerations
Node overcommitment

4.1.1. About the default scheduler
Copy link

The default OpenShift Container Platform pod scheduler is responsible for determining the placement of new pods onto nodes within the cluster. It reads data from the pod and finds a node that is a good fit based on configured profiles. It is completely independent and exists as a standalone solution. It does not modify the pod; it creates a binding for the pod that ties the pod to the particular node.

4.1.1.1. Understanding default scheduling
Copy link

The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:

Filters the nodes: The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates, or filters.
Prioritizes the filtered list of nodes: This is achieved by passing each node through a series of priority, or scoring, functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each scoring function. The node score provided by each scoring function is multiplied by the weight (default weight for most scores is 1) and then combined by adding the scores for each node provided by all the scores. This weight attribute can be used by administrators to give higher importance to some scores.
Selects the best fit node: The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.

4.1.2. Scheduler use cases
Copy link

One of the important use cases for scheduling within OpenShift Container Platform is to support flexible affinity and anti-affinity policies.

4.1.2.1. Infrastructure topological levels
Copy link

Administrators can define multiple topological levels for their infrastructure (nodes) by specifying labels on nodes. For example: region=r1, zone=z1, rack=s1.

These label names have no particular meaning and administrators are free to name their infrastructure levels anything, such as city/building/room. Also, administrators can define any number of levels for their infrastructure topology, with three levels usually being adequate (such as: regions → zones → racks). Administrators can specify affinity and anti-affinity rules at each of these levels in any combination.

4.1.2.2. Affinity
Copy link

Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service are scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod is not scheduled.

If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.

These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

4.1.2.3. Anti-affinity
Copy link

Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-affinity (or 'spread') at a particular level indicates that all pods that belong to the same service are spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler tries to balance the service pods across all applicable nodes as evenly as possible.

These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.

4.2. Scheduling pods using a scheduler profile
Copy link

You can configure OpenShift Container Platform to use a scheduling profile to schedule pods onto nodes within the cluster.

4.2.1. About scheduler profiles
Copy link

You can specify a scheduler profile to control how pods are scheduled onto nodes.

The following scheduler profiles are available:

LowNodeUtilization: This profile attempts to spread pods evenly across nodes to get low resource usage per node. This profile provides the default scheduler behavior.
HighNodeUtilization: This profile attempts to place as many pods as possible on to as few nodes as possible. This minimizes node count and has high resource usage per node.

Note

Switching to the HighNodeUtilization scheduler profile will result in all pods of a ReplicaSet object being scheduled on the same node. This will add an increased risk for pod failure if the node fails.

NoScoring: This is a low-latency profile that strives for the quickest scheduling cycle by disabling all score plugins. This might sacrifice better scheduling decisions for faster ones.

4.2.2. Configuring a scheduler profile
Copy link

You can configure the scheduler to use a scheduler profile.

Prerequisites

Access to the cluster as a user with the cluster-admin role.

Procedure

Edit the Scheduler object:
```
oc edit scheduler cluster
```
```
$ oc edit scheduler cluster
```
Copy to Clipboard Toggle word wrap

Specify the profile to use in the spec.profile field:

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
#...
spec:
  mastersSchedulable: false
  profile: HighNodeUtilization 
#...

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
#...
spec:
  mastersSchedulable: false
  profile: HighNodeUtilization


#...

Copy to Clipboard

Toggle word wrap

1: Set to LowNodeUtilization, HighNodeUtilization, or NoScoring.

Save the file to apply the changes.

4.3. Placing pods relative to other pods using affinity and anti-affinity rules
Copy link

Affinity is a property of pods that controls the nodes on which they prefer to be scheduled. Anti-affinity is a property of pods that prevents a pod from being scheduled on a node.

In OpenShift Container Platform, pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key-value labels on other pods.

4.3.1. Understanding pod affinity
Copy link

Pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods.

Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
Pod anti-affinity can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.

For example, using affinity rules, you could spread or pack pods within a service or relative to pods in other services. Anti-affinity rules allow you to prevent pods of a particular service from scheduling on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service. Or, you could spread the pods of a service across nodes, availability zones, or availability sets to reduce correlated failures.

Note

A label selector might match pods with multiple pod deployments. Use unique combinations of labels when configuring anti-affinity rules to avoid matching pods.

There are two types of pod affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Note

Depending on your pod priority and preemption settings, the scheduler might not be able to find an appropriate node for a pod without violating affinity requirements. If so, a pod might not be scheduled.

To prevent this situation, carefully configure pod affinity with equal-priority pods.

You configure pod affinity/anti-affinity through the Pod spec files. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example shows a Pod spec configured for pod affinity and anti-affinity.

In this example, the pod affinity rule indicates that the pod can schedule onto a node only if that node has at least one already-running pod with a label that has the key security and value S1. The pod anti-affinity rule says that the pod prefers to not schedule onto a node if that node is already running a pod with label having key security and value S2.

Sample Pod config file with pod affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
      - labelSelector:
          matchExpressions:
          - key: security 
            operator: In 
            values:
            - S1 
        topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:


      requiredDuringSchedulingIgnoredDuringExecution:


      - labelSelector:
          matchExpressions:
          - key: security


            operator: In


            values:
            - S1


        topologyKey: failure-domain.beta.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

Copy to Clipboard

Toggle word wrap

1: Stanza to configure pod affinity.
2: Defines a required rule.
3 5: The key and value (label) that must be matched to apply the rule.
4: The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

Sample Pod config file with pod anti-affinity

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity: 
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 100  
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security 
              operator: In 
              values:
              - S2
          topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity:


      preferredDuringSchedulingIgnoredDuringExecution:


      - weight: 100


        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security


              operator: In


              values:
              - S2
          topologyKey: kubernetes.io/hostname
  containers:
  - name: with-pod-affinity
    image: docker.io/ocpqe/hello-pod

Copy to Clipboard

Toggle word wrap

1: Stanza to configure pod anti-affinity.
2: Defines a preferred rule.
3: Specifies a weight for a preferred rule. The node with the highest weight is preferred.
4: Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
5: The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

Note

If labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node.

4.3.2. Configuring a pod affinity rule
Copy link

The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses affinity to allow scheduling with that pod.

Note

You cannot add an affinity directly to a scheduled pod.

Procedure

Create a pod with a specific label in the pod spec:

Create a YAML file with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: security-s1
  labels:
    security: S1
spec:
  containers:
  - name: security-s1
    image: docker.io/ocpqe/hello-pod

apiVersion: v1
kind: Pod
metadata:
  name: security-s1
  labels:
    security: S1
spec:
  containers:
  - name: security-s1
    image: docker.io/ocpqe/hello-pod

Copy to Clipboard

Toggle word wrap

Create the pod.
```
oc create -f <pod-spec>.yaml
```
```
$ oc create -f <pod-spec>.yaml
```
Copy to Clipboard Toggle word wrap

When creating other pods, configure the following parameters to add the affinity:

Create a YAML file with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: security-s1-east
#...
spec
  affinity 
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: 
      - labelSelector:
          matchExpressions:
          - key: security 
            values:
            - S1
            operator: In 
        topologyKey: topology.kubernetes.io/zone 
#...

apiVersion: v1
kind: Pod
metadata:
  name: security-s1-east
#...
spec
  affinity


    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:


      - labelSelector:
          matchExpressions:
          - key: security


            values:
            - S1
            operator: In


        topologyKey: topology.kubernetes.io/zone


#...

Copy to Clipboard

Toggle word wrap

1: Adds a pod affinity.
2: Configures the requiredDuringSchedulingIgnoredDuringExecution parameter or the preferredDuringSchedulingIgnoredDuringExecution parameter.
3: Specifies the key and values that must be met. If you want the new pod to be scheduled with the other pod, use the same key and values parameters as the label on the first pod.
4: Specifies an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.
5: Specify a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.

Create the pod.
```
oc create -f <pod-spec>.yaml
```
```
$ oc create -f <pod-spec>.yaml
```
Copy to Clipboard Toggle word wrap

4.3.3. Configuring a pod anti-affinity rule
Copy link

The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses an anti-affinity preferred rule to attempt to prevent scheduling with that pod.

Note

You cannot add an affinity directly to a scheduled pod.

Procedure

Create a pod with a specific label in the pod spec:

Create a YAML file with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: security-s1
  labels:
    security: S1
spec:
  containers:
  - name: security-s1
    image: docker.io/ocpqe/hello-pod

apiVersion: v1
kind: Pod
metadata:
  name: security-s1
  labels:
    security: S1
spec:
  containers:
  - name: security-s1
    image: docker.io/ocpqe/hello-pod

Copy to Clipboard

Toggle word wrap

Create the pod.
```
oc create -f <pod-spec>.yaml
```
```
$ oc create -f <pod-spec>.yaml
```
Copy to Clipboard Toggle word wrap

When creating other pods, configure the following parameters:

Create a YAML file with the following content:

apiVersion: v1
kind: Pod
metadata:
  name: security-s2-east
#...
spec
  affinity 
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 100 
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security 
              values:
              - S1
              operator: In 
          topologyKey: kubernetes.io/hostname 
#...

apiVersion: v1
kind: Pod
metadata:
  name: security-s2-east
#...
spec
  affinity


    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:


      - weight: 100


        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security


              values:
              - S1
              operator: In


          topologyKey: kubernetes.io/hostname


#...

Copy to Clipboard

Toggle word wrap

1: Adds a pod anti-affinity.
2: Configures the requiredDuringSchedulingIgnoredDuringExecution parameter or the preferredDuringSchedulingIgnoredDuringExecution parameter.
3: For a preferred rule, specifies a weight for the node, 1-100. The node that with highest weight is preferred.
4: Specifies the key and values that must be met. If you want the new pod to not be scheduled with the other pod, use the same key and values parameters as the label on the first pod.
5: Specifies an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.
6: Specifies a topologyKey, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.

Create the pod.
```
oc create -f <pod-spec>.yaml
```
```
$ oc create -f <pod-spec>.yaml
```
Copy to Clipboard Toggle word wrap

4.3.4. Sample pod affinity and anti-affinity rules
Copy link

The following examples demonstrate pod affinity and pod anti-affinity.

4.3.4.1. Pod Affinity
Copy link

The following example demonstrates pod affinity for pods with matching labels and label selectors.

The pod team4 has the label team:4.

apiVersion: v1
kind: Pod
metadata:
  name: team4
  labels:
     team: "4"
#...
spec:
  containers:
  - name: ocp
    image: docker.io/ocpqe/hello-pod
#...

apiVersion: v1
kind: Pod
metadata:
  name: team4
  labels:
     team: "4"
#...
spec:
  containers:
  - name: ocp
    image: docker.io/ocpqe/hello-pod
#...

Copy to Clipboard

Toggle word wrap

The pod team4a has the label selector team:4 under podAffinity.

apiVersion: v1
kind: Pod
metadata:
  name: team4a
#...
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: team
            operator: In
            values:
            - "4"
        topologyKey: kubernetes.io/hostname
  containers:
  - name: pod-affinity
    image: docker.io/ocpqe/hello-pod
#...

apiVersion: v1
kind: Pod
metadata:
  name: team4a
#...
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: team
            operator: In
            values:
            - "4"
        topologyKey: kubernetes.io/hostname
  containers:
  - name: pod-affinity
    image: docker.io/ocpqe/hello-pod
#...

Copy to Clipboard

Toggle word wrap

The team4a pod is scheduled on the same node as the team4 pod.

4.3.4.2. Pod Anti-affinity
Copy link

The following example demonstrates pod anti-affinity for pods with matching labels and label selectors.

The pod pod-s1 has the label security:s1.

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
  labels:
    security: s1
#...
spec:
  containers:
  - name: ocp
    image: docker.io/ocpqe/hello-pod
#...

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
  labels:
    security: s1
#...
spec:
  containers:
  - name: ocp
    image: docker.io/ocpqe/hello-pod
#...

Copy to Clipboard

Toggle word wrap

The pod pod-s2 has the label selector security:s1 under podAntiAffinity.

apiVersion: v1
kind: Pod
metadata:
  name: pod-s2
#...
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - s1
        topologyKey: kubernetes.io/hostname
  containers:
  - name: pod-antiaffinity
    image: docker.io/ocpqe/hello-pod
#...

apiVersion: v1
kind: Pod
metadata:
  name: pod-s2
#...
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - s1
        topologyKey: kubernetes.io/hostname
  containers:
  - name: pod-antiaffinity
    image: docker.io/ocpqe/hello-pod
#...

Copy to Clipboard

Toggle word wrap

The pod pod-s2 cannot be scheduled on the same node as pod-s1.

4.3.4.3. Pod Affinity with no Matching Labels
Copy link

The following example demonstrates pod affinity for pods without matching labels and label selectors.

The pod pod-s1 has the label security:s1.

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
  labels:
    security: s1
#...
spec:
  containers:
  - name: ocp
    image: docker.io/ocpqe/hello-pod
#...

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
  labels:
    security: s1
#...
spec:
  containers:
  - name: ocp
    image: docker.io/ocpqe/hello-pod
#...

Copy to Clipboard

Toggle word wrap

The pod pod-s2 has the label selector security:s2.

apiVersion: v1
kind: Pod
metadata:
  name: pod-s2
#...
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - s2
        topologyKey: kubernetes.io/hostname
  containers:
  - name: pod-affinity
    image: docker.io/ocpqe/hello-pod
#...

apiVersion: v1
kind: Pod
metadata:
  name: pod-s2
#...
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - s2
        topologyKey: kubernetes.io/hostname
  containers:
  - name: pod-affinity
    image: docker.io/ocpqe/hello-pod
#...

Copy to Clipboard

Toggle word wrap

The pod pod-s2 is not scheduled unless there is a node with a pod that has the security:s2 label. If there is no other pod with that label, the new pod remains in a pending state:

Example output

NAME      READY     STATUS    RESTARTS   AGE       IP        NODE
pod-s2    0/1       Pending   0          32s       <none>

NAME      READY     STATUS    RESTARTS   AGE       IP        NODE
pod-s2    0/1       Pending   0          32s       <none>

Copy to Clipboard

Toggle word wrap

4.3.5. Using pod affinity and anti-affinity to control where an Operator is installed
Copy link

By default, when you install an Operator, OpenShift Container Platform installs the Operator pod to one of your worker nodes randomly. However, there might be situations where you want that pod scheduled on a specific node or set of nodes.

The following examples describe situations where you might want to schedule an Operator pod to a specific node or set of nodes:

If an Operator requires a particular platform, such as amd64 or arm64
If an Operator requires a particular operating system, such as Linux or Windows
If you want Operators that work together scheduled on the same host or on hosts located on the same rack
If you want Operators dispersed throughout the infrastructure to avoid downtime due to network or hardware issues

You can control where an Operator pod is installed by adding a pod affinity or anti-affinity to the Operator’s Subscription object.

The following example shows how to use pod anti-affinity to prevent the installation the Custom Metrics Autoscaler Operator from any node that has pods with a specific label:

Pod affinity example that places the Operator pod on one or more specific nodes

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      podAffinity: 
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - test
          topologyKey: kubernetes.io/hostname
#...

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      podAffinity:


        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - test
          topologyKey: kubernetes.io/hostname
#...

Copy to Clipboard

Toggle word wrap

1: A pod affinity that places the Operator’s pod on a node that has pods with the app=test label.

Pod anti-affinity example that prevents the Operator pod from one or more specific nodes

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      podAntiAffinity: 
        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: cpu
              operator: In
              values:
              - high
          topologyKey: kubernetes.io/hostname
#...

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      podAntiAffinity:


        requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
            - key: cpu
              operator: In
              values:
              - high
          topologyKey: kubernetes.io/hostname
#...

Copy to Clipboard

Toggle word wrap

1: A pod anti-affinity that prevents the Operator’s pod from being scheduled on a node that has pods with the cpu=high label.

Procedure

To control the placement of an Operator pod, complete the following steps:

Install the Operator as usual.
If needed, ensure that your nodes are labeled to properly respond to the affinity.
Edit the Operator Subscription object to add an affinity:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      podAntiAffinity: 
        requiredDuringSchedulingIgnoredDuringExecution:
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - ip-10-0-185-229.ec2.internal
            topologyKey: topology.kubernetes.io/zone
#...

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      podAntiAffinity:


        requiredDuringSchedulingIgnoredDuringExecution:
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - ip-10-0-185-229.ec2.internal
            topologyKey: topology.kubernetes.io/zone
#...

Copy to Clipboard

Toggle word wrap

1: Add a podAffinity or podAntiAffinity.

Verification

To ensure that the pod is deployed on the specific node, run the following command:

$ oc get pods -o wide

$ oc get pods -o wide

Copy to Clipboard

Toggle word wrap

Example output

NAME                                                  READY   STATUS    RESTARTS   AGE   IP            NODE                           NOMINATED NODE   READINESS GATES
custom-metrics-autoscaler-operator-5dcc45d656-bhshg   1/1     Running   0          50s   10.131.0.20   ip-10-0-185-229.ec2.internal   <none>           <none>

NAME                                                  READY   STATUS    RESTARTS   AGE   IP            NODE                           NOMINATED NODE   READINESS GATES
custom-metrics-autoscaler-operator-5dcc45d656-bhshg   1/1     Running   0          50s   10.131.0.20   ip-10-0-185-229.ec2.internal   <none>           <none>

Copy to Clipboard

Toggle word wrap

4.4. Controlling pod placement on nodes using node affinity rules
Copy link

Affinity is a property of pods that controls the nodes on which they prefer to be scheduled.

In OpenShift Container Platform node affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on the nodes and label selectors specified in pods.

4.4.1. Understanding node affinity
Copy link

Node affinity allows a pod to specify an affinity towards a group of nodes it can be placed on. The node does not have control over the placement.

For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.

There are two types of node affinity rules: required and preferred.

Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Note

If labels on a node change at runtime that results in an node affinity rule on a pod no longer being met, the pod continues to run on the node.

You configure node affinity through the Pod spec file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.

The following example is a Pod spec with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth and whose value is either e2e-az-North or e2e-az-South:

Example pod configuration file with a node affinity required rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
        nodeSelectorTerms:
        - matchExpressions:
          - key: e2e-az-NorthSouth 
            operator: In 
            values:
            - e2e-az-North 
            - e2e-az-South 
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod
#...

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:


      requiredDuringSchedulingIgnoredDuringExecution:


        nodeSelectorTerms:
        - matchExpressions:
          - key: e2e-az-NorthSouth


            operator: In


            values:
            - e2e-az-North


            - e2e-az-South


  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod
#...

Copy to Clipboard

Toggle word wrap

1: The stanza to configure node affinity.
2: Defines a required rule.
3 5 6: The key/value pair (label) that must be matched to apply the rule.
4: The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the Pod spec. This value can be In, NotIn, Exists, or DoesNotExist, Lt, or Gt.

The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest and whose value is either e2e-az-East or e2e-az-West is preferred for the pod:

Example pod configuration file with a node affinity preferred rule

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity: 
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 1 
        preference:
          matchExpressions:
          - key: e2e-az-EastWest 
            operator: In 
            values:
            - e2e-az-East 
            - e2e-az-West 
  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod
#...

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:


      preferredDuringSchedulingIgnoredDuringExecution:


      - weight: 1


        preference:
          matchExpressions:
          - key: e2e-az-EastWest


            operator: In


            values:
            - e2e-az-East


            - e2e-az-West


  containers:
  - name: with-node-affinity
    image: docker.io/ocpqe/hello-pod
#...

Copy to Clipboard

Toggle word wrap

1: The stanza to configure node affinity.
2: Defines a preferred rule.
3: Specifies a weight for a preferred rule. The node with highest weight is preferred.
4 6 7: The key/value pair (label) that must be matched to apply the rule.
5: The operator represents the relationship between the label on the node and the set of values in the matchExpression parameters in the Pod spec. This value can be In, NotIn, Exists, or DoesNotExist, Lt, or Gt.

There is no explicit node anti-affinity concept, but using the NotIn or DoesNotExist operator replicates that behavior.

Note

If you are using node affinity and node selectors in the same pod configuration, note the following:

If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.

4.4.2. Configuring a required node affinity rule
Copy link

Required rules must be met before a pod can be scheduled on a node.

Procedure

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.

Add a label to a node using the oc label node command:

oc label node node1 e2e-az-name=e2e-az1

$ oc label node node1 e2e-az-name=e2e-az1

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add the label:

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    e2e-az-name: e2e-az1
#...

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    e2e-az-name: e2e-az1
#...

Copy to Clipboard

Toggle word wrap

Create a pod with a specific label in the pod spec:

Create a YAML file with the following content:

Note

You cannot add an affinity directly to a scheduled pod.

Example output

apiVersion: v1
kind: Pod
metadata:
  name: s1
spec:
  affinity: 
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: 
        nodeSelectorTerms:
        - matchExpressions:
          - key: e2e-az-name 
            values:
            - e2e-az1
            - e2e-az2
            operator: In 
#...

apiVersion: v1
kind: Pod
metadata:
  name: s1
spec:
  affinity:


    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:


        nodeSelectorTerms:
        - matchExpressions:
          - key: e2e-az-name


            values:
            - e2e-az1
            - e2e-az2
            operator: In


#...

Copy to Clipboard

Toggle word wrap

1: Adds a pod affinity.
2: Configures the requiredDuringSchedulingIgnoredDuringExecution parameter.
3: Specifies the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and values parameters as the label in the node.
4: Specifies an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.

Create the pod:
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap

4.4.3. Configuring a preferred node affinity rule
Copy link

Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.

Procedure

The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.

Add a label to a node using the oc label node command:
```
oc label node node1 e2e-az-name=e2e-az3
```
```
$ oc label node node1 e2e-az-name=e2e-az3
```
Copy to Clipboard Toggle word wrap

Create a pod with a specific label:

Create a YAML file with the following content:

Note

You cannot add an affinity directly to a scheduled pod.

apiVersion: v1
kind: Pod
metadata:
  name: s1
spec:
  affinity: 
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 
        preference:
          matchExpressions:
          - key: e2e-az-name 
            values:
            - e2e-az3
            operator: In 
#...

apiVersion: v1
kind: Pod
metadata:
  name: s1
spec:
  affinity:


    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:


      - weight:


        preference:
          matchExpressions:
          - key: e2e-az-name


            values:
            - e2e-az3
            operator: In


#...

Copy to Clipboard

Toggle word wrap

1: Adds a pod affinity.
2: Configures the preferredDuringSchedulingIgnoredDuringExecution parameter.
3: Specifies a weight for the node, as a number 1-100. The node with highest weight is preferred.
4: Specifies the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same key and values parameters as the label in the node.
5: Specifies an operator. The operator can be In, NotIn, Exists, or DoesNotExist. For example, use the operator In to require the label to be in the node.

Create the pod.
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap

4.4.4. Sample node affinity rules
Copy link

The following examples demonstrate node affinity.

4.4.4.1. Node affinity with matching labels
Copy link

The following example demonstrates node affinity for a node and pod with matching labels:

The Node1 node has the label zone:us:

oc label node node1 zone=us

$ oc label node node1 zone=us

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add the label:

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    zone: us
#...

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    zone: us
#...

Copy to Clipboard

Toggle word wrap

The pod-s1 pod has the zone and us key/value pair under a required node affinity rule:

cat pod-s1.yaml

$ cat pod-s1.yaml

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
spec:
  containers:
    - image: "docker.io/ocpqe/hello-pod"
      name: hello-pod
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: "zone"
              operator: In
              values:
              - us
#...

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
spec:
  containers:
    - image: "docker.io/ocpqe/hello-pod"
      name: hello-pod
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: "zone"
              operator: In
              values:
              - us
#...

Copy to Clipboard

Toggle word wrap

The pod-s1 pod can be scheduled on Node1:

oc get pod -o wide

$ oc get pod -o wide

Copy to Clipboard

Toggle word wrap

Example output

NAME     READY     STATUS       RESTARTS   AGE      IP      NODE
pod-s1   1/1       Running      0          4m       IP1     node1

NAME     READY     STATUS       RESTARTS   AGE      IP      NODE
pod-s1   1/1       Running      0          4m       IP1     node1

Copy to Clipboard

Toggle word wrap

4.4.4.2. Node affinity with no matching labels
Copy link

The following example demonstrates node affinity for a node and pod without matching labels:

The Node1 node has the label zone:emea:

oc label node node1 zone=emea

$ oc label node node1 zone=emea

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add the label:

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    zone: emea
#...

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    zone: emea
#...

Copy to Clipboard

Toggle word wrap

The pod-s1 pod has the zone and us key/value pair under a required node affinity rule:

cat pod-s1.yaml

$ cat pod-s1.yaml

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
spec:
  containers:
    - image: "docker.io/ocpqe/hello-pod"
      name: hello-pod
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: "zone"
              operator: In
              values:
              - us
#...

apiVersion: v1
kind: Pod
metadata:
  name: pod-s1
spec:
  containers:
    - image: "docker.io/ocpqe/hello-pod"
      name: hello-pod
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
            - key: "zone"
              operator: In
              values:
              - us
#...

Copy to Clipboard

Toggle word wrap

The pod-s1 pod cannot be scheduled on Node1:

oc describe pod pod-s1

$ oc describe pod pod-s1

Copy to Clipboard

Toggle word wrap

Example output

...

Events:
 FirstSeen LastSeen Count From              SubObjectPath  Type                Reason
 --------- -------- ----- ----              -------------  --------            ------
 1m        33s      8     default-scheduler Warning        FailedScheduling    No nodes are available that match all of the following predicates:: MatchNodeSelector (1).

...

Events:
 FirstSeen LastSeen Count From              SubObjectPath  Type                Reason
 --------- -------- ----- ----              -------------  --------            ------
 1m        33s      8     default-scheduler Warning        FailedScheduling    No nodes are available that match all of the following predicates:: MatchNodeSelector (1).

Copy to Clipboard

Toggle word wrap

4.4.5. Using node affinity to control where an Operator is installed
Copy link

The following examples describe situations where you might want to schedule an Operator pod to a specific node or set of nodes:

If an Operator requires a particular platform, such as amd64 or arm64
If an Operator requires a particular operating system, such as Linux or Windows
If you want Operators that work together scheduled on the same host or on hosts located on the same rack
If you want Operators dispersed throughout the infrastructure to avoid downtime due to network or hardware issues

You can control where an Operator pod is installed by adding a node affinity constraints to the Operator’s Subscription object.

The following examples show how to use node affinity to install an instance of the Custom Metrics Autoscaler Operator to a specific node in the cluster:

Node affinity example that places the Operator pod on a specific node

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      nodeAffinity: 
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - ip-10-0-163-94.us-west-2.compute.internal
#...

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      nodeAffinity:


        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - ip-10-0-163-94.us-west-2.compute.internal
#...

Copy to Clipboard

Toggle word wrap

1: A node affinity that requires the Operator’s pod to be scheduled on a node named ip-10-0-163-94.us-west-2.compute.internal.

Node affinity example that places the Operator pod on a node with a specific platform

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      nodeAffinity: 
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/arch
              operator: In
              values:
              - arm64
            - key: kubernetes.io/os
              operator: In
              values:
              - linux
#...

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:
      nodeAffinity:


        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/arch
              operator: In
              values:
              - arm64
            - key: kubernetes.io/os
              operator: In
              values:
              - linux
#...

Copy to Clipboard

Toggle word wrap

1: A node affinity that requires the Operator’s pod to be scheduled on a node with the kubernetes.io/arch=arm64 and kubernetes.io/os=linux labels.

Procedure

To control the placement of an Operator pod, complete the following steps:

Install the Operator as usual.
If needed, ensure that your nodes are labeled to properly respond to the affinity.

Edit the Operator Subscription object to add an affinity:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity: 
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - ip-10-0-185-229.ec2.internal
#...

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-custom-metrics-autoscaler-operator
  namespace: openshift-keda
spec:
  name: my-package
  source: my-operators
  sourceNamespace: operator-registries
  config:
    affinity:


      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/hostname
              operator: In
              values:
              - ip-10-0-185-229.ec2.internal
#...

Copy to Clipboard

Toggle word wrap

1: Add a nodeAffinity.

Verification

To ensure that the pod is deployed on the specific node, run the following command:

$ oc get pods -o wide

$ oc get pods -o wide

Copy to Clipboard

Toggle word wrap

Example output

NAME                                                  READY   STATUS    RESTARTS   AGE   IP            NODE                           NOMINATED NODE   READINESS GATES
custom-metrics-autoscaler-operator-5dcc45d656-bhshg   1/1     Running   0          50s   10.131.0.20   ip-10-0-185-229.ec2.internal   <none>           <none>

NAME                                                  READY   STATUS    RESTARTS   AGE   IP            NODE                           NOMINATED NODE   READINESS GATES
custom-metrics-autoscaler-operator-5dcc45d656-bhshg   1/1     Running   0          50s   10.131.0.20   ip-10-0-185-229.ec2.internal   <none>           <none>

Copy to Clipboard

Toggle word wrap

4.5. Placing pods onto overcommited nodes
Copy link

In an overcommited state, the sum of the container compute resource requests and limits exceeds the resources available on the system. Overcommitment might be desirable in development environments where a trade-off of guaranteed performance for capacity is acceptable.

Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.

4.5.1. Understanding overcommitment
Copy link

OpenShift Container Platform administrators can control the level of overcommit and manage container density on nodes by configuring masters to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange object specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.

Note

That these overrides have no effect if no limits have been set on containers. Create a LimitRange object with default limits, per individual project, or in the project template, to ensure that the overrides apply.

After these overrides, the container limits and requests must still be validated by any LimitRange object in the project. It is possible, for example, for developers to specify a limit close to the minimum limit, and have the request then be overridden below the minimum limit, causing the pod to be forbidden. This unfortunate user experience should be addressed with future work, but for now, configure this capability and LimitRange objects with caution.

4.5.2. Understanding nodes overcommitment
Copy link

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

To ensure this behavior, OpenShift Container Platform configures the kernel to always overcommit memory by setting the vm.overcommit_memory parameter to 1, overriding the default operating system setting.

OpenShift Container Platform also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom parameter to 0. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority

You can view the current setting by running the following commands on your nodes:

sysctl -a |grep commit

$ sysctl -a |grep commit

Copy to Clipboard

Toggle word wrap

Example output

#...
vm.overcommit_memory = 0
#...

#...
vm.overcommit_memory = 0
#...

Copy to Clipboard

Toggle word wrap

sysctl -a |grep panic

$ sysctl -a |grep panic

Copy to Clipboard

Toggle word wrap

Example output

#...
vm.panic_on_oom = 0
#...

#...
vm.panic_on_oom = 0
#...

Copy to Clipboard

Toggle word wrap

Note

The above flags should already be set on nodes, and no further action is required.

You can also perform the following configurations for each node:

Disable or enforce CPU limits using CPU CFS quotas
Reserve resources for system processes
Reserve memory across quality of service tiers

4.6. Controlling pod placement using node taints
Copy link

Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.

4.6.1. Understanding taints and tolerations
Copy link

A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the Node specification (NodeSpec) and apply tolerations to a pod through the Pod specification (PodSpec). When you apply a taint a node, the scheduler cannot place a pod on that node unless the pod can tolerate the taint.

Example taint in a node specification

apiVersion: v1
kind: Node
metadata:
  name: my-node
#...
spec:
  taints:
  - effect: NoExecute
    key: key1
    value: value1
#...

apiVersion: v1
kind: Node
metadata:
  name: my-node
#...
spec:
  taints:
  - effect: NoExecute
    key: key1
    value: value1
#...

Copy to Clipboard

Toggle word wrap

Example toleration in a Pod spec

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

Copy to Clipboard

Toggle word wrap

Taints and tolerations consist of a key, value, and effect.

Expand

Parameter Description

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

Expand

`NoSchedule` ^[1]	New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
`PreferNoSchedule`	New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain.
`NoExecute`	New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed.

operator

Expand

`Equal`	The `key`/`value`/`effect` parameters must match. This is the default.
`Exists`	The `key`/`effect` parameters must match. You must leave a blank `value` parameter, which matches any.

If you add a NoSchedule taint to a control plane node, the node must have the node-role.kubernetes.io/master=:NoSchedule taint, which is added by default.

For example:

apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
    machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
  name: my-node
#...
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
#...

apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
    machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
  name: my-node
#...
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
#...

Copy to Clipboard

Toggle word wrap

A toleration matches a taint:

If the operator parameter is set to Equal:
- the key parameters are the same;
- the value parameters are the same;
- the effect parameters are the same.
If the operator parameter is set to Exists:
- the key parameters are the same;
- the effect parameters are the same.

The following taints are built into OpenShift Container Platform:

node.kubernetes.io/not-ready: The node is not ready. This corresponds to the node condition Ready=False.
node.kubernetes.io/unreachable: The node is unreachable from the node controller. This corresponds to the node condition Ready=Unknown.
node.kubernetes.io/memory-pressure: The node has memory pressure issues. This corresponds to the node condition MemoryPressure=True.
node.kubernetes.io/disk-pressure: The node has disk pressure issues. This corresponds to the node condition DiskPressure=True.
node.kubernetes.io/network-unavailable: The node network is unavailable.
node.kubernetes.io/unschedulable: The node is unschedulable.
node.cloudprovider.kubernetes.io/uninitialized: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.
node.kubernetes.io/pid-pressure: The node has pid pressure. This corresponds to the node condition PIDPressure=True.
Important
OpenShift Container Platform does not set a default pid.available evictionHard.

4.6.1.1. Understanding how to use toleration seconds to delay pod evictions
Copy link

You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the Pod specification or MachineSet object. If a taint with the NoExecute effect is added to a node, a pod that does tolerate the taint, which has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

Example output

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

Copy to Clipboard

Toggle word wrap

Here, if this pod is running but does not have a matching toleration, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

4.6.1.2. Understanding how to use multiple taints
Copy link

You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:

Process the taints for which the pod has a matching toleration.
The remaining unmatched taints have the indicated effects on the pod:
- If there is at least one unmatched taint with effect NoSchedule, OpenShift Container Platform cannot schedule a pod onto that node.
- If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OpenShift Container Platform tries to not schedule the pod onto the node.
- If there is at least one unmatched taint with effect NoExecute, OpenShift Container Platform evicts the pod from the node if it is already running on the node, or the pod is not scheduled onto the node if it is not yet running on the node.
  - Pods that do not tolerate the taint are evicted immediately.
  - Pods that tolerate the taint without specifying tolerationSeconds in their Pod specification remain bound forever.
  - Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

Add the following taints to the node:

oc adm taint nodes node1 key1=value1:NoSchedule

$ oc adm taint nodes node1 key1=value1:NoSchedule

Copy to Clipboard

Toggle word wrap

oc adm taint nodes node1 key1=value1:NoExecute

$ oc adm taint nodes node1 key1=value1:NoExecute

Copy to Clipboard

Toggle word wrap

oc adm taint nodes node1 key2=value2:NoSchedule

$ oc adm taint nodes node1 key2=value2:NoSchedule

Copy to Clipboard

Toggle word wrap

The pod has the following tolerations:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoSchedule"
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
#...

Copy to Clipboard

Toggle word wrap

In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

4.6.1.3. Understanding pod scheduling and node conditions (taint node by condition)
Copy link

The Taint Nodes By Condition feature, which is enabled by default, automatically taints nodes that report conditions such as memory pressure and disk pressure. If a node reports a condition, a taint is added until the condition clears. The taints have the NoSchedule effect, which means no pod can be scheduled on the node unless the pod has a matching toleration.

The scheduler checks for these taints on nodes before scheduling pods. If the taint is present, the pod is scheduled on a different node. Because the scheduler checks for taints and not the actual node conditions, you configure the scheduler to ignore some of these node conditions by adding appropriate pod tolerations.

To ensure backward compatibility, the daemon set controller automatically adds the following tolerations to all daemons:

node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
node.kubernetes.io/unschedulable (1.10 or later)
node.kubernetes.io/network-unavailable (host network only)

You can also add arbitrary tolerations to daemon sets.

Note

The control plane also adds the node.kubernetes.io/memory-pressure toleration on pods that have a QoS class. This is because Kubernetes manages pods in the Guaranteed or Burstable QoS classes. The new BestEffort pods do not get scheduled onto the affected node.

4.6.1.4. Understanding evicting pods by condition (taint-based evictions)
Copy link

The Taint-Based Evictions feature, which is enabled by default, evicts pods from a node that experiences specific conditions, such as not-ready and unreachable. When a node experiences one of these conditions, OpenShift Container Platform automatically adds taints to the node, and starts evicting and rescheduling the pods on different nodes.

Taint Based Evictions have a NoExecute effect, where any pod that does not tolerate the taint is evicted immediately and any pod that does tolerate the taint will never be evicted, unless the pod uses the tolerationSeconds parameter.

The tolerationSeconds parameter allows you to specify how long a pod stays bound to a node that has a node condition. If the condition still exists after the tolerationSeconds period, the taint remains on the node and the pods with a matching toleration are evicted. If the condition clears before the tolerationSeconds period, pods with matching tolerations are not removed.

If you use the tolerationSeconds parameter with no value, pods are never evicted because of the not ready and unreachable node conditions.

Note

OpenShift Container Platform evicts pods in a rate-limited way to prevent massive pod evictions in scenarios such as the master becoming partitioned from the nodes.

By default, if more than 55% of nodes in a given zone are unhealthy, the node lifecycle controller changes that zone’s state to PartialDisruption and the rate of pod evictions is reduced. For small clusters (by default, 50 nodes or less) in this state, nodes in this zone are not tainted and evictions are stopped.

For more information, see Rate limits on eviction in the Kubernetes documentation.

OpenShift Container Platform automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, unless the Pod configuration specifies either toleration.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: node.kubernetes.io/not-ready
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300 
  - key: node.kubernetes.io/unreachable
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: node.kubernetes.io/not-ready
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300


  - key: node.kubernetes.io/unreachable
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300
#...

Copy to Clipboard

Toggle word wrap

1: These tolerations ensure that the default pod behavior is to remain bound for five minutes after one of these node conditions problems is detected.

You can configure these tolerations as needed. For example, if you have an application with a lot of local state, you might want to keep the pods bound to node for a longer time in the event of network partition, allowing for the partition to recover and avoiding pod eviction.

Pods spawned by a daemon set are created with NoExecute tolerations for the following taints with no tolerationSeconds:

node.kubernetes.io/unreachable
node.kubernetes.io/not-ready

As a result, daemon set pods are never evicted because of these node conditions.

4.6.1.5. Tolerating all taints
Copy link

You can configure a pod to tolerate all taints by adding an operator: "Exists" toleration with no key and values parameters. Pods with this toleration are not removed from a node that has taints.

Pod spec for tolerating all taints

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - operator: "Exists"
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - operator: "Exists"
#...

Copy to Clipboard

Toggle word wrap

4.6.2. Adding taints and tolerations
Copy link

You add tolerations to pods and taints to nodes to allow the node to control which pods should or should not be scheduled on them. For existing pods and nodes, you should add the toleration to the pod first, then add the taint to the node to avoid pods being removed from the node before you can add the toleration.

Procedure

Add a toleration to a pod by editing the Pod spec to include a tolerations stanza:

Sample pod configuration file with an Equal operator

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1" 
    value: "value1"
    operator: "Equal"
    effect: "NoExecute"
    tolerationSeconds: 3600 
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"


    value: "value1"
    operator: "Equal"
    effect: "NoExecute"
    tolerationSeconds: 3600


#...

Copy to Clipboard

Toggle word wrap

1: The toleration parameters, as described in the Taint and toleration components table.
2: The tolerationSeconds parameter specifies how long a pod can remain bound to a node before being evicted.

For example:

Sample pod configuration file with an Exists operator

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
   tolerations:
    - key: "key1"
      operator: "Exists" 
      effect: "NoExecute"
      tolerationSeconds: 3600
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
   tolerations:
    - key: "key1"
      operator: "Exists"


      effect: "NoExecute"
      tolerationSeconds: 3600
#...

Copy to Clipboard

Toggle word wrap

1: The Exists operator does not take a value.

This example places a taint on node1 that has key key1, value value1, and taint effect NoExecute.

Add a taint to a node by using the following command with the parameters described in the Taint and toleration components table:

oc adm taint nodes <node_name> <key>=<value>:<effect>

$ oc adm taint nodes <node_name> <key>=<value>:<effect>

Copy to Clipboard

Toggle word wrap

For example:

oc adm taint nodes node1 key1=value1:NoExecute

$ oc adm taint nodes node1 key1=value1:NoExecute

Copy to Clipboard

Toggle word wrap

This command places a taint on node1 that has key key1, value value1, and effect NoExecute.

Note

If you add a NoSchedule taint to a control plane node, the node must have the node-role.kubernetes.io/master=:NoSchedule taint, which is added by default.

For example:

apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
    machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
  name: my-node
#...
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
#...

apiVersion: v1
kind: Node
metadata:
  annotations:
    machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
    machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
  name: my-node
#...
spec:
  taints:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
#...

Copy to Clipboard

Toggle word wrap

The tolerations on the pod match the taint on the node. A pod with either toleration can be scheduled onto node1.

4.6.2.1. Adding taints and tolerations using a compute machine set
Copy link

You can add taints to nodes using a compute machine set. All nodes associated with the MachineSet object are updated with the taint. Tolerations respond to taints added by a compute machine set in the same manner as taints added directly to the nodes.

Procedure

Add a toleration to a pod by editing the Pod spec to include a tolerations stanza:

Sample pod configuration file with Equal operator

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1" 
    value: "value1"
    operator: "Equal"
    effect: "NoExecute"
    tolerationSeconds: 3600 
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"


    value: "value1"
    operator: "Equal"
    effect: "NoExecute"
    tolerationSeconds: 3600


#...

Copy to Clipboard

Toggle word wrap

1: The toleration parameters, as described in the Taint and toleration components table.
2: The tolerationSeconds parameter specifies how long a pod is bound to a node before being evicted.

For example:

Sample pod configuration file with Exists operator

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key1"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

Copy to Clipboard

Toggle word wrap

Add the taint to the MachineSet object:

Edit the MachineSet YAML for the nodes you want to taint or you can create a new MachineSet object:
```
oc edit machineset <machineset>
```
```
$ oc edit machineset <machineset>
```
Copy to Clipboard Toggle word wrap

Add the taint to the spec.template.spec section:

Example taint in a compute machine set specification

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: my-machineset
#...
spec:
#...
  template:
#...
    spec:
      taints:
      - effect: NoExecute
        key: key1
        value: value1
#...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: my-machineset
#...
spec:
#...
  template:
#...
    spec:
      taints:
      - effect: NoExecute
        key: key1
        value: value1
#...

Copy to Clipboard

Toggle word wrap

This example places a taint that has the key key1, value value1, and taint effect NoExecute on the nodes.

Scale down the compute machine set to 0:

oc scale --replicas=0 machineset <machineset> -n openshift-machine-api

$ oc scale --replicas=0 machineset <machineset> -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to scale the compute machine set:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <machineset>
  namespace: openshift-machine-api
spec:
  replicas: 0

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <machineset>
  namespace: openshift-machine-api
spec:
  replicas: 0

Copy to Clipboard

Toggle word wrap

Wait for the machines to be removed.

Scale up the compute machine set as needed:

oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Or:

oc edit machineset <machineset> -n openshift-machine-api

$ oc edit machineset <machineset> -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Wait for the machines to start. The taint is added to the nodes associated with the MachineSet object.

4.6.2.2. Binding a user to a node using taints and tolerations
Copy link

If you want to dedicate a set of nodes for exclusive use by a particular set of users, add a toleration to their pods. Then, add a corresponding taint to those nodes. The pods with the tolerations are allowed to use the tainted nodes or any other nodes in the cluster.

If you want ensure the pods are scheduled to only those tainted nodes, also add a label to the same set of nodes and add a node affinity to the pods so that the pods can only be scheduled onto nodes with that label.

Procedure

To configure a node so that users can use only that node:

Add a corresponding taint to those nodes:

For example:

oc adm taint nodes node1 dedicated=groupName:NoSchedule

$ oc adm taint nodes node1 dedicated=groupName:NoSchedule

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add the taint:

kind: Node
apiVersion: v1
metadata:
  name: my-node
#...
spec:
  taints:
    - key: dedicated
      value: groupName
      effect: NoSchedule
#...

kind: Node
apiVersion: v1
metadata:
  name: my-node
#...
spec:
  taints:
    - key: dedicated
      value: groupName
      effect: NoSchedule
#...

Copy to Clipboard

Toggle word wrap

Add a toleration to the pods by writing a custom admission controller.

4.6.2.3. Creating a project with a node selector and toleration
Copy link

You can create a project that uses a node selector and toleration, which are set as annotations, to control the placement of pods onto specific nodes. Any subsequent resources created in the project are then scheduled on nodes that have a taint matching the toleration.

Prerequisites

A label for node selection has been added to one or more nodes by using a compute machine set or editing the node directly.
A taint has been added to one or more nodes by using a compute machine set or editing the node directly.

Procedure

Create a Project resource definition, specifying a node selector and toleration in the metadata.annotations section:

Example project.yaml file

kind: Project
apiVersion: project.openshift.io/v1
metadata:
  name: <project_name> 
  annotations:
    openshift.io/node-selector: '<label>' 
    scheduler.alpha.kubernetes.io/defaultTolerations: >-
      [{"operator": "Exists", "effect": "NoSchedule", "key":
      "<key_name>"} 
      ]

kind: Project
apiVersion: project.openshift.io/v1
metadata:
  name: <project_name>


  annotations:
    openshift.io/node-selector: '<label>'


    scheduler.alpha.kubernetes.io/defaultTolerations: >-
      [{"operator": "Exists", "effect": "NoSchedule", "key":
      "<key_name>"}

Copy to Clipboard

Toggle word wrap

1: The project name.
2: The default node selector label.
3: The toleration parameters, as described in the Taint and toleration components table. This example uses the NoSchedule effect, which allows existing pods on the node to remain, and the Exists operator, which does not take a value.

Use the oc apply command to create the project:
```
oc apply -f project.yaml
```
```
$ oc apply -f project.yaml
```
Copy to Clipboard Toggle word wrap

Any subsequent resources created in the <project_name> namespace should now be scheduled on the specified nodes.

4.6.2.4. Controlling nodes with special hardware using taints and tolerations
Copy link

In a cluster where a small subset of nodes have specialized hardware, you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

You can achieve this by adding a toleration to pods that need the special hardware and tainting the nodes that have the specialized hardware.

Procedure

To ensure nodes with specialized hardware are reserved for specific pods:

Add a toleration to pods that need the special hardware.

For example:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
    - key: "disktype"
      value: "ssd"
      operator: "Equal"
      effect: "NoSchedule"
      tolerationSeconds: 3600
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
    - key: "disktype"
      value: "ssd"
      operator: "Equal"
      effect: "NoSchedule"
      tolerationSeconds: 3600
#...

Copy to Clipboard

Toggle word wrap

Taint the nodes that have the specialized hardware using one of the following commands:

oc adm taint nodes <node-name> disktype=ssd:NoSchedule

$ oc adm taint nodes <node-name> disktype=ssd:NoSchedule

Copy to Clipboard

Toggle word wrap

Or:

oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule

$ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add the taint:

kind: Node
apiVersion: v1
metadata:
  name: my_node
#...
spec:
  taints:
    - key: disktype
      value: ssd
      effect: PreferNoSchedule
#...

kind: Node
apiVersion: v1
metadata:
  name: my_node
#...
spec:
  taints:
    - key: disktype
      value: ssd
      effect: PreferNoSchedule
#...

Copy to Clipboard

Toggle word wrap

4.6.3. Removing taints and tolerations
Copy link

You can remove taints from nodes and tolerations from pods as needed. You should add the toleration to the pod first, then add the taint to the node to avoid pods being removed from the node before you can add the toleration.

Procedure

To remove taints and tolerations:

To remove a taint from a node:

oc adm taint nodes <node-name> <key>-

$ oc adm taint nodes <node-name> <key>-

Copy to Clipboard

Toggle word wrap

For example:

oc adm taint nodes ip-10-0-132-248.ec2.internal key1-

$ oc adm taint nodes ip-10-0-132-248.ec2.internal key1-

Copy to Clipboard

Toggle word wrap

Example output

node/ip-10-0-132-248.ec2.internal untainted

node/ip-10-0-132-248.ec2.internal untainted

Copy to Clipboard

Toggle word wrap

To remove a toleration from a pod, edit the Pod spec to remove the toleration:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key2"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
#...
spec:
  tolerations:
  - key: "key2"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 3600
#...

Copy to Clipboard

Toggle word wrap

4.7. Placing pods on specific nodes using node selectors
Copy link

A node selector specifies a map of key/value pairs that are defined using custom labels on nodes and selectors specified in pods.

For the pod to be eligible to run on a node, the pod must have the same key/value node selector as the label on the node.

4.7.1. About node selectors
Copy link

You can use a node selector to place specific pods on specific nodes, cluster-wide node selectors to place new pods on specific nodes anywhere in the cluster, and project node selectors to place new pods in a project on specific nodes.

For example, as a cluster administrator, you can create an infrastructure where application developers can deploy pods only onto the nodes closest to their geographical location by including a node selector in every pod they create. In this example, the cluster consists of five data centers spread across two regions. In the U.S., label the nodes as us-east, us-central, or us-west. In the Asia-Pacific region (APAC), label the nodes as apac-east or apac-west. The developers can add a node selector to the pods they create to ensure the pods get scheduled on those nodes.

A pod is not scheduled if the Pod object contains a node selector, but no node has a matching label.

Important

If you are using node selectors and node affinity in the same pod configuration, the following rules control pod placement onto nodes:

If you configure both nodeSelector and nodeAffinity, both conditions must be satisfied for the pod to be scheduled onto a candidate node.
If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms is satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions are satisfied.

Node selectors on specific pods and nodes

You can control which node a specific pod is scheduled on by using node selectors and labels.

To use node selectors and labels, first label the node to avoid pods being descheduled, then add the node selector to the pod.

Note

You cannot add a node selector directly to an existing scheduled pod. You must label the object that controls the pod, such as deployment config.

For example, the following Node object has the region: east label:

Sample Node object with a label

kind: Node
apiVersion: v1
metadata:
  name: ip-10-0-131-14.ec2.internal
  selfLink: /api/v1/nodes/ip-10-0-131-14.ec2.internal
  uid: 7bc2580a-8b8e-11e9-8e01-021ab4174c74
  resourceVersion: '478704'
  creationTimestamp: '2019-06-10T14:46:08Z'
  labels:
    kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/zone: us-east-1a
    node.openshift.io/os_version: '4.5'
    node-role.kubernetes.io/worker: ''
    failure-domain.beta.kubernetes.io/region: us-east-1
    node.openshift.io/os_id: rhcos
    beta.kubernetes.io/instance-type: m4.large
    kubernetes.io/hostname: ip-10-0-131-14
    beta.kubernetes.io/arch: amd64
    region: east 
    type: user-node
#...

kind: Node
apiVersion: v1
metadata:
  name: ip-10-0-131-14.ec2.internal
  selfLink: /api/v1/nodes/ip-10-0-131-14.ec2.internal
  uid: 7bc2580a-8b8e-11e9-8e01-021ab4174c74
  resourceVersion: '478704'
  creationTimestamp: '2019-06-10T14:46:08Z'
  labels:
    kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/zone: us-east-1a
    node.openshift.io/os_version: '4.5'
    node-role.kubernetes.io/worker: ''
    failure-domain.beta.kubernetes.io/region: us-east-1
    node.openshift.io/os_id: rhcos
    beta.kubernetes.io/instance-type: m4.large
    kubernetes.io/hostname: ip-10-0-131-14
    beta.kubernetes.io/arch: amd64
    region: east


    type: user-node
#...

Copy to Clipboard

Toggle word wrap

1: Labels to match the pod node selector.

A pod has the type: user-node,region: east node selector:

Sample Pod object with node selectors

apiVersion: v1
kind: Pod
metadata:
  name: s1
#...
spec:
  nodeSelector: 
    region: east
    type: user-node
#...

apiVersion: v1
kind: Pod
metadata:
  name: s1
#...
spec:
  nodeSelector:


    region: east
    type: user-node
#...

Copy to Clipboard

Toggle word wrap

1: Node selectors to match the node label. The node must have a label for each node selector.

When you create the pod using the example pod spec, it can be scheduled on the example node.

Default cluster-wide node selectors

With default cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.

For example, the following Scheduler object has the default cluster-wide region=east and type=user-node node selectors:

Example Scheduler Operator Custom Resource

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
#...
spec:
  defaultNodeSelector: type=user-node,region=east
#...

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
#...
spec:
  defaultNodeSelector: type=user-node,region=east
#...

Copy to Clipboard

Toggle word wrap

A node in that cluster has the type=user-node,region=east labels:

Example Node object

apiVersion: v1
kind: Node
metadata:
  name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4
#...
  labels:
    region: east
    type: user-node
#...

apiVersion: v1
kind: Node
metadata:
  name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4
#...
  labels:
    region: east
    type: user-node
#...

Copy to Clipboard

Toggle word wrap

Example Pod object with a node selector

apiVersion: v1
kind: Pod
metadata:
  name: s1
#...
spec:
  nodeSelector:
    region: east
#...

apiVersion: v1
kind: Pod
metadata:
  name: s1
#...
spec:
  nodeSelector:
    region: east
#...

Copy to Clipboard

Toggle word wrap

When you create the pod using the example pod spec in the example cluster, the pod is created with the cluster-wide node selector and is scheduled on the labeled node:

Example pod list with the pod on the labeled node

NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
pod-s1   1/1     Running   0          20s   10.131.2.6   ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4   <none>           <none>

NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
pod-s1   1/1     Running   0          20s   10.131.2.6   ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4   <none>           <none>

Copy to Clipboard

Toggle word wrap

Note

If the project where you create the pod has a project node selector, that selector takes preference over a cluster-wide node selector. Your pod is not created or scheduled if the pod does not have the project node selector.

Project node selectors

With project node selectors, when you create a pod in this project, OpenShift Container Platform adds the node selectors to the pod and schedules the pods on a node with matching labels. If there is a cluster-wide default node selector, a project node selector takes preference.

For example, the following project has the region=east node selector:

Example Namespace object

apiVersion: v1
kind: Namespace
metadata:
  name: east-region
  annotations:
    openshift.io/node-selector: "region=east"
#...

apiVersion: v1
kind: Namespace
metadata:
  name: east-region
  annotations:
    openshift.io/node-selector: "region=east"
#...

Copy to Clipboard

Toggle word wrap

The following node has the type=user-node,region=east labels:

Example Node object

apiVersion: v1
kind: Node
metadata:
  name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4
#...
  labels:
    region: east
    type: user-node
#...

apiVersion: v1
kind: Node
metadata:
  name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4
#...
  labels:
    region: east
    type: user-node
#...

Copy to Clipboard

Toggle word wrap

When you create the pod using the example pod spec in this example project, the pod is created with the project node selectors and is scheduled on the labeled node:

Example Pod object

apiVersion: v1
kind: Pod
metadata:
  namespace: east-region
#...
spec:
  nodeSelector:
    region: east
    type: user-node
#...

apiVersion: v1
kind: Pod
metadata:
  namespace: east-region
#...
spec:
  nodeSelector:
    region: east
    type: user-node
#...

Copy to Clipboard

Toggle word wrap

Example pod list with the pod on the labeled node

NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
pod-s1   1/1     Running   0          20s   10.131.2.6   ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4   <none>           <none>

NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                                       NOMINATED NODE   READINESS GATES
pod-s1   1/1     Running   0          20s   10.131.2.6   ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4   <none>           <none>

Copy to Clipboard

Toggle word wrap

A pod in the project is not created or scheduled if the pod contains different node selectors. For example, if you deploy the following pod into the example project, it is not be created:

Example Pod object with an invalid node selector

apiVersion: v1
kind: Pod
metadata:
  name: west-region
#...
spec:
  nodeSelector:
    region: west
#...

apiVersion: v1
kind: Pod
metadata:
  name: west-region
#...
spec:
  nodeSelector:
    region: west
#...

Copy to Clipboard

Toggle word wrap

4.7.2. Using node selectors to control pod placement
Copy link

Note

You cannot add a node selector directly to an existing scheduled pod.

Prerequisites

oc describe pod router-default-66d5cf9464-7pwkc

$ oc describe pod router-default-66d5cf9464-7pwkc

Copy to Clipboard

Toggle word wrap

Example output

kind: Pod
apiVersion: v1
metadata:
#...
Name:               router-default-66d5cf9464-7pwkc
Namespace:          openshift-ingress
# ...
Controlled By:      ReplicaSet/router-default-66d5cf9464
# ...

kind: Pod
apiVersion: v1
metadata:
#...
Name:               router-default-66d5cf9464-7pwkc
Namespace:          openshift-ingress
# ...
Controlled By:      ReplicaSet/router-default-66d5cf9464
# ...

Copy to Clipboard

Toggle word wrap

The web console lists the controlling object under ownerReferences in the pod YAML:

apiVersion: v1
kind: Pod
metadata:
  name: router-default-66d5cf9464-7pwkc
# ...
  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: router-default-66d5cf9464
      uid: d81dd094-da26-11e9-a48a-128e7edf0312
      controller: true
      blockOwnerDeletion: true
# ...

apiVersion: v1
kind: Pod
metadata:
  name: router-default-66d5cf9464-7pwkc
# ...
  ownerReferences:
    - apiVersion: apps/v1
      kind: ReplicaSet
      name: router-default-66d5cf9464
      uid: d81dd094-da26-11e9-a48a-128e7edf0312
      controller: true
      blockOwnerDeletion: true
# ...

Copy to Clipboard

Toggle word wrap

Procedure

Add labels to a node by using a compute machine set or editing the node directly:

Use a MachineSet object to add labels to nodes managed by the compute machine set when a node is created:

Run the following command to add labels to a MachineSet object:

oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

For example:

oc patch MachineSet abc612-msrtw-worker-us-east-1c  --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api

$ oc patch MachineSet abc612-msrtw-worker-us-east-1c  --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add labels to a compute machine set:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: xf2bd-infra-us-east-2a
  namespace: openshift-machine-api
spec:
  template:
    spec:
      metadata:
        labels:
          region: "east"
          type: "user-node"
#...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: xf2bd-infra-us-east-2a
  namespace: openshift-machine-api
spec:
  template:
    spec:
      metadata:
        labels:
          region: "east"
          type: "user-node"
#...

Copy to Clipboard

Toggle word wrap

Verify that the labels are added to the MachineSet object by using the oc edit command:

For example:

oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api

$ oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Example MachineSet object

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet

# ...

spec:
# ...
  template:
    metadata:
# ...
    spec:
      metadata:
        labels:
          region: east
          type: user-node
# ...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet

# ...

spec:
# ...
  template:
    metadata:
# ...
    spec:
      metadata:
        labels:
          region: east
          type: user-node
# ...

Copy to Clipboard

Toggle word wrap

Add labels directly to a node:

Edit the Node object for the node:

oc label nodes <name> <key>=<value>

$ oc label nodes <name> <key>=<value>

Copy to Clipboard

Toggle word wrap

For example, to label a node:

oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east

$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add labels to a node:

kind: Node
apiVersion: v1
metadata:
  name: hello-node-6fbccf8d9
  labels:
    type: "user-node"
    region: "east"
#...

kind: Node
apiVersion: v1
metadata:
  name: hello-node-6fbccf8d9
  labels:
    type: "user-node"
    region: "east"
#...

Copy to Clipboard

Toggle word wrap

Verify that the labels are added to the node:

oc get nodes -l type=user-node,region=east

$ oc get nodes -l type=user-node,region=east

Copy to Clipboard

Toggle word wrap

Example output

NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-142-25.ec2.internal   Ready    worker   17m   v1.25.0

NAME                          STATUS   ROLES    AGE   VERSION
ip-10-0-142-25.ec2.internal   Ready    worker   17m   v1.25.0

Copy to Clipboard

Toggle word wrap

Add the matching node selector to a pod:

To add a node selector to existing and future pods, add a node selector to the controlling object for the pods:

Example ReplicaSet object with labels

kind: ReplicaSet
apiVersion: apps/v1
metadata:
  name: hello-node-6fbccf8d9
# ...
spec:
# ...
  template:
    metadata:
      creationTimestamp: null
      labels:
        ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
        pod-template-hash: 66d5cf9464
    spec:
      nodeSelector:
        kubernetes.io/os: linux
        node-role.kubernetes.io/worker: ''
        type: user-node 
#...

kind: ReplicaSet
apiVersion: apps/v1
metadata:
  name: hello-node-6fbccf8d9
# ...
spec:
# ...
  template:
    metadata:
      creationTimestamp: null
      labels:
        ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default
        pod-template-hash: 66d5cf9464
    spec:
      nodeSelector:
        kubernetes.io/os: linux
        node-role.kubernetes.io/worker: ''
        type: user-node


#...

Copy to Clipboard

Toggle word wrap

1: Add the node selector.

To add a node selector to a specific, new pod, add the selector to the Pod object directly:

Example Pod object with a node selector

apiVersion: v1
kind: Pod
metadata:
  name: hello-node-6fbccf8d9
#...
spec:
  nodeSelector:
    region: east
    type: user-node
#...

apiVersion: v1
kind: Pod
metadata:
  name: hello-node-6fbccf8d9
#...
spec:
  nodeSelector:
    region: east
    type: user-node
#...

Copy to Clipboard

Toggle word wrap

Note

You cannot add a node selector directly to an existing scheduled pod.

4.7.3. Creating default cluster-wide node selectors
Copy link

You can use default cluster-wide node selectors on pods together with labels on nodes to constrain all pods created in a cluster to specific nodes.

With cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.

You configure cluster-wide node selectors by editing the Scheduler Operator custom resource (CR). You add labels to a node, a compute machine set, or a machine config. Adding the label to the compute machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.

Note

You can add additional key/value pairs to a pod. But you cannot add a different value for a default key.

Procedure

To add a default cluster-wide node selector:

Edit the Scheduler Operator CR to add the default cluster-wide node selectors:
```
oc edit scheduler cluster
```
```
$ oc edit scheduler cluster
```
Copy to Clipboard Toggle word wrap
Example Scheduler Operator CR with a node selector
```
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
...
spec:
  defaultNodeSelector: type=user-node,region=east 
  mastersSchedulable: false
```
```
apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  name: cluster
...
spec:
  defaultNodeSelector: type=user-node,region=east 
```
1
```
  mastersSchedulable: false
```
Copy to Clipboard Toggle word wrap
1
Add a node selector with the appropriate <key>:<value> pairs.
After making this change, wait for the pods in the openshift-kube-apiserver project to redeploy. This can take several minutes. The default cluster-wide node selector does not take effect until the pods redeploy.

Add labels to a node by using a compute machine set or editing the node directly:

Use a compute machine set to add labels to nodes managed by the compute machine set when a node is created:

Run the following command to add labels to a MachineSet object:

oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

1: Add a <key>/<value> pair for each label.

For example:

oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api

$ oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add labels to a compute machine set:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <machineset>
  namespace: openshift-machine-api
spec:
  template:
    spec:
      metadata:
        labels:
          region: "east"
          type: "user-node"

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <machineset>
  namespace: openshift-machine-api
spec:
  template:
    spec:
      metadata:
        labels:
          region: "east"
          type: "user-node"

Copy to Clipboard

Toggle word wrap

Verify that the labels are added to the MachineSet object by using the oc edit command:

For example:

oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api

$ oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Example MachineSet object

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
  ...
spec:
  ...
  template:
    metadata:
  ...
    spec:
      metadata:
        labels:
          region: east
          type: user-node
  ...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
  ...
spec:
  ...
  template:
    metadata:
  ...
    spec:
      metadata:
        labels:
          region: east
          type: user-node
  ...

Copy to Clipboard

Toggle word wrap

Redeploy the nodes associated with that compute machine set by scaling down to 0 and scaling up the nodes:

For example:

oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

$ oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

$ oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

When the nodes are ready and available, verify that the label is added to the nodes by using the oc get command:

oc get nodes -l <key>=<value>

$ oc get nodes -l <key>=<value>

Copy to Clipboard

Toggle word wrap

For example:

oc get nodes -l type=user-node

$ oc get nodes -l type=user-node

Copy to Clipboard

Toggle word wrap

Example output

NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp   Ready    worker   61s   v1.25.0

NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp   Ready    worker   61s   v1.25.0

Copy to Clipboard

Toggle word wrap

Add labels directly to a node:

Edit the Node object for the node:

oc label nodes <name> <key>=<value>

$ oc label nodes <name> <key>=<value>

Copy to Clipboard

Toggle word wrap

For example, to label a node:

oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 type=user-node region=east

$ oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 type=user-node region=east

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add labels to a node:

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    type: "user-node"
    region: "east"

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    type: "user-node"
    region: "east"

Copy to Clipboard

Toggle word wrap

Verify that the labels are added to the node using the oc get command:

oc get nodes -l <key>=<value>,<key>=<value>

$ oc get nodes -l <key>=<value>,<key>=<value>

Copy to Clipboard

Toggle word wrap

For example:

oc get nodes -l type=user-node,region=east

$ oc get nodes -l type=user-node,region=east

Copy to Clipboard

Toggle word wrap

Example output

NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49   Ready    worker   17m   v1.25.0

NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49   Ready    worker   17m   v1.25.0

Copy to Clipboard

Toggle word wrap

4.7.4. Creating project-wide node selectors
Copy link

You can use node selectors in a project together with labels on nodes to constrain all pods created in that project to the labeled nodes.

When you create a pod in this project, OpenShift Container Platform adds the node selectors to the pods in the project and schedules the pods on a node with matching labels in the project. If there is a cluster-wide default node selector, a project node selector takes preference.

You add node selectors to a project by editing the Namespace object to add the openshift.io/node-selector parameter. You add labels to a node, a compute machine set, or a machine config. Adding the label to the compute machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.

A pod is not scheduled if the Pod object contains a node selector, but no project has a matching node selector. When you create a pod from that spec, you receive an error similar to the following message:

Example error message

Error from server (Forbidden): error when creating "pod.yaml": pods "pod-4" is forbidden: pod node label selector conflicts with its project node label selector

Error from server (Forbidden): error when creating "pod.yaml": pods "pod-4" is forbidden: pod node label selector conflicts with its project node label selector

Copy to Clipboard

Toggle word wrap

Note

You can add additional key/value pairs to a pod. But you cannot add a different value for a project key.

Procedure

To add a default project node selector:

Create a namespace or edit an existing namespace to add the openshift.io/node-selector parameter:

oc edit namespace <name>

$ oc edit namespace <name>

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/node-selector: "type=user-node,region=east" 
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: kube:admin
    openshift.io/sa.scc.mcs: s0:c30,c5
    openshift.io/sa.scc.supplemental-groups: 1000880000/10000
    openshift.io/sa.scc.uid-range: 1000880000/10000
  creationTimestamp: "2021-05-10T12:35:04Z"
  labels:
    kubernetes.io/metadata.name: demo
  name: demo
  resourceVersion: "145537"
  uid: 3f8786e3-1fcb-42e3-a0e3-e2ac54d15001
spec:
  finalizers:
  - kubernetes

apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/node-selector: "type=user-node,region=east"


    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: kube:admin
    openshift.io/sa.scc.mcs: s0:c30,c5
    openshift.io/sa.scc.supplemental-groups: 1000880000/10000
    openshift.io/sa.scc.uid-range: 1000880000/10000
  creationTimestamp: "2021-05-10T12:35:04Z"
  labels:
    kubernetes.io/metadata.name: demo
  name: demo
  resourceVersion: "145537"
  uid: 3f8786e3-1fcb-42e3-a0e3-e2ac54d15001
spec:
  finalizers:
  - kubernetes

Copy to Clipboard

Toggle word wrap

1: Add the openshift.io/node-selector with the appropriate <key>:<value> pairs.

Add labels to a node by using a compute machine set or editing the node directly:

Use a MachineSet object to add labels to nodes managed by the compute machine set when a node is created:

Run the following command to add labels to a MachineSet object:

oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

For example:

oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api

$ oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add labels to a compute machine set:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <machineset>
  namespace: openshift-machine-api
spec:
  template:
    spec:
      metadata:
        labels:
          region: "east"
          type: "user-node"

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  name: <machineset>
  namespace: openshift-machine-api
spec:
  template:
    spec:
      metadata:
        labels:
          region: "east"
          type: "user-node"

Copy to Clipboard

Toggle word wrap

Verify that the labels are added to the MachineSet object by using the oc edit command:

For example:

oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

$ oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
...
spec:
...
  template:
    metadata:
...
    spec:
      metadata:
        labels:
          region: east
          type: user-node

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
...
spec:
...
  template:
    metadata:
...
    spec:
      metadata:
        labels:
          region: east
          type: user-node

Copy to Clipboard

Toggle word wrap

Redeploy the nodes associated with that compute machine set:

For example:

oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

$ oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

$ oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

When the nodes are ready and available, verify that the label is added to the nodes by using the oc get command:

oc get nodes -l <key>=<value>

$ oc get nodes -l <key>=<value>

Copy to Clipboard

Toggle word wrap

For example:

oc get nodes -l type=user-node,region=east

$ oc get nodes -l type=user-node,region=east

Copy to Clipboard

Toggle word wrap

Example output

NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp   Ready    worker   61s   v1.25.0

NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp   Ready    worker   61s   v1.25.0

Copy to Clipboard

Toggle word wrap

Add labels directly to a node:

Edit the Node object to add labels:

oc label <resource> <name> <key>=<value>

$ oc label <resource> <name> <key>=<value>

Copy to Clipboard

Toggle word wrap

For example, to label a node:

oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-c-tgq49 type=user-node region=east

$ oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-c-tgq49 type=user-node region=east

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to add labels to a node:

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    type: "user-node"
    region: "east"

kind: Node
apiVersion: v1
metadata:
  name: <node_name>
  labels:
    type: "user-node"
    region: "east"

Copy to Clipboard

Toggle word wrap

Verify that the labels are added to the Node object using the oc get command:

oc get nodes -l <key>=<value>

$ oc get nodes -l <key>=<value>

Copy to Clipboard

Toggle word wrap

For example:

oc get nodes -l type=user-node,region=east

$ oc get nodes -l type=user-node,region=east

Copy to Clipboard

Toggle word wrap

Example output

NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49   Ready    worker   17m   v1.25.0

NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49   Ready    worker   17m   v1.25.0

Copy to Clipboard

Toggle word wrap

4.8. Controlling pod placement by using pod topology spread constraints
Copy link

You can use pod topology spread constraints to provide fine-grained control over the placement of your pods across nodes, zones, regions, or other user-defined topology domains. Distributing pods across failure domains can help to achieve high availability and more efficient resource utilization.

4.8.1. Example use cases
Copy link

As an administrator, I want my workload to automatically scale between two to fifteen pods. I want to ensure that when there are only two pods, they are not placed on the same node, to avoid a single point of failure.
As an administrator, I want to distribute my pods evenly across multiple infrastructure zones to reduce latency and network costs. I want to ensure that my cluster can self-heal if issues arise.

4.8.2. Important considerations
Copy link

Pods in an OpenShift Container Platform cluster are managed by workload controllers such as deployments, stateful sets, or daemon sets. These controllers define the desired state for a group of pods, including how they are distributed and scaled across the nodes in the cluster. You should set the same pod topology spread constraints on all pods in a group to avoid confusion. When using a workload controller, such as a deployment, the pod template typically handles this for you.
Mixing different pod topology spread constraints can make OpenShift Container Platform behavior confusing and troubleshooting more difficult. You can avoid this by ensuring that all nodes in a topology domain are consistently labeled. OpenShift Container Platform automatically populates well-known labels, such as kubernetes.io/hostname. This helps avoid the need for manual labeling of nodes. These labels provide essential topology information, ensuring consistent node labeling across the cluster.
Only pods within the same namespace are matched and grouped together when spreading due to a constraint.
You can specify multiple pod topology spread constraints, but you must ensure that they do not conflict with each other. All pod topology spread constraints must be satisfied for a pod to be placed.

4.8.3. Understanding skew and maxSkew
Copy link

Skew refers to the difference in the number of pods that match a specified label selector across different topology domains, such as zones or nodes.

The skew is calculated for each domain by taking the absolute difference between the number of pods in that domain and the number of pods in the domain with the lowest amount of pods scheduled. Setting a maxSkew value guides the scheduler to maintain a balanced pod distribution.

4.8.3.1. Example skew calculation
Copy link

You have three zones (A, B, and C), and you want to distribute your pods evenly across these zones. If zone A has 5 pods, zone B has 3 pods, and zone C has 2 pods, to find the skew, you can subtract the number of pods in the domain with the lowest amount of pods scheduled from the number of pods currently in each zone. This means that the skew for zone A is 3, the skew for zone B is 1, and the skew for zone C is 0.

4.8.3.2. The maxSkew parameter
Copy link

The maxSkew parameter defines the maximum allowable difference, or skew, in the number of pods between any two topology domains. If maxSkew is set to 1, the number of pods in any topology domain should not differ by more than 1 from any other domain. If the skew exceeds maxSkew, the scheduler attempts to place new pods in a way that reduces the skew, adhering to the constraints.

Using the previous example skew calculation, the skew values exceed the default maxSkew value of 1. The scheduler places new pods in zone B and zone C to reduce the skew and achieve a more balanced distribution, ensuring that no topology domain exceeds the skew of 1.

4.8.4. Example configurations for pod topology spread constraints
Copy link

You can specify which pods to group together, which topology domains they are spread among, and the acceptable skew.

The following examples demonstrate pod topology spread constraint configurations.

Example to distribute pods that match the specified labels based on their zone

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    region: us-east
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  topologySpreadConstraints:
  - maxSkew: 1 
    topologyKey: topology.kubernetes.io/zone 
    whenUnsatisfiable: DoNotSchedule 
    labelSelector: 
      matchLabels:
        region: us-east 
    matchLabelKeys:
      - my-pod-label 
  containers:
  - image: "docker.io/ocpqe/hello-pod"
    name: hello-pod
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: [ALL]

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    region: us-east
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  topologySpreadConstraints:
  - maxSkew: 1


    topologyKey: topology.kubernetes.io/zone


    whenUnsatisfiable: DoNotSchedule


    labelSelector:


      matchLabels:
        region: us-east


    matchLabelKeys:
      - my-pod-label


  containers:
  - image: "docker.io/ocpqe/hello-pod"
    name: hello-pod
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: [ALL]

Copy to Clipboard

Toggle word wrap

1: The maximum difference in number of pods between any two topology domains. The default is 1, and you cannot specify a value of 0.
2: The key of a node label. Nodes with this key and identical value are considered to be in the same topology.
3: How to handle a pod if it does not satisfy the spread constraint. The default is DoNotSchedule, which tells the scheduler not to schedule the pod. Set to ScheduleAnyway to still schedule the pod, but the scheduler prioritizes honoring the skew to not make the cluster more imbalanced.
4: Pods that match this label selector are counted and recognized as a group when spreading to satisfy the constraint. Be sure to specify a label selector, otherwise no pods can be matched.
5: Be sure that this Pod spec also sets its labels to match this label selector if you want it to be counted properly in the future.
6: A list of pod label keys to select which pods to calculate spreading over.

Example demonstrating a single pod topology spread constraint

kind: Pod
apiVersion: v1
metadata:
  name: my-pod
  labels:
    region: us-east
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        region: us-east
  containers:
  - image: "docker.io/ocpqe/hello-pod"
    name: hello-pod

kind: Pod
apiVersion: v1
metadata:
  name: my-pod
  labels:
    region: us-east
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        region: us-east
  containers:
  - image: "docker.io/ocpqe/hello-pod"
    name: hello-pod

Copy to Clipboard

Toggle word wrap

The previous example defines a Pod spec with a one pod topology spread constraint. It matches on pods labeled region: us-east, distributes among zones, specifies a skew of 1, and does not schedule the pod if it does not meet these requirements.

Example demonstrating multiple pod topology spread constraints

kind: Pod
apiVersion: v1
metadata:
  name: my-pod-2
  labels:
    region: us-east
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: node
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        region: us-east
  - maxSkew: 1
    topologyKey: rack
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        region: us-east
  containers:
  - image: "docker.io/ocpqe/hello-pod"
    name: hello-pod

kind: Pod
apiVersion: v1
metadata:
  name: my-pod-2
  labels:
    region: us-east
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: node
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        region: us-east
  - maxSkew: 1
    topologyKey: rack
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        region: us-east
  containers:
  - image: "docker.io/ocpqe/hello-pod"
    name: hello-pod

Copy to Clipboard

Toggle word wrap

The previous example defines a Pod spec with two pod topology spread constraints. Both match on pods labeled region: us-east, specify a skew of 1, and do not schedule the pod if it does not meet these requirements.

The first constraint distributes pods based on a user-defined label node, and the second constraint distributes pods based on a user-defined label rack. Both constraints must be met for the pod to be scheduled.

4.9. Evicting pods using the descheduler
Copy link

While the scheduler is used to determine the most suitable node to host a new pod, the descheduler can be used to evict a running pod so that the pod can be rescheduled onto a more suitable node.

4.9.1. About the descheduler
Copy link

You can use the descheduler to evict pods based on specific strategies so that the pods can be rescheduled onto more appropriate nodes.

You can benefit from descheduling running pods in situations such as the following:

Nodes are underutilized or overutilized.
Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
Node failure requires pods to be moved.
New nodes are added to clusters.
Pods have been restarted too many times.

Important

The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.

When the descheduler decides to evict pods from a node, it employs the following general mechanism:

Pods in the openshift-* and kube-system namespaces are never evicted.
Critical pods with priorityClassName set to system-cluster-critical or system-node-critical are never evicted.
Static, mirrored, or stand-alone pods that are not part of a replication controller, replica set, deployment, StatefulSet, or job are never evicted because these pods will not be recreated.
Pods associated with daemon sets are never evicted.
Pods with local storage are never evicted.
Best effort pods are evicted before burstable and guaranteed pods.
All types of pods with the descheduler.alpha.kubernetes.io/evict annotation are eligible for eviction. This annotation is used to override checks that prevent eviction, and the user can select which pod is evicted. Users should know how and if the pod will be recreated.
Pods subject to pod disruption budget (PDB) are not evicted if descheduling violates its pod disruption budget (PDB). The pods are evicted by using eviction subresource to handle PDB.

4.9.2. Descheduler profiles
Copy link

The following descheduler profiles are available:

AffinityAndTaints

This profile evicts pods that violate inter-pod anti-affinity, node affinity, and node taints.

It enables the following strategies:

RemovePodsViolatingInterPodAntiAffinity: removes pods that are violating inter-pod anti-affinity.
RemovePodsViolatingNodeAffinity: removes pods that are violating node affinity.
RemovePodsViolatingNodeTaints: removes pods that are violating NoSchedule taints on nodes.
Pods with a node affinity type of requiredDuringSchedulingIgnoredDuringExecution are removed.

TopologyAndDuplicates

This profile evicts pods in an effort to evenly spread similar pods, or pods of the same topology domain, among nodes.

It enables the following strategies:

RemovePodsViolatingTopologySpreadConstraint: finds unbalanced topology domains and tries to evict pods from larger ones when DoNotSchedule constraints are violated.
RemoveDuplicates: ensures that there is only one pod associated with a replica set, replication controller, deployment, or job running on same node. If there are more, those duplicate pods are evicted for better pod distribution in a cluster.

LifecycleAndUtilization

This profile evicts long-running pods and balances resource usage between nodes.

It enables the following strategies:

RemovePodsHavingTooManyRestarts: removes pods whose containers have been restarted too many times.
Pods where the sum of restarts over all containers (including Init Containers) is more than 100.
LowNodeUtilization: finds nodes that are underutilized and evicts pods, if possible, from overutilized nodes in the hope that recreation of evicted pods will be scheduled on these underutilized nodes.
A node is considered underutilized if its usage is below 20% for all thresholds (CPU, memory, and number of pods).
A node is considered overutilized if its usage is above 50% for any of the thresholds (CPU, memory, and number of pods).
PodLifeTime: evicts pods that are too old.
By default, pods that are older than 24 hours are removed. You can customize the pod lifetime value.

SoftTopologyAndDuplicates

This profile is the same as TopologyAndDuplicates, except that pods with soft topology constraints, such as whenUnsatisfiable: ScheduleAnyway, are also considered for eviction.

Note

Do not enable both SoftTopologyAndDuplicates and TopologyAndDuplicates. Enabling both results in a conflict.

EvictPodsWithLocalStorage

This profile allows pods with local storage to be eligible for eviction.

EvictPodsWithPVC

This profile allows pods with persistent volume claims to be eligible for eviction. If you are using Kubernetes NFS Subdir External Provisioner, you must add an excluded namespace for the namespace where the provisioner is installed.

4.9.3. Installing the descheduler
Copy link

The descheduler is not available by default. To enable the descheduler, you must install the Kube Descheduler Operator from OperatorHub and enable one or more descheduler profiles.

By default, the descheduler runs in predictive mode, which means that it only simulates pod evictions. You must change the mode to automatic for the descheduler to perform the pod evictions.

Important

If you have enabled hosted control planes in your cluster, set a custom priority threshold to lower the chance that pods in the hosted control plane namespaces are evicted. Set the priority threshold class name to hypershift-control-plane, because it has the lowest priority value (100000000) of the hosted control plane priority classes.

Prerequisites

Cluster administrator privileges.
Access to the OpenShift Container Platform web console.

Procedure

Log in to the OpenShift Container Platform web console.
Create the required namespace for the Kube Descheduler Operator.
1. Navigate to Administration → Namespaces and click Create Namespace.
2. Enter openshift-kube-descheduler-operator in the Name field, enter openshift.io/cluster-monitoring=true in the Labels field to enable descheduler metrics, and click Create.
Install the Kube Descheduler Operator.
1. Navigate to Operators → OperatorHub.
2. Type Kube Descheduler Operator into the filter box.
3. Select the Kube Descheduler Operator and click Install.
4. On the Install Operator page, select A specific namespace on the cluster. Select openshift-kube-descheduler-operator from the drop-down menu.
5. Adjust the values for the Update Channel and Approval Strategy to the desired values.
6. Click Install.
Create a descheduler instance.
1. From the Operators → Installed Operators page, click the Kube Descheduler Operator.
2. Select the Kube Descheduler tab and click Create KubeDescheduler.
3. Edit the settings as necessary.
  1. To evict pods instead of simulating the evictions, change the Mode field to Automatic.
  2. Expand the Profiles section to select one or more profiles to enable. The AffinityAndTaints profile is enabled by default. Click Add Profile to select additional profiles.
    Note
    Do not enable both TopologyAndDuplicates and SoftTopologyAndDuplicates. Enabling both results in a conflict.
  3. Optional: Expand the Profile Customizations section to set optional configurations for the descheduler.
    Set a custom pod lifetime value for the LifecycleAndUtilization profile. Use the podLifetime field to set a numerical value and a valid unit (s, m, or h). The default pod lifetime is 24 hours (24h).
    Set a custom priority threshold to consider pods for eviction only if their priority is lower than a specified priority level. Use the thresholdPriority field to set a numerical priority threshold or use the thresholdPriorityClassName field to specify a certain priority class name.
    Note
    Do not specify both thresholdPriority and thresholdPriorityClassName for the descheduler.
    Set specific namespaces to exclude or include from descheduler operations. Expand the namespaces field and add namespaces to the excluded or included list. You can only either set a list of namespaces to exclude or a list of namespaces to include. Note that protected namespaces (openshift-*, kube-system, hypershift) are excluded by default.
    Important
    The LowNodeUtilization strategy does not support namespace exclusion. If the LifecycleAndUtilization profile is set, which enables the LowNodeUtilization strategy, then no namespaces are excluded, even the protected namespaces. To avoid evictions from the protected namespaces while the LowNodeUtilization strategy is enabled, set the priority class name to system-cluster-critical or system-node-critical.
    Experimental: Set thresholds for underutilization and overutilization for the LowNodeUtilization strategy. Use the devLowNodeUtilizationThresholds field to set one of the following values:
    Low: 10% underutilized and 30% overutilized
    Medium: 20% underutilized and 50% overutilized (Default)
    High: 40% underutilized and 70% overutilized
    Note
    This setting is experimental and should not be used in a production environment.
  4. Optional: Use the Descheduling Interval Seconds field to change the number of seconds between descheduler runs. The default is 3600 seconds.
4. Click Create.

You can also configure the profiles and settings for the descheduler later using the OpenShift CLI (oc). If you did not adjust the profiles when creating the descheduler instance from the web console, the AffinityAndTaints profile is enabled by default.

4.9.4. Configuring descheduler profiles
Copy link

You can configure which profiles the descheduler uses to evict pods.

Prerequisites

Cluster administrator privileges

Procedure

Edit the KubeDescheduler object:

oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator

$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator

Copy to Clipboard

Toggle word wrap

Specify one or more profiles in the spec.profiles section.
```
apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  deschedulingIntervalSeconds: 3600
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
  mode: Predictive                                     
  profileCustomizations:
    namespaces:                                        
      excluded:
      - my-namespace
    podLifetime: 48h                                   
    thresholdPriorityClassName: my-priority-class-name 
  profiles:                                            
  - AffinityAndTaints
  - TopologyAndDuplicates                              
  - LifecycleAndUtilization
  - EvictPodsWithLocalStorage
  - EvictPodsWithPVC
```
```
apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  deschedulingIntervalSeconds: 3600
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
  mode: Predictive                                     
```
1
```
  profileCustomizations:
    namespaces:                                        
```
2
```
      excluded:
      - my-namespace
    podLifetime: 48h                                   
```
3
```
    thresholdPriorityClassName: my-priority-class-name 
```
4
```
  profiles:                                            
```
5
```
  - AffinityAndTaints
  - TopologyAndDuplicates                              
```
6
```
  - LifecycleAndUtilization
  - EvictPodsWithLocalStorage
  - EvictPodsWithPVC
```
Copy to Clipboard Toggle word wrap
1
Optional: By default, the descheduler does not evict pods. To evict pods, set mode to Automatic.
2
Optional: Set a list of user-created namespaces to include or exclude from descheduler operations. Use excluded to set a list of namespaces to exclude or use included to set a list of namespaces to include. Note that protected namespaces (openshift-*, kube-system, hypershift) are excluded by default.
Important
The LowNodeUtilization strategy does not support namespace exclusion. If the LifecycleAndUtilization profile is set, which enables the LowNodeUtilization strategy, then no namespaces are excluded, even the protected namespaces. To avoid evictions from the protected namespaces while the LowNodeUtilization strategy is enabled, set the priority class name to system-cluster-critical or system-node-critical.
3
Optional: Enable a custom pod lifetime value for the LifecycleAndUtilization profile. Valid units are s, m, or h. The default pod lifetime is 24 hours.
4
Optional: Specify a priority threshold to consider pods for eviction only if their priority is lower than the specified level. Use the thresholdPriority field to set a numerical priority threshold (for example, 10000) or use the thresholdPriorityClassName field to specify a certain priority class name (for example, my-priority-class-name). If you specify a priority class name, it must already exist or the descheduler will throw an error. Do not set both thresholdPriority and thresholdPriorityClassName.
5
Add one or more profiles to enable. Available profiles: AffinityAndTaints, TopologyAndDuplicates, LifecycleAndUtilization, SoftTopologyAndDuplicates, EvictPodsWithLocalStorage, and EvictPodsWithPVC.
6
Do not enable both TopologyAndDuplicates and SoftTopologyAndDuplicates. Enabling both results in a conflict.
You can enable multiple profiles; the order that the profiles are specified in is not important.
Save the file to apply the changes.

4.9.5. Configuring the descheduler interval
Copy link

You can configure the amount of time between descheduler runs. The default is 3600 seconds (one hour).

Prerequisites

Cluster administrator privileges

Procedure

Edit the KubeDescheduler object:

oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator

$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator

Copy to Clipboard

Toggle word wrap

Update the deschedulingIntervalSeconds field to the desired value:

apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  deschedulingIntervalSeconds: 3600 
...

apiVersion: operator.openshift.io/v1
kind: KubeDescheduler
metadata:
  name: cluster
  namespace: openshift-kube-descheduler-operator
spec:
  deschedulingIntervalSeconds: 3600

...

Copy to Clipboard

Toggle word wrap

1: Set the number of seconds between descheduler runs. A value of 0 in this field runs the descheduler once and exits.

Save the file to apply the changes.

4.9.6. Uninstalling the descheduler
Copy link

You can remove the descheduler from your cluster by removing the descheduler instance and uninstalling the Kube Descheduler Operator. This procedure also cleans up the KubeDescheduler CRD and openshift-kube-descheduler-operator namespace.

Prerequisites

Cluster administrator privileges.
Access to the OpenShift Container Platform web console.

Procedure

Log in to the OpenShift Container Platform web console.
Delete the descheduler instance.
1. From the Operators → Installed Operators page, click Kube Descheduler Operator.
2. Select the Kube Descheduler tab.
3. Click the Options menu next to the cluster entry and select Delete KubeDescheduler.
4. In the confirmation dialog, click Delete.
Uninstall the Kube Descheduler Operator.
1. Navigate to Operators → Installed Operators.
2. Click the Options menu next to the Kube Descheduler Operator entry and select Uninstall Operator.
3. In the confirmation dialog, click Uninstall.
Delete the openshift-kube-descheduler-operator namespace.
1. Navigate to Administration → Namespaces.
2. Enter openshift-kube-descheduler-operator into the filter box.
3. Click the Options menu next to the openshift-kube-descheduler-operator entry and select Delete Namespace.
4. In the confirmation dialog, enter openshift-kube-descheduler-operator and click Delete.
Delete the KubeDescheduler CRD.
1. Navigate to Administration → Custom Resource Definitions.
2. Enter KubeDescheduler into the filter box.
3. Click the Options menu next to the KubeDescheduler entry and select Delete CustomResourceDefinition.
4. In the confirmation dialog, click Delete.

4.10. Secondary scheduler
Copy link

4.10.1. Secondary scheduler overview
Copy link

You can install the Secondary Scheduler Operator to run a custom secondary scheduler alongside the default scheduler to schedule pods.

4.10.1.1. About the Secondary Scheduler Operator
Copy link

The Secondary Scheduler Operator for Red Hat OpenShift provides a way to deploy a custom secondary scheduler in OpenShift Container Platform. The secondary scheduler runs alongside the default scheduler to schedule pods. Pod configurations can specify which scheduler to use.

The custom scheduler must have the /bin/kube-scheduler binary and be based on the Kubernetes scheduling framework.

Important

You can use the Secondary Scheduler Operator to deploy a custom secondary scheduler in OpenShift Container Platform, but Red Hat does not directly support the functionality of the custom secondary scheduler.

The Secondary Scheduler Operator creates the default roles and role bindings required by the secondary scheduler. You can specify which scheduling plugins to enable or disable by configuring the KubeSchedulerConfiguration resource for the secondary scheduler.

4.10.2. Secondary Scheduler Operator for Red Hat OpenShift release notes
Copy link

The Secondary Scheduler Operator for Red Hat OpenShift allows you to deploy a custom secondary scheduler in your OpenShift Container Platform cluster.

These release notes track the development of the Secondary Scheduler Operator for Red Hat OpenShift.

For more information, see About the Secondary Scheduler Operator.

4.10.2.1. Release notes for Secondary Scheduler Operator for Red Hat OpenShift 1.1.4
Copy link

Issued: 26 November 2024

The following advisory is available for the Secondary Scheduler Operator for Red Hat OpenShift 1.1.4:

RHEA-2024:10113

4.10.2.1.1. Bug fixes
Copy link

This release of the Secondary Scheduler Operator addresses Common Vulnerabilities and Exposures (CVEs).

4.10.2.1.2. Known issues
Copy link

Currently, you cannot deploy additional resources, such as config maps, CRDs, or RBAC policies through the Secondary Scheduler Operator. Any resources other than roles and role bindings that are required by your custom secondary scheduler must be applied externally. (WRKLDS-645)

4.10.2.2. Release notes for Secondary Scheduler Operator for Red Hat OpenShift 1.1.3
Copy link

Issued: 26 October 2023

The following advisory is available for the Secondary Scheduler Operator for Red Hat OpenShift 1.1.3:

RHSA-2023:5933

4.10.2.2.1. Bug fixes
Copy link

This release of the Secondary Scheduler Operator addresses Common Vulnerabilities and Exposures (CVEs).

4.10.2.2.2. Known issues
Copy link

Currently, you cannot deploy additional resources, such as config maps, CRDs, or RBAC policies through the Secondary Scheduler Operator. Any resources other than roles and role bindings that are required by your custom secondary scheduler must be applied externally. (WRKLDS-645)

4.10.2.3. Release notes for Secondary Scheduler Operator for Red Hat OpenShift 1.1.2
Copy link

Issued: 23 August 2023

The following advisory is available for the Secondary Scheduler Operator for Red Hat OpenShift 1.1.2:

RHSA-2023:4657

4.10.2.3.1. Bug fixes
Copy link

This release of the Secondary Scheduler Operator addresses several Common Vulnerabilities and Exposures (CVEs).

4.10.2.3.2. Known issues
Copy link

Currently, you cannot deploy additional resources, such as config maps, CRDs, or RBAC policies through the Secondary Scheduler Operator. Any resources other than roles and role bindings that are required by your custom secondary scheduler must be applied externally. (WRKLDS-645)

4.10.2.4. Release notes for Secondary Scheduler Operator for Red Hat OpenShift 1.1.0
Copy link

Issued: 1 September 2022

The following advisory is available for the Secondary Scheduler Operator for Red Hat OpenShift 1.1.0:

RHSA-2022:6152

4.10.2.4.1. New features and enhancements
Copy link

The Secondary Scheduler Operator security context configuration has been updated to comply with pod security admission enforcement.

4.10.2.4.2. Known issues
Copy link

Currently, you cannot deploy additional resources, such as config maps, CRDs, or RBAC policies through the Secondary Scheduler Operator. Any resources other than roles and role bindings that are required by your custom secondary scheduler must be applied externally. (BZ#2071684)

4.10.3. Scheduling pods using a secondary scheduler
Copy link

You can run a custom secondary scheduler in OpenShift Container Platform by installing the Secondary Scheduler Operator, deploying the secondary scheduler, and setting the secondary scheduler in the pod definition.

4.10.3.1. Installing the Secondary Scheduler Operator
Copy link

You can use the web console to install the Secondary Scheduler Operator for Red Hat OpenShift.

Prerequisites

You have access to the cluster with cluster-admin privileges.
You have access to the OpenShift Container Platform web console.

Procedure

Log in to the OpenShift Container Platform web console.
Create the required namespace for the Secondary Scheduler Operator for Red Hat OpenShift.
1. Navigate to Administration → Namespaces and click Create Namespace.
2. Enter openshift-secondary-scheduler-operator in the Name field and click Create.
Install the Secondary Scheduler Operator for Red Hat OpenShift.
1. Navigate to Operators → OperatorHub.
2. Enter Secondary Scheduler Operator for Red Hat OpenShift into the filter box.
3. Select the Secondary Scheduler Operator for Red Hat OpenShift and click Install.
4. On the Install Operator page:
  1. The Update channel is set to stable, which installs the latest stable release of the Secondary Scheduler Operator for Red Hat OpenShift.
  2. Select A specific namespace on the cluster and select openshift-secondary-scheduler-operator from the drop-down menu.
  3. Select an Update approval strategy.
    The Automatic strategy allows Operator Lifecycle Manager (OLM) to automatically update the Operator when a new version is available.
    The Manual strategy requires a user with appropriate credentials to approve the Operator update.
  4. Click Install.

Verification

Navigate to Operators → Installed Operators.
Verify that Secondary Scheduler Operator for Red Hat OpenShift is listed with a Status of Succeeded.

4.10.3.2. Deploying a secondary scheduler
Copy link

After you have installed the Secondary Scheduler Operator, you can deploy a secondary scheduler.

Prerequisities

You have access to the cluster with cluster-admin privileges.
You have access to the OpenShift Container Platform web console.
The Secondary Scheduler Operator for Red Hat OpenShift is installed.

Procedure

Create config map to hold the configuration for the secondary scheduler.

Navigate to Workloads → ConfigMaps.
Click Create ConfigMap.

In the YAML editor, enter the config map definition that contains the necessary KubeSchedulerConfiguration configuration. For example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: "secondary-scheduler-config"                  
  namespace: "openshift-secondary-scheduler-operator" 
data:
  "config.yaml": |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration                  
    leaderElection:
      leaderElect: false
    profiles:
      - schedulerName: secondary-scheduler            
        plugins:                                      
          score:
            disabled:
              - name: NodeResourcesBalancedAllocation
              - name: NodeResourcesLeastAllocated

apiVersion: v1
kind: ConfigMap
metadata:
  name: "secondary-scheduler-config"


  namespace: "openshift-secondary-scheduler-operator"


data:
  "config.yaml": |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration


    leaderElection:
      leaderElect: false
    profiles:
      - schedulerName: secondary-scheduler


        plugins:


          score:
            disabled:
              - name: NodeResourcesBalancedAllocation
              - name: NodeResourcesLeastAllocated

Copy to Clipboard

Toggle word wrap

1: The name of the config map. This is used in the Scheduler Config field when creating the SecondaryScheduler CR.
2: The config map must be created in the openshift-secondary-scheduler-operator namespace.
3: The KubeSchedulerConfiguration resource for the secondary scheduler. For more information, see KubeSchedulerConfiguration in the Kubernetes API documentation.
4: The name of the secondary scheduler. Pods that set their spec.schedulerName field to this value are scheduled with this secondary scheduler.
5: The plugins to enable or disable for the secondary scheduler. For a list default scheduling plugins, see Scheduling plugins in the Kubernetes documentation.

Click Create.

Create the SecondaryScheduler CR:
1. Navigate to Operators → Installed Operators.
2. Select Secondary Scheduler Operator for Red Hat OpenShift.
3. Select the Secondary Scheduler tab and click Create SecondaryScheduler.
4. The Name field defaults to cluster; do not change this name.
5. The Scheduler Config field defaults to secondary-scheduler-config. Ensure that this value matches the name of the config map created earlier in this procedure.
6. In the Scheduler Image field, enter the image name for your custom scheduler.
  Important
  Red Hat does not directly support the functionality of your custom secondary scheduler.
7. Click Create.

4.10.3.3. Scheduling a pod using the secondary scheduler
Copy link

To schedule a pod using the secondary scheduler, set the schedulerName field in the pod definition.

Prerequisities

You have access to the cluster with cluster-admin privileges.
You have access to the OpenShift Container Platform web console.
The Secondary Scheduler Operator for Red Hat OpenShift is installed.
A secondary scheduler is configured.

Procedure

Log in to the OpenShift Container Platform web console.
Navigate to Workloads → Pods.
Click Create Pod.

In the YAML editor, enter the desired pod configuration and add the schedulerName field:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: default
spec:
  containers:
    - name: nginx
      image: nginx:1.14.2
      ports:
        - containerPort: 80
  schedulerName: secondary-scheduler

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: default
spec:
  containers:
    - name: nginx
      image: nginx:1.14.2
      ports:
        - containerPort: 80
  schedulerName: secondary-scheduler

Copy to Clipboard

Toggle word wrap

1: The schedulerName field must match the name that is defined in the config map when you configured the secondary scheduler.

Click Create.

Verification

Describe the pod using the following command:

oc describe pod nginx -n default

$ oc describe pod nginx -n default

Copy to Clipboard

Toggle word wrap

Example output

Name:         nginx
Namespace:    default
Priority:     0
Node:         ci-ln-t0w4r1k-72292-xkqs4-worker-b-xqkxp/10.0.128.3
...
Events:
  Type    Reason          Age   From                 Message
  ----    ------          ----  ----                 -------
  Normal  Scheduled       12s   secondary-scheduler  Successfully assigned default/nginx to ci-ln-t0w4r1k-72292-xkqs4-worker-b-xqkxp
...

Name:         nginx
Namespace:    default
Priority:     0
Node:         ci-ln-t0w4r1k-72292-xkqs4-worker-b-xqkxp/10.0.128.3
...
Events:
  Type    Reason          Age   From                 Message
  ----    ------          ----  ----                 -------
  Normal  Scheduled       12s   secondary-scheduler  Successfully assigned default/nginx to ci-ln-t0w4r1k-72292-xkqs4-worker-b-xqkxp
...

Copy to Clipboard

Toggle word wrap

In the events table, find the event with a message similar to Successfully assigned <namespace>/<pod_name> to <node_name>.
In the "From" column, verify that the event was generated from the secondary scheduler and not the default scheduler.
Note
You can also check the secondary-scheduler-* pod logs in the openshift-secondary-scheduler-namespace to verify that the pod was scheduled by the secondary scheduler.

4.10.4. Uninstalling the Secondary Scheduler Operator
Copy link

You can remove the Secondary Scheduler Operator for Red Hat OpenShift from OpenShift Container Platform by uninstalling the Operator and removing its related resources.

4.10.4.1. Uninstalling the Secondary Scheduler Operator
Copy link

You can uninstall the Secondary Scheduler Operator for Red Hat OpenShift by using the web console.

Prerequisites

You have access to the cluster with cluster-admin privileges.
You have access to the OpenShift Container Platform web console.
The Secondary Scheduler Operator for Red Hat OpenShift is installed.

Procedure

Log in to the OpenShift Container Platform web console.
Uninstall the Secondary Scheduler Operator for Red Hat OpenShift Operator.
1. Navigate to Operators → Installed Operators.
2. Click the Options menu next to the Secondary Scheduler Operator entry and click Uninstall Operator.
3. In the confirmation dialog, click Uninstall.

4.10.4.2. Removing Secondary Scheduler Operator resources
Copy link

Optionally, after uninstalling the Secondary Scheduler Operator for Red Hat OpenShift, you can remove its related resources from your cluster.

Prerequisites

You have access to the cluster with cluster-admin privileges.
You have access to the OpenShift Container Platform web console.

Procedure

Log in to the OpenShift Container Platform web console.
Remove CRDs that were installed by the Secondary Scheduler Operator:
1. Navigate to Administration → CustomResourceDefinitions.
2. Enter SecondaryScheduler in the Name field to filter the CRDs.
3. Click the Options menu next to the SecondaryScheduler CRD and select Delete Custom Resource Definition:
Remove the openshift-secondary-scheduler-operator namespace.
1. Navigate to Administration → Namespaces.
2. Click the Options menu next to the openshift-secondary-scheduler-operator and select Delete Namespace.
3. In the confirmation dialog, enter openshift-secondary-scheduler-operator in the field and click Delete.

Chapter 5. Using Jobs and DaemonSets
Copy link

5.1. Running background tasks on nodes automatically with daemon sets
Copy link

As an administrator, you can create and use daemon sets to run replicas of a pod on specific or all nodes in an OpenShift Container Platform cluster.

A daemon set ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to the cluster. As nodes are removed from the cluster, those pods are removed through garbage collection. Deleting a daemon set will clean up the pods it created.

You can use daemon sets to create shared storage, run a logging pod on every node in your cluster, or deploy a monitoring agent on every node.

For security reasons, the cluster administrators and the project administrators can create daemon sets.

For more information on daemon sets, see the Kubernetes documentation.

Important

Daemon set scheduling is incompatible with project’s default node selector. If you fail to disable it, the daemon set gets restricted by merging with the default node selector. This results in frequent pod recreates on the nodes that got unselected by the merged node selector, which in turn puts unwanted load on the cluster.

5.1.1. Scheduled by default scheduler
Copy link

A daemon set ensures that all eligible nodes run a copy of a pod. Normally, the node that a pod runs on is selected by the Kubernetes scheduler. However, daemon set pods are created and scheduled by the daemon set controller. That introduces the following issues:

Inconsistent pod behavior: Normal pods waiting to be scheduled are created and in Pending state, but daemon set pods are not created in Pending state. This is confusing to the user.
Pod preemption is handled by default scheduler. When preemption is enabled, the daemon set controller will make scheduling decisions without considering pod priority and preemption.

The ScheduleDaemonSetPods feature, enabled by default in OpenShift Container Platform, lets you schedule daemon sets using the default scheduler instead of the daemon set controller, by adding the NodeAffinity term to the daemon set pods, instead of the spec.nodeName term. The default scheduler is then used to bind the pod to the target host. If node affinity of the daemon set pod already exists, it is replaced. The daemon set controller only performs these operations when creating or modifying daemon set pods, and no changes are made to the spec.template of the daemon set.

kind: Pod
apiVersion: v1
metadata:
  name: hello-node-6fbccf8d9-9tmzr
#...
spec:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - target-host-name
#...

kind: Pod
apiVersion: v1
metadata:
  name: hello-node-6fbccf8d9-9tmzr
#...
spec:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchFields:
        - key: metadata.name
          operator: In
          values:
          - target-host-name
#...

Copy to Clipboard

Toggle word wrap

In addition, a node.kubernetes.io/unschedulable:NoSchedule toleration is added automatically to daemon set pods. The default scheduler ignores unschedulable Nodes when scheduling daemon set pods.

5.1.2. Creating daemonsets
Copy link

When creating daemon sets, the nodeSelector field is used to indicate the nodes on which the daemon set should deploy replicas.

Prerequisites

Before you start using daemon sets, disable the default project-wide node selector in your namespace, by setting the namespace annotation openshift.io/node-selector to an empty string:

oc patch namespace myproject -p \
    '{"metadata": {"annotations": {"openshift.io/node-selector": ""}}}'

$ oc patch namespace myproject -p \
    '{"metadata": {"annotations": {"openshift.io/node-selector": ""}}}'

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to disable the default project-wide node selector for a namespace:

apiVersion: v1
kind: Namespace
metadata:
  name: <namespace>
  annotations:
    openshift.io/node-selector: ''
#...

apiVersion: v1
kind: Namespace
metadata:
  name: <namespace>
  annotations:
    openshift.io/node-selector: ''
#...

Copy to Clipboard

Toggle word wrap

If you are creating a new project, overwrite the default node selector:
```
oc adm new-project <name> --node-selector=""
```
```
$ oc adm new-project <name> --node-selector=""
```
Copy to Clipboard Toggle word wrap

Procedure

To create a daemon set:

Define the daemon set yaml file:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hello-daemonset
spec:
  selector:
      matchLabels:
        name: hello-daemonset 
  template:
    metadata:
      labels:
        name: hello-daemonset 
    spec:
      nodeSelector: 
        role: worker
      containers:
      - image: openshift/hello-openshift
        imagePullPolicy: Always
        name: registry
        ports:
        - containerPort: 80
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
      serviceAccount: default
      terminationGracePeriodSeconds: 10
#...

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hello-daemonset
spec:
  selector:
      matchLabels:
        name: hello-daemonset


  template:
    metadata:
      labels:
        name: hello-daemonset


    spec:
      nodeSelector:


        role: worker
      containers:
      - image: openshift/hello-openshift
        imagePullPolicy: Always
        name: registry
        ports:
        - containerPort: 80
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
      serviceAccount: default
      terminationGracePeriodSeconds: 10
#...

Copy to Clipboard

Toggle word wrap

1: The label selector that determines which pods belong to the daemon set.
2: The pod template’s label selector. Must match the label selector above.
3: The node selector that determines on which nodes pod replicas should be deployed. A matching label must be present on the node.

Create the daemon set object:
```
oc create -f daemonset.yaml
```
```
$ oc create -f daemonset.yaml
```
Copy to Clipboard Toggle word wrap

To verify that the pods were created, and that each node has a pod replica:

Find the daemonset pods:

oc get pods

$ oc get pods

Copy to Clipboard

Toggle word wrap

Example output

hello-daemonset-cx6md   1/1       Running   0          2m
hello-daemonset-e3md9   1/1       Running   0          2m

hello-daemonset-cx6md   1/1       Running   0          2m
hello-daemonset-e3md9   1/1       Running   0          2m

Copy to Clipboard

Toggle word wrap

View the pods to verify the pod has been placed onto the node:

oc describe pod/hello-daemonset-cx6md|grep Node

$ oc describe pod/hello-daemonset-cx6md|grep Node

Copy to Clipboard

Toggle word wrap

Example output

Node:        openshift-node01.hostname.com/10.14.20.134

Node:        openshift-node01.hostname.com/10.14.20.134

Copy to Clipboard

Toggle word wrap

oc describe pod/hello-daemonset-e3md9|grep Node

$ oc describe pod/hello-daemonset-e3md9|grep Node

Copy to Clipboard

Toggle word wrap

Example output

Node:        openshift-node02.hostname.com/10.14.20.137

Node:        openshift-node02.hostname.com/10.14.20.137

Copy to Clipboard

Toggle word wrap

Important

If you update a daemon set pod template, the existing pod replicas are not affected.
If you delete a daemon set and then create a new daemon set with a different template but the same label selector, it recognizes any existing pod replicas as having matching labels and thus does not update them or create new replicas despite a mismatch in the pod template.
If you change node labels, the daemon set adds pods to nodes that match the new labels and deletes pods from nodes that do not match the new labels.

To update a daemon set, force new pod replicas to be created by deleting the old replicas or nodes.

5.2. Running tasks in pods using jobs
Copy link

A job executes a task in your OpenShift Container Platform cluster.

A job tracks the overall progress of a task and updates its status with information about active, succeeded, and failed pods. Deleting a job will clean up any pod replicas it created. Jobs are part of the Kubernetes API, which can be managed with oc commands like other object types.

Sample Job specification

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  parallelism: 1    
  completions: 1    
  activeDeadlineSeconds: 1800 
  backoffLimit: 6   
  template:         
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: OnFailure    
#...

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  parallelism: 1


  completions: 1


  activeDeadlineSeconds: 1800


  backoffLimit: 6


  template:


    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: OnFailure


#...

Copy to Clipboard

Toggle word wrap

1: The pod replicas a job should run in parallel.
2: Successful pod completions are needed to mark a job completed.
3: The maximum duration the job can run.
4: The number of retries for a job.
5: The template for the pod the controller creates.
6: The restart policy of the pod.

Additional resources

Jobs in the Kubernetes documentation

5.2.1. Understanding jobs and cron jobs
Copy link

A job tracks the overall progress of a task and updates its status with information about active, succeeded, and failed pods. Deleting a job cleans up any pods it created. Jobs are part of the Kubernetes API, which can be managed with oc commands like other object types.

There are two possible resource types that allow creating run-once objects in OpenShift Container Platform:

Job

A regular job is a run-once object that creates a task and ensures the job finishes.

There are three main types of task suitable to run as a job:

Non-parallel jobs:
- A job that starts only one pod, unless the pod fails.
- The job is complete as soon as its pod terminates successfully.
Parallel jobs with a fixed completion count:
- a job that starts multiple pods.
- The job represents the overall task and is complete when there is one successful pod for each value in the range 1 to the completions value.
Parallel jobs with a work queue:
- A job with multiple parallel worker processes in a given pod.
- OpenShift Container Platform coordinates pods to determine what each should work on or use an external queue service.
- Each pod is independently capable of determining whether or not all peer pods are complete and that the entire job is done.
- When any pod from the job terminates with success, no new pods are created.
- When at least one pod has terminated with success and all pods are terminated, the job is successfully completed.
- When any pod has exited with success, no other pod should be doing any work for this task or writing any output. Pods should all be in the process of exiting.
  For more information about how to make use of the different types of job, see Job Patterns in the Kubernetes documentation.

Cron job

A job can be scheduled to run multiple times, using a cron job.

A cron job builds on a regular job by allowing you to specify how the job should be run. Cron jobs are part of the Kubernetes API, which can be managed with oc commands like other object types.

Cron jobs are useful for creating periodic and recurring tasks, like running backups or sending emails. Cron jobs can also schedule individual tasks for a specific time, such as if you want to schedule a job for a low activity period. A cron job creates a Job object based on the timezone configured on the control plane node that runs the cronjob controller.

Warning

A cron job creates a Job object approximately once per execution time of its schedule, but there are circumstances in which it fails to create a job or two jobs might be created. Therefore, jobs must be idempotent and you must configure history limits.

5.2.1.1. Understanding how to create jobs
Copy link

Both resource types require a job configuration that consists of the following key parts:

A pod template, which describes the pod that OpenShift Container Platform creates.
The parallelism parameter, which specifies how many pods running in parallel at any point in time should execute a job.
- For non-parallel jobs, leave unset. When unset, defaults to 1.
The completions parameter, specifying how many successful pod completions are needed to finish a job.
- For non-parallel jobs, leave unset. When unset, defaults to 1.
- For parallel jobs with a fixed completion count, specify a value.
- For parallel jobs with a work queue, leave unset. When unset defaults to the parallelism value.

5.2.1.2. Understanding how to set a maximum duration for jobs
Copy link

When defining a job, you can define its maximum duration by setting the activeDeadlineSeconds field. It is specified in seconds and is not set by default. When not set, there is no maximum duration enforced.

The maximum duration is counted from the time when a first pod gets scheduled in the system, and defines how long a job can be active. It tracks overall time of an execution. After reaching the specified timeout, the job is terminated by OpenShift Container Platform.

5.2.1.3. Understanding how to set a job back off policy for pod failure
Copy link

A job can be considered failed, after a set amount of retries due to a logical error in configuration or other similar reasons. Failed pods associated with the job are recreated by the controller with an exponential back off delay (10s, 20s, 40s …) capped at six minutes. The limit is reset if no new failed pods appear between controller checks.

Use the spec.backoffLimit parameter to set the number of retries for a job.

5.2.1.4. Understanding how to configure a cron job to remove artifacts
Copy link

Cron jobs can leave behind artifact resources such as jobs or pods. As a user it is important to configure history limits so that old jobs and their pods are properly cleaned. There are two fields within cron job’s spec responsible for that:

.spec.successfulJobsHistoryLimit. The number of successful finished jobs to retain (defaults to 3).
.spec.failedJobsHistoryLimit. The number of failed finished jobs to retain (defaults to 1).

Tip

Delete cron jobs that you no longer need:
```
oc delete cronjob/<cron_job_name>
```
```
$ oc delete cronjob/<cron_job_name>
```
Copy to Clipboard Toggle word wrap
Doing this prevents them from generating unnecessary artifacts.
You can suspend further executions by setting the spec.suspend to true. All subsequent executions are suspended until you reset to false.

5.2.1.5. Known limitations
Copy link

The job specification restart policy only applies to the pods, and not the job controller. However, the job controller is hard-coded to keep retrying jobs to completion.

As such, restartPolicy: Never or --restart=Never results in the same behavior as restartPolicy: OnFailure or --restart=OnFailure. That is, when a job fails it is restarted automatically until it succeeds (or is manually discarded). The policy only sets which subsystem performs the restart.

With the Never policy, the job controller performs the restart. With each attempt, the job controller increments the number of failures in the job status and create new pods. This means that with each failed attempt, the number of pods increases.

With the OnFailure policy, kubelet performs the restart. Each attempt does not increment the number of failures in the job status. In addition, kubelet will retry failed jobs starting pods on the same nodes.

5.2.2. Creating jobs
Copy link

You create a job in OpenShift Container Platform by creating a job object.

Procedure

To create a job:

Create a YAML file similar to the following:
```
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  parallelism: 1    
  completions: 1    
  activeDeadlineSeconds: 1800 
  backoffLimit: 6   
  template:         
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: OnFailure    
#...
```
```
apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  parallelism: 1    
```
1
```
  completions: 1    
```
2
```
  activeDeadlineSeconds: 1800 
```
3
```
  backoffLimit: 6   
```
4
```
  template:         
```
5
```
    metadata:
      name: pi
    spec:
      containers:
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: OnFailure    
```
6
```
#...
```
Copy to Clipboard Toggle word wrap
1
Optional: Specify how many pod replicas a job should run in parallel; defaults to 1.
For non-parallel jobs, leave unset. When unset, defaults to 1.
2
Optional: Specify how many successful pod completions are needed to mark a job completed.
For non-parallel jobs, leave unset. When unset, defaults to 1.
For parallel jobs with a fixed completion count, specify the number of completions.
For parallel jobs with a work queue, leave unset. When unset defaults to the parallelism value.
3
Optional: Specify the maximum duration the job can run.
4
Optional: Specify the number of retries for a job. This field defaults to six.
5
Specify the template for the pod the controller creates.
6
Specify the restart policy of the pod:
Never. Do not restart the job.
OnFailure. Restart the job only if it fails.
Always. Always restart the job.
For details on how OpenShift Container Platform uses restart policy with failed containers, see the Example States in the Kubernetes documentation.
Create the job:
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap

Note

You can also create and launch a job from a single command using oc create job. The following command creates and launches a job similar to the one specified in the previous example:

oc create job pi --image=perl -- perl -Mbignum=bpi -wle 'print bpi(2000)'

$ oc create job pi --image=perl -- perl -Mbignum=bpi -wle 'print bpi(2000)'

Copy to Clipboard

Toggle word wrap

5.2.3. Creating cron jobs
Copy link

You create a cron job in OpenShift Container Platform by creating a job object.

Procedure

To create a cron job:

Create a YAML file similar to the following:
```
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pi
spec:
  schedule: "*/1 * * * *"          
  timeZone: Etc/UTC                
  concurrencyPolicy: "Replace"     
  startingDeadlineSeconds: 200     
  suspend: true                    
  successfulJobsHistoryLimit: 3    
  failedJobsHistoryLimit: 1        
  jobTemplate:                     
    spec:
      template:
        metadata:
          labels:                  
            parent: "cronjobpi"
        spec:
          containers:
          - name: pi
            image: perl
            command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
          restartPolicy: OnFailure 
#...
```
```
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pi
spec:
  schedule: "*/1 * * * *"          
```
1
```
  timeZone: Etc/UTC                
```
2
```
  concurrencyPolicy: "Replace"     
```
3
```
  startingDeadlineSeconds: 200     
```
4
```
  suspend: true                    
```
5
```
  successfulJobsHistoryLimit: 3    
```
6
```
  failedJobsHistoryLimit: 1        
```
7
```
  jobTemplate:                     
```
8
```
    spec:
      template:
        metadata:
          labels:                  
```
9
```
            parent: "cronjobpi"
        spec:
          containers:
          - name: pi
            image: perl
            command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
          restartPolicy: OnFailure 
```
10
```
#...
```
Copy to Clipboard Toggle word wrap
1
Schedule for the job specified in cron format. In this example, the job will run every minute.
2
An optional time zone for the schedule. See List of tz database time zones for valid options. If not specified, the Kubernetes controller manager interprets the schedule relative to its local time zone. This setting is offered as a Technology Preview.
3
An optional concurrency policy, specifying how to treat concurrent jobs within a cron job. Only one of the following concurrent policies may be specified. If not specified, this defaults to allowing concurrent executions.
Allow allows cron jobs to run concurrently.
Forbid forbids concurrent runs, skipping the next run if the previous has not finished yet.
Replace cancels the currently running job and replaces it with a new one.
4
An optional deadline (in seconds) for starting the job if it misses its scheduled time for any reason. Missed jobs executions will be counted as failed ones. If not specified, there is no deadline.
5
An optional flag allowing the suspension of a cron job. If set to true, all subsequent executions will be suspended.
6
The number of successful finished jobs to retain (defaults to 3).
7
The number of failed finished jobs to retain (defaults to 1).
8
Job template. This is similar to the job example.
9
Sets a label for jobs spawned by this cron job.
10
The restart policy of the pod. This does not apply to the job controller.
Note
The .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit fields are optional. These fields specify how many completed and failed jobs should be kept. By default, they are set to 3 and 1 respectively. Setting a limit to 0 corresponds to keeping none of the corresponding kind of jobs after they finish.
Create the cron job:
```
oc create -f <file-name>.yaml
```
```
$ oc create -f <file-name>.yaml
```
Copy to Clipboard Toggle word wrap

Note

You can also create and launch a cron job from a single command using oc create cronjob. The following command creates and launches a cron job similar to the one specified in the previous example:

oc create cronjob pi --image=perl --schedule='*/1 * * * *' -- perl -Mbignum=bpi -wle 'print bpi(2000)'

$ oc create cronjob pi --image=perl --schedule='*/1 * * * *' -- perl -Mbignum=bpi -wle 'print bpi(2000)'

Copy to Clipboard

Toggle word wrap

With oc create cronjob, the --schedule option accepts schedules in cron format.

Chapter 6. Working with nodes
Copy link

6.1. Viewing and listing the nodes in your OpenShift Container Platform cluster
Copy link

You can list all the nodes in your cluster to obtain information such as status, age, memory usage, and details about the nodes.

When you perform node management operations, the CLI interacts with node objects that are representations of actual node hosts. The master uses the information from node objects to validate nodes with health checks.

6.1.1. About listing all the nodes in a cluster
Copy link

You can get detailed information on the nodes in the cluster.

The following command lists all nodes:

oc get nodes

$ oc get nodes

Copy to Clipboard

Toggle word wrap

The following example is a cluster with healthy nodes:

oc get nodes

$ oc get nodes

Copy to Clipboard

Toggle word wrap

Example output

NAME                   STATUS    ROLES     AGE       VERSION
master.example.com     Ready     master    7h        v1.25.0
node1.example.com      Ready     worker    7h        v1.25.0
node2.example.com      Ready     worker    7h        v1.25.0

NAME                   STATUS    ROLES     AGE       VERSION
master.example.com     Ready     master    7h        v1.25.0
node1.example.com      Ready     worker    7h        v1.25.0
node2.example.com      Ready     worker    7h        v1.25.0

Copy to Clipboard

Toggle word wrap

The following example is a cluster with one unhealthy node:

oc get nodes

$ oc get nodes

Copy to Clipboard

Toggle word wrap

Example output

NAME                   STATUS                      ROLES     AGE       VERSION
master.example.com     Ready                       master    7h        v1.25.0
node1.example.com      NotReady,SchedulingDisabled worker    7h        v1.25.0
node2.example.com      Ready                       worker    7h        v1.25.0

NAME                   STATUS                      ROLES     AGE       VERSION
master.example.com     Ready                       master    7h        v1.25.0
node1.example.com      NotReady,SchedulingDisabled worker    7h        v1.25.0
node2.example.com      Ready                       worker    7h        v1.25.0

Copy to Clipboard

Toggle word wrap

The conditions that trigger a NotReady status are shown later in this section.

The -o wide option provides additional information on nodes.

oc get nodes -o wide

$ oc get nodes -o wide

Copy to Clipboard

Toggle word wrap

Example output

NAME                STATUS   ROLES    AGE    VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
master.example.com  Ready    master   171m   v1.25.0   10.0.129.108   <none>        Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.25.0-30.rhaos4.10.gitf2f339d.el8-dev
node1.example.com   Ready    worker   72m    v1.25.0   10.0.129.222   <none>        Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.25.0-30.rhaos4.10.gitf2f339d.el8-dev
node2.example.com   Ready    worker   164m   v1.25.0   10.0.142.150   <none>        Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.25.0-30.rhaos4.10.gitf2f339d.el8-dev

NAME                STATUS   ROLES    AGE    VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
master.example.com  Ready    master   171m   v1.25.0   10.0.129.108   <none>        Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.25.0-30.rhaos4.10.gitf2f339d.el8-dev
node1.example.com   Ready    worker   72m    v1.25.0   10.0.129.222   <none>        Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.25.0-30.rhaos4.10.gitf2f339d.el8-dev
node2.example.com   Ready    worker   164m   v1.25.0   10.0.142.150   <none>        Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa)   4.18.0-240.15.1.el8_3.x86_64   cri-o://1.25.0-30.rhaos4.10.gitf2f339d.el8-dev

Copy to Clipboard

Toggle word wrap

The following command lists information about a single node:

oc get node <node>

$ oc get node <node>

Copy to Clipboard

Toggle word wrap

For example:

oc get node node1.example.com

$ oc get node node1.example.com

Copy to Clipboard

Toggle word wrap

Example output

NAME                   STATUS    ROLES     AGE       VERSION
node1.example.com      Ready     worker    7h        v1.25.0

NAME                   STATUS    ROLES     AGE       VERSION
node1.example.com      Ready     worker    7h        v1.25.0

Copy to Clipboard

Toggle word wrap

The following command provides more detailed information about a specific node, including the reason for the current condition:

oc describe node <node>

$ oc describe node <node>

Copy to Clipboard

Toggle word wrap

For example:

oc describe node node1.example.com

$ oc describe node node1.example.com

Copy to Clipboard

Toggle word wrap

Example output

Name:               node1.example.com 
Roles:              worker 
Labels:             beta.kubernetes.io/arch=amd64   
                    beta.kubernetes.io/instance-type=m4.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2a
                    kubernetes.io/hostname=ip-10-0-140-16
                    node-role.kubernetes.io/worker=
Annotations:        cluster.k8s.io/machine: openshift-machine-api/ahardin-worker-us-east-2a-q5dzc  
                    machineconfiguration.openshift.io/currentConfig: worker-309c228e8b3a92e2235edd544c62fea8
                    machineconfiguration.openshift.io/desiredConfig: worker-309c228e8b3a92e2235edd544c62fea8
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 13 Feb 2019 11:05:57 -0500
Taints:             <none>  
Unschedulable:      false
Conditions:                 
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:05:57 -0500   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:05:57 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:05:57 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:05:57 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:07:09 -0500   KubeletReady                 kubelet is posting ready status
Addresses:   
  InternalIP:   10.0.140.16
  InternalDNS:  ip-10-0-140-16.us-east-2.compute.internal
  Hostname:     ip-10-0-140-16.us-east-2.compute.internal
Capacity:    
 attachable-volumes-aws-ebs:  39
 cpu:                         2
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      8172516Ki
 pods:                        250
Allocatable:
 attachable-volumes-aws-ebs:  39
 cpu:                         1500m
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      7558116Ki
 pods:                        250
System Info:    
 Machine ID:                              63787c9534c24fde9a0cde35c13f1f66
 System UUID:                             EC22BF97-A006-4A58-6AF8-0A38DEEA122A
 Boot ID:                                 f24ad37d-2594-46b4-8830-7f7555918325
 Kernel Version:                          3.10.0-957.5.1.el7.x86_64
 OS Image:                                Red Hat Enterprise Linux CoreOS 410.8.20190520.0 (Ootpa)
 Operating System:                        linux
 Architecture:                            amd64
 Container Runtime Version:               cri-o://1.25.0-0.6.dev.rhaos4.3.git9ad059b.el8-rc2
 Kubelet Version:                         v1.25.0
 Kube-Proxy Version:                      v1.25.0
PodCIDR:                                  10.128.4.0/24
ProviderID:                               aws:///us-east-2a/i-04e87b31dc6b3e171
Non-terminated Pods:                      (12 in total)  
  Namespace                               Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                               ----                                   ------------  ----------  ---------------  -------------
  openshift-cluster-node-tuning-operator  tuned-hdl5q                            0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-dns                           dns-default-l69zr                      0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-image-registry                node-ca-9hmcg                          0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-ingress                       router-default-76455c45c-c5ptv         0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-machine-config-operator       machine-config-daemon-cvqw9            20m (1%)      0 (0%)      50Mi (0%)        0 (0%)
  openshift-marketplace                   community-operators-f67fh              0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-monitoring                    alertmanager-main-0                    50m (3%)      50m (3%)    210Mi (2%)       10Mi (0%)
  openshift-monitoring                    node-exporter-l7q8d                    10m (0%)      20m (1%)    20Mi (0%)        40Mi (0%)
  openshift-monitoring                    prometheus-adapter-75d769c874-hvb85    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-multus                        multus-kw8w5                           0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-sdn                           ovs-t4dsn                              100m (6%)     0 (0%)      300Mi (4%)       0 (0%)
  openshift-sdn                           sdn-g79hg                              100m (6%)     0 (0%)      200Mi (2%)       0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         380m (25%)   270m (18%)
  memory                      880Mi (11%)  250Mi (3%)
  attachable-volumes-aws-ebs  0            0
Events:     
  Type     Reason                   Age                From                      Message
  ----     ------                   ----               ----                      -------
  Normal   NodeHasSufficientPID     6d (x5 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  6d                 kubelet, m01.example.com  Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientDisk    6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientDisk
  Normal   NodeHasSufficientPID     6d                 kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientPID
  Normal   Starting                 6d                 kubelet, m01.example.com  Starting kubelet.
#...

Name:               node1.example.com


Roles:              worker


Labels:             beta.kubernetes.io/arch=amd64


                    beta.kubernetes.io/instance-type=m4.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2a
                    kubernetes.io/hostname=ip-10-0-140-16
                    node-role.kubernetes.io/worker=
Annotations:        cluster.k8s.io/machine: openshift-machine-api/ahardin-worker-us-east-2a-q5dzc


                    machineconfiguration.openshift.io/currentConfig: worker-309c228e8b3a92e2235edd544c62fea8
                    machineconfiguration.openshift.io/desiredConfig: worker-309c228e8b3a92e2235edd544c62fea8
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 13 Feb 2019 11:05:57 -0500
Taints:             <none>


Unschedulable:      false
Conditions:


  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:05:57 -0500   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:05:57 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:05:57 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:05:57 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 13 Feb 2019 15:09:42 -0500   Wed, 13 Feb 2019 11:07:09 -0500   KubeletReady                 kubelet is posting ready status
Addresses:


  InternalIP:   10.0.140.16
  InternalDNS:  ip-10-0-140-16.us-east-2.compute.internal
  Hostname:     ip-10-0-140-16.us-east-2.compute.internal
Capacity:


 attachable-volumes-aws-ebs:  39
 cpu:                         2
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      8172516Ki
 pods:                        250
Allocatable:
 attachable-volumes-aws-ebs:  39
 cpu:                         1500m
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      7558116Ki
 pods:                        250
System Info:


 Machine ID:                              63787c9534c24fde9a0cde35c13f1f66
 System UUID:                             EC22BF97-A006-4A58-6AF8-0A38DEEA122A
 Boot ID:                                 f24ad37d-2594-46b4-8830-7f7555918325
 Kernel Version:                          3.10.0-957.5.1.el7.x86_64
 OS Image:                                Red Hat Enterprise Linux CoreOS 410.8.20190520.0 (Ootpa)
 Operating System:                        linux
 Architecture:                            amd64
 Container Runtime Version:               cri-o://1.25.0-0.6.dev.rhaos4.3.git9ad059b.el8-rc2
 Kubelet Version:                         v1.25.0
 Kube-Proxy Version:                      v1.25.0
PodCIDR:                                  10.128.4.0/24
ProviderID:                               aws:///us-east-2a/i-04e87b31dc6b3e171
Non-terminated Pods:                      (12 in total)


  Namespace                               Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                               ----                                   ------------  ----------  ---------------  -------------
  openshift-cluster-node-tuning-operator  tuned-hdl5q                            0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-dns                           dns-default-l69zr                      0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-image-registry                node-ca-9hmcg                          0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-ingress                       router-default-76455c45c-c5ptv         0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-machine-config-operator       machine-config-daemon-cvqw9            20m (1%)      0 (0%)      50Mi (0%)        0 (0%)
  openshift-marketplace                   community-operators-f67fh              0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-monitoring                    alertmanager-main-0                    50m (3%)      50m (3%)    210Mi (2%)       10Mi (0%)
  openshift-monitoring                    node-exporter-l7q8d                    10m (0%)      20m (1%)    20Mi (0%)        40Mi (0%)
  openshift-monitoring                    prometheus-adapter-75d769c874-hvb85    0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-multus                        multus-kw8w5                           0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-sdn                           ovs-t4dsn                              100m (6%)     0 (0%)      300Mi (4%)       0 (0%)
  openshift-sdn                           sdn-g79hg                              100m (6%)     0 (0%)      200Mi (2%)       0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         380m (25%)   270m (18%)
  memory                      880Mi (11%)  250Mi (3%)
  attachable-volumes-aws-ebs  0            0
Events:


  Type     Reason                   Age                From                      Message
  ----     ------                   ----               ----                      -------
  Normal   NodeHasSufficientPID     6d (x5 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  6d                 kubelet, m01.example.com  Updated Node Allocatable limit across pods
  Normal   NodeHasSufficientMemory  6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientDisk    6d (x6 over 6d)    kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientDisk
  Normal   NodeHasSufficientPID     6d                 kubelet, m01.example.com  Node m01.example.com status is now: NodeHasSufficientPID
  Normal   Starting                 6d                 kubelet, m01.example.com  Starting kubelet.
#...

Copy to Clipboard

Toggle word wrap

1: The name of the node.
2: The role of the node, either master or worker.
3: The labels applied to the node.
4: The annotations applied to the node.
5: The taints applied to the node.
6: The node conditions and status. The conditions stanza lists the Ready, PIDPressure, MemoryPressure, DiskPressure and OutOfDisk status. These condition are described later in this section.
7: The IP address and hostname of the node.
8: The pod resources and allocatable resources.
9: Information about the node host.
10: The pods on the node.
11: The events reported by the node.

Note

The control plane label is not automatically added to newly created or updated master nodes. If you want to use the control plane label for your nodes, you can manually configure the label. For more information, see Understanding how to update labels on nodes in the Additional resources section.

Among the information shown for nodes, the following node conditions appear in the output of the commands shown in this section:

Expand

Table 6.1. Node Conditions
Condition	Description
`Ready`	If `true`, the node is healthy and ready to accept pods. If `false`, the node is not healthy and is not accepting pods. If `unknown`, the node controller has not received a heartbeat from the node for the `node-monitor-grace-period` (the default is 40 seconds).
`DiskPressure`	If `true`, the disk capacity is low.
`MemoryPressure`	If `true`, the node memory is low.
`PIDPressure`	If `true`, there are too many processes on the node.
`OutOfDisk`	If `true`, the node has insufficient free space on the node for adding new pods.
`NetworkUnavailable`	If `true`, the network for the node is not correctly configured.
`NotReady`	If `true`, one of the underlying components, such as the container runtime or network, is experiencing issues or is not yet configured.
`SchedulingDisabled`	Pods cannot be scheduled for placement on the node.

6.1.2. Listing pods on a node in your cluster
Copy link

You can list all the pods on a specific node.

Procedure

To list all or selected pods on selected nodes:

oc get pod --selector=<nodeSelector>

$ oc get pod --selector=<nodeSelector>

Copy to Clipboard

Toggle word wrap

oc get pod --selector=kubernetes.io/os

$ oc get pod --selector=kubernetes.io/os

Copy to Clipboard

Toggle word wrap

Or:

oc get pod -l=<nodeSelector>

$ oc get pod -l=<nodeSelector>

Copy to Clipboard

Toggle word wrap

oc get pod -l kubernetes.io/os=linux

$ oc get pod -l kubernetes.io/os=linux

Copy to Clipboard

Toggle word wrap

To list all pods on a specific node, including terminated pods:

oc get pod --all-namespaces --field-selector=spec.nodeName=<nodename>

$ oc get pod --all-namespaces --field-selector=spec.nodeName=<nodename>

Copy to Clipboard

Toggle word wrap

6.1.3. Viewing memory and CPU usage statistics on your nodes
Copy link

You can display usage statistics about nodes, which provide the runtime environments for containers. These usage statistics include CPU, memory, and storage consumption.

Prerequisites

You must have cluster-reader permission to view the usage statistics.
Metrics must be installed to view the usage statistics.

Procedure

To view the usage statistics:

oc adm top nodes

$ oc adm top nodes

Copy to Clipboard

Toggle word wrap

Example output

NAME                                   CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
ip-10-0-12-143.ec2.compute.internal    1503m        100%      4533Mi          61%
ip-10-0-132-16.ec2.compute.internal    76m          5%        1391Mi          18%
ip-10-0-140-137.ec2.compute.internal   398m         26%       2473Mi          33%
ip-10-0-142-44.ec2.compute.internal    656m         43%       6119Mi          82%
ip-10-0-146-165.ec2.compute.internal   188m         12%       3367Mi          45%
ip-10-0-19-62.ec2.compute.internal     896m         59%       5754Mi          77%
ip-10-0-44-193.ec2.compute.internal    632m         42%       5349Mi          72%

NAME                                   CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
ip-10-0-12-143.ec2.compute.internal    1503m        100%      4533Mi          61%
ip-10-0-132-16.ec2.compute.internal    76m          5%        1391Mi          18%
ip-10-0-140-137.ec2.compute.internal   398m         26%       2473Mi          33%
ip-10-0-142-44.ec2.compute.internal    656m         43%       6119Mi          82%
ip-10-0-146-165.ec2.compute.internal   188m         12%       3367Mi          45%
ip-10-0-19-62.ec2.compute.internal     896m         59%       5754Mi          77%
ip-10-0-44-193.ec2.compute.internal    632m         42%       5349Mi          72%

Copy to Clipboard

Toggle word wrap

To view the usage statistics for nodes with labels:
```
oc adm top node --selector=''
```
```
$ oc adm top node --selector=''
```
Copy to Clipboard Toggle word wrap
You must choose the selector (label query) to filter on. Supports =, ==, and !=.

6.2. Working with nodes
Copy link

As an administrator, you can perform several tasks to make your clusters more efficient.

6.2.1. Understanding how to evacuate pods on nodes
Copy link

Evacuating pods allows you to migrate all or selected pods from a given node or nodes.

You can only evacuate pods backed by a replication controller. The replication controller creates new pods on other nodes and removes the existing pods from the specified node(s).

Bare pods, meaning those not backed by a replication controller, are unaffected by default. You can evacuate a subset of pods by specifying a pod-selector. Pod selectors are based on labels, so all the pods with the specified label will be evacuated.

Procedure

Mark the nodes unschedulable before performing the pod evacuation.
1. Mark the node as unschedulable:
  $ oc adm cordon <node1>
  Copy to Clipboard Toggle word wrap
  Example output
  node/<node1> cordoned
  
  Copy to Clipboard Toggle word wrap
2. Check that the node status is Ready,SchedulingDisabled:
  $ oc get node <node1>
  Copy to Clipboard Toggle word wrap
  Example output
  NAME STATUS ROLES AGE VERSION <node1> Ready,SchedulingDisabled worker 1d v1.25.0
  
  Copy to Clipboard Toggle word wrap
Evacuate the pods using one of the following methods:
- Evacuate all or selected pods on one or more nodes:
  $ oc adm drain <node1> <node2> [--pod-selector=<pod_selector>]
  Copy to Clipboard Toggle word wrap
- Force the deletion of bare pods using the --force option. When set to true, deletion continues even if there are pods not managed by a replication controller, replica set, job, daemon set, or stateful set:
  $ oc adm drain <node1> <node2> --force=true
  Copy to Clipboard Toggle word wrap
- Set a period of time in seconds for each pod to terminate gracefully, use --grace-period. If negative, the default value specified in the pod will be used:
  $ oc adm drain <node1> <node2> --grace-period=-1
  Copy to Clipboard Toggle word wrap
- Ignore pods managed by daemon sets using the --ignore-daemonsets flag set to true:
  $ oc adm drain <node1> <node2> --ignore-daemonsets=true
  Copy to Clipboard Toggle word wrap
- Set the length of time to wait before giving up using the --timeout flag. A value of 0 sets an infinite length of time:
  $ oc adm drain <node1> <node2> --timeout=5s
  Copy to Clipboard Toggle word wrap
- Delete pods even if there are pods using emptyDir volumes by setting the --delete-emptydir-data flag to true. Local data is deleted when the node is drained:
  $ oc adm drain <node1> <node2> --delete-emptydir-data=true
  Copy to Clipboard Toggle word wrap
- List objects that will be migrated without actually performing the evacuation, using the --dry-run option set to true:
  $ oc adm drain <node1> <node2> --dry-run=true
  Copy to Clipboard Toggle word wrap
  Instead of specifying specific node names (for example, <node1> <node2>), you can use the --selector=<node_selector> option to evacuate pods on selected nodes.
Mark the node as schedulable when done.
```
oc adm uncordon <node1>
```
```
$ oc adm uncordon <node1>
```
Copy to Clipboard Toggle word wrap

6.2.2. Understanding how to update labels on nodes
Copy link

You can update any label on a node.

Node labels are not persisted after a node is deleted even if the node is backed up by a Machine.

Note

Any change to a MachineSet object is not applied to existing machines owned by the compute machine set. For example, labels edited or added to an existing MachineSet object are not propagated to existing machines and nodes associated with the compute machine set.

The following command adds or updates labels on a node:

oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>

$ oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>

Copy to Clipboard

Toggle word wrap

For example:

oc label nodes webconsole-7f7f6 unhealthy=true

$ oc label nodes webconsole-7f7f6 unhealthy=true

Copy to Clipboard

Toggle word wrap

Tip

You can alternatively apply the following YAML to apply the label:

kind: Node
apiVersion: v1
metadata:
  name: webconsole-7f7f6
  labels:
    unhealthy: 'true'
#...

kind: Node
apiVersion: v1
metadata:
  name: webconsole-7f7f6
  labels:
    unhealthy: 'true'
#...

Copy to Clipboard

Toggle word wrap

The following command updates all pods in the namespace:
```
oc label pods --all <key_1>=<value_1>
```
```
$ oc label pods --all <key_1>=<value_1>
```
Copy to Clipboard Toggle word wrap
For example:
```
oc label pods --all status=unhealthy
```
```
$ oc label pods --all status=unhealthy
```
Copy to Clipboard Toggle word wrap

Important

In OpenShift Container Platform 4.12 and later, newly installed clusters include both the node-role.kubernetes.io/control-plane and node-role.kubernetes.io/master labels on control plane nodes by default.

In OpenShift Container Platform versions earlier than 4.12, the node-role.kubernetes.io/control-plane label is not added by default. Therefore, you must manually add the node-role.kubernetes.io/control-plane label to control plane nodes in clusters upgraded from earlier versions.

6.2.3. Understanding how to mark nodes as unschedulable or schedulable
Copy link

By default, healthy nodes with a Ready status are marked as schedulable, which means that you can place new pods on the node. Manually marking a node as unschedulable blocks any new pods from being scheduled on the node. Existing pods on the node are not affected.

The following command marks a node or nodes as unschedulable:

Example output

oc adm cordon <node>

$ oc adm cordon <node>

Copy to Clipboard

Toggle word wrap

For example:

oc adm cordon node1.example.com

$ oc adm cordon node1.example.com

Copy to Clipboard

Toggle word wrap

Example output

node/node1.example.com cordoned

NAME                 LABELS                                        STATUS
node1.example.com    kubernetes.io/hostname=node1.example.com      Ready,SchedulingDisabled

node/node1.example.com cordoned

NAME                 LABELS                                        STATUS
node1.example.com    kubernetes.io/hostname=node1.example.com      Ready,SchedulingDisabled

Copy to Clipboard

Toggle word wrap

The following command marks a currently unschedulable node or nodes as schedulable:
```
oc adm uncordon <node1>
```
```
$ oc adm uncordon <node1>
```
Copy to Clipboard Toggle word wrap
Alternatively, instead of specifying specific node names (for example, <node>), you can use the --selector=<node_selector> option to mark selected nodes as schedulable or unschedulable.

6.2.4. Handling errors in single-node OpenShift clusters when the node reboots without draining application pods
Copy link

In single-node OpenShift clusters and in OpenShift Container Platform clusters in general, a situation can arise where a node reboot occurs without first draining the node. This can occur where an application pod requesting devices fails with the UnexpectedAdmissionError error. Deployment, ReplicaSet, or DaemonSet errors are reported because the application pods that require those devices start before the pod serving those devices. You cannot control the order of pod restarts.

While this behavior is to be expected, it can cause a pod to remain on the cluster even though it has failed to deploy successfully. The pod continues to report UnexpectedAdmissionError. This issue is mitigated by the fact that application pods are typically included in a Deployment, ReplicaSet, or DaemonSet. If a pod is in this error state, it is of little concern because another instance should be running. Belonging to a Deployment, ReplicaSet, or DaemonSet guarantees the successful creation and execution of subsequent pods and ensures the successful deployment of the application.

There is ongoing work upstream to ensure that such pods are gracefully terminated. Until that work is resolved, run the following command for a single-node OpenShift cluster to remove the failed pods:

oc delete pods --field-selector status.phase=Failed -n <POD_NAMESPACE>

$ oc delete pods --field-selector status.phase=Failed -n <POD_NAMESPACE>

Copy to Clipboard

Toggle word wrap

Note

The option to drain the node is unavailable for single-node OpenShift clusters.

6.2.5. Deleting nodes
Copy link

6.2.5.1. Deleting nodes from a cluster
Copy link

To delete a node from the OpenShift Container Platform cluster, scale down the appropriate MachineSet object.

Important

When a cluster is integrated with a cloud provider, you must delete the corresponding machine to delete a node. Do not try to use the oc delete node command for this task.

When you delete a node by using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods that are not backed by a replication controller become inaccessible to OpenShift Container Platform. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods.

Note

If you are running cluster on bare metal, you cannot delete a node by editing MachineSet objects. Compute machine sets are only available when a cluster is integrated with a cloud provider. Instead you must unschedule and drain the node before manually deleting it.

Procedure

View the compute machine sets that are in the cluster by running the following command:
```
oc get machinesets -n openshift-machine-api
```
```
$ oc get machinesets -n openshift-machine-api
```
Copy to Clipboard Toggle word wrap
The compute machine sets are listed in the form of <cluster-id>-worker-<aws-region-az>.

Scale down the compute machine set by using one of the following methods:

Specify the number of replicas to scale down to by running the following command:

oc scale --replicas=2 machineset <machine-set-name> -n openshift-machine-api

$ oc scale --replicas=2 machineset <machine-set-name> -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Edit the compute machine set custom resource by running the following command:

oc edit machineset <machine-set-name> -n openshift-machine-api

$ oc edit machineset <machine-set-name> -n openshift-machine-api

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  # ...
  name: <machine-set-name>
  namespace: openshift-machine-api
  # ...
spec:
  replicas: 2 
  # ...

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  # ...
  name: <machine-set-name>
  namespace: openshift-machine-api
  # ...
spec:
  replicas: 2


  # ...

Copy to Clipboard

Toggle word wrap

1: Specify the number of replicas to scale down to.

6.2.5.2. Deleting nodes from a bare metal cluster
Copy link

When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods not backed by a replication controller become inaccessible to OpenShift Container Platform. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods.

Procedure

Delete a node from an OpenShift Container Platform cluster running on bare metal by completing the following steps:

Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap
Drain all pods on the node:
```
oc adm drain <node_name> --force=true
```
```
$ oc adm drain <node_name> --force=true
```
Copy to Clipboard Toggle word wrap
This step might fail if the node is offline or unresponsive. Even if the node does not respond, it might still be running a workload that writes to shared storage. To avoid data corruption, power down the physical hardware before you proceed.
Delete the node from the cluster:
```
oc delete node <node_name>
```
```
$ oc delete node <node_name>
```
Copy to Clipboard Toggle word wrap
Although the node object is now deleted from the cluster, it can still rejoin the cluster after reboot or if the kubelet service is restarted. To permanently delete the node and all its data, you must decommission the node.
If you powered down the physical hardware, turn it back on so that the node can rejoin the cluster.

6.3. Managing nodes
Copy link

OpenShift Container Platform uses a KubeletConfig custom resource (CR) to manage the configuration of nodes. By creating an instance of a KubeletConfig object, a managed machine config is created to override setting on the node.

Note

Logging in to remote machines for the purpose of changing their configuration is not supported.

6.3.1. Modifying nodes
Copy link

To make configuration changes to a cluster, or machine pool, you must create a custom resource definition (CRD), or kubeletConfig object. OpenShift Container Platform uses the Machine Config Controller to watch for changes introduced through the CRD to apply the changes to the cluster.

Note

Because the fields in a kubeletConfig object are passed directly to the kubelet from upstream Kubernetes, the validation of those fields is handled directly by the kubelet itself. Please refer to the relevant Kubernetes documentation for the valid values for these fields. Invalid values in the kubeletConfig object can render cluster nodes unusable.

Procedure

Obtain the label associated with the static CRD, Machine Config Pool, for the type of node you want to configure. Perform one of the following steps:

Check current labels of the desired machine config pool.

For example:

 oc get machineconfigpool  --show-labels

$  oc get machineconfigpool  --show-labels

Copy to Clipboard

Toggle word wrap

Example output

NAME      CONFIG                                             UPDATED   UPDATING   DEGRADED   LABELS
master    rendered-master-e05b81f5ca4db1d249a1bf32f9ec24fd   True      False      False      operator.machineconfiguration.openshift.io/required-for-upgrade=
worker    rendered-worker-f50e78e1bc06d8e82327763145bfcf62   True      False      False

NAME      CONFIG                                             UPDATED   UPDATING   DEGRADED   LABELS
master    rendered-master-e05b81f5ca4db1d249a1bf32f9ec24fd   True      False      False      operator.machineconfiguration.openshift.io/required-for-upgrade=
worker    rendered-worker-f50e78e1bc06d8e82327763145bfcf62   True      False      False

Copy to Clipboard

Toggle word wrap

Add a custom label to the desired machine config pool.
For example:
```
oc label machineconfigpool worker custom-kubelet=enabled
```
```
$ oc label machineconfigpool worker custom-kubelet=enabled
```
Copy to Clipboard Toggle word wrap

Create a kubeletconfig custom resource (CR) for your configuration change.

For example:

Sample configuration for a custom-config CR

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: custom-config 
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: enabled 
  kubeletConfig: 
    podsPerCore: 10
    maxPods: 250
    systemReserved:
      cpu: 2000m
      memory: 1Gi
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: custom-config


spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: enabled


  kubeletConfig:


    podsPerCore: 10
    maxPods: 250
    systemReserved:
      cpu: 2000m
      memory: 1Gi
#...

Copy to Clipboard

Toggle word wrap

1: Assign a name to CR.
2: Specify the label to apply the configuration change, this is the label you added to the machine config pool.
3: Specify the new value(s) you want to change.

Create the CR object.
```
oc create -f <file-name>
```
```
$ oc create -f <file-name>
```
Copy to Clipboard Toggle word wrap
For example:
```
oc create -f master-kube-config.yaml
```
```
$ oc create -f master-kube-config.yaml
```
Copy to Clipboard Toggle word wrap

Most Kubelet Configuration options can be set by the user. The following options are not allowed to be overwritten:

CgroupDriver
ClusterDNS
ClusterDomain
StaticPodPath

Note

If a single node contains more than 50 images, pod scheduling might be imbalanced across nodes. This is because the list of images on a node is shortened to 50 by default. You can disable the image limit by editing the KubeletConfig object and setting the value of nodeStatusMaxImages to -1.

6.3.2. Configuring control plane nodes as schedulable
Copy link

You can configure control plane nodes to be schedulable, meaning that new pods are allowed for placement on the master nodes. By default, control plane nodes are not schedulable.

You can set the masters to be schedulable, but must retain the worker nodes.

Note

You can deploy OpenShift Container Platform with no worker nodes on a bare metal cluster. In this case, the control plane nodes are marked schedulable by default.

You can allow or disallow control plane nodes to be schedulable by configuring the mastersSchedulable field.

Important

When you configure control plane nodes from the default unschedulable to schedulable, additional subscriptions are required. This is because control plane nodes then become worker nodes.

Procedure

Edit the schedulers.config.openshift.io resource.
```
oc edit schedulers.config.openshift.io cluster
```
```
$ oc edit schedulers.config.openshift.io cluster
```
Copy to Clipboard Toggle word wrap

Configure the mastersSchedulable field.

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  creationTimestamp: "2019-09-10T03:04:05Z"
  generation: 1
  name: cluster
  resourceVersion: "433"
  selfLink: /apis/config.openshift.io/v1/schedulers/cluster
  uid: a636d30a-d377-11e9-88d4-0a60097bee62
spec:
  mastersSchedulable: false 
status: {}
#...

apiVersion: config.openshift.io/v1
kind: Scheduler
metadata:
  creationTimestamp: "2019-09-10T03:04:05Z"
  generation: 1
  name: cluster
  resourceVersion: "433"
  selfLink: /apis/config.openshift.io/v1/schedulers/cluster
  uid: a636d30a-d377-11e9-88d4-0a60097bee62
spec:
  mastersSchedulable: false


status: {}
#...

Copy to Clipboard

Toggle word wrap

1: Set to true to allow control plane nodes to be schedulable, or false to disallow control plane nodes to be schedulable.

Save the file to apply the changes.

6.3.3. Setting SELinux booleans
Copy link

OpenShift Container Platform allows you to enable and disable an SELinux boolean on a Red Hat Enterprise Linux CoreOS (RHCOS) node. The following procedure explains how to modify SELinux booleans on nodes using the Machine Config Operator (MCO). This procedure uses container_manage_cgroup as the example boolean. You can modify this value to whichever boolean you need.

Prerequisites

You have installed the OpenShift CLI (oc).

Procedure

Create a new YAML file with a MachineConfig object, displayed in the following example:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-setsebool
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - contents: |
          [Unit]
          Description=Set SELinux booleans
          Before=kubelet.service

          [Service]
          Type=oneshot
          ExecStart=/sbin/setsebool container_manage_cgroup=on
          RemainAfterExit=true

          [Install]
          WantedBy=multi-user.target graphical.target
        enabled: true
        name: setsebool.service
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-setsebool
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - contents: |
          [Unit]
          Description=Set SELinux booleans
          Before=kubelet.service

          [Service]
          Type=oneshot
          ExecStart=/sbin/setsebool container_manage_cgroup=on
          RemainAfterExit=true

          [Install]
          WantedBy=multi-user.target graphical.target
        enabled: true
        name: setsebool.service
#...

Copy to Clipboard

Toggle word wrap

Create the new MachineConfig object by running the following command:
```
oc create -f 99-worker-setsebool.yaml
```
```
$ oc create -f 99-worker-setsebool.yaml
```
Copy to Clipboard Toggle word wrap

Note

Applying any changes to the MachineConfig object causes all affected nodes to gracefully reboot after the change is applied.

6.3.4. Adding kernel arguments to nodes
Copy link

In some special cases, you might want to add kernel arguments to a set of nodes in your cluster. This should only be done with caution and clear understanding of the implications of the arguments you set.

Warning

Improper use of kernel arguments can result in your systems becoming unbootable.

Examples of kernel arguments you could set include:

nosmt: Disables symmetric multithreading (SMT) in the kernel. Multithreading allows multiple logical threads for each CPU. You could consider nosmt in multi-tenant environments to reduce risks from potential cross-thread attacks. By disabling SMT, you essentially choose security over performance.
systemd.unified_cgroup_hierarchy: Enables Linux control group version 2 (cgroup v2). cgroup v2 is the next version of the kernel control group and offers multiple improvements.
Important
OpenShift Container Platform cgroups version 2 support is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
enforcing=0: Configures Security Enhanced Linux (SELinux) to run in permissive mode. In permissive mode, the system acts as if SELinux is enforcing the loaded security policy, including labeling objects and emitting access denial entries in the logs, but it does not actually deny any operations. While not supported for production systems, permissive mode can be helpful for debugging.
Warning
Disabling SELinux on RHCOS in production is not supported. Once SELinux has been disabled on a node, it must be re-provisioned before re-inclusion in a production cluster.

See Kernel.org kernel parameters for a list and descriptions of kernel arguments.

In the following procedure, you create a MachineConfig object that identifies:

A set of machines to which you want to add the kernel argument. In this case, machines with a worker role.
Kernel arguments that are appended to the end of the existing kernel arguments.
A label that indicates where in the list of machine configs the change is applied.

Prerequisites

Have administrative privilege to a working OpenShift Container Platform cluster.

Procedure

List existing MachineConfig objects for your OpenShift Container Platform cluster to determine how to label your machine config:

oc get MachineConfig

$ oc get MachineConfig

Copy to Clipboard

Toggle word wrap

Example output

NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
00-worker                                          52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-master-container-runtime                        52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-master-kubelet                                  52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-worker-container-runtime                        52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-worker-kubelet                                  52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-master-generated-registries                     52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-master-ssh                                                                                 3.2.0             40m
99-worker-generated-registries                     52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-worker-ssh                                                                                 3.2.0             40m
rendered-master-23e785de7587df95a4b517e0647e5ab7   52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
rendered-worker-5d596d9293ca3ea80c896a1191735bb1   52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m

NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
00-worker                                          52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-master-container-runtime                        52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-master-kubelet                                  52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-worker-container-runtime                        52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-worker-kubelet                                  52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-master-generated-registries                     52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-master-ssh                                                                                 3.2.0             40m
99-worker-generated-registries                     52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-worker-ssh                                                                                 3.2.0             40m
rendered-master-23e785de7587df95a4b517e0647e5ab7   52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
rendered-worker-5d596d9293ca3ea80c896a1191735bb1   52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m

Copy to Clipboard

Toggle word wrap

Create a MachineConfig object file that identifies the kernel argument (for example, 05-worker-kernelarg-selinuxpermissive.yaml)

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 05-worker-kernelarg-selinuxpermissive
spec:
  kernelArguments:
    - enforcing=0

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker


  name: 05-worker-kernelarg-selinuxpermissive


spec:
  kernelArguments:
    - enforcing=0

Copy to Clipboard

Toggle word wrap

1: Applies the new kernel argument only to worker nodes.
2: Named to identify where it fits among the machine configs (05) and what it does (adds a kernel argument to configure SELinux permissive mode).
3: Identifies the exact kernel argument as enforcing=0.

Create the new machine config:

oc create -f 05-worker-kernelarg-selinuxpermissive.yaml

$ oc create -f 05-worker-kernelarg-selinuxpermissive.yaml

Copy to Clipboard

Toggle word wrap

Check the machine configs to see that the new one was added:

oc get MachineConfig

$ oc get MachineConfig

Copy to Clipboard

Toggle word wrap

Example output

NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
00-worker                                          52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-master-container-runtime                        52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-master-kubelet                                  52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-worker-container-runtime                        52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-worker-kubelet                                  52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
05-worker-kernelarg-selinuxpermissive                                                         3.2.0             105s
99-master-generated-registries                     52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-master-ssh                                                                                 3.2.0             40m
99-worker-generated-registries                     52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-worker-ssh                                                                                 3.2.0             40m
rendered-master-23e785de7587df95a4b517e0647e5ab7   52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
rendered-worker-5d596d9293ca3ea80c896a1191735bb1   52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m

NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
00-worker                                          52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-master-container-runtime                        52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-master-kubelet                                  52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-worker-container-runtime                        52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
01-worker-kubelet                                  52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
05-worker-kernelarg-selinuxpermissive                                                         3.2.0             105s
99-master-generated-registries                     52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-master-ssh                                                                                 3.2.0             40m
99-worker-generated-registries                     52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
99-worker-ssh                                                                                 3.2.0             40m
rendered-master-23e785de7587df95a4b517e0647e5ab7   52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m
rendered-worker-5d596d9293ca3ea80c896a1191735bb1   52dd3ba6a9a527fc3ab42afac8d12b693534c8c9   3.2.0             33m

Copy to Clipboard

Toggle word wrap

Check the nodes:

oc get nodes

$ oc get nodes

Copy to Clipboard

Toggle word wrap

Example output

NAME                           STATUS                     ROLES    AGE   VERSION
ip-10-0-136-161.ec2.internal   Ready                      worker   28m   v1.25.0
ip-10-0-136-243.ec2.internal   Ready                      master   34m   v1.25.0
ip-10-0-141-105.ec2.internal   Ready,SchedulingDisabled   worker   28m   v1.25.0
ip-10-0-142-249.ec2.internal   Ready                      master   34m   v1.25.0
ip-10-0-153-11.ec2.internal    Ready                      worker   28m   v1.25.0
ip-10-0-153-150.ec2.internal   Ready                      master   34m   v1.25.0

NAME                           STATUS                     ROLES    AGE   VERSION
ip-10-0-136-161.ec2.internal   Ready                      worker   28m   v1.25.0
ip-10-0-136-243.ec2.internal   Ready                      master   34m   v1.25.0
ip-10-0-141-105.ec2.internal   Ready,SchedulingDisabled   worker   28m   v1.25.0
ip-10-0-142-249.ec2.internal   Ready                      master   34m   v1.25.0
ip-10-0-153-11.ec2.internal    Ready                      worker   28m   v1.25.0
ip-10-0-153-150.ec2.internal   Ready                      master   34m   v1.25.0

Copy to Clipboard

Toggle word wrap

You can see that scheduling on each worker node is disabled as the change is being applied.

Check that the kernel argument worked by going to one of the worker nodes and listing the kernel command-line arguments (in /proc/cmdline on the host):

oc debug node/ip-10-0-141-105.ec2.internal

$ oc debug node/ip-10-0-141-105.ec2.internal

Copy to Clipboard

Toggle word wrap

Example output

Starting pod/ip-10-0-141-105ec2internal-debug ...
To use host binaries, run `chroot /host`

sh-4.2# cat /host/proc/cmdline
BOOT_IMAGE=/ostree/rhcos-... console=tty0 console=ttyS0,115200n8
rootflags=defaults,prjquota rw root=UUID=fd0... ostree=/ostree/boot.0/rhcos/16...
coreos.oem.id=qemu coreos.oem.id=ec2 ignition.platform.id=ec2 enforcing=0

sh-4.2# exit

Starting pod/ip-10-0-141-105ec2internal-debug ...
To use host binaries, run `chroot /host`

sh-4.2# cat /host/proc/cmdline
BOOT_IMAGE=/ostree/rhcos-... console=tty0 console=ttyS0,115200n8
rootflags=defaults,prjquota rw root=UUID=fd0... ostree=/ostree/boot.0/rhcos/16...
coreos.oem.id=qemu coreos.oem.id=ec2 ignition.platform.id=ec2 enforcing=0

sh-4.2# exit

Copy to Clipboard

Toggle word wrap

You should see the enforcing=0 argument added to the other kernel arguments.

6.3.5. Enabling swap memory use on nodes
Copy link

Important

Enabling swap memory use on nodes is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

You can enable swap memory use for OpenShift Container Platform workloads on a per-node basis.

Warning

Enabling swap memory can negatively impact workload performance and out-of-resource handling. Do not enable swap memory on control plane nodes.

To enable swap memory, create a kubeletconfig custom resource (CR) to set the swapbehavior parameter. You can set limited or unlimited swap memory:

Limited: Use the LimitedSwap value to limit how much swap memory workloads can use. Any workloads on the node that are not managed by OpenShift Container Platform can still use swap memory. The LimitedSwap behavior depends on whether the node is running with Linux control groups version 1 (cgroups v1) or version 2 (cgroup v2):
- cgroup v2: OpenShift Container Platform workloads can use any combination of memory and swap, up to the pod’s memory limit, if set.
- cgroup v1: OpenShift Container Platform workloads cannot use swap memory.
Unlimited: Use the UnlimitedSwap value to allow workloads to use as much swap memory as they request, up to the system limit.

Because the kubelet will not start in the presence of swap memory without this configuration, you must enable swap memory in OpenShift Container Platform before enabling swap memory on the nodes. If there is no swap memory present on a node, enabling swap memory in OpenShift Container Platform has no effect.

Prerequisites

You have a running OpenShift Container Platform cluster that uses version 4.10 or later.
You are logged in to the cluster as a user with administrative privileges.
You have enabled the TechPreviewNoUpgrade feature set on the cluster (see Nodes → Working with clusters → Enabling features using feature gates).
Note
Enabling the TechPreviewNoUpgrade feature set cannot be undone and prevents minor version updates. These feature sets are not recommended on production clusters.
If cgroup v2 is enabled on a node, you must enable swap accounting on the node, by setting the swapaccount=1 kernel argument.

Procedure

Apply a custom label to the machine config pool where you want to allow swap memory.
```
oc label machineconfigpool worker kubelet-swap=enabled
```
```
$ oc label machineconfigpool worker kubelet-swap=enabled
```
Copy to Clipboard Toggle word wrap

Create a custom resource (CR) to enable and configure swap settings.

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: swap-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      kubelet-swap: enabled
  kubeletConfig:
    failSwapOn: false 
    memorySwap:
      swapBehavior: LimitedSwap 
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: swap-config
spec:
  machineConfigPoolSelector:
    matchLabels:
      kubelet-swap: enabled
  kubeletConfig:
    failSwapOn: false


    memorySwap:
      swapBehavior: LimitedSwap


#...

Copy to Clipboard

Toggle word wrap

1: Set to false to enable swap memory use on the associated nodes. Set to true to disable swap memory use.
2: Specify the swap memory behavior. If unspecified, the default is LimitedSwap.

Enable swap memory on the machines.

6.3.6. Migrating control plane nodes from one RHOSP host to another
Copy link

You can run a script that moves a control plane node from one Red Hat OpenStack Platform (RHOSP) node to another.

Prerequisites

The environment variable OS_CLOUD refers to a clouds entry that has administrative credentials in a clouds.yaml file.
The environment variable KUBECONFIG refers to a configuration that contains administrative OpenShift Container Platform credentials.

Procedure

From a command line, run the following script:

Check for admin OpenStack credentials
Check for admin OpenShift credentials
Drain the node
Power off the server
Verify the server is shut off
Migrate the node
Resize the VM
Wait for the resize confirm to finish
Restart the VM
Wait for the node to show up as Ready:
Uncordon the node
Wait for cluster operators to stabilize

#!/usr/bin/env bash

set -Eeuo pipefail

if [ $# -lt 1 ]; then
	echo "Usage: '$0 node_name'"
	exit 64
fi

# Check for admin OpenStack credentials
openstack server list --all-projects >/dev/null || { >&2 echo "The script needs OpenStack admin credentials. Exiting"; exit 77; }

# Check for admin OpenShift credentials
oc adm top node >/dev/null || { >&2 echo "The script needs OpenShift admin credentials. Exiting"; exit 77; }

set -x

declare -r node_name="$1"
declare server_id
server_id="$(openstack server list --all-projects -f value -c ID -c Name | grep "$node_name" | cut -d' ' -f1)"
readonly server_id

# Drain the node
oc adm cordon "$node_name"
oc adm drain "$node_name" --delete-emptydir-data --ignore-daemonsets --force

# Power off the server
oc debug "node/${node_name}" -- chroot /host shutdown -h 1

# Verify the server is shut off
until openstack server show "$server_id" -f value -c status | grep -q 'SHUTOFF'; do sleep 5; done

# Migrate the node
openstack server migrate --wait "$server_id"

# Resize the VM
openstack server resize confirm "$server_id"

# Wait for the resize confirm to finish
until openstack server show "$server_id" -f value -c status | grep -q 'SHUTOFF'; do sleep 5; done

# Restart the VM
openstack server start "$server_id"

# Wait for the node to show up as Ready:
until oc get node "$node_name" | grep -q "^${node_name}[[:space:]]\+Ready"; do sleep 5; done

# Uncordon the node
oc adm uncordon "$node_name"

# Wait for cluster operators to stabilize
until oc get co -o go-template='statuses: {{ range .items }}{{ range .status.conditions }}{{ if eq .type "Degraded" }}{{ if ne .status "False" }}DEGRADED{{ end }}{{ else if eq .type "Progressing"}}{{ if ne .status "False" }}PROGRESSING{{ end }}{{ else if eq .type "Available"}}{{ if ne .status "True" }}NOTAVAILABLE{{ end }}{{ end }}{{ end }}{{ end }}' | grep -qv '\(DEGRADED\|PROGRESSING\|NOTAVAILABLE\)'; do sleep 5; done

Copy to Clipboard

Toggle word wrap

If the script completes, the control plane machine is migrated to a new RHOSP node.

6.4. Managing the maximum number of pods per node
Copy link

In OpenShift Container Platform, you can configure the number of pods that can run on a node based on the number of processor cores on the node, a hard limit or both. If you use both options, the lower of the two limits the number of pods on a node.

When both options are in use, the lower of the two values limits the number of pods on a node. Exceeding these values can result in:

Increased CPU utilization.
Slow pod scheduling.
Potential out-of-memory scenarios, depending on the amount of memory in the node.
Exhausting the pool of IP addresses.
Resource overcommitting, leading to poor user application performance.

Important

In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running.

Note

Disk IOPS throttling from the cloud provider might have an impact on CRI-O and kubelet. They might get overloaded when there are large number of I/O intensive pods running on the nodes. It is recommended that you monitor the disk I/O on the nodes and use volumes with sufficient throughput for the workload.

The podsPerCore parameter sets the number of pods the node can run based on the number of processor cores on the node. For example, if podsPerCore is set to 10 on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40.

kubeletConfig:
  podsPerCore: 10

kubeletConfig:
  podsPerCore: 10

Copy to Clipboard

Toggle word wrap

Setting podsPerCore to 0 disables this limit. The default is 0. The value of the podsPerCore parameter cannot exceed the value of the maxPods parameter.

The maxPods parameter sets the number of pods the node can run to a fixed value, regardless of the properties of the node.

 kubeletConfig:
    maxPods: 250

 kubeletConfig:
    maxPods: 250

Copy to Clipboard

Toggle word wrap

6.4.1. Configuring the maximum number of pods per node
Copy link

Two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore and maxPods. If you use both options, the lower of the two limits the number of pods on a node.

For example, if podsPerCore is set to 10 on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40.

Prerequisites

Obtain the label associated with the static MachineConfigPool CRD for the type of node you want to configure by entering the following command:

oc edit machineconfigpool <name>

$ oc edit machineconfigpool <name>

Copy to Clipboard

Toggle word wrap

For example:

oc edit machineconfigpool worker

$ oc edit machineconfigpool worker

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-11-16T15:34:25Z"
  generation: 4
  labels:
    pools.operator.machineconfiguration.openshift.io/worker: "" 
  name: worker
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-11-16T15:34:25Z"
  generation: 4
  labels:
    pools.operator.machineconfiguration.openshift.io/worker: ""


  name: worker
#...

Copy to Clipboard

Toggle word wrap

1: The label appears under Labels.

Tip

If the label is not present, add a key/value pair such as:

oc label machineconfigpool worker custom-kubelet=small-pods

$ oc label machineconfigpool worker custom-kubelet=small-pods

Copy to Clipboard

Toggle word wrap

Procedure

Create a custom resource (CR) for your configuration change.
Sample configuration for a max-pods CR
```
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-max-pods 
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
  kubeletConfig:
    podsPerCore: 10 
    maxPods: 250 
#...
```
```
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-max-pods 
```
1
```
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
```
2
```
  kubeletConfig:
    podsPerCore: 10 
```
3
```
    maxPods: 250 
```
4
```
#...
```
Copy to Clipboard Toggle word wrap
1
Assign a name to CR.
2
Specify the label from the machine config pool.
3
Specify the number of pods the node can run based on the number of processor cores on the node.
4
Specify the number of pods the node can run to a fixed value, regardless of the properties of the node.
Note
Setting podsPerCore to 0 disables this limit.
In the above example, the default value for podsPerCore is 10 and the default value for maxPods is 250. This means that unless the node has 25 cores or more, by default, podsPerCore will be the limiting factor.
Run the following command to create the CR:
```
oc create -f <file_name>.yaml
```
```
$ oc create -f <file_name>.yaml
```
Copy to Clipboard Toggle word wrap

Verification

List the MachineConfigPool CRDs to see if the change is applied. The UPDATING column reports True if the change is picked up by the Machine Config Controller:

oc get machineconfigpools

$ oc get machineconfigpools

Copy to Clipboard

Toggle word wrap

Example output

NAME     CONFIG                        UPDATED   UPDATING   DEGRADED
master   master-9cc2c72f205e103bb534   False     False      False
worker   worker-8cecd1236b33ee3f8a5e   False     True       False

NAME     CONFIG                        UPDATED   UPDATING   DEGRADED
master   master-9cc2c72f205e103bb534   False     False      False
worker   worker-8cecd1236b33ee3f8a5e   False     True       False

Copy to Clipboard

Toggle word wrap

Once the change is complete, the UPDATED column reports True.

oc get machineconfigpools

$ oc get machineconfigpools

Copy to Clipboard

Toggle word wrap

Example output

NAME     CONFIG                        UPDATED   UPDATING   DEGRADED
master   master-9cc2c72f205e103bb534   False     True       False
worker   worker-8cecd1236b33ee3f8a5e   True      False      False

NAME     CONFIG                        UPDATED   UPDATING   DEGRADED
master   master-9cc2c72f205e103bb534   False     True       False
worker   worker-8cecd1236b33ee3f8a5e   True      False      False

Copy to Clipboard

Toggle word wrap

6.5. Using the Node Tuning Operator
Copy link

Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.

The Node Tuning Operator helps you manage node-level tuning by orchestrating the TuneD daemon and achieves low latency performance by using the Performance Profile controller. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs.

The Operator manages the containerized TuneD daemon for OpenShift Container Platform as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized TuneD daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.

Node-level settings applied by the containerized TuneD daemon are rolled back on an event that triggers a profile change or when the containerized TuneD daemon is terminated gracefully by receiving and handling a termination signal.

The Node Tuning Operator uses the Performance Profile controller to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications. The cluster administrator configures a performance profile to define node-level settings such as the following:

Updating the kernel to kernel-rt.
Choosing CPUs for housekeeping.
Choosing CPUs for running workloads.

Note

Currently, disabling CPU load balancing is not supported by cgroup v2. As a result, you might not get the desired behavior from performance profiles if you have cgroup v2 enabled. Enabling cgroup v2 is not recommended if you are using performance profiles.

The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.

Note

In earlier versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.

6.5.1. Accessing an example Node Tuning Operator specification
Copy link

Use this process to access an example Node Tuning Operator specification.

Procedure

Run the following command to access an example Node Tuning Operator specification:

oc get tuned.tuned.openshift.io/default -o yaml -n openshift-cluster-node-tuning-operator

oc get tuned.tuned.openshift.io/default -o yaml -n openshift-cluster-node-tuning-operator

Copy to Clipboard

Toggle word wrap

The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or pod labels and profile priorities.

Warning

While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality will be deprecated in future versions of the Node Tuning Operator.

6.5.2. Custom tuning specification
Copy link

The custom resource (CR) for the Operator has two major sections. The first section, profile:, is a list of TuneD profiles and their names. The second, recommend:, defines the profile selection logic.

Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized TuneD daemons are updated.

Management state

The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:

Managed: the Operator will update its operands as configuration resources are updated
Unmanaged: the Operator will ignore changes to the configuration resources
Removed: the Operator will remove its operands and resources the Operator provisioned

Profile data

The profile: section lists TuneD profiles and their names.

profile:
- name: tuned_profile_1
  data: |
    # TuneD profile specification
    [main]
    summary=Description of tuned_profile_1 profile

    [sysctl]
    net.ipv4.ip_forward=1
    # ... other sysctl's or other TuneD daemon plugins supported by the containerized TuneD

# ...

- name: tuned_profile_n
  data: |
    # TuneD profile specification
    [main]
    summary=Description of tuned_profile_n profile

    # tuned_profile_n profile settings

profile:
- name: tuned_profile_1
  data: |
    # TuneD profile specification
    [main]
    summary=Description of tuned_profile_1 profile

    [sysctl]
    net.ipv4.ip_forward=1
    # ... other sysctl's or other TuneD daemon plugins supported by the containerized TuneD

# ...

- name: tuned_profile_n
  data: |
    # TuneD profile specification
    [main]
    summary=Description of tuned_profile_n profile

    # tuned_profile_n profile settings

Copy to Clipboard

Toggle word wrap

Recommended profiles

The profile: selection logic is defined by the recommend: section of the CR. The recommend: section is a list of items to recommend the profiles based on a selection criteria.

recommend:
<recommend-item-1>
# ...
<recommend-item-n>

recommend:
<recommend-item-1>
# ...
<recommend-item-n>

Copy to Clipboard

Toggle word wrap

The individual items of the list:

- machineConfigLabels: 
    <mcLabels> 
  match: 
    <match> 
  priority: <priority> 
  profile: <tuned_profile_name> 
  operand: 
    debug: <bool> 
    tunedConfig:
      reapply_sysctl: <bool>

- machineConfigLabels:


    <mcLabels>


  match:


    <match>


  priority: <priority>


  profile: <tuned_profile_name>


  operand:


    debug: <bool>


    tunedConfig:
      reapply_sysctl: <bool>

Copy to Clipboard

Toggle word wrap

1: Optional.
2: A dictionary of key/value MachineConfig labels. The keys must be unique.
3: If omitted, profile match is assumed unless a profile with a higher priority matches first or machineConfigLabels is set.
4: An optional list.
5: Profile ordering priority. Lower numbers mean higher priority (0 is the highest priority).
6: A TuneD profile to apply on a match. For example tuned_profile_1.
7: Optional operand configuration.
8: Turn debugging on or off for the TuneD daemon. Options are true for on or false for off. The default is false.
9: Turn reapply_sysctl functionality on or off for the TuneD daemon. Options are true for on and false for off.

<match> is an optional list recursively defined as follows:

- label: <label_name> 
  value: <label_value> 
  type: <label_type> 
    <match>

- label: <label_name>


  value: <label_value>


  type: <label_type>


    <match>

Copy to Clipboard

Toggle word wrap

1: Node or pod label name.
2: Optional node or pod label value. If omitted, the presence of <label_name> is enough to match.
3: Optional object type (node or pod). If omitted, node is assumed.
4: An optional <match> list.

If <match> is not omitted, all nested <match> sections must also evaluate to true. Otherwise, false is assumed and the profile with the respective <match> section will not be applied or recommended. Therefore, the nesting (child <match> sections) works as logical AND operator. Conversely, if any item of the <match> list matches, the entire <match> list evaluates to true. Therefore, the list acts as logical OR operator.

If machineConfigLabels is defined, machine config pool based matching is turned on for the given recommend: list item. <mcLabels> specifies the labels for a machine config. The machine config is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>. This involves finding all machine config pools with machine config selector matching <mcLabels> and setting the profile <tuned_profile_name> on all nodes that are assigned the found machine config pools. To target nodes that have both master and worker roles, you must use the master role.

The list items match and machineConfigLabels are connected by the logical OR operator. The match item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true, the machineConfigLabels item is not considered.

Important

When using machine config pool based matching, it is advised to group nodes with the same hardware configuration into the same machine config pool. Not following this practice might result in TuneD operands calculating conflicting kernel parameters for two or more nodes sharing the same machine config pool.

Example: node or pod label based matching

- match:
  - label: tuned.openshift.io/elasticsearch
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
    type: pod
  priority: 10
  profile: openshift-control-plane-es
- match:
  - label: node-role.kubernetes.io/master
  - label: node-role.kubernetes.io/infra
  priority: 20
  profile: openshift-control-plane
- priority: 30
  profile: openshift-node

- match:
  - label: tuned.openshift.io/elasticsearch
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
    type: pod
  priority: 10
  profile: openshift-control-plane-es
- match:
  - label: node-role.kubernetes.io/master
  - label: node-role.kubernetes.io/infra
  priority: 20
  profile: openshift-control-plane
- priority: 30
  profile: openshift-node

Copy to Clipboard

Toggle word wrap

The CR above is translated for the containerized TuneD daemon into its recommend.conf file based on the profile priorities. The profile with the highest priority (10) is openshift-control-plane-es and, therefore, it is considered first. The containerized TuneD daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch label set. If not, the entire <match> section evaluates as false. If there is such a pod with the label, in order for the <match> section to evaluate to true, the node label also needs to be node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

If the labels for the profile with priority 10 matched, openshift-control-plane-es profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane) is considered. This profile is applied if the containerized TuneD pod runs on a node with labels node-role.kubernetes.io/master or node-role.kubernetes.io/infra.

Finally, the profile openshift-node has the lowest priority of 30. It lacks the <match> section and, therefore, will always match. It acts as a profile catch-all to set openshift-node profile, if no other profile with higher priority matches on a given node.

Example: machine config pool based matching

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-node-custom
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift node profile with an additional kernel parameter
      include=openshift-node
      [bootloader]
      cmdline_openshift_node_custom=+skew_tick=1
    name: openshift-node-custom

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-custom"
    priority: 20
    profile: openshift-node-custom

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: openshift-node-custom
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Custom OpenShift node profile with an additional kernel parameter
      include=openshift-node
      [bootloader]
      cmdline_openshift_node_custom=+skew_tick=1
    name: openshift-node-custom

  recommend:
  - machineConfigLabels:
      machineconfiguration.openshift.io/role: "worker-custom"
    priority: 20
    profile: openshift-node-custom

Copy to Clipboard

Toggle word wrap

To minimize node reboots, label the target nodes with a label the machine config pool’s node selector will match, then create the Tuned CR above and finally create the custom machine config pool itself.

Cloud provider-specific TuneD profiles

With this functionality, all Cloud provider-specific nodes can conveniently be assigned a TuneD profile specifically tailored to a given Cloud provider on a OpenShift Container Platform cluster. This can be accomplished without adding additional node labels or grouping nodes into machine config pools.

This functionality takes advantage of spec.providerID node object values in the form of <cloud-provider>://<cloud-provider-specific-id> and writes the file /var/lib/tuned/provider with the value <cloud-provider> in NTO operand containers. The content of this file is then used by TuneD to load provider-<cloud-provider> profile if such profile exists.

The openshift profile that both openshift-control-plane and openshift-node profiles inherit settings from is now updated to use this functionality through the use of conditional profile loading. Neither NTO nor TuneD currently ship any Cloud provider-specific profiles. However, it is possible to create a custom profile provider-<cloud-provider> that will be applied to all Cloud provider-specific cluster nodes.

Example GCE Cloud provider profile

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: provider-gce
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=GCE Cloud provider-specific profile
      # Your tuning for GCE Cloud provider goes here.
    name: provider-gce

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: provider-gce
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=GCE Cloud provider-specific profile
      # Your tuning for GCE Cloud provider goes here.
    name: provider-gce

Copy to Clipboard

Toggle word wrap

Note

Due to profile inheritance, any setting specified in the provider-<cloud-provider> profile will be overwritten by the openshift profile and its child profiles.

6.5.3. Default profiles set on a cluster
Copy link

The following are the default profiles set on a cluster.

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: default
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Optimize systems running OpenShift (provider specific parent profile)
      include=-provider-${f:exec:cat:/var/lib/tuned/provider},openshift
    name: openshift
  recommend:
  - profile: openshift-control-plane
    priority: 30
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
  - profile: openshift-node
    priority: 40

apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
  name: default
  namespace: openshift-cluster-node-tuning-operator
spec:
  profile:
  - data: |
      [main]
      summary=Optimize systems running OpenShift (provider specific parent profile)
      include=-provider-${f:exec:cat:/var/lib/tuned/provider},openshift
    name: openshift
  recommend:
  - profile: openshift-control-plane
    priority: 30
    match:
    - label: node-role.kubernetes.io/master
    - label: node-role.kubernetes.io/infra
  - profile: openshift-node
    priority: 40

Copy to Clipboard

Toggle word wrap

Starting with OpenShift Container Platform 4.9, all OpenShift TuneD profiles are shipped with the TuneD package. You can use the oc exec command to view the contents of these profiles:

oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/openshift{,-control-plane,-node} -name tuned.conf -exec grep -H ^ {} \;

$ oc exec $tuned_pod -n openshift-cluster-node-tuning-operator -- find /usr/lib/tuned/openshift{,-control-plane,-node} -name tuned.conf -exec grep -H ^ {} \;

Copy to Clipboard

Toggle word wrap

6.5.4. Supported TuneD daemon plugins
Copy link

Excluding the [main] section, the following TuneD plugins are supported when using custom profiles defined in the profile: section of the Tuned CR:

audio
cpu
disk
eeepc_she
modules
mounts
net
scheduler
scsi_host
selinux
sysctl
sysfs
usb
video
vm
bootloader

There is some dynamic tuning functionality provided by some of these plugins that is not supported. The following TuneD plugins are currently not supported:

script
systemd

Note

The TuneD bootloader plugin only supports Red Hat Enterprise Linux CoreOS (RHCOS) worker nodes.

Additional resources

6.6. Remediating, fencing, and maintaining nodes
Copy link

When node-level failures occur, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. Failures affecting these workloads risk data loss, corruption, or both. It is important to isolate the node, known as fencing, before initiating recovery of the workload, known as remediation, and recovery of the node.

For more information on remediation, fencing, and maintaining nodes, see the Workload Availability for Red Hat OpenShift documentation.

6.7. Understanding node rebooting
Copy link

To reboot a node without causing an outage for applications running on the platform, it is important to first evacuate the pods. For pods that are made highly available by the routing tier, nothing else needs to be done. For other pods needing storage, typically databases, it is critical to ensure that they can remain in operation with one pod temporarily going offline. While implementing resiliency for stateful pods is different for each application, in all cases it is important to configure the scheduler to use node anti-affinity to ensure that the pods are properly spread across available nodes.

Another challenge is how to handle nodes that are running critical infrastructure such as the router or the registry. The same node evacuation process applies, though it is important to understand certain edge cases.

6.7.1. About rebooting nodes running critical infrastructure
Copy link

When rebooting nodes that host critical OpenShift Container Platform infrastructure components, such as router pods, registry pods, and monitoring pods, ensure that there are at least three nodes available to run these components.

The following scenario demonstrates how service interruptions can occur with applications running on OpenShift Container Platform when only two nodes are available:

Node A is marked unschedulable and all pods are evacuated.
The registry pod running on that node is now redeployed on node B. Node B is now running both registry pods.
Node B is now marked unschedulable and is evacuated.
The service exposing the two pod endpoints on node B loses all endpoints, for a brief period of time, until they are redeployed to node A.

When using three nodes for infrastructure components, this process does not result in a service disruption. However, due to pod scheduling, the last node that is evacuated and brought back into rotation does not have a registry pod. One of the other nodes has two registry pods. To schedule the third registry pod on the last node, use pod anti-affinity to prevent the scheduler from locating two registry pods on the same node.

Additional information

For more information on pod anti-affinity, see Placing pods relative to other pods using affinity and anti-affinity rules.

6.7.2. Rebooting a node using pod anti-affinity
Copy link

Pod anti-affinity is slightly different than node anti-affinity. Node anti-affinity can be violated if there are no other suitable locations to deploy a pod. Pod anti-affinity can be set to either required or preferred.

With this in place, if only two infrastructure nodes are available and one is rebooted, the container image registry pod is prevented from running on the other node. oc get pods reports the pod as unready until a suitable node is available. Once a node is available and all pods are back in ready state, the next node can be restarted.

Procedure

To reboot a node using pod anti-affinity:

Edit the node specification to configure pod anti-affinity:

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity: 
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 100 
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: registry 
              operator: In 
              values:
              - default
          topologyKey: kubernetes.io/hostname
#...

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-antiaffinity
spec:
  affinity:
    podAntiAffinity:


      preferredDuringSchedulingIgnoredDuringExecution:


      - weight: 100


        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: registry


              operator: In


              values:
              - default
          topologyKey: kubernetes.io/hostname
#...

Copy to Clipboard

Toggle word wrap

1: Stanza to configure pod anti-affinity.
2: Defines a preferred rule.
3: Specifies a weight for a preferred rule. The node with the highest weight is preferred.
4: Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
5: The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

This example assumes the container image registry pod has a label of registry=default. Pod anti-affinity can use any Kubernetes match expression.

Enable the MatchInterPodAffinity scheduler predicate in the scheduling policy file.
Perform a graceful restart of the node.

6.7.3. Understanding how to reboot nodes running routers
Copy link

In most cases, a pod running an OpenShift Container Platform router exposes a host port.

The PodFitsPorts scheduler predicate ensures that no router pods using the same port can run on the same node, and pod anti-affinity is achieved. If the routers are relying on IP failover for high availability, there is nothing else that is needed.

For router pods relying on an external service such as AWS Elastic Load Balancing for high availability, it is that service’s responsibility to react to router pod restarts.

In rare cases, a router pod may not have a host port configured. In those cases, it is important to follow the recommended restart process for infrastructure nodes.

6.7.4. Rebooting a node gracefully
Copy link

Before rebooting a node, it is recommended to backup etcd data to avoid any data loss on the node.

Note

For single-node OpenShift clusters that require users to perform the oc login command rather than having the certificates in kubeconfig file to manage the cluster, the oc adm commands might not be available after cordoning and draining the node. This is because the openshift-oauth-apiserver pod is not running due to the cordon. You can use SSH to access the nodes as indicated in the following procedure.

In a single-node OpenShift cluster, pods cannot be rescheduled when cordoning and draining. However, doing so gives the pods, especially your workload pods, time to properly stop and release associated resources.

Procedure

To perform a graceful restart of a node:

Mark the node as unschedulable:
```
oc adm cordon <node1>
```
```
$ oc adm cordon <node1>
```
Copy to Clipboard Toggle word wrap

Drain the node to remove all the running pods:

oc adm drain <node1> --ignore-daemonsets --delete-emptydir-data --force

$ oc adm drain <node1> --ignore-daemonsets --delete-emptydir-data --force

Copy to Clipboard

Toggle word wrap

You might receive errors that pods associated with custom pod disruption budgets (PDB) cannot be evicted.

Example error

error when evicting pods/"rails-postgresql-example-1-72v2w" -n "rails" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

error when evicting pods/"rails-postgresql-example-1-72v2w" -n "rails" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Copy to Clipboard

Toggle word wrap

In this case, run the drain command again, adding the disable-eviction flag, which bypasses the PDB checks:

oc adm drain <node1> --ignore-daemonsets --delete-emptydir-data --force --disable-eviction

$ oc adm drain <node1> --ignore-daemonsets --delete-emptydir-data --force --disable-eviction

Copy to Clipboard

Toggle word wrap

Access the node in debug mode:
```
oc debug node/<node1>
```
```
$ oc debug node/<node1>
```
Copy to Clipboard Toggle word wrap
Change your root directory to /host:
```
chroot /host
```
```
$ chroot /host
```
Copy to Clipboard Toggle word wrap
Restart the node:
```
systemctl reboot
```
```
$ systemctl reboot
```
Copy to Clipboard Toggle word wrap
In a moment, the node enters the NotReady state.
Note
With some single-node OpenShift clusters, the oc commands might not be available after you cordon and drain the node because the openshift-oauth-apiserver pod is not running. You can use SSH to connect to the node and perform the reboot.
$ ssh core@<master-node>.<cluster_name>.<base_domain>
Copy to Clipboard Toggle word wrap
$ sudo systemctl reboot
Copy to Clipboard Toggle word wrap
After the reboot is complete, mark the node as schedulable by running the following command:
```
oc adm uncordon <node1>
```
```
$ oc adm uncordon <node1>
```
Copy to Clipboard Toggle word wrap
Note
With some single-node OpenShift clusters, the oc commands might not be available after you cordon and drain the node because the openshift-oauth-apiserver pod is not running. You can use SSH to connect to the node and uncordon it.
$ ssh core@<target_node>
Copy to Clipboard Toggle word wrap
$ sudo oc adm uncordon <node> --kubeconfig /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig
Copy to Clipboard Toggle word wrap

Verify that the node is ready:

oc get node <node1>

$ oc get node <node1>

Copy to Clipboard

Toggle word wrap

Example output

NAME    STATUS  ROLES    AGE     VERSION
<node1> Ready   worker   6d22h   v1.18.3+b0068a8

NAME    STATUS  ROLES    AGE     VERSION
<node1> Ready   worker   6d22h   v1.18.3+b0068a8

Copy to Clipboard

Toggle word wrap

Additional information

For information on etcd data backup, see Backing up etcd data.

6.8. Freeing node resources using garbage collection
Copy link

As an administrator, you can use OpenShift Container Platform to ensure that your nodes are running efficiently by freeing up resources through garbage collection.

The OpenShift Container Platform node performs two types of garbage collection:

Container garbage collection: Removes terminated containers.
Image garbage collection: Removes images not referenced by any running pods.

6.8.1. Understanding how terminated containers are removed through garbage collection
Copy link

Container garbage collection removes terminated containers by using eviction thresholds.

When eviction thresholds are set for garbage collection, the node tries to keep any container for any pod accessible from the API. If the pod has been deleted, the containers will be as well. Containers are preserved as long the pod is not deleted and the eviction threshold is not reached. If the node is under disk pressure, it will remove containers and their logs will no longer be accessible using oc logs.

eviction-soft - A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period.
eviction-hard - A hard eviction threshold has no grace period, and if observed, OpenShift Container Platform takes immediate action.

The following table lists the eviction thresholds:

Expand

Table 6.2. Variables for configuring container garbage collection
Node condition	Eviction signal	Description
MemoryPressure	`memory.available`	The available memory on the node.
DiskPressure	`nodefs.available` `nodefs.inodesFree` `imagefs.available` `imagefs.inodesFree`	The available disk space or inodes on the node root file system, `nodefs`, or image file system, `imagefs`.

Note

For evictionHard you must specify all of these parameters. If you do not specify all parameters, only the specified parameters are applied and the garbage collection will not function properly.

If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node would constantly oscillate between true and false. As a consequence, the scheduler could make poor scheduling decisions.

To protect against this oscillation, use the evictionpressure-transition-period flag to control how long OpenShift Container Platform must wait before transitioning out of a pressure condition. OpenShift Container Platform will not set an eviction threshold as being met for the specified pressure condition for the period specified before toggling the condition back to false.

Note

Setting the evictionPressureTransitionPeriod parameter to 0 configures the default value of 5 minutes. You cannot set an eviction pressure transition period to zero seconds.

6.8.2. Understanding how images are removed through garbage collection
Copy link

Image garbage collection removes images that are not referenced by any running pods.

OpenShift Container Platform determines which images to remove from a node based on the disk usage that is reported by cAdvisor.

The policy for image garbage collection is based on two conditions:

The percent of disk usage (expressed as an integer) which triggers image garbage collection. The default is 85.
The percent of disk usage (expressed as an integer) to which image garbage collection attempts to free. Default is 80.

For image garbage collection, you can modify any of the following variables using a custom resource.

Expand

Table 6.3. Variables for configuring image garbage collection
Setting	Description
`imageMinimumGCAge`	The minimum age for an unused image before the image is removed by garbage collection. The default is 2m.
`imageGCHighThresholdPercent`	The percent of disk usage, expressed as an integer, which triggers image garbage collection. The default is 85. This value must be greater than the `imageGCLowThresholdPercent` value.
`imageGCLowThresholdPercent`	The percent of disk usage, expressed as an integer, to which image garbage collection attempts to free. The default is 80. This value must be less than the `imageGCHighThresholdPercent` value.

Two lists of images are retrieved in each garbage collector run:

A list of images currently running in at least one pod.
A list of images available on a host.

As new containers are run, new images appear. All images are marked with a time stamp. If the image is running (the first list above) or is newly detected (the second list above), it is marked with the current time. The remaining images are already marked from the previous spins. All images are then sorted by the time stamp.

Once the collection starts, the oldest images get deleted first until the stopping criterion is met.

6.8.3. Configuring garbage collection for containers and images
Copy link

As an administrator, you can configure how OpenShift Container Platform performs garbage collection by creating a kubeletConfig object for each machine config pool.

Note

OpenShift Container Platform supports only one kubeletConfig object for each machine config pool.

You can configure any combination of the following:

Soft eviction for containers
Hard eviction for containers
Eviction for images

Container garbage collection removes terminated containers. Image garbage collection removes images that are not referenced by any running pods.

Prerequisites

Obtain the label associated with the static MachineConfigPool CRD for the type of node you want to configure by entering the following command:

oc edit machineconfigpool <name>

$ oc edit machineconfigpool <name>

Copy to Clipboard

Toggle word wrap

For example:

oc edit machineconfigpool worker

$ oc edit machineconfigpool worker

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-11-16T15:34:25Z"
  generation: 4
  labels:
    pools.operator.machineconfiguration.openshift.io/worker: "" 
  name: worker
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-11-16T15:34:25Z"
  generation: 4
  labels:
    pools.operator.machineconfiguration.openshift.io/worker: ""


  name: worker
#...

Copy to Clipboard

Toggle word wrap

1: The label appears under Labels.

Tip

If the label is not present, add a key/value pair such as:

oc label machineconfigpool worker custom-kubelet=small-pods

$ oc label machineconfigpool worker custom-kubelet=small-pods

Copy to Clipboard

Toggle word wrap

Procedure

Create a custom resource (CR) for your configuration change.

Important

If there is one file system, or if /var/lib/kubelet and /var/lib/containers/ are in the same file system, the settings with the highest values trigger evictions, as those are met first. The file system triggers the eviction.

Sample configuration for a container garbage collection CR:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-kubeconfig 
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
  kubeletConfig:
    evictionSoft: 
      memory.available: "500Mi" 
      nodefs.available: "10%"
      nodefs.inodesFree: "5%"
      imagefs.available: "15%"
      imagefs.inodesFree: "10%"
    evictionSoftGracePeriod:  
      memory.available: "1m30s"
      nodefs.available: "1m30s"
      nodefs.inodesFree: "1m30s"
      imagefs.available: "1m30s"
      imagefs.inodesFree: "1m30s"
    evictionHard: 
      memory.available: "200Mi"
      nodefs.available: "5%"
      nodefs.inodesFree: "4%"
      imagefs.available: "10%"
      imagefs.inodesFree: "5%"
    evictionPressureTransitionPeriod: 3m 
    imageMinimumGCAge: 5m 
    imageGCHighThresholdPercent: 80 
    imageGCLowThresholdPercent: 75 
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-kubeconfig


spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""


  kubeletConfig:
    evictionSoft:


      memory.available: "500Mi"


      nodefs.available: "10%"
      nodefs.inodesFree: "5%"
      imagefs.available: "15%"
      imagefs.inodesFree: "10%"
    evictionSoftGracePeriod:


      memory.available: "1m30s"
      nodefs.available: "1m30s"
      nodefs.inodesFree: "1m30s"
      imagefs.available: "1m30s"
      imagefs.inodesFree: "1m30s"
    evictionHard:


      memory.available: "200Mi"
      nodefs.available: "5%"
      nodefs.inodesFree: "4%"
      imagefs.available: "10%"
      imagefs.inodesFree: "5%"
    evictionPressureTransitionPeriod: 3m


    imageMinimumGCAge: 5m


    imageGCHighThresholdPercent: 80


    imageGCLowThresholdPercent: 75


#...

Copy to Clipboard

Toggle word wrap

1: Name for the object.
2: Specify the label from the machine config pool.
3: For container garbage collection: Type of eviction: evictionSoft or evictionHard.
4: For container garbage collection: Eviction thresholds based on a specific eviction trigger signal.
5: For container garbage collection: Grace periods for the soft eviction. This parameter does not apply to eviction-hard.
6: For container garbage collection: Eviction thresholds based on a specific eviction trigger signal. For evictionHard you must specify all of these parameters. If you do not specify all parameters, only the specified parameters are applied and the garbage collection will not function properly.
7: For container garbage collection: The duration to wait before transitioning out of an eviction pressure condition. Setting the evictionPressureTransitionPeriod parameter to 0 configures the default value of 5 minutes.
8: For image garbage collection: The minimum age for an unused image before the image is removed by garbage collection.
9: For image garbage collection: Image garbage collection is triggered at the specified percent of disk usage (expressed as an integer). This value must be greater than the imageGCLowThresholdPercent value.
10: For image garbage collection: Image garbage collection attempts to free resources to the specified percent of disk usage (expressed as an integer). This value must be less than the imageGCHighThresholdPercent value.

Run the following command to create the CR:

oc create -f <file_name>.yaml

$ oc create -f <file_name>.yaml

Copy to Clipboard

Toggle word wrap

For example:

oc create -f gc-container.yaml

$ oc create -f gc-container.yaml

Copy to Clipboard

Toggle word wrap

Example output

kubeletconfig.machineconfiguration.openshift.io/gc-container created

kubeletconfig.machineconfiguration.openshift.io/gc-container created

Copy to Clipboard

Toggle word wrap

Verification

Verify that garbage collection is active by entering the following command. The Machine Config Pool you specified in the custom resource appears with UPDATING as 'true` until the change is fully implemented:

oc get machineconfigpool

$ oc get machineconfigpool

Copy to Clipboard

Toggle word wrap

Example output

NAME     CONFIG                                   UPDATED   UPDATING
master   rendered-master-546383f80705bd5aeaba93   True      False
worker   rendered-worker-b4c51bb33ccaae6fc4a6a5   False     True

NAME     CONFIG                                   UPDATED   UPDATING
master   rendered-master-546383f80705bd5aeaba93   True      False
worker   rendered-worker-b4c51bb33ccaae6fc4a6a5   False     True

Copy to Clipboard

Toggle word wrap

6.9. Allocating resources for nodes in an OpenShift Container Platform cluster
Copy link

To provide more reliable scheduling and minimize node resource overcommitment, reserve a portion of the CPU and memory resources for use by the underlying node components, such as kubelet and kube-proxy, and the remaining system components, such as sshd and NetworkManager. By specifying the resources to reserve, you provide the scheduler with more information about the remaining CPU and memory resources that a node has available for use by pods. You can allow OpenShift Container Platform to automatically determine the optimal system-reserved CPU and memory resources for your nodes or you can manually determine and set the best resources for your nodes.

Important

To manually set resource values, you must use a kubelet config CR. You cannot use a machine config CR.

6.9.1. Understanding how to allocate resources for nodes
Copy link

CPU and memory resources reserved for node components in OpenShift Container Platform are based on two node settings:

Expand

Setting	Description
`kube-reserved`	This setting is not used with OpenShift Container Platform. Add the CPU and memory resources that you planned to reserve to the `system-reserved` setting.
`system-reserved`	This setting identifies the resources to reserve for the node components and system components, such as CRI-O and Kubelet. The default settings depend on the OpenShift Container Platform and Machine Config Operator versions. Confirm the default `systemReserved` parameter on the `machine-config-operator` repository.

If a flag is not set, the defaults are used. If none of the flags are set, the allocated resource is set to the node’s capacity as it was before the introduction of allocatable resources.

Note

Any CPUs specifically reserved using the reservedSystemCPUs parameter are not available for allocation using kube-reserved or system-reserved.

6.9.1.1. How OpenShift Container Platform computes allocated resources
Copy link

An allocated amount of a resource is computed based on the following formula:

[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]

[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]

Copy to Clipboard

Toggle word wrap

Note

The withholding of Hard-Eviction-Thresholds from Allocatable improves system reliability because the value for Allocatable is enforced for pods at the node level.

If Allocatable is negative, it is set to 0.

Each node reports the system resources that are used by the container runtime and kubelet. To simplify configuring the system-reserved parameter, view the resource use for the node by using the node summary API. The node summary is available at /api/v1/nodes/<node>/proxy/stats/summary.

6.9.1.2. How nodes enforce resource constraints
Copy link

The node is able to limit the total amount of resources that pods can consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from using CPU and memory resources that are needed by system services such as the container runtime and node agent. To improve node reliability, administrators should reserve resources based on a target for resource use.

The node enforces resource constraints by using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy that is separate from system daemons.

Administrators should treat system daemons similar to pods that have a guaranteed quality of service. System daemons can burst within their bounding control groups and this behavior must be managed as part of cluster deployments. Reserve CPU and memory resources for system daemons by specifying the amount of CPU and memory resources in system-reserved.

Enforcing system-reserved limits can prevent critical system services from receiving CPU and memory resources. As a result, a critical system service can be ended by the out-of-memory killer. The recommendation is to enforce system-reserved only if you have profiled the nodes exhaustively to determine precise estimates and you are confident that critical system services can recover if any process in that group is ended by the out-of-memory killer.

6.9.1.3. Understanding Eviction Thresholds
Copy link

If a node is under memory pressure, it can impact the entire node and all pods running on the node. For example, a system daemon that uses more than its reserved amount of memory can trigger an out-of-memory event. To avoid or reduce the probability of system out-of-memory events, the node provides out-of-resource handling.

You can reserve some memory using the --eviction-hard flag. The node attempts to evict pods whenever memory availability on the node drops below the absolute value or percentage. If system daemons do not exist on a node, pods are limited to the memory capacity - eviction-hard. For this reason, resources set aside as a buffer for eviction before reaching out of memory conditions are not available for pods.

The following is an example to illustrate the impact of node allocatable for memory:

Node capacity is 32Gi
--system-reserved is 3Gi
--eviction-hard is set to 100Mi.

For this node, the effective node allocatable value is 28.9Gi. If the node and system components use all their reservation, the memory available for pods is 28.9Gi, and kubelet evicts pods when it exceeds this threshold.

If you enforce node allocatable, 28.9Gi, with top-level cgroups, then pods can never exceed 28.9Gi. Evictions are not performed unless system daemons consume more than 3.1Gi of memory.

If system daemons do not use up all their reservation, with the above example, pods would face memcg OOM kills from their bounding cgroup before node evictions kick in. To better enforce QoS under this situation, the node applies the hard eviction thresholds to the top-level cgroup for all pods to be Node Allocatable + Eviction Hard Thresholds.

If system daemons do not use up all their reservation, the node will evict pods whenever they consume more than 28.9Gi of memory. If eviction does not occur in time, a pod will be OOM killed if pods consume 29Gi of memory.

6.9.1.4. How the scheduler determines resource availability
Copy link

The scheduler uses the value of node.Status.Allocatable instead of node.Status.Capacity to decide if a node will become a candidate for pod scheduling.

By default, the node will report its machine capacity as fully schedulable by the cluster.

6.9.2. Automatically allocating resources for nodes
Copy link

OpenShift Container Platform can automatically determine the optimal system-reserved CPU and memory resources for nodes associated with a specific machine config pool and update the nodes with those values when the nodes start. By default, the system-reserved CPU is 500m and system-reserved memory is 1Gi.

To automatically determine and allocate the system-reserved resources on nodes, create a KubeletConfig custom resource (CR) to set the autoSizingReserved: true parameter. A script on each node calculates the optimal values for the respective reserved resources based on the installed CPU and memory capacity on each node. The script takes into account that increased capacity requires a corresponding increase in the reserved resources.

Automatically determining the optimal system-reserved settings ensures that your cluster is running efficiently and prevents node failure due to resource starvation of system components, such as CRI-O and kubelet, without your needing to manually calculate and update the values.

This feature is disabled by default.

Prerequisites

Obtain the label associated with the static MachineConfigPool object for the type of node you want to configure by entering the following command:

oc edit machineconfigpool <name>

$ oc edit machineconfigpool <name>

Copy to Clipboard

Toggle word wrap

For example:

oc edit machineconfigpool worker

$ oc edit machineconfigpool worker

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-11-16T15:34:25Z"
  generation: 4
  labels:
    pools.operator.machineconfiguration.openshift.io/worker: "" 
  name: worker
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-11-16T15:34:25Z"
  generation: 4
  labels:
    pools.operator.machineconfiguration.openshift.io/worker: ""


  name: worker
#...

Copy to Clipboard

Toggle word wrap

1: The label appears under Labels.

Tip

If an appropriate label is not present, add a key/value pair such as:

oc label machineconfigpool worker custom-kubelet=small-pods

$ oc label machineconfigpool worker custom-kubelet=small-pods

Copy to Clipboard

Toggle word wrap

Procedure

Create a custom resource (CR) for your configuration change:
Sample configuration for a resource allocation CR
```
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: dynamic-node 
spec:
  autoSizingReserved: true 
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
#...
```
```
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: dynamic-node 
```
1
```
spec:
  autoSizingReserved: true 
```
2
```
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
```
3
```
#...
```
Copy to Clipboard Toggle word wrap
1
Assign a name to CR.
2
Add the autoSizingReserved parameter set to true to allow OpenShift Container Platform to automatically determine and allocate the system-reserved resources on the nodes associated with the specified label. To disable automatic allocation on those nodes, set this parameter to false.
3
Specify the label from the machine config pool that you configured in the "Prerequisites" section. You can choose any desired labels for the machine config pool, such as custom-kubelet: small-pods, or the default label, pools.operator.machineconfiguration.openshift.io/worker: "".
The previous example enables automatic resource allocation on all worker nodes. OpenShift Container Platform drains the nodes, applies the kubelet config, and restarts the nodes.
Create the CR by entering the following command:
```
oc create -f <file_name>.yaml
```
```
$ oc create -f <file_name>.yaml
```
Copy to Clipboard Toggle word wrap

Verification

Log in to a node you configured by entering the following command:
```
oc debug node/<node_name>
```
```
$ oc debug node/<node_name>
```
Copy to Clipboard Toggle word wrap
Set /host as the root directory within the debug shell:
```
chroot /host
```
```
# chroot /host
```
Copy to Clipboard Toggle word wrap
View the /etc/node-sizing.env file:
Example output
```
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=0.08
```
```
SYSTEM_RESERVED_MEMORY=3Gi
SYSTEM_RESERVED_CPU=0.08
```
Copy to Clipboard Toggle word wrap
The kubelet uses the system-reserved values in the /etc/node-sizing.env file. In the previous example, the worker nodes are allocated 0.08 CPU and 3 Gi of memory. It can take several minutes for the optimal values to appear.

6.9.3. Manually allocating resources for nodes
Copy link

OpenShift Container Platform supports the CPU and memory resource types for allocation. The ephemeral-resource resource type is also supported. For the cpu type, you specify the resource quantity in units of cores, such as 200m, 0.5, or 1. For memory and ephemeral-storage, you specify the resource quantity in units of bytes, such as 200Ki, 50Mi, or 5Gi. By default, the system-reserved CPU is 500m and system-reserved memory is 1Gi.

As an administrator, you can set these values by using a kubelet config custom resource (CR) through a set of <resource_type>=<resource_quantity> pairs (e.g., cpu=200m,memory=512Mi).

Important

You must use a kubelet config CR to manually set resource values. You cannot use a machine config CR.

For details on the recommended system-reserved values, refer to the recommended system-reserved values.

Prerequisites

Obtain the label associated with the static MachineConfigPool CRD for the type of node you want to configure by entering the following command:

oc edit machineconfigpool <name>

$ oc edit machineconfigpool <name>

Copy to Clipboard

Toggle word wrap

For example:

oc edit machineconfigpool worker

$ oc edit machineconfigpool worker

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-11-16T15:34:25Z"
  generation: 4
  labels:
    pools.operator.machineconfiguration.openshift.io/worker: "" 
  name: worker
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2022-11-16T15:34:25Z"
  generation: 4
  labels:
    pools.operator.machineconfiguration.openshift.io/worker: ""


  name: worker
#...

Copy to Clipboard

Toggle word wrap

1: The label appears under Labels.

Tip

If the label is not present, add a key/value pair such as:

oc label machineconfigpool worker custom-kubelet=small-pods

$ oc label machineconfigpool worker custom-kubelet=small-pods

Copy to Clipboard

Toggle word wrap

Procedure

Create a custom resource (CR) for your configuration change.

Sample configuration for a resource allocation CR

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-allocatable 
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
  kubeletConfig:
    systemReserved: 
      cpu: 1000m
      memory: 1Gi
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-allocatable


spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""


  kubeletConfig:
    systemReserved:


      cpu: 1000m
      memory: 1Gi
#...

Copy to Clipboard

Toggle word wrap

1: Assign a name to CR.
2: Specify the label from the machine config pool.
3: Specify the resources to reserve for the node components and system components.

Run the following command to create the CR:
```
oc create -f <file_name>.yaml
```
```
$ oc create -f <file_name>.yaml
```
Copy to Clipboard Toggle word wrap

6.10. Allocating specific CPUs for nodes in a cluster
Copy link

When using the static CPU Manager policy, you can reserve specific CPUs for use by specific nodes in your cluster. For example, on a system with 24 CPUs, you could reserve CPUs numbered 0 - 3 for the control plane allowing the compute nodes to use CPUs 4 - 23.

6.10.1. Reserving CPUs for nodes
Copy link

To explicitly define a list of CPUs that are reserved for specific nodes, create a KubeletConfig custom resource (CR) to define the reservedSystemCPUs parameter. This list supersedes the CPUs that might be reserved using the systemReserved parameter.

Procedure

Obtain the label associated with the machine config pool (MCP) for the type of node you want to configure:

oc describe machineconfigpool <name>

$ oc describe machineconfigpool <name>

Copy to Clipboard

Toggle word wrap

For example:

oc describe machineconfigpool worker

$ oc describe machineconfigpool worker

Copy to Clipboard

Toggle word wrap

Example output

Name:         worker
Namespace:
Labels:       machineconfiguration.openshift.io/mco-built-in=
              pools.operator.machineconfiguration.openshift.io/worker= 
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
#...

Name:         worker
Namespace:
Labels:       machineconfiguration.openshift.io/mco-built-in=
              pools.operator.machineconfiguration.openshift.io/worker=


Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
#...

Copy to Clipboard

Toggle word wrap

1: Get the MCP label.

Create a YAML file for the KubeletConfig CR:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-reserved-cpus 
spec:
  kubeletConfig:
    reservedSystemCPUs: "0,1,2,3" 
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-reserved-cpus


spec:
  kubeletConfig:
    reservedSystemCPUs: "0,1,2,3"


  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""


#...

Copy to Clipboard

Toggle word wrap

1: Specify a name for the CR.
2: Specify the core IDs of the CPUs you want to reserve for the nodes associated with the MCP.
3: Specify the label from the MCP.

Create the CR object:
```
oc create -f <file_name>.yaml
```
```
$ oc create -f <file_name>.yaml
```
Copy to Clipboard Toggle word wrap

6.11. Enabling TLS security profiles for the kubelet
Copy link

You can use a TLS (Transport Layer Security) security profile to define which TLS ciphers are required by the kubelet when it is acting as an HTTP server. The kubelet uses its HTTP/GRPC server to communicate with the Kubernetes API server, which sends commands to pods, gathers logs, and run exec commands on pods through the kubelet.

A TLS security profile defines the TLS ciphers that the Kubernetes API server must use when connecting with the kubelet to protect communication between the kubelet and the Kubernetes API server.

Note

By default, when the kubelet acts as a client with the Kubernetes API server, it automatically negotiates the TLS parameters with the API server.

6.11.1. Understanding TLS security profiles
Copy link

You can use a TLS (Transport Layer Security) security profile to define which TLS ciphers are required by various OpenShift Container Platform components. The OpenShift Container Platform TLS security profiles are based on Mozilla recommended configurations.

You can specify one of the following TLS security profiles for each component:

Expand

Table 6.4. TLS security profiles
Profile	Description
`Old`	This profile is intended for use with legacy clients or libraries. The profile is based on the Old backward compatibility recommended configuration. The `Old` profile requires a minimum TLS version of 1.0. Note For the Ingress Controller, the minimum TLS version is converted from 1.0 to 1.1.
`Intermediate`	This profile is the default TLS security profile for the Ingress Controller, kubelet, and control plane. The profile is based on the Intermediate compatibility recommended configuration. The `Intermediate` profile requires a minimum TLS version of 1.2. Note This profile is the recommended configuration for the majority of clients.
`Modern`	This profile is intended for use with modern clients that have no need for backwards compatibility. This profile is based on the Modern compatibility recommended configuration. The `Modern` profile requires a minimum TLS version of 1.3.
`Custom`	This profile allows you to define the TLS version and ciphers to use. Warning Use caution when using a `Custom` profile, because invalid configurations can cause problems.

Note

When using one of the predefined profile types, the effective profile configuration is subject to change between releases. For example, given a specification to use the Intermediate profile deployed on release X.Y.Z, an upgrade to release X.Y.Z+1 might cause a new profile configuration to be applied, resulting in a rollout.

6.11.2. Configuring the TLS security profile for the kubelet
Copy link

To configure a TLS security profile for the kubelet when it is acting as an HTTP server, create a KubeletConfig custom resource (CR) to specify a predefined or custom TLS security profile for specific nodes. If a TLS security profile is not configured, the default TLS security profile is Intermediate.

Sample KubeletConfig CR that configures the Old TLS security profile on worker nodes

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
# ...
spec:
  tlsSecurityProfile:
    old: {}
    type: Old
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
# ...

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
# ...
spec:
  tlsSecurityProfile:
    old: {}
    type: Old
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
# ...

Copy to Clipboard

Toggle word wrap

You can see the ciphers and the minimum TLS version of the configured TLS security profile in the kubelet.conf file on a configured node.

Prerequisites

You have access to the cluster as a user with the cluster-admin role.

Procedure

Create a KubeletConfig CR to configure the TLS security profile:

Sample KubeletConfig CR for a Custom profile

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-kubelet-tls-security-profile
spec:
  tlsSecurityProfile:
    type: Custom 
    custom: 
      ciphers: 
      - ECDHE-ECDSA-CHACHA20-POLY1305
      - ECDHE-RSA-CHACHA20-POLY1305
      - ECDHE-RSA-AES128-GCM-SHA256
      - ECDHE-ECDSA-AES128-GCM-SHA256
      minTLSVersion: VersionTLS11
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: "" 
#...

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-kubelet-tls-security-profile
spec:
  tlsSecurityProfile:
    type: Custom


    custom:


      ciphers:


      - ECDHE-ECDSA-CHACHA20-POLY1305
      - ECDHE-RSA-CHACHA20-POLY1305
      - ECDHE-RSA-AES128-GCM-SHA256
      - ECDHE-ECDSA-AES128-GCM-SHA256
      minTLSVersion: VersionTLS11
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""


#...

Copy to Clipboard

Toggle word wrap

Specify the TLS security profile type (Old, Intermediate, or Custom). The default is Intermediate.

Specify the appropriate field for the selected type:

old: {}
intermediate: {}
custom:

For the custom type, specify a list of TLS ciphers and minimum accepted TLS version.