This documentation is for a release that is no longer maintained
See documentation for the latest supported version 3 or the latest supported version 4.Nodes
Configuring and managing nodes in OpenShift Container Platform
Abstract
Chapter 1. Overview of nodes Copy linkLink copied to clipboard!
1.1. About nodes Copy linkLink copied to clipboard!
A node is a virtual or bare-metal machine in a Kubernetes cluster. Worker nodes host your application containers, grouped as pods. The control plane nodes run services that are required to control the Kubernetes cluster. In OpenShift Container Platform, the control plane nodes contain more than just the Kubernetes services for managing the OpenShift Container Platform cluster.
Having stable and healthy nodes in a cluster is fundamental to the smooth functioning of your hosted application. In OpenShift Container Platform, you can access, manage, and monitor a node through the Node
object representing the node. Using the OpenShift CLI (oc
) or the web console, you can perform the following operations on a node.
Read operations
The read operations allow an administrator or a developer to get information about nodes in an OpenShift Container Platform cluster.
- List all the nodes in a cluster.
- Get information about a node, such as memory and CPU usage, health, status, and age.
- List pods running on a node.
Management operations
As an administrator, you can easily manage a node in an OpenShift Container Platform cluster through several tasks:
-
Add or update node labels. A label is a key-value pair applied to a
Node
object. You can control the scheduling of pods using labels. -
Change node configuration using a custom resource definition (CRD), or the
kubeletConfig
object. -
Configure nodes to allow or disallow the scheduling of pods. Healthy worker nodes with a
Ready
status allow pod placement by default while the control plane nodes do not; you can change this default behavior by configuring the worker nodes to be unschedulable and the control plane nodes to be schedulable. -
Allocate resources for nodes using the
system-reserved
setting. You can allow OpenShift Container Platform to automatically determine the optimalsystem-reserved
CPU and memory resources for your nodes, or you can manually determine and set the best resources for your nodes. - Configure the number of pods that can run on a node based on the number of processor cores on the node, a hard limit, or both.
- Reboot a node gracefully using pod anti-affinity.
- Delete a node from a cluster by scaling down the cluster using a machine set. To delete a node from a bare-metal cluster, you must first drain all pods on the node and then manually delete the node.
Enhancement operations
OpenShift Container Platform allows you to do more than just access and manage nodes; as an administrator, you can perform the following tasks on nodes to make the cluster more efficient, application-friendly, and to provide a better environment for your developers.
- Manage node-level tuning for high-performance applications that require some level of kernel tuning by using the Node Tuning Operator.
- Run background tasks on nodes automatically with daemon sets. You can create and use daemon sets to create shared storage, run a logging pod on every node, or deploy a monitoring agent on all nodes.
- Free node resources using garbage collection. You can ensure that your nodes are running efficiently by removing terminated containers and the images not referenced by any running pods.
- Add kernel arguments to a set of nodes.
- Configure an OpenShift Container Platform cluster to have worker nodes at the network edge (remote worker nodes). For information on the challenges of having remote worker nodes in an OpenShift Container Platform cluster and some recommended approaches for managing pods on a remote worker node, see Using remote worker nodes at the network edge.
1.2. About pods Copy linkLink copied to clipboard!
A pod is one or more containers deployed together on a node. As a cluster administrator, you can define a pod, assign it to run on a healthy node that is ready for scheduling, and manage. A pod runs as long as the containers are running. You cannot change a pod once it is defined and is running. Some operations you can perform when working with pods are:
Read operations
As an administrator, you can get information about pods in a project through the following tasks:
- List pods associated with a project, including information such as the number of replicas and restarts, current status, and age.
- View pod usage statistics such as CPU, memory, and storage consumption.
Management operations
The following list of tasks provides an overview of how an administrator can manage pods in an OpenShift Container Platform cluster.
Control scheduling of pods using the advanced scheduling features available in OpenShift Container Platform:
- Node-to-pod binding rules such as pod affinity, node affinity, and anti-affinity.
- Node labels and selectors.
- Taints and tolerations.
- Pod topology spread constraints.
- Custom schedulers.
- Configure the descheduler to evict pods based on specific strategies so that the scheduler reschedules the pods to more appropriate nodes.
- Configure how pods behave after a restart using pod controllers and restart policies.
- Limit both egress and ingress traffic on a pod.
- Add and remove volumes to and from any object that has a pod template. A volume is a mounted file system available to all the containers in a pod. Container storage is ephemeral; you can use volumes to persist container data.
Enhancement operations
You can work with pods more easily and efficiently with the help of various tools and features available in OpenShift Container Platform. The following operations involve using those tools and features to better manage pods.
Operation | User | More information |
---|---|---|
Create and use a horizontal pod autoscaler. | Developer | You can use a horizontal pod autoscaler to specify the minimum and the maximum number of pods you want to run, as well as the CPU utilization or memory utilization your pods should target. Using a horizontal pod autoscaler, you can automatically scale pods. |
Administrator and developer | As an administrator, use a vertical pod autoscaler to better use cluster resources by monitoring the resources and the resource requirements of workloads. As a developer, use a vertical pod autoscaler to ensure your pods stay up during periods of high demand by scheduling pods to nodes that have enough resources for each pod. | |
Provide access to external resources using device plug-ins. | Administrator | A device plug-in is a gRPC service running on nodes (external to the kubelet), which manages specific hardware resources. You can deploy a device plug-in to provide a consistent and portable solution to consume hardware devices across clusters. |
Provide sensitive data to pods using the | Administrator |
Some applications need sensitive information, such as passwords and usernames. You can use the |
1.3. About containers Copy linkLink copied to clipboard!
A container is the basic unit of an OpenShift Container Platform application, which comprises the application code packaged along with its dependencies, libraries, and binaries. Containers provide consistency across environments and multiple deployment targets: physical servers, virtual machines (VMs), and private or public cloud.
Linux container technologies are lightweight mechanisms for isolating running processes and limiting access to only designated resources. As an administrator, You can perform various tasks on a Linux container, such as:
OpenShift Container Platform provides specialized containers called Init containers. Init containers run before application containers and can contain utilities or setup scripts not present in an application image. You can use an Init container to perform tasks before the rest of a pod is deployed.
Apart from performing specific tasks on nodes, pods, and containers, you can work with the overall OpenShift Container Platform cluster to keep the cluster efficient and the application pods highly available.
Chapter 2. Working with pods Copy linkLink copied to clipboard!
2.1. Using pods Copy linkLink copied to clipboard!
A pod is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed.
2.1.1. Understanding pods Copy linkLink copied to clipboard!
Pods are the rough equivalent of a machine instance (physical or virtual) to a Container. Each pod is allocated its own internal IP address, therefore owning its entire port space, and containers within pods can share their local storage and networking.
Pods have a lifecycle; they are defined, then they are assigned to run on a node, then they run until their container(s) exit or they are removed for some other reason. Pods, depending on policy and exit code, might be removed after exiting, or can be retained in order to enable access to the logs of their containers.
OpenShift Container Platform treats pods as largely immutable; changes cannot be made to a pod definition while it is running. OpenShift Container Platform implements changes by terminating an existing pod and recreating it with modified configuration, base image(s), or both. Pods are also treated as expendable, and do not maintain state when recreated. Therefore pods should usually be managed by higher-level controllers, rather than directly by users.
For the maximum number of pods per OpenShift Container Platform node host, see the Cluster Limits.
Bare pods that are not managed by a replication controller will be not rescheduled upon node disruption.
2.1.2. Example pod configurations Copy linkLink copied to clipboard!
OpenShift Container Platform leverages the Kubernetes concept of a pod, which is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed.
The following is an example definition of a pod from a Rails application. It demonstrates many features of pods, most of which are discussed in other topics and thus only briefly mentioned here:
Pod
object definition (YAML)
- 1
- Pods can be "tagged" with one or more labels, which can then be used to select and manage groups of pods in a single operation. The labels are stored in key/value format in the
metadata
hash. - 2
- The pod restart policy with possible values
Always
,OnFailure
, andNever
. The default value isAlways
. - 3
- OpenShift Container Platform defines a security context for containers which specifies whether they are allowed to run as privileged containers, run as a user of their choice, and more. The default context is very restrictive but administrators can modify this as needed.
- 4
containers
specifies an array of one or more container definitions.- 5
- The container specifies where external storage volumes are mounted within the container. In this case, there is a volume for storing access to credentials the registry needs for making requests against the OpenShift Container Platform API.
- 6
- Specify the volumes to provide for the pod. Volumes mount at the specified path. Do not mount to the container root,
/
, or any path that is the same in the host and the container. This can corrupt your host system if the container is sufficiently privileged, such as the host/dev/pts
files. It is safe to mount the host by using/host
. - 7
- Each container in the pod is instantiated from its own container image.
- 8
- Pods making requests against the OpenShift Container Platform API is a common enough pattern that there is a
serviceAccount
field for specifying which service account user the pod should authenticate as when making the requests. This enables fine-grained access control for custom infrastructure components. - 9
- The pod defines storage volumes that are available to its container(s) to use. In this case, it provides an ephemeral volume for a
secret
volume containing the default service account tokens.If you attach persistent volumes that have high file counts to pods, those pods can fail or can take a long time to start. For more information, see When using Persistent Volumes with high file counts in OpenShift, why do pods fail to start or take an excessive amount of time to achieve "Ready" state?.
This pod definition does not include attributes that are filled by OpenShift Container Platform automatically after the pod is created and its lifecycle begins. The Kubernetes pod documentation has details about the functionality and purpose of pods.
2.2. Viewing pods Copy linkLink copied to clipboard!
As an administrator, you can view the pods in your cluster and to determine the health of those pods and the cluster as a whole.
2.2.1. About pods Copy linkLink copied to clipboard!
OpenShift Container Platform leverages the Kubernetes concept of a pod, which is one or more containers deployed together on one host, and the smallest compute unit that can be defined, deployed, and managed. Pods are the rough equivalent of a machine instance (physical or virtual) to a container.
You can view a list of pods associated with a specific project or view usage statistics about pods.
2.2.2. Viewing pods in a project Copy linkLink copied to clipboard!
You can view a list of pods associated with the current project, including the number of replica, the current status, number or restarts and the age of the pod.
Procedure
To view the pods in a project:
Change to the project:
oc project <project-name>
$ oc project <project-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command:
oc get pods
$ oc get pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get pods -n openshift-console
$ oc get pods -n openshift-console
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE console-698d866b78-bnshf 1/1 Running 2 165m console-698d866b78-m87pm 1/1 Running 2 165m
NAME READY STATUS RESTARTS AGE console-698d866b78-bnshf 1/1 Running 2 165m console-698d866b78-m87pm 1/1 Running 2 165m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
-o wide
flags to view the pod IP address and the node where the pod is located.oc get pods -o wide
$ oc get pods -o wide
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE console-698d866b78-bnshf 1/1 Running 2 166m 10.128.0.24 ip-10-0-152-71.ec2.internal <none> console-698d866b78-m87pm 1/1 Running 2 166m 10.129.0.23 ip-10-0-173-237.ec2.internal <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE console-698d866b78-bnshf 1/1 Running 2 166m 10.128.0.24 ip-10-0-152-71.ec2.internal <none> console-698d866b78-m87pm 1/1 Running 2 166m 10.129.0.23 ip-10-0-173-237.ec2.internal <none>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.2.3. Viewing pod usage statistics Copy linkLink copied to clipboard!
You can display usage statistics about pods, which provide the runtime environments for containers. These usage statistics include CPU, memory, and storage consumption.
Prerequisites
-
You must have
cluster-reader
permission to view the usage statistics. - Metrics must be installed to view the usage statistics.
Procedure
To view the usage statistics:
Run the following command:
oc adm top pods
$ oc adm top pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc adm top pods -n openshift-console
$ oc adm top pods -n openshift-console
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CPU(cores) MEMORY(bytes) console-7f58c69899-q8c8k 0m 22Mi console-7f58c69899-xhbgg 0m 25Mi downloads-594fcccf94-bcxk8 3m 18Mi downloads-594fcccf94-kv4p6 2m 15Mi
NAME CPU(cores) MEMORY(bytes) console-7f58c69899-q8c8k 0m 22Mi console-7f58c69899-xhbgg 0m 25Mi downloads-594fcccf94-bcxk8 3m 18Mi downloads-594fcccf94-kv4p6 2m 15Mi
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to view the usage statistics for pods with labels:
oc adm top pod --selector=''
$ oc adm top pod --selector=''
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You must choose the selector (label query) to filter on. Supports
=
,==
, and!=
.
2.2.4. Viewing resource logs Copy linkLink copied to clipboard!
You can view the log for various resources in the OpenShift CLI (oc) and web console. Logs read from the tail, or end, of the log.
Prerequisites
- Access to the OpenShift CLI (oc).
Procedure (UI)
In the OpenShift Container Platform console, navigate to Workloads → Pods or navigate to the pod through the resource you want to investigate.
NoteSome resources, such as builds, do not have pods to query directly. In such instances, you can locate the Logs link on the Details page for the resource.
- Select a project from the drop-down menu.
- Click the name of the pod you want to investigate.
- Click Logs.
Procedure (CLI)
View the log for a specific pod:
oc logs -f <pod_name> -c <container_name>
$ oc logs -f <pod_name> -c <container_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
-f
- Optional: Specifies that the output follows what is being written into the logs.
<pod_name>
- Specifies the name of the pod.
<container_name>
- Optional: Specifies the name of a container. When a pod has more than one container, you must specify the container name.
For example:
oc logs ruby-58cd97df55-mww7r
$ oc logs ruby-58cd97df55-mww7r
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -f ruby-57f7f4855b-znl92 -c ruby
$ oc logs -f ruby-57f7f4855b-znl92 -c ruby
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The contents of log files are printed out.
View the log for a specific resource:
oc logs <object_type>/<resource_name>
$ oc logs <object_type>/<resource_name>
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the resource type and name.
For example:
oc logs deployment/ruby
$ oc logs deployment/ruby
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The contents of log files are printed out.
2.3. Configuring an OpenShift Container Platform cluster for pods Copy linkLink copied to clipboard!
As an administrator, you can create and maintain an efficient cluster for pods.
By keeping your cluster efficient, you can provide a better environment for your developers using such tools as what a pod does when it exits, ensuring that the required number of pods is always running, when to restart pods designed to run only once, limit the bandwidth available to pods, and how to keep pods running during disruptions.
2.3.1. Configuring how pods behave after restart Copy linkLink copied to clipboard!
A pod restart policy determines how OpenShift Container Platform responds when Containers in that pod exit. The policy applies to all Containers in that pod.
The possible values are:
-
Always
- Tries restarting a successfully exited Container on the pod continuously, with an exponential back-off delay (10s, 20s, 40s) until the pod is restarted. The default isAlways
. -
OnFailure
- Tries restarting a failed Container on the pod with an exponential back-off delay (10s, 20s, 40s) capped at 5 minutes. -
Never
- Does not try to restart exited or failed Containers on the pod. Pods immediately fail and exit.
After the pod is bound to a node, the pod will never be bound to another node. This means that a controller is necessary in order for a pod to survive node failure:
Condition | Controller Type | Restart Policy |
---|---|---|
Pods that are expected to terminate (such as batch computations) | Job |
|
Pods that are expected to not terminate (such as web servers) | Replication controller |
|
Pods that must run one-per-machine | Daemon set | Any |
If a Container on a pod fails and the restart policy is set to OnFailure
, the pod stays on the node and the Container is restarted. If you do not want the Container to restart, use a restart policy of Never
.
If an entire pod fails, OpenShift Container Platform starts a new pod. Developers must address the possibility that applications might be restarted in a new pod. In particular, applications must handle temporary files, locks, incomplete output, and so forth caused by previous runs.
Kubernetes architecture expects reliable endpoints from cloud providers. When a cloud provider is down, the kubelet prevents OpenShift Container Platform from restarting.
If the underlying cloud provider endpoints are not reliable, do not install a cluster using cloud provider integration. Install the cluster as if it was in a no-cloud environment. It is not recommended to toggle cloud provider integration on or off in an installed cluster.
For details on how OpenShift Container Platform uses restart policy with failed Containers, see the Example States in the Kubernetes documentation.
2.3.2. Limiting the bandwidth available to pods Copy linkLink copied to clipboard!
You can apply quality-of-service traffic shaping to a pod and effectively limit its available bandwidth. Egress traffic (from the pod) is handled by policing, which simply drops packets in excess of the configured rate. Ingress traffic (to the pod) is handled by shaping queued packets to effectively handle data. The limits you place on a pod do not affect the bandwidth of other pods.
Procedure
To limit the bandwidth on a pod:
Write an object definition JSON file, and specify the data traffic speed using
kubernetes.io/ingress-bandwidth
andkubernetes.io/egress-bandwidth
annotations. For example, to limit both pod egress and ingress bandwidth to 10M/s:Limited
Pod
object definitionCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod using the object definition:
oc create -f <file_or_dir_path>
$ oc create -f <file_or_dir_path>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3.3. Understanding how to use pod disruption budgets to specify the number of pods that must be up Copy linkLink copied to clipboard!
A pod disruption budget is part of the Kubernetes API, which can be managed with oc
commands like other object types. They allow the specification of safety constraints on pods during operations, such as draining a node for maintenance.
PodDisruptionBudget
is an API object that specifies the minimum number or percentage of replicas that must be up at a time. Setting these in projects can be helpful during node maintenance (such as scaling a cluster down or a cluster upgrade) and is only honored on voluntary evictions (not on node failures).
A PodDisruptionBudget
object’s configuration consists of the following key parts:
- A label selector, which is a label query over a set of pods.
An availability level, which specifies the minimum number of pods that must be available simultaneously, either:
-
minAvailable
is the number of pods must always be available, even during a disruption. -
maxUnavailable
is the number of pods can be unavailable during a disruption.
-
A maxUnavailable
of 0%
or 0
or a minAvailable
of 100%
or equal to the number of replicas is permitted but can block nodes from being drained.
You can check for pod disruption budgets across all projects with the following:
oc get poddisruptionbudget --all-namespaces
$ oc get poddisruptionbudget --all-namespaces
Example output
NAMESPACE NAME MIN-AVAILABLE SELECTOR another-project another-pdb 4 bar=foo test-project my-pdb 2 foo=bar
NAMESPACE NAME MIN-AVAILABLE SELECTOR
another-project another-pdb 4 bar=foo
test-project my-pdb 2 foo=bar
The PodDisruptionBudget
is considered healthy when there are at least minAvailable
pods running in the system. Every pod above that limit can be evicted.
Depending on your pod priority and preemption settings, lower-priority pods might be removed despite their pod disruption budget requirements.
2.3.3.1. Specifying the number of pods that must be up with pod disruption budgets Copy linkLink copied to clipboard!
You can use a PodDisruptionBudget
object to specify the minimum number or percentage of replicas that must be up at a time.
Procedure
To configure a pod disruption budget:
Create a YAML file with the an object definition similar to the following:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
PodDisruptionBudget
is part of thepolicy/v1beta1
API group.- 2
- The minimum number of pods that must be available simultaneously. This can be either an integer or a string specifying a percentage, for example,
20%
. - 3
- A label query over a set of resources. The result of
matchLabels
andmatchExpressions
are logically conjoined.
Or:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
PodDisruptionBudget
is part of thepolicy/v1beta1
API group.- 2
- The maximum number of pods that can be unavailable simultaneously. This can be either an integer or a string specifying a percentage, for example,
20%
. - 3
- A label query over a set of resources. The result of
matchLabels
andmatchExpressions
are logically conjoined.
Run the following command to add the object to project:
oc create -f </path/to/file> -n <project_name>
$ oc create -f </path/to/file> -n <project_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3.4. Preventing pod removal using critical pods Copy linkLink copied to clipboard!
There are a number of core components that are critical to a fully functional cluster, but, run on a regular cluster node rather than the master. A cluster might stop working properly if a critical add-on is evicted.
Pods marked as critical are not allowed to be evicted.
Procedure
To make a pod critical:
Create a
Pod
spec or edit existing pods to include thesystem-cluster-critical
priority class:spec: template: metadata: name: critical-pod priorityClassName: system-cluster-critical
spec: template: metadata: name: critical-pod priorityClassName: system-cluster-critical
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Default priority class for pods that should never be evicted from a node.
Alternatively, you can specify
system-node-critical
for pods that are important to the cluster but can be removed if necessary.Create the pod:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.4. Automatically scaling pods with the horizontal pod autoscaler Copy linkLink copied to clipboard!
As a developer, you can use a horizontal pod autoscaler (HPA) to specify how OpenShift Container Platform should automatically increase or decrease the scale of a replication controller or deployment configuration, based on metrics collected from the pods that belong to that replication controller or deployment configuration.
2.4.1. Understanding horizontal pod autoscalers Copy linkLink copied to clipboard!
You can create a horizontal pod autoscaler to specify the minimum and maximum number of pods you want to run, as well as the CPU utilization or memory utilization your pods should target.
Autoscaling for Memory Utilization is a Technology Preview feature only.
After you create a horizontal pod autoscaler, OpenShift Container Platform begins to query the CPU and/or memory resource metrics on the pods. When these metrics are available, the horizontal pod autoscaler computes the ratio of the current metric utilization with the desired metric utilization, and scales up or down accordingly. The query and scaling occurs at a regular interval, but can take one to two minutes before metrics become available.
For replication controllers, this scaling corresponds directly to the replicas of the replication controller. For deployment configurations, scaling corresponds directly to the replica count of the deployment configuration. Note that autoscaling applies only to the latest deployment in the Complete
phase.
OpenShift Container Platform automatically accounts for resources and prevents unnecessary autoscaling during resource spikes, such as during start up. Pods in the unready
state have 0 CPU
usage when scaling up and the autoscaler ignores the pods when scaling down. Pods without known metrics have 0% CPU
usage when scaling up and 100% CPU
when scaling down. This allows for more stability during the HPA decision. To use this feature, you must configure readiness checks to determine if a new pod is ready for use.
In order to use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics.
2.4.1.1. Supported metrics Copy linkLink copied to clipboard!
The following metrics are supported by horizontal pod autoscalers:
Metric | Description | API version |
---|---|---|
CPU utilization | Number of CPU cores used. Can be used to calculate a percentage of the pod’s requested CPU. |
|
Memory utilization | Amount of memory used. Can be used to calculate a percentage of the pod’s requested memory. |
|
For memory-based autoscaling, memory usage must increase and decrease proportionally to the replica count. On average:
- An increase in replica count must lead to an overall decrease in memory (working set) usage per-pod.
- A decrease in replica count must lead to an overall increase in per-pod memory usage.
Use the OpenShift Container Platform web console to check the memory behavior of your application and ensure that your application meets these requirements before using memory-based autoscaling.
The following example shows autoscaling for the image-registry
DeploymentConfig
object. The initial deployment requires 3 pods. The HPA object increased that minimum to 5 and will increase the pods up to 7 if CPU usage on the pods reaches 75%:
oc autoscale dc/image-registry --min=5 --max=7 --cpu-percent=75
$ oc autoscale dc/image-registry --min=5 --max=7 --cpu-percent=75
Example output
horizontalpodautoscaler.autoscaling/image-registry autoscaled
horizontalpodautoscaler.autoscaling/image-registry autoscaled
Sample HPA for the image-registry
DeploymentConfig
object with minReplicas
set to 3
View the new state of the deployment:
oc get dc image-registry
$ oc get dc image-registry
Copy to Clipboard Copied! Toggle word wrap Toggle overflow There are now 5 pods in the deployment:
Example output
NAME REVISION DESIRED CURRENT TRIGGERED BY image-registry 1 5 5 config
NAME REVISION DESIRED CURRENT TRIGGERED BY image-registry 1 5 5 config
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.4.1.2. Scaling policies Copy linkLink copied to clipboard!
The autoscaling/v2beta2
API allows you to add scaling policies to a horizontal pod autoscaler. A scaling policy controls how the OpenShift Container Platform horizontal pod autoscaler (HPA) scales pods. Scaling policies allow you to restrict the rate that HPAs scale pods up or down by setting a specific number or specific percentage to scale in a specified period of time. You can also define a stabilization window, which uses previously computed desired states to control scaling if the metrics are fluctuating. You can create multiple policies for the same scaling direction, and determine which policy is used, based on the amount of change. You can also restrict the scaling by timed iterations. The HPA scales pods during an iteration, then performs scaling, as needed, in further iterations.
Sample HPA object with a scaling policy
- 1
- Specifies the direction for the scaling policy, either
scaleDown
orscaleUp
. This example creates a policy for scaling down. - 2
- Defines the scaling policy.
- 3
- Determines if the policy scales by a specific number of pods or a percentage of pods during each iteration. The default value is
pods
. - 4
- Determines the amount of scaling, either the number of pods or percentage of pods, during each iteration. There is no default value for scaling down by number of pods.
- 5
- Determines the length of a scaling iteration. The default value is
15
seconds. - 6
- The default value for scaling down by percentage is 100%.
- 7
- Determines which policy to use first, if multiple policies are defined. Specify
Max
to use the policy that allows the highest amount of change,Min
to use the policy that allows the lowest amount of change, orDisabled
to prevent the HPA from scaling in that policy direction. The default value isMax
. - 8
- Determines the time period the HPA should look back at desired states. The default value is
0
. - 9
- This example creates a policy for scaling up.
- 10
- The amount of scaling up by the number of pods. The default value for scaling up the number of pods is 4%.
- 11
- The amount of scaling up by the percentage of pods. The default value for scaling up by percentage is 100%.
Example policy for scaling down
In this example, when the number of pods is greater than 40, the percent-based policy is used for scaling down, as that policy results in a larger change, as required by the selectPolicy
.
If there are 80 pod replicas, in the first iteration the HPA reduces the pods by 8, which is 10% of the 80 pods (based on the type: Percent
and value: 10
parameters), over one minute (periodSeconds: 60
). For the next iteration, the number of pods is 72. The HPA calculates that 10% of the remaining pods is 7.2, which it rounds up to 8 and scales down 8 pods. On each subsequent iteration, the number of pods to be scaled is re-calculated based on the number of remaining pods. When the number of pods falls below 40, the pods-based policy is applied, because the pod-based number is greater than the percent-based number. The HPA reduces 4 pods at a time (type: Pods
and value: 4
), over 30 seconds (periodSeconds: 30
), until there are 20 replicas remaining (minReplicas
).
The selectPolicy: Disabled
parameter prevents the HPA from scaling up the pods. You can manually scale up by adjusting the number of replicas in the replica set or deployment set, if needed.
If set, you can view the scaling policy by using the oc edit
command:
oc edit hpa hpa-resource-metrics-memory
$ oc edit hpa hpa-resource-metrics-memory
Example output
2.4.2. Creating a horizontal pod autoscaler by using the web console Copy linkLink copied to clipboard!
From the web console, you can create a horizontal pod autoscaler (HPA) that specifies the minimum and maximum number of pods you want to run on a deployment. You can also define the amount of CPU or memory usage that your pods should target..
An HPA cannot be added to deployments that are part of an Operator-backed service, Knative service, or Helm chart.
Procedure
To create an HPA in the web console:
- In the Topology view, click the node to reveal the side pane.
From the Actions drop-down list, select Add HorizontalPodAutoscaler to open the Add HorizontalPodAutoscaler form.
Figure 2.1. Add HorizontalPodAutoscaler
From the Add HorizontalPodAutoscaler form, define the name, minimum and maximum pod limits, the CPU and memory usage, and click Save.
NoteIf any of the values for CPU and memory usage are missing, a warning is displayed.
To edit an HPA in the web console:
- In the Topology view, click the node to reveal the side pane.
- From the Actions drop-down list, select Edit HorizontalPodAutoscaler to open the Edit Horizontal Pod Autoscaler form.
- From the Edit Horizontal Pod Autoscaler form, edit the minimum and maximum pod limits and the CPU and memory usage, and click Save.
While creating or editing the horizontal pod autoscaler in the web console, you can switch from Form view to YAML view.
To remove an HPA in the web console:
- In the Topology view, click the node to reveal the side panel.
- From the Actions drop-down list, select Remove HorizontalPodAutoscaler.
- In the confirmation pop-up window, click Remove to remove the HPA.
2.4.3. Creating a horizontal pod autoscaler for CPU utilization by using the CLI Copy linkLink copied to clipboard!
You can create a horizontal pod autoscaler (HPA) for an existing Deployment
, DeploymentConfig
, ReplicaSet
, ReplicationController
, or StatefulSet
object that automatically scales the pods associated with that object to maintain the CPU usage you specify.
The HPA increases and decreases the number of replicas between the minimum and maximum numbers to maintain the specified CPU utilization across all pods.
When autoscaling for CPU utilization, you can use the oc autoscale
command and specify the minimum and maximum number of pods you want to run at any given time and the average CPU utilization your pods should target. If you do not specify a minimum, the pods are given default values from the OpenShift Container Platform server. To autoscale for a specific CPU value, create a HorizontalPodAutoscaler
object with the target CPU and pod limits.
Prerequisites
In order to use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics <pod-name>
command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu
and Memory
displayed under Usage
.
oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Example output
Procedure
To create a horizontal pod autoscaler for CPU utilization:
Perform one of the following:
To scale based on the percent of CPU utilization, create a
HorizontalPodAutoscaler
object for an existing object:oc autoscale <object_type>/<name> \ --min <number> \ --max <number> \ --cpu-percent=<percent>
$ oc autoscale <object_type>/<name> \
1 --min <number> \
2 --max <number> \
3 --cpu-percent=<percent>
4 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify the type and name of the object to autoscale. The object must exist and be a
Deployment
,DeploymentConfig
/dc
,ReplicaSet
/rs
,ReplicationController
/rc
, orStatefulSet
. - 2
- Optionally, specify the minimum number of replicas when scaling down.
- 3
- Specify the maximum number of replicas when scaling up.
- 4
- Specify the target average CPU utilization over all the pods, represented as a percent of requested CPU. If not specified or negative, a default autoscaling policy is used.
For example, the following command shows autoscaling for the
image-registry
DeploymentConfig
object. The initial deployment requires 3 pods. The HPA object increased that minimum to 5 and will increase the pods up to 7 if CPU usage on the pods reaches 75%:oc autoscale dc/image-registry --min=5 --max=7 --cpu-percent=75
$ oc autoscale dc/image-registry --min=5 --max=7 --cpu-percent=75
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To scale for a specific CPU value, create a YAML file similar to the following for an existing object:
Create a YAML file similar to the following:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Use the
autoscaling/v2beta2
API. - 2
- Specify a name for this horizontal pod autoscaler object.
- 3
- Specify the API version of the object to scale:
-
For a
ReplicationController
, usev1
. -
For a
DeploymentConfig
, useapps.openshift.io/v1
. -
For a
Deployment
,ReplicaSet
,Statefulset
object, useapps/v1
.
-
For a
- 4
- Specify the type of object. The object must be a
Deployment
,DeploymentConfig
/dc
,ReplicaSet
/rs
,ReplicationController
/rc
, orStatefulSet
. - 5
- Specify the name of the object to scale. The object must exist.
- 6
- Specify the minimum number of replicas when scaling down.
- 7
- Specify the maximum number of replicas when scaling up.
- 8
- Use the
metrics
parameter for memory utilization. - 9
- Specify
cpu
for CPU utilization. - 10
- Set to
AverageValue
. - 11
- Set to
averageValue
with the targeted CPU value.
Create the horizontal pod autoscaler:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify that the horizontal pod autoscaler was created:
oc get hpa cpu-autoscale
$ oc get hpa cpu-autoscale
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE cpu-autoscale ReplicationController/example 173m/500m 1 10 1 20m
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE cpu-autoscale ReplicationController/example 173m/500m 1 10 1 20m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.4.4. Creating a horizontal pod autoscaler object for memory utilization by using the CLI Copy linkLink copied to clipboard!
You can create a horizontal pod autoscaler (HPA) for an existing DeploymentConfig
object or ReplicationController
object that automatically scales the pods associated with that object in order to maintain the average memory utilization you specify, either a direct value or a percentage of requested memory.
The HPA increases and decreases the number of replicas between the minimum and maximum numbers to maintain the specified memory utilization across all pods.
For memory utilization, you can specify the minimum and maximum number of pods and the average memory utilization your pods should target. If you do not specify a minimum, the pods are given default values from the OpenShift Container Platform server.
Autoscaling for memory utilization is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend to use them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information on Red Hat Technology Preview features support scope, see https://access.redhat.com/support/offerings/techpreview/.
Prerequisites
In order to use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics <pod-name>
command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu
and Memory
displayed under Usage
.
oc describe PodMetrics openshift-kube-scheduler-ip-10-0-129-223.compute.internal -n openshift-kube-scheduler
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-129-223.compute.internal -n openshift-kube-scheduler
Example output
Procedure
To create a horizontal pod autoscaler for memory utilization:
Create a YAML file for one of the following:
To scale for a specific memory value, create a
HorizontalPodAutoscaler
object similar to the following for an existingReplicationController
object or replication controller:Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Use the
autoscaling/v2beta2
API. - 2
- Specify a name for this horizontal pod autoscaler object.
- 3
- Specify the API version of the object to scale:
-
For a replication controller, use
v1
, -
For a
DeploymentConfig
object, useapps.openshift.io/v1
.
-
For a replication controller, use
- 4
- Specify the kind of object to scale, either
ReplicationController
orDeploymentConfig
. - 5
- Specify the name of the object to scale. The object must exist.
- 6
- Specify the minimum number of replicas when scaling down.
- 7
- Specify the maximum number of replicas when scaling up.
- 8
- Use the
metrics
parameter for memory utilization. - 9
- Specify
memory
for memory utilization. - 10
- Set the type to
AverageValue
. - 11
- Specify
averageValue
and a specific memory value. - 12
- Optional: Specify a scaling policy to control the rate of scaling up or down.
To scale for a percentage, create a
HorizontalPodAutoscaler
object similar to the following:Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Use the
autoscaling/v2beta2
API. - 2
- Specify a name for this horizontal pod autoscaler object.
- 3
- Specify the API version of the object to scale:
-
For a replication controller, use
v1
, -
For a
DeploymentConfig
object, useapps.openshift.io/v1
.
-
For a replication controller, use
- 4
- Specify the kind of object to scale, either
ReplicationController
orDeploymentConfig
. - 5
- Specify the name of the object to scale. The object must exist.
- 6
- Specify the minimum number of replicas when scaling down.
- 7
- Specify the maximum number of replicas when scaling up.
- 8
- Use the
metrics
parameter for memory utilization. - 9
- Specify
memory
for memory utilization. - 10
- Set to
Utilization
. - 11
- Specify
averageUtilization
and a target average memory utilization over all the pods, represented as a percent of requested memory. The target pods must have memory requests configured. - 12
- Optional: Specify a scaling policy to control the rate of scaling up or down.
Create the horizontal pod autoscaler:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f hpa.yaml
$ oc create -f hpa.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
horizontalpodautoscaler.autoscaling/hpa-resource-metrics-memory created
horizontalpodautoscaler.autoscaling/hpa-resource-metrics-memory created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the horizontal pod autoscaler was created:
oc get hpa hpa-resource-metrics-memory
$ oc get hpa hpa-resource-metrics-memory
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE hpa-resource-metrics-memory ReplicationController/example 2441216/500Mi 1 10 1 20m
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE hpa-resource-metrics-memory ReplicationController/example 2441216/500Mi 1 10 1 20m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc describe hpa hpa-resource-metrics-memory
$ oc describe hpa hpa-resource-metrics-memory
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.4.5. Understanding horizontal pod autoscaler status conditions by using the CLI Copy linkLink copied to clipboard!
You can use the status conditions set to determine whether or not the horizontal pod autoscaler (HPA) is able to scale and whether or not it is currently restricted in any way.
The HPA status conditions are available with the v2beta1
version of the autoscaling API.
The HPA responds with the following status conditions:
The
AbleToScale
condition indicates whether HPA is able to fetch and update metrics, as well as whether any backoff-related conditions could prevent scaling.-
A
True
condition indicates scaling is allowed. -
A
False
condition indicates scaling is not allowed for the reason specified.
-
A
The
ScalingActive
condition indicates whether the HPA is enabled (for example, the replica count of the target is not zero) and is able to calculate desired metrics.-
A
True
condition indicates metrics is working properly. -
A
False
condition generally indicates a problem with fetching metrics.
-
A
The
ScalingLimited
condition indicates that the desired scale was capped by the maximum or minimum of the horizontal pod autoscaler.-
A
True
condition indicates that you need to raise or lower the minimum or maximum replica count in order to scale. A
False
condition indicates that the requested scaling is allowed.oc describe hpa cm-test
$ oc describe hpa cm-test
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The horizontal pod autoscaler status messages.
-
A
The following is an example of a pod that is unable to scale:
Example output
The following is an example of a pod that could not obtain the needed metrics for scaling:
Example output
Conditions: Type Status Reason Message ---- ------ ------ ------- AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True SucceededGetScale the HPA controller was able to get the target's current scale
ScalingActive False FailedGetResourceMetric the HPA was unable to compute the replica count: failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource metrics API
The following is an example of a pod where the requested autoscaling was less than the required minimums:
Example output
2.4.5.1. Viewing horizontal pod autoscaler status conditions by using the CLI Copy linkLink copied to clipboard!
You can view the status conditions set on a pod by the horizontal pod autoscaler (HPA).
The horizontal pod autoscaler status conditions are available with the v2beta1
version of the autoscaling API.
Prerequisites
In order to use horizontal pod autoscalers, your cluster administrator must have properly configured cluster metrics. You can use the oc describe PodMetrics <pod-name>
command to determine if metrics are configured. If metrics are configured, the output appears similar to the following, with Cpu
and Memory
displayed under Usage
.
oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
$ oc describe PodMetrics openshift-kube-scheduler-ip-10-0-135-131.ec2.internal
Example output
Procedure
To view the status conditions on a pod, use the following command with the name of the pod:
oc describe hpa <pod-name>
$ oc describe hpa <pod-name>
For example:
oc describe hpa cm-test
$ oc describe hpa cm-test
The conditions appear in the Conditions
field in the output.
Example output
2.5. Automatically adjust pod resource levels with the vertical pod autoscaler Copy linkLink copied to clipboard!
The OpenShift Container Platform Vertical Pod Autoscaler Operator (VPA) automatically reviews the historic and current CPU and memory resources for containers in pods and can update the resource limits and requests based on the usage values it learns. The VPA uses individual custom resources (CR) to update all of the pods associated with a workload object, such as a Deployment
, DeploymentConfig
, StatefulSet
, Job
, DaemonSet
, ReplicaSet
, or ReplicationController
, in a project.
The VPA helps you to understand the optimal CPU and memory usage for your pods and can automatically maintain pod resources through the pod lifecycle.
vertical pod autoscaler is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.
2.5.1. About the Vertical Pod Autoscaler Operator Copy linkLink copied to clipboard!
The Vertical Pod Autoscaler Operator (VPA) is implemented as an API resource and a custom resource (CR). The CR determines the actions the Vertical Pod Autoscaler Operator should take with the pods associated with a specific workload object, such as a daemon set, replication controller, and so forth, in a project.
The VPA automatically computes historic and current CPU and memory usage for the containers in those pods and uses this data to determine optimized resource limits and requests to ensure that these pods are operating efficiently at all times. For example, the VPA reduces resources for pods that are requesting more resources than they are using and increases resources for pods that are not requesting enough.
The VPA automatically deletes any pods that are out of alignment with its recommendations one at a time, so that your applications can continue to serve requests with no downtime. The workload objects then re-deploy the pods with the original resource limits and requests. The VPA uses a mutating admission webhook to update the pods with optimized resource limits and requests before the pods are admitted to a node. If you do not want the VPA to delete pods, you can view the VPA resource limits and requests and manually update the pods as needed.
For example, if you have a pod that uses 50% of the CPU but only requests 10%, the VPA determines that the pod is consuming more CPU than requested and deletes the pod. The workload object, such as replica set, restarts the pods and the VPA updates the new pod with its recommended resources.
For developers, you can use the VPA to help ensure your pods stay up during periods of high demand by scheduling pods onto nodes that have appropriate resources for each pod.
Administrators can use the VPA to better utilize cluster resources, such as preventing pods from reserving more CPU resources than needed. The VPA monitors the resources that workloads are actually using and adjusts the resource requirements so capacity is available to other workloads. The VPA also maintains the ratios between limits and requests that are specified in initial container configuration.
If you stop running the VPA or delete a specific VPA CR in your cluster, the resource requests for the pods already modified by the VPA do not change. Any new pods get the resources defined in the workload object, not the previous recommendations made by the VPA.
2.5.2. Installing the Vertical Pod Autoscaler Operator Copy linkLink copied to clipboard!
You can use the OpenShift Container Platform web console to install the Vertical Pod Autoscaler Operator (VPA).
Procedure
- In the OpenShift Container Platform web console, click Operators → OperatorHub.
- Choose VerticalPodAutoscaler from the list of available Operators, and click Install.
-
On the Install Operator page, ensure that the Operator recommended namespace option is selected. This installs the Operator in the mandatory
openshift-vertical-pod-autoscaler
namespace, which is automatically created if it does not exist. - Click Install.
Verify the installation by listing the VPA Operator components:
- Navigate to Workloads → Pods.
-
Select the
openshift-vertical-pod-autoscaler
project from the drop-down menu and verify that there are four pods running. - Navigate to Workloads → Deployments to verify that there are four deployments running.
Optional. Verify the installation in the OpenShift Container Platform CLI using the following command:
oc get all -n openshift-vertical-pod-autoscaler
$ oc get all -n openshift-vertical-pod-autoscaler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output shows four pods and four deplyoments:
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.5.3. About Using the Vertical Pod Autoscaler Operator Copy linkLink copied to clipboard!
To use the Vertical Pod Autoscaler Operator (VPA), you create a VPA custom resource (CR) for a workload object in your cluster. The VPA learns and applies the optimal CPU and memory resources for the pods associated with that workload object. You can use a VPA with a deployment, stateful set, job, daemon set, replica set, or replication controller workload object. The VPA CR must be in the same project as the pods you want to monitor.
You use the VPA CR to associate a workload object and specify which mode the VPA operates in:
-
The
Auto
andRecreate
modes automatically apply the VPA CPU and memory recommendations throughout the pod lifetime. The VPA deletes any pods in the project that are out of alignment with its recommendations. When redeployed by the workload object, the VPA updates the new pods with its recommendations. -
The
Initial
mode automatically applies VPA recommendations only at pod creation. -
The
Off
mode only provides recommended resource limits and requests, allowing you to manually apply the recommendations. Theoff
mode does not update pods.
You can also use the CR to opt-out certain containers from VPA evaluation and updates.
For example, a pod has the following limits and requests:
After creating a VPA that is set to auto
, the VPA learns the resource usage and deletes the pod. When redeployed, the pod uses the new resource limits and requests:
You can view the VPA recommendations using the following command:
oc get vpa <vpa-name> --output yaml
$ oc get vpa <vpa-name> --output yaml
After a few minutes, the output shows the recommendations for CPU and memory requests, similar to the following:
Example output
The output shows the recommended resources, target
, the minimum recommended resources, lowerBound
, the highest recommended resources, upperBound
, and the most recent resource recommendations, uncappedTarget
.
The VPA uses the lowerBound
and upperBound
values to determine if a pod needs to be updated. If a pod has resource requests below the lowerBound
values or above the upperBound
values, the VPA terminates and recreates the pod with the target
values.
2.5.3.1. Automatically applying VPA recommendations Copy linkLink copied to clipboard!
To use the VPA to automatically update pods, create a VPA CR for a specific workload object with updateMode
set to Auto
or Recreate
.
When the pods are created for the workload object, the VPA constantly monitors the containers to analyze their CPU and memory needs. The VPA deletes any pods that do not meet the VPA recommendations for CPU and memory. When redeployed, the pods use the new resource limits and requests based on the VPA recommendations, honoring any pod disruption budget set for your applications. The recommendations are added to the status
field of the VPA CR for reference.
The workload object must specify a minimum of two replicas in order for the VPA to monitor and update the pods. If the workload object specifies one replica, the VPA does not delete the pod to prevent application downtime. You can manually delete the pod to use the recommended resources.
Example VPA CR for the Auto
mode
- 1 1
- The type of workload object you want this VPA CR to manage.
- 2
- The name of the workload object you want this VPA CR to manage.
- 3
- Set the mode to
Auto
orRecreate
:-
Auto
. The VPA assigns resource requests on pod creation and updates the existing pods by terminating them when the requested resources differ significantly from the new recommendation. -
Recreate
. The VPA assigns resource requests on pod creation and updates the existing pods by terminating them when the requested resources differ significantly from the new recommendation. This mode should be used rarely, only if you need to ensure that the pods are restarted whenever the resource request changes.
-
There must be operating pods in the project before the VPA can determine recommended resources and apply the recommendations to new pods.
2.5.3.2. Automatically applying VPA recommendations on pod creation Copy linkLink copied to clipboard!
To use the VPA to apply the recommended resources only when a pod is first deployed, create a VPA CR for a specific workload object with updateMode
set to Initial
.
Then, manually delete any pods associated with the workload object that you want to use the VPA recommendations. In the Initial
mode, the VPA does not delete pods and does not update the pods as it learns new resource recommendations.
Example VPA CR for the Initial
mode
There must be operating pods in the project before a VPA can determine recommended resources and apply the recommendations to new pods.
2.5.3.3. Manually applying VPA recommendations Copy linkLink copied to clipboard!
To use the VPA to only determine the recommended CPU and memory values, create a VPA CR for a specific workload object with updateMode
set to off
.
When the pods are created for that workload object, the VPA analyzes the CPU and memory needs of the containers and records those recommendations in the status
field of the VPA CR. The VPA does not update the pods as it determines new resource recommendations.
Example VPA CR for the Off
mode
You can view the recommendations using the following command.
oc get vpa <vpa-name> --output yaml
$ oc get vpa <vpa-name> --output yaml
With the recommendations, you can edit the workload object to add CPU and memory requests, then delete and redeploy the pods using the recommended resources.
There must be operating pods in the project before a VPA can determine recommended resources.
2.5.3.4. Exempting containers from applying VPA recommendations Copy linkLink copied to clipboard!
If your workload object has multiple containers and you do not want the VPA to evaluate and act on all of the containers, create a VPA CR for a specific workload object and add a resourcePolicy
to opt-out specific containers.
When the VPA updates the pods with recommended resources, any containers with a resourcePolicy
are not updated and the VPA does not present recommendations for those containers in the pod.
- 1
- The type of workload object you want this VPA CR to manage.
- 2
- The name of the workload object you want this VPA CR to manage.
- 3
- Set the mode to
Auto
,Recreate
, orOff
. TheRecreate
mode should be used rarely, only if you need to ensure that the pods are restarted whenever the resource request changes. - 4
- Specify the containers you want to opt-out and set
mode
toOff
.
For example, a pod has two containers, the same resource requests and limits:
After launching a VPA CR with the backend
container set to opt-out, the VPA terminates and recreates the pod with the recommended resources applied only to the frontend
container:
2.5.4. Using the Vertical Pod Autoscaler Operator Copy linkLink copied to clipboard!
You can use the Vertical Pod Autoscaler Operator (VPA) by creating a VPA custom resource (CR). The CR indicates which pods it should analyze and determines the actions the VPA should take with those pods.
Procedure
To create a VPA CR for a specific workload object:
Change to the project where the workload object you want to scale is located.
Create a VPA CR YAML file:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify the type of workload object you want this VPA to manage:
Deployment
,StatefulSet
,Job
,DaemonSet
,ReplicaSet
, orReplicationController
. - 2
- Specify the name of an existing workload object you want this VPA to manage.
- 3
- Specify the VPA mode:
-
auto
to automatically apply the recommended resources on pods associated with the controller. The VPA terminates existing pods and creates new pods with the recommended resource limits and requests. -
recreate
to automatically apply the recommended resources on pods associated with the workload object. The VPA terminates existing pods and creates new pods with the recommended resource limits and requests. Therecreate
mode should be used rarely, only if you need to ensure that the pods are restarted whenever the resource request changes. -
initial
to automatically apply the recommended resources when pods associated with the workload object are created. The VPA does not update the pods as it learns new resource recommendations. -
off
to only generate resource recommendations for the pods associated with the workload object. The VPA does not update the pods as it learns new resource recommendations and does not apply the recommendations to new pods.
-
- 4
- Optional. Specify the containers you want to opt-out and set the mode to
Off
.
Create the VPA CR:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After a few moments, the VPA learns the resource usage of the containers in the pods associated with the workload object.
You can view the VPA recommendations using the following command:
oc get vpa <vpa-name> --output yaml
$ oc get vpa <vpa-name> --output yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output shows the recommendations for CPU and memory requests, similar to the following:
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.5.5. Uninstalling the Vertical Pod Autoscaler Operator Copy linkLink copied to clipboard!
You can remove the Vertical Pod Autoscaler Operator (VPA) from your OpenShift Container Platform cluster. After uninstalling, the resource requests for the pods already modified by an existing VPA CR do not change. Any new pods get the resources defined in the workload object, not the previous recommendations made by the Vertical Pod Autoscaler Operator.
You can remove a specific VPA using the oc delete vpa <vpa-name>
command. The same actions apply for resource requests as uninstalling the vertical pod autoscaler.
After removing the VPA Operator, it is recommended that you remove other components associated with the Operator to avoid potential issues.
Prerequisites
- The Vertical Pod Autoscaler Operator must be installed.
Procedure
- In the OpenShift Container Platform web console, click Operators → Installed Operators.
- Switch to the openshift-vertical-pod-autoscaler project.
- Find the VerticalPodAutoscaler Operator and click the Options menu. Select Uninstall Operator.
- In the dialog box, click Uninstall.
- Optional: To remove all operands associated with the Operator, in the dialog box, select Delete all operand instances for this operator checkbox.
- Click Uninstall.
Optional: Use the OpenShift CLI to remove the VPA components:
Delete the VPA mutating webhook configuration:
oc delete mutatingwebhookconfigurations/vpa-webhook-config
$ oc delete mutatingwebhookconfigurations/vpa-webhook-config
Copy to Clipboard Copied! Toggle word wrap Toggle overflow List any VPA custom resources:
oc get verticalpodautoscalercheckpoints.autoscaling.k8s.io,verticalpodautoscalercontrollers.autoscaling.openshift.io,verticalpodautoscalers.autoscaling.k8s.io -o wide --all-namespaces
$ oc get verticalpodautoscalercheckpoints.autoscaling.k8s.io,verticalpodautoscalercontrollers.autoscaling.openshift.io,verticalpodautoscalers.autoscaling.k8s.io -o wide --all-namespaces
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the listed VPA custom resources. For example:
oc delete verticalpodautoscalercheckpoint.autoscaling.k8s.io/vpa-recommender-httpd -n my-project
$ oc delete verticalpodautoscalercheckpoint.autoscaling.k8s.io/vpa-recommender-httpd -n my-project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete verticalpodautoscalercontroller.autoscaling.openshift.io/default -n openshift-vertical-pod-autoscaler
$ oc delete verticalpodautoscalercontroller.autoscaling.openshift.io/default -n openshift-vertical-pod-autoscaler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete verticalpodautoscaler.autoscaling.k8s.io/vpa-recommender -n my-project
$ oc delete verticalpodautoscaler.autoscaling.k8s.io/vpa-recommender -n my-project
Copy to Clipboard Copied! Toggle word wrap Toggle overflow List any VPA custom resource definitions (CRDs):
oc get crd
$ oc get crd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the listed VPA CRDs:
oc delete crd verticalpodautoscalercheckpoints.autoscaling.k8s.io verticalpodautoscalercontrollers.autoscaling.openshift.io verticalpodautoscalers.autoscaling.k8s.io
$ oc delete crd verticalpodautoscalercheckpoints.autoscaling.k8s.io verticalpodautoscalercontrollers.autoscaling.openshift.io verticalpodautoscalers.autoscaling.k8s.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Deleting the CRDs removes the associated roles, cluster roles, and role bindings. However, there might be a few cluster roles that must be manually deleted.
List any VPA cluster roles:
oc get clusterrole | grep openshift-vertical-pod-autoscaler
$ oc get clusterrole | grep openshift-vertical-pod-autoscaler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
openshift-vertical-pod-autoscaler-6896f-admin 2022-02-02T15:29:55Z openshift-vertical-pod-autoscaler-6896f-edit 2022-02-02T15:29:55Z openshift-vertical-pod-autoscaler-6896f-view 2022-02-02T15:29:55Z
openshift-vertical-pod-autoscaler-6896f-admin 2022-02-02T15:29:55Z openshift-vertical-pod-autoscaler-6896f-edit 2022-02-02T15:29:55Z openshift-vertical-pod-autoscaler-6896f-view 2022-02-02T15:29:55Z
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the listed VPA cluster roles. For example:
oc delete clusterrole openshift-vertical-pod-autoscaler-6896f-admin openshift-vertical-pod-autoscaler-6896f-edit openshift-vertical-pod-autoscaler-6896f-view
$ oc delete clusterrole openshift-vertical-pod-autoscaler-6896f-admin openshift-vertical-pod-autoscaler-6896f-edit openshift-vertical-pod-autoscaler-6896f-view
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the VPA Operator:
oc delete operator/vertical-pod-autoscaler.openshift-vertical-pod-autoscaler
$ oc delete operator/vertical-pod-autoscaler.openshift-vertical-pod-autoscaler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.6. Providing sensitive data to pods Copy linkLink copied to clipboard!
Some applications need sensitive information, such as passwords and user names, that you do not want developers to have.
As an administrator, you can use Secret
objects to provide this information without exposing that information in clear text.
2.6.1. Understanding secrets Copy linkLink copied to clipboard!
The Secret
object type provides a mechanism to hold sensitive information such as passwords, OpenShift Container Platform client configuration files, private source repository credentials, and so on. Secrets decouple sensitive content from the pods. You can mount secrets into containers using a volume plug-in or the system can use secrets to perform actions on behalf of a pod.
Key properties include:
- Secret data can be referenced independently from its definition.
- Secret data volumes are backed by temporary file-storage facilities (tmpfs) and never come to rest on a node.
- Secret data can be shared within a namespace.
YAML Secret
object definition
- 1
- Indicates the structure of the secret’s key names and values.
- 2
- The allowable format for the keys in the
data
field must meet the guidelines in the DNS_SUBDOMAIN value in the Kubernetes identifiers glossary. - 3
- The value associated with keys in the
data
map must be base64 encoded. - 4
- Entries in the
stringData
map are converted to base64 and the entry will then be moved to thedata
map automatically. This field is write-only; the value will only be returned via thedata
field. - 5
- The value associated with keys in the
stringData
map is made up of plain text strings.
You must create a secret before creating the pods that depend on that secret.
When creating secrets:
- Create a secret object with secret data.
- Update the pod’s service account to allow the reference to the secret.
-
Create a pod, which consumes the secret as an environment variable or as a file (using a
secret
volume).
2.6.1.1. Types of secrets Copy linkLink copied to clipboard!
The value in the type
field indicates the structure of the secret’s key names and values. The type can be used to enforce the presence of user names and keys in the secret object. If you do not want validation, use the opaque
type, which is the default.
Specify one of the following types to trigger minimal server-side validation to ensure the presence of specific key names in the secret data:
-
kubernetes.io/service-account-token
. Uses a service account token. -
kubernetes.io/basic-auth
. Use with Basic Authentication. -
kubernetes.io/ssh-auth
. Use with SSH Key Authentication. -
kubernetes.io/tls
. Use with TLS certificate authorities.
Specify type: Opaque
if you do not want validation, which means the secret does not claim to conform to any convention for key names or values. An opaque secret, allows for unstructured key:value
pairs that can contain arbitrary values.
You can specify other arbitrary types, such as example.com/my-secret-type
. These types are not enforced server-side, but indicate that the creator of the secret intended to conform to the key/value requirements of that type.
For examples of different secret types, see the code samples in Using Secrets.
2.6.1.2. Secret data keys Copy linkLink copied to clipboard!
Secret keys must be in a DNS subdomain.
2.6.2. Understanding how to create secrets Copy linkLink copied to clipboard!
As an administrator you must create a secret before developers can create the pods that depend on that secret.
When creating secrets:
Create a secret object that contains the data you want to keep secret. The specific data required for each secret type is descibed in the following sections.
Example YAML object that creates an opaque secret
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use either the
data
orstringdata
fields, not both.Update the pod’s service account to reference the secret:
YAML of a service account that uses a secret
apiVersion: v1 kind: ServiceAccount ... secrets: - name: test-secret
apiVersion: v1 kind: ServiceAccount ... secrets: - name: test-secret
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a pod, which consumes the secret as an environment variable or as a file (using a
secret
volume):YAML of a pod populating files in a volume with secret data
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Add a
volumeMounts
field to each container that needs the secret. - 2
- Specifies an unused directory name where you would like the secret to appear. Each key in the secret data map becomes the filename under
mountPath
. - 3
- Set to
true
. If true, this instructs the driver to provide a read-only volume. - 4
- Specifies the name of the secret.
YAML of a pod populating environment variables with secret data
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the environment variable that consumes the secret key.
YAML of a build config populating environment variables with secret data
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies the environment variable that consumes the secret key.
2.6.2.1. Secret creation restrictions Copy linkLink copied to clipboard!
To use a secret, a pod needs to reference the secret. A secret can be used with a pod in three ways:
- To populate environment variables for containers.
- As files in a volume mounted on one or more of its containers.
- By kubelet when pulling images for the pod.
Volume type secrets write data into the container as a file using the volume mechanism. Image pull secrets use service accounts for the automatic injection of the secret into all pods in a namespaces.
When a template contains a secret definition, the only way for the template to use the provided secret is to ensure that the secret volume sources are validated and that the specified object reference actually points to a Secret
object. Therefore, a secret needs to be created before any pods that depend on it. The most effective way to ensure this is to have it get injected automatically through the use of a service account.
Secret API objects reside in a namespace. They can only be referenced by pods in that same namespace.
Individual secrets are limited to 1MB in size. This is to discourage the creation of large secrets that could exhaust apiserver and kubelet memory. However, creation of a number of smaller secrets could also exhaust memory.
2.6.2.2. Creating an opaque secret Copy linkLink copied to clipboard!
As an administrator, you can create an opaque secret, which allows you to store unstructured key:value
pairs that can contain arbitrary values.
Procedure
Create a
Secret
object in a YAML file on a control plane node.For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specifies an opaque secret.
Use the following command to create a
Secret
object:oc create -f <filename>.yaml
$ oc create -f <filename>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
2.6.2.3. Creating a service account token secret Copy linkLink copied to clipboard!
As an administrator, you can create a service account token secret, which allows you to distribute a service account token to applications that must authenticate to the API.
It is recommended to obtain bound service account tokens using the TokenRequest API instead of using service account token secrets. The tokens obtained from the TokenRequest API are more secure than the tokens stored in secrets, because they have a bounded lifetime and are not readable by other API clients.
You should create a service account token secret only if you cannot use the TokenRequest API and if the security exposure of a non-expiring token in a readable API object is acceptable to you.
See the Additional resources section that follows for information on creating bound service account tokens.
Procedure
Create a
Secret
object in a YAML file on a control plane node:Example
secret
object:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following command to create the
Secret
object:oc create -f <filename>.yaml
$ oc create -f <filename>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
2.6.2.4. Creating a basic authentication secret Copy linkLink copied to clipboard!
As an administrator, you can create a basic authentication secret, which allows you to store the credentials needed for basic authentication. When using this secret type, the data
parameter of the Secret
object must contain the following keys encoded in the base64 format:
-
username
: the user name for authentication -
password
: the password or token for authentication
You can use the stringData
parameter to use clear text content.
Procedure
Create a
Secret
object in a YAML file on a control plane node:Example
secret
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following command to create the
Secret
object:oc create -f <filename>.yaml
$ oc create -f <filename>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
2.6.2.5. Creating an SSH authentication secret Copy linkLink copied to clipboard!
As an administrator, you can create an SSH authentication secret, which allows you to store data used for SSH authentication. When using this secret type, the data
parameter of the Secret
object must contain the SSH credential to use.
Procedure
Create a
Secret
object in a YAML file on a control plane node:Example
secret
object:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following command to create the
Secret
object:oc create -f <filename>.yaml
$ oc create -f <filename>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
2.6.2.6. Creating a Docker configuration secret Copy linkLink copied to clipboard!
As an administrator, you can create a Docker configuration secret, which allows you to store the credentials for accessing a container image registry.
-
kubernetes.io/dockercfg
. Use this secret type to store your local Docker configuration file. Thedata
parameter of thesecret
object must contain the contents of a.dockercfg
file encoded in the base64 format. -
kubernetes.io/dockerconfigjson
. Use this secret type to store your local Docker configuration JSON file. Thedata
parameter of thesecret
object must contain the contents of a.docker/config.json
file encoded in the base64 format.
Procedure
Create a
Secret
object in a YAML file on a control plane node.Example Docker configuration
secret
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example Docker configuration JSON
secret
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following command to create the
Secret
objectoc create -f <filename>.yaml
$ oc create -f <filename>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To use the secret in a pod:
- Update the pod’s service account to reference the secret, as shown in the "Understanding how to create secrets" section.
-
Create the pod, which consumes the secret as an environment variable or as a file (using a
secret
volume), as shown in the "Understanding how to create secrets" section.
2.6.3. Understanding how to update secrets Copy linkLink copied to clipboard!
When you modify the value of a secret, the value (used by an already running pod) will not dynamically change. To change a secret, you must delete the original pod and create a new pod (perhaps with an identical PodSpec).
Updating a secret follows the same workflow as deploying a new Container image. You can use the kubectl rolling-update
command.
The resourceVersion
value in a secret is not specified when it is referenced. Therefore, if a secret is updated at the same time as pods are starting, the version of the secret that is used for the pod is not defined.
Currently, it is not possible to check the resource version of a secret object that was used when a pod was created. It is planned that pods will report this information, so that a controller could restart ones using an old resourceVersion
. In the interim, do not update the data of existing secrets, but create new ones with distinct names.
2.6.4. About using signed certificates with secrets Copy linkLink copied to clipboard!
To secure communication to your service, you can configure OpenShift Container Platform to generate a signed serving certificate/key pair that you can add into a secret in a project.
A service serving certificate secret is intended to support complex middleware applications that need out-of-the-box certificates. It has the same settings as the server certificates generated by the administrator tooling for nodes and masters.
Service Pod
spec configured for a service serving certificates secret.
- 1
- Specify the name for the certificate
Other pods can trust cluster-created certificates (which are only signed for internal DNS names), by using the CA bundle in the /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt file that is automatically mounted in their pod.
The signature algorithm for this feature is x509.SHA256WithRSA
. To manually rotate, delete the generated secret. A new certificate is created.
2.6.4.1. Generating signed certificates for use with secrets Copy linkLink copied to clipboard!
To use a signed serving certificate/key pair with a pod, create or edit the service to add the service.beta.openshift.io/serving-cert-secret-name
annotation, then add the secret to the pod.
Procedure
To create a service serving certificate secret:
-
Edit the
Pod
spec for your service. Add the
service.beta.openshift.io/serving-cert-secret-name
annotation with the name you want to use for your secret.Copy to Clipboard Copied! Toggle word wrap Toggle overflow The certificate and key are in PEM format, stored in
tls.crt
andtls.key
respectively.Create the service:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the secret to make sure it was created:
View a list of all secrets:
oc get secrets
$ oc get secrets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME TYPE DATA AGE my-cert kubernetes.io/tls 2 9m
NAME TYPE DATA AGE my-cert kubernetes.io/tls 2 9m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View details on your secret:
oc describe secret my-cert
$ oc describe secret my-cert
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Edit your
Pod
spec with that secret.Copy to Clipboard Copied! Toggle word wrap Toggle overflow When it is available, your pod will run. The certificate will be good for the internal service DNS name,
<service.name>.<service.namespace>.svc
.The certificate/key pair is automatically replaced when it gets close to expiration. View the expiration date in the
service.beta.openshift.io/expiry
annotation on the secret, which is in RFC3339 format.NoteIn most cases, the service DNS name
<service.name>.<service.namespace>.svc
is not externally routable. The primary use of<service.name>.<service.namespace>.svc
is for intracluster or intraservice communication, and with re-encrypt routes.
2.6.5. Troubleshooting secrets Copy linkLink copied to clipboard!
If a service certificate generation fails with (service’s service.beta.openshift.io/serving-cert-generation-error
annotation contains):
secret/ssl-key references serviceUID 62ad25ca-d703-11e6-9d6f-0e9c0057b608, which does not match 77b6dd80-d716-11e6-9d6f-0e9c0057b60
secret/ssl-key references serviceUID 62ad25ca-d703-11e6-9d6f-0e9c0057b608, which does not match 77b6dd80-d716-11e6-9d6f-0e9c0057b60
The service that generated the certificate no longer exists, or has a different serviceUID
. You must force certificates regeneration by removing the old secret, and clearing the following annotations on the service service.beta.openshift.io/serving-cert-generation-error
, service.beta.openshift.io/serving-cert-generation-error-num
:
Delete the secret:
oc delete secret <secret_name>
$ oc delete secret <secret_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Clear the annotations:
oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-
$ oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-num-
$ oc annotate service <service_name> service.beta.openshift.io/serving-cert-generation-error-num-
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The command removing annotation has a -
after the annotation name to be removed.
2.7. Creating and using config maps Copy linkLink copied to clipboard!
The following sections define config maps and how to create and use them.
2.7.1. Understanding config maps Copy linkLink copied to clipboard!
Many applications require configuration using some combination of configuration files, command line arguments, and environment variables. In OpenShift Container Platform, these configuration artifacts are decoupled from image content in order to keep containerized applications portable.
The ConfigMap
object provides mechanisms to inject containers with configuration data while keeping containers agnostic of OpenShift Container Platform. A config map can be used to store fine-grained information like individual properties or coarse-grained information like entire configuration files or JSON blobs.
The ConfigMap
API object holds key-value pairs of configuration data that can be consumed in pods or used to store configuration data for system components such as controllers. For example:
ConfigMap
Object Definition
You can use the binaryData
field when you create a config map from a binary file, such as an image.
Configuration data can be consumed in pods in a variety of ways. A config map can be used to:
- Populate environment variable values in containers
- Set command-line arguments in a container
- Populate configuration files in a volume
Users and system components can store configuration data in a config map.
A config map is similar to a secret, but designed to more conveniently support working with strings that do not contain sensitive information.
Config map restrictions
A config map must be created before its contents can be consumed in pods.
Controllers can be written to tolerate missing configuration data. Consult individual components configured by using config maps on a case-by-case basis.
ConfigMap
objects reside in a project.
They can only be referenced by pods in the same project.
The Kubelet only supports the use of a config map for pods it gets from the API server.
This includes any pods created by using the CLI, or indirectly from a replication controller. It does not include pods created by using the OpenShift Container Platform node’s --manifest-url
flag, its --config
flag, or its REST API because these are not common ways to create pods.
2.7.2. Creating a config map in the OpenShift Container Platform web console Copy linkLink copied to clipboard!
You can create a config map in the OpenShift Container Platform web console.
Procedure
To create a config map as a cluster administrator:
-
In the Administrator perspective, select
Workloads
→Config Maps
. - At the top right side of the page, select Create Config Map.
- Enter the contents of your config map.
- Select Create.
-
In the Administrator perspective, select
To create a config map as a developer:
-
In the Developer perspective, select
Config Maps
. - At the top right side of the page, select Create Config Map.
- Enter the contents of your config map.
- Select Create.
-
In the Developer perspective, select
2.7.3. Creating a config map by using the CLI Copy linkLink copied to clipboard!
You can use the following command to create a config map from directories, specific files, or literal values.
Procedure
Create a config map:
oc create configmap <configmap_name> [options]
$ oc create configmap <configmap_name> [options]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.7.3.1. Creating a config map from a directory Copy linkLink copied to clipboard!
You can create a config map from a directory. This method allows you to use multiple files within a directory to create a config map.
Procedure
The following example procedure outlines how to create a config map from a directory.
Start with a directory with some files that already contain the data with which you want to populate a config map:
ls example-files
$ ls example-files
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
game.properties ui.properties
game.properties ui.properties
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cat example-files/game.properties
$ cat example-files/game.properties
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cat example-files/ui.properties
$ cat example-files/ui.properties
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
color.good=purple color.bad=yellow allow.textmode=true how.nice.to.look=fairlyNice
color.good=purple color.bad=yellow allow.textmode=true how.nice.to.look=fairlyNice
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a config map holding the content of each file in this directory by entering the following command:
oc create configmap game-config \ --from-file=example-files/
$ oc create configmap game-config \ --from-file=example-files/
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When the
--from-file
option points to a directory, each file directly in that directory is used to populate a key in the config map, where the name of the key is the file name, and the value of the key is the content of the file.For example, the previous command creates the following config map:
oc describe configmaps game-config
$ oc describe configmaps game-config
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can see that the two keys in the map are created from the file names in the directory specified in the command. Because the content of those keys might be large, the output of
oc describe
only shows the names of the keys and their sizes.Enter the
oc get
command for the object with the-o
option to see the values of the keys:oc get configmaps game-config -o yaml
$ oc get configmaps game-config -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.7.3.2. Creating a config map from a file Copy linkLink copied to clipboard!
You can create a config map from a file.
Procedure
The following example procedure outlines how to create a config map from a file.
If you create a config map from a file, you can include files containing non-UTF8 data that are placed in this field without corrupting the non-UTF8 data. OpenShift Container Platform detects binary files and transparently encodes the file as MIME
. On the server, the MIME
payload is decoded and stored without corrupting the data.
You can pass the --from-file
option multiple times to the CLI. The following example yields equivalent results to the creating from directories example.
Create a config map by specifying a specific file:
oc create configmap game-config-2 \ --from-file=example-files/game.properties \ --from-file=example-files/ui.properties
$ oc create configmap game-config-2 \ --from-file=example-files/game.properties \ --from-file=example-files/ui.properties
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the results:
oc get configmaps game-config-2 -o yaml
$ oc get configmaps game-config-2 -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
You can specify the key to set in a config map for content imported from a file. This can be set by passing a key=value
expression to the --from-file
option. For example:
Create a config map by specifying a key-value pair:
oc create configmap game-config-3 \ --from-file=game-special-key=example-files/game.properties
$ oc create configmap game-config-3 \ --from-file=game-special-key=example-files/game.properties
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the results:
oc get configmaps game-config-3 -o yaml
$ oc get configmaps game-config-3 -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- This is the key that you set in the preceding step.
2.7.3.3. Creating a config map from literal values Copy linkLink copied to clipboard!
You can supply literal values for a config map.
Procedure
The --from-literal
option takes a key=value
syntax that allows literal values to be supplied directly on the command line.
Create a config map by specifying a literal value:
oc create configmap special-config \ --from-literal=special.how=very \ --from-literal=special.type=charm
$ oc create configmap special-config \ --from-literal=special.how=very \ --from-literal=special.type=charm
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the results:
oc get configmaps special-config -o yaml
$ oc get configmaps special-config -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.7.4. Use cases: Consuming config maps in pods Copy linkLink copied to clipboard!
The following sections describe some uses cases when consuming ConfigMap
objects in pods.
2.7.4.1. Populating environment variables in containers by using config maps Copy linkLink copied to clipboard!
Config maps can be used to populate individual environment variables in containers or to populate environment variables in containers from all keys that form valid environment variable names.
As an example, consider the following config map:
ConfigMap
with two environment variables
ConfigMap
with one environment variable
Procedure
You can consume the keys of this
ConfigMap
in a pod usingconfigMapKeyRef
sections.Sample
Pod
specification configured to inject specific environment variablesCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Stanza to pull the specified environment variables from a
ConfigMap
. - 2
- Name of a Pod environment variable that you are injecting a key’s value into.
- 3 5
- Name of the
ConfigMap
to pull specific environment variables from. - 4 6
- Environment variable to pull from the
ConfigMap
. - 7
- Makes the environment variable optional. As optional, the Pod will be started even if the specified
ConfigMap
and keys do not exist. - 8
- Stanza to pull all environment variables from a
ConfigMap
. - 9
- Name of the
ConfigMap
to pull all environment variables from.
When this Pod is run, the Pod logs will include the following output:
SPECIAL_LEVEL_KEY=very log_level=INFO
SPECIAL_LEVEL_KEY=very log_level=INFO
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
SPECIAL_TYPE_KEY=charm
is not listed in the example output because optional: true
is set.
2.7.4.2. Setting command-line arguments for container commands with config maps Copy linkLink copied to clipboard!
A config map can also be used to set the value of the commands or arguments in a container. This is accomplished by using the Kubernetes substitution syntax $(VAR_NAME)
. Consider the following config map:
Procedure
To inject values into a command in a container, you must consume the keys you want to use as environment variables, as in the consuming ConfigMaps in environment variables use case. Then you can refer to them in a container’s command using the
$(VAR_NAME)
syntax.Sample
Pod
specification configured to inject specific environment variablesCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Inject the values into a command in a container using the keys you want to use as environment variables.
When this pod is run, the output from the echo command run in the test-container container is as follows:
very charm
very charm
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.7.4.3. Injecting content into a volume by using config maps Copy linkLink copied to clipboard!
You can inject content into a volume by using config maps.
Example ConfigMap
custom resource (CR)
Procedure
You have a couple different options for injecting content into a volume by using config maps.
The most basic way to inject content into a volume by using a config map is to populate the volume with files where the key is the file name and the content of the file is the value of the key:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- File containing key.
When this pod is run, the output of the cat command will be:
very
very
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can also control the paths within the volume where config map keys are projected:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Path to config map key.
When this pod is run, the output of the cat command will be:
very
very
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.8. Using device plug-ins to access external resources with pods Copy linkLink copied to clipboard!
Device plug-ins allow you to use a particular device type (GPU, InfiniBand, or other similar computing resources that require vendor-specific initialization and setup) in your OpenShift Container Platform pod without needing to write custom code.
2.8.1. Understanding device plug-ins Copy linkLink copied to clipboard!
The device plug-in provides a consistent and portable solution to consume hardware devices across clusters. The device plug-in provides support for these devices through an extension mechanism, which makes these devices available to Containers, provides health checks of these devices, and securely shares them.
OpenShift Container Platform supports the device plug-in API, but the device plug-in Containers are supported by individual vendors.
A device plug-in is a gRPC service running on the nodes (external to the kubelet
) that is responsible for managing specific hardware resources. Any device plug-in must support following remote procedure calls (RPCs):
Example device plug-ins
For easy device plug-in reference implementation, there is a stub device plug-in in the Device Manager code: vendor/k8s.io/kubernetes/pkg/kubelet/cm/deviceplugin/device_plugin_stub.go.
2.8.1.1. Methods for deploying a device plug-in Copy linkLink copied to clipboard!
- Daemon sets are the recommended approach for device plug-in deployments.
- Upon start, the device plug-in will try to create a UNIX domain socket at /var/lib/kubelet/device-plugin/ on the node to serve RPCs from Device Manager.
- Since device plug-ins must manage hardware resources, access to the host file system, as well as socket creation, they must be run in a privileged security context.
- More specific details regarding deployment steps can be found with each device plug-in implementation.
2.8.2. Understanding the Device Manager Copy linkLink copied to clipboard!
Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plug-ins known as device plug-ins.
You can advertise specialized hardware without requiring any upstream code changes.
OpenShift Container Platform supports the device plug-in API, but the device plug-in Containers are supported by individual vendors.
Device Manager advertises devices as Extended Resources. User pods can consume devices, advertised by Device Manager, using the same Limit/Request mechanism, which is used for requesting any other Extended Resource.
Upon start, the device plug-in registers itself with Device Manager invoking Register
on the /var/lib/kubelet/device-plugins/kubelet.sock and starts a gRPC service at /var/lib/kubelet/device-plugins/<plugin>.sock for serving Device Manager requests.
Device Manager, while processing a new registration request, invokes ListAndWatch
remote procedure call (RPC) at the device plug-in service. In response, Device Manager gets a list of Device objects from the plug-in over a gRPC stream. Device Manager will keep watching on the stream for new updates from the plug-in. On the plug-in side, the plug-in will also keep the stream open and whenever there is a change in the state of any of the devices, a new device list is sent to the Device Manager over the same streaming connection.
While handling a new pod admission request, Kubelet passes requested Extended Resources
to the Device Manager for device allocation. Device Manager checks in its database to verify if a corresponding plug-in exists or not. If the plug-in exists and there are free allocatable devices as well as per local cache, Allocate
RPC is invoked at that particular device plug-in.
Additionally, device plug-ins can also perform several other device-specific operations, such as driver installation, device initialization, and device resets. These functionalities vary from implementation to implementation.
2.8.3. Enabling Device Manager Copy linkLink copied to clipboard!
Enable Device Manager to implement a device plug-in to advertise specialized hardware without any upstream code changes.
Device Manager provides a mechanism for advertising specialized node hardware resources with the help of plug-ins known as device plug-ins.
Obtain the label associated with the static
MachineConfigPool
CRD for the type of node you want to configure. Perform one of the following steps:View the machine config:
oc describe machineconfig <name>
# oc describe machineconfig <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe machineconfig 00-worker
# oc describe machineconfig 00-worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Name: 00-worker Namespace: Labels: machineconfiguration.openshift.io/role=worker
Name: 00-worker Namespace: Labels: machineconfiguration.openshift.io/role=worker
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Label required for the Device Manager.
Procedure
Create a custom resource (CR) for your configuration change.
Sample configuration for a Device Manager CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the Device Manager:
oc create -f devicemgr.yaml
$ oc create -f devicemgr.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
kubeletconfig.machineconfiguration.openshift.io/devicemgr created
kubeletconfig.machineconfiguration.openshift.io/devicemgr created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Ensure that Device Manager was actually enabled by confirming that /var/lib/kubelet/device-plugins/kubelet.sock is created on the node. This is the UNIX domain socket on which the Device Manager gRPC server listens for new plug-in registrations. This sock file is created when the Kubelet is started only if Device Manager is enabled.
2.9. Including pod priority in pod scheduling decisions Copy linkLink copied to clipboard!
You can enable pod priority and preemption in your cluster. Pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.
To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.
2.9.1. Understanding pod priority Copy linkLink copied to clipboard!
When you use the Pod Priority and Preemption feature, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.
2.9.1.1. Pod priority classes Copy linkLink copied to clipboard!
You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.
A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. By default, OpenShift Container Platform has two reserved priority classes for critical system pods to have guaranteed scheduling.
oc get priorityclasses
$ oc get priorityclasses
Example output
NAME VALUE GLOBAL-DEFAULT AGE cluster-logging 1000000 false 29s system-cluster-critical 2000000000 false 72m system-node-critical 2000001000 false 72m
NAME VALUE GLOBAL-DEFAULT AGE
cluster-logging 1000000 false 29s
system-cluster-critical 2000000000 false 72m
system-node-critical 2000001000 false 72m
system-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are
sdn-ovs
,sdn
, and so forth. A number of critical components include thesystem-node-critical
priority class by default, for example:- master-api
- master-controller
- master-etcd
- sdn
- sdn-ovs
- sync
system-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the
system-node-critical
priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth. A number of critical components include thesystem-cluster-critical
priority class by default, for example:- fluentd
- metrics-server
- descheduler
- cluster-logging - This priority is used by Fluentd to make sure Fluentd pods are scheduled to nodes over other apps.
2.9.1.2. Pod priority names Copy linkLink copied to clipboard!
After you have one or more priority classes, you can create pods that specify a priority class name in a Pod
spec. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.
2.9.2. Understanding pod preemption Copy linkLink copied to clipboard!
When a developer creates a pod, the pod goes into a queue. If the developer configured the pod for pod priority or preemption, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.
When the scheduler preempts one or more pods on a node, the nominatedNodeName
field of higher-priority Pod
spec is set to the name of the node, along with the nodename
field. The scheduler uses the nominatedNodeName
field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.
After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the nominatedNodeName
field and nodeName
field of the Pod
spec might be different.
Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the nominatedNodeName
of the pending pod, making the pod eligible for another node.
Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.
The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.
2.9.2.1. Pod preemption and other scheduler settings Copy linkLink copied to clipboard!
If you enable pod priority and preemption, consider your other scheduler settings:
- Pod priority and pod disruption budget
- A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OpenShift Container Platform respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
- Pod priority and pod affinity
- Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.
If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
2.9.2.2. Graceful termination of preempted pods Copy linkLink copied to clipboard!
When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.
To minimize this gap, configure a small graceful termination period for lower-priority pods.
2.9.3. Configuring priority and preemption Copy linkLink copied to clipboard!
You apply pod priority and preemption by creating a priority class object and associating pods to the priority using the priorityClassName
in your Pod
specs.
Sample priority class object
- 1
- The name of the priority class object.
- 2
- The priority value of the object.
- 3
- Optional field that indicates whether this priority class should be used for pods without a priority class name specified. This field is
false
by default. Only one priority class withglobalDefault
set totrue
can exist in the cluster. If there is no priority class withglobalDefault:true
, the priority of pods with no priority class name is zero. Adding a priority class withglobalDefault:true
affects only pods created after the priority class is added and does not change the priorities of existing pods. - 4
- Optional arbitrary text string that describes which pods developers should use with this priority class.
Procedure
To configure your cluster to use priority and preemption:
Create one or more priority classes:
- Specify a name and value for the priority.
-
Optionally specify the
globalDefault
field in the priority class and a description.
Create a
Pod
spec or edit existing pods to include the name of a priority class, similar to the following:Sample
Pod
spec with priority class nameCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify the priority class to use with this pod.
Create the pod:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can add the priority name directly to the pod configuration or to a pod template.
2.10. Placing pods on specific nodes using node selectors Copy linkLink copied to clipboard!
A node selector specifies a map of key-value pairs. The rules are defined using custom labels on nodes and selectors specified in pods.
For the pod to be eligible to run on a node, the pod must have the indicated key-value pairs as the label on the node.
If you are using node affinity and node selectors in the same pod configuration, see the important considerations below.
2.10.1. Using node selectors to control pod placement Copy linkLink copied to clipboard!
You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.
You add labels to a node, a machine set, or a machine config. Adding the label to the machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.
To add node selectors to an existing pod, add a node selector to the controlling object for that pod, such as a ReplicaSet
object, DaemonSet
object, StatefulSet
object, Deployment
object, or DeploymentConfig
object. Any existing pods under that controlling object are recreated on a node with a matching label. If you are creating a new pod, you can add the node selector directly to the Pod
spec.
You cannot add a node selector directly to an existing scheduled pod.
Prerequisites
To add a node selector to existing pods, determine the controlling object for that pod. For example, the router-default-66d5cf9464-m2g75
pod is controlled by the router-default-66d5cf9464
replica set:
The web console lists the controlling object under ownerReferences
in the pod YAML:
Procedure
Add labels to a node by using a machine set or editing the node directly:
Use a
MachineSet
object to add labels to nodes managed by the machine set when a node is created:Run the following command to add labels to a
MachineSet
object:oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc patch MachineSet abc612-msrtw-worker-us-east-1c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
$ oc patch MachineSet abc612-msrtw-worker-us-east-1c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the labels are added to the
MachineSet
object by using theoc edit
command:For example:
oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api
$ oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
MachineSet
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Add labels directly to a node:
Edit the
Node
object for the node:oc label nodes <name> <key>=<value>
$ oc label nodes <name> <key>=<value>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to label a node:
oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the labels are added to the node:
oc get nodes -l type=user-node,region=east
$ oc get nodes -l type=user-node,region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION ip-10-0-142-25.ec2.internal Ready worker 17m v1.18.3+002a51f
NAME STATUS ROLES AGE VERSION ip-10-0-142-25.ec2.internal Ready worker 17m v1.18.3+002a51f
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add the matching node selector to a pod:
To add a node selector to existing and future pods, add a node selector to the controlling object for the pods:
Example
ReplicaSet
object with labelsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Add the node selector.
To add a node selector to a specific, new pod, add the selector to the
Pod
object directly:Example
Pod
object with a node selectorCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou cannot add a node selector directly to an existing scheduled pod.
Chapter 3. Controlling pod placement onto nodes (scheduling) Copy linkLink copied to clipboard!
3.1. Controlling pod placement using the scheduler Copy linkLink copied to clipboard!
Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster.
The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API.
- Default pod scheduling
- OpenShift Container Platform comes with a default scheduler that serves the needs of most users. The default scheduler uses both inherent and customization tools to determine the best fit for a pod.
- Advanced pod scheduling
In situations where you might want more control over where new pods are placed, the OpenShift Container Platform advanced scheduling features allow you to configure a pod so that the pod is required or has a preference to run on a particular node or alongside a specific pod by.
- Using pod affinity and anti-affinity rules.
- Controlling pod placement with pod affinity.
- Controlling pod placement with node affinity.
- Placing pods on overcomitted nodes.
- Controlling pod placement with node selectors.
- Controlling pod placement with taints and tolerations.
3.1.1. Scheduler Use Cases Copy linkLink copied to clipboard!
One of the important use cases for scheduling within OpenShift Container Platform is to support flexible affinity and anti-affinity policies.
3.1.1.1. Infrastructure Topological Levels Copy linkLink copied to clipboard!
Administrators can define multiple topological levels for their infrastructure (nodes) by specifying labels on nodes. For example: region=r1
, zone=z1
, rack=s1
.
These label names have no particular meaning and administrators are free to name their infrastructure levels anything, such as city/building/room. Also, administrators can define any number of levels for their infrastructure topology, with three levels usually being adequate (such as: regions
→ zones
→ racks
). Administrators can specify affinity and anti-affinity rules at each of these levels in any combination.
3.1.1.2. Affinity Copy linkLink copied to clipboard!
Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service are scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod is not scheduled.
If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.
These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.
3.1.1.3. Anti-Affinity Copy linkLink copied to clipboard!
Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-affinity (or 'spread') at a particular level indicates that all pods that belong to the same service are spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler tries to balance the service pods across all applicable nodes as evenly as possible.
If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.
These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.
3.2. Configuring the default scheduler to control pod placement Copy linkLink copied to clipboard!
The default OpenShift Container Platform pod scheduler is responsible for determining placement of new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node.
A selection of predicates and priorities defines the policy for the scheduler. See Modifying scheduler policy for a list of predicates and priorities.
Sample default scheduler object
- 1
- You can specify the name of a custom scheduler policy file.
- 2
- Optional: Specify a default node selector to restrict pod placement to specific nodes. The default node selector is applied to the pods created in all namespaces. Pods can be scheduled on nodes with labels that match the default node selector and any existing pod node selectors. Namespaces having project-wide node selectors are not impacted even if this field is set.
3.2.1. Understanding default scheduling Copy linkLink copied to clipboard!
The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:
- Filters the Nodes
- The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates.
- Prioritize the Filtered List of Nodes
- This is achieved by passing each node through a series of priority_ functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the weight (default weight for most priorities is 1) and then combined by adding the scores for each node provided by all the priorities. This weight attribute can be used by administrators to give higher importance to some priorities.
- Select the Best Fit Node
- The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.
3.2.1.1. Understanding Scheduler Policy Copy linkLink copied to clipboard!
The selection of the predicate and priorities defines the policy for the scheduler.
The scheduler configuration file is a JSON file, which must be named policy.cfg
, that specifies the predicates and priorities the scheduler will consider.
In the absence of the scheduler policy file, the default scheduler behavior is used.
The predicates and priorities defined in the scheduler configuration file completely override the default scheduler policy. If any of the default predicates and priorities are required, you must explicitly specify the functions in the policy configuration.
Sample scheduler config map
- 1
- The
GeneralPredicates
predicate represents thePodFitsResources
,HostName
,PodFitsHostPorts
, andMatchNodeSelector
predicates. Because you are not allowed to configure the same predicate multiple times, theGeneralPredicates
predicate cannot be used alongside any of the four represented predicates.
3.2.2. Creating a scheduler policy file Copy linkLink copied to clipboard!
You can change the default scheduling behavior by creating a JSON file with the desired predicates and priorities. You then generate a config map from the JSON file and point the cluster
Scheduler object to use the config map.
Procedure
To configure the scheduler policy:
Create a JSON file named
policy.cfg
with the desired predicates and priorities.Sample scheduler JSON file
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a config map based on the scheduler JSON file:
oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name>
$ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name>
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Enter a name for the config map.
For example:
oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
$ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
configmap/scheduler-policy created
configmap/scheduler-policy created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the Scheduler Operator custom resource to add the config map:
oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"<configmap-name>"}}}' --type=merge
$ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"<configmap-name>"}}}' --type=merge
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify the name of the config map.
For example:
oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge
$ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After making the change to the
Scheduler
config resource, wait for theopenshift-kube-apiserver
pods to redeploy. This can take several minutes. Until the pods redeploy, new scheduler does not take effect.Verify the scheduler policy is configured by viewing the log of a scheduler pod in the
openshift-kube-scheduler
namespace. The following command checks for the predicates and priorities that are being registered by the scheduler:oc logs <scheduler-pod> | grep predicates
$ oc logs <scheduler-pod> | grep predicates
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc logs openshift-kube-scheduler-ip-10-0-141-29.ec2.internal | grep predicates
$ oc logs openshift-kube-scheduler-ip-10-0-141-29.ec2.internal | grep predicates
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Creating scheduler with fit predicates 'map[MaxGCEPDVolumeCount:{} MaxAzureDiskVolumeCount:{} CheckNodeUnschedulable:{} NoDiskConflict:{} NoVolumeZoneConflict:{} GeneralPredicates:{} MaxCSIVolumeCountPred:{} CheckVolumeBinding:{} MaxEBSVolumeCount:{} MatchInterPodAffinity:{} PodToleratesNodeTaints:{}]' and priority functions 'map[InterPodAffinityPriority:{} LeastRequestedPriority:{} ServiceSpreadingPriority:{} ImageLocalityPriority:{} SelectorSpreadPriority:{} EqualPriority:{} BalancedResourceAllocation:{} NodePreferAvoidPodsPriority:{} NodeAffinityPriority:{} TaintTolerationPriority:{}]'
Creating scheduler with fit predicates 'map[MaxGCEPDVolumeCount:{} MaxAzureDiskVolumeCount:{} CheckNodeUnschedulable:{} NoDiskConflict:{} NoVolumeZoneConflict:{} GeneralPredicates:{} MaxCSIVolumeCountPred:{} CheckVolumeBinding:{} MaxEBSVolumeCount:{} MatchInterPodAffinity:{} PodToleratesNodeTaints:{}]' and priority functions 'map[InterPodAffinityPriority:{} LeastRequestedPriority:{} ServiceSpreadingPriority:{} ImageLocalityPriority:{} SelectorSpreadPriority:{} EqualPriority:{} BalancedResourceAllocation:{} NodePreferAvoidPodsPriority:{} NodeAffinityPriority:{} TaintTolerationPriority:{}]'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.2.3. Modifying scheduler policies Copy linkLink copied to clipboard!
You change scheduling behavior by creating or editing your scheduler policy config map in the openshift-config
project. Add and remove predicates and priorities to the config map to create a scheduler policy.
Procedure
To modify the current custom scheduling, use one of the following methods:
Edit the scheduler policy config map:
oc edit configmap <configmap-name> -n openshift-config
$ oc edit configmap <configmap-name> -n openshift-config
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc edit configmap scheduler-policy -n openshift-config
$ oc edit configmap scheduler-policy -n openshift-config
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow It can take a few minutes for the scheduler to restart the pods with the updated policy.
Change the policies and predicates being used:
Remove the scheduler policy config map:
oc delete configmap -n openshift-config <name>
$ oc delete configmap -n openshift-config <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete configmap -n openshift-config scheduler-policy
$ oc delete configmap -n openshift-config scheduler-policy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
policy.cfg
file to add and remove policies and predicates as needed.For example:
vi policy.cfg
$ vi policy.cfg
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Re-create the scheduler policy config map based on the scheduler JSON file:
oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name>
$ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name>
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Enter a name for the config map.
For example:
oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
$ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
configmap/scheduler-policy created
configmap/scheduler-policy created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.2.3.1. Understanding the scheduler predicates Copy linkLink copied to clipboard!
Predicates are rules that filter out unqualified nodes.
There are several predicates provided by default in OpenShift Container Platform. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes.
3.2.3.1.1. Static Predicates Copy linkLink copied to clipboard!
These predicates do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name.
3.2.3.1.1.1. Default Predicates Copy linkLink copied to clipboard!
The default scheduler policy includes the following predicates:
The NoVolumeZoneConflict
predicate checks that the volumes a pod requests are available in the zone.
{"name" : "NoVolumeZoneConflict"}
{"name" : "NoVolumeZoneConflict"}
The MaxEBSVolumeCount
predicate checks the maximum number of volumes that can be attached to an AWS instance.
{"name" : "MaxEBSVolumeCount"}
{"name" : "MaxEBSVolumeCount"}
The MaxAzureDiskVolumeCount
predicate checks the maximum number of Azure Disk Volumes.
{"name" : "MaxAzureDiskVolumeCount"}
{"name" : "MaxAzureDiskVolumeCount"}
The PodToleratesNodeTaints
predicate checks if a pod can tolerate the node taints.
{"name" : "PodToleratesNodeTaints"}
{"name" : "PodToleratesNodeTaints"}
The CheckNodeUnschedulable
predicate checks if a pod can be scheduled on a node with Unschedulable
spec.
{"name" : "CheckNodeUnschedulable"}
{"name" : "CheckNodeUnschedulable"}
The CheckVolumeBinding
predicate evaluates if a pod can fit based on the volumes, it requests, for both bound and unbound PVCs.
- For PVCs that are bound, the predicate checks that the corresponding PV’s node affinity is satisfied by the given node.
- For PVCs that are unbound, the predicate searched for available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.
The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound PVCs can be matched with an available and node-compatible PV.
{"name" : "CheckVolumeBinding"}
{"name" : "CheckVolumeBinding"}
The NoDiskConflict
predicate checks if the volume requested by a pod is available.
{"name" : "NoDiskConflict"}
{"name" : "NoDiskConflict"}
The MaxGCEPDVolumeCount
predicate checks the maximum number of Google Compute Engine (GCE) Persistent Disks (PD).
{"name" : "MaxGCEPDVolumeCount"}
{"name" : "MaxGCEPDVolumeCount"}
The MaxCSIVolumeCountPred
predicate determines how many Container Storage Interface (CSI) volumes should be attached to a node and whether that number exceeds a configured limit.
{"name" : "MaxCSIVolumeCountPred"}
{"name" : "MaxCSIVolumeCountPred"}
The MatchInterPodAffinity
predicate checks if the pod affinity/anti-affinity rules permit the pod.
{"name" : "MatchInterPodAffinity"}
{"name" : "MatchInterPodAffinity"}
3.2.3.1.1.2. Other Static Predicates Copy linkLink copied to clipboard!
OpenShift Container Platform also supports the following predicates:
The CheckNode-*
predicates cannot be used if the Taint Nodes By Condition feature is enabled. The Taint Nodes By Condition feature is enabled by default.
The CheckNodeCondition
predicate checks if a pod can be scheduled on a node reporting out of disk, network unavailable, or not ready conditions.
{"name" : "CheckNodeCondition"}
{"name" : "CheckNodeCondition"}
The CheckNodeLabelPresence
predicate checks if all of the specified labels exist on a node, regardless of their value.
{"name" : "CheckNodeLabelPresence"}
{"name" : "CheckNodeLabelPresence"}
The checkServiceAffinity
predicate checks that ServiceAffinity labels are homogeneous for pods that are scheduled on a node.
{"name" : "checkServiceAffinity"}
{"name" : "checkServiceAffinity"}
The PodToleratesNodeNoExecuteTaints
predicate checks if a pod tolerations can tolerate a node NoExecute
taints.
{"name" : "PodToleratesNodeNoExecuteTaints"}
{"name" : "PodToleratesNodeNoExecuteTaints"}
3.2.3.1.2. General Predicates Copy linkLink copied to clipboard!
The following general predicates check whether non-critical predicates and essential predicates pass. Non-critical predicates are the predicates that only non-critical pods must pass and essential predicates are the predicates that all pods must pass.
The default scheduler policy includes the general predicates.
Non-critical general predicates
The PodFitsResources
predicate determines a fit based on resource availability (CPU, memory, GPU, and so forth). The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources.
{"name" : "PodFitsResources"}
{"name" : "PodFitsResources"}
Essential general predicates
The PodFitsHostPorts
predicate determines if a node has free ports for the requested pod ports (absence of port conflicts).
{"name" : "PodFitsHostPorts"}
{"name" : "PodFitsHostPorts"}
The HostName
predicate determines fit based on the presence of the Host parameter and a string match with the name of the host.
{"name" : "HostName"}
{"name" : "HostName"}
The MatchNodeSelector
predicate determines fit based on node selector (nodeSelector) queries defined in the pod.
{"name" : "MatchNodeSelector"}
{"name" : "MatchNodeSelector"}
3.2.3.2. Understanding the scheduler priorities Copy linkLink copied to clipboard!
Priorities are rules that rank nodes according to preferences.
A custom set of priorities can be specified to configure the scheduler. There are several priorities provided by default in OpenShift Container Platform. Other priorities can be customized by providing certain parameters. Multiple priorities can be combined and different weights can be given to each in order to impact the prioritization.
3.2.3.2.1. Static Priorities Copy linkLink copied to clipboard!
Static priorities do not take any configuration parameters from the user, except weight. A weight is required to be specified and cannot be 0 or negative.
These are specified in the scheduler policy config map in the openshift-config
project.
3.2.3.2.1.1. Default Priorities Copy linkLink copied to clipboard!
The default scheduler policy includes the following priorities. Each of the priority function has a weight of 1
except NodePreferAvoidPodsPriority
, which has a weight of 10000
.
The NodeAffinityPriority
priority prioritizes nodes according to node affinity scheduling preferences
{"name" : "NodeAffinityPriority", "weight" : 1}
{"name" : "NodeAffinityPriority", "weight" : 1}
The TaintTolerationPriority
priority prioritizes nodes that have a fewer number of intolerable taints on them for a pod. An intolerable taint is one which has key PreferNoSchedule
.
{"name" : "TaintTolerationPriority", "weight" : 1}
{"name" : "TaintTolerationPriority", "weight" : 1}
The ImageLocalityPriority
priority prioritizes nodes that already have requested pod container’s images.
{"name" : "ImageLocalityPriority", "weight" : 1}
{"name" : "ImageLocalityPriority", "weight" : 1}
The SelectorSpreadPriority
priority looks for services, replication controllers (RC), replication sets (RS), and stateful sets that match the pod, then finds existing pods that match those selectors. The scheduler favors nodes that have fewer existing matching pods. Then, it schedules the pod on a node with the smallest number of pods that match those selectors as the pod being scheduled.
{"name" : "SelectorSpreadPriority", "weight" : 1}
{"name" : "SelectorSpreadPriority", "weight" : 1}
The InterPodAffinityPriority
priority computes a sum by iterating through the elements of weightedPodAffinityTerm
and adding weight to the sum if the corresponding PodAffinityTerm is satisfied for that node. The node(s) with the highest sum are the most preferred.
{"name" : "InterPodAffinityPriority", "weight" : 1}
{"name" : "InterPodAffinityPriority", "weight" : 1}
The LeastRequestedPriority
priority favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity.
{"name" : "LeastRequestedPriority", "weight" : 1}
{"name" : "LeastRequestedPriority", "weight" : 1}
The BalancedResourceAllocation
priority favors nodes with balanced resource usage rate. It calculates the difference between the consumed CPU and memory as a fraction of capacity, and prioritizes the nodes based on how close the two metrics are to each other. This should always be used together with LeastRequestedPriority
.
{"name" : "BalancedResourceAllocation", "weight" : 1}
{"name" : "BalancedResourceAllocation", "weight" : 1}
The NodePreferAvoidPodsPriority
priority ignores pods that are owned by a controller other than a replication controller.
{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000}
{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000}
3.2.3.2.1.2. Other Static Priorities Copy linkLink copied to clipboard!
OpenShift Container Platform also supports the following priorities:
The EqualPriority
priority gives an equal weight of 1
to all nodes, if no priority configurations are provided. We recommend using this priority only for testing environments.
{"name" : "EqualPriority", "weight" : 1}
{"name" : "EqualPriority", "weight" : 1}
The MostRequestedPriority
priority prioritizes nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.
{"name" : "MostRequestedPriority", "weight" : 1}
{"name" : "MostRequestedPriority", "weight" : 1}
The ServiceSpreadingPriority
priority spreads pods by minimizing the number of pods belonging to the same service onto the same machine.
{"name" : "ServiceSpreadingPriority", "weight" : 1}
{"name" : "ServiceSpreadingPriority", "weight" : 1}
3.2.3.2.2. Configurable Priorities Copy linkLink copied to clipboard!
You can configure these priorities in the scheduler policy config map, in the openshift-config
namespace, to add labels to affect how the priorities work.
The type of the priority function is identified by the argument that they take. Since these are configurable, multiple priorities of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.
For information on using these priorities, see Modifying Scheduler Policy.
The ServiceAntiAffinity
priority takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a higher score to nodes within a group with the least concentration of pods.
For example:
In some situations using the ServiceAntiAffinity
parameter based on custom labels does not spread pod as expected. See this Red Hat Solution.
The labelPreference
parameter gives priority based on the specified label. If the label is present on a node, that node is given priority. If no label is specified, priority is given to nodes that do not have a label. If multiple priorities with the labelPreference
parameter are set, all of the priorities must have the same weight.
3.2.4. Sample Policy Configurations Copy linkLink copied to clipboard!
The configuration below specifies the default scheduler configuration, if it were to be specified using the scheduler policy file.
In all of the sample configurations below, the list of predicates and priority functions is truncated to include only the ones that pertain to the use case specified. In practice, a complete/meaningful scheduler policy should include most, if not all, of the default predicates and priorities listed above.
The following example defines three topological levels, region (affinity) → zone (affinity) → rack (anti-affinity):
The following example defines three topological levels, city
(affinity) → building
(anti-affinity) → room
(anti-affinity):
The following example defines a policy to only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined:
The following example combines both static and configurable predicates and also priorities:
3.3. Placing pods relative to other pods using affinity and anti-affinity rules Copy linkLink copied to clipboard!
Affinity is a property of pods that controls the nodes on which they prefer to be scheduled. Anti-affinity is a property of pods that prevents a pod from being scheduled on a node.
In OpenShift Container Platform pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods.
3.3.1. Understanding pod affinity Copy linkLink copied to clipboard!
Pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods.
- Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
- Pod anti-affinity can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.
For example, using affinity rules, you could spread or pack pods within a service or relative to pods in other services. Anti-affinity rules allow you to prevent pods of a particular service from scheduling on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service. Or, you could spread the pods of a service across nodes or availability zones to reduce correlated failures.
There are two types of pod affinity rules: required and preferred.
Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
Depending on your pod priority and preemption settings, the scheduler might not be able to find an appropriate node for a pod without violating affinity requirements. If so, a pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
You configure pod affinity/anti-affinity through the Pod
spec files. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.
The following example shows a Pod
spec configured for pod affinity and anti-affinity.
In this example, the pod affinity rule indicates that the pod can schedule onto a node only if that node has at least one already-running pod with a label that has the key security
and value S1
. The pod anti-affinity rule says that the pod prefers to not schedule onto a node if that node is already running a pod with label having key security
and value S2
.
Sample Pod
config file with pod affinity
- 1
- Stanza to configure pod affinity.
- 2
- Defines a required rule.
- 3 5
- The key and value (label) that must be matched to apply the rule.
- 4
- The operator represents the relationship between the label on the existing pod and the set of values in the
matchExpression
parameters in the specification for the new pod. Can beIn
,NotIn
,Exists
, orDoesNotExist
.
Sample Pod
config file with pod anti-affinity
- 1
- Stanza to configure pod anti-affinity.
- 2
- Defines a preferred rule.
- 3
- Specifies a weight for a preferred rule. The node with the highest weight is preferred.
- 4
- Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
- 5
- The operator represents the relationship between the label on the existing pod and the set of values in the
matchExpression
parameters in the specification for the new pod. Can beIn
,NotIn
,Exists
, orDoesNotExist
.
If labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node.
3.3.2. Configuring a pod affinity rule Copy linkLink copied to clipboard!
The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses affinity to allow scheduling with that pod.
Procedure
Create a pod with a specific label in the
Pod
spec:Copy to Clipboard Copied! Toggle word wrap Toggle overflow When creating other pods, edit the
Pod
spec as follows:-
Use the
podAffinity
stanza to configure therequiredDuringSchedulingIgnoredDuringExecution
parameter orpreferredDuringSchedulingIgnoredDuringExecution
parameter: Specify the key and value that must be met. If you want the new pod to be scheduled with the other pod, use the same
key
andvalue
parameters as the label on the first pod.Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Specify an
operator
. The operator can beIn
,NotIn
,Exists
, orDoesNotExist
. For example, use the operatorIn
to require the label to be in the node. -
Specify a
topologyKey
, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
-
Use the
Create the pod.
oc create -f <pod-spec>.yaml
$ oc create -f <pod-spec>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.3.3. Configuring a pod anti-affinity rule Copy linkLink copied to clipboard!
The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses an anti-affinity preferred rule to attempt to prevent scheduling with that pod.
Procedure
Create a pod with a specific label in the
Pod
spec:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
When creating other pods, edit the
Pod
spec to set the following parameters: Use the
podAntiAffinity
stanza to configure therequiredDuringSchedulingIgnoredDuringExecution
parameter orpreferredDuringSchedulingIgnoredDuringExecution
parameter:- Specify a weight for the node, 1-100. The node that with highest weight is preferred.
Specify the key and values that must be met. If you want the new pod to not be scheduled with the other pod, use the same
key
andvalue
parameters as the label on the first pod.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - For a preferred rule, specify a weight, 1-100.
-
Specify an
operator
. The operator can beIn
,NotIn
,Exists
, orDoesNotExist
. For example, use the operatorIn
to require the label to be in the node.
-
Specify a
topologyKey
, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain. Create the pod.
oc create -f <pod-spec>.yaml
$ oc create -f <pod-spec>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.3.4. Sample pod affinity and anti-affinity rules Copy linkLink copied to clipboard!
The following examples demonstrate pod affinity and pod anti-affinity.
3.3.4.1. Pod Affinity Copy linkLink copied to clipboard!
The following example demonstrates pod affinity for pods with matching labels and label selectors.
The pod team4 has the label
team:4
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod team4a has the label selector
team:4
underpodAffinity
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - The team4a pod is scheduled on the same node as the team4 pod.
3.3.4.2. Pod Anti-affinity Copy linkLink copied to clipboard!
The following example demonstrates pod anti-affinity for pods with matching labels and label selectors.
The pod pod-s1 has the label
security:s1
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod pod-s2 has the label selector
security:s1
underpodAntiAffinity
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
The pod pod-s2 cannot be scheduled on the same node as
pod-s1
.
3.3.4.3. Pod Affinity with no Matching Labels Copy linkLink copied to clipboard!
The following example demonstrates pod affinity for pods without matching labels and label selectors.
The pod pod-s1 has the label
security:s1
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod pod-s2 has the label selector
security:s2
.Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod pod-s2 is not scheduled unless there is a node with a pod that has the
security:s2
label. If there is no other pod with that label, the new pod remains in a pending state:Example output
NAME READY STATUS RESTARTS AGE IP NODE pod-s2 0/1 Pending 0 32s <none>
NAME READY STATUS RESTARTS AGE IP NODE pod-s2 0/1 Pending 0 32s <none>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.4. Controlling pod placement on nodes using node affinity rules Copy linkLink copied to clipboard!
Affinity is a property of pods that controls the nodes on which they prefer to be scheduled.
In OpenShift Container Platform node affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on the nodes and label selectors specified in pods.
3.4.1. Understanding node affinity Copy linkLink copied to clipboard!
Node affinity allows a pod to specify an affinity towards a group of nodes it can be placed on. The node does not have control over the placement.
For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.
There are two types of node affinity rules: required and preferred.
Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
If labels on a node change at runtime that results in an node affinity rule on a pod no longer being met, the pod continues to run on the node.
You configure node affinity through the Pod
spec file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.
The following example is a Pod
spec with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth
and whose value is either e2e-az-North
or e2e-az-South
:
Example pod configuration file with a node affinity required rule
- 1
- The stanza to configure node affinity.
- 2
- Defines a required rule.
- 3 5 6
- The key/value pair (label) that must be matched to apply the rule.
- 4
- The operator represents the relationship between the label on the node and the set of values in the
matchExpression
parameters in thePod
spec. This value can beIn
,NotIn
,Exists
, orDoesNotExist
,Lt
, orGt
.
The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest
and whose value is either e2e-az-East
or e2e-az-West
is preferred for the pod:
Example pod configuration file with a node affinity preferred rule
- 1
- The stanza to configure node affinity.
- 2
- Defines a preferred rule.
- 3
- Specifies a weight for a preferred rule. The node with highest weight is preferred.
- 4 6 7
- The key/value pair (label) that must be matched to apply the rule.
- 5
- The operator represents the relationship between the label on the node and the set of values in the
matchExpression
parameters in thePod
spec. This value can beIn
,NotIn
,Exists
, orDoesNotExist
,Lt
, orGt
.
There is no explicit node anti-affinity concept, but using the NotIn
or DoesNotExist
operator replicates that behavior.
If you are using node affinity and node selectors in the same pod configuration, note the following:
-
If you configure both
nodeSelector
andnodeAffinity
, both conditions must be satisfied for the pod to be scheduled onto a candidate node. -
If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node if one of thenodeSelectorTerms
is satisfied. -
If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node only if allmatchExpressions
are satisfied.
3.4.2. Configuring a required node affinity rule Copy linkLink copied to clipboard!
Required rules must be met before a pod can be scheduled on a node.
Procedure
The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.
Add a label to a node using the
oc label node
command:oc label node node1 e2e-az-name=e2e-az1
$ oc label node node1 e2e-az-name=e2e-az1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In the
Pod
spec, use thenodeAffinity
stanza to configure therequiredDuringSchedulingIgnoredDuringExecution
parameter:-
Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same
key
andvalue
parameters as the label in the node. Specify an
operator
. The operator can beIn
,NotIn
,Exists
,DoesNotExist
,Lt
, orGt
. For example, use the operatorIn
to require the label to be in the node:Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same
Create the pod:
oc create -f e2e-az2.yaml
$ oc create -f e2e-az2.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.4.3. Configuring a preferred node affinity rule Copy linkLink copied to clipboard!
Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
Procedure
The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.
Add a label to a node using the
oc label node
command:oc label node node1 e2e-az-name=e2e-az3
$ oc label node node1 e2e-az-name=e2e-az3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In the
Pod
spec, use thenodeAffinity
stanza to configure thepreferredDuringSchedulingIgnoredDuringExecution
parameter:- Specify a weight for the node, as a number 1-100. The node with highest weight is preferred.
Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same
key
andvalue
parameters as the label in the node:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Specify an
operator
. The operator can beIn
,NotIn
,Exists
,DoesNotExist
,Lt
, orGt
. For example, use the OperatorIn
to require the label to be in the node.
Create the pod.
oc create -f e2e-az3.yaml
$ oc create -f e2e-az3.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.4.4. Sample node affinity rules Copy linkLink copied to clipboard!
The following examples demonstrate node affinity.
3.4.4.1. Node affinity with matching labels Copy linkLink copied to clipboard!
The following example demonstrates node affinity for a node and pod with matching labels:
The Node1 node has the label
zone:us
:oc label node node1 zone=us
$ oc label node node1 zone=us
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod-s1 pod has the
zone
andus
key/value pair under a required node affinity rule:cat pod-s1.yaml
$ cat pod-s1.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod-s1 pod can be scheduled on Node1:
oc get pod -o wide
$ oc get pod -o wide
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE IP NODE pod-s1 1/1 Running 0 4m IP1 node1
NAME READY STATUS RESTARTS AGE IP NODE pod-s1 1/1 Running 0 4m IP1 node1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.4.4.2. Node affinity with no matching labels Copy linkLink copied to clipboard!
The following example demonstrates node affinity for a node and pod without matching labels:
The Node1 node has the label
zone:emea
:oc label node node1 zone=emea
$ oc label node node1 zone=emea
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod-s1 pod has the
zone
andus
key/value pair under a required node affinity rule:cat pod-s1.yaml
$ cat pod-s1.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod-s1 pod cannot be scheduled on Node1:
oc describe pod pod-s1
$ oc describe pod pod-s1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.5. Placing pods onto overcommited nodes Copy linkLink copied to clipboard!
In an overcommited state, the sum of the container compute resource requests and limits exceeds the resources available on the system. Overcommitment might be desirable in development environments where a trade-off of guaranteed performance for capacity is acceptable.
Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.
3.5.1. Understanding overcommitment Copy linkLink copied to clipboard!
Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.
OpenShift Container Platform administrators can control the level of overcommit and manage container density on nodes by configuring masters to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange
object specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.
That these overrides have no effect if no limits have been set on containers. Create a LimitRange
object with default limits, per individual project, or in the project template, in order to ensure that the overrides apply.
After these overrides, the container limits and requests must still be validated by any LimitRange
object in the project. It is possible, for example, for developers to specify a limit close to the minimum limit, and have the request then be overridden below the minimum limit, causing the pod to be forbidden. This unfortunate user experience should be addressed with future work, but for now, configure this capability and LimitRange
objects with caution.
3.5.2. Understanding nodes overcommitment Copy linkLink copied to clipboard!
In an overcommitted environment, it is important to properly configure your node to provide best system behavior.
When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.
To ensure this behavior, OpenShift Container Platform configures the kernel to always overcommit memory by setting the vm.overcommit_memory
parameter to 1
, overriding the default operating system setting.
OpenShift Container Platform also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom
parameter to 0
. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority
You can view the current setting by running the following commands on your nodes:
sysctl -a |grep commit
$ sysctl -a |grep commit
Example output
vm.overcommit_memory = 1
vm.overcommit_memory = 1
sysctl -a |grep panic
$ sysctl -a |grep panic
Example output
vm.panic_on_oom = 0
vm.panic_on_oom = 0
The above flags should already be set on nodes, and no further action is required.
You can also perform the following configurations for each node:
- Disable or enforce CPU limits using CPU CFS quotas
- Reserve resources for system processes
- Reserve memory across quality of service tiers
3.6. Controlling pod placement using node taints Copy linkLink copied to clipboard!
Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.
3.6.1. Understanding taints and tolerations Copy linkLink copied to clipboard!
A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.
You apply taints to a node through the Node
specification (NodeSpec
) and apply tolerations to a pod through the Pod
specification (PodSpec
). When you apply a taint a node, the scheduler cannot place a pod on that node unless the pod can tolerate the taint.
Example taint in a node specification
Example toleration in a Pod
spec
Taints and tolerations consist of a key, value, and effect.
Parameter | Description | ||||||
---|---|---|---|---|---|---|---|
|
The | ||||||
|
The | ||||||
| The effect is one of the following:
| ||||||
|
|
If you add a
NoSchedule
taint to a control plane node (also known as the master node) the node must have thenode-role.kubernetes.io/master=:NoSchedule
taint, which is added by default.For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
A toleration matches a taint:
If the
operator
parameter is set toEqual
:-
the
key
parameters are the same; -
the
value
parameters are the same; -
the
effect
parameters are the same.
-
the
If the
operator
parameter is set toExists
:-
the
key
parameters are the same; -
the
effect
parameters are the same.
-
the
The following taints are built into OpenShift Container Platform:
-
node.kubernetes.io/not-ready
: The node is not ready. This corresponds to the node conditionReady=False
. -
node.kubernetes.io/unreachable
: The node is unreachable from the node controller. This corresponds to the node conditionReady=Unknown
. -
node.kubernetes.io/memory-pressure
: The node has memory pressure issues. This corresponds to the node conditionMemoryPressure=True
. -
node.kubernetes.io/disk-pressure
: The node has disk pressure issues. This corresponds to the node conditionDiskPressure=True
. -
node.kubernetes.io/network-unavailable
: The node network is unavailable. -
node.kubernetes.io/unschedulable
: The node is unschedulable. -
node.cloudprovider.kubernetes.io/uninitialized
: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint. node.kubernetes.io/pid-pressure
: The node has pid pressure. This corresponds to the node conditionPIDPressure=True
.ImportantOpenShift Container Platform does not set a default pid.available
evictionHard
.
3.6.1.1. Understanding how to use toleration seconds to delay pod evictions Copy linkLink copied to clipboard!
You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds
parameter in the Pod
specification or MachineSet
object. If a taint with the NoExecute
effect is added to a node, a pod that does tolerate the taint, which has the tolerationSeconds
parameter, the pod is not evicted until that time period expires.
Example output
Here, if this pod is running but does not have a matching toleration, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.
3.6.1.2. Understanding how to use multiple taints Copy linkLink copied to clipboard!
You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:
- Process the taints for which the pod has a matching toleration.
The remaining unmatched taints have the indicated effects on the pod:
-
If there is at least one unmatched taint with effect
NoSchedule
, OpenShift Container Platform cannot schedule a pod onto that node. -
If there is no unmatched taint with effect
NoSchedule
but there is at least one unmatched taint with effectPreferNoSchedule
, OpenShift Container Platform tries to not schedule the pod onto the node. If there is at least one unmatched taint with effect
NoExecute
, OpenShift Container Platform evicts the pod from the node if it is already running on the node, or the pod is not scheduled onto the node if it is not yet running on the node.- Pods that do not tolerate the taint are evicted immediately.
-
Pods that tolerate the taint without specifying
tolerationSeconds
in theirPod
specification remain bound forever. -
Pods that tolerate the taint with a specified
tolerationSeconds
remain bound for the specified amount of time.
-
If there is at least one unmatched taint with effect
For example:
Add the following taints to the node:
oc adm taint nodes node1 key1=value1:NoSchedule
$ oc adm taint nodes node1 key1=value1:NoSchedule
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm taint nodes node1 key1=value1:NoExecute
$ oc adm taint nodes node1 key1=value1:NoExecute
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc adm taint nodes node1 key2=value2:NoSchedule
$ oc adm taint nodes node1 key2=value2:NoSchedule
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The pod has the following tolerations:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.
3.6.1.3. Understanding pod scheduling and node conditions (taint node by condition) Copy linkLink copied to clipboard!
The Taint Nodes By Condition feature, which is enabled by default, automatically taints nodes that report conditions such as memory pressure and disk pressure. If a node reports a condition, a taint is added until the condition clears. The taints have the NoSchedule
effect, which means no pod can be scheduled on the node unless the pod has a matching toleration.
The scheduler checks for these taints on nodes before scheduling pods. If the taint is present, the pod is scheduled on a different node. Because the scheduler checks for taints and not the actual node conditions, you configure the scheduler to ignore some of these node conditions by adding appropriate pod tolerations.
To ensure backward compatibility, the daemon set controller automatically adds the following tolerations to all daemons:
- node.kubernetes.io/memory-pressure
- node.kubernetes.io/disk-pressure
- node.kubernetes.io/unschedulable (1.10 or later)
- node.kubernetes.io/network-unavailable (host network only)
You can also add arbitrary tolerations to daemon sets.
The control plane also adds the node.kubernetes.io/memory-pressure
toleration on pods that have a QoS class. This is because Kubernetes manages pods in the Guaranteed
or Burstable
QoS classes. The new BestEffort
pods do not get scheduled onto the affected node.
3.6.1.4. Understanding evicting pods by condition (taint-based evictions) Copy linkLink copied to clipboard!
The Taint-Based Evictions feature, which is enabled by default, evicts pods from a node that experiences specific conditions, such as not-ready
and unreachable
. When a node experiences one of these conditions, OpenShift Container Platform automatically adds taints to the node, and starts evicting and rescheduling the pods on different nodes.
Taint Based Evictions have a NoExecute
effect, where any pod that does not tolerate the taint is evicted immediately and any pod that does tolerate the taint will never be evicted, unless the pod uses the tolerationSeconds
parameter.
The tolerationSeconds
parameter allows you to specify how long a pod stays bound to a node that has a node condition. If the condition still exists after the tolerationSeconds
period, the taint remains on the node and the pods with a matching toleration are evicted. If the condition clears before the tolerationSeconds
period, pods with matching tolerations are not removed.
If you use the tolerationSeconds
parameter with no value, pods are never evicted because of the not ready and unreachable node conditions.
OpenShift Container Platform evicts pods in a rate-limited way to prevent massive pod evictions in scenarios such as the master becoming partitioned from the nodes.
By default, if more than 55% of nodes in a given zone are unhealthy, the node lifecycle controller changes that zone’s state to PartialDisruption
and the rate of pod evictions is reduced. For small clusters (by default, 50 nodes or less) in this state, nodes in this zone are not tainted and evictions are stopped.
For more information, see Rate limits on eviction in the Kubernetes documentation.
OpenShift Container Platform automatically adds a toleration for node.kubernetes.io/not-ready
and node.kubernetes.io/unreachable
with tolerationSeconds=300
, unless the Pod
configuration specifies either toleration.
- 1
- These tolerations ensure that the default pod behavior is to remain bound for five minutes after one of these node conditions problems is detected.
You can configure these tolerations as needed. For example, if you have an application with a lot of local state, you might want to keep the pods bound to node for a longer time in the event of network partition, allowing for the partition to recover and avoiding pod eviction.
Pods spawned by a daemon set are created with NoExecute
tolerations for the following taints with no tolerationSeconds
:
-
node.kubernetes.io/unreachable
-
node.kubernetes.io/not-ready
As a result, daemon set pods are never evicted because of these node conditions.
3.6.1.5. Tolerating all taints Copy linkLink copied to clipboard!
You can configure a pod to tolerate all taints by adding an operator: "Exists"
toleration with no key
and value
parameters. Pods with this toleration are not removed from a node that has taints.
Pod
spec for tolerating all taints
3.6.2. Adding taints and tolerations Copy linkLink copied to clipboard!
You add tolerations to pods and taints to nodes to allow the node to control which pods should or should not be scheduled on them. For existing pods and nodes, you should add the toleration to the pod first, then add the taint to the node to avoid pods being removed from the node before you can add the toleration.
Procedure
Add a toleration to a pod by editing the
Pod
spec to include atolerations
stanza:Sample pod configuration file with an Equal operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
Sample pod configuration file with an Exists operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
Exists
operator does not take avalue
.
This example places a taint on
node1
that has keykey1
, valuevalue1
, and taint effectNoExecute
.Add a taint to a node by using the following command with the parameters described in the Taint and toleration components table:
oc adm taint nodes <node_name> <key>=<value>:<effect>
$ oc adm taint nodes <node_name> <key>=<value>:<effect>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc adm taint nodes node1 key1=value1:NoExecute
$ oc adm taint nodes node1 key1=value1:NoExecute
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This command places a taint on
node1
that has keykey1
, valuevalue1
, and effectNoExecute
.NoteIf you add a
NoSchedule
taint to a control plane node (also known as the master node) the node must have thenode-role.kubernetes.io/master=:NoSchedule
taint, which is added by default.For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The tolerations on the pod match the taint on the node. A pod with either toleration can be scheduled onto
node1
.
3.6.2.1. Adding taints and tolerations using a machine set Copy linkLink copied to clipboard!
You can add taints to nodes using a machine set. All nodes associated with the MachineSet
object are updated with the taint. Tolerations respond to taints added by a machine set in the same manner as taints added directly to the nodes.
Procedure
Add a toleration to a pod by editing the
Pod
spec to include atolerations
stanza:Sample pod configuration file with
Equal
operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
Sample pod configuration file with
Exists
operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the taint to the
MachineSet
object:Edit the
MachineSet
YAML for the nodes you want to taint or you can create a newMachineSet
object:oc edit machineset <machineset>
$ oc edit machineset <machineset>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the taint to the
spec.template.spec
section:Example taint in a machine set specification
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This example places a taint that has the key
key1
, valuevalue1
, and taint effectNoExecute
on the nodes.Scale down the machine set to 0:
oc scale --replicas=0 machineset <machineset> -n openshift-machine-api
$ oc scale --replicas=0 machineset <machineset> -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait for the machines to be removed.
Scale up the machine set as needed:
oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait for the machines to start. The taint is added to the nodes associated with the
MachineSet
object.
3.6.2.2. Binding a user to a node using taints and tolerations Copy linkLink copied to clipboard!
If you want to dedicate a set of nodes for exclusive use by a particular set of users, add a toleration to their pods. Then, add a corresponding taint to those nodes. The pods with the tolerations are allowed to use the tainted nodes, or any other nodes in the cluster.
If you want ensure the pods are scheduled to only those tainted nodes, also add a label to the same set of nodes and add a node affinity to the pods so that the pods can only be scheduled onto nodes with that label.
Procedure
To configure a node so that users can use only that node:
Add a corresponding taint to those nodes:
For example:
oc adm taint nodes node1 dedicated=groupName:NoSchedule
$ oc adm taint nodes node1 dedicated=groupName:NoSchedule
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Add a toleration to the pods by writing a custom admission controller.
3.6.2.3. Controlling nodes with special hardware using taints and tolerations Copy linkLink copied to clipboard!
In a cluster where a small subset of nodes have specialized hardware, you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.
You can achieve this by adding a toleration to pods that need the special hardware and tainting the nodes that have the specialized hardware.
Procedure
To ensure nodes with specialized hardware are reserved for specific pods:
Add a toleration to pods that need the special hardware.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Taint the nodes that have the specialized hardware using one of the following commands:
oc adm taint nodes <node-name> disktype=ssd:NoSchedule
$ oc adm taint nodes <node-name> disktype=ssd:NoSchedule
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Or:
oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule
$ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.6.3. Removing taints and tolerations Copy linkLink copied to clipboard!
You can remove taints from nodes and tolerations from pods as needed. You should add the toleration to the pod first, then add the taint to the node to avoid pods being removed from the node before you can add the toleration.
Procedure
To remove taints and tolerations:
To remove a taint from a node:
oc adm taint nodes <node-name> <key>-
$ oc adm taint nodes <node-name> <key>-
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc adm taint nodes ip-10-0-132-248.ec2.internal key1-
$ oc adm taint nodes ip-10-0-132-248.ec2.internal key1-
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
node/ip-10-0-132-248.ec2.internal untainted
node/ip-10-0-132-248.ec2.internal untainted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To remove a toleration from a pod, edit the
Pod
spec to remove the toleration:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.7. Placing pods on specific nodes using node selectors Copy linkLink copied to clipboard!
A node selector specifies a map of key/value pairs that are defined using custom labels on nodes and selectors specified in pods.
For the pod to be eligible to run on a node, the pod must have the same key/value node selector as the label on the node.
3.7.1. About node selectors Copy linkLink copied to clipboard!
You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.
You can use a node selector to place specific pods on specific nodes, cluster-wide node selectors to place new pods on specific nodes anywhere in the cluster, and project node selectors to place new pods in a project on specific nodes.
For example, as a cluster administrator, you can create an infrastructure where application developers can deploy pods only onto the nodes closest to their geographical location by including a node selector in every pod they create. In this example, the cluster consists of five data centers spread across two regions. In the U.S., label the nodes as us-east
, us-central
, or us-west
. In the Asia-Pacific region (APAC), label the nodes as apac-east
or apac-west
. The developers can add a node selector to the pods they create to ensure the pods get scheduled on those nodes.
A pod is not scheduled if the Pod
object contains a node selector, but no node has a matching label.
If you are using node selectors and node affinity in the same pod configuration, the following rules control pod placement onto nodes:
-
If you configure both
nodeSelector
andnodeAffinity
, both conditions must be satisfied for the pod to be scheduled onto a candidate node. -
If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node if one of thenodeSelectorTerms
is satisfied. -
If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node only if allmatchExpressions
are satisfied.
- Node selectors on specific pods and nodes
You can control which node a specific pod is scheduled on by using node selectors and labels.
To use node selectors and labels, first label the node to avoid pods being descheduled, then add the node selector to the pod.
NoteYou cannot add a node selector directly to an existing scheduled pod. You must label the object that controls the pod, such as deployment config.
For example, the following
Node
object has theregion: east
label:Sample
Node
object with a labelCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Label to match the pod node selector.
A pod has the
type: user-node,region: east
node selector:Sample
Pod
object with node selectorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Node selectors to match the node label.
When you create the pod using the example pod spec, it can be scheduled on the example node.
- Default cluster-wide node selectors
With default cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.
For example, the following
Scheduler
object has the default cluster-wideregion=east
andtype=user-node
node selectors:Example Scheduler Operator Custom Resource
Copy to Clipboard Copied! Toggle word wrap Toggle overflow A node in that cluster has the
type=user-node,region=east
labels:Example
Node
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Pod
object with a node selectorCopy to Clipboard Copied! Toggle word wrap Toggle overflow When you create the pod using the example pod spec in the example cluster, the pod is created with the cluster-wide node selector and is scheduled on the labeled node:
Example pod list with the pod on the labeled node
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-s1 1/1 Running 0 20s 10.131.2.6 ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 <none> <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-s1 1/1 Running 0 20s 10.131.2.6 ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 <none> <none>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the project where you create the pod has a project node selector, that selector takes preference over a cluster-wide node selector. Your pod is not created or scheduled if the pod does not have the project node selector.
- Project node selectors
With project node selectors, when you create a pod in this project, OpenShift Container Platform adds the node selectors to the pod and schedules the pods on a node with matching labels. If there is a cluster-wide default node selector, a project node selector takes preference.
For example, the following project has the
region=east
node selector:Example
Namespace
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow The following node has the
type=user-node,region=east
labels:Example
Node
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow When you create the pod using the example pod spec in this example project, the pod is created with the project node selectors and is scheduled on the labeled node:
Example
Pod
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example pod list with the pod on the labeled node
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-s1 1/1 Running 0 20s 10.131.2.6 ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 <none> <none>
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-s1 1/1 Running 0 20s 10.131.2.6 ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 <none> <none>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow A pod in the project is not created or scheduled if the pod contains different node selectors. For example, if you deploy the following pod into the example project, it is not be created:
Example
Pod
object with an invalid node selectorCopy to Clipboard Copied! Toggle word wrap Toggle overflow
3.7.2. Using node selectors to control pod placement Copy linkLink copied to clipboard!
You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.
You add labels to a node, a machine set, or a machine config. Adding the label to the machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.
To add node selectors to an existing pod, add a node selector to the controlling object for that pod, such as a ReplicaSet
object, DaemonSet
object, StatefulSet
object, Deployment
object, or DeploymentConfig
object. Any existing pods under that controlling object are recreated on a node with a matching label. If you are creating a new pod, you can add the node selector directly to the Pod
spec.
You cannot add a node selector directly to an existing scheduled pod.
Prerequisites
To add a node selector to existing pods, determine the controlling object for that pod. For example, the router-default-66d5cf9464-m2g75
pod is controlled by the router-default-66d5cf9464
replica set:
The web console lists the controlling object under ownerReferences
in the pod YAML:
Procedure
Add labels to a node by using a machine set or editing the node directly:
Use a
MachineSet
object to add labels to nodes managed by the machine set when a node is created:Run the following command to add labels to a
MachineSet
object:oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc patch MachineSet abc612-msrtw-worker-us-east-1c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
$ oc patch MachineSet abc612-msrtw-worker-us-east-1c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the labels are added to the
MachineSet
object by using theoc edit
command:For example:
oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api
$ oc edit MachineSet abc612-msrtw-worker-us-east-1c -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
MachineSet
objectCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Add labels directly to a node:
Edit the
Node
object for the node:oc label nodes <name> <key>=<value>
$ oc label nodes <name> <key>=<value>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to label a node:
oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the labels are added to the node:
oc get nodes -l type=user-node,region=east
$ oc get nodes -l type=user-node,region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION ip-10-0-142-25.ec2.internal Ready worker 17m v1.18.3+002a51f
NAME STATUS ROLES AGE VERSION ip-10-0-142-25.ec2.internal Ready worker 17m v1.18.3+002a51f
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add the matching node selector to a pod:
To add a node selector to existing and future pods, add a node selector to the controlling object for the pods:
Example
ReplicaSet
object with labelsCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Add the node selector.
To add a node selector to a specific, new pod, add the selector to the
Pod
object directly:Example
Pod
object with a node selectorCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou cannot add a node selector directly to an existing scheduled pod.
3.7.3. Creating default cluster-wide node selectors Copy linkLink copied to clipboard!
You can use default cluster-wide node selectors on pods together with labels on nodes to constrain all pods created in a cluster to specific nodes.
With cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.
You configure cluster-wide node selectors by editing the Scheduler Operator custom resource (CR). You add labels to a node, a machine set, or a machine config. Adding the label to the machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.
You can add additional key/value pairs to a pod. But you cannot add a different value for a default key.
Procedure
To add a default cluster-wide node selector:
Edit the Scheduler Operator CR to add the default cluster-wide node selectors:
oc edit scheduler cluster
$ oc edit scheduler cluster
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example Scheduler Operator CR with a node selector
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Add a node selector with the appropriate
<key>:<value>
pairs.
After making this change, wait for the pods in the
openshift-kube-apiserver
project to redeploy. This can take several minutes. The default cluster-wide node selector does not take effect until the pods redeploy.Add labels to a node by using a machine set or editing the node directly:
Use a machine set to add labels to nodes managed by the machine set when a node is created:
Run the following command to add labels to a
MachineSet
object:oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Add a
<key>/<value>
pair for each label.
For example:
oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
$ oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the labels are added to the
MachineSet
object by using theoc edit
command:For example:
oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
$ oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Redeploy the nodes associated with that machine set by scaling down to
0
and scaling up the nodes:For example:
oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
$ oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
$ oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When the nodes are ready and available, verify that the label is added to the nodes by using the
oc get
command:oc get nodes -l <key>=<value>
$ oc get nodes -l <key>=<value>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get nodes -l type=user-node
$ oc get nodes -l type=user-node
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp Ready worker 61s v1.18.3+002a51f
NAME STATUS ROLES AGE VERSION ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp Ready worker 61s v1.18.3+002a51f
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add labels directly to a node:
Edit the
Node
object for the node:oc label nodes <name> <key>=<value>
$ oc label nodes <name> <key>=<value>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to label a node:
oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 type=user-node region=east
$ oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 type=user-node region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the labels are added to the node using the
oc get
command:oc get nodes -l <key>=<value>,<key>=<value>
$ oc get nodes -l <key>=<value>,<key>=<value>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get nodes -l type=user-node,region=east
$ oc get nodes -l type=user-node,region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 Ready worker 17m v1.18.3+002a51f
NAME STATUS ROLES AGE VERSION ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 Ready worker 17m v1.18.3+002a51f
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.7.4. Creating project-wide node selectors Copy linkLink copied to clipboard!
You can use node selectors in a project together with labels on nodes to constrain all pods created in that project to the labeled nodes.
When you create a pod in this project, OpenShift Container Platform adds the node selectors to the pods in the project and schedules the pods on a node with matching labels in the project. If there is a cluster-wide default node selector, a project node selector takes preference.
You add node selectors to a project by editing the Namespace
object to add the openshift.io/node-selector
parameter. You add labels to a node, a machine set, or a machine config. Adding the label to the machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.
A pod is not scheduled if the Pod
object contains a node selector, but no project has a matching node selector. When you create a pod from that spec, you receive an error similar to the following message:
Example error message
Error from server (Forbidden): error when creating "pod.yaml": pods "pod-4" is forbidden: pod node label selector conflicts with its project node label selector
Error from server (Forbidden): error when creating "pod.yaml": pods "pod-4" is forbidden: pod node label selector conflicts with its project node label selector
You can add additional key/value pairs to a pod. But you cannot add a different value for a project key.
Procedure
To add a default project node selector:
Create a namespace or edit an existing namespace to add the
openshift.io/node-selector
parameter:oc edit namespace <name>
$ oc edit namespace <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Add the
openshift.io/node-selector
with the appropriate<key>:<value>
pairs.
Add labels to a node by using a machine set or editing the node directly:
Use a
MachineSet
object to add labels to nodes managed by the machine set when a node is created:Run the following command to add labels to a
MachineSet
object:oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
$ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]' -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
$ oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]' -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the labels are added to the
MachineSet
object by using theoc edit
command:For example:
oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
$ oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Redeploy the nodes associated with that machine set:
For example:
oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
$ oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
$ oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When the nodes are ready and available, verify that the label is added to the nodes by using the
oc get
command:oc get nodes -l <key>=<value>
$ oc get nodes -l <key>=<value>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get nodes -l type=user-node,region=east
$ oc get nodes -l type=user-node,region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp Ready worker 61s v1.18.3+002a51f
NAME STATUS ROLES AGE VERSION ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp Ready worker 61s v1.18.3+002a51f
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add labels directly to a node:
Edit the
Node
object to add labels:oc label <resource> <name> <key>=<value>
$ oc label <resource> <name> <key>=<value>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to label a node:
oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-c-tgq49 type=user-node region=east
$ oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-c-tgq49 type=user-node region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the labels are added to the
Node
object using theoc get
command:oc get nodes -l <key>=<value>
$ oc get nodes -l <key>=<value>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get nodes -l type=user-node,region=east
$ oc get nodes -l type=user-node,region=east
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 Ready worker 17m v1.18.3+002a51f
NAME STATUS ROLES AGE VERSION ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 Ready worker 17m v1.18.3+002a51f
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.8. Controlling pod placement by using pod topology spread constraints Copy linkLink copied to clipboard!
You can use pod topology spread constraints to control the placement of your pods across nodes, zones, regions, or other user-defined topology domains.
3.8.1. About pod topology spread constraints Copy linkLink copied to clipboard!
By using a pod topology spread constraint, you provide fine-grained control over the distribution of pods across failure domains to help achieve high availability and more efficient resource utilization.
OpenShift Container Platform administrators can label nodes to provide topology information, such as regions, zones, nodes, or other user-defined domains. After these labels are set on nodes, users can then define pod topology spread constraints to control the placement of pods across these topology domains.
You specify which pods to group together, which topology domains they are spread among, and the acceptable skew. Only pods within the same namespace are matched and grouped together when spreading due to a constraint.
3.8.2. Configuring pod topology spread constraints Copy linkLink copied to clipboard!
The following steps demonstrate how to configure pod topology spread constraints to distribute pods that match the specified labels based on their zone.
You can specify multiple pod topology spread constraints, but you must ensure that they do not conflict with each other. All pod topology spread constraints must be satisfied for a pod to be placed.
Prerequisites
- A cluster administrator has added the required labels to nodes.
Procedure
Create a
Pod
spec and specify a pod topology spread constraint:Example
pod-spec.yaml
fileCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The maximum difference in number of pods between any two topology domains. The default is
1
, and you cannot specify a value of0
. - 2
- The key of a node label. Nodes with this key and identical value are considered to be in the same topology.
- 3
- How to handle a pod if it does not satisfy the spread constraint. The default is
DoNotSchedule
, which tells the scheduler not to schedule the pod. Set toScheduleAnyway
to still schedule the pod, but the scheduler prioritizes honoring the skew to not make the cluster more imbalanced. - 4
- Pods that match this label selector are counted and recognized as a group when spreading to satisfy the constraint. Be sure to specify a label selector, otherwise no pods can be matched.
- 5
- Be sure that this
Pod
spec also sets its labels to match this label selector if you want it to be counted properly in the future.
Create the pod:
oc create -f pod-spec.yaml
$ oc create -f pod-spec.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.8.3. Example pod topology spread constraints Copy linkLink copied to clipboard!
The following examples demonstrate pod topology spread constraint configurations.
3.8.3.1. Single pod topology spread constraint example Copy linkLink copied to clipboard!
This example Pod
spec defines one pod topology spread constraint. It matches on pods labeled foo:bar
, distributes among zones, specifies a skew of 1
, and does not schedule the pod if it does not meet these requirements.
3.8.3.2. Multiple pod topology spread constraints example Copy linkLink copied to clipboard!
This example Pod
spec defines two pod topology spread constraints. Both match on pods labeled foo:bar
, specify a skew of 1
, and do not schedule the pod if it does not meet these requirements.
The first constraint distributes pods based on a user-defined label node
, and the second constraint distributes pods based on a user-defined label rack
. Both constraints must be met for the pod to be scheduled.
3.9. Running a custom scheduler Copy linkLink copied to clipboard!
You can run multiple custom schedulers alongside the default scheduler and configure which scheduler to use for each pod.
It is supported to use a custom scheduler with OpenShift Container Platform, but Red Hat does not directly support the functionality of the custom scheduler.
For information on how to configure the default scheduler, see Configuring the default scheduler to control pod placement.
To schedule a given pod using a specific scheduler, specify the name of the scheduler in that Pod
specification.
3.9.1. Deploying a custom scheduler Copy linkLink copied to clipboard!
To include a custom scheduler in your cluster, include the image for a custom scheduler in a deployment.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. You have a scheduler binary.
NoteInformation on how to create a scheduler binary is outside the scope of this document. For an example, see Configure Multiple Schedulers in the Kubernetes documentation. Note that the actual functionality of your custom scheduler is not supported by Red Hat.
- You have created an image containing the scheduler binary and pushed it to a registry.
Procedure
Create a file that contains the deployment resources for the custom scheduler:
Example
custom-scheduler.yaml
fileCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1 2 3 4
- This procedure uses the
kube-system
namespace, but you can use the namespace of your choosing. - 5
- The command for your custom scheduler might require different arguments. For example, you can pass configuration as a mounted volume using the
--config
argument. - 6
- Specify the container image that you created for the custom scheduler.
Create the deployment resources in the cluster:
oc create -f custom-scheduler.yaml
$ oc create -f custom-scheduler.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the scheduler pod is running:
oc get pods -n kube-system
$ oc get pods -n kube-system
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The custom scheduler pod is listed as
Running
:NAME READY STATUS RESTARTS AGE custom-scheduler-6cd7c4b8bc-854zb 1/1 Running 0 2m
NAME READY STATUS RESTARTS AGE custom-scheduler-6cd7c4b8bc-854zb 1/1 Running 0 2m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.9.2. Deploying pods using a custom scheduler Copy linkLink copied to clipboard!
After the custom scheduler is deployed in your cluster, you can configure pods to use that scheduler instead of the default scheduler.
Each scheduler has a separate view of resources in a cluster. For that reason, each scheduler should operate over its own set of nodes.
If two or more schedulers operate on the same node, they might intervene with each other and schedule more pods on the same node than there are available resources for. Pods might get rejected due to insufficient resources in this case.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
role. - The custom scheduler has been deployed in the cluster.
Procedure
If your cluster uses role-based access control (RBAC), add the custom scheduler name to the
system:kube-scheduler
cluster role.Edit the
system:kube-scheduler
cluster role:oc edit clusterrole system:kube-scheduler
$ oc edit clusterrole system:kube-scheduler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the name of the custom scheduler to the
resourceNames
lists for theleases
andendpoints
resources:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a
Pod
configuration and specify the name of the custom scheduler in theschedulerName
parameter:Example
custom-scheduler-example.yaml
fileCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of the custom scheduler to use, which is
custom-scheduler
in this example. When no scheduler name is supplied, the pod is automatically scheduled using the default scheduler.
Create the pod:
oc create -f custom-scheduler-example.yaml
$ oc create -f custom-scheduler-example.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Enter the following command to check that the pod was created:
oc get pod custom-scheduler-example
$ oc get pod custom-scheduler-example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
custom-scheduler-example
pod is listed in the output:NAME READY STATUS RESTARTS AGE custom-scheduler-example 1/1 Running 0 4m
NAME READY STATUS RESTARTS AGE custom-scheduler-example 1/1 Running 0 4m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enter the following command to check that the custom scheduler has scheduled the pod:
oc describe pod custom-scheduler-example
$ oc describe pod custom-scheduler-example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The scheduler,
custom-scheduler
, is listed as shown in the following truncated output:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> custom-scheduler Successfully assigned default/custom-scheduler-example to <node_name>
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> custom-scheduler Successfully assigned default/custom-scheduler-example to <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.10. Evicting pods using the descheduler Copy linkLink copied to clipboard!
While the scheduler is used to determine the most suitable node to host a new pod, the descheduler can be used to evict a running pod so that the pod can be rescheduled onto a more suitable node.
The descheduler is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.
3.10.1. About the descheduler Copy linkLink copied to clipboard!
You can use the descheduler to evict pods based on specific strategies so that the pods can be rescheduled onto more appropriate nodes.
You can benefit from descheduling running pods in situations such as the following:
- Nodes are underutilized or overutilized.
- Pod and node affinity requirements, such as taints or labels, have changed and the original scheduling decisions are no longer appropriate for certain nodes.
- Node failure requires pods to be moved.
- New nodes are added to clusters.
- Pods have been restarted too many times.
The descheduler does not schedule replacement of evicted pods. The scheduler automatically performs this task for the evicted pods.
When the descheduler decides to evict pods from a node, it employs the following general mechanism:
-
Critical pods with
priorityClassName
set tosystem-cluster-critical
orsystem-node-critical
are never evicted. - Static, mirrored, or stand-alone pods that are not part of a replication controller, replica set, deployment, or job are never evicted because these pods will not be recreated.
- Pods associated with daemon sets are never evicted.
- Pods with local storage are never evicted.
- Best effort pods are evicted before burstable and guaranteed pods.
-
All types of pods with the
descheduler.alpha.kubernetes.io/evict
annotation are evicted. This annotation is used to override checks that prevent eviction, and the user can select which pod is evicted. Users should know how and if the pod will be recreated. - Pods subject to pod disruption budget (PDB) are not evicted if descheduling violates its pod disruption budget (PDB). The pods are evicted by using eviction subresource to handle PDB.
3.10.2. Descheduler strategies Copy linkLink copied to clipboard!
The following descheduler strategies are available:
- Low node utilization
The
LowNodeUtilization
strategy finds nodes that are underutilized and evicts pods, if possible, from other nodes in the hope that recreation of evicted pods will be scheduled on these underutilized nodes.The underutilization of nodes is determined by several configurable threshold parameters: CPU, memory, and number of pods. If a node’s usage is below the configured thresholds for all parameters (CPU, memory, and number of pods), then the node is considered to be underutilized.
You can also set a target threshold for CPU, memory, and number of pods. If a node’s usage is above the configured target thresholds for any of the parameters, then the node’s pods might be considered for eviction.
Additionally, you can use the
NumberOfNodes
parameter to set the strategy to activate only when the number of underutilized nodes is above the configured value. This can be helpful in large clusters where a few nodes might be underutilized frequently or for a short period of time.- Duplicate pods
The
RemoveDuplicates
strategy ensures that there is only one pod associated with a replica set, replication controller, deployment, or job running on same node. If there are more, then those duplicate pods are evicted for better spreading of pods in a cluster.This situation could occur after a node failure, when a pod is moved to another node, leading to more than one pod associated with a replica set, replication controller, deployment, or job on that node. After the failed node is ready again, this strategy evicts the duplicate pod.
This strategy has an optional parameter,
ExcludeOwnerKinds
, that allows you to specify a list ofKind
types. If a pod has any of these types listed as anOwnerRef
, that pod is not considered for eviction.- Violation of inter-pod anti-affinity
The
RemovePodsViolatingInterPodAntiAffinity
strategy ensures that pods violating inter-pod anti-affinity are removed from nodes.This situation could occur when anti-affinity rules are created for pods that are already running on the same node.
- Violation of node affinity
The
RemovePodsViolatingNodeAffinity
strategy ensures that pods violating node affinity are removed from nodes.This situation could occur if a node no longer satisfies a pod’s affinity rule. If another node is available that satisfies the affinity rule, then the pod is evicted.
- Violation of node taints
The
RemovePodsViolatingNodeTaints
strategy ensures that pods violatingNoSchedule
taints on nodes are removed.This situation could occur if a pod is set to tolerate a taint
key=value:NoSchedule
and is running on a tainted node. If the node’s taint is updated or removed, the taint is no longer satisfied by the pod’s tolerations and the pod is evicted.- Too many restarts
The
RemovePodsHavingTooManyRestarts
strategy ensures that pods that have been restarted too many times are removed from nodes.This situation could occur if a pod is scheduled on a node that is unable to start it. For example, if the node is having network issues and is unable to mount a networked persistent volume, then the pod should be evicted so that it can be scheduled on another node. Another example is if the pod is crashlooping.
This strategy has two configurable parameters:
PodRestartThreshold
andIncludingInitContainers
. If a pod is restarted more than the configuredPodRestartThreshold
value, then the pod is evicted. You can use theIncludingInitContainers
parameter to specify whether restarts for Init Containers should be calculated into thePodRestartThreshold
value.- Pod life time
The
PodLifeTime
strategy evicts pods that are too old.After a pod reaches the age, in seconds, set by the
MaxPodLifeTimeSeconds
parameter, it is evicted.
3.10.3. Installing the descheduler Copy linkLink copied to clipboard!
The descheduler is not available by default. To enable the descheduler, you must install the Kube Descheduler Operator from OperatorHub. After the Kube Descheduler Operator is installed, you can then configure the eviction strategies.
Prerequisites
- Cluster administrator privileges.
- Access to the OpenShift Container Platform web console.
Procedure
- Log in to the OpenShift Container Platform web console.
Create the required namespace for the Kube Descheduler Operator.
- Navigate to Administration → Namespaces and click Create Namespace.
-
Enter
openshift-kube-descheduler-operator
in the Name field and click Create.
Install the Kube Descheduler Operator.
- Navigate to Operators → OperatorHub.
- Type Kube Descheduler Operator into the filter box.
- Select the Kube Descheduler Operator and click Install.
- On the Install Operator page, select A specific namespace on the cluster. Select openshift-kube-descheduler-operator from the drop-down menu.
- Adjust the values for the Update Channel and Approval Strategy to the desired values.
- Click Install.
Create a descheduler instance.
- From the Operators → Installed Operators page, click the Kube Descheduler Operator.
- Select the Kube Descheduler tab and click Create KubeDescheduler.
- Edit the settings as necessary and click Create.
You can now configure the strategies for the descheduler. There are no strategies enabled by default.
3.10.4. Configuring descheduler strategies Copy linkLink copied to clipboard!
You can configure which strategies the descheduler uses to evict pods.
Prerequisites
- Cluster administrator privileges.
Procedure
Edit the
KubeDescheduler
object:oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Specify one or more strategies in the
spec.strategies
section.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
LowNodeUtilization
strategy provides additional parameters, such asCPUThreshold
andMemoryThreshold
, that you can optionally configure. - 2
- The
RemoveDuplicates
strategy provides an optional parameter,ExcludeOwnerKinds
. - 3
- The
RemovePodsHavingTooManyRestarts
strategy requires thePodRestartThreshold
parameter to be set. It also provides the optionalIncludingInitContainers
parameter. - 4
- The
RemovePodsViolatingInterPodAntiAffinity
,RemovePodsViolatingNodeAffinity
, andRemovePodsViolatingNodeTaints
strategies do not have any additional parameters to configure. - 5
- The
PodLifeTime
strategy requires theMaxPodLifeTimeSeconds
parameter to be set.
You can enable multiple strategies and the order that the strategies are specified in is not important.
- Save the file to apply the changes.
3.10.5. Filtering pods by namespace Copy linkLink copied to clipboard!
You can configure whether or not pods are considered for eviction based on their namespace. Only the following descheduler strategies support namespace filtering:
-
PodLifeTime
-
RemovePodsHavingTooManyRestarts
-
RemovePodsViolatingInterPodAntiAffinity
-
RemovePodsViolatingNodeAffinity
-
RemovePodsViolatingNodeTaints
You can use the IncludeNamespaces
parameter to specify which namespaces that a descheduler strategy should be run on, or you can use the ExcludeNamespaces
parameter to specify which namespaces that a descheduler strategy should not be run on.
Prerequisites
- Cluster administrator privileges.
Procedure
Edit the
KubeDescheduler
object:oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add either the
IncludeNamespaces
orExcludeNamespaces
parameter to one or more strategies:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Save the file to apply the changes.
3.10.6. Filtering pods by priority Copy linkLink copied to clipboard!
You can configure descheduler strategies to consider pods for eviction only if their priority is lower than a specified priority level. Pods that are higher than the specified priority threshold are not considered for eviction.
You can use the ThresholdPriority
parameter to set a numerical priority threshold, or you can use the ThresholdPriorityClassName
parameter to specify a certain priority class name.
Prerequisites
- Cluster administrator privileges.
Procedure
Edit the
KubeDescheduler
object:oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add either the
ThresholdPriority
orThresholdPriorityClassName
parameter to one or more strategies:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Save the file to apply the changes.
3.10.7. Configuring additional descheduler settings Copy linkLink copied to clipboard!
You can configure additional settings for the descheduler, such as how frequently it runs.
Prerequisites
- Cluster administrator privileges.
Procedure
Edit the
KubeDescheduler
object:oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
$ oc edit kubedeschedulers.operator.openshift.io cluster -n openshift-kube-descheduler-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure additional settings as necessary:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Save the file to apply the changes.
3.10.8. Uninstalling the descheduler Copy linkLink copied to clipboard!
You can remove the descheduler from your cluster by removing the descheduler instance and uninstalling the Kube Descheduler Operator. This procedure also cleans up the KubeDescheduler
CRD and openshift-kube-descheduler-operator
namespace.
Prerequisites
- Cluster administrator privileges.
- Access to the OpenShift Container Platform web console.
Procedure
- Log in to the OpenShift Container Platform web console.
Delete the descheduler instance.
- From the Operators → Installed Operators page, click Kube Descheduler Operator.
- Select the Kube Descheduler tab.
-
Click the Options menu
next to the cluster entry and select Delete KubeDescheduler.
- In the confirmation dialog, click Delete.
Uninstall the Kube Descheduler Operator.
- Navigate to Operators → Installed Operators,
-
Click the Options menu
next to the Kube Descheduler Operator entry and select Uninstall Operator.
- In the confirmation dialog, click Uninstall.
Delete the
openshift-kube-descheduler-operator
namespace.- Navigate to Administration → Namespaces.
-
Enter
openshift-kube-descheduler-operator
into the filter box. -
Click the Options menu
next to the openshift-kube-descheduler-operator entry and select Delete Namespace.
-
In the confirmation dialog, enter
openshift-kube-descheduler-operator
and click Delete.
Delete the
KubeDescheduler
CRD.- Navigate to Administration → Custom Resource Definitions.
-
Enter
KubeDescheduler
into the filter box. -
Click the Options menu
next to the KubeDescheduler entry and select Delete CustomResourceDefinition.
- In the confirmation dialog, click Delete.
Chapter 4. Using Jobs and DaemonSets Copy linkLink copied to clipboard!
4.1. Running background tasks on nodes automatically with daemon sets Copy linkLink copied to clipboard!
As an administrator, you can create and use daemon sets to run replicas of a pod on specific or all nodes in an OpenShift Container Platform cluster.
A daemon set ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to the cluster. As nodes are removed from the cluster, those pods are removed through garbage collection. Deleting a daemon set will clean up the pods it created.
You can use daemon sets to create shared storage, run a logging pod on every node in your cluster, or deploy a monitoring agent on every node.
For security reasons, only cluster administrators can create daemon sets.
For more information on daemon sets, see the Kubernetes documentation.
Daemon set scheduling is incompatible with project’s default node selector. If you fail to disable it, the daemon set gets restricted by merging with the default node selector. This results in frequent pod recreates on the nodes that got unselected by the merged node selector, which in turn puts unwanted load on the cluster.
4.1.1. Scheduled by default scheduler Copy linkLink copied to clipboard!
A daemon set ensures that all eligible nodes run a copy of a pod. Normally, the node that a pod runs on is selected by the Kubernetes scheduler. However, previously daemon set pods are created and scheduled by the daemon set controller. That introduces the following issues:
-
Inconsistent pod behavior: Normal pods waiting to be scheduled are created and in Pending state, but daemon set pods are not created in
Pending
state. This is confusing to the user. - Pod preemption is handled by default scheduler. When preemption is enabled, the daemon set controller will make scheduling decisions without considering pod priority and preemption.
The ScheduleDaemonSetPods feature, enabled by default in OpenShift Container Platform, lets you to schedule daemon sets using the default scheduler instead of the daemon set controller, by adding the NodeAffinity
term to the daemon set pods, instead of the spec.nodeName
term. The default scheduler is then used to bind the pod to the target host. If node affinity of the daemon set pod already exists, it is replaced. The daemon set controller only performs these operations when creating or modifying daemon set pods, and no changes are made to the spec.template
of the daemon set.
In addition, a node.kubernetes.io/unschedulable:NoSchedule
toleration is added automatically to daemon set pods. The default scheduler ignores unschedulable Nodes when scheduling daemon set pods.
4.1.2. Creating daemonsets Copy linkLink copied to clipboard!
When creating daemon sets, the nodeSelector
field is used to indicate the nodes on which the daemon set should deploy replicas.
Prerequisites
Before you start using daemon sets, disable the default project-wide node selector in your namespace, by setting the namespace annotation
openshift.io/node-selector
to an empty string:oc patch namespace myproject -p \ '{"metadata": {"annotations": {"openshift.io/node-selector": ""}}}'
$ oc patch namespace myproject -p \ '{"metadata": {"annotations": {"openshift.io/node-selector": ""}}}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you are creating a new project, overwrite the default node selector:
`oc adm new-project <name> --node-selector=""`.
`oc adm new-project <name> --node-selector=""`.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
To create a daemon set:
Define the daemon set yaml file:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the daemon set object:
oc create -f daemonset.yaml
$ oc create -f daemonset.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To verify that the pods were created, and that each node has a pod replica:
Find the daemonset pods:
oc get pods
$ oc get pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
hello-daemonset-cx6md 1/1 Running 0 2m hello-daemonset-e3md9 1/1 Running 0 2m
hello-daemonset-cx6md 1/1 Running 0 2m hello-daemonset-e3md9 1/1 Running 0 2m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the pods to verify the pod has been placed onto the node:
oc describe pod/hello-daemonset-cx6md|grep Node
$ oc describe pod/hello-daemonset-cx6md|grep Node
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Node: openshift-node01.hostname.com/10.14.20.134
Node: openshift-node01.hostname.com/10.14.20.134
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc describe pod/hello-daemonset-e3md9|grep Node
$ oc describe pod/hello-daemonset-e3md9|grep Node
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Node: openshift-node02.hostname.com/10.14.20.137
Node: openshift-node02.hostname.com/10.14.20.137
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- If you update a daemon set pod template, the existing pod replicas are not affected.
- If you delete a daemon set and then create a new daemon set with a different template but the same label selector, it recognizes any existing pod replicas as having matching labels and thus does not update them or create new replicas despite a mismatch in the pod template.
- If you change node labels, the daemon set adds pods to nodes that match the new labels and deletes pods from nodes that do not match the new labels.
To update a daemon set, force new pod replicas to be created by deleting the old replicas or nodes.
4.2. Running tasks in pods using jobs Copy linkLink copied to clipboard!
A job executes a task in your OpenShift Container Platform cluster.
A job tracks the overall progress of a task and updates its status with information about active, succeeded, and failed pods. Deleting a job will clean up any pod replicas it created. Jobs are part of the Kubernetes API, which can be managed with oc
commands like other object types.
Sample Job specification
See the Kubernetes documentation for more information about jobs.
4.2.1. Understanding jobs and cron jobs Copy linkLink copied to clipboard!
A job tracks the overall progress of a task and updates its status with information about active, succeeded, and failed pods. Deleting a job cleans up any pods it created. Jobs are part of the Kubernetes API, which can be managed with oc
commands like other object types.
There are two possible resource types that allow creating run-once objects in OpenShift Container Platform:
- Job
- A regular job is a run-once object that creates a task and ensures the job finishes.
There are three main types of task suitable to run as a job:
Non-parallel jobs:
- A job that starts only one pod, unless the pod fails.
- The job is complete as soon as its pod terminates successfully.
Parallel jobs with a fixed completion count:
- a job that starts multiple pods.
-
The job represents the overall task and is complete when there is one successful pod for each value in the range
1
to thecompletions
value.
Parallel jobs with a work queue:
- A job with multiple parallel worker processes in a given pod.
- OpenShift Container Platform coordinates pods to determine what each should work on or use an external queue service.
- Each pod is independently capable of determining whether or not all peer pods are complete and that the entire job is done.
- When any pod from the job terminates with success, no new pods are created.
- When at least one pod has terminated with success and all pods are terminated, the job is successfully completed.
- When any pod has exited with success, no other pod should be doing any work for this task or writing any output. Pods should all be in the process of exiting.
For more information about how to make use of the different types of job, see Job Patterns in the Kubernetes documentation.
- Cron job
- A job can be scheduled to run multiple times, using a cron job.
A cron job builds on a regular job by allowing you to specify how the job should be run. Cron jobs are part of the Kubernetes API, which can be managed with oc
commands like other object types.
Cron jobs are useful for creating periodic and recurring tasks, like running backups or sending emails. Cron jobs can also schedule individual tasks for a specific time, such as if you want to schedule a job for a low activity period. A cron job creates a Job
object based on the timezone configured on the control plane node that runs the cronjob controller.
A cron job creates a Job
object approximately once per execution time of its schedule, but there are circumstances in which it fails to create a job or two jobs might be created. Therefore, jobs must be idempotent and you must configure history limits.
4.2.1.1. Understanding how to create jobs Copy linkLink copied to clipboard!
Both resource types require a job configuration that consists of the following key parts:
- A pod template, which describes the pod that OpenShift Container Platform creates.
The
parallelism
parameter, which specifies how many pods running in parallel at any point in time should execute a job.-
For non-parallel jobs, leave unset. When unset, defaults to
1
.
-
For non-parallel jobs, leave unset. When unset, defaults to
The
completions
parameter, specifying how many successful pod completions are needed to finish a job.-
For non-parallel jobs, leave unset. When unset, defaults to
1
. - For parallel jobs with a fixed completion count, specify a value.
-
For parallel jobs with a work queue, leave unset. When unset defaults to the
parallelism
value.
-
For non-parallel jobs, leave unset. When unset, defaults to
4.2.1.2. Understanding how to set a maximum duration for jobs Copy linkLink copied to clipboard!
When defining a job, you can define its maximum duration by setting the activeDeadlineSeconds
field. It is specified in seconds and is not set by default. When not set, there is no maximum duration enforced.
The maximum duration is counted from the time when a first pod gets scheduled in the system, and defines how long a job can be active. It tracks overall time of an execution. After reaching the specified timeout, the job is terminated by OpenShift Container Platform.
4.2.1.3. Understanding how to set a job back off policy for pod failure Copy linkLink copied to clipboard!
A job can be considered failed, after a set amount of retries due to a logical error in configuration or other similar reasons. Failed pods associated with the job are recreated by the controller with an exponential back off delay (10s
, 20s
, 40s
…) capped at six minutes. The limit is reset if no new failed pods appear between controller checks.
Use the spec.backoffLimit
parameter to set the number of retries for a job.
4.2.1.4. Understanding how to configure a cron job to remove artifacts Copy linkLink copied to clipboard!
Cron jobs can leave behind artifact resources such as jobs or pods. As a user it is important to configure history limits so that old jobs and their pods are properly cleaned. There are two fields within cron job’s spec responsible for that:
-
.spec.successfulJobsHistoryLimit
. The number of successful finished jobs to retain (defaults to 3). -
.spec.failedJobsHistoryLimit
. The number of failed finished jobs to retain (defaults to 1).
Delete cron jobs that you no longer need:
oc delete cronjob/<cron_job_name>
$ oc delete cronjob/<cron_job_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Doing this prevents them from generating unnecessary artifacts.
-
You can suspend further executions by setting the
spec.suspend
to true. All subsequent executions are suspended until you reset tofalse
.
4.2.1.5. Known limitations Copy linkLink copied to clipboard!
The job specification restart policy only applies to the pods, and not the job controller. However, the job controller is hard-coded to keep retrying jobs to completion.
As such, restartPolicy: Never
or --restart=Never
results in the same behavior as restartPolicy: OnFailure
or --restart=OnFailure
. That is, when a job fails it is restarted automatically until it succeeds (or is manually discarded). The policy only sets which subsystem performs the restart.
With the Never
policy, the job controller performs the restart. With each attempt, the job controller increments the number of failures in the job status and create new pods. This means that with each failed attempt, the number of pods increases.
With the OnFailure
policy, kubelet performs the restart. Each attempt does not increment the number of failures in the job status. In addition, kubelet will retry failed jobs starting pods on the same nodes.
4.2.2. Creating jobs Copy linkLink copied to clipboard!
You create a job in OpenShift Container Platform by creating a job object.
Procedure
To create a job:
Create a YAML file similar to the following:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optionally, specify how many pod replicas a job should run in parallel; defaults to
1
.-
For non-parallel jobs, leave unset. When unset, defaults to
1
.
-
For non-parallel jobs, leave unset. When unset, defaults to
Optionally, specify how many successful pod completions are needed to mark a job completed.
-
For non-parallel jobs, leave unset. When unset, defaults to
1
. - For parallel jobs with a fixed completion count, specify the number of completions.
-
For parallel jobs with a work queue, leave unset. When unset defaults to the
parallelism
value.
-
For non-parallel jobs, leave unset. When unset, defaults to
- Optionally, specify the maximum duration the job can run.
- Optionally, specify the number of retries for a job. This field defaults to six.
- Specify the template for the pod the controller creates.
Specify the restart policy of the pod:
-
Never
. Do not restart the job. -
OnFailure
. Restart the job only if it fails. Always
. Always restart the job.For details on how OpenShift Container Platform uses restart policy with failed containers, see the Example States in the Kubernetes documentation.
-
Create the job:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
You can also create and launch a job from a single command using oc create job
. The following command creates and launches a job similar to the one specified in the previous example:
oc create job pi --image=perl -- perl -Mbignum=bpi -wle 'print bpi(2000)'
$ oc create job pi --image=perl -- perl -Mbignum=bpi -wle 'print bpi(2000)'
4.2.3. Creating cron jobs Copy linkLink copied to clipboard!
You create a cron job in OpenShift Container Platform by creating a job object.
Procedure
To create a cron job:
Create a YAML file similar to the following:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1 1
- Schedule for the job specified in cron format. In this example, the job will run every minute.
- 2 2
- An optional concurrency policy, specifying how to treat concurrent jobs within a cron job. Only one of the following concurrent policies may be specified. If not specified, this defaults to allowing concurrent executions.
-
Allow
allows cron jobs to run concurrently. -
Forbid
forbids concurrent runs, skipping the next run if the previous has not finished yet. -
Replace
cancels the currently running job and replaces it with a new one.
-
- 3 3
- An optional deadline (in seconds) for starting the job if it misses its scheduled time for any reason. Missed jobs executions will be counted as failed ones. If not specified, there is no deadline.
- 4 4
- An optional flag allowing the suspension of a cron job. If set to
true
, all subsequent executions will be suspended. - 5 5
- The number of successful finished jobs to retain (defaults to 3).
- 6 6
- The number of failed finished jobs to retain (defaults to 1).
- 7
- Job template. This is similar to the job example.
- 8
- Sets a label for jobs spawned by this cron job.
- 9
- The restart policy of the pod. This does not apply to the job controller.Note
The
.spec.successfulJobsHistoryLimit
and.spec.failedJobsHistoryLimit
fields are optional. These fields specify how many completed and failed jobs should be kept. By default, they are set to3
and1
respectively. Setting a limit to0
corresponds to keeping none of the corresponding kind of jobs after they finish.
Create the cron job:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
You can also create and launch a cron job from a single command using oc create cronjob
. The following command creates and launches a cron job similar to the one specified in the previous example:
oc create cronjob pi --image=perl --schedule='*/1 * * * *' -- perl -Mbignum=bpi -wle 'print bpi(2000)'
$ oc create cronjob pi --image=perl --schedule='*/1 * * * *' -- perl -Mbignum=bpi -wle 'print bpi(2000)'
With oc create cronjob
, the --schedule
option accepts schedules in cron format.
Chapter 5. Working with nodes Copy linkLink copied to clipboard!
5.1. Viewing and listing the nodes in your OpenShift Container Platform cluster Copy linkLink copied to clipboard!
You can list all the nodes in your cluster to obtain information such as status, age, memory usage, and details about the nodes.
When you perform node management operations, the CLI interacts with node objects that are representations of actual node hosts. The master uses the information from node objects to validate nodes with health checks.
5.1.1. About listing all the nodes in a cluster Copy linkLink copied to clipboard!
You can get detailed information on the nodes in the cluster.
The following command lists all nodes:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following example is a cluster with healthy nodes:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION master.example.com Ready master 7h v1.19.0 node1.example.com Ready worker 7h v1.19.0 node2.example.com Ready worker 7h v1.19.0
NAME STATUS ROLES AGE VERSION master.example.com Ready master 7h v1.19.0 node1.example.com Ready worker 7h v1.19.0 node2.example.com Ready worker 7h v1.19.0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following example is a cluster with one unhealthy node:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION master.example.com Ready master 7h v1.20.0 node1.example.com NotReady,SchedulingDisabled worker 7h v1.20.0 node2.example.com Ready worker 7h v1.20.0
NAME STATUS ROLES AGE VERSION master.example.com Ready master 7h v1.20.0 node1.example.com NotReady,SchedulingDisabled worker 7h v1.20.0 node2.example.com Ready worker 7h v1.20.0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The conditions that trigger a
NotReady
status are shown later in this section.The
-o wide
option provides additional information on nodes.oc get nodes -o wide
$ oc get nodes -o wide
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master.example.com Ready master 171m v1.20.0+39c0afe 10.0.129.108 <none> Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.21.0-30.rhaos4.8.gitf2f339d.el8-dev node1.example.com Ready worker 72m v1.20.0+39c0afe 10.0.129.222 <none> Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.21.0-30.rhaos4.8.gitf2f339d.el8-dev node2.example.com Ready worker 164m v1.20.0+39c0afe 10.0.142.150 <none> Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.21.0-30.rhaos4.8.gitf2f339d.el8-dev
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master.example.com Ready master 171m v1.20.0+39c0afe 10.0.129.108 <none> Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.21.0-30.rhaos4.8.gitf2f339d.el8-dev node1.example.com Ready worker 72m v1.20.0+39c0afe 10.0.129.222 <none> Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.21.0-30.rhaos4.8.gitf2f339d.el8-dev node2.example.com Ready worker 164m v1.20.0+39c0afe 10.0.142.150 <none> Red Hat Enterprise Linux CoreOS 48.83.202103210901-0 (Ootpa) 4.18.0-240.15.1.el8_3.x86_64 cri-o://1.21.0-30.rhaos4.8.gitf2f339d.el8-dev
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command lists information about a single node:
oc get node <node>
$ oc get node <node>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get node node1.example.com
$ oc get node node1.example.com
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION node1.example.com Ready worker 7h v1.20.0
NAME STATUS ROLES AGE VERSION node1.example.com Ready worker 7h v1.20.0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command provides more detailed information about a specific node, including the reason for the current condition:
oc describe node <node>
$ oc describe node <node>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe node node1.example.com
$ oc describe node node1.example.com
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of the node.
- 2
- The role of the node, either
master
orworker
. - 3
- The labels applied to the node.
- 4
- The annotations applied to the node.
- 5
- The taints applied to the node.
- 6
- The node conditions and status. The
conditions
stanza lists theReady
,PIDPressure
,PIDPressure
,MemoryPressure
,DiskPressure
andOutOfDisk
status. These condition are described later in this section. - 7
- The IP address and hostname of the node.
- 8
- The pod resources and allocatable resources.
- 9
- Information about the node host.
- 10
- The pods on the node.
- 11
- The events reported by the node.
Among the information shown for nodes, the following node conditions appear in the output of the commands shown in this section:
Condition | Description |
---|---|
|
If |
|
If |
|
If |
|
If |
|
If |
|
If |
|
If |
| Pods cannot be scheduled for placement on the node. |
5.1.2. Listing pods on a node in your cluster Copy linkLink copied to clipboard!
You can list all the pods on a specific node.
Procedure
To list all or selected pods on one or more nodes:
oc describe node <node1> <node2>
$ oc describe node <node1> <node2>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe node ip-10-0-128-218.ec2.internal
$ oc describe node ip-10-0-128-218.ec2.internal
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To list all or selected pods on selected nodes:
oc describe --selector=<node_selector>
$ oc describe --selector=<node_selector>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc describe node --selector=kubernetes.io/os
$ oc describe node --selector=kubernetes.io/os
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Or:
oc describe -l=<pod_selector>
$ oc describe -l=<pod_selector>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc describe node -l node-role.kubernetes.io/worker
$ oc describe node -l node-role.kubernetes.io/worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To list all pods on a specific node, including terminated pods:
oc get pod --all-namespaces --field-selector=spec.nodeName=<nodename>
$ oc get pod --all-namespaces --field-selector=spec.nodeName=<nodename>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.1.3. Viewing memory and CPU usage statistics on your nodes Copy linkLink copied to clipboard!
You can display usage statistics about nodes, which provide the runtime environments for containers. These usage statistics include CPU, memory, and storage consumption.
Prerequisites
-
You must have
cluster-reader
permission to view the usage statistics. - Metrics must be installed to view the usage statistics.
Procedure
To view the usage statistics:
oc adm top nodes
$ oc adm top nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To view the usage statistics for nodes with labels:
oc adm top node --selector=''
$ oc adm top node --selector=''
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You must choose the selector (label query) to filter on. Supports
=
,==
, and!=
.
5.2. Working with nodes Copy linkLink copied to clipboard!
As an administrator, you can perform a number of tasks to make your clusters more efficient.
5.2.1. Understanding how to evacuate pods on nodes Copy linkLink copied to clipboard!
Evacuating pods allows you to migrate all or selected pods from a given node or nodes.
You can only evacuate pods backed by a replication controller. The replication controller creates new pods on other nodes and removes the existing pods from the specified node(s).
Bare pods, meaning those not backed by a replication controller, are unaffected by default. You can evacuate a subset of pods by specifying a pod-selector. Pod selectors are based on labels, so all the pods with the specified label will be evacuated.
Procedure
Mark the nodes unschedulable before performing the pod evacuation.
Mark the node as unschedulable:
oc adm cordon <node1>
$ oc adm cordon <node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
node/<node1> cordoned
node/<node1> cordoned
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the node status is
Ready,SchedulingDisabled
:oc get node <node1>
$ oc get node <node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION <node1> Ready,SchedulingDisabled worker 1d v1.24.0
NAME STATUS ROLES AGE VERSION <node1> Ready,SchedulingDisabled worker 1d v1.24.0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Evacuate the pods using one of the following methods:
Evacuate all or selected pods on one or more nodes:
oc adm drain <node1> <node2> [--pod-selector=<pod_selector>]
$ oc adm drain <node1> <node2> [--pod-selector=<pod_selector>]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Force the deletion of bare pods using the
--force
option. When set totrue
, deletion continues even if there are pods not managed by a replication controller, replica set, job, daemon set, or stateful set:oc adm drain <node1> <node2> --force=true
$ oc adm drain <node1> <node2> --force=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set a period of time in seconds for each pod to terminate gracefully, use
--grace-period
. If negative, the default value specified in the pod will be used:oc adm drain <node1> <node2> --grace-period=-1
$ oc adm drain <node1> <node2> --grace-period=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ignore pods managed by daemon sets using the
--ignore-daemonsets
flag set totrue
:oc adm drain <node1> <node2> --ignore-daemonsets=true
$ oc adm drain <node1> <node2> --ignore-daemonsets=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the length of time to wait before giving up using the
--timeout
flag. A value of0
sets an infinite length of time:oc adm drain <node1> <node2> --timeout=5s
$ oc adm drain <node1> <node2> --timeout=5s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete pods even if there are pods using emptyDir using the
--delete-local-data
flag set totrue
. Local data is deleted when the node is drained:oc adm drain <node1> <node2> --delete-local-data=true
$ oc adm drain <node1> <node2> --delete-local-data=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow List objects that will be migrated without actually performing the evacuation, using the
--dry-run
option set totrue
:oc adm drain <node1> <node2> --dry-run=true
$ oc adm drain <node1> <node2> --dry-run=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Instead of specifying specific node names (for example,
<node1> <node2>
), you can use the--selector=<node_selector>
option to evacuate pods on selected nodes.
Mark the node as schedulable when done.
oc adm uncordon <node1>
$ oc adm uncordon <node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.2.2. Understanding how to update labels on nodes Copy linkLink copied to clipboard!
You can update any label on a node.
Node labels are not persisted after a node is deleted even if the node is backed up by a Machine.
Any change to a MachineSet
object is not applied to existing machines owned by the machine set. For example, labels edited or added to an existing MachineSet
object are not propagated to existing machines and nodes associated with the machine set.
The following command adds or updates labels on a node:
oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>
$ oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc label nodes webconsole-7f7f6 unhealthy=true
$ oc label nodes webconsole-7f7f6 unhealthy=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command updates all pods in the namespace:
oc label pods --all <key_1>=<value_1>
$ oc label pods --all <key_1>=<value_1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc label pods --all status=unhealthy
$ oc label pods --all status=unhealthy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.2.3. Understanding how to mark nodes as unschedulable or schedulable Copy linkLink copied to clipboard!
By default, healthy nodes with a Ready
status are marked as schedulable, meaning that new pods are allowed for placement on the node. Manually marking a node as unschedulable blocks any new pods from being scheduled on the node. Existing pods on the node are not affected.
The following command marks a node or nodes as unschedulable:
Example output
oc adm cordon <node>
$ oc adm cordon <node>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc adm cordon node1.example.com
$ oc adm cordon node1.example.com
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
node/node1.example.com cordoned NAME LABELS STATUS node1.example.com kubernetes.io/hostname=node1.example.com Ready,SchedulingDisabled
node/node1.example.com cordoned NAME LABELS STATUS node1.example.com kubernetes.io/hostname=node1.example.com Ready,SchedulingDisabled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command marks a currently unschedulable node or nodes as schedulable:
oc adm uncordon <node1>
$ oc adm uncordon <node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, instead of specifying specific node names (for example,
<node>
), you can use the--selector=<node_selector>
option to mark selected nodes as schedulable or unschedulable.
5.2.4. Configuring control plane nodes as schedulable Copy linkLink copied to clipboard!
You can configure control plane nodes (also known as the master nodes) to be schedulable, meaning that new pods are allowed for placement on the master nodes. By default, control plane nodes are not schedulable.
You can set the masters to be schedulable, but must retain the worker nodes.
You can deploy OpenShift Container Platform with no worker nodes on a bare metal cluster. In this case, the control plane nodes are marked schedulable by default.
You can allow or disallow control plane nodes to be schedulable by configuring the mastersSchedulable
field.
+
When you configure control plane nodes from the default unschedulable to schedulable, additional subscriptions are required. This is because control plane nodes then become worker nodes.
Procedure
Edit the
schedulers.config.openshift.io
resource.oc edit schedulers.config.openshift.io cluster
$ oc edit schedulers.config.openshift.io cluster
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure the
mastersSchedulable
field.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Set to
true
to allow control plane nodes to be schedulable, orfalse
to disallow control plane nodes to be schedulable.
- Save the file to apply the changes.
5.2.5. Deleting nodes Copy linkLink copied to clipboard!
5.2.5.1. Deleting nodes from a cluster Copy linkLink copied to clipboard!
When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods not backed by a replication controller become inaccessible to OpenShift Container Platform. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods.
Procedure
To delete a node from the OpenShift Container Platform cluster, edit the appropriate MachineSet
object:
If you are running cluster on bare metal, you cannot delete a node by editing MachineSet
objects. Machine sets are only available when a cluster is integrated with a cloud provider. Instead you must unschedule and drain the node before manually deleting it.
View the machine sets that are in the cluster:
oc get machinesets -n openshift-machine-api
$ oc get machinesets -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The machine sets are listed in the form of <clusterid>-worker-<aws-region-az>.
Scale the machine set:
oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For more information on scaling your cluster using a machine set, see Manually scaling a machine set.
5.2.5.2. Deleting nodes from a bare metal cluster Copy linkLink copied to clipboard!
When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods not backed by a replication controller become inaccessible to OpenShift Container Platform. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods.
Procedure
Delete a node from an OpenShift Container Platform cluster running on bare metal by completing the following steps:
Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain all pods on the node:
oc adm drain <node_name> --force=true
$ oc adm drain <node_name> --force=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This step might fail if the node is offline or unresponsive. Even if the node does not respond, it might still be running a workload that writes to shared storage. To avoid data corruption, power down the physical hardware before you proceed.
Delete the node from the cluster:
oc delete node <node_name>
$ oc delete node <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Although the node object is now deleted from the cluster, it can still rejoin the cluster after reboot or if the kubelet service is restarted. To permanently delete the node and all its data, you must decommission the node.
- If you powered down the physical hardware, turn it back on so that the node can rejoin the cluster.
5.2.6. Setting SELinux booleans Copy linkLink copied to clipboard!
OpenShift Container Platform allows you to enable and disable an SELinux boolean on a Red Hat Enterprise Linux CoreOS (RHCOS) node. The following procedure explains how to modify SELinux booleans on nodes using the Machine Config Operator (MCO). This procedure uses container_manage_cgroup
as the example boolean. You can modify this value to whichever boolean you need.
Prerequisites
- You have installed the OpenShift CLI (oc).
Procedure
Create a new YAML file with a
MachineConfig
object, displayed in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the new
MachineConfig
object by running the following command:oc create -f 99-worker-setsebool.yaml
$ oc create -f 99-worker-setsebool.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Applying any changes to the MachineConfig
object causes all affected nodes to gracefully reboot after the change is applied.
5.2.7. Adding kernel arguments to nodes Copy linkLink copied to clipboard!
In some special cases, you might want to add kernel arguments to a set of nodes in your cluster. This should only be done with caution and clear understanding of the implications of the arguments you set.
Improper use of kernel arguments can result in your systems becoming unbootable.
Examples of kernel arguments you could set include:
- enforcing=0: Configures Security Enhanced Linux (SELinux) to run in permissive mode. In permissive mode, the system acts as if SELinux is enforcing the loaded security policy, including labeling objects and emitting access denial entries in the logs, but it does not actually deny any operations. While not supported for production systems, permissive mode can be helpful for debugging.
-
nosmt: Disables symmetric multithreading (SMT) in the kernel. Multithreading allows multiple logical threads for each CPU. You could consider
nosmt
in multi-tenant environments to reduce risks from potential cross-thread attacks. By disabling SMT, you essentially choose security over performance.
See Kernel.org kernel parameters for a list and descriptions of kernel arguments.
In the following procedure, you create a MachineConfig
object that identifies:
- A set of machines to which you want to add the kernel argument. In this case, machines with a worker role.
- Kernel arguments that are appended to the end of the existing kernel arguments.
- A label that indicates where in the list of machine configs the change is applied.
Prerequisites
- Have administrative privilege to a working OpenShift Container Platform cluster.
Procedure
List existing
MachineConfig
objects for your OpenShift Container Platform cluster to determine how to label your machine config:oc get MachineConfig
$ oc get MachineConfig
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
MachineConfig
object file that identifies the kernel argument (for example,05-worker-kernelarg-selinuxpermissive.yaml
)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the new machine config:
oc create -f 05-worker-kernelarg-selinuxpermissive.yaml
$ oc create -f 05-worker-kernelarg-selinuxpermissive.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the machine configs to see that the new one was added:
oc get MachineConfig
$ oc get MachineConfig
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the nodes:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can see that scheduling on each worker node is disabled as the change is being applied.
Check that the kernel argument worked by going to one of the worker nodes and listing the kernel command line arguments (in
/proc/cmdline
on the host):oc debug node/ip-10-0-141-105.ec2.internal
$ oc debug node/ip-10-0-141-105.ec2.internal
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You should see the
enforcing=0
argument added to the other kernel arguments.
5.3. Managing nodes Copy linkLink copied to clipboard!
OpenShift Container Platform uses a KubeletConfig custom resource (CR) to manage the configuration of nodes. By creating an instance of a KubeletConfig
object, a managed machine config is created to override setting on the node.
Logging in to remote machines for the purpose of changing their configuration is not supported.
5.3.1. Modifying nodes Copy linkLink copied to clipboard!
To make configuration changes to a cluster, or machine pool, you must create a custom resource definition (CRD), or kubeletConfig
object. OpenShift Container Platform uses the Machine Config Controller to watch for changes introduced through the CRD to apply the changes to the cluster.
Because the fields in a kubeletConfig
object are passed directly to the kubelet from upstream Kubernetes, the validation of those fields is handled directly by the kubelet itself. Please refer to the relevant Kubernetes documentation for the valid values for these fields. Invalid values in the kubeletConfig
object can render cluster nodes unusable.
Procedure
Obtain the label associated with the static CRD, Machine Config Pool, for the type of node you want to configure. Perform one of the following steps:
Check current labels of the desired machine config pool.
For example:
oc get machineconfigpool --show-labels
$ oc get machineconfigpool --show-labels
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING DEGRADED LABELS master rendered-master-e05b81f5ca4db1d249a1bf32f9ec24fd True False False operator.machineconfiguration.openshift.io/required-for-upgrade= worker rendered-worker-f50e78e1bc06d8e82327763145bfcf62 True False False
NAME CONFIG UPDATED UPDATING DEGRADED LABELS master rendered-master-e05b81f5ca4db1d249a1bf32f9ec24fd True False False operator.machineconfiguration.openshift.io/required-for-upgrade= worker rendered-worker-f50e78e1bc06d8e82327763145bfcf62 True False False
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a custom label to the desired machine config pool.
For example:
oc label machineconfigpool worker custom-kubelet=enabled
$ oc label machineconfigpool worker custom-kubelet=enabled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a
kubeletconfig
custom resource (CR) for your configuration change.For example:
Sample configuration for a custom-config CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the CR object.
oc create -f <file-name>
$ oc create -f <file-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f master-kube-config.yaml
$ oc create -f master-kube-config.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Most Kubelet Configuration options can be set by the user. The following options are not allowed to be overwritten:
- CgroupDriver
- ClusterDNS
- ClusterDomain
- RuntimeRequestTimeout
- StaticPodPath
5.4. Managing the maximum number of pods per node Copy linkLink copied to clipboard!
In OpenShift Container Platform, you can configure the number of pods that can run on a node based on the number of processor cores on the node, a hard limit or both. If you use both options, the lower of the two limits the number of pods on a node.
Exceeding these values can result in:
- Increased CPU utilization by OpenShift Container Platform.
- Slow pod scheduling.
- Potential out-of-memory scenarios, depending on the amount of memory in the node.
- Exhausting the IP address pool.
- Resource overcommitting, leading to poor user application performance.
A pod that is holding a single container actually uses two containers. The second container sets up networking prior to the actual container starting. As a result, a node running 10 pods actually has 20 containers running.
The podsPerCore
parameter limits the number of pods the node can run based on the number of processor cores on the node. For example, if podsPerCore
is set to 10
on a node with 4 processor cores, the maximum number of pods allowed on the node is 40.
The maxPods
parameter limits the number of pods the node can run to a fixed value, regardless of the properties of the node.
5.4.1. Configuring the maximum number of pods per node Copy linkLink copied to clipboard!
Two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore
and maxPods
. If you use both options, the lower of the two limits the number of pods on a node.
For example, if podsPerCore
is set to 10
on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40.
Prerequisites
Obtain the label associated with the static
MachineConfigPool
CRD for the type of node you want to configure. Perform one of the following steps:View the machine config pool:
oc describe machineconfigpool <name>
$ oc describe machineconfigpool <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe machineconfigpool worker
$ oc describe machineconfigpool worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If a label has been added it appears under
labels
.
If the label is not present, add a key/value pair:
oc label machineconfigpool worker custom-kubelet=small-pods
$ oc label machineconfigpool worker custom-kubelet=small-pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
Create a custom resource (CR) for your configuration change.
Sample configuration for a
max-pods
CRCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteSetting
podsPerCore
to0
disables this limit.In the above example, the default value for
podsPerCore
is10
and the default value formaxPods
is250
. This means that unless the node has 25 cores or more, by default,podsPerCore
will be the limiting factor.List the
MachineConfigPool
CRDs to see if the change is applied. TheUPDATING
column reportsTrue
if the change is picked up by the Machine Config Controller:oc get machineconfigpools
$ oc get machineconfigpools
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING DEGRADED master master-9cc2c72f205e103bb534 False False False worker worker-8cecd1236b33ee3f8a5e False True False
NAME CONFIG UPDATED UPDATING DEGRADED master master-9cc2c72f205e103bb534 False False False worker worker-8cecd1236b33ee3f8a5e False True False
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Once the change is complete, the
UPDATED
column reportsTrue
.oc get machineconfigpools
$ oc get machineconfigpools
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING DEGRADED master master-9cc2c72f205e103bb534 False True False worker worker-8cecd1236b33ee3f8a5e True False False
NAME CONFIG UPDATED UPDATING DEGRADED master master-9cc2c72f205e103bb534 False True False worker worker-8cecd1236b33ee3f8a5e True False False
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.5. Using the Node Tuning Operator Copy linkLink copied to clipboard!
Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.
The Node Tuning Operator helps you manage node-level tuning by orchestrating the Tuned daemon. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs.
The Operator manages the containerized Tuned daemon for OpenShift Container Platform as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized Tuned daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.
Node-level settings applied by the containerized Tuned daemon are rolled back on an event that triggers a profile change or when the containerized Tuned daemon is terminated gracefully by receiving and handling a termination signal.
The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.
5.5.1. Accessing an example Node Tuning Operator specification Copy linkLink copied to clipboard!
Use this process to access an example Node Tuning Operator specification.
Procedure
Run:
oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator
$ oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or pod labels and profile priorities.
While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality might be deprecated in future versions of the Node Tuning Operator.
5.5.2. Custom tuning specification Copy linkLink copied to clipboard!
The custom resource (CR) for the Operator has two major sections. The first section, profile:
, is a list of Tuned profiles and their names. The second, recommend:
, defines the profile selection logic.
Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized Tuned daemons are updated.
Management state
The Operator Management state is set by adjusting the default Tuned CR. By default, the Operator is in the Managed state and the spec.managementState
field is not present in the default Tuned CR. Valid values for the Operator Management state are as follows:
- Managed: the Operator will update its operands as configuration resources are updated
- Unmanaged: the Operator will ignore changes to the configuration resources
- Removed: the Operator will remove its operands and resources the Operator provisioned
Profile data
The profile:
section lists Tuned profiles and their names.
Recommended profiles
The profile:
selection logic is defined by the recommend:
section of the CR. The recommend:
section is a list of items to recommend the profiles based on a selection criteria.
recommend: <recommend-item-1> # ... <recommend-item-n>
recommend:
<recommend-item-1>
# ...
<recommend-item-n>
The individual items of the list:
- 1
- Optional.
- 2
- A dictionary of key/value
MachineConfig
labels. The keys must be unique. - 3
- If omitted, profile match is assumed unless a profile with a higher priority matches first or
machineConfigLabels
is set. - 4
- An optional list.
- 5
- Profile ordering priority. Lower numbers mean higher priority (
0
is the highest priority). - 6
- A Tuned profile to apply on a match. For example
tuned_profile_1
.
<match>
is an optional list recursively defined as follows:
- label: <label_name> value: <label_value> type: <label_type> <match>
- label: <label_name>
value: <label_value>
type: <label_type>
<match>
If <match>
is not omitted, all nested <match>
sections must also evaluate to true
. Otherwise, false
is assumed and the profile with the respective <match>
section will not be applied or recommended. Therefore, the nesting (child <match>
sections) works as logical AND operator. Conversely, if any item of the <match>
list matches, the entire <match>
list evaluates to true
. Therefore, the list acts as logical OR operator.
If machineConfigLabels
is defined, machine config pool based matching is turned on for the given recommend:
list item. <mcLabels>
specifies the labels for a machine config. The machine config is created automatically to apply host settings, such as kernel boot parameters, for the profile <tuned_profile_name>
. This involves finding all machine config pools with machine config selector matching <mcLabels>
and setting the profile <tuned_profile_name>
on all nodes that are assigned the found machine config pools. To target nodes that have both master and worker roles, you must use the master role.
The list items match
and machineConfigLabels
are connected by the logical OR operator. The match
item is evaluated first in a short-circuit manner. Therefore, if it evaluates to true
, the machineConfigLabels
item is not considered.
When using machine config pool based matching, it is advised to group nodes with the same hardware configuration into the same machine config pool. Not following this practice might result in Tuned operands calculating conflicting kernel parameters for two or more nodes sharing the same machine config pool.
Example: node or pod label based matching
The CR above is translated for the containerized Tuned daemon into its recommend.conf
file based on the profile priorities. The profile with the highest priority (10
) is openshift-control-plane-es
and, therefore, it is considered first. The containerized Tuned daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch
label set. If not, the entire <match>
section evaluates as false
. If there is such a pod with the label, in order for the <match>
section to evaluate to true
, the node label also needs to be node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
If the labels for the profile with priority 10
matched, openshift-control-plane-es
profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane
) is considered. This profile is applied if the containerized Tuned pod runs on a node with labels node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
Finally, the profile openshift-node
has the lowest priority of 30
. It lacks the <match>
section and, therefore, will always match. It acts as a profile catch-all to set openshift-node
profile, if no other profile with higher priority matches on a given node.
Example: machine config pool based matching
To minimize node reboots, label the target nodes with a label the machine config pool’s node selector will match, then create the Tuned CR above and finally create the custom machine config pool itself.
5.5.3. Default profiles set on a cluster Copy linkLink copied to clipboard!
The following are the default profiles set on a cluster.
5.5.4. Supported Tuned daemon plug-ins Copy linkLink copied to clipboard!
Excluding the [main]
section, the following Tuned plug-ins are supported when using custom profiles defined in the profile:
section of the Tuned CR:
- audio
- cpu
- disk
- eeepc_she
- modules
- mounts
- net
- scheduler
- scsi_host
- selinux
- sysctl
- sysfs
- usb
- video
- vm
There is some dynamic tuning functionality provided by some of these plug-ins that is not supported. The following Tuned plug-ins are currently not supported:
- bootloader
- script
- systemd
See Available Tuned Plug-ins and Getting Started with Tuned for more information.
5.6. Understanding node rebooting Copy linkLink copied to clipboard!
To reboot a node without causing an outage for applications running on the platform, it is important to first evacuate the pods. For pods that are made highly available by the routing tier, nothing else needs to be done. For other pods needing storage, typically databases, it is critical to ensure that they can remain in operation with one pod temporarily going offline. While implementing resiliency for stateful pods is different for each application, in all cases it is important to configure the scheduler to use node anti-affinity to ensure that the pods are properly spread across available nodes.
Another challenge is how to handle nodes that are running critical infrastructure such as the router or the registry. The same node evacuation process applies, though it is important to understand certain edge cases.
5.6.1. About rebooting nodes running critical infrastructure Copy linkLink copied to clipboard!
When rebooting nodes that host critical OpenShift Container Platform infrastructure components, such as router pods, registry pods, and monitoring pods, ensure that there are at least three nodes available to run these components.
The following scenario demonstrates how service interruptions can occur with applications running on OpenShift Container Platform when only two nodes are available:
- Node A is marked unschedulable and all pods are evacuated.
- The registry pod running on that node is now redeployed on node B. Node B is now running both registry pods.
- Node B is now marked unschedulable and is evacuated.
- The service exposing the two pod endpoints on node B loses all endpoints, for a brief period of time, until they are redeployed to node A.
When using three nodes for infrastructure components, this process does not result in a service disruption. However, due to pod scheduling, the last node that is evacuated and brought back into rotation does not have a registry pod. One of the other nodes has two registry pods. To schedule the third registry pod on the last node, use pod anti-affinity to prevent the scheduler from locating two registry pods on the same node.
Additional information
- For more information on pod anti-affinity, see Placing pods relative to other pods using affinity and anti-affinity rules.
5.6.2. Rebooting a node using pod anti-affinity Copy linkLink copied to clipboard!
Pod anti-affinity is slightly different than node anti-affinity. Node anti-affinity can be violated if there are no other suitable locations to deploy a pod. Pod anti-affinity can be set to either required or preferred.
With this in place, if only two infrastructure nodes are available and one is rebooted, the container image registry pod is prevented from running on the other node. oc get pods
reports the pod as unready until a suitable node is available. Once a node is available and all pods are back in ready state, the next node can be restarted.
Procedure
To reboot a node using pod anti-affinity:
Edit the node specification to configure pod anti-affinity:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Stanza to configure pod anti-affinity.
- 2
- Defines a preferred rule.
- 3
- Specifies a weight for a preferred rule. The node with the highest weight is preferred.
- 4
- Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
- 5
- The operator represents the relationship between the label on the existing pod and the set of values in the
matchExpression
parameters in the specification for the new pod. Can beIn
,NotIn
,Exists
, orDoesNotExist
.
This example assumes the container image registry pod has a label of
registry=default
. Pod anti-affinity can use any Kubernetes match expression.-
Enable the
MatchInterPodAffinity
scheduler predicate in the scheduling policy file. - Perform a graceful restart of the node.
5.6.3. Understanding how to reboot nodes running routers Copy linkLink copied to clipboard!
In most cases, a pod running an OpenShift Container Platform router exposes a host port.
The PodFitsPorts
scheduler predicate ensures that no router pods using the same port can run on the same node, and pod anti-affinity is achieved. If the routers are relying on IP failover for high availability, there is nothing else that is needed.
For router pods relying on an external service such as AWS Elastic Load Balancing for high availability, it is that service’s responsibility to react to router pod restarts.
In rare cases, a router pod may not have a host port configured. In those cases, it is important to follow the recommended restart process for infrastructure nodes.
5.6.4. Rebooting a node gracefully Copy linkLink copied to clipboard!
Before rebooting a node, it is recommended to backup etcd data to avoid any data loss on the node.
Procedure
To perform a graceful restart of a node:
Mark the node as unschedulable:
oc adm cordon <node1>
$ oc adm cordon <node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node to remove all the running pods:
oc adm drain <node1> --ignore-daemonsets --delete-local-data
$ oc adm drain <node1> --ignore-daemonsets --delete-local-data
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You might receive errors that pods associated with custom pod disruption budgets (PDB) cannot be evicted.
Example error
error when evicting pods/"rails-postgresql-example-1-72v2w" -n "rails" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pods/"rails-postgresql-example-1-72v2w" -n "rails" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this case, run the drain command again, adding the
disable-eviction
flag, which bypasses the PDB checks:oc adm drain <node1> --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
$ oc adm drain <node1> --ignore-daemonsets --delete-emptydir-data --force --disable-eviction
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Access the node in debug mode:
oc debug node/<node1>
$ oc debug node/<node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change your root directory to the host:
chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the node:
systemctl reboot
$ systemctl reboot
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a moment, the node enters the
NotReady
state.After the reboot is complete, mark the node as schedulable by running the following command:
oc adm uncordon <node1>
$ oc adm uncordon <node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the node is ready:
oc get node <node1>
$ oc get node <node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION <node1> Ready worker 6d22h v1.18.3+b0068a8
NAME STATUS ROLES AGE VERSION <node1> Ready worker 6d22h v1.18.3+b0068a8
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional information
For information on etcd data backup, see Backing up etcd data.
5.7. Freeing node resources using garbage collection Copy linkLink copied to clipboard!
As an administrator, you can use OpenShift Container Platform to ensure that your nodes are running efficiently by freeing up resources through garbage collection.
The OpenShift Container Platform node performs two types of garbage collection:
- Container garbage collection: Removes terminated containers.
- Image garbage collection: Removes images not referenced by any running pods.
5.7.1. Understanding how terminated containers are removed through garbage collection Copy linkLink copied to clipboard!
Container garbage collection can be performed using eviction thresholds.
When eviction thresholds are set for garbage collection, the node tries to keep any container for any pod accessible from the API. If the pod has been deleted, the containers will be as well. Containers are preserved as long the pod is not deleted and the eviction threshold is not reached. If the node is under disk pressure, it will remove containers and their logs will no longer be accessible using oc logs
.
- eviction-soft - A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period.
- eviction-hard - A hard eviction threshold has no grace period, and if observed, OpenShift Container Platform takes immediate action.
The following table lists the eviction thresholds:
Node condition | Eviction signal | Description |
---|---|---|
MemoryPressure |
| The available memory on the node. |
DiskPressure |
|
The available disk space or inodes on the node root file system, |
For evictionHard
you must specify all of these parameters. If you do not specify all parameters, only the specified parameters are applied and the garbage collection will not function properly.
If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node would constantly oscillate between true
and false
. As a consequence, the scheduler could make poor scheduling decisions.
To protect against this oscillation, use the eviction-pressure-transition-period
flag to control how long OpenShift Container Platform must wait before transitioning out of a pressure condition. OpenShift Container Platform will not set an eviction threshold as being met for the specified pressure condition for the period specified before toggling the condition back to false.
5.7.2. Understanding how images are removed through garbage collection Copy linkLink copied to clipboard!
Image garbage collection relies on disk usage as reported by cAdvisor on the node to decide which images to remove from the node.
The policy for image garbage collection is based on two conditions:
- The percent of disk usage (expressed as an integer) which triggers image garbage collection. The default is 85.
- The percent of disk usage (expressed as an integer) to which image garbage collection attempts to free. Default is 80.
For image garbage collection, you can modify any of the following variables using a custom resource.
Setting | Description |
---|---|
| The minimum age for an unused image before the image is removed by garbage collection. The default is 2m. |
| The percent of disk usage, expressed as an integer, which triggers image garbage collection. The default is 85. |
| The percent of disk usage, expressed as an integer, to which image garbage collection attempts to free. The default is 80. |
Two lists of images are retrieved in each garbage collector run:
- A list of images currently running in at least one pod.
- A list of images available on a host.
As new containers are run, new images appear. All images are marked with a time stamp. If the image is running (the first list above) or is newly detected (the second list above), it is marked with the current time. The remaining images are already marked from the previous spins. All images are then sorted by the time stamp.
Once the collection starts, the oldest images get deleted first until the stopping criterion is met.
5.7.3. Configuring garbage collection for containers and images Copy linkLink copied to clipboard!
As an administrator, you can configure how OpenShift Container Platform performs garbage collection by creating a kubeletConfig
object for each machine config pool.
OpenShift Container Platform supports only one kubeletConfig
object for each machine config pool.
You can configure any combination of the following:
- Soft eviction for containers
- Hard eviction for containers
- Eviction for images
Prerequisites
Obtain the label associated with the static
MachineConfigPool
CRD for the type of node you want to configure. Perform one of the following steps:View the machine config pool:
oc describe machineconfigpool <name>
$ oc describe machineconfigpool <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe machineconfigpool worker
$ oc describe machineconfigpool worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Name: worker Namespace: Labels: custom-kubelet=small-pods
Name: worker Namespace: Labels: custom-kubelet=small-pods
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If a label has been added it appears under
Labels
.
If the label is not present, add a key/value pair:
oc label machineconfigpool worker custom-kubelet=small-pods
$ oc label machineconfigpool worker custom-kubelet=small-pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
Create a custom resource (CR) for your configuration change.
ImportantIf there is one file system, or if
/var/lib/kubelet
and/var/lib/containers/
are in the same file system, the settings with the highest values trigger evictions, as those are met first. The file system triggers the eviction.Sample configuration for a container garbage collection CR:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Name for the object.
- 2
- Selector label.
- 3
- Type of eviction:
evictionSoft
orevictionHard
. - 4
- Eviction thresholds based on a specific eviction trigger signal.
- 5
- Grace periods for the soft eviction. This parameter does not apply to
eviction-hard
. - 6
- Eviction thresholds based on a specific eviction trigger signal. For
evictionHard
you must specify all of these parameters. If you do not specify all parameters, only the specified parameters are applied and the garbage collection will not function properly. - 7
- The duration to wait before transitioning out of an eviction pressure condition.
- 8
- The minimum age for an unused image before the image is removed by garbage collection.
- 9
- The percent of disk usage (expressed as an integer) that triggers image garbage collection.
- 10
- The percent of disk usage (expressed as an integer) that image garbage collection attempts to free.
Create the object:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f gc-container.yaml
$ oc create -f gc-container.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
kubeletconfig.machineconfiguration.openshift.io/gc-container created
kubeletconfig.machineconfiguration.openshift.io/gc-container created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that garbage collection is active. The Machine Config Pool you specified in the custom resource appears with
UPDATING
as 'true` until the change is fully implemented:oc get machineconfigpool
$ oc get machineconfigpool
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING master rendered-master-546383f80705bd5aeaba93 True False worker rendered-worker-b4c51bb33ccaae6fc4a6a5 False True
NAME CONFIG UPDATED UPDATING master rendered-master-546383f80705bd5aeaba93 True False worker rendered-worker-b4c51bb33ccaae6fc4a6a5 False True
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.8. Allocating resources for nodes in an OpenShift Container Platform cluster Copy linkLink copied to clipboard!
To provide more reliable scheduling and minimize node resource overcommitment, reserve a portion of the CPU and memory resources for use by the underlying node components, such as kubelet
and kube-proxy
, and the remaining system components, such as sshd
and NetworkManager
. By specifying the resources to reserve, you provide the scheduler with more information about the remaining CPU and memory resources that a node has available for use by pods.
5.8.1. Understanding how to allocate resources for nodes Copy linkLink copied to clipboard!
CPU and memory resources reserved for node components in OpenShift Container Platform are based on two node settings:
Setting | Description |
---|---|
|
This setting is not used with OpenShift Container Platform. Add the CPU and memory resources that you planned to reserve to the |
|
This setting identifies the resources to reserve for the node components and system components. The default settings depend on the OpenShift Container Platform and Machine Config Operator versions. Confirm the default |
If a flag is not set, the defaults are used. If none of the flags are set, the allocated resource is set to the node’s capacity as it was before the introduction of allocatable resources.
Any CPUs specifically reserved using the reservedSystemCPUs
parameter are not available for allocation using kube-reserved
or system-reserved
.
5.8.1.1. How OpenShift Container Platform computes allocated resources Copy linkLink copied to clipboard!
An allocated amount of a resource is computed based on the following formula:
[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]
[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]
The withholding of Hard-Eviction-Thresholds
from Allocatable
improves system reliability because the value for Allocatable
is enforced for pods at the node level.
If Allocatable
is negative, it is set to 0
.
Each node reports the system resources that are used by the container runtime and kubelet. To simplify configuring the system-reserved
parameter, view the resource use for the node by using the node summary API. The node summary is available at /api/v1/nodes/<node>/proxy/stats/summary
.
5.8.1.2. How nodes enforce resource constraints Copy linkLink copied to clipboard!
The node is able to limit the total amount of resources that pods can consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from using CPU and memory resources that are needed by system services such as the container runtime and node agent. To improve node reliability, administrators should reserve resources based on a target for resource use.
The node enforces resource constraints by using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy that is separate from system daemons.
Administrators should treat system daemons similar to pods that have a guaranteed quality of service. System daemons can burst within their bounding control groups and this behavior must be managed as part of cluster deployments. Reserve CPU and memory resources for system daemons by specifying the amount of CPU and memory resources in system-reserved
.
Enforcing system-reserved
limits can prevent critical system services from receiving CPU and memory resources. As a result, a critical system service can be ended by the out-of-memory killer. The recommendation is to enforce system-reserved
only if you have profiled the nodes exhaustively to determine precise estimates and you are confident that critical system services can recover if any process in that group is ended by the out-of-memory killer.
5.8.1.3. Understanding Eviction Thresholds Copy linkLink copied to clipboard!
If a node is under memory pressure, it can impact the entire node and all pods running on the node. For example, a system daemon that uses more than its reserved amount of memory can trigger an out-of-memory event. To avoid or reduce the probability of system out-of-memory events, the node provides out-of-resource handling.
You can reserve some memory using the --eviction-hard
flag. The node attempts to evict pods whenever memory availability on the node drops below the absolute value or percentage. If system daemons do not exist on a node, pods are limited to the memory capacity - eviction-hard
. For this reason, resources set aside as a buffer for eviction before reaching out of memory conditions are not available for pods.
The following is an example to illustrate the impact of node allocatable for memory:
-
Node capacity is
32Gi
-
--system-reserved is
3Gi
-
--eviction-hard is set to
100Mi
.
For this node, the effective node allocatable value is 28.9Gi
. If the node and system components use all their reservation, the memory available for pods is 28.9Gi
, and kubelet evicts pods when it exceeds this threshold.
If you enforce node allocatable, 28.9Gi
, with top-level cgroups, then pods can never exceed 28.9Gi
. Evictions are not performed unless system daemons consume more than 3.1Gi
of memory.
If system daemons do not use up all their reservation, with the above example, pods would face memcg OOM kills from their bounding cgroup before node evictions kick in. To better enforce QoS under this situation, the node applies the hard eviction thresholds to the top-level cgroup for all pods to be Node Allocatable + Eviction Hard Thresholds
.
If system daemons do not use up all their reservation, the node will evict pods whenever they consume more than 28.9Gi
of memory. If eviction does not occur in time, a pod will be OOM killed if pods consume 29Gi
of memory.
5.8.1.4. How the scheduler determines resource availability Copy linkLink copied to clipboard!
The scheduler uses the value of node.Status.Allocatable
instead of node.Status.Capacity
to decide if a node will become a candidate for pod scheduling.
By default, the node will report its machine capacity as fully schedulable by the cluster.
5.8.2. Configuring allocated resources for nodes Copy linkLink copied to clipboard!
OpenShift Container Platform supports the CPU and memory resource types for allocation. The ephemeral-resource
resource type is supported as well. For the cpu
type, the resource quantity is specified in units of cores, such as 200m
, 0.5
, or 1
. For memory
and ephemeral-storage
, it is specified in units of bytes, such as 200Ki
, 50Mi
, or 5Gi
.
As an administrator, you can set these using a custom resource (CR) through a set of <resource_type>=<resource_quantity>
pairs (e.g., cpu=200m,memory=512Mi
).
For details on the recommended system-reserved
values, refer to the recommended system-reserved values.
Prerequisites
Obtain the label associated with the static
MachineConfigPool
CRD for the type of node you want to configure. Perform one of the following steps:View the Machine Config Pool:
oc describe machineconfigpool <name>
$ oc describe machineconfigpool <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe machineconfigpool worker
$ oc describe machineconfigpool worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If a label has been added it appears under
labels
.
If the label is not present, add a key/value pair:
oc label machineconfigpool worker custom-kubelet=small-pods
$ oc label machineconfigpool worker custom-kubelet=small-pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.9. Allocating specific CPUs for nodes in a cluster Copy linkLink copied to clipboard!
When using the static CPU Manager policy, you can reserve specific CPUs for use by specific nodes in your cluster. For example, on a system with 24 CPUs, you could reserve CPUs numbered 0 - 3 for the control plane allowing the compute nodes to use CPUs 4 - 23.
5.9.1. Reserving CPUs for nodes Copy linkLink copied to clipboard!
To explicitly define a list of CPUs that are reserved for specific nodes, create a KubeletConfig
custom resource (CR) to define the reservedSystemCPUs
parameter. This list supersedes the CPUs that might be reserved using the systemReserved
and kubeReserved
parameters.
Procedure
Obtain the label associated with the machine config pool (MCP) for the type of node you want to configure:
oc describe machineconfigpool <name>
$ oc describe machineconfigpool <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe machineconfigpool worker
$ oc describe machineconfigpool worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Get the MCP label.
Create a YAML file for the
KubeletConfig
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the CR object:
oc create -f <file_name>.yaml
$ oc create -f <file_name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.10. Machine Config Daemon metrics Copy linkLink copied to clipboard!
The Machine Config Daemon is a part of the Machine Config Operator. It runs on every node in the cluster. The Machine Config Daemon manages configuration changes and updates on each of the nodes.
5.10.1. Machine Config Daemon metrics Copy linkLink copied to clipboard!
Beginning with OpenShift Container Platform 4.3, the Machine Config Daemon provides a set of metrics. These metrics can be accessed using the Prometheus Cluster Monitoring stack.
The following table describes this set of metrics.
Metrics marked with *
in the *Name* and Description columns represent serious errors that might cause performance problems. Such problems might prevent updates and upgrades from proceeding.
While some entries contain commands for getting specific logs, the most comprehensive set of logs is available using the oc adm must-gather
command.
Name | Format | Description | Notes |
---|---|---|---|
|
| Shows the OS that MCD is running on, such as RHCOS or RHEL. In case of RHCOS, the version is provided. | |
|
| Shows the number of successful SSH authentications into the node. | The non-zero value shows that someone might have made manual changes to the node. Such changes might cause irreconcilable errors due to the differences between the state on the disk and the state defined in the machine configuration. |
|
| Logs errors received during failed drain. * |
While drains might need multiple tries to succeed, terminal failed drains prevent updates from proceeding. The For further investigation, see the logs by running:
|
|
| Logs errors encountered during pivot. * | Pivot errors might prevent OS upgrades from proceeding. For further investigation, run this command to access the node and see all its logs:
Alternatively, you can run this command to only see the logs from the
|
|
| State of Machine Config Daemon for the indicated node. Possible states are "Done", "Working", and "Degraded". In case of "Degraded", the reason is included. | For further investigation, see the logs by running:
|
|
| Logs kubelet health failures. * | This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. For further investigation, run this command to access the node and see all its logs:
|
|
| Logs the failed reboots and the corresponding errors. * | This is expected to be empty, which indicates a successful reboot. For further investigation, see the logs by running:
|
|
| Logs success or failure of configuration updates and the corresponding errors. |
The expected value is For further investigation, see the logs by running:
|
Chapter 6. Working with containers Copy linkLink copied to clipboard!
6.1. Understanding Containers Copy linkLink copied to clipboard!
The basic units of OpenShift Container Platform applications are called containers. Linux container technologies are lightweight mechanisms for isolating running processes so that they are limited to interacting with only their designated resources.
Many application instances can be running in containers on a single host without visibility into each others' processes, files, network, and so on. Typically, each container provides a single service (often called a "micro-service"), such as a web server or a database, though containers can be used for arbitrary workloads.
The Linux kernel has been incorporating capabilities for container technologies for years. OpenShift Container Platform and Kubernetes add the ability to orchestrate containers across multi-host installations.
About containers and RHEL kernel memory
Due to Red Hat Enterprise Linux (RHEL) behavior, a container on a node with high CPU usage might seem to consume more memory than expected. The higher memory consumption could be caused by the kmem_cache
in the RHEL kernel. The RHEL kernel creates a kmem_cache
for each cgroup. For added performance, the kmem_cache
contains a cpu_cache
, and a node cache for any NUMA nodes. These caches all consume kernel memory.
The amount of memory stored in those caches is proportional to the number of CPUs that the system uses. As a result, a higher number of CPUs results in a greater amount of kernel memory being held in these caches. Higher amounts of kernel memory in these caches can cause OpenShift Container Platform containers to exceed the configured memory limits, resulting in the container being killed.
To avoid losing containers due to kernel memory issues, ensure that the containers request sufficient memory. You can use the following formula to estimate the amount of memory consumed by the kmem_cache
, where nproc
is the number of processing units available that are reported by the nproc
command. The lower limit of container requests should be this value plus the container memory requirements:
$(nproc) X 1/2 MiB
$(nproc) X 1/2 MiB
6.2. Using Init Containers to perform tasks before a pod is deployed Copy linkLink copied to clipboard!
OpenShift Container Platform provides init containers, which are specialized containers that run before application containers and can contain utilities or setup scripts not present in an app image.
6.2.1. Understanding Init Containers Copy linkLink copied to clipboard!
You can use an Init Container resource to perform tasks before the rest of a pod is deployed.
A pod can have Init Containers in addition to application containers. Init containers allow you to reorganize setup scripts and binding code.
An Init Container can:
- Contain and run utilities that are not desirable to include in the app Container image for security reasons.
- Contain utilities or custom code for setup that is not present in an app image. For example, there is no requirement to make an image FROM another image just to use a tool like sed, awk, python, or dig during setup.
- Use Linux namespaces so that they have different filesystem views from app containers, such as access to secrets that application containers are not able to access.
Each Init Container must complete successfully before the next one is started. So, Init Containers provide an easy way to block or delay the startup of app containers until some set of preconditions are met.
For example, the following are some ways you can use Init Containers:
Wait for a service to be created with a shell command like:
for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; done; exit 1
for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; done; exit 1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Register this pod with a remote server from the downward API with a command like:
curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d ‘instance=$()&ip=$()’
$ curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d ‘instance=$()&ip=$()’
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Wait for some time before starting the app Container with a command like
sleep 60
. - Clone a git repository into a volume.
- Place values into a configuration file and run a template tool to dynamically generate a configuration file for the main app Container. For example, place the POD_IP value in a configuration and generate the main app configuration file using Jinja.
See the Kubernetes documentation for more information.
6.2.2. Creating Init Containers Copy linkLink copied to clipboard!
The following example outlines a simple pod which has two Init Containers. The first waits for myservice
and the second waits for mydb
. After both containers complete, the pod begins.
Procedure
Create a YAML file for the Init Container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a YAML file for the
myservice
service.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a YAML file for the
mydb
service.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to create the
myapp-pod
:oc create -f myapp.yaml
$ oc create -f myapp.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
pod/myapp-pod created
pod/myapp-pod created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the status of the pod:
oc get pods
$ oc get pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE myapp-pod 0/1 Init:0/2 0 5s
NAME READY STATUS RESTARTS AGE myapp-pod 0/1 Init:0/2 0 5s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Note that the pod status indicates it is waiting
Run the following commands to create the services:
oc create -f mydb.yaml
$ oc create -f mydb.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f myservice.yaml
$ oc create -f myservice.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the status of the pod:
oc get pods
$ oc get pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE myapp-pod 1/1 Running 0 2m
NAME READY STATUS RESTARTS AGE myapp-pod 1/1 Running 0 2m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.3. Using volumes to persist container data Copy linkLink copied to clipboard!
Files in a container are ephemeral. As such, when a container crashes or stops, the data is lost. You can use volumes to persist the data used by the containers in a pod. A volume is directory, accessible to the Containers in a pod, where data is stored for the life of the pod.
6.3.1. Understanding volumes Copy linkLink copied to clipboard!
Volumes are mounted file systems available to pods and their containers which may be backed by a number of host-local or network attached storage endpoints. Containers are not persistent by default; on restart, their contents are cleared.
To ensure that the file system on the volume contains no errors and, if errors are present, to repair them when possible, OpenShift Container Platform invokes the fsck
utility prior to the mount
utility. This occurs when either adding a volume or updating an existing volume.
The simplest volume type is emptyDir
, which is a temporary directory on a single machine. Administrators may also allow you to request a persistent volume that is automatically attached to your pods.
emptyDir
volume storage may be restricted by a quota based on the pod’s FSGroup, if the FSGroup parameter is enabled by your cluster administrator.
6.3.2. Working with volumes using the OpenShift Container Platform CLI Copy linkLink copied to clipboard!
You can use the CLI command oc set volume
to add and remove volumes and volume mounts for any object that has a pod template like replication controllers or deployment configs. You can also list volumes in pods or any object that has a pod template.
The oc set volume
command uses the following general syntax:
oc set volume <object_selection> <operation> <mandatory_parameters> <options>
$ oc set volume <object_selection> <operation> <mandatory_parameters> <options>
- Object selection
-
Specify one of the following for the
object_selection
parameter in theoc set volume
command:
Syntax | Description | Example |
---|---|---|
|
Selects |
|
|
Selects |
|
|
Selects resources of type |
|
|
Selects all resources of type |
|
| File name, directory, or URL to file to use to edit the resource. |
|
- Operation
-
Specify
--add
or--remove
for theoperation
parameter in theoc set volume
command. - Mandatory parameters
- Any mandatory parameters are specific to the selected operation and are discussed in later sections.
- Options
- Any options are specific to the selected operation and are discussed in later sections.
6.3.3. Listing volumes and volume mounts in a pod Copy linkLink copied to clipboard!
You can list volumes and volume mounts in pods or pod templates:
Procedure
To list volumes:
oc set volume <object_type>/<name> [options]
$ oc set volume <object_type>/<name> [options]
List volume supported options:
Option | Description | Default |
---|---|---|
| Name of the volume. | |
|
Select containers by name. It can also take wildcard |
|
For example:
To list all volumes for pod p1:
oc set volume pod/p1
$ oc set volume pod/p1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To list volume v1 defined on all deployment configs:
oc set volume dc --all --name=v1
$ oc set volume dc --all --name=v1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.3.4. Adding volumes to a pod Copy linkLink copied to clipboard!
You can add volumes and volume mounts to a pod.
Procedure
To add a volume, a volume mount, or both to pod templates:
oc set volume <object_type>/<name> --add [options]
$ oc set volume <object_type>/<name> --add [options]
Option | Description | Default |
---|---|---|
| Name of the volume. | Automatically generated, if not specified. |
|
Name of the volume source. Supported values: |
|
|
Select containers by name. It can also take wildcard |
|
|
Mount path inside the selected containers. Do not mount to the container root, | |
|
Host path. Mandatory parameter for | |
|
Name of the secret. Mandatory parameter for | |
|
Name of the configmap. Mandatory parameter for | |
|
Name of the persistent volume claim. Mandatory parameter for | |
|
Details of volume source as a JSON string. Recommended if the desired volume source is not supported by | |
|
Display the modified objects instead of updating them on the server. Supported values: | |
| Output the modified objects with the given version. |
|
For example:
To add a new volume source emptyDir to the registry
DeploymentConfig
object:oc set volume dc/registry --add
$ oc set volume dc/registry --add
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To add volume v1 with secret secret1 for replication controller r1 and mount inside the containers at /data:
oc set volume rc/r1 --add --name=v1 --type=secret --secret-name='secret1' --mount-path=/data
$ oc set volume rc/r1 --add --name=v1 --type=secret --secret-name='secret1' --mount-path=/data
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To add existing persistent volume v1 with claim name pvc1 to deployment configuration dc.json on disk, mount the volume on container c1 at /data, and update the
DeploymentConfig
object on the server:oc set volume -f dc.json --add --name=v1 --type=persistentVolumeClaim \ --claim-name=pvc1 --mount-path=/data --containers=c1
$ oc set volume -f dc.json --add --name=v1 --type=persistentVolumeClaim \ --claim-name=pvc1 --mount-path=/data --containers=c1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To add a volume v1 based on Git repository https://github.com/namespace1/project1 with revision 5125c45f9f563 for all replication controllers:
oc set volume rc --all --add --name=v1 \ --source='{"gitRepo": {
$ oc set volume rc --all --add --name=v1 \ --source='{"gitRepo": { "repository": "https://github.com/namespace1/project1", "revision": "5125c45f9f563" }}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.3.5. Updating volumes and volume mounts in a pod Copy linkLink copied to clipboard!
You can modify the volumes and volume mounts in a pod.
Procedure
Updating existing volumes using the --overwrite
option:
oc set volume <object_type>/<name> --add --overwrite [options]
$ oc set volume <object_type>/<name> --add --overwrite [options]
For example:
To replace existing volume v1 for replication controller r1 with existing persistent volume claim pvc1:
oc set volume rc/r1 --add --overwrite --name=v1 --type=persistentVolumeClaim --claim-name=pvc1
$ oc set volume rc/r1 --add --overwrite --name=v1 --type=persistentVolumeClaim --claim-name=pvc1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To change the
DeploymentConfig
object d1 mount point to /opt for volume v1:oc set volume dc/d1 --add --overwrite --name=v1 --mount-path=/opt
$ oc set volume dc/d1 --add --overwrite --name=v1 --mount-path=/opt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.3.6. Removing volumes and volume mounts from a pod Copy linkLink copied to clipboard!
You can remove a volume or volume mount from a pod.
Procedure
To remove a volume from pod templates:
oc set volume <object_type>/<name> --remove [options]
$ oc set volume <object_type>/<name> --remove [options]
Option | Description | Default |
---|---|---|
| Name of the volume. | |
|
Select containers by name. It can also take wildcard |
|
| Indicate that you want to remove multiple volumes at once. | |
|
Display the modified objects instead of updating them on the server. Supported values: | |
| Output the modified objects with the given version. |
|
For example:
To remove a volume v1 from the
DeploymentConfig
object d1:oc set volume dc/d1 --remove --name=v1
$ oc set volume dc/d1 --remove --name=v1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To unmount volume v1 from container c1 for the
DeploymentConfig
object d1 and remove the volume v1 if it is not referenced by any containers on d1:oc set volume dc/d1 --remove --name=v1 --containers=c1
$ oc set volume dc/d1 --remove --name=v1 --containers=c1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To remove all volumes for replication controller r1:
oc set volume rc/r1 --remove --confirm
$ oc set volume rc/r1 --remove --confirm
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.3.7. Configuring volumes for multiple uses in a pod Copy linkLink copied to clipboard!
You can configure a volume to allows you to share one volume for multiple uses in a single pod using the volumeMounts.subPath
property to specify a subPath
value inside a volume instead of the volume’s root.
Procedure
View the list of files in the volume, run the
oc rsh
command:oc rsh <pod>
$ oc rsh <pod>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
ls /path/to/volume/subpath/mount
sh-4.2$ ls /path/to/volume/subpath/mount example_file1 example_file2 example_file3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the
subPath
:Example
Pod
spec withsubPath
parameterCopy to Clipboard Copied! Toggle word wrap Toggle overflow
6.4. Mapping volumes using projected volumes Copy linkLink copied to clipboard!
A projected volume maps several existing volume sources into the same directory.
The following types of volume sources can be projected:
- Secrets
- Config Maps
- Downward API
All sources are required to be in the same namespace as the pod.
6.4.1. Understanding projected volumes Copy linkLink copied to clipboard!
Projected volumes can map any combination of these volume sources into a single directory, allowing the user to:
- automatically populate a single volume with the keys from multiple secrets, config maps, and with downward API information, so that I can synthesize a single directory with various sources of information;
- populate a single volume with the keys from multiple secrets, config maps, and with downward API information, explicitly specifying paths for each item, so that I can have full control over the contents of that volume.
The following general scenarios show how you can use projected volumes.
- Config map, secrets, Downward API.
-
Projected volumes allow you to deploy containers with configuration data that includes passwords. An application using these resources could be deploying Red Hat OpenStack Platform (RHOSP) on Kubernetes. The configuration data might have to be assembled differently depending on if the services are going to be used for production or for testing. If a pod is labeled with production or testing, the downward API selector
metadata.labels
can be used to produce the correct RHOSP configs. - Config map + secrets.
- Projected volumes allow you to deploy containers involving configuration data and passwords. For example, you might execute a config map with some sensitive encrypted tasks that are decrypted using a vault password file.
- ConfigMap + Downward API.
-
Projected volumes allow you to generate a config including the pod name (available via the
metadata.name
selector). This application can then pass the pod name along with requests in order to easily determine the source without using IP tracking. - Secrets + Downward API.
-
Projected volumes allow you to use a secret as a public key to encrypt the namespace of the pod (available via the
metadata.namespace
selector). This example allows the Operator to use the application to deliver the namespace information securely without using an encrypted transport.
6.4.1.1. Example Pod specs Copy linkLink copied to clipboard!
The following are examples of Pod
specs for creating projected volumes.
Pod with a secret, a Downward API, and a config map
- 1
- Add a
volumeMounts
section for each container that needs the secret. - 2
- Specify a path to an unused directory where the secret will appear.
- 3
- Set
readOnly
totrue
. - 4
- Add a
volumes
block to list each projected volume source. - 5
- Specify any name for the volume.
- 6
- Set the execute permission on the files.
- 7
- Add a secret. Enter the name of the secret object. Each secret you want to use must be listed.
- 8
- Specify the path to the secrets file under the
mountPath
. Here, the secrets file is in /projected-volume/my-group/my-username. - 9
- Add a Downward API source.
- 10
- Add a ConfigMap source.
- 11
- Set the mode for the specific projection
If there are multiple containers in the pod, each container needs a volumeMounts
section, but only one volumes
section is needed.
Pod with multiple secrets with a non-default permission mode set
The defaultMode
can only be specified at the projected level and not for each volume source. However, as illustrated above, you can explicitly set the mode
for each individual projection.
6.4.1.2. Pathing Considerations Copy linkLink copied to clipboard!
- Collisions Between Keys when Configured Paths are Identical
If you configure any keys with the same path, the pod spec will not be accepted as valid. In the following example, the specified path for
mysecret
andmyconfigmap
are the same:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Consider the following situations related to the volume file paths.
- Collisions Between Keys without Configured Paths
- The only run-time validation that can occur is when all the paths are known at pod creation, similar to the above scenario. Otherwise, when a conflict occurs the most recent specified resource will overwrite anything preceding it (this is true for resources that are updated after pod creation as well).
- Collisions when One Path is Explicit and the Other is Automatically Projected
- In the event that there is a collision due to a user specified path matching data that is automatically projected, the latter resource will overwrite anything preceding it as before
6.4.2. Configuring a Projected Volume for a Pod Copy linkLink copied to clipboard!
When creating projected volumes, consider the volume file path situations described in Understanding projected volumes.
The following example shows how to use a projected volume to mount an existing secret volume source. The steps can be used to create a user name and password secrets from local files. You then create a pod that runs one container, using a projected volume to mount the secrets into the same shared directory.
Procedure
To use a projected volume to mount an existing secret volume source.
Create files containing the secrets, entering the following, replacing the password and user information as appropriate:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
user
andpass
values can be any valid string that is base64 encoded.The following example shows
admin
in base64:echo -n "admin" | base64
$ echo -n "admin" | base64
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
YWRtaW4=
YWRtaW4=
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following example shows the password
1f2d1e2e67df
in base64:.echo -n "1f2d1e2e67df" | base64
$ echo -n "1f2d1e2e67df" | base64
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
MWYyZDFlMmU2N2Rm
MWYyZDFlMmU2N2Rm
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following command to create the secrets:
oc create -f <secrets-filename>
$ oc create -f <secrets-filename>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f secret.yaml
$ oc create -f secret.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
secret "mysecret" created
secret "mysecret" created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can check that the secret was created using the following commands:
oc get secret <secret-name>
$ oc get secret <secret-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get secret mysecret
$ oc get secret mysecret
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME TYPE DATA AGE mysecret Opaque 2 17h
NAME TYPE DATA AGE mysecret Opaque 2 17h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get secret <secret-name> -o yaml
$ oc get secret <secret-name> -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get secret mysecret -o yaml
$ oc get secret mysecret -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a pod configuration file similar to the following that includes a
volumes
section:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod from the configuration file:
oc create -f <your_yaml_file>.yaml
$ oc create -f <your_yaml_file>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f secret-pod.yaml
$ oc create -f secret-pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
pod "test-projected-volume" created
pod "test-projected-volume" created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the pod container is running, and then watch for changes to the pod:
oc get pod <name>
$ oc get pod <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get pod test-projected-volume
$ oc get pod test-projected-volume
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output should appear similar to the following:
Example output
NAME READY STATUS RESTARTS AGE test-projected-volume 1/1 Running 0 14s
NAME READY STATUS RESTARTS AGE test-projected-volume 1/1 Running 0 14s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In another terminal, use the
oc exec
command to open a shell to the running container:oc exec -it <pod> <command>
$ oc exec -it <pod> <command>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc exec -it test-projected-volume -- /bin/sh
$ oc exec -it test-projected-volume -- /bin/sh
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In your shell, verify that the
projected-volumes
directory contains your projected sources:/ # ls
/ # ls
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
bin home root tmp dev proc run usr etc projected-volume sys var
bin home root tmp dev proc run usr etc projected-volume sys var
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.5. Allowing containers to consume API objects Copy linkLink copied to clipboard!
The Downward API is a mechanism that allows containers to consume information about API objects without coupling to OpenShift Container Platform. Such information includes the pod’s name, namespace, and resource values. Containers can consume information from the downward API using environment variables or a volume plug-in.
6.5.1. Expose pod information to Containers using the Downward API Copy linkLink copied to clipboard!
The Downward API contains such information as the pod’s name, project, and resource values. Containers can consume information from the downward API using environment variables or a volume plug-in.
Fields within the pod are selected using the FieldRef
API type. FieldRef
has two fields:
Field | Description |
---|---|
| The path of the field to select, relative to the pod. |
|
The API version to interpret the |
Currently, the valid selectors in the v1 API include:
Selector | Description |
---|---|
| The pod’s name. This is supported in both environment variables and volumes. |
| The pod’s namespace.This is supported in both environment variables and volumes. |
| The pod’s labels. This is only supported in volumes and not in environment variables. |
| The pod’s annotations. This is only supported in volumes and not in environment variables. |
| The pod’s IP. This is only supported in environment variables and not volumes. |
The apiVersion
field, if not specified, defaults to the API version of the enclosing pod template.
6.5.2. Understanding how to consume container values using the downward API Copy linkLink copied to clipboard!
You containers can consume API values using environment variables or a volume plug-in. Depending on the method you choose, containers can consume:
- Pod name
- Pod project/namespace
- Pod annotations
- Pod labels
Annotations and labels are available using only a volume plug-in.
6.5.2.1. Consuming container values using environment variables Copy linkLink copied to clipboard!
When using a container’s environment variables, use the EnvVar
type’s valueFrom
field (of type EnvVarSource
) to specify that the variable’s value should come from a FieldRef
source instead of the literal value specified by the value
field.
Only constant attributes of the pod can be consumed this way, as environment variables cannot be updated once a process is started in a way that allows the process to be notified that the value of a variable has changed. The fields supported using environment variables are:
- Pod name
- Pod project/namespace
Procedure
To use environment variables
Create a
pod.yaml
file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod from the
pod.yaml
file:oc create -f pod.yaml
$ oc create -f pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the container’s logs for the
MY_POD_NAME
andMY_POD_NAMESPACE
values:oc logs -p dapi-env-test-pod
$ oc logs -p dapi-env-test-pod
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.5.2.2. Consuming container values using a volume plug-in Copy linkLink copied to clipboard!
You containers can consume API values using a volume plug-in.
Containers can consume:
- Pod name
- Pod project/namespace
- Pod annotations
- Pod labels
Procedure
To use the volume plug-in:
Create a
volume-pod.yaml
file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod from the
volume-pod.yaml
file:oc create -f volume-pod.yaml
$ oc create -f volume-pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the container’s logs and verify the presence of the configured fields:
oc logs -p dapi-volume-test-pod
$ oc logs -p dapi-volume-test-pod
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.5.3. Understanding how to consume container resources using the Downward API Copy linkLink copied to clipboard!
When creating pods, you can use the Downward API to inject information about computing resource requests and limits so that image and application authors can correctly create an image for specific environments.
You can do this using environment variable or a volume plug-in.
6.5.3.1. Consuming container resources using environment variables Copy linkLink copied to clipboard!
When creating pods, you can use the Downward API to inject information about computing resource requests and limits using environment variables.
Procedure
To use environment variables:
When creating a pod configuration, specify environment variables that correspond to the contents of the
resources
field in thespec.container
field:Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the resource limits are not included in the container configuration, the downward API defaults to the node’s CPU and memory allocatable values.
Create the pod from the
pod.yaml
file:oc create -f pod.yaml
$ oc create -f pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.5.3.2. Consuming container resources using a volume plug-in Copy linkLink copied to clipboard!
When creating pods, you can use the Downward API to inject information about computing resource requests and limits using a volume plug-in.
Procedure
To use the Volume Plug-in:
When creating a pod configuration, use the
spec.volumes.downwardAPI.items
field to describe the desired resources that correspond to thespec.resources
field:Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the resource limits are not included in the container configuration, the Downward API defaults to the node’s CPU and memory allocatable values.
Create the pod from the
volume-pod.yaml
file:oc create -f volume-pod.yaml
$ oc create -f volume-pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.5.4. Consuming secrets using the Downward API Copy linkLink copied to clipboard!
When creating pods, you can use the downward API to inject secrets so image and application authors can create an image for specific environments.
Procedure
Create a
secret.yaml
file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
Secret
object from thesecret.yaml
file:oc create -f secret.yaml
$ oc create -f secret.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
pod.yaml
file that references theusername
field from the aboveSecret
object:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod from the
pod.yaml
file:oc create -f pod.yaml
$ oc create -f pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the container’s logs for the
MY_SECRET_USERNAME
value:oc logs -p dapi-env-test-pod
$ oc logs -p dapi-env-test-pod
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.5.5. Consuming configuration maps using the Downward API Copy linkLink copied to clipboard!
When creating pods, you can use the Downward API to inject configuration map values so image and application authors can create an image for specific environments.
Procedure
Create a
configmap.yaml
file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
ConfigMap
object from theconfigmap.yaml
file:oc create -f configmap.yaml
$ oc create -f configmap.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
pod.yaml
file that references the aboveConfigMap
object:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod from the
pod.yaml
file:oc create -f pod.yaml
$ oc create -f pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the container’s logs for the
MY_CONFIGMAP_VALUE
value:oc logs -p dapi-env-test-pod
$ oc logs -p dapi-env-test-pod
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.5.6. Referencing environment variables Copy linkLink copied to clipboard!
When creating pods, you can reference the value of a previously defined environment variable by using the $()
syntax. If the environment variable reference can not be resolved, the value will be left as the provided string.
Procedure
Create a
pod.yaml
file that references an existingenvironment variable
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod from the
pod.yaml
file:oc create -f pod.yaml
$ oc create -f pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the container’s logs for the
MY_ENV_VAR_REF_ENV
value:oc logs -p dapi-env-test-pod
$ oc logs -p dapi-env-test-pod
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.5.7. Escaping environment variable references Copy linkLink copied to clipboard!
When creating a pod, you can escape an environment variable reference by using a double dollar sign. The value will then be set to a single dollar sign version of the provided value.
Procedure
Create a
pod.yaml
file that references an existingenvironment variable
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod from the
pod.yaml
file:oc create -f pod.yaml
$ oc create -f pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the container’s logs for the
MY_NEW_ENV
value:oc logs -p dapi-env-test-pod
$ oc logs -p dapi-env-test-pod
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.6. Copying files to or from an OpenShift Container Platform container Copy linkLink copied to clipboard!
You can use the CLI to copy local files to or from a remote directory in a container using the rsync
command.
6.6.1. Understanding how to copy files Copy linkLink copied to clipboard!
The oc rsync
command, or remote sync, is a useful tool for copying database archives to and from your pods for backup and restore purposes. You can also use oc rsync
to copy source code changes into a running pod for development debugging, when the running pod supports hot reload of source files.
oc rsync <source> <destination> [-c <container>]
$ oc rsync <source> <destination> [-c <container>]
6.6.1.1. Requirements Copy linkLink copied to clipboard!
- Specifying the Copy Source
-
The source argument of the
oc rsync
command must point to either a local directory or a pod directory. Individual files are not supported.
When specifying a pod directory the directory name must be prefixed with the pod name:
<pod name>:<dir>
<pod name>:<dir>
If the directory name ends in a path separator (/
), only the contents of the directory are copied to the destination. Otherwise, the directory and its contents are copied to the destination.
- Specifying the Copy Destination
-
The destination argument of the
oc rsync
command must point to a directory. If the directory does not exist, butrsync
is used for copy, the directory is created for you. - Deleting Files at the Destination
-
The
--delete
flag may be used to delete any files in the remote directory that are not in the local directory. - Continuous Syncing on File Change
-
Using the
--watch
option causes the command to monitor the source path for any file system changes, and synchronizes changes when they occur. With this argument, the command runs forever.
Synchronization occurs after short quiet periods to ensure a rapidly changing file system does not result in continuous synchronization calls.
When using the --watch
option, the behavior is effectively the same as manually invoking oc rsync
repeatedly, including any arguments normally passed to oc rsync
. Therefore, you can control the behavior via the same flags used with manual invocations of oc rsync
, such as --delete
.
6.6.2. Copying files to and from containers Copy linkLink copied to clipboard!
Support for copying local files to or from a container is built into the CLI.
Prerequisites
When working with oc rsync
, note the following:
- rsync must be installed
-
The
oc rsync
command uses the localrsync
tool if present on the client machine and the remote container.
If rsync
is not found locally or in the remote container, a tar archive is created locally and sent to the container where the tar utility is used to extract the files. If tar is not available in the remote container, the copy will fail.
The tar copy method does not provide the same functionality as oc rsync
. For example, oc rsync
creates the destination directory if it does not exist and only sends files that are different between the source and the destination.
In Windows, the cwRsync
client should be installed and added to the PATH for use with the oc rsync
command.
Procedure
To copy a local directory to a pod directory:
oc rsync <local-dir> <pod-name>:/<remote-dir> -c <container-name>
$ oc rsync <local-dir> <pod-name>:/<remote-dir> -c <container-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc rsync /home/user/source devpod1234:/src -c user-container
$ oc rsync /home/user/source devpod1234:/src -c user-container
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To copy a pod directory to a local directory:
oc rsync devpod1234:/src /home/user/source
$ oc rsync devpod1234:/src /home/user/source
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
oc rsync devpod1234:/src/status.txt /home/user/
$ oc rsync devpod1234:/src/status.txt /home/user/
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.6.3. Using advanced Rsync features Copy linkLink copied to clipboard!
The oc rsync
command exposes fewer command line options than standard rsync
. In the case that you want to use a standard rsync
command line option that is not available in oc rsync
, for example the --exclude-from=FILE
option, it might be possible to use standard rsync
's --rsh
(-e
) option or RSYNC_RSH
environment variable as a workaround, as follows:
rsync --rsh='oc rsh' --exclude-from=FILE SRC POD:DEST
$ rsync --rsh='oc rsh' --exclude-from=FILE SRC POD:DEST
or:
Export the RSYNC_RSH
variable:
export RSYNC_RSH='oc rsh'
$ export RSYNC_RSH='oc rsh'
Then, run the rsync command:
rsync --exclude-from=FILE SRC POD:DEST
$ rsync --exclude-from=FILE SRC POD:DEST
Both of the above examples configure standard rsync
to use oc rsh
as its remote shell program to enable it to connect to the remote pod, and are an alternative to running oc rsync
.
6.7. Executing remote commands in an OpenShift Container Platform container Copy linkLink copied to clipboard!
You can use the CLI to execute remote commands in an OpenShift Container Platform container.
6.7.1. Executing remote commands in containers Copy linkLink copied to clipboard!
Support for remote container command execution is built into the CLI.
Procedure
To run a command in a container:
oc exec <pod> [-c <container>] <command> [<arg_1> ... <arg_n>]
$ oc exec <pod> [-c <container>] <command> [<arg_1> ... <arg_n>]
For example:
oc exec mypod date
$ oc exec mypod date
Example output
Thu Apr 9 02:21:53 UTC 2015
Thu Apr 9 02:21:53 UTC 2015
For security purposes, the oc exec
command does not work when accessing privileged containers except when the command is executed by a cluster-admin
user.
6.7.2. Protocol for initiating a remote command from a client Copy linkLink copied to clipboard!
Clients initiate the execution of a remote command in a container by issuing a request to the Kubernetes API server:
/proxy/nodes/<node_name>/exec/<namespace>/<pod>/<container>?command=<command>
/proxy/nodes/<node_name>/exec/<namespace>/<pod>/<container>?command=<command>
In the above URL:
-
<node_name>
is the FQDN of the node. -
<namespace>
is the project of the target pod. -
<pod>
is the name of the target pod. -
<container>
is the name of the target container. -
<command>
is the desired command to be executed.
For example:
/proxy/nodes/node123.openshift.com/exec/myns/mypod/mycontainer?command=date
/proxy/nodes/node123.openshift.com/exec/myns/mypod/mycontainer?command=date
Additionally, the client can add parameters to the request to indicate if:
- the client should send input to the remote container’s command (stdin).
- the client’s terminal is a TTY.
- the remote container’s command should send output from stdout to the client.
- the remote container’s command should send output from stderr to the client.
After sending an exec
request to the API server, the client upgrades the connection to one that supports multiplexed streams; the current implementation uses HTTP/2.
The client creates one stream each for stdin, stdout, and stderr. To distinguish among the streams, the client sets the streamType
header on the stream to one of stdin
, stdout
, or stderr
.
The client closes all streams, the upgraded connection, and the underlying connection when it is finished with the remote command execution request.
6.8. Using port forwarding to access applications in a container Copy linkLink copied to clipboard!
OpenShift Container Platform supports port forwarding to pods.
6.8.1. Understanding port forwarding Copy linkLink copied to clipboard!
You can use the CLI to forward one or more local ports to a pod. This allows you to listen on a given or random port locally, and have data forwarded to and from given ports in the pod.
Support for port forwarding is built into the CLI:
oc port-forward <pod> [<local_port>:]<remote_port> [...[<local_port_n>:]<remote_port_n>]
$ oc port-forward <pod> [<local_port>:]<remote_port> [...[<local_port_n>:]<remote_port_n>]
The CLI listens on each local port specified by the user, forwarding using the protocol described below.
Ports may be specified using the following formats:
| The client listens on port 5000 locally and forwards to 5000 in the pod. |
| The client listens on port 6000 locally and forwards to 5000 in the pod. |
| The client selects a free local port and forwards to 5000 in the pod. |
OpenShift Container Platform handles port-forward requests from clients. Upon receiving a request, OpenShift Container Platform upgrades the response and waits for the client to create port-forwarding streams. When OpenShift Container Platform receives a new stream, it copies data between the stream and the pod’s port.
Architecturally, there are options for forwarding to a pod’s port. The supported OpenShift Container Platform implementation invokes nsenter
directly on the node host to enter the pod’s network namespace, then invokes socat
to copy data between the stream and the pod’s port. However, a custom implementation could include running a helper pod that then runs nsenter
and socat
, so that those binaries are not required to be installed on the host.
6.8.2. Using port forwarding Copy linkLink copied to clipboard!
You can use the CLI to port-forward one or more local ports to a pod.
Procedure
Use the following command to listen on the specified port in a pod:
oc port-forward <pod> [<local_port>:]<remote_port> [...[<local_port_n>:]<remote_port_n>]
$ oc port-forward <pod> [<local_port>:]<remote_port> [...[<local_port_n>:]<remote_port_n>]
For example:
Use the following command to listen on ports
5000
and6000
locally and forward data to and from ports5000
and6000
in the pod:oc port-forward <pod> 5000 6000
$ oc port-forward <pod> 5000 6000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Forwarding from 127.0.0.1:5000 -> 5000 Forwarding from [::1]:5000 -> 5000 Forwarding from 127.0.0.1:6000 -> 6000 Forwarding from [::1]:6000 -> 6000
Forwarding from 127.0.0.1:5000 -> 5000 Forwarding from [::1]:5000 -> 5000 Forwarding from 127.0.0.1:6000 -> 6000 Forwarding from [::1]:6000 -> 6000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following command to listen on port
8888
locally and forward to5000
in the pod:oc port-forward <pod> 8888:5000
$ oc port-forward <pod> 8888:5000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Forwarding from 127.0.0.1:8888 -> 5000 Forwarding from [::1]:8888 -> 5000
Forwarding from 127.0.0.1:8888 -> 5000 Forwarding from [::1]:8888 -> 5000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following command to listen on a free port locally and forward to
5000
in the pod:oc port-forward <pod> :5000
$ oc port-forward <pod> :5000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Forwarding from 127.0.0.1:42390 -> 5000 Forwarding from [::1]:42390 -> 5000
Forwarding from 127.0.0.1:42390 -> 5000 Forwarding from [::1]:42390 -> 5000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Or:
oc port-forward <pod> 0:5000
$ oc port-forward <pod> 0:5000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.8.3. Protocol for initiating port forwarding from a client Copy linkLink copied to clipboard!
Clients initiate port forwarding to a pod by issuing a request to the Kubernetes API server:
/proxy/nodes/<node_name>/portForward/<namespace>/<pod>
/proxy/nodes/<node_name>/portForward/<namespace>/<pod>
In the above URL:
-
<node_name>
is the FQDN of the node. -
<namespace>
is the namespace of the target pod. -
<pod>
is the name of the target pod.
For example:
/proxy/nodes/node123.openshift.com/portForward/myns/mypod
/proxy/nodes/node123.openshift.com/portForward/myns/mypod
After sending a port forward request to the API server, the client upgrades the connection to one that supports multiplexed streams; the current implementation uses Hyptertext Transfer Protocol Version 2 (HTTP/2).
The client creates a stream with the port
header containing the target port in the pod. All data written to the stream is delivered via the kubelet to the target pod and port. Similarly, all data sent from the pod for that forwarded connection is delivered back to the same stream in the client.
The client closes all streams, the upgraded connection, and the underlying connection when it is finished with the port forwarding request.
6.9. Using sysctls in containers Copy linkLink copied to clipboard!
Sysctl settings are exposed via Kubernetes, allowing users to modify certain kernel parameters at runtime for namespaces within a container. Only sysctls that are namespaced can be set independently on pods. If a sysctl is not namespaced, called node-level, you must use another method of setting the sysctl, such as the Node Tuning Operator. Moreover, only those sysctls considered safe are whitelisted by default; you can manually enable other unsafe sysctls on the node to be available to the user.
6.9.1. About sysctls Copy linkLink copied to clipboard!
In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems, such as:
- kernel (common prefix: kernel.)
- networking (common prefix: net.)
- virtual memory (common prefix: vm.)
- MDADM (common prefix: dev.)
More subsystems are described in Kernel documentation. To get a list of all parameters, run:
sudo sysctl -a
$ sudo sysctl -a
6.9.1.1. Namespaced versus node-level sysctls Copy linkLink copied to clipboard!
A number of sysctls are namespaced in the Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.
The following sysctls are known to be namespaced:
- kernel.shm*
- kernel.msg*
- kernel.sem
- fs.mqueue.*
Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.
Sysctls that are not namespaced are called node-level and must be set manually by the cluster administrator, either by means of the underlying Linux distribution of the nodes, such as by modifying the /etc/sysctls.conf file, or by using a daemon set with privileged containers. You can use the Node Tuning Operator to set node-level sysctls.
Consider marking nodes with special sysctls as tainted. Only schedule pods onto them that need those sysctl settings. Use the taints and toleration feature to mark the nodes.
6.9.1.2. Safe versus unsafe sysctls Copy linkLink copied to clipboard!
Sysctls are grouped into safe and unsafe sysctls.
For a sysctl to be considered safe, it must use proper namespacing and must be properly isolated between pods on the same node. This means that if you set a sysctl for one pod it must not:
- Influence any other pod on the node
- Harm the node’s health
- Gain CPU or memory resources outside of the resource limits of a pod
OpenShift Container Platform supports, or whitelists, the following sysctls in the safe set:
- kernel.shm_rmid_forced
- net.ipv4.ip_local_port_range
- net.ipv4.tcp_syncookies
- net.ipv4.ping_group_range
All safe sysctls are enabled by default. You can use a sysctl in a pod by modifying the Pod
spec.
Any sysctl not whitelisted by OpenShift Container Platform is considered unsafe for OpenShift Container Platform. Note that being namespaced alone is not sufficient for the sysctl to be considered safe.
All unsafe sysctls are disabled by default, and the cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls are scheduled but do not launch.
oc get pod
$ oc get pod
Example output
NAME READY STATUS RESTARTS AGE hello-pod 0/1 SysctlForbidden 0 14s
NAME READY STATUS RESTARTS AGE
hello-pod 0/1 SysctlForbidden 0 14s
6.9.2. Setting sysctls for a pod Copy linkLink copied to clipboard!
You can set sysctls on pods using the pod’s securityContext
. The securityContext
applies to all containers in the same pod.
Safe sysctls are allowed by default. A pod with unsafe sysctls fails to launch on any node unless the cluster administrator explicitly enables unsafe sysctls for that node. As with node-level sysctls, use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.
The following example uses the pod securityContext
to set a safe sysctl kernel.shm_rmid_forced
and two unsafe sysctls, net.core.somaxconn
and kernel.msgmax
. There is no distinction between safe and unsafe sysctls in the specification.
To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects.
Procedure
To use safe and unsafe sysctls:
Modify the YAML file that defines the pod and add the
securityContext
spec, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod:
oc apply -f <file-name>.yaml
$ oc apply -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the unsafe sysctls are not allowed for the node, the pod is scheduled, but does not deploy:
oc get pod
$ oc get pod
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE hello-pod 0/1 SysctlForbidden 0 14s
NAME READY STATUS RESTARTS AGE hello-pod 0/1 SysctlForbidden 0 14s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.9.3. Enabling unsafe sysctls Copy linkLink copied to clipboard!
A cluster administrator can allow certain unsafe sysctls for very special situations such as high performance or real-time application tuning.
If you want to use unsafe sysctls, a cluster administrator must enable them individually for a specific type of node. The sysctls must be namespaced.
You can further control which sysctls can be set in pods by specifying lists of sysctls or sysctl patterns in the forbiddenSysctls
and allowedUnsafeSysctls
fields of the Security Context Constraints.
-
The
forbiddenSysctls
option excludes specific sysctls. -
The
allowedUnsafeSysctls
option controls specific needs such as high performance or real-time application tuning.
Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems, such as improper behavior of containers, resource shortage, or breaking a node.
Procedure
Add a label to the machine config pool where the containers where containers with the unsafe sysctls will run:
oc edit machineconfigpool worker
$ oc edit machineconfigpool worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Add a
key: pair
label.
Create a
KubeletConfig
custom resource (CR):Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the object:
oc apply -f set-sysctl-worker.yaml
$ oc apply -f set-sysctl-worker.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow A new
MachineConfig
object named in the99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet
format is created.Wait for the cluster to reboot using the
machineconfigpool
objectstatus
fields:For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow A message similar to the following appears when the cluster is ready:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When the cluster is ready, check for the merged
KubeletConfig
object in the newMachineConfig
object:oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
$ oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can now add unsafe sysctls to pods as needed.
Chapter 7. Working with clusters Copy linkLink copied to clipboard!
7.1. Viewing system event information in an OpenShift Container Platform cluster Copy linkLink copied to clipboard!
Events in OpenShift Container Platform are modeled based on events that happen to API objects in an OpenShift Container Platform cluster.
7.1.1. Understanding events Copy linkLink copied to clipboard!
Events allow OpenShift Container Platform to record information about real-world events in a resource-agnostic manner. They also allow developers and administrators to consume information about system components in a unified way.
7.1.2. Viewing events using the CLI Copy linkLink copied to clipboard!
You can get a list of events in a given project using the CLI.
Procedure
To view events in a project use the following command:
oc get events [-n <project>]
$ oc get events [-n <project>]
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of the project.
For example:
oc get events -n openshift-config
$ oc get events -n openshift-config
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To view events in your project from the OpenShift Container Platform console.
- Launch the OpenShift Container Platform console.
- Click Home → Events and select your project.
Move to resource that you want to see events. For example: Home → Projects → <project-name> → <resource-name>.
Many objects, such as pods and deployments, have their own Events tab as well, which shows events related to that object.
7.1.3. List of events Copy linkLink copied to clipboard!
This section describes the events of OpenShift Container Platform.
Name | Description |
---|---|
| Failed pod configuration validation. |
Name | Description |
---|---|
| Back-off restarting failed the container. |
| Container created. |
| Pull/Create/Start failed. |
| Killing the container. |
| Container started. |
| Preempting other pods. |
| Container runtime did not stop the pod within specified grace period. |
Name | Description |
---|---|
| Container is unhealthy. |
Name | Description |
---|---|
| Back off Ctr Start, image pull. |
| The image’s NeverPull Policy is violated. |
| Failed to pull the image. |
| Failed to inspect the image. |
| Successfully pulled the image or the container image is already present on the machine. |
| Pulling the image. |
Name | Description |
---|---|
| Free disk space failed. |
| Invalid disk capacity. |
Name | Description |
---|---|
| Volume mount failed. |
| Host network not supported. |
| Host/port conflict. |
| Kubelet setup failed. |
| Undefined shaper. |
| Node is not ready. |
| Node is not schedulable. |
| Node is ready. |
| Node is schedulable. |
| Node selector mismatch. |
| Out of disk. |
| Node rebooted. |
| Starting kubelet. |
| Failed to attach volume. |
| Failed to detach volume. |
| Failed to expand/reduce volume. |
| Successfully expanded/reduced volume. |
| Failed to expand/reduce file system. |
| Successfully expanded/reduced file system. |
| Failed to unmount volume. |
| Failed to map a volume. |
| Failed unmaped device. |
| Volume is already mounted. |
| Volume is successfully detached. |
| Volume is successfully mounted. |
| Volume is successfully unmounted. |
| Container garbage collection failed. |
| Image garbage collection failed. |
| Failed to enforce System Reserved Cgroup limit. |
| Enforced System Reserved Cgroup limit. |
| Unsupported mount option. |
| Pod sandbox changed. |
| Failed to create pod sandbox. |
| Failed pod sandbox status. |
Name | Description |
---|---|
| Pod sync failed. |
Name | Description |
---|---|
| There is an OOM (out of memory) situation on the cluster. |
Name | Description |
---|---|
| Failed to stop a pod. |
| Failed to create a pod container. |
| Failed to make pod data directories. |
| Network is not ready. |
|
Error creating: |
|
Created pod: |
|
Error deleting: |
|
Deleted pod: |
Name | Description |
---|---|
SelectorRequired | Selector is required. |
| Could not convert selector into a corresponding internal selector object. |
| HPA was unable to compute the replica count. |
| Unknown metric source type. |
| HPA was able to successfully calculate a replica count. |
| Failed to convert the given HPA. |
| HPA controller was unable to get the target’s current scale. |
| HPA controller was able to get the target’s current scale. |
| Failed to compute desired number of replicas based on listed metrics. |
|
New size: |
|
New size: |
| Failed to update status. |
Name | Description |
---|---|
| Starting OpenShift-SDN. |
| The pod’s network interface has been lost and the pod will be stopped. |
Name | Description |
---|---|
|
The service-port |
Name | Description |
---|---|
| There are no persistent volumes available and no storage class is set. |
| Volume size or class is different from what is requested in claim. |
| Error creating recycler pod. |
| Occurs when volume is recycled. |
| Occurs when pod is recycled. |
| Occurs when volume is deleted. |
| Error when deleting the volume. |
| Occurs when volume for the claim is provisioned either manually or via external software. |
| Failed to provision volume. |
| Error cleaning provisioned volume. |
| Occurs when the volume is provisioned successfully. |
| Delay binding until pod scheduling. |
Name | Description |
---|---|
| Handler failed for pod start. |
| Handler failed for pre-stop. |
| Pre-stop hook unfinished. |
Name | Description |
---|---|
| Failed to cancel deployment. |
| Canceled deployment. |
| Created new replication controller. |
| No available Ingress IP to allocate to service. |
Name | Description |
---|---|
|
Failed to schedule pod: |
|
By |
|
Successfully assigned |
Name | Description |
---|---|
| This daemon set is selecting all pods. A non-empty selector is required. |
|
Failed to place pod on |
|
Found failed daemon pod |
Name | Description |
---|---|
| Error creating load balancer. |
| Deleting load balancer. |
| Ensuring load balancer. |
| Ensured load balancer. |
|
There are no available nodes for |
|
Lists the new |
|
Lists the new IP address. For example, |
|
Lists external IP address. For example, |
|
Lists the new UID. For example, |
|
Lists the new |
|
Lists the new |
| Updated load balancer with new hosts. |
| Error updating load balancer with new hosts. |
| Deleting load balancer. |
| Error deleting load balancer. |
| Deleted load balancer. |
7.2. Estimating the number of pods your OpenShift Container Platform nodes can hold Copy linkLink copied to clipboard!
As a cluster administrator, you can use the cluster capacity tool to view the number of pods that can be scheduled to increase the current resources before they become exhausted, and to ensure any future pods can be scheduled. This capacity comes from an individual node host in a cluster, and includes CPU, memory, disk space, and others.
7.2.1. Understanding the OpenShift Container Platform cluster capacity tool Copy linkLink copied to clipboard!
The cluster capacity tool simulates a sequence of scheduling decisions to determine how many instances of an input pod can be scheduled on the cluster before it is exhausted of resources to provide a more accurate estimation.
The remaining allocatable capacity is a rough estimation, because it does not count all of the resources being distributed among nodes. It analyzes only the remaining resources and estimates the available capacity that is still consumable in terms of a number of instances of a pod with given requirements that can be scheduled in a cluster.
Also, pods might only have scheduling support on particular sets of nodes based on its selection and affinity criteria. As a result, the estimation of which remaining pods a cluster can schedule can be difficult.
You can run the cluster capacity analysis tool as a stand-alone utility from the command line, or as a job in a pod inside an OpenShift Container Platform cluster. Running it as job inside of a pod enables you to run it multiple times without intervention.
7.2.2. Running the cluster capacity tool on the command line Copy linkLink copied to clipboard!
You can run the OpenShift Container Platform cluster capacity tool from the command line to estimate the number of pods that can be scheduled onto your cluster.
Prerequisites
- Run the OpenShift Cluster Capacity Tool, which is available as a container image from the Red Hat Ecosystem Catalog.
Create a sample
Pod
spec file, which the tool uses for estimating resource usage. Thepodspec
specifies its resource requirements aslimits
orrequests
. The cluster capacity tool takes the pod’s resource requirements into account for its estimation analysis.An example of the
Pod
spec input is:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
To use the cluster capacity tool on the command line:
From the terminal, log in to the Red Hat Registry:
podman login registry.redhat.io
$ podman login registry.redhat.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the cluster capacity tool image:
podman pull registry.redhat.io/openshift4/ose-cluster-capacity
$ podman pull registry.redhat.io/openshift4/ose-cluster-capacity
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the cluster capacity tool:
podman run -v $HOME/.kube:/kube:Z -v $(pwd):/cc:Z ose-cluster-capacity \ /bin/cluster-capacity --kubeconfig /kube/config --podspec /cc/pod-spec.yaml \ --verbose
$ podman run -v $HOME/.kube:/kube:Z -v $(pwd):/cc:Z ose-cluster-capacity \ /bin/cluster-capacity --kubeconfig /kube/config --podspec /cc/pod-spec.yaml \ --verbose
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- You can also add the
--verbose
option to output a detailed description of how many pods can be scheduled on each node in the cluster.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In the above example, the number of estimated pods that can be scheduled onto the cluster is 88.
7.2.3. Running the cluster capacity tool as a job inside a pod Copy linkLink copied to clipboard!
Running the cluster capacity tool as a job inside of a pod has the advantage of being able to be run multiple times without needing user intervention. Running the cluster capacity tool as a job involves using a ConfigMap
object.
Prerequisites
Download and install the cluster capacity tool.
Procedure
To run the cluster capacity tool:
Create the cluster role:
cat << EOF| oc create -f -
$ cat << EOF| oc create -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the service account:
oc create sa cluster-capacity-sa
$ oc create sa cluster-capacity-sa
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the role to the service account:
oc adm policy add-cluster-role-to-user cluster-capacity-role \ system:serviceaccount:default:cluster-capacity-sa
$ oc adm policy add-cluster-role-to-user cluster-capacity-role \ system:serviceaccount:default:cluster-capacity-sa
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Define and create the
Pod
spec:Copy to Clipboard Copied! Toggle word wrap Toggle overflow The cluster capacity analysis is mounted in a volume using a
ConfigMap
object namedcluster-capacity-configmap
to mount input pod spec filepod.yaml
into a volumetest-volume
at the path/test-pod
.If you haven’t created a
ConfigMap
object, create one before creating the job:oc create configmap cluster-capacity-configmap \ --from-file=pod.yaml=pod.yaml
$ oc create configmap cluster-capacity-configmap \ --from-file=pod.yaml=pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the job using the below example of a job specification file:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- A required environment variable letting the cluster capacity tool know that it is running inside a cluster as a pod.
Thepod.yaml
key of theConfigMap
object is the same as thePod
spec file name, though it is not required. By doing this, the input pod spec file can be accessed inside the pod as/test-pod/pod.yaml
.
Run the cluster capacity image as a job in a pod:
oc create -f cluster-capacity-job.yaml
$ oc create -f cluster-capacity-job.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the job logs to find the number of pods that can be scheduled in the cluster:
oc logs jobs/cluster-capacity-job
$ oc logs jobs/cluster-capacity-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
7.3. Restrict resource consumption with limit ranges Copy linkLink copied to clipboard!
By default, containers run with unbounded compute resources on an OpenShift Container Platform cluster. With limit ranges, you can restrict resource consumption for specific objects in a project:
- pods and containers: You can set minimum and maximum requirements for CPU and memory for pods and their containers.
-
Image streams: You can set limits on the number of images and tags in an
ImageStream
object. - Images: You can limit the size of images that can be pushed to an internal registry.
- Persistent volume claims (PVC): You can restrict the size of the PVCs that can be requested.
If a pod does not meet the constraints imposed by the limit range, the pod cannot be created in the namespace.
7.3.1. About limit ranges Copy linkLink copied to clipboard!
A limit range, defined by a LimitRange
object, restricts resource consumption in a project. In the project you can set specific resource limits for a pod, container, image, image stream, or persistent volume claim (PVC).
All requests to create and modify resources are evaluated against each LimitRange
object in the project. If the resource violates any of the enumerated constraints, the resource is rejected.
The following shows a limit range object for all components: pod, container, image, image stream, or PVC. You can configure limits for any or all of these components in the same object. You create a different limit range object for each project where you want to control resources.
Sample limit range object for a container
7.3.1.1. About component limits Copy linkLink copied to clipboard!
The following examples show limit range parameters for each component. The examples are broken out for clarity. You can create a single LimitRange
object for any or all components as necessary.
7.3.1.1.1. Container limits Copy linkLink copied to clipboard!
A limit range allows you to specify the minimum and maximum CPU and memory that each container in a pod can request for a specific project. If a container is created in the project, the container CPU and memory requests in the Pod
spec must comply with the values set in the LimitRange
object. If not, the pod does not get created.
-
The container CPU or memory request and limit must be greater than or equal to the
min
resource constraint for containers that are specified in theLimitRange
object. The container CPU or memory request and limit must be less than or equal to the
max
resource constraint for containers that are specified in theLimitRange
object.If the
LimitRange
object defines amax
CPU, you do not need to define a CPUrequest
value in thePod
spec. But you must specify a CPUlimit
value that satisfies the maximum CPU constraint specified in the limit range.The ratio of the container limits to requests must be less than or equal to the
maxLimitRequestRatio
value for containers that is specified in theLimitRange
object.If the
LimitRange
object defines amaxLimitRequestRatio
constraint, any new containers must have both arequest
and alimit
value. OpenShift Container Platform calculates the limit-to-request ratio by dividing thelimit
by therequest
. This value should be a non-negative integer greater than 1.For example, if a container has
cpu: 500
in thelimit
value, andcpu: 100
in therequest
value, the limit-to-request ratio forcpu
is5
. This ratio must be less than or equal to themaxLimitRequestRatio
.
If the Pod
spec does not specify a container resource memory or limit, the default
or defaultRequest
CPU and memory values for containers specified in the limit range object are assigned to the container.
Container LimitRange
object definition
- 1
- The name of the LimitRange object.
- 2
- The maximum amount of CPU that a single container in a pod can request.
- 3
- The maximum amount of memory that a single container in a pod can request.
- 4
- The minimum amount of CPU that a single container in a pod can request.
- 5
- The minimum amount of memory that a single container in a pod can request.
- 6
- The default amount of CPU that a container can use if not specified in the
Pod
spec. - 7
- The default amount of memory that a container can use if not specified in the
Pod
spec. - 8
- The default amount of CPU that a container can request if not specified in the
Pod
spec. - 9
- The default amount of memory that a container can request if not specified in the
Pod
spec. - 10
- The maximum limit-to-request ratio for a container.
7.3.1.1.2. Pod limits Copy linkLink copied to clipboard!
A limit range allows you to specify the minimum and maximum CPU and memory limits for all containers across a pod in a given project. To create a container in the project, the container CPU and memory requests in the Pod
spec must comply with the values set in the LimitRange
object. If not, the pod does not get created.
If the Pod
spec does not specify a container resource memory or limit, the default
or defaultRequest
CPU and memory values for containers specified in the limit range object are assigned to the container.
Across all containers in a pod, the following must hold true:
-
The container CPU or memory request and limit must be greater than or equal to the
min
resource constraints for pods that are specified in theLimitRange
object. -
The container CPU or memory request and limit must be less than or equal to the
max
resource constraints for pods that are specified in theLimitRange
object. -
The ratio of the container limits to requests must be less than or equal to the
maxLimitRequestRatio
constraint specified in theLimitRange
object.
Pod LimitRange
object definition
- 1
- The name of the limit range object.
- 2
- The maximum amount of CPU that a pod can request across all containers.
- 3
- The maximum amount of memory that a pod can request across all containers.
- 4
- The minimum amount of CPU that a pod can request across all containers.
- 5
- The minimum amount of memory that a pod can request across all containers.
- 6
- The maximum limit-to-request ratio for a container.
7.3.1.1.3. Image limits Copy linkLink copied to clipboard!
A LimitRange
object allows you to specify the maximum size of an image that can be pushed to an internal registry.
When pushing images to an internal registry, the following must hold true:
-
The size of the image must be less than or equal to the
max
size for images that is specified in theLimitRange
object.
Image LimitRange
object definition
To prevent blobs that exceed the limit from being uploaded to the registry, the registry must be configured to enforce quotas.
The image size is not always available in the manifest of an uploaded image. This is especially the case for images built with Docker 1.10 or higher and pushed to a v2 registry. If such an image is pulled with an older Docker daemon, the image manifest is converted by the registry to schema v1 lacking all the size information. No storage limit set on images prevent it from being uploaded.
The issue is being addressed.
7.3.1.1.4. Image stream limits Copy linkLink copied to clipboard!
A LimitRange
object allows you to specify limits for image streams.
For each image stream, the following must hold true:
-
The number of image tags in an
ImageStream
specification must be less than or equal to theopenshift.io/image-tags
constraint in theLimitRange
object. -
The number of unique references to images in an
ImageStream
specification must be less than or equal to theopenshift.io/images
constraint in the limit range object.
Imagestream LimitRange
object definition
The openshift.io/image-tags
resource represents unique image references. Possible references are an ImageStreamTag
, an ImageStreamImage
and a DockerImage
. Tags can be created using the oc tag
and oc import-image
commands. No distinction is made between internal and external references. However, each unique reference tagged in an ImageStream
specification is counted just once. It does not restrict pushes to an internal container image registry in any way, but is useful for tag restriction.
The openshift.io/images
resource represents unique image names recorded in image stream status. It allows for restriction of a number of images that can be pushed to the internal registry. Internal and external references are not distinguished.
7.3.1.1.5. Persistent volume claim limits Copy linkLink copied to clipboard!
A LimitRange
object allows you to restrict the storage requested in a persistent volume claim (PVC).
Across all persistent volume claims in a project, the following must hold true:
-
The resource request in a persistent volume claim (PVC) must be greater than or equal the
min
constraint for PVCs that is specified in theLimitRange
object. -
The resource request in a persistent volume claim (PVC) must be less than or equal the
max
constraint for PVCs that is specified in theLimitRange
object.
PVC LimitRange
object definition
7.3.2. Creating a Limit Range Copy linkLink copied to clipboard!
To apply a limit range to a project:
Create a
LimitRange
object with your required specifications:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify a name for the
LimitRange
object. - 2
- To set limits for a pod, specify the minimum and maximum CPU and memory requests as needed.
- 3
- To set limits for a container, specify the minimum and maximum CPU and memory requests as needed.
- 4
- Optional. For a container, specify the default amount of CPU or memory that a container can use, if not specified in the
Pod
spec. - 5
- Optional. For a container, specify the default amount of CPU or memory that a container can request, if not specified in the
Pod
spec. - 6
- Optional. For a container, specify the maximum limit-to-request ratio that can be specified in the
Pod
spec. - 7
- To set limits for an Image object, set the maximum size of an image that can be pushed to an internal registry.
- 8
- To set limits for an image stream, set the maximum number of image tags and references that can be in the
ImageStream
object file, as needed. - 9
- To set limits for a persistent volume claim, set the minimum and maximum amount of storage that can be requested.
Create the object:
oc create -f <limit_range_file> -n <project>
$ oc create -f <limit_range_file> -n <project>
1 Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify the name of the YAML file you created and the project where you want the limits to apply.
7.3.3. Viewing a limit Copy linkLink copied to clipboard!
You can view any limits defined in a project by navigating in the web console to the project’s Quota page.
You can also use the CLI to view limit range details:
Get the list of
LimitRange
object defined in the project. For example, for a project called demoproject:oc get limits -n demoproject
$ oc get limits -n demoproject
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME CREATED AT resource-limits 2020-07-15T17:14:23Z
NAME CREATED AT resource-limits 2020-07-15T17:14:23Z
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Describe the
LimitRange
object you are interested in, for example theresource-limits
limit range:oc describe limits resource-limits -n demoproject
$ oc describe limits resource-limits -n demoproject
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow
7.3.4. Deleting a Limit Range Copy linkLink copied to clipboard!
To remove any active LimitRange
object to no longer enforce the limits in a project:
Run the following command:
oc delete limits <limit_name>
$ oc delete limits <limit_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
7.4. Configuring cluster memory to meet container memory and risk requirements Copy linkLink copied to clipboard!
As a cluster administrator, you can help your clusters operate efficiently through managing application memory by:
- Determining the memory and risk requirements of a containerized application component and configuring the container memory parameters to suit those requirements.
- Configuring containerized application runtimes (for example, OpenJDK) to adhere optimally to the configured container memory parameters.
- Diagnosing and resolving memory-related error conditions associated with running in a container.
7.4.1. Understanding managing application memory Copy linkLink copied to clipboard!
It is recommended to fully read the overview of how OpenShift Container Platform manages Compute Resources before proceeding.
For each kind of resource (memory, CPU, storage), OpenShift Container Platform allows optional request and limit values to be placed on each container in a pod.
Note the following about memory requests and memory limits:
Memory request
- The memory request value, if specified, influences the OpenShift Container Platform scheduler. The scheduler considers the memory request when scheduling a container to a node, then fences off the requested memory on the chosen node for the use of the container.
- If a node’s memory is exhausted, OpenShift Container Platform prioritizes evicting its containers whose memory usage most exceeds their memory request. In serious cases of memory exhaustion, the node OOM killer may select and kill a process in a container based on a similar metric.
- The cluster administrator can assign quota or assign default values for the memory request value.
- The cluster administrator can override the memory request values that a developer specifies, in order to manage cluster overcommit.
Memory limit
- The memory limit value, if specified, provides a hard limit on the memory that can be allocated across all the processes in a container.
- If the memory allocated by all of the processes in a container exceeds the memory limit, the node Out of Memory (OOM) killer will immediately select and kill a process in the container.
- If both memory request and limit are specified, the memory limit value must be greater than or equal to the memory request.
- The cluster administrator can assign quota or assign default values for the memory limit value.
-
The minimum memory limit is 12 MB. If a container fails to start due to a
Cannot allocate memory
pod event, the memory limit is too low. Either increase or remove the memory limit. Removing the limit allows pods to consume unbounded node resources.
7.4.1.1. Managing application memory strategy Copy linkLink copied to clipboard!
The steps for sizing application memory on OpenShift Container Platform are as follows:
Determine expected container memory usage
Determine expected mean and peak container memory usage, empirically if necessary (for example, by separate load testing). Remember to consider all the processes that may potentially run in parallel in the container: for example, does the main application spawn any ancillary scripts?
Determine risk appetite
Determine risk appetite for eviction. If the risk appetite is low, the container should request memory according to the expected peak usage plus a percentage safety margin. If the risk appetite is higher, it may be more appropriate to request memory according to the expected mean usage.
Set container memory request
Set container memory request based on the above. The more accurately the request represents the application memory usage, the better. If the request is too high, cluster and quota usage will be inefficient. If the request is too low, the chances of application eviction increase.
Set container memory limit, if required
Set container memory limit, if required. Setting a limit has the effect of immediately killing a container process if the combined memory usage of all processes in the container exceeds the limit, and is therefore a mixed blessing. On the one hand, it may make unanticipated excess memory usage obvious early ("fail fast"); on the other hand it also terminates processes abruptly.
Note that some OpenShift Container Platform clusters may require a limit value to be set; some may override the request based on the limit; and some application images rely on a limit value being set as this is easier to detect than a request value.
If the memory limit is set, it should not be set to less than the expected peak container memory usage plus a percentage safety margin.
Ensure application is tuned
Ensure application is tuned with respect to configured request and limit values, if appropriate. This step is particularly relevant to applications which pool memory, such as the JVM. The rest of this page discusses this.
7.4.2. Understanding OpenJDK settings for OpenShift Container Platform Copy linkLink copied to clipboard!
The default OpenJDK settings do not work well with containerized environments. As a result, some additional Java memory settings must always be provided whenever running the OpenJDK in a container.
The JVM memory layout is complex, version dependent, and describing it in detail is beyond the scope of this documentation. However, as a starting point for running OpenJDK in a container, at least the following three memory-related tasks are key:
- Overriding the JVM maximum heap size.
- Encouraging the JVM to release unused memory to the operating system, if appropriate.
- Ensuring all JVM processes within a container are appropriately configured.
Optimally tuning JVM workloads for running in a container is beyond the scope of this documentation, and may involve setting multiple additional JVM options.
7.4.2.1. Understanding how to override the JVM maximum heap size Copy linkLink copied to clipboard!
For many Java workloads, the JVM heap is the largest single consumer of memory. Currently, the OpenJDK defaults to allowing up to 1/4 (1/-XX:MaxRAMFraction
) of the compute node’s memory to be used for the heap, regardless of whether the OpenJDK is running in a container or not. It is therefore essential to override this behavior, especially if a container memory limit is also set.
There are at least two ways the above can be achieved:
If the container memory limit is set and the experimental options are supported by the JVM, set
-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap
.NoteThe
UseCGroupMemoryLimitForHeap
option has been removed in JDK 11. Use-XX:+UseContainerSupport
instead.This sets
-XX:MaxRAM
to the container memory limit, and the maximum heap size (-XX:MaxHeapSize
/-Xmx
) to 1/-XX:MaxRAMFraction
(1/4 by default).Directly override one of
-XX:MaxRAM
,-XX:MaxHeapSize
or-Xmx
.This option involves hard-coding a value, but has the advantage of allowing a safety margin to be calculated.
7.4.2.2. Understanding how to encourage the JVM to release unused memory to the operating system Copy linkLink copied to clipboard!
By default, the OpenJDK does not aggressively return unused memory to the operating system. This may be appropriate for many containerized Java workloads, but notable exceptions include workloads where additional active processes co-exist with a JVM within a container, whether those additional processes are native, additional JVMs, or a combination of the two.
The OpenShift Container Platform Jenkins maven slave image uses the following JVM arguments to encourage the JVM to release unused memory to the operating system:
-XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90.
-XX:+UseParallelGC
-XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4
-XX:AdaptiveSizePolicyWeight=90.
These arguments are intended to return heap memory to the operating system whenever allocated memory exceeds 110% of in-use memory (-XX:MaxHeapFreeRatio
), spending up to 20% of CPU time in the garbage collector (-XX:GCTimeRatio
). At no time will the application heap allocation be less than the initial heap allocation (overridden by -XX:InitialHeapSize
/ -Xms
). Detailed additional information is available Tuning Java’s footprint in OpenShift (Part 1), Tuning Java’s footprint in OpenShift (Part 2), and at OpenJDK and Containers.
7.4.2.3. Understanding how to ensure all JVM processes within a container are appropriately configured Copy linkLink copied to clipboard!
In the case that multiple JVMs run in the same container, it is essential to ensure that they are all configured appropriately. For many workloads it will be necessary to grant each JVM a percentage memory budget, leaving a perhaps substantial additional safety margin.
Many Java tools use different environment variables (JAVA_OPTS
, GRADLE_OPTS
, MAVEN_OPTS
, and so on) to configure their JVMs and it can be challenging to ensure that the right settings are being passed to the right JVM.
The JAVA_TOOL_OPTIONS
environment variable is always respected by the OpenJDK, and values specified in JAVA_TOOL_OPTIONS
will be overridden by other options specified on the JVM command line. By default, to ensure that these options are used by default for all JVM workloads run in the slave image, the OpenShift Container Platform Jenkins maven slave image sets:
JAVA_TOOL_OPTIONS="-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -Dsun.zip.disableMemoryMapping=true"
JAVA_TOOL_OPTIONS="-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap -Dsun.zip.disableMemoryMapping=true"
The UseCGroupMemoryLimitForHeap
option has been removed in JDK 11. Use -XX:+UseContainerSupport
instead.
This does not guarantee that additional options are not required, but is intended to be a helpful starting point.
7.4.3. Finding the memory request and limit from within a pod Copy linkLink copied to clipboard!
An application wishing to dynamically discover its memory request and limit from within a pod should use the Downward API.
Procedure
Configure the pod to add the
MEMORY_REQUEST
andMEMORY_LIMIT
stanzas:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Access the pod using a remote shell:
oc rsh test
$ oc rsh test
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the requested values were applied:
env | grep MEMORY | sort
$ env | grep MEMORY | sort
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
MEMORY_LIMIT=536870912 MEMORY_REQUEST=402653184
MEMORY_LIMIT=536870912 MEMORY_REQUEST=402653184
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The memory limit value can also be read from inside the container by the /sys/fs/cgroup/memory/memory.limit_in_bytes
file.
7.4.4. Understanding OOM kill policy Copy linkLink copied to clipboard!
OpenShift Container Platform can kill a process in a container if the total memory usage of all the processes in the container exceeds the memory limit, or in serious cases of node memory exhaustion.
When a process is Out of Memory (OOM) killed, this might result in the container exiting immediately. If the container PID 1 process receives the SIGKILL, the container will exit immediately. Otherwise, the container behavior is dependent on the behavior of the other processes.
For example, a container process exited with code 137, indicating it received a SIGKILL signal.
If the container does not exit immediately, an OOM kill is detectable as follows:
Access the pod using a remote shell:
oc rsh test
# oc rsh test
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to see the current OOM kill count in
/sys/fs/cgroup/memory/memory.oom_control
:grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
$ grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control oom_kill 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to provoke an OOM kill:
sed -e '' </dev/zero
$ sed -e '' </dev/zero
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Killed
Killed
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following command to view the exit status of the
sed
command:echo $?
$ echo $?
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
137
137
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
137
code indicates the container process exited with code 137, indicating it received a SIGKILL signal.Run the following command to see that the OOM kill counter in
/sys/fs/cgroup/memory/memory.oom_control
incremented:grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
$ grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control oom_kill 1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If one or more processes in a pod are OOM killed, when the pod subsequently exits, whether immediately or not, it will have phase Failed and reason OOMKilled. An OOM-killed pod might be restarted depending on the value of
restartPolicy
. If not restarted, controllers such as the replication controller will notice the pod’s failed status and create a new pod to replace the old one.Use the follwing command to get the pod status:
oc get pod test
$ oc get pod test
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE test 0/1 OOMKilled 0 1m
NAME READY STATUS RESTARTS AGE test 0/1 OOMKilled 0 1m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the pod has not restarted, run the following command to view the pod:
oc get pod test -o yaml
$ oc get pod test -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If restarted, run the following command to view the pod:
oc get pod test -o yaml
$ oc get pod test -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
7.4.5. Understanding pod eviction Copy linkLink copied to clipboard!
OpenShift Container Platform may evict a pod from its node when the node’s memory is exhausted. Depending on the extent of memory exhaustion, the eviction may or may not be graceful. Graceful eviction implies the main process (PID 1) of each container receiving a SIGTERM signal, then some time later a SIGKILL signal if the process has not exited already. Non-graceful eviction implies the main process of each container immediately receiving a SIGKILL signal.
An evicted pod has phase Failed and reason Evicted. It will not be restarted, regardless of the value of restartPolicy
. However, controllers such as the replication controller will notice the pod’s failed status and create a new pod to replace the old one.
oc get pod test
$ oc get pod test
Example output
NAME READY STATUS RESTARTS AGE test 0/1 Evicted 0 1m
NAME READY STATUS RESTARTS AGE
test 0/1 Evicted 0 1m
oc get pod test -o yaml
$ oc get pod test -o yaml
Example output
... status: message: 'Pod The node was low on resource: [MemoryPressure].' phase: Failed reason: Evicted
...
status:
message: 'Pod The node was low on resource: [MemoryPressure].'
phase: Failed
reason: Evicted
7.5. Configuring your cluster to place pods on overcommitted nodes Copy linkLink copied to clipboard!
In an overcommitted state, the sum of the container compute resource requests and limits exceeds the resources available on the system. For example, you might want to use overcommitment in development environments where a trade-off of guaranteed performance for capacity is acceptable.
Containers can specify compute resource requests and limits. Requests are used for scheduling your container and provide a minimum service guarantee. Limits constrain the amount of compute resource that can be consumed on your node.
The scheduler attempts to optimize the compute resource use across all nodes in your cluster. It places pods onto specific nodes, taking the pods' compute resource requests and nodes' available capacity into consideration.
OpenShift Container Platform administrators can control the level of overcommit and manage container density on nodes. You can configure cluster-level overcommit using the ClusterResourceOverride Operator to override the ratio between requests and limits set on developer containers. In conjunction with node overcommit and project memory and CPU limits and defaults, you can adjust the resource limit and request to achieve the desired level of overcommit.
In OpenShift Container Platform, you must enable cluster-level overcommit. Node overcommitment is enabled by default. See Disabling overcommitment for a node.
7.5.1. Resource requests and overcommitment Copy linkLink copied to clipboard!
For each compute resource, a container may specify a resource request and limit. Scheduling decisions are made based on the request to ensure that a node has enough capacity available to meet the requested value. If a container specifies limits, but omits requests, the requests are defaulted to the limits. A container is not able to exceed the specified limit on the node.
The enforcement of limits is dependent upon the compute resource type. If a container makes no request or limit, the container is scheduled to a node with no resource guarantees. In practice, the container is able to consume as much of the specified resource as is available with the lowest local priority. In low resource situations, containers that specify no resource requests are given the lowest quality of service.
Scheduling is based on resources requested, while quota and hard limits refer to resource limits, which can be set higher than requested resources. The difference between request and limit determines the level of overcommit; for instance, if a container is given a memory request of 1Gi and a memory limit of 2Gi, it is scheduled based on the 1Gi request being available on the node, but could use up to 2Gi; so it is 200% overcommitted.
7.5.2. Cluster-level overcommit using the Cluster Resource Override Operator Copy linkLink copied to clipboard!
The Cluster Resource Override Operator is an admission webhook that allows you to control the level of overcommit and manage container density across all the nodes in your cluster. The Operator controls how nodes in specific projects can exceed defined memory and CPU limits.
You must install the Cluster Resource Override Operator using the OpenShift Container Platform console or CLI as shown in the following sections. During the installation, you create a ClusterResourceOverride
custom resource (CR), where you set the level of overcommit, as shown in the following example:
- 1
- The name must be
cluster
. - 2
- Optional. If a container memory limit has been specified or defaulted, the memory request is overridden to this percentage of the limit, between 1-100. The default is 50.
- 3
- Optional. If a container CPU limit has been specified or defaulted, the CPU request is overridden to this percentage of the limit, between 1-100. The default is 25.
- 4
- Optional. If a container memory limit has been specified or defaulted, the CPU limit is overridden to a percentage of the memory limit, if specified. Scaling 1Gi of RAM at 100 percent is equal to 1 CPU core. This is processed prior to overriding the CPU request (if configured). The default is 200.
The Cluster Resource Override Operator overrides have no effect if limits have not been set on containers. Create a LimitRange
object with default limits per individual project or configure limits in Pod
specs for the overrides to apply.
When configured, overrides can be enabled per-project by applying the following label to the Namespace object for each project:
The Operator watches for the ClusterResourceOverride
CR and ensures that the ClusterResourceOverride
admission webhook is installed into the same namespace as the operator.
7.5.2.1. Installing the Cluster Resource Override Operator using the web console Copy linkLink copied to clipboard!
You can use the OpenShift Container Platform web console to install the Cluster Resource Override Operator to help control overcommit in your cluster.
Prerequisites
-
The Cluster Resource Override Operator has no effect if limits have not been set on containers. You must specify default limits for a project using a
LimitRange
object or configure limits inPod
specs for the overrides to apply.
Procedure
To install the Cluster Resource Override Operator using the OpenShift Container Platform web console:
In the OpenShift Container Platform web console, navigate to Home → Projects
- Click Create Project.
-
Specify
clusterresourceoverride-operator
as the name of the project. - Click Create.
Navigate to Operators → OperatorHub.
- Choose ClusterResourceOverride Operator from the list of available Operators and click Install.
- On the Install Operator page, make sure A specific Namespace on the cluster is selected for Installation Mode.
- Make sure clusterresourceoverride-operator is selected for Installed Namespace.
- Select an Update Channel and Approval Strategy.
- Click Install.
On the Installed Operators page, click ClusterResourceOverride.
- On the ClusterResourceOverride Operator details page, click Create Instance.
On the Create ClusterResourceOverride page, edit the YAML template to set the overcommit values as needed:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name must be
cluster
. - 2
- Optional. Specify the percentage to override the container memory limit, if used, between 1-100. The default is 50.
- 3
- Optional. Specify the percentage to override the container CPU limit, if used, between 1-100. The default is 25.
- 4
- Optional. Specify the percentage to override the container memory limit, if used. Scaling 1Gi of RAM at 100 percent is equal to 1 CPU core. This is processed prior to overriding the CPU request, if configured. The default is 200.
- Click Create.
Check the current state of the admission webhook by checking the status of the cluster custom resource:
- On the ClusterResourceOverride Operator page, click cluster.
On the ClusterResourceOverride Details age, click YAML. The
mutatingWebhookConfigurationRef
section appears when the webhook is called.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Reference to the
ClusterResourceOverride
admission webhook.
7.5.2.2. Installing the Cluster Resource Override Operator using the CLI Copy linkLink copied to clipboard!
You can use the OpenShift Container Platform CLI to install the Cluster Resource Override Operator to help control overcommit in your cluster.
Prerequisites
-
The Cluster Resource Override Operator has no effect if limits have not been set on containers. You must specify default limits for a project using a
LimitRange
object or configure limits inPod
specs for the overrides to apply.
Procedure
To install the Cluster Resource Override Operator using the CLI:
Create a namespace for the Cluster Resource Override Operator:
Create a
Namespace
object YAML file (for example,cro-namespace.yaml
) for the Cluster Resource Override Operator:apiVersion: v1 kind: Namespace metadata: name: clusterresourceoverride-operator
apiVersion: v1 kind: Namespace metadata: name: clusterresourceoverride-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the namespace:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f cro-namespace.yaml
$ oc create -f cro-namespace.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create an Operator group:
Create an
OperatorGroup
object YAML file (for example, cro-og.yaml) for the Cluster Resource Override Operator:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the Operator Group:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f cro-og.yaml
$ oc create -f cro-og.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a subscription:
Create a
Subscription
object YAML file (for example, cro-sub.yaml) for the Cluster Resource Override Operator:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the subscription:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f cro-sub.yaml
$ oc create -f cro-sub.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a
ClusterResourceOverride
custom resource (CR) object in theclusterresourceoverride-operator
namespace:Change to the
clusterresourceoverride-operator
namespace.oc project clusterresourceoverride-operator
$ oc project clusterresourceoverride-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
ClusterResourceOverride
object YAML file (for example, cro-cr.yaml) for the Cluster Resource Override Operator:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name must be
cluster
. - 2
- Optional. Specify the percentage to override the container memory limit, if used, between 1-100. The default is 50.
- 3
- Optional. Specify the percentage to override the container CPU limit, if used, between 1-100. The default is 25.
- 4
- Optional. Specify the percentage to override the container memory limit, if used. Scaling 1Gi of RAM at 100 percent is equal to 1 CPU core. This is processed prior to overriding the CPU request, if configured. The default is 200.
Create the
ClusterResourceOverride
object:oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f cro-cr.yaml
$ oc create -f cro-cr.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify the current state of the admission webhook by checking the status of the cluster custom resource.
oc get clusterresourceoverride cluster -n clusterresourceoverride-operator -o yaml
$ oc get clusterresourceoverride cluster -n clusterresourceoverride-operator -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
mutatingWebhookConfigurationRef
section appears when the webhook is called.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Reference to the
ClusterResourceOverride
admission webhook.
7.5.2.3. Configuring cluster-level overcommit Copy linkLink copied to clipboard!
The Cluster Resource Override Operator requires a ClusterResourceOverride
custom resource (CR) and a label for each project where you want the Operator to control overcommit.
Prerequisites
-
The Cluster Resource Override Operator has no effect if limits have not been set on containers. You must specify default limits for a project using a
LimitRange
object or configure limits inPod
specs for the overrides to apply.
Procedure
To modify cluster-level overcommit:
Edit the
ClusterResourceOverride
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Optional. Specify the percentage to override the container memory limit, if used, between 1-100. The default is 50.
- 2
- Optional. Specify the percentage to override the container CPU limit, if used, between 1-100. The default is 25.
- 3
- Optional. Specify the percentage to override the container memory limit, if used. Scaling 1Gi of RAM at 100 percent is equal to 1 CPU core. This is processed prior to overriding the CPU request, if configured. The default is 200.
Ensure the following label has been added to the Namespace object for each project where you want the Cluster Resource Override Operator to control overcommit:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Add this label to each project.
7.5.3. Node-level overcommit Copy linkLink copied to clipboard!
You can use various ways to control overcommit on specific nodes, such as quality of service (QOS) guarantees, CPU limits, or reserve resources. You can also disable overcommit for specific nodes and specific projects.
7.5.3.1. Understanding compute resources and containers Copy linkLink copied to clipboard!
The node-enforced behavior for compute resources is specific to the resource type.
7.5.3.1.1. Understanding container CPU requests Copy linkLink copied to clipboard!
A container is guaranteed the amount of CPU it requests and is additionally able to consume excess CPU available on the node, up to any limit specified by the container. If multiple containers are attempting to use excess CPU, CPU time is distributed based on the amount of CPU requested by each container.
For example, if one container requested 500m of CPU time and another container requested 250m of CPU time, then any extra CPU time available on the node is distributed among the containers in a 2:1 ratio. If a container specified a limit, it will be throttled not to use more CPU than the specified limit. CPU requests are enforced using the CFS shares support in the Linux kernel. By default, CPU limits are enforced using the CFS quota support in the Linux kernel over a 100ms measuring interval, though this can be disabled.
7.5.3.1.2. Understanding container memory requests Copy linkLink copied to clipboard!
A container is guaranteed the amount of memory it requests. A container can use more memory than requested, but once it exceeds its requested amount, it could be terminated in a low memory situation on the node. If a container uses less memory than requested, it will not be terminated unless system tasks or daemons need more memory than was accounted for in the node’s resource reservation. If a container specifies a limit on memory, it is immediately terminated if it exceeds the limit amount.
7.5.3.2. Understanding overcomitment and quality of service classes Copy linkLink copied to clipboard!
A node is overcommitted when it has a pod scheduled that makes no request, or when the sum of limits across all pods on that node exceeds available machine capacity.
In an overcommitted environment, it is possible that the pods on the node will attempt to use more compute resource than is available at any given point in time. When this occurs, the node must give priority to one pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class.
For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority:
Priority | Class Name | Description |
---|---|---|
1 (highest) | Guaranteed | If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed. |
2 | Burstable | If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable. |
3 (lowest) | BestEffort | If requests and limits are not set for any of the resources, then the container is classified as BestEffort. |
Memory is an incompressible resource, so in low memory situations, containers that have the lowest priority are terminated first:
- Guaranteed containers are considered top priority, and are guaranteed to only be terminated if they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
- Burstable containers under system memory pressure are more likely to be terminated once they exceed their requests and no other BestEffort containers exist.
- BestEffort containers are treated with the lowest priority. Processes in these containers are first to be terminated if the system runs out of memory.
7.5.3.2.1. Understanding how to reserve memory across quality of service tiers Copy linkLink copied to clipboard!
You can use the qos-reserved
parameter to specify a percentage of memory to be reserved by a pod in a particular QoS level. This feature attempts to reserve requested resources to exclude pods from lower OoS classes from using resources requested by pods in higher QoS classes.
OpenShift Container Platform uses the qos-reserved
parameter as follows:
-
A value of
qos-reserved=memory=100%
will prevent theBurstable
andBestEffort
QOS classes from consuming memory that was requested by a higher QoS class. This increases the risk of inducing OOM onBestEffort
andBurstable
workloads in favor of increasing memory resource guarantees forGuaranteed
andBurstable
workloads. -
A value of
qos-reserved=memory=50%
will allow theBurstable
andBestEffort
QOS classes to consume half of the memory requested by a higher QoS class. -
A value of
qos-reserved=memory=0%
will allow aBurstable
andBestEffort
QoS classes to consume up to the full node allocatable amount if available, but increases the risk that aGuaranteed
workload will not have access to requested memory. This condition effectively disables this feature.
7.5.3.3. Understanding swap memory and QOS Copy linkLink copied to clipboard!
You can disable swap by default on your nodes to preserve quality of service (QOS) guarantees. Otherwise, physical resources on a node can oversubscribe, affecting the resource guarantees the Kubernetes scheduler makes during pod placement.
For example, if two guaranteed pods have reached their memory limit, each container could start using swap memory. Eventually, if there is not enough swap space, processes in the pods can be terminated due to the system being oversubscribed.
Failing to disable swap results in nodes not recognizing that they are experiencing MemoryPressure, resulting in pods not receiving the memory they made in their scheduling request. As a result, additional pods are placed on the node to further increase memory pressure, ultimately increasing your risk of experiencing a system out of memory (OOM) event.
If swap is enabled, any out-of-resource handling eviction thresholds for available memory will not work as expected. Take advantage of out-of-resource handling to allow pods to be evicted from a node when it is under memory pressure, and rescheduled on an alternative node that has no such pressure.
7.5.3.4. Understanding nodes overcommitment Copy linkLink copied to clipboard!
In an overcommitted environment, it is important to properly configure your node to provide best system behavior.
When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.
To ensure this behavior, OpenShift Container Platform configures the kernel to always overcommit memory by setting the vm.overcommit_memory
parameter to 1
, overriding the default operating system setting.
OpenShift Container Platform also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom
parameter to 0
. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority
You can view the current setting by running the following commands on your nodes:
sysctl -a |grep commit
$ sysctl -a |grep commit
Example output
vm.overcommit_memory = 1
vm.overcommit_memory = 1
sysctl -a |grep panic
$ sysctl -a |grep panic
Example output
vm.panic_on_oom = 0
vm.panic_on_oom = 0
The above flags should already be set on nodes, and no further action is required.
You can also perform the following configurations for each node:
- Disable or enforce CPU limits using CPU CFS quotas
- Reserve resources for system processes
- Reserve memory across quality of service tiers
7.5.3.5. Disabling or enforcing CPU limits using CPU CFS quotas Copy linkLink copied to clipboard!
Nodes by default enforce specified CPU limits using the Completely Fair Scheduler (CFS) quota support in the Linux kernel.
If you disable CPU limit enforcement, it is important to understand the impact on your node:
- If a container has a CPU request, the request continues to be enforced by CFS shares in the Linux kernel.
- If a container does not have a CPU request, but does have a CPU limit, the CPU request defaults to the specified CPU limit, and is enforced by CFS shares in the Linux kernel.
- If a container has both a CPU request and limit, the CPU request is enforced by CFS shares in the Linux kernel, and the CPU limit has no impact on the node.
Prerequisites
Obtain the label associated with the static
MachineConfigPool
CRD for the type of node you want to configure. Perform one of the following steps:View the machine config pool:
oc describe machineconfigpool <name>
$ oc describe machineconfigpool <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe machineconfigpool worker
$ oc describe machineconfigpool worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If a label has been added it appears under
labels
.
If the label is not present, add a key/value pair:
oc label machineconfigpool worker custom-kubelet=small-pods
$ oc label machineconfigpool worker custom-kubelet=small-pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
7.5.3.6. Reserving resources for system processes Copy linkLink copied to clipboard!
To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by system daemons that are required to run on your node for your cluster to function. In particular, it is recommended that you reserve resources for incompressible resources such as memory.
Procedure
To explicitly reserve resources for non-pod processes, allocate node resources by specifying resources available for scheduling. For more details, see Allocating Resources for Nodes.
7.5.3.7. Disabling overcommitment for a node Copy linkLink copied to clipboard!
When enabled, overcommitment can be disabled on each node.
Procedure
To disable overcommitment in a node run the following command on that node:
sysctl -w vm.overcommit_memory=0
$ sysctl -w vm.overcommit_memory=0
7.5.4. Project-level limits Copy linkLink copied to clipboard!
To help control overcommit, you can set per-project resource limit ranges, specifying memory and CPU limits and defaults for a project that overcommit cannot exceed.
For information on project-level resource limits, see Additional resources.
Alternatively, you can disable overcommitment for specific projects.
7.5.4.1. Disabling overcommitment for a project Copy linkLink copied to clipboard!
When enabled, overcommitment can be disabled per-project. For example, you can allow infrastructure components to be configured independently of overcommitment.
Procedure
To disable overcommitment in a project:
- Edit the project object file
Add the following annotation:
quota.openshift.io/cluster-resource-override-enabled: "false"
quota.openshift.io/cluster-resource-override-enabled: "false"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the project object:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
7.6. Enabling OpenShift Container Platform features using FeatureGates Copy linkLink copied to clipboard!
As an administrator, you can use feature gates to enable features that are not part of the default set of features.
7.6.1. Understanding feature gates Copy linkLink copied to clipboard!
You can use the FeatureGate
custom resource (CR) to enable specific feature sets in your cluster. A feature set is a collection of OpenShift Container Platform features that are not enabled by default.
You can activate the following feature set by using the FeatureGate
CR:
-
IPv6DualStackNoUpgrade
. This feature gate enables the dual-stack networking mode in your cluster. Dual-stack networking supports the use of IPv4 and IPv6 simultaneously. Enabling this feature set is not supported, cannot be undone, and prevents updates. This feature set is not recommended on production clusters.
7.6.2. Enabling feature sets using the web console Copy linkLink copied to clipboard!
You can use the OpenShift Container Platform web console to enable feature sets for all of the nodes in a cluster by editing the FeatureGate
custom resource (CR).
Procedure
To enable feature sets:
- In the OpenShift Container Platform web console, switch to the Administration → Custom Resource Definitions page.
- On the Custom Resource Definitions page, click FeatureGate.
- On the Custom Resource Definition Details page, click the Instances tab.
- Click the cluster feature gate, then click the YAML tab.
Edit the cluster instance to add specific feature sets:
Sample Feature Gate custom resource
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After you save the changes, new machine configs are created, the machine config pools are updated, and scheduling on each node is disabled while the change is being applied.
NoteEnabling the
IPv6DualStackNoUpgrade
feature set cannot be undone and prevents updates. This feature set is not recommended on production clusters.
Verification
You can verify that the feature gates are enabled by looking at the kubelet.conf
file on a node after the nodes return to the ready state.
- From the Administrator perspective in the web console, navigate to Compute → Nodes.
- Select a node.
- In the Node details page, click Terminal.
In the terminal window, change your root directory to the host:
chroot /host
sh-4.2# chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the
kubelet.conf
file:cat /etc/kubernetes/kubelet.conf
sh-4.2# cat /etc/kubernetes/kubelet.conf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Sample output
... featureGates: InsightsOperatorPullingSCA: true, LegacyNodeRoleBehavior: false ...
... featureGates: InsightsOperatorPullingSCA: true, LegacyNodeRoleBehavior: false ...
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The features that are listed as
true
are enabled on your cluster.NoteThe features listed vary depending upon the OpenShift Container Platform version.
7.6.3. Enabling feature sets using the CLI Copy linkLink copied to clipboard!
You can use the OpenShift CLI (oc
) to enable feature sets for all of the nodes in a cluster by editing the FeatureGate
custom resource (CR).
Prerequisites
-
You have installed the OpenShift CLI (
oc
).
Procedure
To enable feature sets:
Edit the
FeatureGate
CR namedcluster
:oc edit featuregate cluster
$ oc edit featuregate cluster
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Sample FeatureGate custom resource
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After you save the changes, new machine configs are created, the machine config pools are updated, and scheduling on each node is disabled while the change is being applied.
NoteEnabling the
IPv6DualStackNoUpgrade
feature set cannot be undone and prevents updates. This feature set is not recommended on production clusters.
Verification
You can verify that the feature gates are enabled by looking at the kubelet.conf
file on a node after the nodes return to the ready state.
Start a debug session for a node:
oc debug node/<node_name>
$ oc debug node/<node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change your root directory to the host:
chroot /host
sh-4.2# chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View the
kubelet.conf
file:cat /etc/kubernetes/kubelet.conf
sh-4.2# cat /etc/kubernetes/kubelet.conf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Sample output
... featureGates: InsightsOperatorPullingSCA: true, LegacyNodeRoleBehavior: false ...
... featureGates: InsightsOperatorPullingSCA: true, LegacyNodeRoleBehavior: false ...
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The features that are listed as
true
are enabled on your cluster.NoteThe features listed vary depending upon the OpenShift Container Platform version.
Chapter 8. Remote worker nodes on the network edge Copy linkLink copied to clipboard!
8.1. Using remote worker nodes at the network edge Copy linkLink copied to clipboard!
You can configure OpenShift Container Platform clusters with nodes located at your network edge. In this topic, they are called remote worker nodes. A typical cluster with remote worker nodes combines on-premise master and worker nodes with worker nodes in other locations that connect to the cluster. This topic is intended to provide guidance on best practices for using remote worker nodes and does not contain specific configuration details.
There are multiple use cases across different industries, such as telecommunications, retail, manufacturing, and government, for using a deployment pattern with remote worker nodes. For example, you can separate and isolate your projects and workloads by combining the remote worker nodes into Kubernetes zones.
However, having remote worker nodes can introduce higher latency, intermittent loss of network connectivity, and other issues. Among the challenges in a cluster with remote worker node are:
- Network separation: The OpenShift Container Platform control plane and the remote worker nodes must be able communicate with each other. Because of the distance between the control plane and the remote worker nodes, network issues could prevent this communication. See Network separation with remote worker nodes for information on how OpenShift Container Platform responds to network separation and for methods to diminish the impact to your cluster.
- Power outage: Because the control plane and remote worker nodes are in separate locations, a power outage at the remote location or at any point between the two can negatively impact your cluster. See Power loss on remote worker nodes for information on how OpenShift Container Platform responds to a node losing power and for methods to diminish the impact to your cluster.
- Latency spikes or temporary reduction in throughput: As with any network, any changes in network conditions between your cluster and the remote worker nodes can negatively impact your cluster. These types of situations are beyond the scope of this documentation.
Note the following limitations when planning a cluster with remote worker nodes:
- Remote worker nodes are supported on only bare metal clusters with user-provisioned infrastructure.
- OpenShift Container Platform does not support remote worker nodes that use a different cloud provider than the on-premise cluster uses.
- Moving workloads from one Kubernetes zone to a different Kubernetes zone can be problematic due to system and environment issues, such as a specific type of memory not being available in a different zone.
- Proxies and firewalls can present additional limitations that are beyond the scope of this document. Refer to the relevant OpenShift Container Platform documentation for how to address such limitations, such as Configuring your firewall.
- You are responsible for configuring and maintaining L2/L3-level network connectivity between the control plane and the network-edge nodes.
8.1.1. Network separation with remote worker nodes Copy linkLink copied to clipboard!
All nodes send heartbeats to the Kubernetes Controller Manager Operator (kube controller) in the OpenShift Container Platform cluster every 10 seconds. If the cluster does not receive heartbeats from a node, OpenShift Container Platform responds using several default mechanisms.
OpenShift Container Platform is designed to be resilient to network partitions and other disruptions. You can mitigate some of the more common disruptions, such as interruptions from software upgrades, network splits, and routing issues. Mitigation strategies include ensuring that pods on remote worker nodes request the correct amount of CPU and memory resources, configuring an appropriate replication policy, using redundancy across zones, and using Pod Disruption Budgets on workloads.
If the kube controller loses contact with a node after a configured period, the node controller on the control plane updates the node health to Unhealthy
and marks the node Ready
condition as Unknown
. In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable
taint with a NoExecute
effect to the node and schedules pods on the node for eviction after five minutes, by default.
If a workload controller, such as a Deployment
object or StatefulSet
object, is directing traffic to pods on the unhealthy node and other nodes can reach the cluster, OpenShift Container Platform routes the traffic away from the pods on the node. Nodes that cannot reach the cluster do not get updated with the new traffic routing. As a result, the workloads on those nodes might continue to attempt to reach the unhealthy node.
You can mitigate the effects of connection loss by:
- using daemon sets to create pods that tolerate the taints
- using static pods that automatically restart if a node goes down
- using Kubernetes zones to control pod eviction
- configuring pod tolerations to delay or avoid pod eviction
- configuring the kubelet to control the timing of when it marks nodes as unhealthy.
For more information on using these objects in a cluster with remote worker nodes, see About remote worker node strategies.
8.1.2. Power loss on remote worker nodes Copy linkLink copied to clipboard!
If a remote worker node loses power or restarts ungracefully, OpenShift Container Platform responds using several default mechanisms.
If the Kubernetes Controller Manager Operator (kube controller) loses contact with a node after a configured period, the control plane updates the node health to Unhealthy
and marks the node Ready
condition as Unknown
. In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable
taint with a NoExecute
effect to the node and schedules pods on the node for eviction after five minutes, by default.
On the node, the pods must be restarted when the node recovers power and reconnects with the control plane.
If you want the pods to restart immediately upon restart, use static pods.
After the node restarts, the kubelet also restarts and attempts to restart the pods that were scheduled on the node. If the connection to the control plane takes longer than the default five minutes, the control plane cannot update the node health and remove the node.kubernetes.io/unreachable
taint. On the node, the kubelet terminates any running pods. When these conditions are cleared, the scheduler can start scheduling pods to that node.
You can mitigate the effects of power loss by:
- using daemon sets to create pods that tolerate the taints
- using static pods that automatically restart with a node
- configuring pods tolerations to delay or avoid pod eviction
- configuring the kubelet to control the timing of when the node controller marks nodes as unhealthy.
For more information on using these objects in a cluster with remote worker nodes, see About remote worker node strategies.
8.1.3. Remote worker node strategies Copy linkLink copied to clipboard!
If you use remote worker nodes, consider which objects to use to run your applications.
It is recommend to use daemon sets or static pods based on the behavior you want in the event of network issues or power loss. In addition, you can use Kubernetes zones and tolerations to control or avoid pod evictions if the control plane cannot reach remote worker nodes.
- Daemon sets
- Daemon sets are the best approach to managing pods on remote worker nodes for the following reasons:
-
Daemon sets do not typically need rescheduling behavior. If a node disconnects from the cluster, pods on the node can continue to run. OpenShift Container Platform does not change the state of daemon set pods, and leaves the pods in the state they last reported. For example, if a daemon set pod is in the
Running
state, when a node stops communicating, the pod keeps running and is assumed to be running by OpenShift Container Platform. Daemon set pods, by default, are created with
NoExecute
tolerations for thenode.kubernetes.io/unreachable
andnode.kubernetes.io/not-ready
taints with notolerationSeconds
value. These default values ensure that daemon set pods are never evicted if the control plane cannot reach a node. For example:Tolerations added to daemon set pods by default
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Daemon sets can use labels to ensure that a workload runs on a matching worker node.
- You can use an OpenShift Container Platform service endpoint to load balance daemon set pods.
Daemon sets do not schedule pods after a reboot of the node if OpenShift Container Platform cannot reach the node.
- Static pods
- If you want pods restart if a node reboots, after a power loss for example, consider static pods. The kubelet on a node automatically restarts static pods as node restarts.
Static pods cannot use secrets and config maps.
- Kubernetes zones
- Kubernetes zones can slow down the rate or, in some cases, completely stop pod evictions.
When the control plane cannot reach a node, the node controller, by default, applies node.kubernetes.io/unreachable
taints and evicts pods at a rate of 0.1 nodes per second. However, in a cluster that uses Kubernetes zones, pod eviction behavior is altered.
If a zone is fully disrupted, where all nodes in the zone have a Ready
condition that is False
or Unknown
, the control plane does not apply the node.kubernetes.io/unreachable
taint to the nodes in that zone.
For partially disrupted zones, where more than 55% of the nodes have a False
or Unknown
condition, the pod eviction rate is reduced to 0.01 nodes per second. Nodes in smaller clusters, with fewer than 50 nodes, are not tainted. Your cluster must have more than three zones for these behavior to take effect.
You assign a node to a specific zone by applying the topology.kubernetes.io/region
label in the node specification.
Sample node labels for Kubernetes zones
kind: Node apiVersion: v1 metadata: labels: topology.kubernetes.io/region=east
kind: Node
apiVersion: v1
metadata:
labels:
topology.kubernetes.io/region=east
KubeletConfig
objects
You can adjust the amount of time that the kubelet checks the state of each node.
To set the interval that affects the timing of when the on-premise node controller marks nodes with the Unhealthy
or Unreachable
condition, create a KubeletConfig
object that contains the node-status-update-frequency
and node-status-report-frequency
parameters.
The kubelet on each node determines the node status as defined by the node-status-update-frequency
setting and reports that status to the cluster based on the node-status-report-frequency
setting. By default, the kubelet determines the pod status every 10 seconds and reports the status every minute. However, if the node state changes, the kubelet reports the change to the cluster immediately. OpenShift Container Platform uses the node-status-report-frequency
setting only when the Node Lease feature gate is enabled, which is the default state in OpenShift Container Platform clusters. If the Node Lease feature gate is disabled, the node reports its status based on the node-status-update-frequency
setting.
Example kubelet config
- 1
- Specify the type of node type to which this
KubeletConfig
object applies using the label from theMachineConfig
object. - 2
- Specify the frequency that the kubelet checks the status of a node associated with this
MachineConfig
object. The default value is10s
. If you change this default, thenode-status-report-frequency
value is changed to the same value. - 3
- Specify the frequency that the kubelet reports the status of a node associated with this
MachineConfig
object. The default value is1m
.
The node-status-update-frequency
parameter works with the node-monitor-grace-period
and pod-eviction-timeout
parameters.
-
The
node-monitor-grace-period
parameter specifies how long OpenShift Container Platform waits after a node associated with aMachineConfig
object is markedUnhealthy
if the controller manager does not receive the node heartbeat. Workloads on the node continue to run after this time. If the remote worker node rejoins the cluster afternode-monitor-grace-period
expires, pods continue to run. New pods can be scheduled to that node. Thenode-monitor-grace-period
interval is40s
. Thenode-status-update-frequency
value must be lower than thenode-monitor-grace-period
value. -
The
pod-eviction-timeout
parameter specifies the amount of time OpenShift Container Platform waits after marking a node that is associated with aMachineConfig
object asUnreachable
to start marking pods for eviction. Evicted pods are rescheduled on other nodes. If the remote worker node rejoins the cluster afterpod-eviction-timeout
expires, the pods running on the remote worker node are terminated because the node controller has evicted the pods on-premise. Pods can then be rescheduled to that node. Thepod-eviction-timeout
interval is5m0s
.
Modifying the node-monitor-grace-period
and pod-eviction-timeout
parameters is not supported.
- Tolerations
-
You can use pod tolerations to mitigate the effects if the on-premise node controller adds a
node.kubernetes.io/unreachable
taint with aNoExecute
effect to a node it cannot reach.
A taint with the NoExecute
effect affects pods that are running on the node in the following ways:
- Pods that do not tolerate the taint are queued for eviction.
-
Pods that tolerate the taint without specifying a
tolerationSeconds
value in their toleration specification remain bound forever. -
Pods that tolerate the taint with a specified
tolerationSeconds
value remain bound for the specified amount of time. After the time elapses, the pods are queued for eviction.
You can delay or avoid pod eviction by configuring pods tolerations with the NoExecute
effect for the node.kubernetes.io/unreachable
and node.kubernetes.io/not-ready
taints.
Example toleration in a pod spec
OpenShift Container Platform uses the tolerationSeconds
value after the pod-eviction-timeout
value elapses.
- Other types of OpenShift Container Platform objects
- You can use replica sets, deployments, and replication controllers. The scheduler can reschedule these pods onto other nodes after the node is disconnected for five minutes. Rescheduling onto other nodes can be beneficial for some workloads, such as REST APIs, where an administrator can guarantee a specific number of pods are running and accessible.
When working with remote worker nodes, rescheduling pods on different nodes might not be acceptable if remote worker nodes are intended to be reserved for specific functions.
stateful sets do not get restarted when there is an outage. The pods remain in the terminating
state until the control plane can acknowledge that the pods are terminated.
To avoid scheduling a to a node that does not have access to the same type of persistent storage, OpenShift Container Platform cannot migrate pods that require persistent volumes to other zones in the case of network separation.
Legal Notice
Copy linkLink copied to clipboard!
Copyright © 2025 Red Hat
OpenShift documentation is licensed under the Apache License 2.0 (https://www.apache.org/licenses/LICENSE-2.0).
Modified versions must remove all Red Hat trademarks.
Portions adapted from https://github.com/kubernetes-incubator/service-catalog/ with modifications by Red Hat.
Red Hat, Red Hat Enterprise Linux, the Red Hat logo, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
Java® is a registered trademark of Oracle and/or its affiliates.
XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.
MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.
Node.js® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation’s permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
All other trademarks are the property of their respective owners.