Chapter 4. Controlling pod placement onto nodes (scheduling)
4.1. Controlling pod placement using the scheduler
Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster.
The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API.
- Default pod scheduling
- OpenShift Dedicated comes with a default scheduler that serves the needs of most users. The default scheduler uses both inherent and customization tools to determine the best fit for a pod.
- Advanced pod scheduling
In situations where you might want more control over where new pods are placed, the OpenShift Dedicated advanced scheduling features allow you to configure a pod so that the pod is required or has a preference to run on a particular node or alongside a specific pod.
You can control pod placement by using the following scheduling features:
4.1.1. About the default scheduler
The default OpenShift Dedicated pod scheduler is responsible for determining the placement of new pods onto nodes within the cluster. It reads data from the pod and finds a node that is a good fit based on configured profiles. It is completely independent and exists as a standalone solution. It does not modify the pod; it creates a binding for the pod that ties the pod to the particular node.
4.1.1.1. Understanding default scheduling
The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:
- Filters the nodes
- The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates, or filters.
- Prioritizes the filtered list of nodes
- This is achieved by passing each node through a series of priority, or scoring, functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each scoring function. The node score provided by each scoring function is multiplied by the weight (default weight for most scores is 1) and then combined by adding the scores for each node provided by all the scores. This weight attribute can be used by administrators to give higher importance to some scores.
- Selects the best fit node
- The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.
4.1.2. Scheduler use cases
One of the important use cases for scheduling within OpenShift Dedicated is to support flexible affinity and anti-affinity policies.
4.1.2.1. Affinity
Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service are scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod is not scheduled.
If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.
These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.
4.1.2.2. Anti-affinity
Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-affinity (or 'spread') at a particular level indicates that all pods that belong to the same service are spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler tries to balance the service pods across all applicable nodes as evenly as possible.
If you need greater control over where the pods are scheduled, see Controlling pod placement on nodes using node affinity rules and Placing pods relative to other pods using affinity and anti-affinity rules.
These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.
4.2. Placing pods relative to other pods using affinity and anti-affinity rules
Affinity is a property of pods that controls the nodes on which they prefer to be scheduled. Anti-affinity is a property of pods that prevents a pod from being scheduled on a node.
In OpenShift Dedicated, pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key-value labels on other pods.
4.2.1. Understanding pod affinity
Pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods.
- Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
- Pod anti-affinity can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.
For example, using affinity rules, you could spread or pack pods within a service or relative to pods in other services. Anti-affinity rules allow you to prevent pods of a particular service from scheduling on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service. Or, you could spread the pods of a service across nodes, availability zones, or availability sets to reduce correlated failures.
A label selector might match pods with multiple pod deployments. Use unique combinations of labels when configuring anti-affinity rules to avoid matching pods.
There are two types of pod affinity rules: required and preferred.
Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
Depending on your pod priority and preemption settings, the scheduler might not be able to find an appropriate node for a pod without violating affinity requirements. If so, a pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
You configure pod affinity/anti-affinity through the Pod
spec files. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.
The following example shows a Pod
spec configured for pod affinity and anti-affinity.
In this example, the pod affinity rule indicates that the pod can schedule onto a node only if that node has at least one already-running pod with a label that has the key security
and value S1
. The pod anti-affinity rule says that the pod prefers to not schedule onto a node if that node is already running a pod with label having key security
and value S2
.
Sample Pod
config file with pod affinity
apiVersion: v1 kind: Pod metadata: name: with-pod-affinity spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault affinity: podAffinity: 1 requiredDuringSchedulingIgnoredDuringExecution: 2 - labelSelector: matchExpressions: - key: security 3 operator: In 4 values: - S1 5 topologyKey: topology.kubernetes.io/zone containers: - name: with-pod-affinity image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL]
- 1
- Stanza to configure pod affinity.
- 2
- Defines a required rule.
- 3 5
- The key and value (label) that must be matched to apply the rule.
- 4
- The operator represents the relationship between the label on the existing pod and the set of values in the
matchExpression
parameters in the specification for the new pod. Can beIn
,NotIn
,Exists
, orDoesNotExist
.
Sample Pod
config file with pod anti-affinity
apiVersion: v1 kind: Pod metadata: name: with-pod-antiaffinity spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault affinity: podAntiAffinity: 1 preferredDuringSchedulingIgnoredDuringExecution: 2 - weight: 100 3 podAffinityTerm: labelSelector: matchExpressions: - key: security 4 operator: In 5 values: - S2 topologyKey: kubernetes.io/hostname containers: - name: with-pod-affinity image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL]
- 1
- Stanza to configure pod anti-affinity.
- 2
- Defines a preferred rule.
- 3
- Specifies a weight for a preferred rule. The node with the highest weight is preferred.
- 4
- Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
- 5
- The operator represents the relationship between the label on the existing pod and the set of values in the
matchExpression
parameters in the specification for the new pod. Can beIn
,NotIn
,Exists
, orDoesNotExist
.
If labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node.
4.2.2. Configuring a pod affinity rule
The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses affinity to allow scheduling with that pod.
You cannot add an affinity directly to a scheduled pod.
Procedure
Create a pod with a specific label in the pod spec:
Create a YAML file with the following content:
apiVersion: v1 kind: Pod metadata: name: security-s1 labels: security: S1 spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: security-s1 image: docker.io/ocpqe/hello-pod securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault
Create the pod.
$ oc create -f <pod-spec>.yaml
When creating other pods, configure the following parameters to add the affinity:
Create a YAML file with the following content:
apiVersion: v1 kind: Pod metadata: name: security-s1-east # ... spec: affinity: 1 podAffinity: requiredDuringSchedulingIgnoredDuringExecution: 2 - labelSelector: matchExpressions: - key: security 3 values: - S1 operator: In 4 topologyKey: topology.kubernetes.io/zone 5 # ...
- 1
- Adds a pod affinity.
- 2
- Configures the
requiredDuringSchedulingIgnoredDuringExecution
parameter or thepreferredDuringSchedulingIgnoredDuringExecution
parameter. - 3
- Specifies the
key
andvalues
that must be met. If you want the new pod to be scheduled with the other pod, use the samekey
andvalues
parameters as the label on the first pod. - 4
- Specifies an
operator
. The operator can beIn
,NotIn
,Exists
, orDoesNotExist
. For example, use the operatorIn
to require the label to be in the node. - 5
- Specify a
topologyKey
, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
Create the pod.
$ oc create -f <pod-spec>.yaml
4.2.3. Configuring a pod anti-affinity rule
The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses an anti-affinity preferred rule to attempt to prevent scheduling with that pod.
You cannot add an affinity directly to a scheduled pod.
Procedure
Create a pod with a specific label in the pod spec:
Create a YAML file with the following content:
apiVersion: v1 kind: Pod metadata: name: security-s1 labels: security: S1 spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: security-s1 image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL]
Create the pod.
$ oc create -f <pod-spec>.yaml
When creating other pods, configure the following parameters:
Create a YAML file with the following content:
apiVersion: v1 kind: Pod metadata: name: security-s2-east # ... spec: # ... affinity: 1 podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: 2 - weight: 100 3 podAffinityTerm: labelSelector: matchExpressions: - key: security 4 values: - S1 operator: In 5 topologyKey: kubernetes.io/hostname 6 # ...
- 1
- Adds a pod anti-affinity.
- 2
- Configures the
requiredDuringSchedulingIgnoredDuringExecution
parameter or thepreferredDuringSchedulingIgnoredDuringExecution
parameter. - 3
- For a preferred rule, specifies a weight for the node, 1-100. The node that with highest weight is preferred.
- 4
- Specifies the
key
andvalues
that must be met. If you want the new pod to not be scheduled with the other pod, use the samekey
andvalues
parameters as the label on the first pod. - 5
- Specifies an
operator
. The operator can beIn
,NotIn
,Exists
, orDoesNotExist
. For example, use the operatorIn
to require the label to be in the node. - 6
- Specifies a
topologyKey
, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
Create the pod.
$ oc create -f <pod-spec>.yaml
4.2.4. Sample pod affinity and anti-affinity rules
The following examples demonstrate pod affinity and pod anti-affinity.
4.2.4.1. Pod Affinity
The following example demonstrates pod affinity for pods with matching labels and label selectors.
The pod team4 has the label
team:4
.apiVersion: v1 kind: Pod metadata: name: team4 labels: team: "4" # ... spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: ocp image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] # ...
The pod team4a has the label selector
team:4
underpodAffinity
.apiVersion: v1 kind: Pod metadata: name: team4a # ... spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: team operator: In values: - "4" topologyKey: kubernetes.io/hostname containers: - name: pod-affinity image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] # ...
- The team4a pod is scheduled on the same node as the team4 pod.
4.2.4.2. Pod Anti-affinity
The following example demonstrates pod anti-affinity for pods with matching labels and label selectors.
The pod pod-s1 has the label
security:s1
.apiVersion: v1 kind: Pod metadata: name: pod-s1 labels: security: s1 # ... spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: ocp image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] # ...
The pod pod-s2 has the label selector
security:s1
underpodAntiAffinity
.apiVersion: v1 kind: Pod metadata: name: pod-s2 # ... spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - s1 topologyKey: kubernetes.io/hostname containers: - name: pod-antiaffinity image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] # ...
-
The pod pod-s2 cannot be scheduled on the same node as
pod-s1
.
4.2.4.3. Pod Affinity with no Matching Labels
The following example demonstrates pod affinity for pods without matching labels and label selectors.
The pod pod-s1 has the label
security:s1
.apiVersion: v1 kind: Pod metadata: name: pod-s1 labels: security: s1 # ... spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - name: ocp image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] # ...
The pod pod-s2 has the label selector
security:s2
.apiVersion: v1 kind: Pod metadata: name: pod-s2 # ... spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - s2 topologyKey: kubernetes.io/hostname containers: - name: pod-affinity image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] # ...
The pod pod-s2 is not scheduled unless there is a node with a pod that has the
security:s2
label. If there is no other pod with that label, the new pod remains in a pending state:Example output
NAME READY STATUS RESTARTS AGE IP NODE pod-s2 0/1 Pending 0 32s <none>
4.3. Controlling pod placement on nodes using node affinity rules
Affinity is a property of pods that controls the nodes on which they prefer to be scheduled.
In OpenShift Dedicated node affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on the nodes and label selectors specified in pods.
4.3.1. Understanding node affinity
Node affinity allows a pod to specify an affinity towards a group of nodes it can be placed on. The node does not have control over the placement.
For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.
There are two types of node affinity rules: required and preferred.
Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
If labels on a node change at runtime that results in an node affinity rule on a pod no longer being met, the pod continues to run on the node.
You configure node affinity through the Pod
spec file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.
The following example is a Pod
spec with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth
and whose value is either e2e-az-North
or e2e-az-South
:
Example pod configuration file with a node affinity required rule
apiVersion: v1 kind: Pod metadata: name: with-node-affinity spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault affinity: nodeAffinity: 1 requiredDuringSchedulingIgnoredDuringExecution: 2 nodeSelectorTerms: - matchExpressions: - key: e2e-az-NorthSouth 3 operator: In 4 values: - e2e-az-North 5 - e2e-az-South 6 containers: - name: with-node-affinity image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] # ...
- 1
- The stanza to configure node affinity.
- 2
- Defines a required rule.
- 3 5 6
- The key/value pair (label) that must be matched to apply the rule.
- 4
- The operator represents the relationship between the label on the node and the set of values in the
matchExpression
parameters in thePod
spec. This value can beIn
,NotIn
,Exists
, orDoesNotExist
,Lt
, orGt
.
The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest
and whose value is either e2e-az-East
or e2e-az-West
is preferred for the pod:
Example pod configuration file with a node affinity preferred rule
apiVersion: v1 kind: Pod metadata: name: with-node-affinity spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault affinity: nodeAffinity: 1 preferredDuringSchedulingIgnoredDuringExecution: 2 - weight: 1 3 preference: matchExpressions: - key: e2e-az-EastWest 4 operator: In 5 values: - e2e-az-East 6 - e2e-az-West 7 containers: - name: with-node-affinity image: docker.io/ocpqe/hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] # ...
- 1
- The stanza to configure node affinity.
- 2
- Defines a preferred rule.
- 3
- Specifies a weight for a preferred rule. The node with highest weight is preferred.
- 4 6 7
- The key/value pair (label) that must be matched to apply the rule.
- 5
- The operator represents the relationship between the label on the node and the set of values in the
matchExpression
parameters in thePod
spec. This value can beIn
,NotIn
,Exists
, orDoesNotExist
,Lt
, orGt
.
There is no explicit node anti-affinity concept, but using the NotIn
or DoesNotExist
operator replicates that behavior.
If you are using node affinity and node selectors in the same pod configuration, note the following:
-
If you configure both
nodeSelector
andnodeAffinity
, both conditions must be satisfied for the pod to be scheduled onto a candidate node. -
If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node if one of thenodeSelectorTerms
is satisfied. -
If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node only if allmatchExpressions
are satisfied.
4.3.2. Configuring a required node affinity rule
Required rules must be met before a pod can be scheduled on a node.
Procedure
The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.
Create a pod with a specific label in the pod spec:
Create a YAML file with the following content:
NoteYou cannot add an affinity directly to a scheduled pod.
Example output
apiVersion: v1 kind: Pod metadata: name: s1 spec: affinity: 1 nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: 2 nodeSelectorTerms: - matchExpressions: - key: e2e-az-name 3 values: - e2e-az1 - e2e-az2 operator: In 4 #...
- 1
- Adds a pod affinity.
- 2
- Configures the
requiredDuringSchedulingIgnoredDuringExecution
parameter. - 3
- Specifies the
key
andvalues
that must be met. If you want the new pod to be scheduled on the node you edited, use the samekey
andvalues
parameters as the label in the node. - 4
- Specifies an
operator
. The operator can beIn
,NotIn
,Exists
, orDoesNotExist
. For example, use the operatorIn
to require the label to be in the node.
Create the pod:
$ oc create -f <file-name>.yaml
4.3.3. Configuring a preferred node affinity rule
Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
Procedure
The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.
Create a pod with a specific label:
Create a YAML file with the following content:
NoteYou cannot add an affinity directly to a scheduled pod.
apiVersion: v1 kind: Pod metadata: name: s1 spec: affinity: 1 nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: 2 - weight: 3 preference: matchExpressions: - key: e2e-az-name 4 values: - e2e-az3 operator: In 5 #...
- 1
- Adds a pod affinity.
- 2
- Configures the
preferredDuringSchedulingIgnoredDuringExecution
parameter. - 3
- Specifies a weight for the node, as a number 1-100. The node with highest weight is preferred.
- 4
- Specifies the
key
andvalues
that must be met. If you want the new pod to be scheduled on the node you edited, use the samekey
andvalues
parameters as the label in the node. - 5
- Specifies an
operator
. The operator can beIn
,NotIn
,Exists
, orDoesNotExist
. For example, use the operatorIn
to require the label to be in the node.
Create the pod.
$ oc create -f <file-name>.yaml
4.3.4. Sample node affinity rules
The following examples demonstrate node affinity.
4.3.4.1. Node affinity with matching labels
The following example demonstrates node affinity for a node and pod with matching labels:
The Node1 node has the label
zone:us
:$ oc label node node1 zone=us
TipYou can alternatively apply the following YAML to add the label:
kind: Node apiVersion: v1 metadata: name: <node_name> labels: zone: us #...
The pod-s1 pod has the
zone
andus
key/value pair under a required node affinity rule:$ cat pod-s1.yaml
Example output
apiVersion: v1 kind: Pod metadata: name: pod-s1 spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "zone" operator: In values: - us #...
The pod-s1 pod can be scheduled on Node1:
$ oc get pod -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE pod-s1 1/1 Running 0 4m IP1 node1
4.3.4.2. Node affinity with no matching labels
The following example demonstrates node affinity for a node and pod without matching labels:
The Node1 node has the label
zone:emea
:$ oc label node node1 zone=emea
TipYou can alternatively apply the following YAML to add the label:
kind: Node apiVersion: v1 metadata: name: <node_name> labels: zone: emea #...
The pod-s1 pod has the
zone
andus
key/value pair under a required node affinity rule:$ cat pod-s1.yaml
Example output
apiVersion: v1 kind: Pod metadata: name: pod-s1 spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "zone" operator: In values: - us #...
The pod-s1 pod cannot be scheduled on Node1:
$ oc describe pod pod-s1
Example output
... Events: FirstSeen LastSeen Count From SubObjectPath Type Reason --------- -------- ----- ---- ------------- -------- ------ 1m 33s 8 default-scheduler Warning FailedScheduling No nodes are available that match all of the following predicates:: MatchNodeSelector (1).
4.4. Placing pods onto overcommited nodes
In an overcommited state, the sum of the container compute resource requests and limits exceeds the resources available on the system. Overcommitment might be desirable in development environments where a trade-off of guaranteed performance for capacity is acceptable.
Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.
4.4.1. Understanding overcommitment
Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.
OpenShift Dedicated administrators can control the level of overcommit and manage container density on nodes by configuring masters to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange
object specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.
That these overrides have no effect if no limits have been set on containers. Create a LimitRange
object with default limits, per individual project, or in the project template, to ensure that the overrides apply.
After these overrides, the container limits and requests must still be validated by any LimitRange
object in the project. It is possible, for example, for developers to specify a limit close to the minimum limit, and have the request then be overridden below the minimum limit, causing the pod to be forbidden. This unfortunate user experience should be addressed with future work, but for now, configure this capability and LimitRange
objects with caution.
4.4.2. Understanding nodes overcommitment
In an overcommitted environment, it is important to properly configure your node to provide best system behavior.
When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.
To ensure this behavior, OpenShift Dedicated configures the kernel to always overcommit memory by setting the vm.overcommit_memory
parameter to 1
, overriding the default operating system setting.
OpenShift Dedicated also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom
parameter to 0
. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority.
You can view the current setting by running the following commands on your nodes:
$ sysctl -a |grep commit
Example output
#... vm.overcommit_memory = 0 #...
$ sysctl -a |grep panic
Example output
#... vm.panic_on_oom = 0 #...
The above flags should already be set on nodes, and no further action is required.
You can also perform the following configurations for each node:
- Disable or enforce CPU limits using CPU CFS quotas
- Reserve resources for system processes
- Reserve memory across quality of service tiers
4.5. Placing pods on specific nodes using node selectors
A node selector specifies a map of key/value pairs that are defined using custom labels on nodes and selectors specified in pods.
For the pod to be eligible to run on a node, the pod must have the same key/value node selector as the label on the node.
4.5.1. About node selectors
You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Dedicated schedules the pods on nodes that contain matching labels.
You can use a node selector to place specific pods on specific nodes, cluster-wide node selectors to place new pods on specific nodes anywhere in the cluster, and project node selectors to place new pods in a project on specific nodes.
For example, as a cluster administrator, you can create an infrastructure where application developers can deploy pods only onto the nodes closest to their geographical location by including a node selector in every pod they create. In this example, the cluster consists of five data centers spread across two regions. In the U.S., label the nodes as us-east
, us-central
, or us-west
. In the Asia-Pacific region (APAC), label the nodes as apac-east
or apac-west
. The developers can add a node selector to the pods they create to ensure the pods get scheduled on those nodes.
A pod is not scheduled if the Pod
object contains a node selector, but no node has a matching label.
If you are using node selectors and node affinity in the same pod configuration, the following rules control pod placement onto nodes:
-
If you configure both
nodeSelector
andnodeAffinity
, both conditions must be satisfied for the pod to be scheduled onto a candidate node. -
If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node if one of thenodeSelectorTerms
is satisfied. -
If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node only if allmatchExpressions
are satisfied.
- Node selectors on specific pods and nodes
You can control which node a specific pod is scheduled on by using node selectors and labels.
To use node selectors and labels, first label the node to avoid pods being descheduled, then add the node selector to the pod.
NoteYou cannot add a node selector directly to an existing scheduled pod. You must label the object that controls the pod, such as deployment config.
For example, the following
Node
object has theregion: east
label:Sample
Node
object with a labelkind: Node apiVersion: v1 metadata: name: ip-10-0-131-14.ec2.internal selfLink: /api/v1/nodes/ip-10-0-131-14.ec2.internal uid: 7bc2580a-8b8e-11e9-8e01-021ab4174c74 resourceVersion: '478704' creationTimestamp: '2019-06-10T14:46:08Z' labels: kubernetes.io/os: linux topology.kubernetes.io/zone: us-east-1a node.openshift.io/os_version: '4.5' node-role.kubernetes.io/worker: '' topology.kubernetes.io/region: us-east-1 node.openshift.io/os_id: rhcos node.kubernetes.io/instance-type: m4.large kubernetes.io/hostname: ip-10-0-131-14 kubernetes.io/arch: amd64 region: east 1 type: user-node #...
- 1
- Labels to match the pod node selector.
A pod has the
type: user-node,region: east
node selector:Sample
Pod
object with node selectorsapiVersion: v1 kind: Pod metadata: name: s1 #... spec: nodeSelector: 1 region: east type: user-node #...
- 1
- Node selectors to match the node label. The node must have a label for each node selector.
When you create the pod using the example pod spec, it can be scheduled on the example node.
- Default cluster-wide node selectors
With default cluster-wide node selectors, when you create a pod in that cluster, OpenShift Dedicated adds the default node selectors to the pod and schedules the pod on nodes with matching labels.
For example, the following
Scheduler
object has the default cluster-wideregion=east
andtype=user-node
node selectors:Example Scheduler Operator Custom Resource
apiVersion: config.openshift.io/v1 kind: Scheduler metadata: name: cluster #... spec: defaultNodeSelector: type=user-node,region=east #...
A node in that cluster has the
type=user-node,region=east
labels:Example
Node
objectapiVersion: v1 kind: Node metadata: name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 #... labels: region: east type: user-node #...
Example
Pod
object with a node selectorapiVersion: v1 kind: Pod metadata: name: s1 #... spec: nodeSelector: region: east #...
When you create the pod using the example pod spec in the example cluster, the pod is created with the cluster-wide node selector and is scheduled on the labeled node:
Example pod list with the pod on the labeled node
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-s1 1/1 Running 0 20s 10.131.2.6 ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 <none> <none>
NoteIf the project where you create the pod has a project node selector, that selector takes preference over a cluster-wide node selector. Your pod is not created or scheduled if the pod does not have the project node selector.
- Project node selectors
With project node selectors, when you create a pod in this project, OpenShift Dedicated adds the node selectors to the pod and schedules the pods on a node with matching labels. If there is a cluster-wide default node selector, a project node selector takes preference.
For example, the following project has the
region=east
node selector:Example
Namespace
objectapiVersion: v1 kind: Namespace metadata: name: east-region annotations: openshift.io/node-selector: "region=east" #...
The following node has the
type=user-node,region=east
labels:Example
Node
objectapiVersion: v1 kind: Node metadata: name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 #... labels: region: east type: user-node #...
When you create the pod using the example pod spec in this example project, the pod is created with the project node selectors and is scheduled on the labeled node:
Example
Pod
objectapiVersion: v1 kind: Pod metadata: namespace: east-region #... spec: nodeSelector: region: east type: user-node #...
Example pod list with the pod on the labeled node
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-s1 1/1 Running 0 20s 10.131.2.6 ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 <none> <none>
A pod in the project is not created or scheduled if the pod contains different node selectors. For example, if you deploy the following pod into the example project, it is not created:
Example
Pod
object with an invalid node selectorapiVersion: v1 kind: Pod metadata: name: west-region #... spec: nodeSelector: region: west #...
4.5.2. Using node selectors to control pod placement
You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Dedicated schedules the pods on nodes that contain matching labels.
You add labels to a node, a compute machine set, or a machine config. Adding the label to the compute machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.
To add node selectors to an existing pod, add a node selector to the controlling object for that pod, such as a ReplicaSet
object, DaemonSet
object, StatefulSet
object, Deployment
object, or DeploymentConfig
object. Any existing pods under that controlling object are recreated on a node with a matching label. If you are creating a new pod, you can add the node selector directly to the pod spec. If the pod does not have a controlling object, you must delete the pod, edit the pod spec, and recreate the pod.
You cannot add a node selector directly to an existing scheduled pod.
Prerequisites
To add a node selector to existing pods, determine the controlling object for that pod. For example, the router-default-66d5cf9464-m2g75
pod is controlled by the router-default-66d5cf9464
replica set:
$ oc describe pod router-default-66d5cf9464-7pwkc
Example output
kind: Pod apiVersion: v1 metadata: # ... Name: router-default-66d5cf9464-7pwkc Namespace: openshift-ingress # ... Controlled By: ReplicaSet/router-default-66d5cf9464 # ...
The web console lists the controlling object under ownerReferences
in the pod YAML:
apiVersion: v1 kind: Pod metadata: name: router-default-66d5cf9464-7pwkc # ... ownerReferences: - apiVersion: apps/v1 kind: ReplicaSet name: router-default-66d5cf9464 uid: d81dd094-da26-11e9-a48a-128e7edf0312 controller: true blockOwnerDeletion: true # ...
Procedure
Add the matching node selector to a pod:
To add a node selector to existing and future pods, add a node selector to the controlling object for the pods:
Example
ReplicaSet
object with labelskind: ReplicaSet apiVersion: apps/v1 metadata: name: hello-node-6fbccf8d9 # ... spec: # ... template: metadata: creationTimestamp: null labels: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default pod-template-hash: 66d5cf9464 spec: nodeSelector: kubernetes.io/os: linux node-role.kubernetes.io/worker: '' type: user-node 1 # ...
- 1
- Add the node selector.
To add a node selector to a specific, new pod, add the selector to the
Pod
object directly:Example
Pod
object with a node selectorapiVersion: v1 kind: Pod metadata: name: hello-node-6fbccf8d9 # ... spec: nodeSelector: region: east type: user-node # ...
NoteYou cannot add a node selector directly to an existing scheduled pod.
4.6. Controlling pod placement by using pod topology spread constraints
You can use pod topology spread constraints to provide fine-grained control over the placement of your pods across nodes, zones, regions, or other user-defined topology domains. Distributing pods across failure domains can help to achieve high availability and more efficient resource utilization.
4.6.1. Example use cases
- As an administrator, I want my workload to automatically scale between two to fifteen pods. I want to ensure that when there are only two pods, they are not placed on the same node, to avoid a single point of failure.
- As an administrator, I want to distribute my pods evenly across multiple infrastructure zones to reduce latency and network costs. I want to ensure that my cluster can self-heal if issues arise.
4.6.2. Important considerations
- Pods in an OpenShift Dedicated cluster are managed by workload controllers such as deployments, stateful sets, or daemon sets. These controllers define the desired state for a group of pods, including how they are distributed and scaled across the nodes in the cluster. You should set the same pod topology spread constraints on all pods in a group to avoid confusion. When using a workload controller, such as a deployment, the pod template typically handles this for you.
-
Mixing different pod topology spread constraints can make OpenShift Dedicated behavior confusing and troubleshooting more difficult. You can avoid this by ensuring that all nodes in a topology domain are consistently labeled. OpenShift Dedicated automatically populates well-known labels, such as
kubernetes.io/hostname
. This helps avoid the need for manual labeling of nodes. These labels provide essential topology information, ensuring consistent node labeling across the cluster. - Only pods within the same namespace are matched and grouped together when spreading due to a constraint.
- You can specify multiple pod topology spread constraints, but you must ensure that they do not conflict with each other. All pod topology spread constraints must be satisfied for a pod to be placed.
4.6.3. Understanding skew and maxSkew
Skew refers to the difference in the number of pods that match a specified label selector across different topology domains, such as zones or nodes.
The skew is calculated for each domain by taking the absolute difference between the number of pods in that domain and the number of pods in the domain with the lowest amount of pods scheduled. Setting a maxSkew
value guides the scheduler to maintain a balanced pod distribution.
4.6.3.1. Example skew calculation
You have three zones (A, B, and C), and you want to distribute your pods evenly across these zones. If zone A has 5 pods, zone B has 3 pods, and zone C has 2 pods, to find the skew, you can subtract the number of pods in the domain with the lowest amount of pods scheduled from the number of pods currently in each zone. This means that the skew for zone A is 3, the skew for zone B is 1, and the skew for zone C is 0.
4.6.3.2. The maxSkew parameter
The maxSkew
parameter defines the maximum allowable difference, or skew, in the number of pods between any two topology domains. If maxSkew
is set to 1
, the number of pods in any topology domain should not differ by more than 1 from any other domain. If the skew exceeds maxSkew
, the scheduler attempts to place new pods in a way that reduces the skew, adhering to the constraints.
Using the previous example skew calculation, the skew values exceed the default maxSkew
value of 1
. The scheduler places new pods in zone B and zone C to reduce the skew and achieve a more balanced distribution, ensuring that no topology domain exceeds the skew of 1.
4.6.4. Example configurations for pod topology spread constraints
You can specify which pods to group together, which topology domains they are spread among, and the acceptable skew.
The following examples demonstrate pod topology spread constraint configurations.
Example to distribute pods that match the specified labels based on their zone
apiVersion: v1 kind: Pod metadata: name: my-pod labels: region: us-east spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault topologySpreadConstraints: - maxSkew: 1 1 topologyKey: topology.kubernetes.io/zone 2 whenUnsatisfiable: DoNotSchedule 3 labelSelector: 4 matchLabels: region: us-east 5 matchLabelKeys: - my-pod-label 6 containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL]
- 1
- The maximum difference in number of pods between any two topology domains. The default is
1
, and you cannot specify a value of0
. - 2
- The key of a node label. Nodes with this key and identical value are considered to be in the same topology.
- 3
- How to handle a pod if it does not satisfy the spread constraint. The default is
DoNotSchedule
, which tells the scheduler not to schedule the pod. Set toScheduleAnyway
to still schedule the pod, but the scheduler prioritizes honoring the skew to not make the cluster more imbalanced. - 4
- Pods that match this label selector are counted and recognized as a group when spreading to satisfy the constraint. Be sure to specify a label selector, otherwise no pods can be matched.
- 5
- Be sure that this
Pod
spec also sets its labels to match this label selector if you want it to be counted properly in the future. - 6
- A list of pod label keys to select which pods to calculate spreading over.
Example demonstrating a single pod topology spread constraint
kind: Pod apiVersion: v1 metadata: name: my-pod labels: region: us-east spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: region: us-east containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL]
The previous example defines a Pod
spec with a one pod topology spread constraint. It matches on pods labeled region: us-east
, distributes among zones, specifies a skew of 1
, and does not schedule the pod if it does not meet these requirements.
Example demonstrating multiple pod topology spread constraints
kind: Pod apiVersion: v1 metadata: name: my-pod-2 labels: region: us-east spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault topologySpreadConstraints: - maxSkew: 1 topologyKey: node whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: region: us-east - maxSkew: 1 topologyKey: rack whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: region: us-east containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL]
The previous example defines a Pod
spec with two pod topology spread constraints. Both match on pods labeled region: us-east
, specify a skew of 1
, and do not schedule the pod if it does not meet these requirements.
The first constraint distributes pods based on a user-defined label node
, and the second constraint distributes pods based on a user-defined label rack
. Both constraints must be met for the pod to be scheduled.