Chapter 12. Scheduling resources
12.1. Using node selectors to move logging resources
A node selector specifies a map of key/value pairs that are defined using custom labels on nodes and selectors specified in pods.
For the pod to be eligible to run on a node, the pod must have the same key/value node selector as the label on the node.
12.1.1. About node selectors
You can use node selectors on pods and labels on nodes to control where the pod is scheduled. With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.
You can use a node selector to place specific pods on specific nodes, cluster-wide node selectors to place new pods on specific nodes anywhere in the cluster, and project node selectors to place new pods in a project on specific nodes.
For example, as a cluster administrator, you can create an infrastructure where application developers can deploy pods only onto the nodes closest to their geographical location by including a node selector in every pod they create. In this example, the cluster consists of five data centers spread across two regions. In the U.S., label the nodes as us-east
, us-central
, or us-west
. In the Asia-Pacific region (APAC), label the nodes as apac-east
or apac-west
. The developers can add a node selector to the pods they create to ensure the pods get scheduled on those nodes.
A pod is not scheduled if the Pod
object contains a node selector, but no node has a matching label.
If you are using node selectors and node affinity in the same pod configuration, the following rules control pod placement onto nodes:
-
If you configure both
nodeSelector
andnodeAffinity
, both conditions must be satisfied for the pod to be scheduled onto a candidate node. -
If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node if one of thenodeSelectorTerms
is satisfied. -
If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node only if allmatchExpressions
are satisfied.
- Node selectors on specific pods and nodes
You can control which node a specific pod is scheduled on by using node selectors and labels.
To use node selectors and labels, first label the node to avoid pods being descheduled, then add the node selector to the pod.
NoteYou cannot add a node selector directly to an existing scheduled pod. You must label the object that controls the pod, such as deployment config.
For example, the following
Node
object has theregion: east
label:Sample
Node
object with a labelkind: Node apiVersion: v1 metadata: name: ip-10-0-131-14.ec2.internal selfLink: /api/v1/nodes/ip-10-0-131-14.ec2.internal uid: 7bc2580a-8b8e-11e9-8e01-021ab4174c74 resourceVersion: '478704' creationTimestamp: '2019-06-10T14:46:08Z' labels: kubernetes.io/os: linux failure-domain.beta.kubernetes.io/zone: us-east-1a node.openshift.io/os_version: '4.5' node-role.kubernetes.io/worker: '' failure-domain.beta.kubernetes.io/region: us-east-1 node.openshift.io/os_id: rhcos beta.kubernetes.io/instance-type: m4.large kubernetes.io/hostname: ip-10-0-131-14 beta.kubernetes.io/arch: amd64 region: east 1 type: user-node #...
- 1
- Labels to match the pod node selector.
A pod has the
type: user-node,region: east
node selector:Sample
Pod
object with node selectorsapiVersion: v1 kind: Pod metadata: name: s1 #... spec: nodeSelector: 1 region: east type: user-node #...
- 1
- Node selectors to match the node label. The node must have a label for each node selector.
When you create the pod using the example pod spec, it can be scheduled on the example node.
- Default cluster-wide node selectors
With default cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.
For example, the following
Scheduler
object has the default cluster-wideregion=east
andtype=user-node
node selectors:Example Scheduler Operator Custom Resource
apiVersion: config.openshift.io/v1 kind: Scheduler metadata: name: cluster #... spec: defaultNodeSelector: type=user-node,region=east #...
A node in that cluster has the
type=user-node,region=east
labels:Example
Node
objectapiVersion: v1 kind: Node metadata: name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 #... labels: region: east type: user-node #...
Example
Pod
object with a node selectorapiVersion: v1 kind: Pod metadata: name: s1 #... spec: nodeSelector: region: east #...
When you create the pod using the example pod spec in the example cluster, the pod is created with the cluster-wide node selector and is scheduled on the labeled node:
Example pod list with the pod on the labeled node
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-s1 1/1 Running 0 20s 10.131.2.6 ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 <none> <none>
NoteIf the project where you create the pod has a project node selector, that selector takes preference over a cluster-wide node selector. Your pod is not created or scheduled if the pod does not have the project node selector.
- Project node selectors
With project node selectors, when you create a pod in this project, OpenShift Container Platform adds the node selectors to the pod and schedules the pods on a node with matching labels. If there is a cluster-wide default node selector, a project node selector takes preference.
For example, the following project has the
region=east
node selector:Example
Namespace
objectapiVersion: v1 kind: Namespace metadata: name: east-region annotations: openshift.io/node-selector: "region=east" #...
The following node has the
type=user-node,region=east
labels:Example
Node
objectapiVersion: v1 kind: Node metadata: name: ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 #... labels: region: east type: user-node #...
When you create the pod using the example pod spec in this example project, the pod is created with the project node selectors and is scheduled on the labeled node:
Example
Pod
objectapiVersion: v1 kind: Pod metadata: namespace: east-region #... spec: nodeSelector: region: east type: user-node #...
Example pod list with the pod on the labeled node
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-s1 1/1 Running 0 20s 10.131.2.6 ci-ln-qg1il3k-f76d1-hlmhl-worker-b-df2s4 <none> <none>
A pod in the project is not created or scheduled if the pod contains different node selectors. For example, if you deploy the following pod into the example project, it is not be created:
Example
Pod
object with an invalid node selectorapiVersion: v1 kind: Pod metadata: name: west-region #... spec: nodeSelector: region: west #...
12.1.2. Moving logging resources
You can configure the Red Hat OpenShift Logging Operator to deploy the pods for logging components, such as Elasticsearch and Kibana, to different nodes. You cannot move the Red Hat OpenShift Logging Operator pod from its installed location.
For example, you can move the Elasticsearch pods to a separate node because of high CPU, memory, and disk requirements.
Prerequisites
- You have installed the Red Hat OpenShift Logging Operator and the OpenShift Elasticsearch Operator.
Procedure
Edit the
ClusterLogging
custom resource (CR) in theopenshift-logging
project:$ oc edit ClusterLogging instance
Example
ClusterLogging
CRapiVersion: logging.openshift.io/v1 kind: ClusterLogging # ... spec: logStore: elasticsearch: nodeCount: 3 nodeSelector: 1 node-role.kubernetes.io/infra: '' tolerations: - effect: NoSchedule key: node-role.kubernetes.io/infra value: reserved - effect: NoExecute key: node-role.kubernetes.io/infra value: reserved redundancyPolicy: SingleRedundancy resources: limits: cpu: 500m memory: 16Gi requests: cpu: 500m memory: 16Gi storage: {} type: elasticsearch managementState: Managed visualization: kibana: nodeSelector: 2 node-role.kubernetes.io/infra: '' tolerations: - effect: NoSchedule key: node-role.kubernetes.io/infra value: reserved - effect: NoExecute key: node-role.kubernetes.io/infra value: reserved proxy: resources: null replicas: 1 resources: null type: kibana # ...
Verification
To verify that a component has moved, you can use the oc get pod -o wide
command.
For example:
You want to move the Kibana pod from the
ip-10-0-147-79.us-east-2.compute.internal
node:$ oc get pod kibana-5b8bdf44f9-ccpq9 -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kibana-5b8bdf44f9-ccpq9 2/2 Running 0 27s 10.129.2.18 ip-10-0-147-79.us-east-2.compute.internal <none> <none>
You want to move the Kibana pod to the
ip-10-0-139-48.us-east-2.compute.internal
node, a dedicated infrastructure node:$ oc get nodes
Example output
NAME STATUS ROLES AGE VERSION ip-10-0-133-216.us-east-2.compute.internal Ready master 60m v1.24.0 ip-10-0-139-146.us-east-2.compute.internal Ready master 60m v1.24.0 ip-10-0-139-192.us-east-2.compute.internal Ready worker 51m v1.24.0 ip-10-0-139-241.us-east-2.compute.internal Ready worker 51m v1.24.0 ip-10-0-147-79.us-east-2.compute.internal Ready worker 51m v1.24.0 ip-10-0-152-241.us-east-2.compute.internal Ready master 60m v1.24.0 ip-10-0-139-48.us-east-2.compute.internal Ready infra 51m v1.24.0
Note that the node has a
node-role.kubernetes.io/infra: ''
label:$ oc get node ip-10-0-139-48.us-east-2.compute.internal -o yaml
Example output
kind: Node apiVersion: v1 metadata: name: ip-10-0-139-48.us-east-2.compute.internal selfLink: /api/v1/nodes/ip-10-0-139-48.us-east-2.compute.internal uid: 62038aa9-661f-41d7-ba93-b5f1b6ef8751 resourceVersion: '39083' creationTimestamp: '2020-04-13T19:07:55Z' labels: node-role.kubernetes.io/infra: '' ...
To move the Kibana pod, edit the
ClusterLogging
CR to add a node selector:apiVersion: logging.openshift.io/v1 kind: ClusterLogging # ... spec: # ... visualization: kibana: nodeSelector: 1 node-role.kubernetes.io/infra: '' proxy: resources: null replicas: 1 resources: null type: kibana
- 1
- Add a node selector to match the label in the node specification.
After you save the CR, the current Kibana pod is terminated and new pod is deployed:
$ oc get pods
Example output
NAME READY STATUS RESTARTS AGE cluster-logging-operator-84d98649c4-zb9g7 1/1 Running 0 29m elasticsearch-cdm-hwv01pf7-1-56588f554f-kpmlg 2/2 Running 0 28m elasticsearch-cdm-hwv01pf7-2-84c877d75d-75wqj 2/2 Running 0 28m elasticsearch-cdm-hwv01pf7-3-f5d95b87b-4nx78 2/2 Running 0 28m collector-42dzz 1/1 Running 0 28m collector-d74rq 1/1 Running 0 28m collector-m5vr9 1/1 Running 0 28m collector-nkxl7 1/1 Running 0 28m collector-pdvqb 1/1 Running 0 28m collector-tflh6 1/1 Running 0 28m kibana-5b8bdf44f9-ccpq9 2/2 Terminating 0 4m11s kibana-7d85dcffc8-bfpfp 2/2 Running 0 33s
The new pod is on the
ip-10-0-139-48.us-east-2.compute.internal
node:$ oc get pod kibana-7d85dcffc8-bfpfp -o wide
Example output
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kibana-7d85dcffc8-bfpfp 2/2 Running 0 43s 10.131.0.22 ip-10-0-139-48.us-east-2.compute.internal <none> <none>
After a few moments, the original Kibana pod is removed.
$ oc get pods
Example output
NAME READY STATUS RESTARTS AGE cluster-logging-operator-84d98649c4-zb9g7 1/1 Running 0 30m elasticsearch-cdm-hwv01pf7-1-56588f554f-kpmlg 2/2 Running 0 29m elasticsearch-cdm-hwv01pf7-2-84c877d75d-75wqj 2/2 Running 0 29m elasticsearch-cdm-hwv01pf7-3-f5d95b87b-4nx78 2/2 Running 0 29m collector-42dzz 1/1 Running 0 29m collector-d74rq 1/1 Running 0 29m collector-m5vr9 1/1 Running 0 29m collector-nkxl7 1/1 Running 0 29m collector-pdvqb 1/1 Running 0 29m collector-tflh6 1/1 Running 0 29m kibana-7d85dcffc8-bfpfp 2/2 Running 0 62s
12.1.3. Additional resources
12.2. Using taints and tolerations to control logging pod placement
Taints and tolerations allow the node to control which pods should (or should not) be scheduled on them.
12.2.1. Understanding taints and tolerations
A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.
You apply taints to a node through the Node
specification (NodeSpec
) and apply tolerations to a pod through the Pod
specification (PodSpec
). When you apply a taint a node, the scheduler cannot place a pod on that node unless the pod can tolerate the taint.
Example taint in a node specification
apiVersion: v1 kind: Node metadata: name: my-node #... spec: taints: - effect: NoExecute key: key1 value: value1 #...
Example toleration in a Pod
spec
apiVersion: v1 kind: Pod metadata: name: my-pod #... spec: tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" tolerationSeconds: 3600 #...
Taints and tolerations consist of a key, value, and effect.
Parameter | Description | ||||||
---|---|---|---|---|---|---|---|
|
The | ||||||
|
The | ||||||
| The effect is one of the following:
| ||||||
|
|
If you add a
NoSchedule
taint to a control plane node, the node must have thenode-role.kubernetes.io/master=:NoSchedule
taint, which is added by default.For example:
apiVersion: v1 kind: Node metadata: annotations: machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0 machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c name: my-node #... spec: taints: - effect: NoSchedule key: node-role.kubernetes.io/master #...
A toleration matches a taint:
If the
operator
parameter is set toEqual
:-
the
key
parameters are the same; -
the
value
parameters are the same; -
the
effect
parameters are the same.
-
the
If the
operator
parameter is set toExists
:-
the
key
parameters are the same; -
the
effect
parameters are the same.
-
the
The following taints are built into OpenShift Container Platform:
-
node.kubernetes.io/not-ready
: The node is not ready. This corresponds to the node conditionReady=False
. -
node.kubernetes.io/unreachable
: The node is unreachable from the node controller. This corresponds to the node conditionReady=Unknown
. -
node.kubernetes.io/memory-pressure
: The node has memory pressure issues. This corresponds to the node conditionMemoryPressure=True
. -
node.kubernetes.io/disk-pressure
: The node has disk pressure issues. This corresponds to the node conditionDiskPressure=True
. -
node.kubernetes.io/network-unavailable
: The node network is unavailable. -
node.kubernetes.io/unschedulable
: The node is unschedulable. -
node.cloudprovider.kubernetes.io/uninitialized
: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint. node.kubernetes.io/pid-pressure
: The node has pid pressure. This corresponds to the node conditionPIDPressure=True
.ImportantOpenShift Container Platform does not set a default pid.available
evictionHard
.
12.2.2. Using tolerations to control log store pod placement
By default, log store pods have the following toleration configurations:
Elasticsearch log store pods default tolerations
apiVersion: v1 kind: Pod metadata: name: elasticsearch-example namespace: openshift-logging spec: # ... tolerations: - effect: NoSchedule key: node.kubernetes.io/disk-pressure operator: Exists - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists # ...
LokiStack log store pods default tolerations
apiVersion: v1 kind: Pod metadata: name: lokistack-example namespace: openshift-logging spec: # ... tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists # ...
You can configure a toleration for log store pods by adding a taint and then modifying the tolerations
syntax in the ClusterLogging
custom resource (CR).
Prerequisites
- You have installed the Red Hat OpenShift Logging Operator.
-
You have installed the OpenShift CLI (
oc
). - You have deployed an internal log store that is either Elasticsearch or LokiStack.
Procedure
Add a taint to a node where you want to schedule the logging pods, by running the following command:
$ oc adm taint nodes <node_name> <key>=<value>:<effect>
Example command
$ oc adm taint nodes node1 lokistack=node:NoExecute
This example places a taint on
node1
that has keylokistack
, valuenode
, and taint effectNoExecute
. Nodes with theNoExecute
effect schedule only pods that match the taint and remove existing pods that do not match.Edit the
logstore
section of theClusterLogging
CR to configure a toleration for the log store pods:Example
ClusterLogging
CRapiVersion: logging.openshift.io/v1 kind: ClusterLogging metadata: # ... spec: # ... logStore: type: lokistack elasticsearch: nodeCount: 1 tolerations: - key: lokistack 1 operator: Exists 2 effect: NoExecute 3 tolerationSeconds: 6000 4 # ...
This toleration matches the taint created by the oc adm taint
command. A pod with this toleration can be scheduled onto node1
.
12.2.3. Using tolerations to control the log visualizer pod placement
You can use a specific key/value pair that is not on other pods to ensure that only the Kibana pod can run on the specified node.
Prerequisites
-
You have installed the Red Hat OpenShift Logging Operator, the OpenShift Elasticsearch Operator, and the OpenShift CLI (
oc
).
Procedure
Add a taint to a node where you want to schedule the log visualizer pod by running the following command:
$ oc adm taint nodes <node_name> <key>=<value>:<effect>
Example command
$ oc adm taint nodes node1 kibana=node:NoExecute
This example places a taint on
node1
that has keykibana
, valuenode
, and taint effectNoExecute
. You must use theNoExecute
taint effect.NoExecute
schedules only pods that match the taint and remove existing pods that do not match.Edit the
visualization
section of theClusterLogging
CR to configure a toleration for the Kibana pod:apiVersion: logging.openshift.io/v1 kind: ClusterLogging metadata: # ... spec: # ... visualization: type: kibana kibana: tolerations: - key: kibana 1 operator: Exists 2 effect: NoExecute 3 tolerationSeconds: 6000 4 resources: limits: memory: 2Gi requests: cpu: 100m memory: 1Gi replicas: 1 # ...
This toleration matches the taint created by the oc adm taint
command. A pod with this toleration would be able to schedule onto node1
.
12.2.4. Using tolerations to control log collector pod placement
By default, log collector pods have the following tolerations
configuration:
apiVersion: v1 kind: Pod metadata: name: collector-example namespace: openshift-logging spec: # ... collection: type: vector tolerations: - effect: NoSchedule key: node-role.kubernetes.io/master operator: Exists - effect: NoSchedule key: node.kubernetes.io/disk-pressure operator: Exists - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists - effect: NoSchedule key: node.kubernetes.io/pid-pressure operator: Exists - effect: NoSchedule key: node.kubernetes.io/unschedulable operator: Exists # ...
Prerequisites
-
You have installed the Red Hat OpenShift Logging Operator and OpenShift CLI (
oc
).
Procedure
Add a taint to a node where you want logging collector pods to schedule logging collector pods by running the following command:
$ oc adm taint nodes <node_name> <key>=<value>:<effect>
Example command
$ oc adm taint nodes node1 collector=node:NoExecute
This example places a taint on
node1
that has keycollector
, valuenode
, and taint effectNoExecute
. You must use theNoExecute
taint effect.NoExecute
schedules only pods that match the taint and removes existing pods that do not match.Edit the
collection
stanza of theClusterLogging
custom resource (CR) to configure a toleration for the logging collector pods:apiVersion: logging.openshift.io/v1 kind: ClusterLogging metadata: # ... spec: # ... collection: type: vector tolerations: - key: collector 1 operator: Exists 2 effect: NoExecute 3 tolerationSeconds: 6000 4 resources: limits: memory: 2Gi requests: cpu: 100m memory: 1Gi # ...
This toleration matches the taint created by the oc adm taint
command. A pod with this toleration can be scheduled onto node1
.