Chapter 2. Controlling pod placement onto nodes (scheduling)
2.1. Controlling pod placement using the scheduler
Pod scheduling is an internal process that determines placement of new pods onto nodes within the cluster.
The scheduler code has a clean separation that watches new pods as they get created and identifies the most suitable node to host them. It then creates bindings (pod to node bindings) for the pods using the master API.
- Default pod scheduling
- OpenShift Container Platform comes with a default scheduler that serves the needs of most users. The default scheduler uses both inherent and customization tools to determine the best fit for a pod.
- Advanced pod scheduling
In situations where you might want more control over where new pods are placed, the OpenShift Container Platform advanced scheduling features allow you to configure a pod so that the pod is required or has a preference to run on a particular node or alongside a specific pod by.
- Using pod affinity and anti-affinity rules.
- Controlling pod placement with pod affinity.
- Controlling pod placement with node affinity.
- Placing pods on overcomitted nodes.
- Controlling pod placement with node selectors.
- Controlling pod placement with taints and tolerations.
2.1.1. Scheduler Use Cases
One of the important use cases for scheduling within OpenShift Container Platform is to support flexible affinity and anti-affinity policies.
2.1.1.1. Infrastructure Topological Levels
Administrators can define multiple topological levels for their infrastructure (nodes) by specifying labels on nodes. For example: region=r1
, zone=z1
, rack=s1
.
These label names have no particular meaning and administrators are free to name their infrastructure levels anything, such as city/building/room. Also, administrators can define any number of levels for their infrastructure topology, with three levels usually being adequate (such as: regions
zones
racks
). Administrators can specify affinity and anti-affinity rules at each of these levels in any combination.
2.1.1.2. Affinity
Administrators should be able to configure the scheduler to specify affinity at any topological level, or even at multiple levels. Affinity at a particular level indicates that all pods that belong to the same service are scheduled onto nodes that belong to the same level. This handles any latency requirements of applications by allowing administrators to ensure that peer pods do not end up being too geographically separated. If no node is available within the same affinity group to host the pod, then the pod is not scheduled.
If you need greater control over where the pods are scheduled, see Using Node Affinity and Using Pod Affinity and Anti-affinity.
These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.
2.1.1.3. Anti-Affinity
Administrators should be able to configure the scheduler to specify anti-affinity at any topological level, or even at multiple levels. Anti-affinity (or 'spread') at a particular level indicates that all pods that belong to the same service are spread across nodes that belong to that level. This ensures that the application is well spread for high availability purposes. The scheduler tries to balance the service pods across all applicable nodes as evenly as possible.
If you need greater control over where the pods are scheduled, see Using Node Affinity and Using Pod Affinity and Anti-affinity.
These advanced scheduling features allow administrators to specify which node a pod can be scheduled on and to force or reject scheduling relative to other pods.
2.2. Configuring the default scheduler to control pod placement
The default OpenShift Container Platform pod scheduler is responsible for determining placement of new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node.
A selection of predicates and priorities defines the policy for the scheduler. See Modifying scheduler policy for a list of predicates and priorities.
Sample default scheduler object
apiVersion: config.openshift.io/v1 kind: Scheduler metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: 2019-05-20T15:39:01Z generation: 1 name: cluster resourceVersion: "1491" selfLink: /apis/config.openshift.io/v1/schedulers/cluster uid: 6435dd99-7b15-11e9-bd48-0aec821b8e34 spec: policy: 1 name: scheduler-policy defaultNodeSelector: type=user-node,region=east 2
2.2.1. Understanding default scheduling
The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:
- Filters the Nodes
- The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates.
- Prioritize the Filtered List of Nodes
- This is achieved by passing each node through a series of priority_ functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the weight (default weight for most priorities is 1) and then combined by adding the scores for each node provided by all the priorities. This weight attribute can be used by administrators to give higher importance to some priorities.
- Select the Best Fit Node
- The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.
2.2.1.1. Understanding Scheduler Policy
The selection of the predicate and priorities defines the policy for the scheduler.
The scheduler configuration file is a JSON file, which must be named policy.cfg
, that specifies the predicates and priorities the scheduler will consider.
In the absence of the scheduler policy file, the default scheduler behavior is used.
The predicates and priorities defined in the scheduler configuration file completely override the default scheduler policy. If any of the default predicates and priorities are required, you must explicitly specify the functions in the policy configuration.
Sample scheduler ConfigMap
apiVersion: v1 data: policy.cfg: | { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "MaxGCEPDVolumeCount"}, {"name" : "GeneralPredicates"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "MaxCSIVolumeCountPred"}, {"name" : "CheckVolumeBinding"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "PodFitsResources"}, {"name" : "MatchInterPodAffinity"}, {"name" : "CheckNodeUnschedulable"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "MatchNodeSelector"}, {"name" : "HostName"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, {"name" : "NodeAffinityPriority", "weight" : 1}, {"name" : "TaintTolerationPriority", "weight" : 1}, {"name" : "ImageLocalityPriority", "weight" : 1}, {"name" : "SelectorSpreadPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] } kind: ConfigMap metadata: creationTimestamp: "2019-09-17T08:42:33Z" name: scheduler-policy namespace: openshift-config resourceVersion: "59500" selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy uid: 17ee8865-d927-11e9-b213-02d1e1709840`
2.2.2. Creating a scheduler policy file
You can control change the default scheduling behavior by creating a JSON file with using the with the desired predicates and priorities. You then generate a ConfigMap from the JSON file and point the cluster
Scheduler object to use the ConfigMap.
Procedure
To configure the scheduler policy:
Create the a JSON file named
policy.cfg
with the desired predicates and priorities.Sample scheduler JSON file
{ "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ 1 {"name" : "PodFitsHostPorts"}, {"name" : "PodFitsResources"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "MatchNodeSelector"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "checkServiceAffinity"}, {"name" : "PodToleratesNodeNoExecuteTaints"}, {"name" : "MaxGCEPDVolumeCount"}, {"name" : "MatchInterPodAffinity"}, {"name" : "PodToleratesNodeTaints"}, {"name" : "HostName"} ], "priorities" : [2 {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] }
Create a ConfigMap based on the scheduler JSON file:
$ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name> 1
- 1
- Enter a name for the ConfigMap.
For example:
$ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy configmap/scheduler-policy created
apiVersion: v1 data: policy.cfg: | { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "MaxGCEPDVolumeCount"}, {"name" : "GeneralPredicates"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "MaxCSIVolumeCountPred"}, {"name" : "CheckVolumeBinding"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "PodFitsResources"}, {"name" : "MatchInterPodAffinity"}, {"name" : "CheckNodeUnschedulable"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "MatchNodeSelector"}, {"name" : "HostName"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, {"name" : "NodeAffinityPriority", "weight" : 1}, {"name" : "TaintTolerationPriority", "weight" : 1}, {"name" : "ImageLocalityPriority", "weight" : 1}, {"name" : "SelectorSpreadPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] } kind: ConfigMap metadata: creationTimestamp: "2019-09-17T08:42:33Z" name: scheduler-policy namespace: openshift-config resourceVersion: "59500" selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy uid: 17ee8865-d927-11e9-b213-02d1e1709840`
Edit the Scheduler Operator Custom Resource to add the ConfigMap:
$ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"<configmap-name>"}}}' --type=merge 1
- 1
- Specify the name of the ConfigMap.
For example:
$ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge
After making the change to the Scheduler config resource, wait for the
opensift-kube-apiserver
pods to redeploy. This can take several minutes. Until the pods redeploy, new scheduler does not take effect.Verify the scheduler policy is configured by viewing the log of a scheduler pod in the
openshift-kube-scheduler
namespace. The following command checks for the predoicates and priorites that are being registered by the scheduler:$ oc logs <scheduler-pod> | grep predicates
For example:
$ oc logs openshift-kube-scheduler-ip-10-0-141-29.ec2.internal | grep predicates Creating scheduler with fit predicates 'map[MaxGCEPDVolumeCount:{} MaxAzureDiskVolumeCount:{} CheckNodeUnschedulable:{} NoDiskConflict:{} NoVolumeZoneConflict:{} MatchNodeSelector:{} GeneralPredicates:{} MaxCSIVolumeCountPred:{} CheckVolumeBinding:{} MaxEBSVolumeCount:{} PodFitsResources:{} MatchInterPodAffinity:{} HostName:{} PodToleratesNodeTaints:{}]' and priority functions 'map[InterPodAffinityPriority:{} LeastRequestedPriority:{} ServiceSpreadingPriority:{} ImageLocalityPriority:{} SelectorSpreadPriority:{} EqualPriority:{} BalancedResourceAllocation:{} NodePreferAvoidPodsPriority:{} NodeAffinityPriority:{} TaintTolerationPriority:{}]'
2.2.3. Modifying scheduler policies
You change scheduling behavior by creating or editing your scheduler policy ConfigMap in the openshift-config
project. Add and remove predicates and priorities to the ConfigMap to create a scheduler policy.
Procedure
To modify the current custom schedluling, use one of the following methods:
Edit the scheduler policy ConfigMap:
$ oc edit configmap <configmap-name> -n openshift-config
For example:
$ oc edit configmap scheduler-policy -n openshift-config apiVersion: v1 data: policy.cfg: | { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ 1 {"name" : "MaxGCEPDVolumeCount"}, {"name" : "GeneralPredicates"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "MaxCSIVolumeCountPred"}, {"name" : "CheckVolumeBinding"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "PodFitsResources"}, {"name" : "MatchInterPodAffinity"}, {"name" : "CheckNodeUnschedulable"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "MatchNodeSelector"}, {"name" : "HostName"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ 2 {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, {"name" : "NodeAffinityPriority", "weight" : 1}, {"name" : "TaintTolerationPriority", "weight" : 1}, {"name" : "ImageLocalityPriority", "weight" : 1}, {"name" : "SelectorSpreadPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] } kind: ConfigMap metadata: creationTimestamp: "2019-09-17T17:44:19Z" name: scheduler-policy namespace: openshift-config resourceVersion: "15370" selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy
It can take a few minutes for the scheduler to restart the pods with the updated policy.
Change the policies and predicates being used:
Remove the scheduler policy CongifMap:
$ oc delete configmap -n openshift-config <name>
For example:
$ oc delete configmap -n openshift-config scheduler-policy
Edit the
policy.cfg
file to add and remove policies and predicates as needed.For example:
$ vi policy.cfg
apiVersion: v1 data: policy.cfg: | { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "PodFitsHostPorts"}, {"name" : "PodFitsResources"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "MatchNodeSelector"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "CheckVolumeBinding"}, {"name" : "CheckServiceAffinity"}, {"name" : "PodToleratesNodeNoExecuteTaints"}, {"name" : "MaxGCEPDVolumeCount"}, {"name" : "MatchInterPodAffinity"}, {"name" : "PodToleratesNodeTaints"}, {"name" : "HostName"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 2}, {"name" : "BalancedResourceAllocation", "weight" : 2}, {"name" : "ServiceSpreadingPriority", "weight" : 2}, {"name" : "EqualPriority", "weight" : 2} ] }
Re-create the scheduler policy ConfigMap based on the scheduler JSON file:
$ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name> 1
- 1
- Enter a name for the ConfigMap.
For example:
$ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy configmap/scheduler-policy created
2.2.3.1. Understanding the scheduler predicates
Predicates are rules that filter out unqualified nodes.
There are several predicates provided by default in OpenShift Container Platform. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes.
2.2.3.1.1. Static Predicates
These predicates do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name.
2.2.3.1.1.1. Default Predicates
The default scheduler policy includes the following predicates:
NoVolumeZoneConflict checks that the volumes a pod requests are available in the zone.
{"name" : "NoVolumeZoneConflict"}
MaxEBSVolumeCount checks the maximum number of volumes that can be attached to an AWS instance.
{"name" : "MaxEBSVolumeCount"}
MaxAzureDiskVolumeCount checks the maximum number of Azure Disk Volumes.
{"name" : "MaxAzureDiskVolumeCount"}
PodToleratesNodeTaints checks if a pod can tolerate the node taints.
{"name" : "PodToleratesNodeTaints"}
CheckNodeUnschedulable checks if a pod can be scheduled on a node with Unschedulable
spec.
{"name" : "CheckNodeUnschedulable"}
CheckVolumeBinding evaluates if a pod can fit based on the volumes, it requests, for both bound and unbound PVCs. * For PVCs that are bound, the predicate checks that the corresponding PV’s node affinity is satisfied by the given node. * For PVCs that are unbound, the predicate searched for available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.
The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound PVCs can be matched with an available and node-compatible PV.
{"name" : "CheckVolumeBinding"}
NoDiskConflict checks if the volume requested by a pod is available.
{"name" : "NoDiskConflict"}
MaxGCEPDVolumeCount checks the maximum number of Google Compute Engine (GCE) Persistent Disks (PD).
{"name" : "MaxGCEPDVolumeCount"}
MaxCSIVolumeCountPred
MatchInterPodAffinity checks if the pod affinity/anti-affinity rules permit the pod.
{"name" : "MatchInterPodAffinity"}
2.2.3.1.1.2. Other Static Predicates
OpenShift Container Platform also supports the following predicates:
The CheckNode-* predicates cannot be used if the Taint Nodes By Condition feature is enabled. The Taint Nodes By Condition feature is enabled by default.
CheckNodeCondition checks if a pod can be scheduled on a node reporting out of disk, network unavailable, or not ready conditions.
{"name" : "CheckNodeCondition"}
CheckNodeLabelPresence checks if all of the specified labels exist on a node, regardless of their value.
{"name" : "CheckNodeLabelPresence"}
checkServiceAffinity checks that ServiceAffinity labels are homogeneous for pods that are scheduled on a node.
{"name" : "checkServiceAffinity"}
PodToleratesNodeNoExecuteTaints checks if a pod tolerations can tolerate a node NoExecute taints.
{"name" : "PodToleratesNodeNoExecuteTaints"}
2.2.3.1.2. General Predicates
The following general predicates check whether non-critical predicates and essential predicates pass. Non-critical predicates are the predicates that only non-critical pods must pass and essential predicates are the predicates that all pods must pass.
The default scheduler policy includes the general predicates.
Non-critical general predicates
PodFitsResources determines a fit based on resource availability (CPU, memory, GPU, and so forth). The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources.
{"name" : "PodFitsResources"}
Essential general predicates
PodFitsHostPorts determines if a node has free ports for the requested pod ports (absence of port conflicts).
{"name" : "PodFitsHostPorts"}
HostName determines fit based on the presence of the Host parameter and a string match with the name of the host.
{"name" : "HostName"}
MatchNodeSelector determines fit based on node selector (nodeSelector) queries defined in the pod.
{"name" : "MatchNodeSelector"}
2.2.3.2. Understanding the scheduler priorities
Priorities are rules that rank nodes according to preferences.
A custom set of priorities can be specified to configure the scheduler. There are several priorities provided by default in OpenShift Container Platform. Other priorities can be customized by providing certain parameters. Multiple priorities can be combined and different weights can be given to each in order to impact the prioritization.
2.2.3.2.1. Static Priorities
Static priorities do not take any configuration parameters from the user, except weight. A weight is required to be specified and cannot be 0 or negative.
These are specified in the scheduler policy Configmap in the openshift-config
project.
2.2.3.2.1.1. Default Priorities
The default scheduler policy includes the following priorities. Each of the priority function has a weight of 1
except NodePreferAvoidPodsPriority
, which has a weight of 10000
.
NodeAffinityPriority prioritizes nodes according to node affinity scheduling preferences
{"name" : "NodeAffinityPriority", "weight" : 1}
TaintTolerationPriority prioritizes nodes that have a fewer number of intolerable taints on them for a pod. An intolerable taint is one which has key PreferNoSchedule
.
{"name" : "TaintTolerationPriority", "weight" : 1}
ImageLocalityPriority prioritizes nodes that already have requested pod container’s images.
{"name" : "ImageLocalityPriority", "weight" : 1}
SelectorSpreadPriority looks for services, replication controllers (RC), replication sets (RS), and stateful sets that match the pod, then finds existing pods that match those selectors. The scheduler favors nodes that have fewer existing matching pods. Then, it schedules the pod on a node with the smallest number of pods that match those selectors as the pod being scheduled.
{"name" : "SelectorSpreadPriority", "weight" : 1}
InterPodAffinityPriority computes a sum by iterating through the elements of weightedPodAffinityTerm
and adding weight to the sum if the corresponding PodAffinityTerm is satisfied for that node. The node(s) with the highest sum are the most preferred.
{"name" : "InterPodAffinityPriority", "weight" : 1}
LeastRequestedPriority favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity.
{"name" : "LeastRequestedPriority", "weight" : 1}
BalancedResourceAllocation favors nodes with balanced resource usage rate. It calculates the difference between the consumed CPU and memory as a fraction of capacity, and prioritizes the nodes based on how close the two metrics are to each other. This should always be used together with LeastRequestedPriority
.
{"name" : "BalancedResourceAllocation", "weight" : 1}
NodePreferAvoidPodsPriority ignores pods that are owned by a controller other than a replication controller.
{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000}
2.2.3.2.1.2. Other Static Priorities
OpenShift Container Platform also supports the following priorities:
EqualPriority gives an equal weight of 1
to all nodes, if no priority configurations are provided. We recommend using this priority only for testing environments.
{"name" : "EqualPriority", "weight" : 1}
MostRequestedPriority prioritizes nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.
{"name" : "MostRequestedPriority", "weight" : 1}
ServiceSpreadingPriority spreads pods by minimizing the number of pods belonging to the same service onto the same machine.
{"name" : "ServiceSpreadingPriority", "weight" : 1}
2.2.3.2.2. Configurable Priorities
You can configure these priorities in the scheduler policy Configmap, in the openshift-config
project, to add labels to affect how the priorities.
The type of the priority function is identified by the argument that they take. Since these are configurable, multiple priorities of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.
For information on using these priorities, see Modifying Scheduler Policy.
ServiceAntiAffinity takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a higher score to nodes within a group with the least concentration of pods.
{ "kind": "Policy", "apiVersion": "v1", "priorities":[ { "name":"<name>", 1 "weight" : 1 2 "argument":{ "serviceAntiAffinity":{ "label": "<label>" 3 } } } ] }
For example:
{ "kind": "Policy", "apiVersion": "v1", "priorities": [ { "name":"RackSpread", "weight" : 1, "argument": { "serviceAntiAffinity": { "label": "rack" } } } ] }
In some situations using ServiceAntiAffinity
based on custom labels does not spread pod as expected. See this Red Hat Solution.
*The labelPreference
parameter gives priority based on the specified label. If the label is present on a node, that node is given priority. If no label is specified, priority is given to nodes that do not have a label.
{ "kind": "Policy", "apiVersion": "v1", "priorities":[ { "name":"<name>", 1 "weight" : 1 2 "argument":{ "labelPreference":{ "label": "<label>", 3 "presence": true 4 } } } ] }
2.2.4. Sample Policy Configurations
The configuration below specifies the default scheduler configuration, if it were to be specified using the scheduler policy file.
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "RegionZoneAffinity", 1 "argument": { "serviceAffinity": { 2 "labels": "region, zone" 3 } } } ], "priorities": [ { "name":"RackSpread", 4 "weight" : 1, "argument": { "serviceAntiAffinity": { 5 "label": "rack" 6 } } } ] }
In all of the sample configurations below, the list of predicates and priority functions is truncated to include only the ones that pertain to the use case specified. In practice, a complete/meaningful scheduler policy should include most, if not all, of the default predicates and priorities listed above.
The following example defines three topological levels, region (affinity)
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "RegionZoneAffinity", "argument": { "serviceAffinity": { "label": "region, zone" } } } ], "priorities": [ { "name":"RackSpread", "weight" : 1, "argument": { "serviceAntiAffinity": { "label": "rack" } } } ] }
The following example defines three topological levels, city (affinity)
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "CityAffinity", "argument": { "serviceAffinity": { "label": "city" } } } ], "priorities": [ { "name":"BuildingSpread", "weight" : 1, "argument": { "serviceAntiAffinity": { "label": "building" } } }, { "name":"RoomSpread", "weight" : 1, "argument": { "serviceAntiAffinity": { "label": "room" } } } ] }
The following example defines a policy to only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined:
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "RequireRegion", "argument": { "labelPreference": { "label": "region", "presence": true } } } ], "priorities": [ { "name":"ZonePreferred", "weight" : 1, "argument": { "labelPreference": { "label": "zone", "presence": true } } } ] }
The following example combines both static and configurable predicates and also priorities:
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "RegionAffinity", "argument": { "serviceAffinity": { "label": "region" } } }, { "name": "RequireRegion", "argument": { "labelsPresence": { "label": "region", "presence": true } } }, { "name": "BuildingNodesAvoid", "argument": { "labelsPresence": { "label": "building", "presence": false } } }, {"name" : "PodFitsPorts"}, {"name" : "MatchNodeSelector"} ], "priorities": [ { "name": "ZoneSpread", "weight" : 2, "argument": { "serviceAntiAffinity":{ "label": "zone" } } }, { "name":"ZonePreferred", "weight" : 1, "argument": { "labelPreference":{ "label": "zone", "presence": true } } }, {"name" : "ServiceSpreadingPriority", "weight" : 1} ] }
2.3. Placing pods relative to other pods using affinity and anti-affinity rules
Affinity is a property of pods that controls the nodes on which they prefer to be scheduled. Anti-affinity is a property of pods that prevents a pod from being scheduled on a node.
In OpenShift Container Platform pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods.
2.3.1. Understanding pod affinity
Pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key/value labels on other pods.
- Pod affinity can tell the scheduler to locate a new pod on the same node as other pods if the label selector on the new pod matches the label on the current pod.
- Pod anti-affinity can prevent the scheduler from locating a new pod on the same node as pods with the same labels if the label selector on the new pod matches the label on the current pod.
For example, using affinity rules, you could spread or pack pods within a service or relative to pods in other services. Anti-affinity rules allow you to prevent pods of a particular service from scheduling on the same nodes as pods of another service that are known to interfere with the performance of the pods of the first service. Or, you could spread the pods of a service across nodes or availability zones to reduce correlated failures.
There are two types of pod affinity rules: required and preferred.
Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
Depending on your pod priority and preemption settings, the scheduler might not be able to find an appropriate node for a pod without violating affinity requirements. If so, a pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
You configure pod affinity/anti-affinity through the pod specification files. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.
The following example shows a pod specification configured for pod affinity and anti-affinity.
In this example, the pod affinity rule indicates that the pod can schedule onto a node only if that node has at least one already-running pod with a label that has the key security
and value S1
. The pod anti-affinity rule says that the pod prefers to not schedule onto a node if that node is already running a pod with label having key security
and value S2
.
Sample pod config file with pod affinity
apiVersion: v1 kind: Pod metadata: name: with-pod-affinity spec: affinity: podAffinity: 1 requiredDuringSchedulingIgnoredDuringExecution: 2 - labelSelector: matchExpressions: - key: security 3 operator: In 4 values: - S1 5 topologyKey: failure-domain.beta.kubernetes.io/zone containers: - name: with-pod-affinity image: docker.io/ocpqe/hello-pod
- 1
- Stanza to configure pod affinity.
- 2
- Defines a required rule.
- 3 5
- The key and value (label) that must be matched to apply the rule.
- 4
- The operator represents the relationship between the label on the existing pod and the set of values in the
matchExpression
parameters in the specification for the new pod. Can beIn
,NotIn
,Exists
, orDoesNotExist
.
Sample pod config file with pod anti-affinity
apiVersion: v1 kind: Pod metadata: name: with-pod-antiaffinity spec: affinity: podAntiAffinity: 1 preferredDuringSchedulingIgnoredDuringExecution: 2 - weight: 100 3 podAffinityTerm: labelSelector: matchExpressions: - key: security 4 operator: In 5 values: - S2 topologyKey: kubernetes.io/hostname containers: - name: with-pod-affinity image: docker.io/ocpqe/hello-pod
- 1
- Stanza to configure pod anti-affinity.
- 2
- Defines a preferred rule.
- 3
- Specifies a weight for a preferred rule. The node with the highest weight is preferred.
- 4
- Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
- 5
- The operator represents the relationship between the label on the existing pod and the set of values in the
matchExpression
parameters in the specification for the new pod. Can beIn
,NotIn
,Exists
, orDoesNotExist
.
If labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node.
2.3.2. Configuring a pod affinity rule
The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses affinity to allow scheduling with that pod.
Procedure
Create a pod with a specific label in the pod specification:
$ cat team4.yaml apiVersion: v1 kind: Pod metadata: name: security-s1 labels: security: S1 spec: containers: - name: security-s1 image: docker.io/ocpqe/hello-pod
When creating other pods, edit the pod specification as follows:
-
Use the
podAntiAffinity
stanza to configure therequiredDuringSchedulingIgnoredDuringExecution
parameter orpreferredDuringSchedulingIgnoredDuringExecution
parameter: Specify the key and value that must be met. If you want the new pod to be scheduled with the other pod, use the same
key
andvalue
parameters as the label on the first pod.podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - S1 topologyKey: failure-domain.beta.kubernetes.io/zone
-
Specify an
operator
. The operator can beIn
,NotIn
,Exists
, orDoesNotExist
. For example, use the operatorIn
to require the label to be in the node. -
Specify a
topologyKey
, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain.
-
Use the
Create the pod.
$ oc create -f <pod-spec>.yaml
2.3.3. Configuring a pod anti-affinity rule
The following steps demonstrate a simple two-pod configuration that creates pod with a label and a pod that uses an anti-affinity preferred rule to attempt to prevent scheduling with that pod.
Procedure
Create a pod with a specific label in the pod specification:
$ cat team4.yaml apiVersion: v1 kind: Pod metadata: name: security-s2 labels: security: S2 spec: containers: - name: security-s2 image: docker.io/ocpqe/hello-pod
- When creating other pods, edit the pod specification to set the following parameters:
Use the
podAntiAffinity
stanza to configure therequiredDuringSchedulingIgnoredDuringExecution
parameter orpreferredDuringSchedulingIgnoredDuringExecution
parameter:- Specify a weight for the node, 1-100. The node that with highest weight is preferred.
Specify the key and values that must be met. If you want the new pod to not be scheduled with the other pod, use the same
key
andvalue
parameters as the label on the first pod.podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator: In values: - S2 topologyKey: kubernetes.io/hostname
- For a preferred rule, specify a weight, 1-100.
-
Specify an
operator
. The operator can beIn
,NotIn
,Exists
, orDoesNotExist
. For example, use the operatorIn
to require the label to be in the node.
-
Specify a
topologyKey
, which is a prepopulated Kubernetes label that the system uses to denote such a topology domain. Create the pod.
$ oc create -f <pod-spec>.yaml
2.3.4. Sample pod affinity and anti-affinity rules
The following examples demonstrate pod affinity and pod anti-affinity.
2.3.4.1. Pod Affinity
The following example demonstrates pod affinity for pods with matching labels and label selectors.
The pod team4 has the label
team:4
.$ cat team4.yaml apiVersion: v1 kind: Pod metadata: name: team4 labels: team: "4" spec: containers: - name: ocp image: docker.io/ocpqe/hello-pod
The pod team4a has the label selector
team:4
underpodAffinity
.$ cat pod-team4a.yaml apiVersion: v1 kind: Pod metadata: name: team4a spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: team operator: In values: - "4" topologyKey: kubernetes.io/hostname containers: - name: pod-affinity image: docker.io/ocpqe/hello-pod
- The team4a pod is scheduled on the same node as the team4 pod.
2.3.4.2. Pod Anti-affinity
The following example demonstrates pod anti-affinity for pods with matching labels and label selectors.
The pod pod-s1 has the label
security:s1
.cat pod-s1.yaml apiVersion: v1 kind: Pod metadata: name: pod-s1 labels: security: s1 spec: containers: - name: ocp image: docker.io/ocpqe/hello-pod
The pod pod-s2 has the label selector
security:s1
underpodAntiAffinity
.cat pod-s2.yaml apiVersion: v1 kind: Pod metadata: name: pod-s2 spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - s1 topologyKey: kubernetes.io/hostname containers: - name: pod-antiaffinity image: docker.io/ocpqe/hello-pod
-
The pod pod-s2 cannot be scheduled on the same node as
pod-s1
.
2.3.4.3. Pod Affinity with no Matching Labels
The following example demonstrates pod affinity for pods without matching labels and label selectors.
The pod pod-s1 has the label
security:s1
.$ cat pod-s1.yaml apiVersion: v1 kind: Pod metadata: name: pod-s1 labels: security: s1 spec: containers: - name: ocp image: docker.io/ocpqe/hello-pod
The pod pod-s2 has the label selector
security:s2
.$ cat pod-s2.yaml apiVersion: v1 kind: Pod metadata: name: pod-s2 spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: security operator: In values: - s2 topologyKey: kubernetes.io/hostname containers: - name: pod-affinity image: docker.io/ocpqe/hello-pod
The pod pod-s2 is not scheduled unless there is a node with a pod that has the
security:s2
label. If there is no other pod with that label, the new pod remains in a pending state:NAME READY STATUS RESTARTS AGE IP NODE pod-s2 0/1 Pending 0 32s <none>
2.4. Controlling pod placement on nodes using node affinity rules
Affinity is a property of pods that controls the nodes on which they prefer to be scheduled.
In OpenShift Container Platformnode affinity is a set of rules used by the scheduler to determine where a pod can be placed. The rules are defined using custom labels on the nodes and label selectors specified in pods.
2.4.1. Understanding node affinity
Node affinity allows a pod to specify an affinity towards a group of nodes it can be placed on. The node does not have control over the placement.
For example, you could configure a pod to only run on a node with a specific CPU or in a specific availability zone.
There are two types of node affinity rules: required and preferred.
Required rules must be met before a pod can be scheduled on a node. Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
If labels on a node change at runtime that results in an node affinity rule on a pod no longer being met, the pod continues to run on the node.
You configure node affinity through the pod specification file. You can specify a required rule, a preferred rule, or both. If you specify both, the node must first meet the required rule, then attempts to meet the preferred rule.
The following example is a pod specification with a rule that requires the pod be placed on a node with a label whose key is e2e-az-NorthSouth
and whose value is either e2e-az-North
or e2e-az-South
:
Sample pod configuration file with a node affinity required rule
apiVersion: v1 kind: Pod metadata: name: with-node-affinity spec: affinity: nodeAffinity: 1 requiredDuringSchedulingIgnoredDuringExecution: 2 nodeSelectorTerms: - matchExpressions: - key: e2e-az-NorthSouth 3 operator: In 4 values: - e2e-az-North 5 - e2e-az-South 6 containers: - name: with-node-affinity image: docker.io/ocpqe/hello-pod
- 1
- The stanza to configure node affinity.
- 2
- Defines a required rule.
- 3 5 6
- The key/value pair (label) that must be matched to apply the rule.
- 4
- The operator represents the relationship between the label on the node and the set of values in the
matchExpression
parameters in the pod specification. This value can beIn
,NotIn
,Exists
, orDoesNotExist
,Lt
, orGt
.
The following example is a node specification with a preferred rule that a node with a label whose key is e2e-az-EastWest
and whose value is either e2e-az-East
or e2e-az-West
is preferred for the pod:
Sample pod configuration file with a node affinity preferred rule
apiVersion: v1 kind: Pod metadata: name: with-node-affinity spec: affinity: nodeAffinity: 1 preferredDuringSchedulingIgnoredDuringExecution: 2 - weight: 1 3 preference: matchExpressions: - key: e2e-az-EastWest 4 operator: In 5 values: - e2e-az-East 6 - e2e-az-West 7 containers: - name: with-node-affinity image: docker.io/ocpqe/hello-pod
- 1
- The stanza to configure node affinity.
- 2
- Defines a preferred rule.
- 3
- Specifies a weight for a preferred rule. The node with highest weight is preferred.
- 4 6 7
- The key/value pair (label) that must be matched to apply the rule.
- 5
- The operator represents the relationship between the label on the node and the set of values in the
matchExpression
parameters in the pod specification. This value can beIn
,NotIn
,Exists
, orDoesNotExist
,Lt
, orGt
.
There is no explicit node anti-affinity concept, but using the NotIn
or DoesNotExist
operator replicates that behavior.
If you are using node affinity and node selectors in the same pod configuration, note the following:
-
If you configure both
nodeSelector
andnodeAffinity
, both conditions must be satisfied for the pod to be scheduled onto a candidate node. -
If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node if one of thenodeSelectorTerms
is satisfied. -
If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node only if allmatchExpressions
are satisfied.
2.4.2. Configuring a required node affinity rule
Required rules must be met before a pod can be scheduled on a node.
Procedure
The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler is required to place on the node.
Add a label to a node using the
oc label node
command:$ oc label node node1 e2e-az-name=e2e-az1
In the pod specification, use the
nodeAffinity
stanza to configure therequiredDuringSchedulingIgnoredDuringExecution
parameter:-
Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same
key
andvalue
parameters as the label in the node. Specify an
operator
. The operator can beIn
,NotIn
,Exists
,DoesNotExist
,Lt
, orGt
. For example, use the operatorIn
to require the label to be in the node:spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: e2e-az-name operator: In values: - e2e-az1 - e2e-az2
-
Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same
Create the pod:
$ oc create -f e2e-az2.yaml
2.4.3. Configuring a Preferred Node Affinity Rule
Preferred rules specify that, if the rule is met, the scheduler tries to enforce the rules, but does not guarantee enforcement.
Procedure
The following steps demonstrate a simple configuration that creates a node and a pod that the scheduler tries to place on the node.
Add a label to a node using the
oc label node
command:$ oc label node node1 e2e-az-name=e2e-az3
In the pod specification, use the
nodeAffinity
stanza to configure thepreferredDuringSchedulingIgnoredDuringExecution
parameter:- Specify a weight for the node, as a number 1-100. The node with highest weight is preferred.
Specify the key and values that must be met. If you want the new pod to be scheduled on the node you edited, use the same
key
andvalue
parameters as the label in the node:spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: e2e-az-name operator: In values: - e2e-az3
-
Specify an
operator
. The operator can beIn
,NotIn
,Exists
,DoesNotExist
,Lt
, orGt
. For example, use the operatorIn
to require the label to be in the node. Create the pod.
$ oc create -f e2e-az3.yaml
2.4.4. Sample node affinity rules
The following examples demonstrate node affinity.
2.4.4.1. Node Affinity with Matching Labels
The following example demonstrates node affinity for a node and pod with matching labels:
The Node1 node has the label
zone:us
:$ oc label node node1 zone=us
The pod pod-s1 has the
zone
andus
key/value pair under a required node affinity rule:$ cat pod-s1.yaml apiVersion: v1 kind: Pod metadata: name: pod-s1 spec: containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "zone" operator: In values: - us
The pod pod-s1 can be scheduled on Node1:
$ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE pod-s1 1/1 Running 0 4m IP1 node1
2.4.4.2. Node Affinity with No Matching Labels
The following example demonstrates node affinity for a node and pod without matching labels:
The Node1 node has the label
zone:emea
:$ oc label node node1 zone=emea
The pod pod-s1 has the
zone
andus
key/value pair under a required node affinity rule:$ cat pod-s1.yaml apiVersion: v1 kind: Pod metadata: name: pod-s1 spec: containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "zone" operator: In values: - us
The pod pod-s1 cannot be scheduled on Node1:
$ oc describe pod pod-s1 <---snip---> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason --------- -------- ----- ---- ------------- -------- ------ 1m 33s 8 default-scheduler Warning FailedScheduling No nodes are available that match all of the following predicates:: MatchNodeSelector (1).
2.4.5. Additional resources
For information about changing node labels, see Understanding how to update labels on nodes.
2.5. Placing pods onto overcommited nodes
In an overcommited state, the sum of the container compute resource requests and limits exceeds the resources available on the system. Overcommitment might be desirable in development environments where a trade-off of guaranteed performance for capacity is acceptable.
Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.
2.5.1. Understanding overcommitment
Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.
OpenShift Container Platform administrators can control the level of overcommit and manage container density on nodes by configuring masters to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.
That these overrides have no effect if no limits have been set on containers. Create a LimitRange object with default limits (per individual project, or in the project template) in order to ensure that the overrides apply.
After these overrides, the container limits and requests must still be validated by any LimitRange objects in the project. It is possible, for example, for developers to specify a limit close to the minimum limit, and have the request then be overridden below the minimum limit, causing the pod to be forbidden. This unfortunate user experience should be addressed with future work, but for now, configure this capability and LimitRanges with caution.
2.5.2. Understanding nodes overcommitment
In an overcommitted environment, it is important to properly configure your node to provide best system behavior.
When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.
In an overcommitted environment, it is important to properly configure your node to provide best system behavior.
When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.
To ensure this behavior, OpenShift Container Platform configures the kernel to always overcommit memory by setting the vm.overcommit_memory
parameter to 1
, overriding the default operating system setting.
OpenShift Container Platform also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom
parameter to 0
. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority
You can view the current setting by running the following commands on your nodes:
$ sysctl -a |grep commit vm.overcommit_memory = 1
$ sysctl -a |grep panic vm.panic_on_oom = 0
The above flags should already be set on nodes, and no further action is required.
You can also perform the following configurations for each node:
- Disable or enforce CPU limits using CPU CFS quotas
- Reserve resources for system processes
- Reserve memory across quality of service tiers
2.6. Controlling pod placement using node taints
Taints and tolerations allow the Node to control which Pods should (or should not) be scheduled on them.
2.6.1. Understanding taints and tolerations
A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.
You apply taints to a node through the node specification (NodeSpec
) and apply tolerations to a pod through the pod specification (PodSpec
). A taint on a node instructs the node to repel all pods that do not tolerate the taint.
Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.
Parameter | Description | ||||||
---|---|---|---|---|---|---|---|
|
The | ||||||
|
The | ||||||
| The effect is one of the following:
| ||||||
|
|
A toleration matches a taint:
If the
operator
parameter is set toEqual
:-
the
key
parameters are the same; -
the
value
parameters are the same; -
the
effect
parameters are the same.
-
the
If the
operator
parameter is set toExists
:-
the
key
parameters are the same; -
the
effect
parameters are the same.
-
the
The following taints are built into kubernetes:
-
node.kubernetes.io/not-ready
: The node is not ready. This corresponds to the node conditionReady=False
. -
node.kubernetes.io/unreachable
: The node is unreachable from the node controller. This corresponds to the node conditionReady=Unknown
. -
node.kubernetes.io/out-of-disk
: The node has insufficient free space on the node for adding new pods. This corresponds to the node conditionOutOfDisk=True
. -
node.kubernetes.io/memory-pressure
: The node has memory pressure issues. This corresponds to the node conditionMemoryPressure=True
. -
node.kubernetes.io/disk-pressure
: The node has disk pressure issues. This corresponds to the node conditionDiskPressure=True
. -
node.kubernetes.io/network-unavailable
: The node network is unavailable. -
node.kubernetes.io/unschedulable
: The node is unschedulable. -
node.cloudprovider.kubernetes.io/uninitialized
: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.
2.6.1.1. Understanding how to use toleration seconds to delay pod evictions
You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds
parameter in the pod specification. If a taint with the NoExecute
effect is added to a node, any pods that do not tolerate the taint are evicted immediately (pods that do tolerate the taint are not evicted). However, if a pod that to be evicted has the tolerationSeconds
parameter, the pod is not evicted until that time period expires.
For example:
tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute" tolerationSeconds: 3600
Here, if this pod is running but does not have a matching taint, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.
2.6.1.2. Understanding how to use multiple taints
You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:
- Process the taints for which the pod has a matching toleration.
The remaining unmatched taints have the indicated effects on the pod:
-
If there is at least one unmatched taint with effect
NoSchedule
, OpenShift Container Platform cannot schedule a pod onto that node. -
If there is no unmatched taint with effect
NoSchedule
but there is at least one unmatched taint with effectPreferNoSchedule
, OpenShift Container Platform tries to not schedule the pod onto the node. If there is at least one unmatched taint with effect
NoExecute
, OpenShift Container Platform evicts the pod from the node (if it is already running on the node), or the pod is not scheduled onto the node (if it is not yet running on the node).- Pods that do not tolerate the taint are evicted immediately.
-
Pods that tolerate the taint without specifying
tolerationSeconds
in their toleration specification remain bound forever. -
Pods that tolerate the taint with a specified
tolerationSeconds
remain bound for the specified amount of time.
-
If there is at least one unmatched taint with effect
For example:
The node has the following taints:
$ oc adm taint nodes node1 key1=value1:NoSchedule $ oc adm taint nodes node1 key1=value1:NoExecute $ oc adm taint nodes node1 key2=value2:NoSchedule
The pod has the following tolerations:
tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" - key: "key1" operator: "Equal" value: "value1" effect: "NoExecute"
In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.
2.6.1.3. Preventing pod eviction for node problems
OpenShift Container Platform can be configured to represent node unreachable and node not ready conditions as taints. This allows per-pod specification of how long to remain bound to a node that becomes unreachable or not ready, rather than using the default of five minutes.
The Taint-Based Evictions feature is enabled by default. The taints are automatically added by the node controller and the normal logic for evicting pods from Ready
nodes is disabled.
-
If a node enters a not ready state, the
node.kubernetes.io/not-ready:NoExecute
taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period. -
If a node enters a not reachable state, the
node.kubernetes.io/unreachable:NoExecute
taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.
This feature, in combination with tolerationSeconds
, allows a pod to specify how long it should stay bound to a node that has one or both of these problems.
2.6.1.4. Understanding pod scheduling and node conditions (Taint Node by Condition)
OpenShift Container Platform automatically taints nodes that report conditions such as memory pressure and disk pressure. If a node reports a condition, a taint is added until the condition clears. The taints have the NoSchedule
effect, which means no pod can be scheduled on the node, unless the pod has a matching toleration. This feature, Taint Nodes By Condition, is enabled by default.
The scheduler checks for these taints on nodes before scheduling pods. If the taint is present, the pod is scheduled on a different node. Because the scheduler checks for taints and not the actual Node conditions, you configure the scheduler to ignore some of these node conditions by adding appropriate Pod tolerations.
The DaemonSet controller automatically adds the following tolerations to all daemons, to ensure backward compatibility:
- node.kubernetes.io/memory-pressure
- node.kubernetes.io/disk-pressure
- node.kubernetes.io/out-of-disk (only for critical pods)
- node.kubernetes.io/unschedulable (1.10 or later)
- node.kubernetes.io/network-unavailable (host network only)
You can also add arbitrary tolerations to DaemonSets.
2.6.1.5. Understanding evicting pods by condition (Taint-Based Evictions)
The Taint-Based Evictions feature, enabled by default, evicts pods from a node that experiences specific conditions, such as not-ready
and unreachable
. When a node experiences one of these conditions, OpenShift Container Platform automatically adds taints to the node, and starts evicting and rescheduling the pods on different nodes.
Taint Based Evictions has a NoExecute
effect, where any pod that does not tolerate the taint will be evicted immediately and any pod that does tolerate the taint will never be evicted.
OpenShift Container Platform evicts pods in a rate-limited way to prevent massive pod evictions in scenarios such as the master becoming partitioned from the nodes.
This feature, in combination with tolerationSeconds
, allows you to specify how long a pod should stay bound to a node that has a node condition. If the condition still exists after the tolerationSections
period, the taint remains on the node and the pods are evicted in a rate-limited manner. If the condition clears before the tolerationSeconds
period, pods are not removed.
OpenShift Container Platform automatically adds a toleration for node.kubernetes.io/not-ready
and node.kubernetes.io/unreachable
with tolerationSeconds=300
, unless the pod configuration specifies either toleration.
spec tolerations: - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 300
These tolerations ensure that the default pod behavior is to remain bound for 5 minutes after one of these node conditions problems is detected.
You can configure these tolerations as needed. For example, if you have an application with a lot of local state you might want to keep the pods bound to node for a longer time in the event of network partition, allowing for the partition to recover and avoiding pod eviction.
DaemonSet pods are created with NoExecute tolerations for the following taints with no tolerationSeconds:
-
node.kubernetes.io/unreachable
-
node.kubernetes.io/not-ready
This ensures that DaemonSet pods are never evicted due to these node conditions, even if the DefaultTolerationSeconds
admission controller is disabled.
2.6.2. Adding taints and tolerations
You add taints to nodes and tolerations to pods allow the node to control which pods should (or should not) be scheduled on them.
Procedure
Use the following command using the parameters described in the taint and toleration components table:
$ oc adm taint nodes <node-name> <key>=<value>:<effect>
For example:
$ oc adm taint nodes node1 key1=value1:NoExecute
This example places a taint on
node1
that has keykey1
, valuevalue1
, and taint effectNoExecute
.Add a toleration to a pod by editing the pod specification to include a
tolerations
section:Sample pod configuration file with
Equal
operatortolerations: - key: "key1" 1 operator: "Equal" 2 value: "value1" 3 effect: "NoExecute" 4 tolerationSeconds: 3600 5
For example:
Sample pod configuration file with
Exists
operatortolerations: - key: "key1" operator: "Exists" effect: "NoExecute" tolerationSeconds: 3600
Both of these tolerations match the taint created by the
oc adm taint
command above. A pod with either toleration would be able to schedule ontonode1
.
2.6.2.1. Dedicating a Node for a User using taints and tolerations
You can specify a set of nodes for exclusive use by a particular set of users.
Procedure
To specify dedicated nodes:
Add a taint to those nodes:
For example:
$ oc adm taint nodes node1 dedicated=groupName:NoSchedule
Add a corresponding toleration to the pods by writing a custom admission controller.
Only the pods with the tolerations are allowed to use the dedicated nodes.
2.6.2.2. Binding a user to a Node using taints and tolerations
You can configure a node so that particular users can use only the dedicated nodes.
Procedure
To configure a node so that users can use only that node:
Add a taint to those nodes:
For example:
$ oc adm taint nodes node1 dedicated=groupName:NoSchedule
Add a corresponding toleration to the pods by writing a custom admission controller.
The admission controller should add a node affinity to require that the pods can only schedule onto nodes labeled with the
key:value
label (dedicated=groupName
).-
Add a label similar to the taint (such as the
key:value
label) to the dedicated nodes.
2.6.2.3. Controlling Nodes with special hardware using taints and tolerations
In a cluster where a small subset of nodes have specialized hardware (for example GPUs), you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.
Procedure
To ensure pods are blocked from the specialized hardware:
Taint the nodes that have the specialized hardware using one of the following commands:
$ oc adm taint nodes <node-name> disktype=ssd:NoSchedule $ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule
- Adding a corresponding toleration to pods that use the special hardware using an admission controller.
For example, the admission controller could use some characteristic(s) of the pod to determine that the pod should be allowed to use the special nodes by adding a toleration.
To ensure pods can only use the specialized hardware, you need some additional mechanism. For example, you could label the nodes that have the special hardware and use node affinity on the pods that need the hardware.
2.6.3. Removing taints and tolerations
You can remove taints from nodes and tolerations from pods as needed.
Procedure
To remove taints and tolerations:
To remove a taint from a node:
$ oc adm taint nodes <node-name> <key>-
For example:
$ oc adm taint nodes ip-10-0-132-248.ec2.internal key1- node/ip-10-0-132-248.ec2.internal untainted
To remove a toleration from a pod, edit the pod specification to remove the toleration:
tolerations: - key: "key2" operator: "Exists" effect: "NoExecute" tolerationSeconds: 3600
2.7. Placing pods on specific nodes using node selectors
A node selector specifies a map of key-value pairs. The rules are defined using custom labels on nodes and selectors specified in pods. You can use node selectors to place specific pods on specific nodes, all pods in a project on specific nodes, or create a default node selector to schedule pods that do not have a defined node selector or project selector.
For the pod to be eligible to run on a node, the pod must have the indicated key-value pairs as the label on the node.
If you are using node affinity and node selectors in the same pod configuration, see the important considerations below.
2.7.1. Using node selectors to control pod placement
You can use node selector labels on pods to control where the pod is scheduled.
With node selectors, OpenShift Container Platform schedules the pods on nodes that contain matching labels.
You can add labels to a node or MachineConfig, but the labels will not persist if the node or machine goes down. Adding the label to the MachineSet ensures that new nodes or machines will have the label.
To add node selectors to an existing pod, add a node selector to the controlling object for that node, such as a ReplicaSet, Daemonset, or StatefulSet. Any existing pods under that controlling object are recreated on a node with a matching label. If you are creating a new pod, you can add the node selector directly to the pod spec.
You cannot add a node selector to an existing scheduled pod.
Prerequisites
If you want to add a node selector to existing pods, determine the controlling object for that pod. For exeample, the router-default-66d5cf9464-m2g75
pod is controlled by the router-default-66d5cf9464
ReplicaSet:
$ oc describe pod router-default-66d5cf9464-7pwkc Name: router-default-66d5cf9464-7pwkc Namespace: openshift-ingress .... Controlled By: ReplicaSet/router-default-66d5cf9464
The web console lists the controlling object under ownerReferences
in the pod YAML:
ownerReferences: - apiVersion: apps/v1 kind: ReplicaSet name: router-default-66d5cf9464 uid: d81dd094-da26-11e9-a48a-128e7edf0312 controller: true blockOwnerDeletion: true
Procedure
Add the desired label to your nodes:
$ oc label <resource> <name> <key>=<value>
For example, to label a node:
$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
The label is applied to the node:
kind: Node apiVersion: v1 metadata: name: ip-10-0-131-14.ec2.internal selfLink: /api/v1/nodes/ip-10-0-131-14.ec2.internal uid: 7bc2580a-8b8e-11e9-8e01-021ab4174c74 resourceVersion: '478704' creationTimestamp: '2019-06-10T14:46:08Z' labels: beta.kubernetes.io/os: linux failure-domain.beta.kubernetes.io/zone: us-east-1a node.openshift.io/os_version: '4.2' node-role.kubernetes.io/worker: '' failure-domain.beta.kubernetes.io/region: us-east-1 node.openshift.io/os_id: rhcos beta.kubernetes.io/instance-type: m4.large kubernetes.io/hostname: ip-10-0-131-14 region: east 1 beta.kubernetes.io/arch: amd64 type: user-node 2 ....
Alternatively, you can add the label to a MachineSet:
$ oc edit MachineSet abc612-msrtw-worker-us-east-1c
apiVersion: machine.openshift.io/v1beta1 kind: MachineSet .... spec: replicas: 2 selector: matchLabels: machine.openshift.io/cluster-api-cluster: ci-ln-89dz2y2-d5d6b-4995x machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: ci-ln-89dz2y2-d5d6b-4995x-worker-us-east-1a template: metadata: creationTimestamp: null labels: machine.openshift.io/cluster-api-cluster: ci-ln-89dz2y2-d5d6b-4995x machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/cluster-api-machineset: ci-ln-89dz2y2-d5d6b-4995x-worker-us-east-1a spec: metadata: creationTimestamp: null labels: region: east 1 type: user-node 2 ....
Add the desired node selector a pod:
To add a node selector to existing and furture pods, add a node selector to the controlling object for the pods:
For example:
kind: ReplicaSet .... spec: .... template: metadata: creationTimestamp: null labels: ingresscontroller.operator.openshift.io/deployment-ingresscontroller: default pod-template-hash: 66d5cf9464 spec: nodeSelector: beta.kubernetes.io/os: linux node-role.kubernetes.io/worker: '' type: user-node 1
- 1
- Add the desired node selector.
For a new pod, you can add the selector to the pod specification directly:
apiVersion: v1 kind: Pod ... spec: nodeSelector: <key>: <value> ...
For example:
apiVersion: v1 kind: Pod .... spec: nodeSelector: region: east type: user-node
If you are using node selectors and node affinity in the same pod configuration, note the following:
-
If you configure both
nodeSelector
andnodeAffinity
, both conditions must be satisfied for the pod to be scheduled onto a candidate node. -
If you specify multiple
nodeSelectorTerms
associated withnodeAffinity
types, then the pod can be scheduled onto a node if one of thenodeSelectorTerms
is satisfied. -
If you specify multiple
matchExpressions
associated withnodeSelectorTerms
, then the pod can be scheduled onto a node only if allmatchExpressions
are satisfied.
2.7.2. Creating default cluster-wide node selectors
You can use default node selectors on pods together with labels on nodes to constrain all pods created in a cluster to specific nodes.
With cluster node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the appropriate <key>:<value>
and schedules the pod on nodes with matching labels.
You can add additional <key>:<value>
pairs for the pod. But you cannot add a different <value>
for a default <key>
.
For example, if the cluster node selector is region: east
the following pod spec adds a new pair and is allowed:
spec: nodeSelector: region: east type: user-node
The following pod spec uses a different value for region
and is not allowed:
spec: nodeSelector: region: west
If the project where you are creating the pod has a project node selector, that selector takes preference over a cluster node selector.
Procedure
To add a default cluster node selector:
Edit the Scheduler Operator Custom Resource to add the cluster node selectors:
$ oc edit scheduler cluster
apiVersion: config.openshift.io/v1 kind: Scheduler metadata: name: cluster ... spec: defaultNodeSelector: type=user-node,region=east 1 mastersSchedulable: false policy: name: ""
- 1
- Add a node selector with the appropriate
<key>:<value>
pairs.
After making this change, wait for the pods in the
openshift-kube-apiserver
project to redeploy. This can take several minutes. The default cluster node selector does not take effect until the pods redeploy.Edit a node or MachineSet to add labels:
$ oc label <resource> <name> <key>=<value>
For example, to label a node:
$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
To label a MachineSet:
$ oc label MachineSet abc612-msrtw-worker-us-east-1c type=user-node region=east
When you create a pod, OpenShift Container Platform adds the appropriate <key>:<value>
and schedules the pod on the labeled node.
For example:
spec: nodeSelector: region: east type: user-node
2.7.3. Creating project-wide node selectors
You can use node selectors on a project together with labels on nodes to constrain all pods created in a namespace to the labeled nodes.
With project node selectors, when you create a pod in the namespace, OpenShift Container Platform adds the appropriate <key>:<value>
and schedules the pod on nodes with matching labels.
You can add labels to a node or MachineConfig, but the labels will not persist if the node or machine goes down. Adding the label to the MachineSet ensures that new nodes or machines will have the label.
You can add additional <key>:<value>
pairs for the pod. But you cannot add a different <value>
for a default <key>
.
For example, if the project node selector is region: east
the following pod spec adds a new pair and is allowed:
spec: nodeSelector: region: east type: user-node
The following pod spec uses a different value for region
and is not allowed:
spec: nodeSelector: region: west
If there is a cluster-wide default node selector, a project node selector takes preference.
Procedure
To add a default project node selector:
Create a namespace or edit an existing namespace associated with the project to add the
openshift.io/node-selector
parameter:$ oc edit namespace <name>
apiVersion: v1 kind: Namespace metadata: annotations: openshift.io/node-selector: "type=user-node,region=east" 1 openshift.io/sa.scc.mcs: s0:c17,c14 openshift.io/sa.scc.supplemental-groups: 1000300000/10000 openshift.io/sa.scc.uid-range: 1000300000/10000 creationTimestamp: 2019-06-10T14:39:45Z labels: openshift.io/run-level: "0" name: demo resourceVersion: "401885" selfLink: /api/v1/namespaces/openshift-kube-apiserver uid: 96ecc54b-8b8d-11e9-9f54-0a9ae641edd0 spec: finalizers: - kubernetes status: phase: Active
- 1
- Add openshift.io/node-selector` with the appropriate
<key>:<value>
pairs.
Edit a node or MachineSet to add labels:
$ oc label <resource> <name> <key>=<value>
For example, to label a node:
$ oc label nodes ip-10-0-142-25.ec2.internal type=user-node region=east
To label a MachineSet:
$ oc label MachineSet abc612-msrtw-worker-us-east-1c type=user-node region=east
When you create a pod in the namespace, OpenShift Container Platform adds the appropriate <key>:<value>
and schedules the pod on the labeled node.
For example:
spec: nodeSelector: region: east type: user-node