2.2. Configuring the default scheduler to control pod placement
The default OpenShift Container Platform pod scheduler is responsible for determining placement of new pods onto nodes within the cluster. It reads data from the pod and tries to find a node that is a good fit based on configured policies. It is completely independent and exists as a standalone/pluggable solution. It does not modify the pod and just creates a binding for the pod that ties the pod to the particular node.
A selection of predicates and priorities defines the policy for the scheduler. See Modifying scheduler policy for a list of predicates and priorities.
Sample default scheduler object
apiVersion: config.openshift.io/v1 kind: Scheduler metadata: annotations: release.openshift.io/create-only: "true" creationTimestamp: 2019-05-20T15:39:01Z generation: 1 name: cluster resourceVersion: "1491" selfLink: /apis/config.openshift.io/v1/schedulers/cluster uid: 6435dd99-7b15-11e9-bd48-0aec821b8e34 spec: policy: 1 name: scheduler-policy defaultNodeSelector: type=user-node,region=east 2
- 1
- You can specify the name of a custom scheduler policy file.
- 2
- Optional: Specify a default node selector to restrict pod placement to specific nodes. The default node selector is applied to the pods created in all namespaces. Pods can be scheduled on nodes with labels that match the default node selector and any existing pod node selectors. Namespaces having project-wide node selectors are not impacted even if this field is set.
2.2.1. Understanding default scheduling
The existing generic scheduler is the default platform-provided scheduler engine that selects a node to host the pod in a three-step operation:
- Filters the Nodes
- The available nodes are filtered based on the constraints or requirements specified. This is done by running each node through the list of filter functions called predicates.
- Prioritize the Filtered List of Nodes
- This is achieved by passing each node through a series of priority_ functions that assign it a score between 0 - 10, with 0 indicating a bad fit and 10 indicating a good fit to host the pod. The scheduler configuration can also take in a simple weight (positive numeric value) for each priority function. The node score provided by each priority function is multiplied by the weight (default weight for most priorities is 1) and then combined by adding the scores for each node provided by all the priorities. This weight attribute can be used by administrators to give higher importance to some priorities.
- Select the Best Fit Node
- The nodes are sorted based on their scores and the node with the highest score is selected to host the pod. If multiple nodes have the same high score, then one of them is selected at random.
2.2.1.1. Understanding Scheduler Policy
The selection of the predicate and priorities defines the policy for the scheduler.
The scheduler configuration file is a JSON file, which must be named policy.cfg
, that specifies the predicates and priorities the scheduler will consider.
In the absence of the scheduler policy file, the default scheduler behavior is used.
The predicates and priorities defined in the scheduler configuration file completely override the default scheduler policy. If any of the default predicates and priorities are required, you must explicitly specify the functions in the policy configuration.
Sample scheduler config map
apiVersion: v1
data:
policy.cfg: |
{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "MaxGCEPDVolumeCount"},
{"name" : "GeneralPredicates"}, 1
{"name" : "MaxAzureDiskVolumeCount"},
{"name" : "MaxCSIVolumeCountPred"},
{"name" : "CheckVolumeBinding"},
{"name" : "MaxEBSVolumeCount"},
{"name" : "MatchInterPodAffinity"},
{"name" : "CheckNodeUnschedulable"},
{"name" : "NoDiskConflict"},
{"name" : "NoVolumeZoneConflict"},
{"name" : "PodToleratesNodeTaints"}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "ServiceSpreadingPriority", "weight" : 1},
{"name" : "NodePreferAvoidPodsPriority", "weight" : 1},
{"name" : "NodeAffinityPriority", "weight" : 1},
{"name" : "TaintTolerationPriority", "weight" : 1},
{"name" : "ImageLocalityPriority", "weight" : 1},
{"name" : "SelectorSpreadPriority", "weight" : 1},
{"name" : "InterPodAffinityPriority", "weight" : 1},
{"name" : "EqualPriority", "weight" : 1}
]
}
kind: ConfigMap
metadata:
creationTimestamp: "2019-09-17T08:42:33Z"
name: scheduler-policy
namespace: openshift-config
resourceVersion: "59500"
selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy
uid: 17ee8865-d927-11e9-b213-02d1e1709840`
- 1
- The
GeneralPredicates
predicate represents thePodFitsResources
,HostName
,PodFitsHostPorts
, andMatchNodeSelector
predicates. Because you are not allowed to configure the same predicate multiple times, theGeneralPredicates
predicate cannot be used alongside any of the four represented predicates.
2.2.2. Creating a scheduler policy file
You can change the default scheduling behavior by creating a JSON file with the desired predicates and priorities. You then generate a config map from the JSON file and point the cluster
Scheduler object to use the config map.
Procedure
To configure the scheduler policy:
Create a JSON file named
policy.cfg
with the desired predicates and priorities.Sample scheduler JSON file
{ "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ 1 {"name" : "MaxGCEPDVolumeCount"}, {"name" : "GeneralPredicates"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "MaxCSIVolumeCountPred"}, {"name" : "CheckVolumeBinding"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "MatchInterPodAffinity"}, {"name" : "CheckNodeUnschedulable"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ 2 {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, {"name" : "NodeAffinityPriority", "weight" : 1}, {"name" : "TaintTolerationPriority", "weight" : 1}, {"name" : "ImageLocalityPriority", "weight" : 1}, {"name" : "SelectorSpreadPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] }
Create a config map based on the scheduler JSON file:
$ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name> 1
- 1
- Enter a name for the config map.
For example:
$ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
Example output
configmap/scheduler-policy created
Edit the Scheduler Operator custom resource to add the config map:
$ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"<configmap-name>"}}}' --type=merge 1
- 1
- Specify the name of the config map.
For example:
$ oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge
After making the change to the
Scheduler
config resource, wait for theopenshift-kube-apiserver
pods to redeploy. This can take several minutes. Until the pods redeploy, new scheduler does not take effect.Verify the scheduler policy is configured by viewing the log of a scheduler pod in the
openshift-kube-scheduler
namespace. The following command checks for the predicates and priorities that are being registered by the scheduler:$ oc logs <scheduler-pod> | grep predicates
For example:
$ oc logs openshift-kube-scheduler-ip-10-0-141-29.ec2.internal | grep predicates
Example output
Creating scheduler with fit predicates 'map[MaxGCEPDVolumeCount:{} MaxAzureDiskVolumeCount:{} CheckNodeUnschedulable:{} NoDiskConflict:{} NoVolumeZoneConflict:{} GeneralPredicates:{} MaxCSIVolumeCountPred:{} CheckVolumeBinding:{} MaxEBSVolumeCount:{} MatchInterPodAffinity:{} PodToleratesNodeTaints:{}]' and priority functions 'map[InterPodAffinityPriority:{} LeastRequestedPriority:{} ServiceSpreadingPriority:{} ImageLocalityPriority:{} SelectorSpreadPriority:{} EqualPriority:{} BalancedResourceAllocation:{} NodePreferAvoidPodsPriority:{} NodeAffinityPriority:{} TaintTolerationPriority:{}]'
2.2.3. Modifying scheduler policies
You change scheduling behavior by creating or editing your scheduler policy config map in the openshift-config
project. Add and remove predicates and priorities to the config map to create a scheduler policy.
Procedure
To modify the current custom scheduling, use one of the following methods:
Edit the scheduler policy config map:
$ oc edit configmap <configmap-name> -n openshift-config
For example:
$ oc edit configmap scheduler-policy -n openshift-config
Example output
apiVersion: v1 data: policy.cfg: | { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ 1 {"name" : "MaxGCEPDVolumeCount"}, {"name" : "GeneralPredicates"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "MaxCSIVolumeCountPred"}, {"name" : "CheckVolumeBinding"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "MatchInterPodAffinity"}, {"name" : "CheckNodeUnschedulable"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ 2 {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, {"name" : "NodeAffinityPriority", "weight" : 1}, {"name" : "TaintTolerationPriority", "weight" : 1}, {"name" : "ImageLocalityPriority", "weight" : 1}, {"name" : "SelectorSpreadPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] } kind: ConfigMap metadata: creationTimestamp: "2019-09-17T17:44:19Z" name: scheduler-policy namespace: openshift-config resourceVersion: "15370" selfLink: /api/v1/namespaces/openshift-config/configmaps/scheduler-policy
It can take a few minutes for the scheduler to restart the pods with the updated policy.
Change the policies and predicates being used:
Remove the scheduler policy config map:
$ oc delete configmap -n openshift-config <name>
For example:
$ oc delete configmap -n openshift-config scheduler-policy
Edit the
policy.cfg
file to add and remove policies and predicates as needed.For example:
$ vi policy.cfg
Example output
apiVersion: v1 data: policy.cfg: | { "kind" : "Policy", "apiVersion" : "v1", "predicates" : [ {"name" : "MaxGCEPDVolumeCount"}, {"name" : "GeneralPredicates"}, {"name" : "MaxAzureDiskVolumeCount"}, {"name" : "MaxCSIVolumeCountPred"}, {"name" : "CheckVolumeBinding"}, {"name" : "MaxEBSVolumeCount"}, {"name" : "MatchInterPodAffinity"}, {"name" : "CheckNodeUnschedulable"}, {"name" : "NoDiskConflict"}, {"name" : "NoVolumeZoneConflict"}, {"name" : "PodToleratesNodeTaints"} ], "priorities" : [ {"name" : "LeastRequestedPriority", "weight" : 1}, {"name" : "BalancedResourceAllocation", "weight" : 1}, {"name" : "ServiceSpreadingPriority", "weight" : 1}, {"name" : "NodePreferAvoidPodsPriority", "weight" : 1}, {"name" : "NodeAffinityPriority", "weight" : 1}, {"name" : "TaintTolerationPriority", "weight" : 1}, {"name" : "ImageLocalityPriority", "weight" : 1}, {"name" : "SelectorSpreadPriority", "weight" : 1}, {"name" : "InterPodAffinityPriority", "weight" : 1}, {"name" : "EqualPriority", "weight" : 1} ] }
Re-create the scheduler policy config map based on the scheduler JSON file:
$ oc create configmap -n openshift-config --from-file=policy.cfg <configmap-name> 1
- 1
- Enter a name for the config map.
For example:
$ oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
Example output
configmap/scheduler-policy created
2.2.3.1. Understanding the scheduler predicates
Predicates are rules that filter out unqualified nodes.
There are several predicates provided by default in OpenShift Container Platform. Some of these predicates can be customized by providing certain parameters. Multiple predicates can be combined to provide additional filtering of nodes.
2.2.3.1.1. Static Predicates
These predicates do not take any configuration parameters or inputs from the user. These are specified in the scheduler configuration using their exact name.
2.2.3.1.1.1. Default Predicates
The default scheduler policy includes the following predicates:
The NoVolumeZoneConflict
predicate checks that the volumes a pod requests are available in the zone.
{"name" : "NoVolumeZoneConflict"}
The MaxEBSVolumeCount
predicate checks the maximum number of volumes that can be attached to an AWS instance.
{"name" : "MaxEBSVolumeCount"}
The MaxAzureDiskVolumeCount
predicate checks the maximum number of Azure Disk Volumes.
{"name" : "MaxAzureDiskVolumeCount"}
The PodToleratesNodeTaints
predicate checks if a pod can tolerate the node taints.
{"name" : "PodToleratesNodeTaints"}
The CheckNodeUnschedulable
predicate checks if a pod can be scheduled on a node with Unschedulable
spec.
{"name" : "CheckNodeUnschedulable"}
The CheckVolumeBinding
predicate evaluates if a pod can fit based on the volumes, it requests, for both bound and unbound PVCs.
- For PVCs that are bound, the predicate checks that the corresponding PV’s node affinity is satisfied by the given node.
- For PVCs that are unbound, the predicate searched for available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.
The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound PVCs can be matched with an available and node-compatible PV.
{"name" : "CheckVolumeBinding"}
The NoDiskConflict
predicate checks if the volume requested by a pod is available.
{"name" : "NoDiskConflict"}
The MaxGCEPDVolumeCount
predicate checks the maximum number of Google Compute Engine (GCE) Persistent Disks (PD).
{"name" : "MaxGCEPDVolumeCount"}
The MaxCSIVolumeCount
predicate determines how many Container Storage Interface (CSI) volumes should be attached to a node and whether that number exceeds a configured limit.
{"name" : "MaxCSIVolumeCount"}
The MatchInterPodAffinity
predicate checks if the pod affinity/anti-affinity rules permit the pod.
{"name" : "MatchInterPodAffinity"}
2.2.3.1.1.2. Other Static Predicates
OpenShift Container Platform also supports the following predicates:
The CheckNode-*
predicates cannot be used if the Taint Nodes By Condition feature is enabled. The Taint Nodes By Condition feature is enabled by default.
The CheckNodeCondition
predicate checks if a pod can be scheduled on a node reporting out of disk, network unavailable, or not ready conditions.
{"name" : "CheckNodeCondition"}
The CheckNodeLabelPresence
predicate checks if all of the specified labels exist on a node, regardless of their value.
{"name" : "CheckNodeLabelPresence"}
The checkServiceAffinity
predicate checks that ServiceAffinity labels are homogeneous for pods that are scheduled on a node.
{"name" : "checkServiceAffinity"}
The PodToleratesNodeNoExecuteTaints
predicate checks if a pod tolerations can tolerate a node NoExecute
taints.
{"name" : "PodToleratesNodeNoExecuteTaints"}
2.2.3.1.2. General Predicates
The following general predicates check whether non-critical predicates and essential predicates pass. Non-critical predicates are the predicates that only non-critical pods must pass and essential predicates are the predicates that all pods must pass.
The default scheduler policy includes the general predicates.
Non-critical general predicates
The PodFitsResources
predicate determines a fit based on resource availability (CPU, memory, GPU, and so forth). The nodes can declare their resource capacities and then pods can specify what resources they require. Fit is based on requested, rather than used resources.
{"name" : "PodFitsResources"}
Essential general predicates
The PodFitsHostPorts
predicate determines if a node has free ports for the requested pod ports (absence of port conflicts).
{"name" : "PodFitsHostPorts"}
The HostName
predicate determines fit based on the presence of the Host parameter and a string match with the name of the host.
{"name" : "HostName"}
The MatchNodeSelector
predicate determines fit based on node selector (nodeSelector) queries defined in the pod.
{"name" : "MatchNodeSelector"}
2.2.3.2. Understanding the scheduler priorities
Priorities are rules that rank nodes according to preferences.
A custom set of priorities can be specified to configure the scheduler. There are several priorities provided by default in OpenShift Container Platform. Other priorities can be customized by providing certain parameters. Multiple priorities can be combined and different weights can be given to each in order to impact the prioritization.
2.2.3.2.1. Static Priorities
Static priorities do not take any configuration parameters from the user, except weight. A weight is required to be specified and cannot be 0 or negative.
These are specified in the scheduler policy config map in the openshift-config
project.
2.2.3.2.1.1. Default Priorities
The default scheduler policy includes the following priorities. Each of the priority function has a weight of 1
except NodePreferAvoidPodsPriority
, which has a weight of 10000
.
The NodeAffinityPriority
priority prioritizes nodes according to node affinity scheduling preferences
{"name" : "NodeAffinityPriority", "weight" : 1}
The TaintTolerationPriority
priority prioritizes nodes that have a fewer number of intolerable taints on them for a pod. An intolerable taint is one which has key PreferNoSchedule
.
{"name" : "TaintTolerationPriority", "weight" : 1}
The ImageLocalityPriority
priority prioritizes nodes that already have requested pod container’s images.
{"name" : "ImageLocalityPriority", "weight" : 1}
The SelectorSpreadPriority
priority looks for services, replication controllers (RC), replication sets (RS), and stateful sets that match the pod, then finds existing pods that match those selectors. The scheduler favors nodes that have fewer existing matching pods. Then, it schedules the pod on a node with the smallest number of pods that match those selectors as the pod being scheduled.
{"name" : "SelectorSpreadPriority", "weight" : 1}
The InterPodAffinityPriority
priority computes a sum by iterating through the elements of weightedPodAffinityTerm
and adding weight to the sum if the corresponding PodAffinityTerm is satisfied for that node. The node(s) with the highest sum are the most preferred.
{"name" : "InterPodAffinityPriority", "weight" : 1}
The LeastRequestedPriority
priority favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes nodes that have the highest available/remaining capacity.
{"name" : "LeastRequestedPriority", "weight" : 1}
The BalancedResourceAllocation
priority favors nodes with balanced resource usage rate. It calculates the difference between the consumed CPU and memory as a fraction of capacity, and prioritizes the nodes based on how close the two metrics are to each other. This should always be used together with LeastRequestedPriority
.
{"name" : "BalancedResourceAllocation", "weight" : 1}
The NodePreferAvoidPodsPriority
priority ignores pods that are owned by a controller other than a replication controller.
{"name" : "NodePreferAvoidPodsPriority", "weight" : 10000}
2.2.3.2.1.2. Other Static Priorities
OpenShift Container Platform also supports the following priorities:
The EqualPriority
priority gives an equal weight of 1
to all nodes, if no priority configurations are provided. We recommend using this priority only for testing environments.
{"name" : "EqualPriority", "weight" : 1}
The MostRequestedPriority
priority prioritizes nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.
{"name" : "MostRequestedPriority", "weight" : 1}
The ServiceSpreadingPriority
priority spreads pods by minimizing the number of pods belonging to the same service onto the same machine.
{"name" : "ServiceSpreadingPriority", "weight" : 1}
2.2.3.2.2. Configurable Priorities
You can configure these priorities in the scheduler policy config map, in the openshift-config
namespace, to add labels to affect how the priorities work.
The type of the priority function is identified by the argument that they take. Since these are configurable, multiple priorities of the same type (but different configuration parameters) can be combined as long as their user-defined names are different.
For information on using these priorities, see Modifying Scheduler Policy.
The ServiceAntiAffinity
priority takes a label and ensures a good spread of the pods belonging to the same service across the group of nodes based on the label values. It gives the same score to all nodes that have the same value for the specified label. It gives a higher score to nodes within a group with the least concentration of pods.
{ "kind": "Policy", "apiVersion": "v1", "priorities":[ { "name":"<name>", 1 "weight" : 1 2 "argument":{ "serviceAntiAffinity":{ "label": "<label>" 3 } } } ] }
For example:
{ "kind": "Policy", "apiVersion": "v1", "priorities": [ { "name":"RackSpread", "weight" : 1, "argument": { "serviceAntiAffinity": { "label": "rack" } } } ] }
In some situations using the ServiceAntiAffinity
parameter based on custom labels does not spread pod as expected. See this Red Hat Solution.
The labelPreference
parameter gives priority based on the specified label. If the label is present on a node, that node is given priority. If no label is specified, priority is given to nodes that do not have a label. If multiple priorities with the labelPreference
parameter are set, all of the priorities must have the same weight.
{ "kind": "Policy", "apiVersion": "v1", "priorities":[ { "name":"<name>", 1 "weight" : 1 2 "argument":{ "labelPreference":{ "label": "<label>", 3 "presence": true 4 } } } ] }
2.2.4. Sample Policy Configurations
The configuration below specifies the default scheduler configuration, if it were to be specified using the scheduler policy file.
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "RegionZoneAffinity", 1 "argument": { "serviceAffinity": { 2 "labels": ["region, zone"] 3 } } } ], "priorities": [ { "name":"RackSpread", 4 "weight" : 1, "argument": { "serviceAntiAffinity": { 5 "label": "rack" 6 } } } ] }
In all of the sample configurations below, the list of predicates and priority functions is truncated to include only the ones that pertain to the use case specified. In practice, a complete/meaningful scheduler policy should include most, if not all, of the default predicates and priorities listed above.
The following example defines three topological levels, region (affinity)
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "RegionZoneAffinity", "argument": { "serviceAffinity": { "labels": ["region, zone"] } } } ], "priorities": [ { "name":"RackSpread", "weight" : 1, "argument": { "serviceAntiAffinity": { "label": "rack" } } } ] }
The following example defines three topological levels, city
(affinity) building
(anti-affinity) room
(anti-affinity):
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "CityAffinity", "argument": { "serviceAffinity": { "label": "city" } } } ], "priorities": [ { "name":"BuildingSpread", "weight" : 1, "argument": { "serviceAntiAffinity": { "label": "building" } } }, { "name":"RoomSpread", "weight" : 1, "argument": { "serviceAntiAffinity": { "label": "room" } } } ] }
The following example defines a policy to only use nodes with the 'region' label defined and prefer nodes with the 'zone' label defined:
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "RequireRegion", "argument": { "labelPreference": { "labels": ["region"], "presence": true } } } ], "priorities": [ { "name":"ZonePreferred", "weight" : 1, "argument": { "labelPreference": { "label": "zone", "presence": true } } } ] }
The following example combines both static and configurable predicates and also priorities:
{ "kind": "Policy", "apiVersion": "v1", "predicates": [ { "name": "RegionAffinity", "argument": { "serviceAffinity": { "labels": ["region"] } } }, { "name": "RequireRegion", "argument": { "labelsPresence": { "labels": ["region"], "presence": true } } }, { "name": "BuildingNodesAvoid", "argument": { "labelsPresence": { "label": "building", "presence": false } } }, {"name" : "PodFitsPorts"}, {"name" : "MatchNodeSelector"} ], "priorities": [ { "name": "ZoneSpread", "weight" : 2, "argument": { "serviceAntiAffinity":{ "label": "zone" } } }, { "name":"ZonePreferred", "weight" : 1, "argument": { "labelPreference":{ "label": "zone", "presence": true } } }, {"name" : "ServiceSpreadingPriority", "weight" : 1} ] }