1.8. Including pod priority in pod scheduling decisions
You can enable pod priority and preemption in your cluster. pod priority indicates the importance of a pod relative to other pods and queues the pods based on that priority. pod preemption allows the cluster to evict, or preempt, lower-priority pods so that higher-priority pods can be scheduled if there is no available space on a suitable node pod priority also affects the scheduling order of pods and out-of-resource eviction ordering on the node.
To use priority and preemption, you create priority classes that define the relative weight of your pods. Then, reference a priority class in the pod specification to apply that weight for scheduling.
1.8.1. Understanding pod priority
When you use the Pod Priority and Preemption feature, the scheduler orders pending pods by their priority, and a pending pod is placed ahead of other pending pods with lower priority in the scheduling queue. As a result, the higher priority pod might be scheduled sooner than pods with lower priority if its scheduling requirements are met. If a pod cannot be scheduled, scheduler continues to schedule other lower priority pods.
1.8.1.1. Pod priority classes
You can assign pods a priority class, which is a non-namespaced object that defines a mapping from a name to the integer value of the priority. The higher the value, the higher the priority.
A priority class object can take any 32-bit integer value smaller than or equal to 1000000000 (one billion). Reserve numbers larger than one billion for critical pods that should not be preempted or evicted. By default, OpenShift Container Platform has two reserved priority classes for critical system pods to have guaranteed scheduling.
$ oc get priorityclasses
Example output
NAME VALUE GLOBAL-DEFAULT AGE cluster-logging 1000000 false 29s system-cluster-critical 2000000000 false 72m system-node-critical 2000001000 false 72m
system-node-critical - This priority class has a value of 2000001000 and is used for all pods that should never be evicted from a node. Examples of pods that have this priority class are
sdn-ovs
,sdn
, and so forth. A number of critical components include thesystem-node-critical
priority class by default, for example:- master-api
- master-controller
- master-etcd
- sdn
- sdn-ovs
- sync
system-cluster-critical - This priority class has a value of 2000000000 (two billion) and is used with pods that are important for the cluster. Pods with this priority class can be evicted from a node in certain circumstances. For example, pods configured with the
system-node-critical
priority class can take priority. However, this priority class does ensure guaranteed scheduling. Examples of pods that can have this priority class are fluentd, add-on components like descheduler, and so forth. A number of critical components include thesystem-cluster-critical
priority class by default, for example:- fluentd
- metrics-server
- descheduler
- cluster-logging - This priority is used by Fluentd to make sure Fluentd pods are scheduled to nodes over other apps.
If you upgrade your existing cluster, the priority of your existing pods is effectively zero. However, existing pods with the scheduler.alpha.kubernetes.io/critical-pod
annotation are automatically converted to system-cluster-critical
class. Fluentd cluster logging pods with the annotation are converted to the cluster-logging
priority class.
1.8.1.2. Pod priority names
After you have one or more priority classes, you can create pods that specify a priority class name in a Pod
spec. The priority admission controller uses the priority class name field to populate the integer value of the priority. If the named priority class is not found, the pod is rejected.
1.8.2. Understanding pod preemption
When a developer creates a pod, the pod goes into a queue. If the developer configured the pod for pod priority or preemption, the scheduler picks a pod from the queue and tries to schedule the pod on a node. If the scheduler cannot find space on an appropriate node that satisfies all the specified requirements of the pod, preemption logic is triggered for the pending pod.
When the scheduler preempts one or more pods on a node, the nominatedNodeName
field of higher-priority Pod
spec is set to the name of the node, along with the nodename
field. The scheduler uses the nominatedNodeName
field to keep track of the resources reserved for pods and also provides information to the user about preemptions in the clusters.
After the scheduler preempts a lower-priority pod, the scheduler honors the graceful termination period of the pod. If another node becomes available while scheduler is waiting for the lower-priority pod to terminate, the scheduler can schedule the higher-priority pod on that node. As a result, the nominatedNodeName
field and nodeName
field of the Pod
spec might be different.
Also, if the scheduler preempts pods on a node and is waiting for termination, and a pod with a higher-priority pod than the pending pod needs to be scheduled, the scheduler can schedule the higher-priority pod instead. In such a case, the scheduler clears the nominatedNodeName
of the pending pod, making the pod eligible for another node.
Preemption does not necessarily remove all lower-priority pods from a node. The scheduler can schedule a pending pod by removing a portion of the lower-priority pods.
The scheduler considers a node for pod preemption only if the pending pod can be scheduled on the node.
1.8.2.1. Pod preemption and other scheduler settings
If you enable pod priority and preemption, consider your other scheduler settings:
- Pod priority and pod disruption budget
- A pod disruption budget specifies the minimum number or percentage of replicas that must be up at a time. If you specify pod disruption budgets, OpenShift Container Platform respects them when preempting pods at a best effort level. The scheduler attempts to preempt pods without violating the pod disruption budget. If no such pods are found, lower-priority pods might be preempted despite their pod disruption budget requirements.
- Pod priority and pod affinity
- Pod affinity requires a new pod to be scheduled on the same node as other pods with the same label.
If a pending pod has inter-pod affinity with one or more of the lower-priority pods on a node, the scheduler cannot preempt the lower-priority pods without violating the affinity requirements. In this case, the scheduler looks for another node to schedule the pending pod. However, there is no guarantee that the scheduler can find an appropriate node and pending pod might not be scheduled.
To prevent this situation, carefully configure pod affinity with equal-priority pods.
1.8.2.2. Graceful termination of preempted pods
When preempting a pod, the scheduler waits for the pod graceful termination period to expire, allowing the pod to finish working and exit. If the pod does not exit after the period, the scheduler kills the pod. This graceful termination period creates a time gap between the point that the scheduler preempts the pod and the time when the pending pod can be scheduled on the node.
To minimize this gap, configure a small graceful termination period for lower-priority pods.
1.8.3. Configuring priority and preemption
You apply pod priority and preemption by creating a priority class object and associating pods to the priority using the priorityClassName
in your Pod
specs.
Sample priority class object
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority 1 value: 1000000 2 globalDefault: false 3 description: "This priority class should be used for XYZ service pods only." 4
- 1
- The name of the priority class object.
- 2
- The priority value of the object.
- 3
- Optional field that indicates whether this priority class should be used for pods without a priority class name specified. This field is
false
by default. Only one priority class withglobalDefault
set totrue
can exist in the cluster. If there is no priority class withglobalDefault:true
, the priority of pods with no priority class name is zero. Adding a priority class withglobalDefault:true
affects only pods created after the priority class is added and does not change the priorities of existing pods. - 4
- Optional arbitrary text string that describes which pods developers should use with this priority class.
Procedure
To configure your cluster to use priority and preemption:
Create one or more priority classes:
- Specify a name and value for the priority.
-
Optionally specify the
globalDefault
field in the priority class and a description.
Create a
Pod
spec or edit existing pods to include the name of a priority class, similar to the following:Sample
Pod
spec with priority class nameapiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent priorityClassName: high-priority 1
- 1
- Specify the priority class to use with this pod.
Create the pod:
$ oc create -f <file-name>.yaml
You can add the priority name directly to the pod configuration or to a pod template.