Chapter 21. Handling Out of Resource Errors

21.1. Overview
Copy link

The node must preserve node stability when available compute resources are low. This is especially important when dealing with incompressible resources such as memory or disk. If either resource is exhausted, the node becomes unstable.

Warning

Failure to disable swap memory makes the node not recognize it is under MemoryPressure.

To take advantage of memory based evictions, operators must disable swap.

21.2. Eviction Policy
Copy link

Using eviction policies, a node can proactively monitor for and prevent against total starvation of a compute resource.

In cases where a node is running low on available resources, it can proactively fail one or more pods in order to reclaim the starved resource using an eviction policy. When the node fails a pod, it terminates all containers in the pod, and the PodPhase is transitioned to Failed.

Platform administrators can configure eviction settings within the node-config.yaml file.

21.2.1. Eviction Signals
Copy link

The node can be configured to trigger eviction decisions on the signals described in the table below. The value of each signal is described in the description column based on the node summary API.

To view the signals:

curl <certificate details> \
  https://<master>/api/v1/nodes/<node>/proxy/stats/summary

curl <certificate details> \
  https://<master>/api/v1/nodes/<node>/proxy/stats/summary

Copy to Clipboard

Toggle word wrap

Expand

Table 21.1. Supported Eviction Signals
Eviction Signal	Description
`memory.available`	`memory.available` = `node.status.capacity[memory]` - `node.stats.memory.workingSet`
`nodefs.available`	`nodefs.available` = `node.stats.fs.available`
`nodefs.inodesFree`	`nodefs.inodesFree` = `node.stats.fs.inodesFree`
`imagefs.available`	`imagefs.available` = `node.stats.runtime.imagefs.available`
`imagefs.inodesFree`	`imagefs.inodesFree` = `node.stats.runtime.imagefs.inodesFree`

The node supports two file system partitions when detecting disk pressure.

The nodefs file system that the node uses for local disk volumes, daemon logs, and so on (for example, the file system that provides /).
The imagefs file system that the container runtime uses for storing images and individual container writable layers.

The node auto-discovers these file systems using cAdvisor.

If you store volumes and logs in a dedicated file system, the node will not monitor that file system at this time.

Note

As of OpenShift Container Platform 3.4, the node supports the ability to trigger eviction decisions based on disk pressure. Operators must opt in to enable disk-based evictions. Prior to evicting pods due to disk pressure, the node will also perform container and image garbage collection. In future releases, garbage collection will be deprecated in favor of a pure disk eviction based configuration.

21.2.2. Eviction Thresholds
Copy link

You can configure a node to specify eviction thresholds, which trigger the node to reclaim resources.

Eviction thresholds can be soft, for when you allow a grace period before reclaiming resources, and hard, for when the node takes immediate action when a threshold is met.

Thresholds are configured in the following form:

<eviction_signal><operator><quantity>

<eviction_signal><operator><quantity>

Copy to Clipboard

Toggle word wrap

Valid eviction-signal tokens as defined by eviction signals.
Valid operator tokens are <.
Valid quantity tokens must match the quantity representation used by Kubernetes.
an eviction threshold can be expressed as a percentage if it ends with the % token.

For example, if an operator has a node with 10Gi of memory, and that operator wants to induce eviction if available memory falls below 1Gi, an eviction threshold for memory can be specified as either of the following:

memory.available<1Gi
memory.available<10%

memory.available<1Gi
memory.available<10%

Copy to Clipboard

Toggle word wrap

21.2.2.1. Soft Eviction Thresholds
Copy link

A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided, the node errors on startup.

In addition, if a soft eviction threshold is met, an operator can specify a maximum allowed pod termination grace period to use when evicting pods from the node. If specified, the node uses the lesser value among the pod.Spec.TerminationGracePeriodSeconds and the maximum-allowed grace period. If not specified, the node kills pods immediately with no graceful termination.

To configure soft eviction thresholds, the following flags are supported:

eviction-soft: a set of eviction thresholds (for example, memory.available<1.5Gi) that, if met over a corresponding grace period, triggers a pod eviction.
eviction-soft-grace-period: a set of eviction grace periods (for example, memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
eviction-max-pod-grace-period: the maximum-allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

21.2.2.2. Hard Eviction Thresholds
Copy link

A hard eviction threshold has no grace period and, if observed, the node takes immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, the node kills the pod immediately with no graceful termination.

To configure hard eviction thresholds, the following flag is supported:

eviction-hard: a set of eviction thresholds (for example, memory.available<1Gi) that, if met, triggers a pod eviction.

21.2.3. Oscillation of Node Conditions
Copy link

If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node condition oscillates between true and false, which can confuse the scheduler.

To protect this, set the following flag to control how long the node must wait before transitioning out of a pressure condition:

eviction-pressure-transition-period: the duration that the node has to wait before transitioning out of an eviction pressure condition.

Before toggling the condition back to false, the node ensures that it has not observed a met eviction threshold for the specified pressure condition for the period specified.

21.2.4. Eviction Monitoring Interval
Copy link

The node evaluates and monitors eviction thresholds every 10 seconds and the value can not be modified. This is the housekeeping interval.

21.2.5. Mapping Eviction Signals to Node Conditions
Copy link

The node can map one or more eviction signals to a corresponding node condition.

If an eviction threshold is met, independent of its associated grace period, the node reports a condition indicating that the node is under pressure.

The following node conditions are defined that correspond to the specified eviction signal.

Expand

Table 21.2. Node Conditions Related to Low Resources
Node Condition	Eviction Signal	Description
`MemoryPressure`	`memory.available`	Available memory on the node has satisfied an eviction threshold.
`DiskPressure`	`nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree`	Available disk space and inodes on either the node’s root file system or image file system has satisfied an eviction threshold.

When the above is set the node continues to report node status updates at the frequency specified by the node-status-update-frequency argument, which defaults to 10s.

21.2.6. Reclaiming Node-level Resources
Copy link

If an eviction criteria is satisfied, the node initiates the process of reclaiming the pressured resource until it observes that the signal has gone below its defined threshold. During this time, the node does not support scheduling any new pods.

The node attempts to reclaim node-level resources prior to evicting end-user pods. If disk pressure is observed, the node reclaims node-level resources differently if the machine has a dedicated imagefs configured for the container runtime.

21.2.6.1. With Imagefs
Copy link

If the nodefs file system meets eviction thresholds, the node frees up disk space in the following order:

Delete dead pods/containers

If the imagefs file system meets eviction thresholds, the node frees up disk space in the following order:

Delete all unused images

21.2.6.2. Without Imagefs
Copy link

If the nodefs file system meets eviction thresholds, the node frees up disk space in the following order:

Delete dead pods/containers
Delete all unused images

21.2.7. Eviction of Pods
Copy link

If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until it observes the signal going below its defined threshold.

The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.

BestEffort: pods that consume the most of the starved resource are failed first.
Burstable: pods that consume the most of the starved resource relative to their request for that resource are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.
Guaranteed: pods that consume the most of the starved resource relative to their request are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.

A Guaranteed pod will never be evicted because of another pod’s resource consumption unless a system daemon (node, docker, journald, etc) is consuming more resources than were reserved via system-reserved, or kube-reserved allocations or if the node has only Guaranteed pods remaining.

If the latter, the node evicts a Guaranteed pod that least impacts node stability and limits the impact of the unexpected consumption to other Guaranteed pods.

Local disk is a BestEffort resource. If necessary, the node will evict pods one at a time to reclaim disk when DiskPressure is encountered. The node ranks pods by quality of service. If the node is responding to inode starvation, it will reclaim inodes by evicting pods with the lowest quality of service first. If the node is responding to lack of available disk, it will rank pods within a quality of service that consumes the largest amount of local disk, and evict those pods first.

Note

At this time, volumes that are backed by local disk are only deleted when a pod is deleted from the API server instead of when the pod is terminated.

As a result, if a pod is evicted as a consequence of consuming too much disk in an EmptyDir volume, the pod will be evicted, but the local volume usage will not be reclaimed by the node. The node will keep evicting pods on the node to prevent total exhaustion of disk. Operators can reclaim the disk by manually deleting the evicted pods from the node once terminated.

This will be remedied in a future release.

21.2.8. Scheduler
Copy link

The scheduler views node conditions when placing additional pods on the node. For example, if the node has an eviction threshold like the following:

eviction-hard is "memory.available<500Mi"

eviction-hard is "memory.available<500Mi"

Copy to Clipboard

Toggle word wrap

and available memory falls below 500Mi, the node reports a value in Node.Status.Conditions as MemoryPressure as true.

Expand

Table 21.3. Node Conditions and Scheduler Behavior
Node Condition	Scheduler Behavior
`MemoryPressure`	If a node reports this condition, the scheduler will not place `BestEffort` pods on that node.
`DiskPressure`	If a node reports this condition, the scheduler will not place any additional pods on that node.

21.2.9. Example Scenario
Copy link

Consider the following scenario:

Node memory capacity of 10Gi.
The operator wants to reserve 10% of memory capacity for system daemons (kernel, node, etc.).
The operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.

A node reports two values:

Capacity: How much resource is on the machine
Allocatable: How much resource is made available for scheduling.

The goal is to allow the scheduler to fully allocate a node and to not have evictions occur.

Evictions should only occur if pods use more than their requested amount of resource.

To facilitate this scenario, the node configuration file (the node-config.yaml file) is modified as follows:

kubeletArguments:
  eviction-hard: 
    - "memory.available<500Mi"
  system-reserved:
    - "memory=1.5Gi"

kubeletArguments:
  eviction-hard:


    - "memory.available<500Mi"
  system-reserved:
    - "memory=1.5Gi"

Copy to Clipboard

Toggle word wrap

1: This threshold can either be eviction-hard or eviction-soft.

Note

Soft eviction usage is more common when you are targeting a certain level of utilization, but can tolerate temporary spikes. It is recommended that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific. The system reservation should also cover the soft eviction threshold.

Implicit in this configuration is the understanding that system-reserved should include the amount of memory covered by the eviction threshold.

To reach that capacity, either some pod is using more than its request, or the system is using more than 1Gi.

If a node has 10 Gi of capacity, and you want to reserve 10% of that capacity for the system daemons, do the following:

capacity = 10 Gi
system-reserved = 10 Gi * .1 = 1 Gi

capacity = 10 Gi
system-reserved = 10 Gi * .1 = 1 Gi

Copy to Clipboard

Toggle word wrap

The node allocatable value in this setting becomes:

allocatable = capacity - system-reserved = 9 Gi

allocatable = capacity - system-reserved = 9 Gi

Copy to Clipboard

Toggle word wrap

This means by default, the scheduler will schedule pods that request 9 Gi of memory to that node.

If you want to turn on eviction so that eviction is triggered when the node observes that available memory falls below 10% of capacity for 30 seconds, or immediately when it falls below 5% of capacity, you need the scheduler to see allocatable as 8Gi. Therefore, ensure your system reservation covers the greater of your eviction thresholds.

capacity = 10 Gi
eviction-threshold = 10 Gi * .1 = 1 Gi
system-reserved = (10Gi * .1) + eviction-threshold = 2 Gi
allocatable = capacity - system-reserved = 8 Gi

capacity = 10 Gi
eviction-threshold = 10 Gi * .1 = 1 Gi
system-reserved = (10Gi * .1) + eviction-threshold = 2 Gi
allocatable = capacity - system-reserved = 8 Gi

Copy to Clipboard

Toggle word wrap

You must set system-reserved equal to the amount of resource you want to reserve for system-daemons, plus the amount of resource you want to reserve before triggering evictions.

This configuration ensures that the scheduler does not place pods on a node that immediately induce memory pressure and trigger eviction assuming those pods use less than their configured request.

21.3. Out of Resource and Out of Memory
Copy link

If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.

The node sets a oom_score_adj value for each container based on the quality of service for the pod.

Expand

Table 21.4. Quality of Service OOM Scores
Quality of Service	`oom_score_adj` Value
`Guaranteed`	-998
`BestEffort`	1000
`Burstable`	min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

If the node is unable to reclaim memory prior to experiencing a system OOM event, the oom_killer calculates an oom_score:

% of node memory a container is using + `oom_score_adj` = `oom_score`

% of node memory a container is using + `oom_score_adj` = `oom_score`

Copy to Clipboard

Toggle word wrap

The node then kills the container with the highest score.

Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.

Unlike pod eviction, if a pod container is OOM failed, it can be restarted by the node based on its RestartPolicy.

21.4. Recommended Practices
Copy link

21.4.1. DaemonSets and Out of Resource Handling
Copy link

If a node evicts a pod that was created by a DaemonSet, the pod will immediately be recreated and rescheduled back to the same node, because the node has no ability to distinguish a pod created from a DaemonSet versus any other object.

In general, DaemonSets should not create BestEffort pods to avoid being identified as a candidate pod for eviction. Instead DaemonSets should ideally launch Guaranteed pods.

Chapter 21. Handling Out of Resource Errors

21.1. Overview
Copy link

21.2. Eviction Policy
Copy link

21.2.1. Eviction Signals
Copy link

21.2.2. Eviction Thresholds
Copy link

21.2.2.1. Soft Eviction Thresholds
Copy link

21.2.2.2. Hard Eviction Thresholds
Copy link

21.2.3. Oscillation of Node Conditions
Copy link

21.2.4. Eviction Monitoring Interval
Copy link

21.2.5. Mapping Eviction Signals to Node Conditions
Copy link

21.2.6. Reclaiming Node-level Resources
Copy link

21.2.6.1. With Imagefs
Copy link

21.2.6.2. Without Imagefs
Copy link

21.2.7. Eviction of Pods
Copy link

21.2.8. Scheduler
Copy link

21.2.9. Example Scenario
Copy link

21.3. Out of Resource and Out of Memory
Copy link

21.4. Recommended Practices
Copy link

21.4.1. DaemonSets and Out of Resource Handling
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 21. Handling Out of Resource Errors

21.1. OverviewCopy linkLink copied to clipboard!

21.2. Eviction PolicyCopy linkLink copied to clipboard!

21.2.1. Eviction SignalsCopy linkLink copied to clipboard!

21.2.2. Eviction ThresholdsCopy linkLink copied to clipboard!

21.2.2.1. Soft Eviction ThresholdsCopy linkLink copied to clipboard!

21.2.2.2. Hard Eviction ThresholdsCopy linkLink copied to clipboard!

21.2.3. Oscillation of Node ConditionsCopy linkLink copied to clipboard!

21.2.4. Eviction Monitoring IntervalCopy linkLink copied to clipboard!

21.2.5. Mapping Eviction Signals to Node ConditionsCopy linkLink copied to clipboard!

21.2.6. Reclaiming Node-level ResourcesCopy linkLink copied to clipboard!

21.2.6.1. With ImagefsCopy linkLink copied to clipboard!

21.2.6.2. Without ImagefsCopy linkLink copied to clipboard!

21.2.7. Eviction of PodsCopy linkLink copied to clipboard!

21.2.8. SchedulerCopy linkLink copied to clipboard!

21.2.9. Example ScenarioCopy linkLink copied to clipboard!

21.3. Out of Resource and Out of MemoryCopy linkLink copied to clipboard!

21.4. Recommended PracticesCopy linkLink copied to clipboard!

21.4.1. DaemonSets and Out of Resource HandlingCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

21.1. Overview
Copy link

21.2. Eviction Policy
Copy link

21.2.1. Eviction Signals
Copy link

21.2.2. Eviction Thresholds
Copy link

21.2.2.1. Soft Eviction Thresholds
Copy link

21.2.2.2. Hard Eviction Thresholds
Copy link

21.2.3. Oscillation of Node Conditions
Copy link

21.2.4. Eviction Monitoring Interval
Copy link

21.2.5. Mapping Eviction Signals to Node Conditions
Copy link

21.2.6. Reclaiming Node-level Resources
Copy link

21.2.6.1. With Imagefs
Copy link

21.2.6.2. Without Imagefs
Copy link

21.2.7. Eviction of Pods
Copy link

21.2.8. Scheduler
Copy link

21.2.9. Example Scenario
Copy link

21.3. Out of Resource and Out of Memory
Copy link

21.4. Recommended Practices
Copy link

21.4.1. DaemonSets and Out of Resource Handling
Copy link