2.11. Node-level overcommit

2.11.1. Understanding compute resources and containers
复制链接

The node-enforced behavior for compute resources is specific to the resource type.

2.11.1.1. Understanding container CPU requests
复制链接

A container is guaranteed the amount of CPU it requests and is additionally able to consume excess CPU available on the node, up to any limit specified by the container. If multiple containers are attempting to use excess CPU, CPU time is distributed based on the amount of CPU requested by each container.

For example, if one container requested 500m of CPU time and another container requested 250m of CPU time, then any extra CPU time available on the node is distributed among the containers in a 2:1 ratio. If a container specified a limit, it will be throttled not to use more CPU than the specified limit. CPU requests are enforced using the CFS shares support in the Linux kernel. By default, CPU limits are enforced using the CFS quota support in the Linux kernel over a 100ms measuring interval, though this can be disabled.

2.11.1.2. Understanding container memory requests
复制链接

A container is guaranteed the amount of memory it requests. A container can use more memory than requested, but once it exceeds its requested amount, it could be terminated in a low memory situation on the node. If a container uses less memory than requested, it will not be terminated unless system tasks or daemons need more memory than was accounted for in the node’s resource reservation. If a container specifies a limit on memory, it is immediately terminated if it exceeds the limit amount.

2.11.2. Understanding overcomitment and quality of service classes
复制链接

A node is overcommitted when it has a pod scheduled that makes no request, or when the sum of limits across all pods on that node exceeds available machine capacity.

In an overcommitted environment, it is possible that the pods on the node will attempt to use more compute resource than is available at any given point in time. When this occurs, the node must give priority to one pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class.

For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority:

Expand

表 2.2. Quality of Service Classes
Priority	Class Name	Description
1 (highest)	Guaranteed	If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed.
2	Burstable	If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable.
3 (lowest)	BestEffort	If requests and limits are not set for any of the resources, then the container is classified as BestEffort.

Memory is an incompressible resource, so in low memory situations, containers that have the lowest priority are terminated first:

Guaranteed containers are considered top priority, and are guaranteed to only be terminated if they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
Burstable containers under system memory pressure are more likely to be terminated once they exceed their requests and no other BestEffort containers exist.
BestEffort containers are treated with the lowest priority. Processes in these containers are first to be terminated if the system runs out of memory.

2.11.2.1. Understanding how to reserve memory across quality of service tiers
复制链接

You can use the qos-reserved parameter to specify a percentage of memory to be reserved by a pod in a particular QoS level. This feature attempts to reserve requested resources to exclude pods from lower OoS classes from using resources requested by pods in higher QoS classes.

OpenShift Container Platform uses the qos-reserved parameter as follows:

A value of qos-reserved=memory=100% will prevent the Burstable and BestEffort QOS classes from consuming memory that was requested by a higher QoS class. This increases the risk of inducing OOM on BestEffort and Burstable workloads in favor of increasing memory resource guarantees for Guaranteed and Burstable workloads.
A value of qos-reserved=memory=50% will allow the Burstable and BestEffort QOS classes to consume half of the memory requested by a higher QoS class.
A value of qos-reserved=memory=0% will allow a Burstable and BestEffort QoS classes to consume up to the full node allocatable amount if available, but increases the risk that a Guaranteed workload will not have access to requested memory. This condition effectively disables this feature.

2.11.3. Understanding swap memory and QOS
复制链接

You can disable swap by default on your nodes in order to preserve quality of service (QOS) guarantees. Otherwise, physical resources on a node can oversubscribe, affecting the resource guarantees the Kubernetes scheduler makes during pod placement.

For example, if two guaranteed pods have reached their memory limit, each container could start using swap memory. Eventually, if there is not enough swap space, processes in the pods can be terminated due to the system being oversubscribed.

Failing to disable swap results in nodes not recognizing that they are experiencing MemoryPressure, resulting in pods not receiving the memory they made in their scheduling request. As a result, additional pods are placed on the node to further increase memory pressure, ultimately increasing your risk of experiencing a system out of memory (OOM) event.

重要

If swap is enabled, any out-of-resource handling eviction thresholds for available memory will not work as expected. Take advantage of out-of-resource handling to allow pods to be evicted from a node when it is under memory pressure, and rescheduled on an alternative node that has no such pressure.

2.11.4. Understanding nodes overcommitment
复制链接

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

To ensure this behavior, OpenShift Container Platform configures the kernel to always overcommit memory by setting the vm.overcommit_memory parameter to 1, overriding the default operating system setting.

OpenShift Container Platform also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom parameter to 0. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority

You can view the current setting by running the following commands on your nodes:

sysctl -a |grep commit

$ sysctl -a |grep commit

Copy to Clipboard

Toggle word wrap

Example output

vm.overcommit_memory = 1

vm.overcommit_memory = 1

Copy to Clipboard

Toggle word wrap

sysctl -a |grep panic

$ sysctl -a |grep panic

Copy to Clipboard

Toggle word wrap

Example output

vm.panic_on_oom = 0

vm.panic_on_oom = 0

Copy to Clipboard

Toggle word wrap

注意

The above flags should already be set on nodes, and no further action is required.

You can also perform the following configurations for each node:

Disable or enforce CPU limits using CPU CFS quotas
Reserve resources for system processes
Reserve memory across quality of service tiers

2.11.5. Disabling or enforcing CPU limits using CPU CFS quotas
复制链接

Nodes by default enforce specified CPU limits using the Completely Fair Scheduler (CFS) quota support in the Linux kernel.

If you disable CPU limit enforcement, it is important to understand the impact on your node:

If a container has a CPU request, the request continues to be enforced by CFS shares in the Linux kernel.
If a container does not have a CPU request, but does have a CPU limit, the CPU request defaults to the specified CPU limit, and is enforced by CFS shares in the Linux kernel.
If a container has both a CPU request and limit, the CPU request is enforced by CFS shares in the Linux kernel, and the CPU limit has no impact on the node.

Prerequisites

Obtain the label associated with the static MachineConfigPool CRD for the type of node you want to configure. Perform one of the following steps:

View the machine config pool:

oc describe machineconfigpool <name>

$ oc describe machineconfigpool <name>

Copy to Clipboard

Toggle word wrap

For example:

oc describe machineconfigpool worker

$ oc describe machineconfigpool worker

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: 2019-02-08T14:52:39Z
  generation: 1
  labels:
    custom-kubelet: small-pods

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: 2019-02-08T14:52:39Z
  generation: 1
  labels:
    custom-kubelet: small-pods

1

Copy to Clipboard

Toggle word wrap

1: If a label has been added it appears under labels.

If the label is not present, add a key/value pair:

oc label machineconfigpool worker custom-kubelet=small-pods

$ oc label machineconfigpool worker custom-kubelet=small-pods

Copy to Clipboard

Toggle word wrap

Procedure

Create a custom resource (CR) for your configuration change.

Sample configuration for a disabling CPU limits

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: disable-cpu-units 
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: small-pods 
  kubeletConfig:
    cpuCfsQuota: 
      - "false"

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: disable-cpu-units

1


spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: small-pods

2


  kubeletConfig:
    cpuCfsQuota:

3


      - "false"

Copy to Clipboard

Toggle word wrap

1: Assign a name to CR.
2: Specify the label to apply the configuration change.
3: Set the cpuCfsQuota parameter to false.

2.11.6. Reserving resources for system processes
复制链接

To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by system daemons that are required to run on your node for your cluster to function. In particular, it is recommended that you reserve resources for incompressible resources such as memory.

Procedure

To explicitly reserve resources for non-pod processes, allocate node resources by specifying resources available for scheduling. For more details, see Allocating Resources for Nodes.

2.11.7. Disabling overcommitment for a node
复制链接

When enabled, overcommitment can be disabled on each node.

Procedure

To disable overcommitment in a node run the following command on that node:

sysctl -w vm.overcommit_memory=0

$ sysctl -w vm.overcommit_memory=0

Copy to Clipboard

Toggle word wrap

2.11.1. Understanding compute resources and containers
复制链接

2.11.1.1. Understanding container CPU requests
复制链接

2.11.1.2. Understanding container memory requests
复制链接

2.11.2. Understanding overcomitment and quality of service classes
复制链接

2.11.2.1. Understanding how to reserve memory across quality of service tiers
复制链接

2.11.3. Understanding swap memory and QOS
复制链接

2.11.4. Understanding nodes overcommitment
复制链接

2.11.5. Disabling or enforcing CPU limits using CPU CFS quotas
复制链接

2.11.6. Reserving resources for system processes
复制链接

2.11.7. Disabling overcommitment for a node
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.11. Node-level overcommit

2.11.1. Understanding compute resources and containers复制链接链接已复制到粘贴板!

2.11.1.1. Understanding container CPU requests复制链接链接已复制到粘贴板!

2.11.1.2. Understanding container memory requests复制链接链接已复制到粘贴板!

2.11.2. Understanding overcomitment and quality of service classes复制链接链接已复制到粘贴板!

2.11.2.1. Understanding how to reserve memory across quality of service tiers复制链接链接已复制到粘贴板!

2.11.3. Understanding swap memory and QOS复制链接链接已复制到粘贴板!

2.11.4. Understanding nodes overcommitment复制链接链接已复制到粘贴板!

2.11.5. Disabling or enforcing CPU limits using CPU CFS quotas复制链接链接已复制到粘贴板!

2.11.6. Reserving resources for system processes复制链接链接已复制到粘贴板!

2.11.7. Disabling overcommitment for a node复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.11.1. Understanding compute resources and containers
复制链接

2.11.1.1. Understanding container CPU requests
复制链接

2.11.1.2. Understanding container memory requests
复制链接

2.11.2. Understanding overcomitment and quality of service classes
复制链接

2.11.2.1. Understanding how to reserve memory across quality of service tiers
复制链接

2.11.3. Understanding swap memory and QOS
复制链接

2.11.4. Understanding nodes overcommitment
复制链接

2.11.5. Disabling or enforcing CPU limits using CPU CFS quotas
复制链接

2.11.6. Reserving resources for system processes
复制链接

2.11.7. Disabling overcommitment for a node
复制链接