Chapter 26. Overcommitting
26.1. Overview
Containers can specify compute resource requests and limits. Requests are used for scheduling your container and provide a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.
The scheduler attempts to optimize the compute resource use across all nodes in your cluster. It places pods onto specific nodes, taking the pods' compute resource requests and nodes' available capacity into consideration.
Requests and limits enable administrators to allow and manage the overcommitment of resources on a node, which may be desirable in development environments where a tradeoff of guaranteed performance for capacity is acceptable.
26.2. Requests and Limits
For each compute resource, a container may specify a resource request and limit. Scheduling decisions are made based on the request to ensure that a node has enough capacity available to meet the requested value. If a container specifies limits, but omits requests, the requests are defaulted to the limits. A container is not able to exceed the specified limit on the node.
The enforcement of limits is dependent upon the compute resource type. If a container makes no request or limit, the container is scheduled to a node with no resource guarantees. In practice, the container is able to consume as much of the specified resource as is available with the lowest local priority. In low resource situations, containers that specify no resource requests are given the lowest quality of service.
26.2.1. Tune Buffer Chunk Limit
If Fluentd logger is unable to keep up with a high number of logs, it will need to switch to file buffering to reduce memory usage and prevent data loss.
The Fluentd buffer_chunk_limit
is determined by the environment variable BUFFER_SIZE_LIMIT
, which has the default value 8m
. The file buffer size per output is determined by the environment variable FILE_BUFFER_LIMIT
, which has the default value 256Mi
. The permanent volume size must be larger than FILE_BUFFER_LIMIT
multiplied by the output.
On the Fluentd and Mux pods, permanent volume /var/lib/fluentd should be prepared by the PVC or hostmount, for example. That area is then used for the file buffers.
The buffer_type
and buffer_path
are configured in the Fluentd configuration files as follows:
$ egrep "buffer_type|buffer_path" *.conf output-es-config.conf: buffer_type file buffer_path `/var/lib/fluentd/buffer-output-es-config` output-es-ops-config.conf: buffer_type file buffer_path `/var/lib/fluentd/buffer-output-es-ops-config` filter-pre-mux-client.conf: buffer_type file buffer_path `/var/lib/fluentd/buffer-mux-client`
The Fluentd buffer_queue_limit
is the value of the variable BUFFER_QUEUE_LIMIT
. This value is 32
by default.
The environment variable BUFFER_QUEUE_LIMIT
is calculated as (FILE_BUFFER_LIMIT / (number_of_outputs * BUFFER_SIZE_LIMIT))
.
If the BUFFER_QUEUE_LIMIT
variable has the default set of values:
-
FILE_BUFFER_LIMIT = 256Mi
-
number_of_outputs = 1
-
BUFFER_SIZE_LIMIT = 8Mi
The value of buffer_queue_limit
will be 32
. To change the buffer_queue_limit
, you need to change the value of FILE_BUFFER_LIMIT
.
In this formula, number_of_outputs
is 1
if all the logs are sent to a single resource, and it is incremented by 1
for each additional resource. For example, the value of number_of_outputs
is:
-
1
- if all logs are sent to a single ElasticSearch pod -
2
- if application logs are sent to an ElasticSearch pod and ops logs are sent to another ElasticSearch pod -
4
- if application logs are sent to an ElasticSearch pod, ops logs are sent to another ElasticSearch pod, and both of them are forwarded to other Fluentd instances
26.3. Compute Resources
The node-enforced behavior for compute resources is specific to the resource type.
26.3.1. CPU
A container is guaranteed the amount of CPU it requests and is additionally able to consume excess CPU available on the node, up to any limit specified by the container. If multiple containers are attempting to use excess CPU, CPU time is distributed based on the amount of CPU requested by each container.
For example, if one container requested 500m of CPU time and another container requested 250m of CPU time, then any extra CPU time available on the node is distributed among the containers in a 2:1 ratio. If a container specified a limit, it will be throttled not to use more CPU than the specified limit.
CPU requests are enforced using the CFS shares support in the Linux kernel. By default, CPU limits are enforced using the CFS quota support in the Linux kernel over a 100ms measuring interval, though this can be disabled.
26.3.2. Memory
A container is guaranteed the amount of memory it requests. A container can use more memory than requested, but once it exceeds its requested amount, it could be terminated in a low memory situation on the node.
If a container uses less memory than requested, it will not be terminated unless system tasks or daemons need more memory than was accounted for in the node’s resource reservation. If a container specifies a limit on memory, it is immediately terminated if it exceeds the limit amount.
26.3.3. Ephemeral storage
This topic applies only if you enabled the ephemeral storage technology preview in OpenShift Container Platform 3.10. This feature is disabled by default. If enabled, the OpenShift Container Platform cluster uses ephemeral storage to store information that does not need to persist after the cluster is destroyed. To enable this feature, see configuring for ephemeral storage.
A container is guaranteed the amount of ephemeral storage it requests. A container can use more ephemeral storage than requested, but once it exceeds its requested amount, it can be terminated if the available ephemeral disk space gets too low.
If a container uses less ephemeral storage than requested, it will not be terminated unless system tasks or daemons need more local ephemeral storage than was accounted for in the node’s resource reservation. If a container specifies a limit on ephemeral storage, it is immediately terminated if it exceeds the limit amount.
26.4. Quality of Service Classes
A node is overcommitted when it has a pod scheduled that makes no request, or when the sum of limits across all pods on that node exceeds available machine capacity.
In an overcommitted environment, it is possible that the pods on the node will attempt to use more compute resource than is available at any given point in time. When this occurs, the node must give priority to one pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class.
For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority:
Priority | Class Name | Description |
---|---|---|
1 (highest) | Guaranteed | If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed. |
2 | Burstable | If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable. |
3 (lowest) | BestEffort | If requests and limits are not set for any of the resources, then the container is classified as BestEffort. |
Memory is an incompressible resource, so in low memory situations, containers that have the lowest priority are terminated first:
- Guaranteed containers are considered top priority, and are guaranteed to only be terminated if they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
- Burstable containers under system memory pressure are more likely to be terminated once they exceed their requests and no other BestEffort containers exist.
- BestEffort containers are treated with the lowest priority. Processes in these containers are first to be terminated if the system runs out of memory.
26.5. Configuring Masters for Overcommitment
Scheduling is based on resources requested, while quota and hard limits refer to resource limits, which can be set higher than requested resources. The difference between request and limit determines the level of overcommit; for instance, if a container is given a memory request of 1Gi and a memory limit of 2Gi, it is scheduled based on the 1Gi request being available on the node, but could use up to 2Gi; so it is 200% overcommitted.
If OpenShift Container Platform administrators would like to control the level of overcommit and manage container density on nodes, masters can be configured to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.
This requires configuring the ClusterResourceOverride
admission controller in the master-config.yaml as in the following example (reuse the existing configuration tree if it exists, or introduce absent elements as needed):
admissionConfig: pluginConfig: ClusterResourceOverride: 1 configuration: apiVersion: v1 kind: ClusterResourceOverrideConfig memoryRequestToLimitPercent: 25 2 cpuRequestToLimitPercent: 25 3 limitCPUToMemoryPercent: 200 4
- 1
- This is the plug-in name; case matters and anything but an exact match for a plug-in name is ignored.
- 2
- (optional, 1-100) If a container memory limit has been specified or defaulted, the memory request is overridden to this percentage of the limit.
- 3
- (optional, 1-100) If a container CPU limit has been specified or defaulted, the CPU request is overridden to this percentage of the limit.
- 4
- (optional, positive integer) If a container memory limit has been specified or defaulted, the CPU limit is overridden to a percentage of the memory limit, with a 100 percentage scaling 1Gi of RAM to equal 1 CPU core. This is processed prior to overriding CPU request (if configured).
After changing the master configuration, a master restart is required.
Note that these overrides have no effect if no limits have been set on containers. Create a LimitRange object with default limits (per individual project, or in the project template) in order to ensure that the overrides apply.
Note also that after overrides, the container limits and requests must still be validated by any LimitRange objects in the project. It is possible, for example, for developers to specify a limit close to the minimum limit, and have the request then be overridden below the minimum limit, causing the pod to be forbidden. This unfortunate user experience should be addressed with future work, but for now, configure this capability and LimitRanges with caution.
When configured, overrides can be disabled per-project (for example, to allow infrastructure components to be configured independently of overrides) by editing the project and adding the following annotation:
quota.openshift.io/cluster-resource-override-enabled: "false"
26.6. Configuring Nodes for Overcommitment
In an overcommitted environment, it is important to properly configure your node to provide best system behavior.
26.6.1. Reserving Memory Across Quality of Service Tiers
You can use the experimental-qos-reserved
parameter to specify a percentage of memory to be reserved by a pod in a particular QoS level. This feature attempts to reserve requested resources to exclude pods from lower OoS classes from using resources requested by pods in higher QoS classes.
By reserving resources for higher QOS levels, pods that don’t have resource limits are prevented from encroaching on the resources requested by pods at higher QoS levels.
To configure the experimental-qos-reserved
parameter, edit the appropriate node configuration map.
kubeletArguments:
cgroups-per-qos:
- true
cgroup-driver:
- 'systemd'
cgroup-root:
- '/'
experimental-qos-reserved: 1
- 'memory=50%'
- 1
- Specifies how pod resource requests are reserved at the QoS level.
OpenShift Container Platform uses the experimental-qos-reserved
parameter as follows:
-
A value of
experimental-qos-reserved=memory=100%
will prevent theBurstable
andBestEffort
QOS classes from consuming memory that was requested by a higher QoS class. This increases the risk of inducing OOM onBestEffort
andBurstable
workloads in favor of increasing memory resource guarantees forGuaranteed
andBurstable
workloads. -
A value of
experimental-qos-reserved=memory=50%
will allow theBurstable
andBestEffort
QOS classes to consume half of the memory requested by a higher QoS class. -
A value of
experimental-qos-reserved=memory=0%
will allow aBurstable
andBestEffort
QoS classes to consume up to the full node allocatable amount if available, but increases the risk that aGuaranteed
workload will not have access to requested memory. This condition effectively disables this feature.
26.6.2. Enforcing CPU Limits
Nodes by default enforce specified CPU limits using the CPU CFS quota support in the Linux kernel. If you do not want to enforce CPU limits on the node, you can disable its enforcement by modifying the appropriate node configuration map to include the following parameters:
kubeletArguments: cpu-cfs-quota: - "false"
If CPU limit enforcement is disabled, it is important to understand the impact that will have on your node:
- If a container makes a request for CPU, it will continue to be enforced by CFS shares in the Linux kernel.
- If a container makes no explicit request for CPU, but it does specify a limit, the request will default to the specified limit, and be enforced by CFS shares in the Linux kernel.
- If a container specifies both a request and a limit for CPU, the request will be enforced by CFS shares in the Linux kernel, and the limit will have no impact on the node.
26.6.3. Reserving Resources for System Processes
The scheduler ensures that there are enough resources for all pods on a node based on the pod requests. It verifies that the sum of requests of containers on the node is no greater than the node capacity. It includes all containers started by the node, but not containers or processes started outside the knowledge of the cluster.
It is recommended that you reserve some portion of the node capacity to allow for the system daemons that are required to run on your node for your cluster to function (sshd, docker, etc.). In particular, it is recommended that you reserve resources for incompressible resources such as memory.
If you want to explicitly reserve resources for non-pod processes, there are two ways to do so:
- The preferred method is to allocate node resources by specifying resources available for scheduling. See Allocating Node Resources for more details.
Alternatively, you can create a resource-reserver pod that does nothing but reserve capacity from being scheduled on the node by the cluster. For example:
Example 26.1. resource-reserver Pod Definition
apiVersion: v1 kind: Pod metadata: name: resource-reserver spec: containers: - name: sleep-forever image: gcr.io/google_containers/pause:0.8.0 resources: limits: cpu: 100m 1 memory: 150Mi 2
You can save your definition to a file, for example resource-reserver.yaml, then place the file in the node configuration directory, for example /etc/origin/node/ or the
--config=<dir>
location if otherwise specified.Additionally, configure the node server to read the definition from the node configuration directory by specifying the directory in the
kubeletArguments.config
parameter in the appropriate node configuration map:kubeletArguments: config: - "/etc/origin/node" 1
- 1
- If
--config=<dir>
is specified, use<dir>
here.
With the resource-reserver.yaml file in place, starting the node server also launches the sleep-forever container. The scheduler takes into account the remaining capacity of the node, adjusting where to place cluster pods accordingly.
To remove the resource-reserver pod, you can delete or move the resource-reserver.yaml file from the node configuration directory.
26.6.4. Kernel Tunable Flags
When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.
To ensure this behavior, the node instructs the kernel to always overcommit memory:
$ sysctl -w vm.overcommit_memory=1
The node also instructs the kernel not to panic when it runs out of memory. Instead, the kernel OOM killer should kill processes based on priority:
$ sysctl -w vm.panic_on_oom=0
The above flags should already be set on nodes, and no further action is required.
26.6.5. Disabling Swap Memory
You can disable swap by default on your nodes in order to preserve quality of service guarantees. Otherwise, physical resources on a node can oversubscribe, affecting the resource guarantees the Kubernetes scheduler makes during pod placement.
For example, if two guaranteed pods have reached their memory limit, each container could start using swap memory. Eventually, if there is not enough swap space, processes in the pods can be terminated due to the system being oversubscribed.
To disable swap:
$ swapoff -a
Failing to disable swap results in nodes not recognizing that they are experiencing MemoryPressure, resulting in pods not receiving the memory they made in their scheduling request. As a result, additional pods are placed on the node to further increase memory pressure, ultimately increasing your risk of experiencing a system out of memory (OOM) event.
If swap is enabled, any out of resource handling eviction thresholds for available memory will not work as expected. Take advantage of out of resource handling to allow pods to be evicted from a node when it is under memory pressure, and rescheduled on an alternative node that has no such pressure.