Chapter 22. Allocating Node Resources
22.1. Purpose for Allocating Node Resources
To provide more reliable scheduling and minimize node resource overcommitment, reserve a portion of the CPU and memory resources for use by the underlying node components such as kubelet, kube-proxy, and the container engine. The resources that you reserve are also used by the remaining system components such as sshd, NetworkManager, and so on. Specifying the resources to reserve provides the scheduler with more information about the remaining memory and CPU resources that a node has available for use by pods.
22.2. Configuring Nodes for Allocated Resources
Resources are reserved for node components and system components in OpenShift Container Platform by configuring the system-reserved
node setting.
OpenShift Container Platform does not use the kube-reserved
setting. Documentation for Kubernetes and some cloud vendors that provide a Kubernetes environment might suggest configuring kube-reserved
. That information does not apply to an OpenShift Container Platform cluster.
Use caution when you tune your cluster with resource limits and enforcing limits with evictions. Enforcing system-reserved
limits can prevent critical system services from receiving CPU time or ending the critical system services when memory resources run low.
In most cases, tuning resource allocation is performed by making an adjustment and then monitoring the cluster performance with a production-like workload. That process is repeated until the cluster is stable and meets service-level agreements.
For more information on the effects of these settings, see Computing Allocated Resources.
Setting | Description |
---|---|
|
This setting is not used with OpenShift Container Platform. Add the CPU and memory resources that you planned to reserve to |
| Resources that are reserved for the node components and system components. Default is none. |
View the services that are controlled by system-reserved
with a tool such as lscgroup
by running the following commands:
# yum install libcgroup-tools
$ lscgroup memory:/system.slice
Reserve resources in the kubeletArguments
section of the node configuration map by adding a set of <resource_type>=<resource_quantity>
pairs. For example, cpu=500m,memory=1Gi
reserves 500 millicores of CPU and one gigabyte of memory.
Example 22.1. Node-Allocatable Resources Settings
kubeletArguments: system-reserved: - "cpu=500m,memory=1Gi"
Add the system-reserved
field if it does not exist.
Do not edit the node-config.yaml
file directly.
To determine appropriate values for these settings, view the resource usage of a node by using the node summary API. For more information, see System Resources Reported by Node.
After you set system-reserved
:
Monitor the memory usage of a node for high-water marks:
$ ps aux | grep <service-name>
For example:
$ ps aux | grep atomic-openshift-node USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 11089 11.5 0.3 112712 996 pts/1 R+ 16:23 0:00 grep --color=auto atomic-openshift-node
If this value is close to your
system-reserved
mark, you can increase thesystem-reserved
value.Monitor the memory usage of system services with a tool such as
cgget
by running the following commands:# yum install libcgroup-tools
$ cgget -g memory /system.slice | grep memory.usage_in_bytes
If this value is close to your
system-reserved
mark, you can increase thesystem-reserved
value.- Use the OpenShift Container Platform cluster loader to measure performance metrics of your deployment at various cluster states.
22.3. Computing Allocated Resources
An allocated amount of a resource is computed based on the following formula:
[Allocatable] = [Node Capacity] - [system-reserved] - [Hard-Eviction-Thresholds]
The withholding of Hard-Eviction-Thresholds
from allocatable improves system reliability because the value for allocatable is enforced for pods at the node level. The experimental-allocatable-ignore-eviction
setting is available to preserve legacy behavior, but it will be deprecated in a future release.
If [Allocatable]
is negative, it is set to 0
.
22.4. Viewing Node-Allocatable Resources and Capacity
To view the current capacity and allocatable resources for a node, run the following command:
$ oc get node/<node_name> -o yaml
In the following partial output, the allocatable values are less than the capacity. The difference is expected and matches a cpu=500m,memory=1Gi
resource allocation for system-reserved
.
status: ... allocatable: cpu: "3500m" memory: 6857952Ki pods: "110" capacity: cpu: "4" memory: 8010948Ki pods: "110" ...
The scheduler uses the values for allocatable
to decide if a node is a candidate for pod scheduling.
22.5. System Resources Reported by Node
Each node reports the system resources that are used by the container runtime and kubelet. To simplify configuring system-reserved
, view the resource usage for the node by using the node summary API. The node summary is available at <master>/api/v1/nodes/<node>/proxy/stats/summary
.
For instance, to access the resources from cluster.node22 node, run the following command:
$ curl <certificate details> https://<master>/api/v1/nodes/cluster.node22/proxy/stats/summary
The response includes information that is similar to the following:
{ "node": { "nodeName": "cluster.node22", "systemContainers": [ { "cpu": { "usageCoreNanoSeconds": 929684480915, "usageNanoCores": 190998084 }, "memory": { "rssBytes": 176726016, "usageBytes": 1397895168, "workingSetBytes": 1050509312 }, "name": "kubelet" }, { "cpu": { "usageCoreNanoSeconds": 128521955903, "usageNanoCores": 5928600 }, "memory": { "rssBytes": 35958784, "usageBytes": 129671168, "workingSetBytes": 102416384 }, "name": "runtime" } ] } }
See REST API Overview for more details about certificate details.
22.6. Node Enforcement
The node is able to limit the total amount of resources that pods can consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from using CPU and memory resources that are needed by system services such as the container runtime and node agent. To improve node reliability, administrators should reserve resources based on a target for resource use.
The node enforces resource constraints using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy that is separate from system daemons.
To configure node enforcement, use the following parameters in the appropriate node configuration map.
Example 22.2. Node Cgroup Settings
kubeletArguments: cgroups-per-qos: - "true" 1 cgroup-driver: - "systemd" 2 enforce-node-allocatable: - "pods" 3
- 1
- Enable or disable a cgroup hierarchy for each quality of service. The cgroups are managed by the node. Any change of this setting requires a full drain of the node. This flag must be
true
to enable the node to enforce the node-allocatable resource constraints. The default value istrue
and Red Hat does not recommend that customers change this value. - 2
- The cgroup driver that is used by the node to manage the cgroup hierarchies. This value must match the driver that is associated with the container runtime. Valid values are
systemd
andcgroupfs
, but Red Hat supportssystemd
only. - 3
- A comma-delimited list of scopes for where the node should enforce node resource constraints. The default value is
pods
and Red Hat supportspods
only.
Administrators should treat system daemons similar to pods that have a guaranteed quality of service. System daemons can burst within their bounding control groups and this behavior must be managed as part of cluster deployments. Reserve CPU and memory resources for system daemons by specifying the resources in system-reserved
as shown in section Configuring Nodes for Allocated Resources.
To view the cgroup driver that is set, run the following command:
$ systemctl status atomic-openshift-node -l | grep cgroup-driver=
The output includes a response that is similar to the following:
--cgroup-driver=systemd
For more information on managing and troubleshooting cgroup drivers, see Introduction to Control Groups (Cgroups).
22.7. Eviction Thresholds
If a node is under memory pressure, it can impact the entire node and all pods running on it. If a system daemon uses more than its reserved amount of memory, an out-of-memory event can occur that impacts the entire node and all pods running on the node. To avoid or reduce the probability of system out-of-memory events, the node provides out of resource handling.