This documentation is for a release that is no longer maintained
See documentation for the latest supported version 3 or the latest supported version 4.2.4. Recommended node host practices
The OpenShift Container Platform node configuration file contains important options. For example, two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore
and maxPods
.
When both options are in use, the lower of the two values limits the number of pods on a node. Exceeding these values can result in:
- Increased CPU utilization.
- Slow pod scheduling.
- Potential out-of-memory scenarios, depending on the amount of memory in the node.
- Exhausting the pool of IP addresses.
- Resource overcommitting, leading to poor user application performance.
In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running.
podsPerCore
sets the number of pods the node can run based on the number of processor cores on the node. For example, if podsPerCore
is set to 10
on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40
.
kubeletConfig: podsPerCore: 10
kubeletConfig:
podsPerCore: 10
Setting podsPerCore
to 0
disables this limit. The default is 0
. podsPerCore
cannot exceed maxPods
.
maxPods
sets the number of pods the node can run to a fixed value, regardless of the properties of the node.
kubeletConfig: maxPods: 250
kubeletConfig:
maxPods: 250
The kubelet configuration is currently serialized as an Ignition configuration, so it can be directly edited. However, there is also a new kubelet-config-controller
added to the Machine Config Controller (MCC). This allows you to create a KubeletConfig
custom resource (CR) to edit the kubelet parameters.
Procedure
Run:
oc get machineconfig
$ oc get machineconfig
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This provides a list of the available machine configuration objects you can select. By default, the two kubelet-related configs are
01-master-kubelet
and01-worker-kubelet
.To check the current value of max pods per node, run:
oc describe node <node-ip> | grep Allocatable -A6
# oc describe node <node-ip> | grep Allocatable -A6
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Look for
value: pods: <value>
.For example:
oc describe node ip-172-31-128-158.us-east-2.compute.internal | grep Allocatable -A6
# oc describe node ip-172-31-128-158.us-east-2.compute.internal | grep Allocatable -A6
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To set the max pods per node on the worker nodes, create a custom resource file that contains the kubelet configuration. For example,
change-maxPods-cr.yaml
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow The rate at which the kubelet talks to the API server depends on queries per second (QPS) and burst values. The default values,
50
forkubeAPIQPS
and100
forkubeAPIBurst
, are good enough if there are limited pods running on each node. Updating the kubelet QPS and burst rates is recommended if there are enough CPU and memory resources on the node:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run:
oc label machineconfigpool worker custom-kubelet=large-pods
$ oc label machineconfigpool worker custom-kubelet=large-pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run:
oc create -f change-maxPods-cr.yaml
$ oc create -f change-maxPods-cr.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run:
oc get kubeletconfig
$ oc get kubeletconfig
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This should return
set-max-pods
.Depending on the number of worker nodes in the cluster, wait for the worker nodes to be rebooted one by one. For a cluster with 3 worker nodes, this could take about 10 to 15 minutes.
Check for
maxPods
changing for the worker nodes:oc describe node
$ oc describe node
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the change by running:
oc get kubeletconfigs set-max-pods -o yaml
$ oc get kubeletconfigs set-max-pods -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This should show a status of
True
andtype:Success
Procedure
By default, only one machine is allowed to be unavailable when applying the kubelet-related configuration to the available worker nodes. For a large cluster, it can take a long time for the configuration change to be reflected. At any time, you can adjust the number of machines that are updating to speed up the process.
Run:
oc edit machineconfigpool worker
$ oc edit machineconfigpool worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set
maxUnavailable
to the desired value.spec: maxUnavailable: <node_count>
spec: maxUnavailable: <node_count>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 重要When setting the value, consider the number of worker nodes that can be unavailable without affecting the applications running on the cluster.
2.4.2. Control plane node sizing 复制链接链接已复制到粘贴板!
The control plane node resource requirements depend on the number of nodes in the cluster. The following control plane node size recommendations are based on the results of control plane density focused testing. The control plane tests create the following objects across the cluster in each of the namespaces depending on the node counts:
- 12 image streams
- 3 build configurations
- 6 builds
- 1 deployment with 2 pod replicas mounting two secrets each
- 2 deployments with 1 pod replica mounting two secrets
- 3 services pointing to the previous deployments
- 3 routes pointing to the previous deployments
- 10 secrets, 2 of which are mounted by the previous deployments
- 10 config maps, 2 of which are mounted by the previous deployments
Number of worker nodes | Cluster load (namespaces) | CPU cores | Memory (GB) |
---|---|---|---|
25 | 500 | 4 | 16 |
100 | 1000 | 8 | 32 |
250 | 4000 | 16 | 96 |
On a cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails because the remaining two nodes must handle the load in order to be highly available. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures on large and dense clusters, keep the overall resource usage on the master nodes to at least half of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the master nodes accordingly.
The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the running
phase.
If you used an installer-provisioned infrastructure installation method, you cannot modify the control plane node size in a running OpenShift Container Platform 4.5 cluster. Instead, you must estimate your total node count and use the suggested control plane node size during installation.
The recommendations are based on the data points captured on OpenShift Container Platform clusters with OpenShiftSDN as the network plug-in.
In OpenShift Container Platform 4.5, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration.
2.4.3. Setting up CPU Manager 复制链接链接已复制到粘贴板!
Procedure
Optional: Label a node:
oc label node perf-node.example.com cpumanager=true
# oc label node perf-node.example.com cpumanager=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
MachineConfigPool
of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:oc edit machineconfigpool worker
# oc edit machineconfigpool worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a label to the worker machine config pool:
metadata: creationTimestamp: 2020-xx-xxx generation: 3 labels: custom-kubelet: cpumanager-enabled
metadata: creationTimestamp: 2020-xx-xxx generation: 3 labels: custom-kubelet: cpumanager-enabled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
KubeletConfig
,cpumanager-kubeletconfig.yaml
, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See themachineConfigPoolSelector
section:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Specify a policy:
-
none
. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically. -
static
. This policy allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node.
-
- 2
- Optional. Specify the CPU Manager reconcile frequency. The default is
5s
.
Create the dynamic kubelet config:
oc create -f cpumanager-kubeletconfig.yaml
# oc create -f cpumanager-kubeletconfig.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.
Check for the merged kubelet config:
oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
# oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the worker for the updated
kubelet.conf
:oc debug node/perf-node.example.com
# oc debug node/perf-node.example.com sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
cpuManagerPolicy: static cpuManagerReconcilePeriod: 5s
cpuManagerPolicy: static
1 cpuManagerReconcilePeriod: 5s
2 Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:
cat cpumanager-pod.yaml
# cat cpumanager-pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the pod:
oc create -f cpumanager-pod.yaml
# oc create -f cpumanager-pod.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the pod is scheduled to the node that you labeled:
oc describe pod cpumanager
# oc describe pod cpumanager
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
cgroups
are set up correctly. Get the process ID (PID) of thepause
process:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pods of quality of service (QoS) tier
Guaranteed
are placed within thekubepods.slice
. Pods of other QoS tiers end up in childcgroups
ofkubepods
:cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
# cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
cpuset.cpus 1 tasks 32706
cpuset.cpus 1 tasks 32706
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the allowed CPU list for the task:
grep ^Cpus_allowed_list /proc/32706/status
# grep ^Cpus_allowed_list /proc/32706/status
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Cpus_allowed_list: 1
Cpus_allowed_list: 1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that another pod (in this case, the pod in the
burstable
QoS tier) on the system cannot run on the core allocated for theGuaranteed
pod:cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus 0 oc describe node perf-node.example.com
# cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus 0 # oc describe node perf-node.example.com
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This VM has two CPU cores. The
system-reserved
setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at theNode Allocatable
amount. You can see thatAllocatable CPU
is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:NAME READY STATUS RESTARTS AGE cpumanager-6cqz7 1/1 Running 0 33m cpumanager-7qc2t 0/1 Pending 0 11s
NAME READY STATUS RESTARTS AGE cpumanager-6cqz7 1/1 Running 0 33m cpumanager-7qc2t 0/1 Pending 0 11s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow