2.4. Recommended node host practices
The OpenShift Container Platform node configuration file contains important options. For example, two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore
and maxPods
.
When both options are in use, the lower of the two values limits the number of pods on a node. Exceeding these values can result in:
- Increased CPU utilization.
- Slow pod scheduling.
- Potential out-of-memory scenarios, depending on the amount of memory in the node.
- Exhausting the pool of IP addresses.
- Resource overcommitting, leading to poor user application performance.
In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running.
podsPerCore
sets the number of pods the node can run based on the number of processor cores on the node. For example, if podsPerCore
is set to 10
on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40
.
kubeletConfig: podsPerCore: 10
Setting podsPerCore
to 0
disables this limit. The default is 0
. podsPerCore
cannot exceed maxPods
.
maxPods
sets the number of pods the node can run to a fixed value, regardless of the properties of the node.
kubeletConfig: maxPods: 250
2.4.1. Creating a KubeletConfig CRD to edit kubelet parameters
The kubelet configuration is currently serialized as an Ignition configuration, so it can be directly edited. However, there is also a new kubelet-config-controller
added to the Machine Config Controller (MCC). This allows you to create a KubeletConfig
custom resource (CR) to edit the kubelet parameters.
Procedure
Run:
$ oc get machineconfig
This provides a list of the available machine configuration objects you can select. By default, the two kubelet-related configs are
01-master-kubelet
and01-worker-kubelet
.To check the current value of max pods per node, run:
# oc describe node <node-ip> | grep Allocatable -A6
Look for
value: pods: <value>
.For example:
# oc describe node ip-172-31-128-158.us-east-2.compute.internal | grep Allocatable -A6
Example output
Allocatable: attachable-volumes-aws-ebs: 25 cpu: 3500m hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15341844Ki pods: 250
To set the max pods per node on the worker nodes, create a custom resource file that contains the kubelet configuration. For example,
change-maxPods-cr.yaml
:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: set-max-pods spec: machineConfigPoolSelector: matchLabels: custom-kubelet: large-pods kubeletConfig: maxPods: 500
The rate at which the kubelet talks to the API server depends on queries per second (QPS) and burst values. The default values,
50
forkubeAPIQPS
and100
forkubeAPIBurst
, are good enough if there are limited pods running on each node. Updating the kubelet QPS and burst rates is recommended if there are enough CPU and memory resources on the node:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: set-max-pods spec: machineConfigPoolSelector: matchLabels: custom-kubelet: large-pods kubeletConfig: maxPods: <pod_count> kubeAPIBurst: <burst_rate> kubeAPIQPS: <QPS>
Run:
$ oc label machineconfigpool worker custom-kubelet=large-pods
Run:
$ oc create -f change-maxPods-cr.yaml
Run:
$ oc get kubeletconfig
This should return
set-max-pods
.Depending on the number of worker nodes in the cluster, wait for the worker nodes to be rebooted one by one. For a cluster with 3 worker nodes, this could take about 10 to 15 minutes.
Check for
maxPods
changing for the worker nodes:$ oc describe node
Verify the change by running:
$ oc get kubeletconfigs set-max-pods -o yaml
This should show a status of
True
andtype:Success
Procedure
By default, only one machine is allowed to be unavailable when applying the kubelet-related configuration to the available worker nodes. For a large cluster, it can take a long time for the configuration change to be reflected. At any time, you can adjust the number of machines that are updating to speed up the process.
Run:
$ oc edit machineconfigpool worker
Set
maxUnavailable
to the desired value.spec: maxUnavailable: <node_count>
重要When setting the value, consider the number of worker nodes that can be unavailable without affecting the applications running on the cluster.
2.4.2. Control plane node sizing
The control plane node resource requirements depend on the number of nodes in the cluster. The following control plane node size recommendations are based on the results of control plane density focused testing. The control plane tests create the following objects across the cluster in each of the namespaces depending on the node counts:
- 12 image streams
- 3 build configurations
- 6 builds
- 1 deployment with 2 pod replicas mounting two secrets each
- 2 deployments with 1 pod replica mounting two secrets
- 3 services pointing to the previous deployments
- 3 routes pointing to the previous deployments
- 10 secrets, 2 of which are mounted by the previous deployments
- 10 config maps, 2 of which are mounted by the previous deployments
Number of worker nodes | Cluster load (namespaces) | CPU cores | Memory (GB) |
---|---|---|---|
25 | 500 | 4 | 16 |
100 | 1000 | 8 | 32 |
250 | 4000 | 16 | 96 |
On a cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails because the remaining two nodes must handle the load in order to be highly available. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures on large and dense clusters, keep the overall resource usage on the master nodes to at least half of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the master nodes accordingly.
The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the running
phase.
If you used an installer-provisioned infrastructure installation method, you cannot modify the control plane node size in a running OpenShift Container Platform 4.5 cluster. Instead, you must estimate your total node count and use the suggested control plane node size during installation.
The recommendations are based on the data points captured on OpenShift Container Platform clusters with OpenShiftSDN as the network plug-in.
In OpenShift Container Platform 4.5, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration.
2.4.3. Setting up CPU Manager
Procedure
Optional: Label a node:
# oc label node perf-node.example.com cpumanager=true
Edit the
MachineConfigPool
of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:# oc edit machineconfigpool worker
Add a label to the worker machine config pool:
metadata: creationTimestamp: 2020-xx-xxx generation: 3 labels: custom-kubelet: cpumanager-enabled
Create a
KubeletConfig
,cpumanager-kubeletconfig.yaml
, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See themachineConfigPoolSelector
section:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: cpumanager-enabled spec: machineConfigPoolSelector: matchLabels: custom-kubelet: cpumanager-enabled kubeletConfig: cpuManagerPolicy: static 1 cpuManagerReconcilePeriod: 5s 2
- 1
- Specify a policy:
-
none
. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically. -
static
. This policy allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node.
-
- 2
- Optional. Specify the CPU Manager reconcile frequency. The default is
5s
.
Create the dynamic kubelet config:
# oc create -f cpumanager-kubeletconfig.yaml
This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.
Check for the merged kubelet config:
# oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
Example output
"ownerReferences": [ { "apiVersion": "machineconfiguration.openshift.io/v1", "kind": "KubeletConfig", "name": "cpumanager-enabled", "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878" } ]
Check the worker for the updated
kubelet.conf
:# oc debug node/perf-node.example.com sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
Example output
cpuManagerPolicy: static 1 cpuManagerReconcilePeriod: 5s 2
Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:
# cat cpumanager-pod.yaml
Example output
apiVersion: v1 kind: Pod metadata: generateName: cpumanager- spec: containers: - name: cpumanager image: gcr.io/google_containers/pause-amd64:3.0 resources: requests: cpu: 1 memory: "1G" limits: cpu: 1 memory: "1G" nodeSelector: cpumanager: "true"
Create the pod:
# oc create -f cpumanager-pod.yaml
Verify that the pod is scheduled to the node that you labeled:
# oc describe pod cpumanager
Example output
Name: cpumanager-6cqz7 Namespace: default Priority: 0 PriorityClassName: <none> Node: perf-node.example.com/xxx.xx.xx.xxx ... Limits: cpu: 1 memory: 1G Requests: cpu: 1 memory: 1G ... QoS Class: Guaranteed Node-Selectors: cpumanager=true
Verify that the
cgroups
are set up correctly. Get the process ID (PID) of thepause
process:# ├─init.scope │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17 └─kubepods.slice ├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice │ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope │ └─32706 /pause
Pods of quality of service (QoS) tier
Guaranteed
are placed within thekubepods.slice
. Pods of other QoS tiers end up in childcgroups
ofkubepods
:# cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
Example output
cpuset.cpus 1 tasks 32706
Check the allowed CPU list for the task:
# grep ^Cpus_allowed_list /proc/32706/status
Example output
Cpus_allowed_list: 1
Verify that another pod (in this case, the pod in the
burstable
QoS tier) on the system cannot run on the core allocated for theGuaranteed
pod:# cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus 0 # oc describe node perf-node.example.com
Example output
... Capacity: attachable-volumes-aws-ebs: 39 cpu: 2 ephemeral-storage: 124768236Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 8162900Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 1500m ephemeral-storage: 124768236Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 7548500Ki pods: 250 ------- ---- ------------ ---------- --------------- ------------- --- default cpumanager-6cqz7 1 (66%) 1 (66%) 1G (12%) 1G (12%) 29m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1440m (96%) 1 (66%)
This VM has two CPU cores. The
system-reserved
setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at theNode Allocatable
amount. You can see thatAllocatable CPU
is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:NAME READY STATUS RESTARTS AGE cpumanager-6cqz7 1/1 Running 0 33m cpumanager-7qc2t 0/1 Pending 0 11s