This documentation is for a release that is no longer maintained
See documentation for the latest supported version 3 or the latest supported version 4.Este conteúdo não está disponível no idioma selecionado.
Chapter 4. Working with nodes
4.1. Viewing and listing the nodes in your OpenShift Container Platform cluster Copiar o linkLink copiado para a área de transferência!
You can list all the nodes in your cluster to obtain information such as status, age, memory usage, and details about the nodes.
When you perform node management operations, the CLI interacts with node objects that are representations of actual node hosts. The master uses the information from node objects to validate nodes with health checks.
4.1.1. About listing all the nodes in a cluster Copiar o linkLink copiado para a área de transferência!
You can get detailed information on the nodes in the cluster.
The following command lists all nodes:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
-wide
option provides additional information on all nodes.oc get nodes -o wide
$ oc get nodes -o wide
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command lists information about a single node:
oc get node <node>
$ oc get node <node>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
STATUS
column in the output of these commands can show nodes with the following conditions:Expand Table 4.1. Node Conditions Condition Description Ready
The node reports its own readiness to the API server by returning
True
.NotReady
One of the underlying components, such as the container runtime or network, is experiencing issues or is not yet configured.
SchedulingDisabled
Pods cannot be scheduled for placement on the node.
The following command provides more detailed information about a specific node, including the reason for the current condition:
oc describe node <node>
$ oc describe node <node>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of the node.
- 2
- The role of the node, either
master
orworker
. - 3
- The labels applied to the node.
- 4
- The annotations applied to the node.
- 5
- The taints applied to the node.
- 6
- Node conditions.
- 7
- The IP address and host name of the node.
- 8
- The pod resources and allocatable resources.
- 9
- Information about the node host.
- 10
- The pods on the node.
- 11
- The events reported by the node.
4.1.2. Listing pods on a node in your cluster Copiar o linkLink copiado para a área de transferência!
You can list all the pods on a specific node.
Procedure
To list all or selected pods on one or more nodes:
oc describe node <node1> <node2>
$ oc describe node <node1> <node2>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe node ip-10-0-128-218.ec2.internal
$ oc describe node ip-10-0-128-218.ec2.internal
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To list all or selected pods on selected nodes:
oc describe --selector=<node_selector> oc describe -l=<pod_selector>
$ oc describe --selector=<node_selector> $ oc describe -l=<pod_selector>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc describe node --selector=beta.kubernetes.io/os oc describe node -l node-role.kubernetes.io/worker
$ oc describe node --selector=beta.kubernetes.io/os $ oc describe node -l node-role.kubernetes.io/worker
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.1.3. Viewing memory and CPU usage statistics on your nodes Copiar o linkLink copiado para a área de transferência!
You can display usage statistics about nodes, which provide the runtime environments for containers. These usage statistics include CPU, memory, and storage consumption.
Prerequisites
-
You must have
cluster-reader
permission to view the usage statistics. - Metrics must be installed to view the usage statistics.
Procedure
To view the usage statistics:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To view the usage statistics for nodes with labels:
oc adm top node --selector=''
$ oc adm top node --selector=''
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You must choose the selector (label query) to filter on. Supports
=
,==
, and!=
.
4.2. Working with nodes Copiar o linkLink copiado para a área de transferência!
As an administrator, you can perform a number of tasks to make your clusters more efficient.
4.2.1. Understanding how to evacuate pods on nodes Copiar o linkLink copiado para a área de transferência!
Evacuating pods allows you to migrate all or selected pods from a given node or nodes.
You can only evacuate pods backed by a replication controller. The replication controller creates new pods on other nodes and removes the existing pods from the specified node(s).
Bare pods, meaning those not backed by a replication controller, are unaffected by default. You can evacuate a subset of pods by specifying a pod-selector. Pod selectors are based on labels, so all the pods with the specified label will be evacuated.
Nodes must first be marked unschedulable to perform pod evacuation.
Use oc adm uncordon
to mark the node as schedulable when done.
oc adm uncordon <node1>
$ oc adm uncordon <node1>
The following command evacuates all or selected pods on one or more nodes:
oc adm drain <node1> <node2> [--pod-selector=<pod_selector>]
$ oc adm drain <node1> <node2> [--pod-selector=<pod_selector>]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command forces deletion of bare pods using the
--force
option. When set totrue
, deletion continues even if there are pods not managed by a replication controller, replica set, job, daemon set, or stateful set:oc adm drain <node1> <node2> --force=true
$ oc adm drain <node1> <node2> --force=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command sets a period of time in seconds for each pod to terminate gracefully, use
--grace-period
. If negative, the default value specified in the pod will be used:oc adm drain <node1> <node2> --grace-period=-1
$ oc adm drain <node1> <node2> --grace-period=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command ignores pods managed by a daemon set by using the
--ignore-daemonsets
flag set totrue
:oc adm drain <node1> <node2> --ignore-daemonsets=true
$ oc adm drain <node1> <node2> --ignore-daemonsets=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command sets the length of time to wait before giving up using the
--timeout
flag. A value of0
sets an infinite length of time:oc adm drain <node1> <node2> --timeout=5s
$ oc adm drain <node1> <node2> --timeout=5s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command deletes pods even if there are pods using emptyDir using the
--delete-local-data
flag set totrue
. Local data is deleted when the node is drained:oc adm drain <node1> <node2> --delete-local-data=true
$ oc adm drain <node1> <node2> --delete-local-data=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command lists objects that will be migrated without actually performing the evacuation, using the
--dry-run
option set totrue
:oc adm drain <node1> <node2> --dry-run=true
$ oc adm drain <node1> <node2> --dry-run=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Instead of specifying specific node names (for example,
<node1> <node2>
), you can use the--selector=<node_selector>
option to evacuate pods on selected nodes.
4.2.2. Understanding how to update labels on nodes Copiar o linkLink copiado para a área de transferência!
You can update any label on a node.
Node labels are not persisted after a node is deleted even if the node is backed up by a Machine.
Any change to a MachineSet
object is not applied to existing machines owned by the machine set. For example, labels edited or added to an existing MachineSet
object are not propagated to existing machines and nodes associated with the machine set.
The following command adds or updates labels on a node:
oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>
$ oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc label nodes webconsole-7f7f6 unhealthy=true
$ oc label nodes webconsole-7f7f6 unhealthy=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command updates all pods in the namespace:
oc label pods --all <key_1>=<value_1>
$ oc label pods --all <key_1>=<value_1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc label pods --all status=unhealthy
$ oc label pods --all status=unhealthy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.2.3. Understanding how to mark nodes as unschedulable or schedulable Copiar o linkLink copiado para a área de transferência!
By default, healthy nodes with a Ready
status are marked as schedulable, meaning that new pods are allowed for placement on the node. Manually marking a node as unschedulable blocks any new pods from being scheduled on the node. Existing pods on the node are not affected.
The following command marks a node or nodes as unschedulable:
oc adm cordon <node>
$ oc adm cordon <node>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc adm cordon node1.example.com
$ oc adm cordon node1.example.com node/node1.example.com cordoned NAME LABELS STATUS node1.example.com kubernetes.io/hostname=node1.example.com Ready,SchedulingDisabled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following command marks a currently unschedulable node or nodes as schedulable:
oc adm uncordon <node1>
$ oc adm uncordon <node1>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, instead of specifying specific node names (for example,
<node>
), you can use the--selector=<node_selector>
option to mark selected nodes as schedulable or unschedulable.
4.2.4. Configuring master nodes as schedulable Copiar o linkLink copiado para a área de transferência!
As of OpenShift Container Platform 4.2, you can configure master nodes to be schedulable, meaning that new pods are allowed for placement on the master nodes. By default, master nodes are not schedulable. However, if your cluster does not contain any worker nodes, then master nodes are marked schedulable by default.
In version 4.4, the ability to create a cluster that does not have worker nodes is available to only clusters that are deployed on bare metal as a technology preview. For all other cluster types, you can set the masters to be schedulable but must retain worker nodes.
You can allow or disallow master nodes to be schedulable by configuring the mastersSchedulable
field.
Procedure
Edit the
schedulers.config.openshift.io
resource.oc edit schedulers.config.openshift.io cluster
$ oc edit schedulers.config.openshift.io cluster
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure the
mastersSchedulable
field.Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Set to
true
to allow master nodes to be schedulable, orfalse
to disallow master nodes to be schedulable.
- Save the file to apply the changes.
4.2.5. Deleting nodes Copiar o linkLink copiado para a área de transferência!
4.2.5.1. Deleting nodes from a cluster Copiar o linkLink copiado para a área de transferência!
When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods not backed by a replication controller become inaccessible to OpenShift Container Platform. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods.
Procedure
To delete a node from the OpenShift Container Platform cluster, edit the appropriate MachineSet
object:
If you are running cluster on bare metal, you cannot delete a node by editing MachineSet
objects. Machine sets are only available when a cluster is integrated with a cloud provider. Instead you must unschedule and drain the node before manually deleting it.
View the machine sets that are in the cluster:
oc get machinesets -n openshift-machine-api
$ oc get machinesets -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The machine sets are listed in the form of <clusterid>-worker-<aws-region-az>.
Scale the machine set:
oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For more information on scaling your cluster using a machine set, see Manually scaling a machine set.
4.2.5.2. Deleting nodes from a bare metal cluster Copiar o linkLink copiado para a área de transferência!
When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node are not deleted. Any bare pods not backed by a replication controller become inaccessible to OpenShift Container Platform. Pods backed by replication controllers are rescheduled to other available nodes. You must delete local manifest pods.
Procedure
Delete a node from an OpenShift Container Platform cluster running on bare metal by completing the following steps:
Mark the node as unschedulable:
oc adm cordon <node_name>
$ oc adm cordon <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain all pods on your node:
oc adm drain <node_name> --force=true
$ oc adm drain <node_name> --force=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete your node from the cluster:
oc delete node <node_name>
$ oc delete node <node_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Although the node object is now deleted from the cluster, it can still rejoin the cluster after reboot or if the kubelet service is restarted. To permanently delete the node and all its data, you must decommission the node.
4.2.6. Adding kernel arguments to Nodes Copiar o linkLink copiado para a área de transferência!
In some special cases, you might want to add kernel arguments to a set of nodes in your cluster. This should only be done with caution and clear understanding of the implications of the arguments you set.
Improper use of kernel arguments can result in your systems becoming unbootable.
Examples of kernel arguments you could set include:
- selinux=0: Disables Security Enhanced Linux (SELinux). While not recommended for production, disabling SELinux can improve performance by 2% - 3%.
-
nosmt: Disables symmetric multithreading (SMT) in the kernel. Multithreading allows multiple logical threads for each CPU. You could consider
nosmt
in multi-tenant environments to reduce risks from potential cross-thread attacks. By disabling SMT, you essentially choose security over performance.
See Kernel.org kernel parameters for a list and descriptions of kernel arguments.
In the following procedure, you create a MachineConfig
object that identifies:
- A set of machines to which you want to add the kernel argument. In this case, machines with a worker role.
- Kernel arguments that are appended to the end of the existing kernel arguments.
- A label that indicates where in the list of machine configs the change is applied.
Prerequisites
- Have administrative privilege to a working OpenShift Container Platform cluster.
Procedure
List existing
MachineConfig
objects for your OpenShift Container Platform cluster to determine how to label your machine config:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
MachineConfig
object file that identifies the kernel argument (for example,05-worker-kernelarg-selinuxoff.yaml
)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the new machine config:
oc create -f 05-worker-kernelarg-selinuxoff.yaml
$ oc create -f 05-worker-kernelarg-selinuxoff.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the machine configs to see that the new one was added:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the nodes:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can see that scheduling on each worker node is disabled as the change is being applied.
Check that the kernel argument worked by going to one of the worker nodes and listing the kernel command line arguments (in
/proc/cmdline
on the host):Copy to Clipboard Copied! Toggle word wrap Toggle overflow You should see the
selinux=0
argument added to the other kernel arguments.
4.2.7. Additional resources Copiar o linkLink copiado para a área de transferência!
For more information on scaling your cluster using a MachineSet, see Manually scaling a MachineSet.
4.3. Managing nodes Copiar o linkLink copiado para a área de transferência!
OpenShift Container Platform uses a KubeletConfig custom resource (CR) to manage the configuration of nodes. By creating an instance of a KubeletConfig
object, a managed machine config is created to override setting on the node.
Logging in to remote machines for the purpose of changing their configuration is not supported.
4.3.1. Modifying Nodes Copiar o linkLink copiado para a área de transferência!
To make configuration changes to a cluster, or machine pool, you must create a custom resource definition, or KubeletConfig
object. OpenShift Container Platform uses the Machine Config Controller to watch for changes introduced through the CRD applies the changes to the cluster.
Procedure
Obtain the label associated with the static CRD, Machine Config Pool, for the type of node you want to configure. Perform one of the following steps:
Check current labels of the desired machine config pool.
For example:
oc get machineconfigpool --show-labels
$ oc get machineconfigpool --show-labels NAME CONFIG UPDATED UPDATING DEGRADED LABELS master rendered-master-e05b81f5ca4db1d249a1bf32f9ec24fd True False False operator.machineconfiguration.openshift.io/required-for-upgrade= worker rendered-worker-f50e78e1bc06d8e82327763145bfcf62 True False False
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add a custom label to the desired machine config pool.
For example:
oc label machineconfigpool worker custom-kubelet=enabled
$ oc label machineconfigpool worker custom-kubelet=enabled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a
kubeletconfig
custom resource (CR) for your configuration change.For example:
Sample configuration for a custom-config CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the CR object.
oc create -f <file-name>
$ oc create -f <file-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f master-kube-config.yaml
$ oc create -f master-kube-config.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Most KubeletConfig Options can be set by the user. The following options are not allowed to be overwritten:
- CgroupDriver
- ClusterDNS
- ClusterDomain
- RuntimeRequestTimeout
- StaticPodPath
4.4. Managing the maximum number of pods per node Copiar o linkLink copiado para a área de transferência!
In OpenShift Container Platform, you can configure the number of pods that can run on a node based on the number of processor cores on the node, a hard limit or both. If you use both options, the lower of the two limits the number of pods on a node.
Exceeding these values can result in:
- Increased CPU utilization by OpenShift Container Platform.
- Slow pod scheduling.
- Potential out-of-memory scenarios, depending on the amount of memory in the node.
- Exhausting the IP address pool.
- Resource overcommitting, leading to poor user application performance.
A pod that is holding a single container actually uses two containers. The second container sets up networking prior to the actual container starting. As a result, a node running 10 pods actually has 20 containers running.
The podsPerCore
parameter limits the number of pods the node can run based on the number of processor cores on the node. For example, if podsPerCore
is set to 10
on a node with 4 processor cores, the maximum number of pods allowed on the node is 40.
The maxPods
parameter limits the number of pods the node can run to a fixed value, regardless of the properties of the node.
4.4.1. Configuring the maximum number of pods per Node Copiar o linkLink copiado para a área de transferência!
Two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore
and maxPods
. If you use both options, the lower of the two limits the number of pods on a node.
For example, if podsPerCore
is set to 10
on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40.
Prerequisite
Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:
View the machine config pool:
oc describe machineconfigpool <name>
$ oc describe machineconfigpool <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If a label has been added it appears under
labels
.
If the label is not present, add a key/value pair:
oc label machineconfigpool worker custom-kubelet=small-pods
$ oc label machineconfigpool worker custom-kubelet=small-pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
Create a custom resource (CR) for your configuration change.
Sample configuration for a max-pods CR
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteSetting
podsPerCore
to 0 disables this limit.In the above example, the default value for
podsPerCore
is10
and the default value formaxPods
is250
. This means that unless the node has 25 cores or more, by default,podsPerCore
will be the limiting factor.List the machine config pool CRDs to see if the change is applied. The
UPDATING
column reportsTrue
if the change is picked up by the Machine Config Controller:oc get machineconfigpools
$ oc get machineconfigpools NAME CONFIG UPDATED UPDATING DEGRADED master master-9cc2c72f205e103bb534 False False False worker worker-8cecd1236b33ee3f8a5e False True False
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Once the change is complete, the
UPDATED
column reportsTrue
.oc get machineconfigpools
$ oc get machineconfigpools NAME CONFIG UPDATED UPDATING DEGRADED master master-9cc2c72f205e103bb534 False True False worker worker-8cecd1236b33ee3f8a5e True False False
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.5. Using the Node Tuning Operator Copiar o linkLink copiado para a área de transferência!
Learn about the Node Tuning Operator and how you can use it to manage node-level tuning by orchestrating the tuned daemon.
The Node Tuning Operator helps you manage node-level tuning by orchestrating the Tuned daemon. The majority of high-performance applications require some level of kernel tuning. The Node Tuning Operator provides a unified management interface to users of node-level sysctls and more flexibility to add custom tuning specified by user needs. The Operator manages the containerized Tuned daemon for OpenShift Container Platform as a Kubernetes daemon set. It ensures the custom tuning specification is passed to all containerized Tuned daemons running in the cluster in the format that the daemons understand. The daemons run on all nodes in the cluster, one per node.
Node-level settings applied by the containerized Tuned daemon are rolled back on an event that triggers a profile change or when the containerized Tuned daemon is terminated gracefully by receiving and handling a termination signal.
The Node Tuning Operator is part of a standard OpenShift Container Platform installation in version 4.1 and later.
4.5.1. Accessing an example Node Tuning Operator specification Copiar o linkLink copiado para a área de transferência!
Use this process to access an example Node Tuning Operator specification.
Procedure
Run:
oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator
$ oc get Tuned/default -o yaml -n openshift-cluster-node-tuning-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The default CR is meant for delivering standard node-level tuning for the OpenShift Container Platform platform and it can only be modified to set the Operator Management state. Any other custom changes to the default CR will be overwritten by the Operator. For custom tuning, create your own Tuned CRs. Newly created CRs will be combined with the default CR and custom tuning applied to OpenShift Container Platform nodes based on node or pod labels and profile priorities.
While in certain situations the support for pod labels can be a convenient way of automatically delivering required tuning, this practice is discouraged and strongly advised against, especially in large-scale clusters. The default Tuned CR ships without pod label matching. If a custom profile is created with pod label matching, then the functionality will be enabled at that time. The pod label functionality might be deprecated in future versions of the Node Tuning Operator.
4.5.2. Custom tuning specification Copiar o linkLink copiado para a área de transferência!
The custom resource (CR) for the Operator has two major sections. The first section, profile:
, is a list of Tuned profiles and their names. The second, recommend:
, defines the profile selection logic.
Multiple custom tuning specifications can co-exist as multiple CRs in the Operator’s namespace. The existence of new CRs or the deletion of old CRs is detected by the Operator. All existing custom tuning specifications are merged and appropriate objects for the containerized Tuned daemons are updated.
Profile data
The profile:
section lists Tuned profiles and their names.
Recommended profiles
The profile:
selection logic is defined by the recommend:
section of the CR. The recommend:
section is a list of items to recommend the profiles based on a selection criteria.
If <match>
is omitted, a profile match (for example, true
) is assumed.
<match>
is an optional array recursively defined as follows:
- label: <label_name> # node or Pod label name value: <label_value> # optional node or Pod label value; if omitted, the presence of <label_name> is enough to match type: <label_type> # optional node or Pod type (`node` or `pod`); if omitted, `node` is assumed <match> # an optional <match> array
- label: <label_name> # node or Pod label name
value: <label_value> # optional node or Pod label value; if omitted, the presence of <label_name> is enough to match
type: <label_type> # optional node or Pod type (`node` or `pod`); if omitted, `node` is assumed
<match> # an optional <match> array
If <match>
is not omitted, all nested <match>
sections must also evaluate to true
. Otherwise, false
is assumed and the profile with the respective <match>
section will not be applied or recommended. Therefore, the nesting (child <match>
sections) works as logical AND operator. Conversely, if any item of the <match>
array matches, the entire <match>
array evaluates to true
. Therefore, the array acts as logical OR operator.
Example
The CR above is translated for the containerized Tuned daemon into its recommend.conf
file based on the profile priorities. The profile with the highest priority (10
) is openshift-control-plane-es
and, therefore, it is considered first. The containerized Tuned daemon running on a given node looks to see if there is a pod running on the same node with the tuned.openshift.io/elasticsearch
label set. If not, the entire <match>
section evaluates as false
. If there is such a pod with the label, in order for the <match>
section to evaluate to true
, the node label also needs to be node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
If the labels for the profile with priority 10
matched, openshift-control-plane-es
profile is applied and no other profile is considered. If the node/pod label combination did not match, the second highest priority profile (openshift-control-plane
) is considered. This profile is applied if the containerized Tuned pod runs on a node with labels node-role.kubernetes.io/master
or node-role.kubernetes.io/infra
.
Finally, the profile openshift-node
has the lowest priority of 30
. It lacks the <match>
section and, therefore, will always match. It acts as a profile catch-all to set openshift-node
profile, if no other profile with higher priority matches on a given node.
4.5.3. Default profiles set on a cluster Copiar o linkLink copiado para a área de transferência!
The following are the default profiles set on a cluster.
4.5.4. Supported Tuned daemon plug-ins Copiar o linkLink copiado para a área de transferência!
Excluding the [main]
section, the following Tuned plug-ins are supported when using custom profiles defined in the profile:
section of the Tuned CR:
- audio
- cpu
- disk
- eeepc_she
- modules
- mounts
- net
- scheduler
- scsi_host
- selinux
- sysctl
- sysfs
- usb
- video
- vm
There is some dynamic tuning functionality provided by some of these plug-ins that is not supported. The following Tuned plug-ins are currently not supported:
- bootloader
- script
- systemd
See Available Tuned Plug-ins and Getting Started with Tuned for more information.
4.6. Understanding node rebooting Copiar o linkLink copiado para a área de transferência!
To reboot a node without causing an outage for applications running on the platform, it is important to first evacuate the pods. For pods that are made highly available by the routing tier, nothing else needs to be done. For other pods needing storage, typically databases, it is critical to ensure that they can remain in operation with one pod temporarily going offline. While implementing resiliency for stateful pods is different for each application, in all cases it is important to configure the scheduler to use node anti-affinity to ensure that the pods are properly spread across available nodes.
Another challenge is how to handle nodes that are running critical infrastructure such as the router or the registry. The same node evacuation process applies, though it is important to understand certain edge cases.
4.6.1. Understanding infrastructure node rebooting Copiar o linkLink copiado para a área de transferência!
Infrastructure nodes are nodes that are labeled to run pieces of the OpenShift Container Platform environment. Currently, the easiest way to manage node reboots is to ensure that there are at least three nodes available to run infrastructure. The nodes to run the infrastructure are called master nodes.
The scenario below demonstrates a common mistake that can lead to service interruptions for the applications running on OpenShift Container Platform when only two nodes are available.
- Node A is marked unschedulable and all pods are evacuated.
- The registry pod running on that node is now redeployed on node B. This means node B is now running both registry pods.
- Node B is now marked unschedulable and is evacuated.
- The service exposing the two pod endpoints on node B, for a brief period of time, loses all endpoints until they are redeployed to node A.
The same process using three master nodes for infrastructure does not result in a service disruption. However, due to pod scheduling, the last node that is evacuated and brought back in to rotation is left running zero registries. The other two nodes will run two and one registries respectively. The best solution is to rely on pod anti-affinity.
4.6.2. Rebooting a node using pod anti-affinity Copiar o linkLink copiado para a área de transferência!
Pod anti-affinity is slightly different than node anti-affinity. Node anti-affinity can be violated if there are no other suitable locations to deploy a pod. Pod anti-affinity can be set to either required or preferred.
With this in place, if only two infrastructure nodes are available and one is rebooted, the container image registry pod is prevented from running on the other node. oc get pods
reports the pod as unready until a suitable node is available. Once a node is available and all pods are back in ready state, the next node can be restarted.
Procedure
To reboot a node using pod anti-affinity:
Edit the node specification to configure pod anti-affinity:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Stanza to configure pod anti-affinity.
- 2
- Defines a preferred rule.
- 3
- Specifies a weight for a preferred rule. The node with the highest weight is preferred.
- 4
- Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
- 5
- The operator represents the relationship between the label on the existing pod and the set of values in the
matchExpression
parameters in the specification for the new pod. Can beIn
,NotIn
,Exists
, orDoesNotExist
.
This example assumes the container image registry pod has a label of
registry=default
. Pod anti-affinity can use any Kubernetes match expression.-
Enable the
MatchInterPodAffinity
scheduler predicate in the scheduling policy file.
4.6.3. Understanding how to reboot nodes running routers Copiar o linkLink copiado para a área de transferência!
In most cases, a pod running an OpenShift Container Platform router exposes a host port.
The PodFitsPorts
scheduler predicate ensures that no router pods using the same port can run on the same node, and pod anti-affinity is achieved. If the routers are relying on IP failover for high availability, there is nothing else that is needed.
For router pods relying on an external service such as AWS Elastic Load Balancing for high availability, it is that service’s responsibility to react to router pod restarts.
In rare cases, a router pod may not have a host port configured. In those cases, it is important to follow the recommended restart process for infrastructure nodes.
4.7. Freeing node resources using garbage collection Copiar o linkLink copiado para a área de transferência!
As an administrator, you can use OpenShift Container Platform to ensure that your nodes are running efficiently by freeing up resources through garbage collection.
The OpenShift Container Platform node performs two types of garbage collection:
- Container garbage collection: Removes terminated containers.
- Image garbage collection: Removes images not referenced by any running pods.
4.7.1. Understanding how terminated containers are removed though garbage collection Copiar o linkLink copiado para a área de transferência!
Container garbage collection can be performed using eviction thresholds.
When eviction thresholds are set for garbage collection, the node tries to keep any container for any pod accessible from the API. If the pod has been deleted, the containers will be as well. Containers are preserved as long the pod is not deleted and the eviction threshold is not reached. If the node is under disk pressure, it will remove containers and their logs will no longer be accessible using oc logs
.
- eviction-soft - A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period.
- eviction-hard - A hard eviction threshold has no grace period, and if observed, OpenShift Container Platform takes immediate action.
If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node would constantly oscillate between true
and false
. As a consequence, the scheduler could make poor scheduling decisions.
To protect against this oscillation, use the eviction-pressure-transition-period
flag to control how long OpenShift Container Platform must wait before transitioning out of a pressure condition. OpenShift Container Platform will not set an eviction threshold as being met for the specified pressure condition for the period specified before toggling the condition back to false.
4.7.2. Understanding how images are removed though garbage collection Copiar o linkLink copiado para a área de transferência!
Image garbage collection relies on disk usage as reported by cAdvisor on the node to decide which images to remove from the node.
The policy for image garbage collection is based on two conditions:
- The percent of disk usage (expressed as an integer) which triggers image garbage collection. The default is 85.
- The percent of disk usage (expressed as an integer) to which image garbage collection attempts to free. Default is 80.
For image garbage collection, you can modify any of the following variables using a custom resource.
Setting | Description |
---|---|
| The minimum age for an unused image before the image is removed by garbage collection. The default is 2m. |
| The percent of disk usage, expressed as an integer, which triggers image garbage collection. The default is 85. |
| The percent of disk usage, expressed as an integer, to which image garbage collection attempts to free. The default is 80. |
Two lists of images are retrieved in each garbage collector run:
- A list of images currently running in at least one pod.
- A list of images available on a host.
As new containers are run, new images appear. All images are marked with a time stamp. If the image is running (the first list above) or is newly detected (the second list above), it is marked with the current time. The remaining images are already marked from the previous spins. All images are then sorted by the time stamp.
Once the collection starts, the oldest images get deleted first until the stopping criterion is met.
4.7.3. Configuring garbage collection for containers and images Copiar o linkLink copiado para a área de transferência!
As an administrator, you can configure how OpenShift Container Platform performs garbage collection by creating a kubeletConfig
object for each Machine Config Pool.
OpenShift Container Platform supports only one kubeletConfig
object for each Machine Config Pool.
You can configure any combination of the following:
- soft eviction for containers
- hard eviction for containers
- eviction for images
For soft container eviction you can also configure a grace period before eviction.
Prerequisites
Obtain the label associated with the static
MachineConfigPool
CRD for the type of node you want to configure. Perform one of the following steps:View the Machine Config Pool:
oc describe machineconfigpool <name>
$ oc describe machineconfigpool <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If a label has been added it appears under
labels
.
If the label is not present, add a key/value pair:
oc label machineconfigpool worker custom-kubelet=small-pods
$ oc label machineconfigpool worker custom-kubelet=small-pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
Create a custom resource (CR) for your configuration change.
Sample configuration for a container garbage collection CR:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Name for the object.
- 2
- Selector label.
- 3
- Type of eviction:
EvictionSoft
andEvictionHard
. - 4
- Eviction thresholds based on a specific eviction trigger signal.
- 5
- Grace periods for the soft eviction. This parameter does not apply to
eviction-hard
. - 6
- The duration to wait before transitioning out of an eviction pressure condition
- 7
- The minimum age for an unused image before the image is removed by garbage collection.
- 8
- The percent of disk usage (expressed as an integer) which triggers image garbage collection.
- 9
- The percent of disk usage (expressed as an integer) to which image garbage collection attempts to free.
Create the object:
oc create -f <file-name>.yaml
$ oc create -f <file-name>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc create -f gc-container.yaml kubeletconfig.machineconfiguration.openshift.io/gc-container created
oc create -f gc-container.yaml kubeletconfig.machineconfiguration.openshift.io/gc-container created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that garbage collection is active. The Machine Config Pool you specified in the custom resource appears with
UPDATING
as 'true` until the change is fully implemented:oc get machineconfigpool
$ oc get machineconfigpool NAME CONFIG UPDATED UPDATING master rendered-master-546383f80705bd5aeaba93 True False worker rendered-worker-b4c51bb33ccaae6fc4a6a5 False True
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.8. Allocating resources for nodes in an OpenShift Container Platform cluster Copiar o linkLink copiado para a área de transferência!
To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by all underlying node components (such as kubelet, kube-proxy) and the remaining system components (such as sshd, NetworkManager) on the host. Once specified, the scheduler has more information about the resources (e.g., memory, CPU) a node has allocated for pods.
4.8.1. Understanding how to allocate resources for nodes Copiar o linkLink copiado para a área de transferência!
CPU and memory resources reserved for node components in OpenShift Container Platform are based on two node settings:
Setting | Description |
---|---|
| Resources reserved for node components. Default is none. |
|
Resources reserved for the remaining system components. Default settings depend on the OpenShift Container Platform and Machine Config Operator versions. Confirm the default |
If a flag is not set, the defaults are used. If none of the flags are set, the allocated resource is set to the node’s capacity as it was before the introduction of allocatable resources.
4.8.1.1. How OpenShift Container Platform computes allocated resources Copiar o linkLink copiado para a área de transferência!
An allocated amount of a resource is computed based on the following formula:
[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - [Hard-Eviction-Thresholds]
[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - [Hard-Eviction-Thresholds]
The withholding of Hard-Eviction-Thresholds
from allocatable is a change in behavior to improve system reliability now that allocatable is enforced for end-user pods at the node level. The experimental-allocatable-ignore-eviction
setting is available to preserve legacy behavior, but it will be deprecated in a future release.
If [Allocatable]
is negative, it is set to 0.
Each node reports system resources utilized by the container runtime and kubelet. To better aid your ability to configure --system-reserved
and --kube-reserved
, you can introspect corresponding node’s resource usage using the node summary API, which is accessible at /api/v1/nodes/<node>/proxy/stats/summary
.
4.8.1.2. How nodes enforce resource constraints Copiar o linkLink copiado para a área de transferência!
The node is able to limit the total amount of resources that pods may consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from starving system services (for example: container runtime, node agent, etc.) for resources. It is strongly encouraged that administrators reserve resources based on the desired node utilization target in order to improve node reliability.
The node enforces resource constraints using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy separate from system daemons.
Optionally, the node can be made to enforce kube-reserved and system-reserved by specifying those tokens in the enforce-node-allocatable flag. If specified, the corresponding --kube-reserved-cgroup
or --system-reserved-cgroup
needs to be provided. In future releases, the node and container runtime will be packaged in a common cgroup separate from system.slice
. Until that time, we do not recommend users change the default value of enforce-node-allocatable flag.
Administrators should treat system daemons similar to Guaranteed pods. System daemons can burst within their bounding control groups and this behavior needs to be managed as part of cluster deployments. Enforcing system-reserved limits can lead to critical system services being CPU starved or OOM killed on the node. The recommendation is to enforce system-reserved only if operators have profiled their nodes exhaustively to determine precise estimates and are confident in their ability to recover if any process in that group is OOM killed.
As a result, we strongly recommended that users only enforce node allocatable for pods
by default, and set aside appropriate reservations for system daemons to maintain overall node reliability.
4.8.1.3. Understanding Eviction Thresholds Copiar o linkLink copiado para a área de transferência!
If a node is under memory pressure, it can impact the entire node and all pods running on it. If a system daemon is using more than its reserved amount of memory, an OOM event may occur that can impact the entire node and all pods running on it. To avoid (or reduce the probability of) system OOMs the node provides out-of-resource handling.
You can reserve some memory using the --eviction-hard
flag. The node attempts to evict pods whenever memory availability on the node drops below the absolute value or percentage. If system daemons do not exist on a node, pods are limited to the memory capacity - eviction-hard
. For this reason, resources set aside as a buffer for eviction before reaching out of memory conditions are not available for pods.
The following is an example to illustrate the impact of node allocatable for memory:
-
Node capacity is
32Gi
-
--kube-reserved is
2Gi
-
--system-reserved is
1Gi
-
--eviction-hard is set to
100Mi
.
For this node, the effective node allocatable value is 28.9Gi
. If the node and system components use up all their reservation, the memory available for pods is 28.9Gi
, and kubelet will evict pods when it exceeds this usage.
If you enforce node allocatable (28.9Gi
) via top level cgroups, then pods can never exceed 28.9Gi
. Evictions would not be performed unless system daemons are consuming more than 3.1Gi
of memory.
If system daemons do not use up all their reservation, with the above example, pods would face memcg OOM kills from their bounding cgroup before node evictions kick in. To better enforce QoS under this situation, the node applies the hard eviction thresholds to the top-level cgroup for all pods to be Node Allocatable + Eviction Hard Thresholds
.
If system daemons do not use up all their reservation, the node will evict pods whenever they consume more than 28.9Gi
of memory. If eviction does not occur in time, a pod will be OOM killed if pods consume 29Gi
of memory.
4.8.1.4. How the scheduler determines resource availability Copiar o linkLink copiado para a área de transferência!
The scheduler uses the value of node.Status.Allocatable
instead of node.Status.Capacity
to decide if a node will become a candidate for pod scheduling.
By default, the node will report its machine capacity as fully schedulable by the cluster.
4.8.2. Configuring allocated resources for nodes Copiar o linkLink copiado para a área de transferência!
OpenShift Container Platform supports the CPU and memory resource types for allocation. If your administrator enabled the ephemeral storage technology preview, the ephemeral-resource
resource type is supported as well. For the cpu
type, the resource quantity is specified in units of cores, such as 200m
, 0.5
, or 1
. For memory
and ephemeral-storage
, it is specified in units of bytes, such as 200Ki
, 50Mi
, or 5Gi
.
As an administrator, you can set these using a custom resource (CR) through a set of <resource_type>=<resource_quantity>
pairs (e.g., cpu=200m,memory=512Mi).
Prerequisites
To help you determine setting for
--system-reserved
and--kube-reserved
you can introspect the corresponding node’s resource usage using the node summary API, which is accessible at/api/v1/nodes/<node>/proxy/stats/summary
. Enter the following command for your node:oc get --raw /api/v1/nodes/<node>/proxy/stats/summary
$ oc get --raw /api/v1/nodes/<node>/proxy/stats/summary
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to access the resources from
cluster.node22
node, you can enter:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Obtain the label associated with the static
MachineConfigPool
CRD for the type of node you want to configure. Perform one of the following steps:View the Machine Config Pool:
oc describe machineconfigpool <name>
$ oc describe machineconfigpool <name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- If a label has been added it appears under
labels
.
If the label is not present, add a key/value pair:
oc label machineconfigpool worker custom-kubelet=small-pods
$ oc label machineconfigpool worker custom-kubelet=small-pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.9. Machine Config Daemon metrics Copiar o linkLink copiado para a área de transferência!
The Machine Config Daemon is a part of the Machine Config Operator. It runs on every node in the cluster. The Machine Config Daemon manages configuration changes and updates on each of the nodes.
4.9.1. Machine Config Daemon metrics Copiar o linkLink copiado para a área de transferência!
Beginning with OpenShift Container Platform 4.3, the Machine Config Daemon provides a set of metrics. These metrics can be accessed using the Prometheus Cluster Monitoring stack.
The following table describes this set of metrics.
Metrics marked with *
in the Name
and Description
columns represent serious errors that might cause performance problems. Such problems might prevent updates and upgrades from proceeding.
While some entries contain commands for getting specific logs, the most comprehensive set of logs is available using the oc adm must-gather
command.
Name | Format | Description | Notes |
---|---|---|---|
mcd_host_os_and_version | []string{"os", "version"} | Shows the OS that MCD is running on, such as RHCOS or RHEL. In case of RHCOS, the version is provided. | |
ssh_accessed | counter | Shows the number of successful SSH authentications into the node. | The non-zero value shows that someone might have made manual changes to the node. Such changes might cause irreconcilable errors due to the differences between the state on the disk and the state defined in the machine configuration. |
mcd_drain* | {"drain_time", "err"} | Logs errors received during failed drain. * |
While drains might need multiple tries to succeed, terminal failed drains prevent updates from proceeding. The For further investigation, see the logs by running:
|
mcd_pivot_err* | []string{"pivot_target", "err"} | Logs errors encountered during pivot. * | Pivot errors might prevent OS upgrades from proceeding. For further investigation, run this command to access the node and see all its logs:
Alternatively, you can run this command to only see the logs from the
|
mcd_state | []string{"state", "reason"} | State of Machine Config Daemon for the indicated node. Possible states are "Done", "Working", and "Degraded". In case of "Degraded", the reason is included. | For further investigation, see the logs by running:
|
mcd_kubelet_state* | []string{"err"} | Logs kubelet health failures. * | This is expected to be empty, with failure count of 0. If failure count exceeds 2, the error indicating threshold is exceeded. This indicates a possible issue with the health of the kubelet. For further investigation, run this command to access the node and see all its logs:
|
mcd_reboot_err* | []string{"message", "err"} | Logs the failed reboots and the corresponding errors. * | This is expected to be empty, which indicates a successful reboot. For further investigation, see the logs by running:
|
mcd_update_state | []string{"config", "err"} | Logs success or failure of configuration updates and the corresponding errors. |
The expected value is For further investigation, see the logs by running:
|
Additional resources