Chapter 10. Node maintenance
10.1. About node maintenance
10.1.1. Understanding node maintenance mode
Nodes can be placed into maintenance mode using the oc adm
utility, or using NodeMaintenance
custom resources (CRs).
Placing a node into maintenance marks the node as unschedulable and drains all the virtual machines and pods from it. Virtual machine instances that have a LiveMigrate
eviction strategy are live migrated to another node without loss of service. This eviction strategy is configured by default in virtual machine created from common templates but must be configured manually for custom virtual machines.
Virtual machine instances without an eviction strategy are shut down. Virtual machines with a RunStrategy
of Running
or RerunOnFailure
are recreated on another node. Virtual machines with a RunStrategy
of Manual
are not automatically restarted.
Virtual machines must have a persistent volume claim (PVC) with a shared ReadWriteMany
(RWX) access mode to be live migrated.
When installed as part of OpenShift Virtualization, Node Maintenance Operator watches for new or deleted NodeMaintenance
CRs. When a new NodeMaintenance
CR is detected, no new workloads are scheduled and the node is cordoned off from the rest of the cluster. All pods that can be evicted are evicted from the node. When a NodeMaintenance
CR is deleted, the node that is referenced in the CR is made available for new workloads.
Using a NodeMaintenance
CR for node maintenance tasks achieves the same results as the oc adm cordon
and oc adm drain
commands using standard OpenShift Container Platform custom resource processing.
10.1.2. Maintaining bare metal nodes
When you deploy OpenShift Container Platform on bare metal infrastructure, there are additional considerations that must be taken into account compared to deploying on cloud infrastructure. Unlike in cloud environments where the cluster nodes are considered ephemeral, re-provisioning a bare metal node requires significantly more time and effort for maintenance tasks.
When a bare metal node fails, for example, if a fatal kernel error happens or a NIC card hardware failure occurs, workloads on the failed node need to be restarted elsewhere else on the cluster while the problem node is repaired or replaced. Node maintenance mode allows cluster administrators to gracefully power down nodes, moving workloads to other parts of the cluster and ensuring workloads do not get interrupted. Detailed progress and node status details are provided during maintenance.
10.2. Setting a node to maintenance mode
Place a node into maintenance from the web console, CLI, or using a NodeMaintenance
custom resource.
10.2.1. Setting a node to maintenance mode in the web console
Set a node to maintenance mode using the Options menu
found on each node in the Compute
Procedure
-
In the OpenShift Virtualization console, click Compute
Nodes. You can set the node to maintenance from this screen, which makes it easier to perform actions on multiple nodes in the one screen or from the Node Details screen where you can view comprehensive details of the selected node:
- Click the Options menu at the end of the node and select Start Maintenance.
-
Click the node name to open the Node Details screen and click Actions
Start Maintenance.
- Click Start Maintenance in the confirmation window.
The node will live migrate virtual machine instances that have the LiveMigration
eviction strategy, and the node is no longer schedulable. All other pods and virtual machines on the node are deleted and recreated on another node.
10.2.2. Setting a node to maintenance mode in the CLI
Set a node to maintenance mode by marking it as unschedulable and using the oc adm drain
command to evict or delete pods from the node.
Procedure
Mark the node as unschedulable. The node status changes to
NotReady,SchedulingDisabled
.$ oc adm cordon <node1>
Drain the node in preparation for maintenance. The node live migrates virtual machine instances that have the
LiveMigratable
condition set toTrue
and thespec:evictionStrategy
field set toLiveMigrate
. All other pods and virtual machines on the node are deleted and recreated on another node.$ oc adm drain <node1> --delete-emptydir-data --ignore-daemonsets=true --force
-
The
--delete-emptydir-data
flag removes any virtual machine instances on the node that useemptyDir
volumes. Data in these volumes is ephemeral and is safe to be deleted after termination. -
The
--ignore-daemonsets=true
flag ensures that daemon sets are ignored and pod eviction can continue successfully. -
The
--force
flag is required to delete pods that are not managed by a replica set or daemon set controller.
-
The
10.2.3. Setting a node to maintenance mode with a NodeMaintenance custom resource
You can put a node into maintenance mode with a NodeMaintenance
custom resource (CR). When you apply a NodeMaintenance
CR, all allowed pods are evicted and the node is shut down. Evicted pods are queued to be moved to another node in the cluster.
Prerequisites
-
Install the OpenShift Container Platform CLI
oc
. -
Log in to the cluster as a user with
cluster-admin
privileges.
Procedure
Create the following node maintenance CR, and save the file as
nodemaintenance-cr.yaml
:apiVersion: nodemaintenance.kubevirt.io/v1beta1 kind: NodeMaintenance metadata: name: maintenance-example 1 spec: nodeName: node-1.example.com 2 reason: "Node maintenance" 3
Apply the node maintenance schedule by running the following command:
$ oc apply -f nodemaintenance-cr.yaml
Check the progress of the maintenance task by running the following command, replacing
<node-name>
with the name of your node:$ oc describe node <node-name>
Example output
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeNotSchedulable 61m kubelet Node node-1.example.com status is now: NodeNotSchedulable
10.2.3.1. Checking status of current NodeMaintenance CR tasks
You can check the status of current NodeMaintenance
CR tasks.
Prerequisites
-
Install the OpenShift Container Platform CLI
oc
. -
Log in as a user with
cluster-admin
privileges.
Procedure
Check the status of current node maintenance tasks by running the following command:
$ oc get NodeMaintenance -o yaml
Example output
apiVersion: v1 items: - apiVersion: nodemaintenance.kubevirt.io/v1beta1 kind: NodeMaintenance metadata: ... spec: nodeName: node-1.example.com reason: Node maintenance status: evictionPods: 3 1 pendingPods: - pod-example-workload-0 - httpd - httpd-manual phase: Running lastError: "Last failure message" 2 totalpods: 5 ...
10.3. Resuming a node from maintenance mode
Resuming a node brings it out of maintenance mode and makes it schedulable again.
Resume a node from maintenance mode from the web console, CLI, or by deleting the NodeMaintenance
custom resource.
10.3.1. Resuming a node from maintenance mode in the web console
Resume a node from maintenance mode using the Options menu
found on each node in the Compute
Procedure
-
In the OpenShift Virtualization console, click Compute
Nodes. You can resume the node from this screen, which makes it easier to perform actions on multiple nodes in the one screen, or from the Node Details screen where you can view comprehensive details of the selected node:
- Click the Options menu at the end of the node and select Stop Maintenance.
-
Click the node name to open the Node Details screen and click Actions
Stop Maintenance.
- Click Stop Maintenance in the confirmation window.
The node becomes schedulable, but virtual machine instances that were running on the node prior to maintenance will not automatically migrate back to this node.
10.3.2. Resuming a node from maintenance mode in the CLI
Resume a node from maintenance mode by making it schedulable again.
Procedure
Mark the node as schedulable. You can then resume scheduling new workloads on the node.
$ oc adm uncordon <node1>
10.3.3. Resuming a node from maintenance mode that was initiated with a NodeMaintenance CR
You can resume a node by deleting the NodeMaintenance
CR.
Prerequisites
-
Install the OpenShift Container Platform CLI
oc
. -
Log in to the cluster as a user with
cluster-admin
privileges.
Procedure
When your node maintenance task is complete, delete the active
NodeMaintenance
CR:$ oc delete -f nodemaintenance-cr.yaml
Example output
nodemaintenance.nodemaintenance.kubevirt.io "maintenance-example" deleted
10.4. Automatic renewal of TLS certificates
All TLS certificates for OpenShift Virtualization components are renewed and rotated automatically. You are not required to refresh them manually.
10.4.1. TLS certificates automatic renewal schedules
TLS certificates are automatically deleted and replaced according to the following schedule:
- KubeVirt certificates are renewed daily.
- Containerized Data Importer controller (CDI) certificates are renewed every 15 days.
- MAC pool certificates are renewed every year.
Automatic TLS certificate rotation does not disrupt any operations. For example, the following operations continue to function without any disruption:
- Migrations
- Image uploads
- VNC and console connections
10.5. Managing node labeling for obsolete CPU models
You can schedule a virtual machine (VM) on a node where the CPU model and policy attribute of the VM are compatible with the CPU models and policy attributes that the node supports. By specifying a list of obsolete CPU models in a config map, you can exclude them from the list of labels created for CPU models.
10.5.1. Understanding node labeling for obsolete CPU models
To ensure that a node supports only valid CPU models for scheduled VMs, create a config map with a list of obsolete CPU models. When the node-labeller
obtains the list of obsolete CPU models, it eliminates those CPU models and creates labels for valid CPU models.
If you do not configure a config map with a list of obsolete CPU models, all CPU models are evaluated for labels, including obsolete CPU models that are not present in your environment.
Through the process of iteration, the list of base CPU features in the minimum CPU model are eliminated from the list of labels generated for the node. For example, an environment might have two supported CPU models: Penryn
and Haswell
.
If Penryn
is specified as the CPU model for minCPU
, the node-labeller
evaluates each base CPU feature for Penryn
and compares it with each CPU feature supported by Haswell
. If the CPU feature is supported by both Penryn
and Haswell
, the node-labeller
eliminates that feature from the list of CPU features for creating labels. If a CPU feature is supported only by Haswell
and not by Penryn
, that CPU feature is included in the list of generated labels. The node-labeller
follows this iterative process to eliminate base CPU features that are present in the minimum CPU model and create labels.
The following example shows the complete list of CPU features for Penryn
which is specified as the CPU model for minCPU
:
Example of CPU features for Penryn
apic clflush cmov cx16 cx8 de fpu fxsr lahf_lm lm mca mce mmx msr mtrr nx pae pat pge pni pse pse36 sep sse sse2 sse4.1 ssse3 syscall tsc
The following example shows the complete list of CPU features for Haswell
:
Example of CPU features for Haswell
aes apic avx avx2 bmi1 bmi2 clflush cmov cx16 cx8 de erms fma fpu fsgsbase fxsr hle invpcid lahf_lm lm mca mce mmx movbe msr mtrr nx pae pat pcid pclmuldq pge pni popcnt pse pse36 rdtscp rtm sep smep sse sse2 sse4.1 sse4.2 ssse3 syscall tsc tsc-deadline x2apic xsave
The following example shows the list of node labels generated by the node-labeller
after iterating and comparing the CPU features for Penryn
with the CPU features for Haswell
:
Example of node labels after iteration
aes avx avx2 bmi1 bmi2 erms fma fsgsbase hle invpcid movbe pcid pclmuldq popcnt rdtscp rtm sse4.2 tsc-deadline x2apic xsave
10.5.2. Configuring a config map for obsolete CPU models
Use this procedure to configure a config map for obsolete CPU models.
Procedure
Create a
ConfigMap
object, specifying the obsolete CPU models in theobsoleteCPUs
array. For example:apiVersion: v1 kind: ConfigMap metadata: name: cpu-plugin-configmap 1 data: 2 cpu-plugin-configmap: obsoleteCPUs: 3 - "486" - "pentium" - "pentium2" - "pentium3" - "pentiumpro" minCPU: "Penryn" 4