Chapter 7. Placing nodes in maintenance mode with Node Maintenance Operator

PDF

You can use the Node Maintenance Operator to place nodes in maintenance mode by using the oc adm utility or NodeMaintenance custom resources (CRs).

7.1. About the Node Maintenance Operator

The Node Maintenance Operator watches for new or deleted NodeMaintenance CRs. When a new NodeMaintenance CR is detected, no new workloads are scheduled and the node is cordoned off from the rest of the cluster. All pods that can be evicted are evicted from the node. When a NodeMaintenance CR is deleted, the node that is referenced in the CR is made available for new workloads.

Note

Using a NodeMaintenance CR for node maintenance tasks achieves the same results as the oc adm cordon and oc adm drain commands using standard Red Hat OpenShift CR processing.

7.2. Installing the Node Maintenance Operator

You can install the Node Maintenance Operator using the web console or the OpenShift CLI (oc).

Note

If OpenShift Virtualization version 4.10 or less is installed in your cluster, it includes an outdated version of the Node Maintenance Operator.

7.2.1. Installing the Node Maintenance Operator by using the web console

You can use the Red Hat OpenShift web console to install the Node Maintenance Operator.

Prerequisites

Procedure

In the Red Hat OpenShift web console, navigate to Operators OperatorHub.
Select the Node Maintenance Operator, then click Install.
Keep the default selection of Installation mode and namespace to ensure that the Operator will be installed to the openshift-workload-availability namespace.
Click Install.

Verification

To confirm that the installation is successful:

Navigate to the Operators Installed Operators page.
Check that the Operator is installed in the openshift-workload-availability namespace and that its status is Succeeded.

If the Operator is not installed successfully:

Navigate to the Operators Installed Operators page and inspect the Status column for any errors or failures.
Navigate to the Operators Installed Operators Node Maintenance Operator Details page, and inspect the Conditions section for errors before pod creation.
Navigate to the Workloads Pods page, search for the Node Maintenance Operator pod in the installed namespace, and check the logs in the Logs tab.

7.2.2. Installing the Node Maintenance Operator by using the CLI

You can use the OpenShift CLI (oc) to install the Node Maintenance Operator.

You can install the Node Maintenance Operator in your own namespace or in the openshift-workload-availability namespace.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.

Procedure

Create a Namespace CR for the Node Maintenance Operator:
1. Define the Namespace CR and save the YAML file, for example, workload-availability-namespace.yaml:
```
apiVersion: v1
kind: Namespace
metadata:
  name: openshift-workload-availability
```
2. To create the Namespace CR, run the following command:
```
$ oc create -f workload-availability-namespace.yaml
```

Create an OperatorGroup CR:

Define the OperatorGroup CR and save the YAML file, for example, workload-availability-operator-group.yaml:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: workload-availability-operator-group
  namespace: openshift-workload-availability

To create the OperatorGroup CR, run the following command:

$ oc create -f workload-availability-operator-group.yaml

Create a Subscription CR:
1. Define the Subscription CR and save the YAML file, for example, node-maintenance-subscription.yaml:
```
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: node-maintenance-operator
  namespace: openshift-workload-availability 1
spec:
  channel: stable
  installPlanApproval: Automatic
  name: node-maintenance-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  package: node-maintenance-operator
```
  1
  Specify the Namespace where you want to install the Node Maintenance Operator.
  Important
  To install the Node Maintenance Operator in the openshift-workload-availability namespace, specify openshift-workload-availability in the Subscription CR.
2. To create the Subscription CR, run the following command:
```
$ oc create -f node-maintenance-subscription.yaml
```

Verification

Verify that the installation succeeded by inspecting the CSV resource:

$ oc get csv -n openshift-workload-availability

Example output

NAME                               DISPLAY                     VERSION   REPLACES  PHASE
node-maintenance-operator.v5.3.0   Node Maintenance Operator   5.3.0   node-maintenance-operator.v5.2.1            Succeeded

Verify that the Node Maintenance Operator is running:

$ oc get deployment -n openshift-workload-availability

Example output

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
node-maintenance-operator-controller-manager   1/1     1            1           10d

The Node Maintenance Operator is supported in a restricted network environment. For more information, see Using Operator Lifecycle Manager on restricted networks.

7.3. Setting a node to maintenance mode

You can place a node into maintenance mode from the web console or from the CLI by using a NodeMaintenance CR.

7.3.1. Setting a node to maintenance mode by using the web console

To set a node to maintenance mode, you can create a NodeMaintenance custom resource (CR) by using the web console.

Prerequisites

Log in as a user with cluster-admin privileges.
Install the Node Maintenance Operator from the OperatorHub.

Procedure

From the Administrator perspective in the web console, navigate to Operators Installed Operators.
Select the Node Maintenance Operator from the list of Operators.
In the Node Maintenance tab, click Create NodeMaintenance.
In the Create NodeMaintenance page, select the Form view or the YAML view to configure the NodeMaintenance CR.
To apply the NodeMaintenance CR that you have configured, click Create.

Verification

In the Node Maintenance tab, inspect the Status column and verify that its status is Succeeded.

7.3.2. Setting a node to maintenance mode by using the CLI

You can put a node into maintenance mode with a NodeMaintenance custom resource (CR). When you apply a NodeMaintenance CR, all allowed pods are evicted and the node is rendered unschedulable. Evicted pods are queued to be moved to another node in the cluster.

Prerequisites

Install the Red Hat OpenShift CLI oc.
Log in to the cluster as a user with cluster-admin privileges.

Procedure

Create the following NodeMaintenance CR, and save the file as nodemaintenance-cr.yaml:
```
apiVersion: nodemaintenance.medik8s.io/v1beta1
kind: NodeMaintenance
metadata:
  name: nodemaintenance-cr  1
spec:
  nodeName: node-1.example.com 2
  reason: "NIC replacement" 3
```
1
The name of the node maintenance CR.
2
The name of the node to be put into maintenance mode.
3
A plain text description of the reason for maintenance.
Apply the node maintenance CR by running the following command:
```
$ oc apply -f nodemaintenance-cr.yaml
```

Verification

Check the progress of the maintenance task by running the following command:
```
$ oc describe node <node-name>
```
where <node-name> is the name of your node; for example, node-1.example.com

Check the example output:

Events:
  Type     Reason                     Age                   From     Message
  ----     ------                     ----                  ----     -------
  Normal   NodeNotSchedulable         61m                   kubelet  Node node-1.example.com status is now: NodeNotSchedulable

7.3.3. Checking status of current NodeMaintenance CR tasks

You can check the status of current NodeMaintenance CR tasks.

Prerequisites

Install the Red Hat OpenShift CLI oc.
Log in as a user with cluster-admin privileges.

Procedure

Check the status of current node maintenance tasks, for example the NodeMaintenance CR or nm object, by running the following command:

$ oc get nm -o yaml

Example output

apiVersion: v1
items:
- apiVersion: nodemaintenance.medik8s.io/v1beta1
  kind: NodeMaintenance
  metadata:
...
  spec:
    nodeName: node-1.example.com
    reason: Node maintenance
  status:
    drainProgress: 100   1
    evictionPods: 3   2
    lastError: "Last failure message" 3
    lastUpdate: "2022-06-23T11:43:18Z" 4
    phase: Succeeded
    totalpods: 5 5
...

1: The percentage completion of draining the node.
2: The number of pods scheduled for eviction.
3: The latest eviction error, if any.
4: The last time the status was updated.
5: The total number of pods before the node entered maintenance mode.

7.4. Resuming a node from maintenance mode

You can resume a node from maintenance mode from the web console or from the CLI by using a NodeMaintenance CR. Resuming a node brings it out of maintenance mode and makes it schedulable again.

7.4.1. Resuming a node from maintenance mode by using the web console

To resume a node from maintenance mode, you can delete a NodeMaintenance custom resource (CR) by using the web console.

Prerequisites

Log in as a user with cluster-admin privileges.
Install the Node Maintenance Operator from the OperatorHub.

Procedure

From the Administrator perspective in the web console, navigate to Operators Installed Operators.
Select the Node Maintenance Operator from the list of Operators.
In the Node Maintenance tab, select the NodeMaintenance CR that you want to delete.
Click the Options menu at the end of the node and select Delete NodeMaintenance.

Verification

In the Red Hat OpenShift console, click Compute Nodes.
Inspect the Status column of the node for which you deleted the NodeMaintenance CR and verify that its status is Ready.

7.4.2. Resuming a node from maintenance mode by using the CLI

You can resume a node from maintenance mode that was initiated with a NodeMaintenance CR by deleting the NodeMaintenance CR.

Prerequisites

Install the Red Hat OpenShift CLI oc.
Log in to the cluster as a user with cluster-admin privileges.

Procedure

When your node maintenance task is complete, delete the active NodeMaintenance CR:

$ oc delete -f nodemaintenance-cr.yaml

Example output

nodemaintenance.nodemaintenance.medik8s.io "maintenance-example" deleted

Verification

Check the progress of the maintenance task by running the following command:
```
$ oc describe node <node-name>
```
where <node-name> is the name of your node; for example, node-1.example.com

Check the example output:

Events:
  Type     Reason                  Age                   From     Message
  ----     ------                  ----                  ----     -------
  Normal   NodeSchedulable         2m                    kubelet  Node node-1.example.com status is now: NodeSchedulable

7.5. Working with bare-metal nodes

For clusters with bare-metal nodes, you can place a node into maintenance mode, and resume a node from maintenance mode, by using the web console Actions control.

Note

Clusters with bare-metal nodes can also place a node into maintenance mode, and resume a node from maintenance mode, by using the web console and CLI, as outlined. These methods, by using the web console Actions control, are applicable to bare-metal clusters only.

7.5.1. Maintaining bare-metal nodes

When you deploy Red Hat OpenShift on bare-metal infrastructure, you must take additional considerations into account compared to deploying on cloud infrastructure. Unlike in cloud environments, where the cluster nodes are considered ephemeral, reprovisioning a bare-metal node requires significantly more time and effort for maintenance tasks.

When a bare-metal node fails due to a kernel error or a NIC card hardware failure, workloads on the failed node need to be restarted on another node in the cluster while the problem node is repaired or replaced. Node maintenance mode allows cluster administrators to gracefully turn-off nodes, move workloads to other parts of the cluster, and ensure that workloads do not get interrupted. Detailed progress and node status details are provided during maintenance.

7.5.2. Setting a bare-metal node to maintenance mode

Set a bare-metal node to maintenance mode using the Options menu kebab found on each node in the Compute Nodes list, or using the Actions control of the Node Details screen.

Procedure

From the Administrator perspective of the web console, click Compute Nodes.
You can set the node to maintenance from this screen, which makes it easier to perform actions on multiple nodes, or from the Node Details screen, where you can view comprehensive details of the selected node:
- Click the Options menu at the end of the node and select Start Maintenance.
- Click the node name to open the Node Details screen and click Actions Start Maintenance.
Click Start Maintenance in the confirmation window.

The node is no longer schedulable. If it had virtual machines with the LiveMigration eviction strategy, then it will live migrate them. All other pods and virtual machines on the node are deleted and recreated on another node.

Verification

Navigate to the Compute Nodes page and verify that the corresponding node has a status of Under maintenance.

7.5.3. Resuming a bare-metal node from maintenance mode

Resume a bare-metal node from maintenance mode using the Options menu kebab found on each node in the Compute Nodes list, or using the Actions control of the Node Details screen.

Procedure

From the Administrator perspective of the web console, click Compute Nodes.
You can resume the node from this screen, which makes it easier to perform actions on multiple nodes, or from the Node Details screen, where you can view comprehensive details of the selected node:
- Click the Options menu at the end of the node and select Stop Maintenance.
- Click the node name to open the Node Details screen and click Actions Stop Maintenance.
Click Stop Maintenance in the confirmation window.

The node becomes schedulable. If it had virtual machine instances that were running on the node prior to maintenance, then they will not automatically migrate back to this node.

Verification

Navigate to the Compute Nodes page and verify that the corresponding node has a status of Ready.

7.6. Gathering data about the Node Maintenance Operator

To collect debugging information about the Node Maintenance Operator, use the must-gather tool. For information about the must-gather image for the Node Maintenance Operator, see Gathering data about specific features.

Chapter 7. Placing nodes in maintenance mode with Node Maintenance Operator

7.1. About the Node Maintenance Operator

7.2. Installing the Node Maintenance Operator

7.2.1. Installing the Node Maintenance Operator by using the web console

7.2.2. Installing the Node Maintenance Operator by using the CLI

7.3. Setting a node to maintenance mode

7.3.1. Setting a node to maintenance mode by using the web console

7.3.2. Setting a node to maintenance mode by using the CLI

7.3.3. Checking status of current NodeMaintenance CR tasks

7.4. Resuming a node from maintenance mode

7.4.1. Resuming a node from maintenance mode by using the web console

7.4.2. Resuming a node from maintenance mode by using the CLI

7.5. Working with bare-metal nodes

7.5.1. Maintaining bare-metal nodes

7.5.2. Setting a bare-metal node to maintenance mode

7.5.3. Resuming a bare-metal node from maintenance mode

7.6. Gathering data about the Node Maintenance Operator

7.7. Additional resources

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links