Replacing nodes

Red Hat OpenShift Data Foundation 4.17

Instructions for how to safely replace a node in an OpenShift Data Foundation cluster.

Red Hat Storage Documentation Team

Abstract

This document explains how to safely replace a node in a Red Hat OpenShift Data Foundation cluster.

Making open source more inclusive
Copy link

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message.

Providing feedback on Red Hat documentation
Copy link

We appreciate your input on our documentation. Do let us know how we can make it better.

To give feedback, create a Jira ticket:

Log in to the Jira.
Click Create in the top navigation bar
Enter a descriptive title in the Summary field.
Enter your suggestion for improvement in the Description field. Include links to the relevant parts of the documentation.
Select Documentation in the Components field.
Click Create at the bottom of the dialogue.

Preface
Copy link

For OpenShift Data Foundation, node replacement can be performed proactively for an operational node and reactively for a failed node for the following deployments:

For Amazon Web Services (AWS)
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For VMware
- User-provisioned infrastructure
- Installer-provisioned infrastructure
For Microsoft Azure
- Installer-provisioned infrastructure
For local storage devices
- Bare metal
- VMware
- IBM Power
For replacing your storage nodes in external mode, see Red Hat Ceph Storage documentation.

Chapter 1. OpenShift Data Foundation deployed using dynamic devices
Copy link

1.1. OpenShift Data Foundation deployed on AWS
Copy link

To replace an operational node, see:
- Section 1.1.1, “Replacing an operational AWS node on user-provisioned infrastructure”.
- Section 1.1.2, “Replacing an operational AWS node on installer-provisioned infrastructure”.
To replace a failed node, see:
- Section 1.1.3, “Replacing a failed AWS node on user-provisioned infrastructure”.
- Section 1.1.4, “Replacing a failed AWS node on installer-provisioned infrastructure”.

1.1.1. Replacing an operational AWS node on user-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Note

When replacing an AWS node on user-provisioned infrastructure, the new node needs to be created in the same AWS zone as the original node.

Procedure

Identify the node that you need to replace.
Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Drain the node:
```
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Copy to Clipboard Toggle word wrap
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Delete the node:
```
oc delete nodes <node_name>
```
```
$ oc delete nodes <node_name>
```
Copy to Clipboard Toggle word wrap
Create a new Amazon Web Service (AWS) machine instance with the required infrastructure. See Platform requirements.
Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.1.2. Replacing an operational AWS node on installer-provisioned infrastructure
Copy link

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the node that you need to replace. Take a note of its Machine Name.
Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Drain the node:
```
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Copy to Clipboard Toggle word wrap
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click Action menu (⋮) → Delete Machine.
Click Delete to confirm that the machine is deleted. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
Important
This activity might take at least 5 - 10 minutes or more.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.1.3. Replacing a failed AWS node on user-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Identify the Amazon Web Service (AWS) machine instance of the node that you need to replace.
Log in to AWS, and terminate the AWS machine instance that you identified.
Create a new AWS machine instance with the required infrastructure. See Platform requirements.
Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.1.4. Replacing a failed AWS node on installer-provisioned infrastructure
Copy link

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the faulty node, and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining, and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.
Optional: If the failed Amazon Web Service (AWS) instance is not removed automatically, terminate the instance from the AWS console.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.2. OpenShift Data Foundation deployed on VMware
Copy link

To replace an operational node, see:
- Section 1.2.1, “Replacing an operational VMware node on user-provisioned infrastructure”.
- Section 1.2.2, “Replacing an operational VMware node on installer-provisioned infrastructure”.
To replace a failed node, see:
- Section 1.2.3, “Replacing a failed VMware node on user-provisioned infrastructure”.
- Section 1.2.4, “Replacing a failed VMware node on installer-provisioned infrastructure”.

1.2.1. Replacing an operational VMware node on user-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Identify the node and its Virtual Machine (VM) that you need replace.
Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Drain the node:
```
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Copy to Clipboard Toggle word wrap
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Delete the node:
```
oc delete nodes <node_name>
```
```
$ oc delete nodes <node_name>
```
Copy to Clipboard Toggle word wrap
Log in to VMware vSphere, and terminate the VM that you identified:
Important
Delete the VM only from the inventory and not from the disk.
Create a new VM on VMware vSphere with the required infrastructure. See Platform requirements.
Create a new OpenShift Container Platform worker node using the new VM.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.2.2. Replacing an operational VMware node on installer-provisioned infrastructure
Copy link

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the node that you need to replace. Take a note of its Machine Name.
Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Drain the node:
```
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Copy to Clipboard Toggle word wrap
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine is deleted. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
Important
This activity might take at least 5 - 10 minutes or more.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.2.3. Replacing a failed VMware node on user-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with similar infrastructure and resources to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Identify the node and its Virtual Machine (VM) that you need to replace.
Delete the node:
```
oc delete nodes <node_name>
```
```
$ oc delete nodes <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Log in to VMware vSphere and terminate the VM that you identified.
Important
Delete the VM only from the inventory and not from the disk.
Create a new VM on VMware vSphere with the required infrastructure. See Platform requirements.
Create a new OpenShift Container Platform worker node using the new VM.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.2.4. Replacing a failed VMware node on installer-provisioned infrastructure
Copy link

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the faulty node, and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining, and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created. Wait for te new machine to start.
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.
Optional: If the failed Virtual Machine (VM) is not removed automatically, terminate the VM from VMware vSphere.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.3. OpenShift Data Foundation deployed on Microsoft Azure
Copy link

1.3.1. Replacing operational nodes on Azure installer-provisioned infrastructure
Copy link

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the node that you need to replace. Take a note of its Machine Name.
Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Drain the node:
```
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Copy to Clipboard Toggle word wrap
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine is deleted. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
Important
This activity might take at least 5 - 10 minutes or more.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads→ Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.3.2. Replacing failed nodes on Azure installer-provisioned infrastructure
Copy link

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the faulty node, and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining, and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created. Wait for the new machine to start.
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.
Optional: If the failed Azure instance is not removed automatically, terminate the instance from the Azure console.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that new the Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.4. OpenShift Data Foundation deployed on Google cloud
Copy link

1.4.1. Replacing operational nodes on Google Cloud installer-provisioned infrastructure
Copy link

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap
Drain the node using the following command:
```
oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
```
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
```
Copy to Clipboard Toggle word wrap
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Machines. Search for the required machine.
Besides the required machine, click the Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
Important
This activity may take at least 5-10 minutes or more.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:

$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that the new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

1.4.2. Replacing failed nodes on Google Cloud installer-provisioned infrastructure
Copy link

Procedure

Log in to OpenShift Web Console and click Compute → Nodes.
Identify the faulty node and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created, wait for new machine to start.
Important
This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Click Compute → Nodes, confirm if the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the web user interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From the command line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.
Optional: If the failed Google Cloud instance is not removed automatically, terminate the instance from Google Cloud console.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that new the Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

Chapter 2. OpenShift Data Foundation deployed using local storage devices
Copy link

The procedure for bare‑metal deployments in this chapter also applies to agnostic deployments.

2.1. Replacing storage nodes on bare metal infrastructure
Copy link

To replace an operational node, see Section 2.1.1, “Replacing an operational node on bare metal user-provisioned infrastructure”.
To replace a failed node, see Section 2.1.2, “Replacing a failed node on bare metal user-provisioned infrastructure”.

2.1.1. Replacing an operational node on bare metal user-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with similar infrastructure, resources, and disks to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Identify the node, and get the labels on the node that you need to replace:
```
oc get nodes --show-labels | grep <node_name>
```
```
$ oc get nodes --show-labels | grep <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Identify the monitor pod (if any), and OSDs that are running in the node that you need to replace:
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step:

For example:

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap

Drain the node:

oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Delete the node:
```
oc delete node <node_name>
```
```
$ oc delete node <node_name>
```
Copy to Clipboard Toggle word wrap
Get a new bare-metal machine with the required infrastructure. See Installing on bare metal.
Important
For information about how to replace a master node when you have installed OpenShift Data Foundation on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.
Create a new OpenShift Container Platform node using the new bare-metal machine.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Navigate to the openshift-storage project:
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
Copy to Clipboard Toggle word wrap
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
The FORCE_OSD_REMOVAL value must be changed to true in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
Copy to Clipboard Toggle word wrap

Ensure that the OSD removal is completed.

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Copy to Clipboard

Toggle word wrap

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Copy to Clipboard

Toggle word wrap

Important

If the ocs-osd-removal-job fails, and the pod is not in the expected Completed state, check the pod logs for further debugging:

For example:

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Copy to Clipboard

Toggle word wrap

Identify the Persistent Volume (PV) associated with the Persistent Volume Claim (PVC):

oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

Copy to Clipboard

Toggle word wrap

If there is a PV in Released state, delete it:

oc delete pv <persistent_volume>

# oc delete pv <persistent_volume>

Copy to Clipboard

Toggle word wrap

For example:

oc delete pv local-pv-d6bf175b

# oc delete pv local-pv-d6bf175b

Copy to Clipboard

Toggle word wrap

Example output:

persistentvolume "local-pv-d9c5cbd6" deleted

persistentvolume "local-pv-d9c5cbd6" deleted

Copy to Clipboard

Toggle word wrap

Identify the crashcollector pod deployment:

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

If there is an existing crashcollector pod deployment, delete it:

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete the ocs-osd-removal-job:

oc delete -n openshift-storage job ocs-osd-removal-job

# oc delete -n openshift-storage job ocs-osd-removal-job

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-job" deleted

job.batch "ocs-osd-removal-job" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created, and is in the Running state:

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

Copy to Clipboard

Toggle word wrap

OSD and monitor pod might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

2.1.2. Replacing a failed node on bare metal user-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with similar infrastructure, resources, and disks to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Identify the node, and get the labels on the node that you need to replace:
```
oc get nodes --show-labels | grep <node_name>
```
```
$ oc get nodes --show-labels | grep <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Identify the monitor pod (if any), and OSDs that are running in the node that you need to replace:
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step:

For example:

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap

Remove the pods which are in Terminating state:

oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Copy to Clipboard

Toggle word wrap

Drain the node:

oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Delete the node:
```
oc delete node <node_name>
```
```
$ oc delete node <node_name>
```
Copy to Clipboard Toggle word wrap
Get a new bare-metal machine with the required infrastructure. See Installing on bare metal.
Important
For information about how to replace a master node when you have installed OpenShift Data Foundation on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.
Create a new OpenShift Container Platform node using the new bare-metal machine.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Navigate to the openshift-storage project:
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
Copy to Clipboard Toggle word wrap
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
The FORCE_OSD_REMOVAL value must be changed to true in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
Copy to Clipboard Toggle word wrap

Ensure that the OSD removal is completed.

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Copy to Clipboard

Toggle word wrap

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Copy to Clipboard

Toggle word wrap

Important

If the ocs-osd-removal-job fails, and the pod is not in the expected Completed state, check the pod logs for further debugging:

For example:

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Copy to Clipboard

Toggle word wrap

Verify that the new localblock Persistent Volume (PV) is available:

$oc get pv | grep localblock | grep Available

$oc get pv | grep localblock | grep Available

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Copy to Clipboard

Toggle word wrap

Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.
Find and delete the rook-ceph-operator pod:
1. Find the rook-ceph-operator pod ID:
  $ oc get po -l app=rook-ceph-operator
  Copy to Clipboard Toggle word wrap
  Example output:
  NAME READY STATUS RESTARTS AGE rook-ceph-operator-5b47ccd5b9-b6t8c 1/1 Running 0 33m
  Copy to Clipboard Toggle word wrap
2. Delete the rook-ceph-operator pod. The following example uses the ID from the output in the previous step:
  $ oc delete pod rook-ceph-operator-5b47ccd5b9-b6t8c
  Copy to Clipboard Toggle word wrap
  Replace the operator ID with the ID you found in the previous step.

Identify the crashcollector pod deployment:

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

If there is an existing crashcollector pod deployment, delete it:

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete the ocs-osd-removal-job:

oc delete -n openshift-storage job ocs-osd-removal-job

# oc delete -n openshift-storage job ocs-osd-removal-job

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-job" deleted

job.batch "ocs-osd-removal-job" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created, and is in the Running state:

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

Copy to Clipboard

Toggle word wrap

OSD and monitor pod might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

2.2. Replacing storage nodes on IBM Z or IBM® LinuxONE infrastructure
Copy link

You can choose one of the following procedures to replace storage nodes:

Section 2.2.1, “Replacing operational nodes on IBM Z or IBM® LinuxONE infrastructure”.
Section 2.2.2, “Replacing failed nodes on IBM Z or IBM® LinuxONE infrastructure”.

2.2.1. Replacing operational nodes on IBM Z or IBM® LinuxONE infrastructure
Copy link

Use this procedure to replace an operational node on IBM Z or IBM® LinuxONE infrastructure.

Procedure

Identify the node and get labels on the node to be replaced. Make a note of the rack label.
```
oc get nodes --show-labels | grep <node_name>
```
```
$ oc get nodes --show-labels | grep <node_name>
```
Copy to Clipboard Toggle word wrap
Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step.

For example:

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the nodes as unschedulable.
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap

Remove the pods which are in the Terminating state.

oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Copy to Clipboard

Toggle word wrap

Drain the node.

oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Delete the node.
```
oc delete node <node_name>
```
```
$ oc delete node <node_name>
```
Copy to Clipboard Toggle word wrap
Get a new IBM Z storage node as a replacement.
Check for certificate signing requests (CSRs) related to OpenShift Data Foundation that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all required OpenShift Data Foundation CSRs for the new node:
```
oc adm certificate approve <Certificate_Name>
```
```
$ oc adm certificate approve <Certificate_Name>
```
Copy to Clipboard Toggle word wrap
Click Compute → Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
Apply the openshift-storage label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) → Edit Labels
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap

Add a new worker node to localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

oc edit -n local-storage-project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

# oc edit -n local-storage-project localvolumediscovery auto-discover-devices
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Copy to Clipboard

Toggle word wrap

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Determine which localVolumeSet to edit.
Replace local-storage-project in the following commands with the name of your local storage project. The default project name is openshift-local-storage in OpenShift Data Foundation 4.6 and later. Previous versions use local-storage by default.
```
oc get -n local-storage-project localvolumeset
NAME          AGE
localblock   25h
```
```
# oc get -n local-storage-project localvolumeset
NAME          AGE
localblock   25h
```
Copy to Clipboard Toggle word wrap

Update the localVolumeSet definition to include the new node and remove the failed node.

oc edit -n local-storage-project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

# oc edit -n local-storage-project localvolumeset localblock
[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Copy to Clipboard

Toggle word wrap

Remember to save before exiting the editor.

In the above example, server3.example.com was removed and newnode.example.com is the new node.

Verify that the new localblock PV is available.

oc get pv | grep localblock
          CAPA- ACCESS RECLAIM                                STORAGE
NAME      CITY  MODES  POLICY  STATUS     CLAIM               CLASS       AGE
local-pv- 931Gi  RWO   Delete  Bound      openshift-storage/  localblock  25h
3e8964d3                                  ocs-deviceset-2-0
                                          -79j94
local-pv- 931Gi  RWO   Delete  Bound      openshift-storage/  localblock  25h
414755e0                                  ocs-deviceset-1-0
                                          -959rp
local-pv- 931Gi RWO Delete Available localblock 3m24s b481410
local-pv- 931Gi  RWO   Delete  Bound      openshift-storage/  localblock  25h
d9c5cbd6                                  ocs-deviceset-0-0
                                          -nvs68

$ oc get pv | grep localblock
          CAPA- ACCESS RECLAIM                                STORAGE
NAME      CITY  MODES  POLICY  STATUS     CLAIM               CLASS       AGE
local-pv- 931Gi  RWO   Delete  Bound      openshift-storage/  localblock  25h
3e8964d3                                  ocs-deviceset-2-0
                                          -79j94
local-pv- 931Gi  RWO   Delete  Bound      openshift-storage/  localblock  25h
414755e0                                  ocs-deviceset-1-0
                                          -959rp
local-pv- 931Gi RWO Delete Available localblock 3m24s b481410
local-pv- 931Gi  RWO   Delete  Bound      openshift-storage/  localblock  25h
d9c5cbd6                                  ocs-deviceset-0-0
                                          -nvs68

Copy to Clipboard

Toggle word wrap

Change to the openshift-storage project.
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
1. Identify the PVC as afterwards we need to delete PV associated with that specific PVC.
  $ osd_id_to_remove=1 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
  Copy to Clipboard Toggle word wrap
  where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-1.
  Example output:
  ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
  Copy to Clipboard Toggle word wrap
  In this example, the PVC name is ocs-deviceset-localblock-0-data-0-g2mmc.
2. Remove the failed OSD from the cluster.
  $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -f -
  Copy to Clipboard Toggle word wrap
  You can remove more than one OSD by adding comma separated OSD IDs in the command. (For example: FAILED_OSD_IDS=0,1,2)
  Warning
  This step results in OSD being completely removed from the cluster. Ensure that the correct value of osd_id_to_remove is provided.

Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal pod.

A status of Completed confirms that the OSD removal job succeeded.

oc get pod -l job-name=ocs-osd-removal-osd_id_to_remove -n openshift-storage

# oc get pod -l job-name=ocs-osd-removal-osd_id_to_remove -n openshift-storage

Copy to Clipboard

Toggle word wrap

Note

If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

oc logs -l job-name=ocs-osd-removal-osd_id_to_remove -n openshift-storage --tail=-1

# oc logs -l job-name=ocs-osd-removal-osd_id_to_remove -n openshift-storage --tail=-1

Copy to Clipboard

Toggle word wrap

It may be necessary to manually cleanup the removed OSD as follows:

ceph osd crush remove osd.osd_id_to_remove
ceph osd rm osd_id_to_remove
ceph auth del osd.osd_id_to_remove
ceph osd crush rm osd_id_to_remove

ceph osd crush remove osd.osd_id_to_remove
ceph osd rm osd_id_to_remove
ceph auth del osd.osd_id_to_remove
ceph osd crush rm osd_id_to_remove

Copy to Clipboard

Toggle word wrap

Delete the PV associated with the failed node.

Identify the PV associated with the PVC.

The PVC name must be identical to the name that is obtained while removing the failed OSD from the cluster.

oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-5c9b8982  500Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc  localblock  24h  worker-0

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
local-pv-5c9b8982  500Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc  localblock  24h  worker-0

Copy to Clipboard

Toggle word wrap

If there is a PV in Released state, delete it.

oc delete pv <persistent-volume>

# oc delete pv <persistent-volume>

Copy to Clipboard

Toggle word wrap

For example:

oc delete pv local-pv-5c9b8982
persistentvolume "local-pv-5c9b8982" deleted

# oc delete pv local-pv-5c9b8982
persistentvolume "local-pv-5c9b8982" deleted

Copy to Clipboard

Toggle word wrap

Identify the crashcollector pod deployment.

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

If there is an existing crashcollector pod deployment, delete it.

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete the ocs-osd-removal job.

oc delete job ocs-osd-removal-${osd_id_to_remove}

# oc delete job ocs-osd-removal-${osd_id_to_remove}

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-0" deleted

job.batch "ocs-osd-removal-0" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

2.2.2. Replacing failed nodes on IBM Z or IBM® LinuxONE infrastructure
Copy link

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the faulty node, and click on its Machine Name.
Click Actions → Edit Annotations, and click Add More.
Add machine.openshift.io/exclude-node-draining, and click Save.
Click Actions → Delete Machine, and click Delete.
A new machine is automatically created. Wait for new machine to start.
Important
This activity might take at least 5 - 10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when you label the new node, and it is functional.
Click Compute → Nodes. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*
Verify that all the other required OpenShift Data Foundation pods are in Running state.

Verify that new Object Storage Device (OSD) pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

2.3. Replacing storage nodes on IBM Power infrastructure
Copy link

For OpenShift Data Foundation, you can perform node replacement proactively for an operational node, and reactively for a failed node, for the deployments related to IBM Power.

2.3.1. Replacing an operational or failed storage node on IBM Power
Copy link

Prerequisites

Ensure that the replacement nodes are configured with the similar infrastructure and resources to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Identify the node, and get the labels on the node that you need to replace:
```
oc get nodes --show-labels | grep <node_name>
```
```
$ oc get nodes --show-labels | grep <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Identify the mon (if any), and Object Storage Device (OSD) pods that are running in the node that you need to replace:
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step:

For example:

oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage

$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap

Remove the pods which are in Terminating state:

oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Copy to Clipboard

Toggle word wrap

Drain the node:

oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Delete the node:
```
oc delete node <node_name>
```
```
$ oc delete node <node_name>
```
Copy to Clipboard Toggle word wrap
Get a new IBM Power machine with the required infrastructure. See Installing a cluster on IBM Power.
Create a new OpenShift Container Platform node using the new IBM Power machine.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=''

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Identify the namespace where OpenShift local storage operator is installed, and assign it to the local_storage_project variable:

local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

Copy to Clipboard

Toggle word wrap

For example:

local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

Copy to Clipboard

Toggle word wrap

echo $local_storage_project

echo $local_storage_project

Copy to Clipboard

Toggle word wrap

Example output:

openshift-local-storage

openshift-local-storage

Copy to Clipboard

Toggle word wrap

Add a newly added worker node to the localVolume.

Determine the localVolume you need to edit:
```
oc get -n $local_storage_project localvolume
```
```
# oc get -n $local_storage_project localvolume
```
Copy to Clipboard Toggle word wrap
Example output:
```
NAME           AGE
localblock    25h
```
```
NAME           AGE
localblock    25h
```
Copy to Clipboard Toggle word wrap

Update the localVolume definition to include the new node, and remove the failed node:

oc edit -n $local_storage_project localvolume localblock

# oc edit -n $local_storage_project localvolume localblock

Copy to Clipboard

Toggle word wrap

Example output:

[...]
    nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            #- worker-0
            - worker-1
            - worker-2
            - worker-3
[...]

[...]
    nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            #- worker-0
            - worker-1
            - worker-2
            - worker-3
[...]

Copy to Clipboard

Toggle word wrap

Remember to save before exiting the editor.

In the this example, worker-0 is removed and worker-3 is the new node.

Verify that the new localblock Persistent Volume (PV) is available:

oc get pv | grep localblock

$ oc get pv | grep localblock

Copy to Clipboard

Toggle word wrap

Example output:

NAME              CAPACITY   ACCESSMODES RECLAIMPOLICY STATUS     CLAIM             STORAGECLASS                 AGE
local-pv-3e8964d3    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-2-data-0-mdbg9  localblock     25h
local-pv-414755e0    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-1-data-0-4cslf  localblock     25h
local-pv-b481410    500Gi     RWO        Delete       Available                                            localblock     3m24s
local-pv-5c9b8982    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-0-data-0-g2mmc  localblock     25h

NAME              CAPACITY   ACCESSMODES RECLAIMPOLICY STATUS     CLAIM             STORAGECLASS                 AGE
local-pv-3e8964d3    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-2-data-0-mdbg9  localblock     25h
local-pv-414755e0    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-1-data-0-4cslf  localblock     25h
local-pv-b481410    500Gi     RWO        Delete       Available                                            localblock     3m24s
local-pv-5c9b8982    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-0-data-0-g2mmc  localblock     25h

Copy to Clipboard

Toggle word wrap

Navigate to the openshift-storage project:
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.
1. Identify the Persistent Volume Claim (PVC):
  $ osd_id_to_remove=1
  Copy to Clipboard Toggle word wrap
  $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${<osd_id_to_remove>} | grep ceph.rook.io/pvc
  Copy to Clipboard Toggle word wrap
  where, <osd_id_to_remove> is the integer in the pod name immediately after the rook-ceph-osd prefix.
  In this example, the deployment name is rook-ceph-osd-1.
  Example output:
  ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
  Copy to Clipboard Toggle word wrap
2. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
  $ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
  Copy to Clipboard Toggle word wrap
  <failed_osd_id>
  Is the integer in the pod name immediately after the rook-ceph-osd prefix. You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
  The FORCE_OSD_REMOVAL value must be changed to true in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
  Warning
  This step results in the OSD being completely removed from the cluster. Ensure that the correct value of osd_id_to_remove is provided.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job has succeeded.
```
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
Copy to Clipboard Toggle word wrap

Ensure that the OSD removal is completed.

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Copy to Clipboard

Toggle word wrap

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Copy to Clipboard

Toggle word wrap

Important

If the ocs-osd-removal-job fails, and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Copy to Clipboard

Toggle word wrap

Delete the PV associated with the failed node.

Identify the PV associated with the PVC:

oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-5c9b8982  500Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc  localblock  24h  worker-0

local-pv-5c9b8982  500Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc  localblock  24h  worker-0

Copy to Clipboard

Toggle word wrap

The PVC name must be identical to the name that is obtained while removing the failed OSD from the cluster.

If there is a PV in Released state, delete it:
```
oc delete pv <persistent_volume>
```
```
# oc delete pv <persistent_volume>
```
Copy to Clipboard Toggle word wrap
For example:
```
oc delete pv local-pv-5c9b8982
```
```
# oc delete pv local-pv-5c9b8982
```
Copy to Clipboard Toggle word wrap
Example output:
```
persistentvolume "local-pv-5c9b8982" deleted
```
```
persistentvolume "local-pv-5c9b8982" deleted
```
Copy to Clipboard Toggle word wrap

Identify the crashcollector pod deployment:

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

If there is an existing crashcollector pod deployment, delete it:

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete the ocs-osd-removal-job:

oc delete -n openshift-storage job ocs-osd-removal-job

# oc delete -n openshift-storage job ocs-osd-removal-job

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-job" deleted

job.batch "ocs-osd-removal-job" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state:

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-b-74f6dc9dd6-4llzq                                   1/1     Running     0          6h14m
rook-ceph-mon-c-74948755c-h7wtx                                    1/1     Running     0          4h24m
rook-ceph-mon-d-598f69869b-4bv49                                   1/1     Running     0          162m

rook-ceph-mon-b-74f6dc9dd6-4llzq                                   1/1     Running     0          6h14m
rook-ceph-mon-c-74948755c-h7wtx                                    1/1     Running     0          4h24m
rook-ceph-mon-d-598f69869b-4bv49                                   1/1     Running     0          162m

Copy to Clipboard

Toggle word wrap

The OSD and monitor pod might take several minutes to get to the Running state.

Verify that the new OSD pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

2.4. Replacing storage nodes on VMware infrastructure
Copy link

To replace an operational node, see:
- Section 2.4.1, “Replacing an operational node on VMware user-provisioned infrastructure”.
- Section 2.4.2, “Replacing an operational node on VMware installer-provisioned infrastructure”.
To replace a failed node,see:
- Section 2.4.3, “Replacing a failed node on VMware user-provisioned infrastructure”.
- Section 2.4.4, “Replacing a failed node on VMware installer-provisioned infrastructure”.

2.4.1. Replacing an operational node on VMware user-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with similar infrastructure, resources, and disks to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Identify the node, and get the labels on the node that you need to replace:
```
oc get nodes --show-labels | grep <node_name>
```
```
$ oc get nodes --show-labels | grep <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Identify the monitor pod (if any), and OSDs that are running in the node that you need to replace:
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step:

For example:

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap

Drain the node:

oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Delete the node:
```
oc delete node <node_name>
```
```
$ oc delete node <node_name>
```
Copy to Clipboard Toggle word wrap
Log in to VMware vSphere and terminate the Virtual Machine (VM) that you have identified.
Create a new VM on VMware vSphere with the required infrastructure. See Infrastructure requirements.
Create a new OpenShift Container Platform worker node using the new VM.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Navigate to the openshift-storage project:
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
Copy to Clipboard Toggle word wrap
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
The FORCE_OSD_REMOVAL value must be changed to true in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
Copy to Clipboard Toggle word wrap

Ensure that the OSD removal is completed.

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Copy to Clipboard

Toggle word wrap

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Copy to Clipboard

Toggle word wrap

Important

If the ocs-osd-removal-job fails, and the pod is not in the expected Completed state, check the pod logs for further debugging:

For example:

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Copy to Clipboard

Toggle word wrap

Identify the Persistent Volume (PV) associated with the Persistent Volume Claim (PVC):

oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

Copy to Clipboard

Toggle word wrap

If there is a PV in Released state, delete it:

oc delete pv <persistent_volume>

# oc delete pv <persistent_volume>

Copy to Clipboard

Toggle word wrap

For example:

oc delete pv local-pv-d6bf175b

# oc delete pv local-pv-d6bf175b

Copy to Clipboard

Toggle word wrap

Example output:

persistentvolume "local-pv-d9c5cbd6" deleted

persistentvolume "local-pv-d9c5cbd6" deleted

Copy to Clipboard

Toggle word wrap

Identify the crashcollector pod deployment:

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

If there is an existing crashcollector pod deployment, delete it:

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete the ocs-osd-removal-job:

oc delete -n openshift-storage job ocs-osd-removal-job

# oc delete -n openshift-storage job ocs-osd-removal-job

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-job" deleted

job.batch "ocs-osd-removal-job" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created, and is in the Running state:

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

Copy to Clipboard

Toggle word wrap

OSD and monitor pod might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

2.4.2. Replacing an operational node on VMware installer-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with the similar infrastructure, resources, and disks to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the node that you need to replace. Take a note of its Machine Name.
Get labels on the node:
```
oc get nodes --show-labels | grep <node_name>
```
```
$ oc get nodes --show-labels | grep <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Identify the mon (if any), and Object Storage Devices (OSDs) that are running in the node:
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods that you identified in the previous step:

For example:

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap

Drain the node:

oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Click Compute → Machines. Search for the required machine.
Besides the required machine, click Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
Important
This activity might take at least 5 - 10 minutes or more.
Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Physically add a new device to the node.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Identify the namespace where the OpenShift local storage operator is installed, and assign it to the local_storage_project variable:

local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

Copy to Clipboard

Toggle word wrap

For example:

local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

Copy to Clipboard

Toggle word wrap

echo $local_storage_project

echo $local_storage_project

Copy to Clipboard

Toggle word wrap

Example output:

openshift-local-storage

openshift-local-storage

Copy to Clipboard

Toggle word wrap

Add a new worker node to the localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node.

oc edit -n $local_storage_project localvolumediscovery auto-discover-devices

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices

Copy to Clipboard

Toggle word wrap

Example output:

[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Copy to Clipboard

Toggle word wrap

Remember to save before exiting the editor.

In this example, server3.example.com is removed, and newnode.example.com is the new node.

Determine the localVolumeSet you need to edit:
```
oc get -n $local_storage_project localvolumeset
```
```
# oc get -n $local_storage_project localvolumeset
```
Copy to Clipboard Toggle word wrap
Example output:
```
NAME          AGE
localblock   25h
```
```
NAME          AGE
localblock   25h
```
Copy to Clipboard Toggle word wrap

Update the localVolumeSet definition to include the new node and remove the failed node:

oc edit -n $local_storage_project localvolumeset localblock

# oc edit -n $local_storage_project localvolumeset localblock

Copy to Clipboard

Toggle word wrap

Example output:

[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - newnode.example.com
[...]

Copy to Clipboard

Toggle word wrap

Remember to save before exiting the editor.

In this example, server3.example.com is removed, and newnode.example.com is the new node.

Verify that the new localblock Persistent Volume (PV) is available:

oc get pv | grep localblock | grep Available

$ oc get pv | grep localblock | grep Available

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Copy to Clipboard

Toggle word wrap

Navigate to the openshift-storage project:
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
Copy to Clipboard Toggle word wrap
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
The FORCE_OSD_REMOVAL value must be changed to true in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
Copy to Clipboard Toggle word wrap

Ensure that the OSD removal is completed.

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Copy to Clipboard

Toggle word wrap

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Copy to Clipboard

Toggle word wrap

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

For example:

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Copy to Clipboard

Toggle word wrap

Identify the PV associated with the Persistent Volume Claim (PVC):

oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

Copy to Clipboard

Toggle word wrap

If there is a PV in Released state, delete it:

oc delete pv <persistent_volume>

# oc delete pv <persistent_volume>

Copy to Clipboard

Toggle word wrap

For example:

oc delete pv local-pv-d6bf175b

# oc delete pv local-pv-d6bf175b

Copy to Clipboard

Toggle word wrap

Example output:

persistentvolume "local-pv-d9c5cbd6" deleted

persistentvolume "local-pv-d9c5cbd6" deleted

Copy to Clipboard

Toggle word wrap

Identify the crashcollector pod deployment:

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

If there is an existing crashcollector pod deployment, delete it:

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete the ocs-osd-removal-job:

oc delete -n openshift-storage job ocs-osd-removal-job

# oc delete -n openshift-storage job ocs-osd-removal-job

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-job" deleted

job.batch "ocs-osd-removal-job" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created and is in the Running state.

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

Copy to Clipboard

Toggle word wrap

OSD and monitor pod might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

2.4.3. Replacing a failed node on VMware user-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with similar infrastructure, resources, and disks to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Identify the node, and get the labels on the node that you need to replace:
```
oc get nodes --show-labels | grep <node_name>
```
```
$ oc get nodes --show-labels | grep <node_name>
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Identify the monitor pod (if any), and OSDs that are running in the node that you need to replace:
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step:

For example:

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the node as unschedulable:
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap

Remove the pods which are in Terminating state:

oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Copy to Clipboard

Toggle word wrap

Drain the node:

oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Delete the node:
```
oc delete node <node_name>
```
```
$ oc delete node <node_name>
```
Copy to Clipboard Toggle word wrap
Log in to VMware vSphere and terminate the Virtual Machine (VM) that you have identified.
Create a new VM on VMware vSphere with the required infrastructure. See Infrastructure requirements.
Create a new OpenShift Container Platform worker node using the new VM.
Check for the Certificate Signing Requests (CSRs) related to OpenShift Container Platform that are in Pending state:
```
oc get csr
```
```
$ oc get csr
```
Copy to Clipboard Toggle word wrap
Approve all the required OpenShift Container Platform CSRs for the new node:
```
oc adm certificate approve <certificate_name>
```
```
$ oc adm certificate approve <certificate_name>
```
Copy to Clipboard Toggle word wrap
<certificate_name>
Specify the name of the CSR.
Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Navigate to the openshift-storage project:
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
Copy to Clipboard Toggle word wrap
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
The FORCE_OSD_REMOVAL value must be changed to true in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
Copy to Clipboard Toggle word wrap

Ensure that the OSD removal is completed.

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Copy to Clipboard

Toggle word wrap

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Copy to Clipboard

Toggle word wrap

Important

If the ocs-osd-removal-job fails, and the pod is not in the expected Completed state, check the pod logs for further debugging:

For example:

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Copy to Clipboard

Toggle word wrap

Identify the Persistent Volume (PV) associated with the Persistent Volume Claim (PVC):

oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

Copy to Clipboard

Toggle word wrap

If there is a PV in Released state, delete it:

oc delete pv <persistent_volume>

# oc delete pv <persistent_volume>

Copy to Clipboard

Toggle word wrap

For example:

oc delete pv local-pv-d6bf175b

# oc delete pv local-pv-d6bf175b

Copy to Clipboard

Toggle word wrap

Example output:

persistentvolume "local-pv-d9c5cbd6" deleted

persistentvolume "local-pv-d9c5cbd6" deleted

Copy to Clipboard

Toggle word wrap

Identify the crashcollector pod deployment:

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

If there is an existing crashcollector pod deployment, delete it:

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<failed_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete the ocs-osd-removal-job:

oc delete -n openshift-storage job ocs-osd-removal-job

# oc delete -n openshift-storage job ocs-osd-removal-job

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-job" deleted

job.batch "ocs-osd-removal-job" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created, and is in the Running state:

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

Copy to Clipboard

Toggle word wrap

OSD and monitor pod might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

2.4.4. Replacing a failed node on VMware installer-provisioned infrastructure
Copy link

Prerequisites

Ensure that the replacement nodes are configured with the similar infrastructure, resources, and disks to the node that you replace.
You must be logged into the OpenShift Container Platform cluster.

Procedure

Log in to the OpenShift Web Console, and click Compute → Nodes.
Identify the node that you need to replace. Take a note of its Machine Name.
Get the labels on the node:
```
oc get nodes --show-labels | grep _<node_name>_
```
```
$ oc get nodes --show-labels | grep _<node_name>_
```
Copy to Clipboard Toggle word wrap
<node_name>
Specify the name of node that you need to replace.
Identify the mon (if any) and Object Storage Devices (OSDs) that are running in the node:
```
oc get pods -n openshift-storage -o wide | grep -i _<node_name>_
```
```
$ oc get pods -n openshift-storage -o wide | grep -i _<node_name>_
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step:

For example:

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the node as unschedulable:
```
oc adm cordon _<node_name>_
```
```
$ oc adm cordon _<node_name>_
```
Copy to Clipboard Toggle word wrap

Remove the pods which are in Terminating state:

oc get pods -A -o wide | grep -i _<node_name>_ |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

$ oc get pods -A -o wide | grep -i _<node_name>_ |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'

Copy to Clipboard

Toggle word wrap

Drain the node:

oc adm drain _<node_name>_ --force --delete-emptydir-data=true --ignore-daemonsets

$ oc adm drain _<node_name>_ --force --delete-emptydir-data=true --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Click Compute → Machines. Search for the required machine.
Besides the required machine, click Action menu (⋮) → Delete Machine.
Click Delete to confirm the machine is deleted. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
Important
This activity might take at least 5 - 10 minutes or more.
Click Compute → Nodes in the OpenShift Web Console. Confirm that the new node is in Ready state.
Physically add a new device to the node.
Apply the OpenShift Data Foundation label to the new node using any one of the following:
From the user interface
For the new node, click Action Menu (⋮) → Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage, and click Save.
From the command-line interface
Apply the OpenShift Data Foundation label to the new node:
$ oc label node _<new_node_name>_ cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap
<new_node_name>
Specify the name of the new node.

Identify the namespace where the OpenShift local storage operator is installed, and assign it to the local_storage_project variable:

local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

Copy to Clipboard

Toggle word wrap

For example:

local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

$ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)

Copy to Clipboard

Toggle word wrap

echo $local_storage_project

echo $local_storage_project

Copy to Clipboard

Toggle word wrap

Example output:

openshift-local-storage

openshift-local-storage

Copy to Clipboard

Toggle word wrap

Add a new worker node to the localVolumeDiscovery and localVolumeSet.

Update the localVolumeDiscovery definition to include the new node and remove the failed node:

oc edit -n $local_storage_project localvolumediscovery auto-discover-devices

# oc edit -n $local_storage_project localvolumediscovery auto-discover-devices

Copy to Clipboard

Toggle word wrap

Example output:

[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - **newnode.example.com**
[...]

[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - **newnode.example.com**
[...]

Copy to Clipboard

Toggle word wrap

Remember to save before exiting the editor.

In this example, server3.example.com is removed and newnode.example.com is the new node.

Determine the localVolumeSet you need to edit.
```
oc get -n $local_storage_project localvolumeset
```
```
# oc get -n $local_storage_project localvolumeset
```
Copy to Clipboard Toggle word wrap
Example output:
```
NAME          AGE
localblock   25h
```
```
NAME          AGE
localblock   25h
```
Copy to Clipboard Toggle word wrap

Update the localVolumeSet definition to include the new node and remove the failed node:

oc edit -n $local_storage_project localvolumeset localblock

# oc edit -n $local_storage_project localvolumeset localblock

Copy to Clipboard

Toggle word wrap

Example output:

[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - **newnode.example.com**
[...]

[...]
   nodeSelector:
    nodeSelectorTerms:
      - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - server1.example.com
            - server2.example.com
            #- server3.example.com
            - **newnode.example.com**
[...]

Copy to Clipboard

Toggle word wrap

Remember to save before exiting the editor.

In this example, server3.example.com is removed and newnode.example.com is the new node.

Verify that the new localblock PV is available:

oc get pv | grep localblock | grep Available

$ oc get pv | grep localblock | grep Available

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

local-pv-551d950     512Gi    RWO    Delete  Available
localblock     26s

Copy to Clipboard

Toggle word wrap

Navigate to the openshift-storage project:
```
oc project openshift-storage
```
```
$ oc project openshift-storage
```
Copy to Clipboard Toggle word wrap
Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required:
```
oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
```
$ oc process -n openshift-storage ocs-osd-removal \
-p FAILED_OSD_IDS=<failed_osd_id> | oc create -f -
```
Copy to Clipboard Toggle word wrap
<failed_osd_id>
Is the integer in the pod name immediately after the rook-ceph-osd prefix.
You can add comma separated OSD IDs in the command to remove more than one OSD, for example, FAILED_OSD_IDS=0,1,2.
The FORCE_OSD_REMOVAL value must be changed to true in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.
A status of Completed confirms that the OSD removal job succeeded.
```
oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
```
Copy to Clipboard Toggle word wrap

Ensure that the OSD removal is completed.

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

Copy to Clipboard

Toggle word wrap

Example output:

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0

Copy to Clipboard

Toggle word wrap

Important

If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging:

For example:

oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1

Copy to Clipboard

Toggle word wrap

Identify the PV associated with the Persistent Volume Claim (PVC):

oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

# oc get pv -L kubernetes.io/hostname | grep localblock | grep Released

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1

Copy to Clipboard

Toggle word wrap

If there is a PV in Released state, delete it:

oc delete pv _<persistent_volume>_

# oc delete pv _<persistent_volume>_

Copy to Clipboard

Toggle word wrap

For example:

oc delete pv local-pv-d6bf175b

# oc delete pv local-pv-d6bf175b

Copy to Clipboard

Toggle word wrap

Example output:

persistentvolume "local-pv-d9c5cbd6" deleted

persistentvolume "local-pv-d9c5cbd6" deleted

Copy to Clipboard

Toggle word wrap

Identify the crashcollector pod deployment:

oc get deployment --selector=app=rook-ceph-crashcollector,node_name=_<failed_node_name>_ -n openshift-storage

$ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=_<failed_node_name>_ -n openshift-storage

Copy to Clipboard

Toggle word wrap

If there is an existing crashcollector pod deployment, delete it:

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=_<failed_node_name>_ -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=_<failed_node_name>_ -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete the ocs-osd-removal-job:

oc delete -n openshift-storage job ocs-osd-removal-job

# oc delete -n openshift-storage job ocs-osd-removal-job

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-job" deleted

job.batch "ocs-osd-removal-job" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads → Pods. Confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Data Foundation pods are in Running state.

Ensure that the new incremental mon is created, and is in the Running state:

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
0          38m
rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
0          38m
rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
0          4m8s

Copy to Clipboard

Toggle word wrap

OSD and monitor pod might take several minutes to get to the Running state.

Verify that new OSD pods are running on the replacement node:

oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

$ oc get pods -o wide -n openshift-storage| egrep -i <new_node_name> | egrep osd

Copy to Clipboard

Toggle word wrap

Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in the previous step, do the following:
1. Create a debug pod and open a chroot environment for the one or more selected hosts:
  $ oc debug node/<node_name>
  Copy to Clipboard Toggle word wrap
  $ chroot /host
  Copy to Clipboard Toggle word wrap
2. Display the list of available block devices:
  $ lsblk
  Copy to Clipboard Toggle word wrap
  Check for the crypt keyword beside the one or more ocs-deviceset names.
If the verification steps fail, contact Red Hat Support.

Replacing nodes

Instructions for how to safely replace a node in an OpenShift Data Foundation cluster.

Making open source more inclusiveCopy linkLink copied to clipboard!

Providing feedback on Red Hat documentationCopy linkLink copied to clipboard!

PrefaceCopy linkLink copied to clipboard!

Chapter 1. OpenShift Data Foundation deployed using dynamic devicesCopy linkLink copied to clipboard!

1.1. OpenShift Data Foundation deployed on AWSCopy linkLink copied to clipboard!

1.1.1. Replacing an operational AWS node on user-provisioned infrastructureCopy linkLink copied to clipboard!

1.1.2. Replacing an operational AWS node on installer-provisioned infrastructureCopy linkLink copied to clipboard!

1.1.3. Replacing a failed AWS node on user-provisioned infrastructureCopy linkLink copied to clipboard!

1.1.4. Replacing a failed AWS node on installer-provisioned infrastructureCopy linkLink copied to clipboard!

1.2. OpenShift Data Foundation deployed on VMwareCopy linkLink copied to clipboard!

1.2.1. Replacing an operational VMware node on user-provisioned infrastructureCopy linkLink copied to clipboard!

1.2.2. Replacing an operational VMware node on installer-provisioned infrastructureCopy linkLink copied to clipboard!

1.2.3. Replacing a failed VMware node on user-provisioned infrastructureCopy linkLink copied to clipboard!

1.2.4. Replacing a failed VMware node on installer-provisioned infrastructureCopy linkLink copied to clipboard!

1.3. OpenShift Data Foundation deployed on Microsoft AzureCopy linkLink copied to clipboard!

1.3.1. Replacing operational nodes on Azure installer-provisioned infrastructureCopy linkLink copied to clipboard!

1.3.2. Replacing failed nodes on Azure installer-provisioned infrastructureCopy linkLink copied to clipboard!

1.4. OpenShift Data Foundation deployed on Google cloudCopy linkLink copied to clipboard!

1.4.1. Replacing operational nodes on Google Cloud installer-provisioned infrastructureCopy linkLink copied to clipboard!

1.4.2. Replacing failed nodes on Google Cloud installer-provisioned infrastructureCopy linkLink copied to clipboard!

Chapter 2. OpenShift Data Foundation deployed using local storage devicesCopy linkLink copied to clipboard!

2.1. Replacing storage nodes on bare metal infrastructureCopy linkLink copied to clipboard!

2.1.1. Replacing an operational node on bare metal user-provisioned infrastructureCopy linkLink copied to clipboard!

2.1.2. Replacing a failed node on bare metal user-provisioned infrastructureCopy linkLink copied to clipboard!

2.2. Replacing storage nodes on IBM Z or IBM® LinuxONE infrastructureCopy linkLink copied to clipboard!

2.2.1. Replacing operational nodes on IBM Z or IBM® LinuxONE infrastructureCopy linkLink copied to clipboard!

2.2.2. Replacing failed nodes on IBM Z or IBM® LinuxONE infrastructureCopy linkLink copied to clipboard!

2.3. Replacing storage nodes on IBM Power infrastructureCopy linkLink copied to clipboard!

2.3.1. Replacing an operational or failed storage node on IBM PowerCopy linkLink copied to clipboard!

2.4. Replacing storage nodes on VMware infrastructureCopy linkLink copied to clipboard!

2.4.1. Replacing an operational node on VMware user-provisioned infrastructureCopy linkLink copied to clipboard!

2.4.2. Replacing an operational node on VMware installer-provisioned infrastructureCopy linkLink copied to clipboard!

2.4.3. Replacing a failed node on VMware user-provisioned infrastructureCopy linkLink copied to clipboard!

2.4.4. Replacing a failed node on VMware installer-provisioned infrastructureCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Making open source more inclusive
Copy link

Providing feedback on Red Hat documentation
Copy link

Preface
Copy link

Chapter 1. OpenShift Data Foundation deployed using dynamic devices
Copy link

1.1. OpenShift Data Foundation deployed on AWS
Copy link

1.1.1. Replacing an operational AWS node on user-provisioned infrastructure
Copy link

1.1.2. Replacing an operational AWS node on installer-provisioned infrastructure
Copy link

1.1.3. Replacing a failed AWS node on user-provisioned infrastructure
Copy link

1.1.4. Replacing a failed AWS node on installer-provisioned infrastructure
Copy link

1.2. OpenShift Data Foundation deployed on VMware
Copy link

1.2.1. Replacing an operational VMware node on user-provisioned infrastructure
Copy link

1.2.2. Replacing an operational VMware node on installer-provisioned infrastructure
Copy link

1.2.3. Replacing a failed VMware node on user-provisioned infrastructure
Copy link

1.2.4. Replacing a failed VMware node on installer-provisioned infrastructure
Copy link

1.3. OpenShift Data Foundation deployed on Microsoft Azure
Copy link

1.3.1. Replacing operational nodes on Azure installer-provisioned infrastructure
Copy link

1.3.2. Replacing failed nodes on Azure installer-provisioned infrastructure
Copy link

1.4. OpenShift Data Foundation deployed on Google cloud
Copy link

1.4.1. Replacing operational nodes on Google Cloud installer-provisioned infrastructure
Copy link

1.4.2. Replacing failed nodes on Google Cloud installer-provisioned infrastructure
Copy link

Chapter 2. OpenShift Data Foundation deployed using local storage devices
Copy link

2.1. Replacing storage nodes on bare metal infrastructure
Copy link

2.1.1. Replacing an operational node on bare metal user-provisioned infrastructure
Copy link

2.1.2. Replacing a failed node on bare metal user-provisioned infrastructure
Copy link

2.2. Replacing storage nodes on IBM Z or IBM® LinuxONE infrastructure
Copy link

2.2.1. Replacing operational nodes on IBM Z or IBM® LinuxONE infrastructure
Copy link

2.2.2. Replacing failed nodes on IBM Z or IBM® LinuxONE infrastructure
Copy link

2.3. Replacing storage nodes on IBM Power infrastructure
Copy link

2.3.1. Replacing an operational or failed storage node on IBM Power
Copy link

2.4. Replacing storage nodes on VMware infrastructure
Copy link

2.4.1. Replacing an operational node on VMware user-provisioned infrastructure
Copy link

2.4.2. Replacing an operational node on VMware installer-provisioned infrastructure
Copy link

2.4.3. Replacing a failed node on VMware user-provisioned infrastructure
Copy link

2.4.4. Replacing a failed node on VMware installer-provisioned infrastructure
Copy link