Chapter 14. Replacing storage nodes
You can choose one of the following procedures to replace storage nodes:
14.1. Replacing operational nodes on Google Cloud installer-provisioned infrastructure
Use this procedure to replace an operational node on Google Cloud installer-provisioned infrastructure (IPI).
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
$ oc adm cordon <node_name>
Drain the node using the following command:
$ oc adm drain <node_name> --force --delete-emptydir-data=true --ignore-daemonsets
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
-
Click Compute
Machines. Search for the required machine. -
Besides the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Data Foundation pods are in Running state.
Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.
14.2. Replacing failed nodes on Google Cloud installer-provisioned infrastructure
Perform this procedure to replace a failed node which is not operational on Google Cloud installer-provisioned infrastructure (IPI) for OpenShift Data Foundation.
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the faulty node and click on its Machine Name.
-
Click Actions
Edit Annotations, and click Add More. -
Add
machine.openshift.io/exclude-node-draining
and click Save. -
Click Actions
Delete Machine, and click Delete. A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Data Foundation label to the new node using any one of the following:
- From the web user interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storage
and click Save.
-
For the new node, click Action Menu (⋮)
- From the command line interface
Execute the following command to apply the OpenShift Data Foundation label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
- [Optional]: If the failed Google Cloud instance is not removed automatically, terminate the instance from Google Cloud console.
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-*
-
csi-rbdplugin-*
-
- Verify that all other required OpenShift Data Foundation pods are in Running state.
Verify that new OSD pods are running on the replacement node.
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name> $ chroot /host
Run “lsblk” and check for the “crypt” keyword beside the
ocs-deviceset
name(s)$ lsblk
- If verification steps fail, contact Red Hat Support.