Chapter 13. Replacing storage nodes
- To replace an operational node, see Section 13.1, “Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure”
- To replace a failed node, see Section 13.2, “Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure”
13.1. Replacing an operational node on Red Hat Virtualization installer-provisioned infrastructure Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on Red Hat Virtualization installer-provisioned infrastructure (IPI).
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the node that needs to be replaced. Take a note of its Machine Name.
Get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the mon (if any) and OSDs that are running in the node to be replaced.
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the nodes as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Click Compute
Machines. Search for the required machine. -
Besides the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
Nodes in the OpenShift web console. Confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add the local storage devices available on these worker nodes to the OpenShift Container Storage StorageCluster.
Determine which
localVolumeSetto edit.Replace local-storage-project in the following commands with the name of your local storage project. The default project name is
openshift-local-storagein OpenShift Container Storage 4.6 and later. Previous versions uselocal-storageby default.oc get -n local-storage-project localvolumeset NAME AGE localblock 25h
# oc get -n local-storage-project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the new node to the
localVolumeSetdefinition.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
Verify that the new
localblockPV is available.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removalpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-failed-osd-id -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-failed-osd-id -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removalfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-failed-osd_id -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-failed-osd_id -n openshift-storage --tail=-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV associated with the failed node.
Identify the PV associated with the PVC.
oc get -n openshift-storage pvc claim-name
# oc get -n openshift-storage pvc claim-nameCopy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get -n openshift-storage pvc ocs-deviceset-0-0-nvs68 ACCESS STORAGE NAME STATUS VOLUME CAPACITY MODES CLASS AGE ocs-deviceset- Released local-pv- 931Gi RWO localblock 24h 0-0-nvs68 d9c5cbd6# oc get -n openshift-storage pvc ocs-deviceset-0-0-nvs68 ACCESS STORAGE NAME STATUS VOLUME CAPACITY MODES CLASS AGE ocs-deviceset- Released local-pv- 931Gi RWO localblock 24h 0-0-nvs68 d9c5cbd6Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV.
oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d9c5cbd6 persistentvolume "local-pv-d9c5cbd6" deleted
# oc delete pv local-pv-d9c5cbd6 persistentvolume "local-pv-d9c5cbd6" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Delete the
crashcollectorpod deployment.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Deploy the new OSD by restarting the
rook-ceph-operatorto force operator reconciliation.oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
rook-ceph-operator.oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
rook-ceph-operatorpod is restarted.oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Creation of the new OSD and
monmight take several minutes after the operator restarts.
Delete the
ocs-osd-removaljob.oc delete job ocs-osd-removal-${osd_id_to_remove}# oc delete job ocs-osd-removal-${osd_id_to_remove}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads
Pods, confirm that at least the following pods on the new node are in Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162mCopy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.
13.2. Replacing a failed node on Red Hat Virtualization installer-provisioned infrastructure Copy linkLink copied to clipboard!
The ephemeral storage of Red Hat Virtualization for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Red Hat Virtualization platform.
Prerequisites
- Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
- You must be logged into the OpenShift Container Platform (RHOCP) cluster.
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the node that needs to be replaced. Take a note of its Machine Name.
Get the labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the mon (if any) and OSDs that are running in the node to be replaced.
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the node as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the pods which are in Terminating state.
oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Click Compute
Machines. Search for the required machine. -
Besides the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
Nodes in the OpenShift web console. Confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
- Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the local storage devices available on these worker nodes to the OpenShift Container Storage StorageCluster.
Determine which
localVolumeSetto edit.Replace local-storage-project in the following commands with the name of your local storage project. The default project name is
openshift-local-storagein OpenShift Container Storage 4.6 and later. Previous versions uselocal-storageby default.oc get -n local-storage-project localvolumeset NAME AGE localblock 25h
# oc get -n local-storage-project localvolumeset NAME AGE localblock 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the new node to the
localVolumeSetdefinition.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remember to save before exiting the editor.
Verify that the new
localblockPV is available.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storageproject.oc project openshift-storage
$ oc project openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
$ oc process -n openshift-storage ocs-osd-removal \ -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removalpod.A status of
Completedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-failed-osd-id -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-failed-osd-id -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removalfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-failed-osd_id -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-failed-osd_id -n openshift-storage --tail=-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV associated with the failed node.
Identify the PV associated with the PVC.
oc get -n openshift-storage pvc claim-name
# oc get -n openshift-storage pvc claim-nameCopy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc get -n openshift-storage pvc ocs-deviceset-0-0-nvs68 ACCESS STORAGE NAME STATUS VOLUME CAPACITY MODES CLASS AGE ocs-deviceset- Released local-pv- 931Gi RWO localblock 24h 0-0-nvs68 d9c5cbd6# oc get -n openshift-storage pvc ocs-deviceset-0-0-nvs68 ACCESS STORAGE NAME STATUS VOLUME CAPACITY MODES CLASS AGE ocs-deviceset- Released local-pv- 931Gi RWO localblock 24h 0-0-nvs68 d9c5cbd6Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV.
oc delete pv <persistent-volume>
# oc delete pv <persistent-volume>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example:
oc delete pv local-pv-d9c5cbd6 persistentvolume "local-pv-d9c5cbd6" deleted
# oc delete pv local-pv-d9c5cbd6 persistentvolume "local-pv-d9c5cbd6" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Delete the
crashcollectorpod deployment.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Deploy the new OSD by restarting the
rook-ceph-operatorto force operator reconciliation.oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
rook-ceph-operator.oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
rook-ceph-operatorpod is restarted.oc get -n openshift-storage pod -l app=rook-ceph-operator
# oc get -n openshift-storage pod -l app=rook-ceph-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Creation of the new OSD and
monmight take several minutes after the operator restarts.
Delete the`ocs-osd-removal` job.
oc delete job ocs-osd-removal-${osd_id_to_remove}# oc delete job ocs-osd-removal-${osd_id_to_remove}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads
Pods, confirm that at least the following pods on the new node are in Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Ensure that the new incremental
monis created and is in the Running state.oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162m
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162mCopy to Clipboard Copied! Toggle word wrap Toggle overflow OSD and Mon might take several minutes to get to the
Runningstate.Verify that new OSD pods are running on the replacement node.
oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
$ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
For each of the new nodes identified in previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name> chroot /host
$ oc debug node/<node name> $ chroot /hostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run “lsblk” and check for the “crypt” keyword beside the
ocs-devicesetname(s)lsblk
$ lsblkCopy to Clipboard Copied! Toggle word wrap Toggle overflow
- If verification steps fail, contact Red Hat Support.