Chapter 9. Replacing storage nodes
Depending on the type of your deployment, you can choose one of the following procedures to replace storage nodes:
For dynamically provisioned storage nodes deployed on AWS, see:
- Section 9.1.1, “Replacing operational nodes on AWS user-provisioned infrastructures”
- Section 9.1.2, “Replacing failed nodes on AWS user-provisioned infrastructures”
- Section 9.1.3, “Replacing operational nodes on AWS installer-provisioned infrastructures”
- Section 9.1.4, “Replacing failed nodes on AWS installer-provisioned infrastructures”
For dynamically created storage nodes deployed on VMware, see:
For storage nodes deployed using local storage devices, see:
9.1. Dynamically provisioned OpenShift Container Storage deployed on AWS infrastructures Copy linkLink copied to clipboard!
9.1.1. Replacing operational nodes on AWS user-provisioned infrastructures Copy linkLink copied to clipboard!
Perform this procedure to replace an operational node on AWS user-provisioned infrastructure.
Procedure
- Identify the node that needs to be replaced.
Mark the node as unschedulable using the following command:
$ oc adm cordon <node_name>Drain the node using the following command:
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Delete the node using the following command:
$ oc delete nodes <node_name>- Create a new AWS machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:$ oc get csrApprove all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.1.2. Replacing failed nodes on AWS user-provisioned infrastructures Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on AWS user-provisioned infrastructure (UPI) for OpenShift Container Storage.
Procedure
- Identify the AWS machine instance of the node that needs to be replaced.
- Log in to AWS and terminate the identified AWS machine instance.
- Create a new AWS machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform node using the new AWS machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:$ oc get csrApprove all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, contact Red Hat Support.
9.1.3. Replacing operational nodes on AWS installer-provisioned infrastructures Copy linkLink copied to clipboard!
Use this procedure to replace an operational node on AWS installer-provisioned infrastructure (IPI).
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the node that needs to be replaced. Take a note of its Machine Name.
Mark the node as unschedulable using the following command:
$ oc adm cordon <node_name>Drain the node using the following command:
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
-
Click Compute
Machines. Search for the required machine. -
Besides the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.1.4. Replacing failed nodes on AWS installer-provisioned infrastructures Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node which is not operational on AWS installer-provisioned infrastructure (IPI) for OpenShift Container Storage.
Procedure
-
Log in to OpenShift Web Console and click Compute
Nodes. - Identify the faulty node and click on its Machine Name.
-
Click Actions
Edit Annotations, and click Add More. -
Add
machine.openshift.io/exclude-node-drainingand click Save. -
Click Actions
Delete Machine, and click Delete. A new machine is automatically created, wait for new machine to start.
ImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
- [Optional]: If the failed AWS instance is not removed automatically, terminate the instance from AWS console.
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.2. Dynamically provisioned OpenShift Container Storage deployed on VMware infrastructures Copy linkLink copied to clipboard!
9.2.1. Replacing operational nodes on VMware user-provisioned infrastructures Copy linkLink copied to clipboard!
Perform this procedure to replace an operational node on VMware user-provisioned infrastructure (UPI).
Procedure
- Identify the node and its VM that needs to be replaced.
Mark the node as unschedulable using the following command:
$ oc adm cordon <node_name>Drain the node using the following command:
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsImportantThis activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.
Delete the node using the following command:
$ oc delete nodes <node_name>Log in to vSphere and terminate the identified VM.
ImportantVM should be deleted only from the inventory and not from the disk.
- Create a new VM on vSphere with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:$ oc get csrApprove all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.2.2. Replacing failed nodes on VMware user-provisioned infrastructures Copy linkLink copied to clipboard!
Perform this procedure to replace a failed node on VMware user-provisioned infrastructure (UPI).
Procedure
- Identify the node and its VM that needs to be replaced.
Delete the node using the following command:
$ oc delete nodes <node_name>Log in to vSphere and terminate the identified VM.
ImportantVM should be deleted only from the inventory and not from the disk.
- Create a new VM on vSphere with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:$ oc get csrApprove all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>-
Click Compute
Nodes, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
- Verify that all other required OpenShift Container Storage pods are in Running state.
- If verification steps fail, kindly contact Red Hat Support.
9.3. OpenShift Container Storage deployed using local storage devices Copy linkLink copied to clipboard!
While replacing a node, the hostname of the new Openshift Container Storage node should not be the same as the hostname of any decommissioned Openshift Container Storage node due to a known issue. As a workaround, we recommend to use a new hostname for adding the replaced node back into the cluster.
9.3.1. Replacing failed storage nodes on Amazon EC2 infrastructure Copy linkLink copied to clipboard!
The ephemeral storage of Amazon EC2 I3 for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Amazon EC2 infrastructure.
Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
Prerequisites
- You must be logged into OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>Identify the mon (if any) and OSDs that are running in the node to be replaced.
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageMark the nodes as unschedulable.
$ oc adm cordon <node_name>Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsNoteIf the failed node is not connected to the network, remove the pods running on it by using the command:
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}' $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsRemove the failed node.
For Installer provisioned infrastructure, delete the machine corresponding to the failed node. A new node is automatically added.
-
Click Compute
Machines. Search for the required machine. -
Besides the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
For User provisioned infrastructure, follow the below mentioned steps
Delete the node.
$ oc delete node <node_name>- Create a new Amazon EC2 I3 machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform node using the new Amazon EC2 I3 machine instance.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:
$ oc get csrApprove all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>
- [Optional]: If the failed AWS instance is not removed automatically, terminate the instance from AWS console.
-
Click Compute
Nodes in OpenShift web console. Confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolumeCR. You can either remove or comment out the failed device/dev/disk/by-id/{id}and add the new/dev/disk/by-id/{id}.$ oc get -n local-storage localvolume NAME AGE local-block 25h$ oc edit -n local-storage localvolume local-blockExample output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7 # - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4 storageClassName: localblock volumeMode: Block [...]Make sure to save the changes after editing the CR.
You can see that in this CR the below two new devices using by-id have been added.
-
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9 -
nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
-
Display PVs with
localblock.$ oc get pv | grep localblockExample output:
local-pv-3646185e 2328Gi RWO Delete Available localblock 9s local-pv-3933e86 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-1-v9jp4 localblock 5h1m local-pv-8176b2bf 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock 5h1m local-pv-ab7cabb3 2328Gi RWO Delete Available localblock 9s local-pv-ac52e8a 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-knrgr localblock 5h1m local-pv-b7e6fd37 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-rdm7m localblock 5h1m local-pv-cb454338 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-1-h9hfm localblock 5h1m local-pv-da5e3175 2328Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-1-g97lq localblock 5h ...
Delete each PV and OSD associated with failed node using the following steps.
Identify the DeviceSet associated with the OSD to be replaced.
$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvcwhere,
osd_id_to_removeis the integer in the pod name immediately after therook-ceph-osdprefix. In this example, the deployment name isrook-ceph-osd-0.Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68Identify the PV associated with the PVC.
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49mIn this example, the associated PV is
local-pv-8176b2bf.Delete the PVC which was identified in earlier steps. In this example, the PVC name is ocs-deviceset-0-0-nvs68.
$ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storageExample output:
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deletedDelete the PV which was identified in earlier steps. In this example, the PV name is local-pv-8176b2bf.
$ oc delete pv local-pv-8176b2bfExample output:
persistentvolume "local-pv-8176b2bf" deletedRemove the failed OSD from the cluster.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removalpod. A status ofCompletedconfirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storageNoteIf
ocs-osd-removalfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1Delete the OSD pod deployment.
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
Delete
crashcollectorpod deployment identified in an earlier step.$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storageDeploy the new OSD by restarting the
rook-ceph-operatorto force operator reconciliation.$ oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3mDelete the
rook-ceph-operator.$ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deletedVerify that the
rook-ceph-operatorpod is restarted.$ oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66sCreation of the new OSD may take several minutes after the operator starts.
Delete the
ocs-osd-removaljob(s).$ oc delete job ocs-osd-removal-${osd_id_to_remove}Example output:
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
$ oc get pod -n openshift-storage | grep monExample output:
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27mOSDs and mon’s might take several minutes to get to the Running state.
- If verification steps fail, contact Red Hat Support.
9.3.2. Replacing failed storage nodes on VMware infrastructure Copy linkLink copied to clipboard!
Prerequisites
- You must be logged into OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
$ oc get nodes --show-labels | grep <node_name>Identify the
mon(if any) and OSDs that are running in the node to be replaced.$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageMark the node as unschedulable.
$ oc adm cordon <node_name>Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsNoteIf the failed node is not connected to the network, remove the pods running on it by using the command:
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}' $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsDelete the node.
$ oc delete node <node_name>- Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
- Create a new OpenShift Container Platform worker node using the new VM.
Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in
Pendingstate:$ oc get csrApprove all required OpenShift Container Platform CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>-
Click Compute
Nodes in OpenShift Web Console, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.
Add a new disk entry to
LocalVolumeCR.Edit
LocalVolumeCR and remove or comment out failed device/dev/disk/by-id/{id}and add the new/dev/disk/by-id/{id}. In this example, the new device is/dev/disk/by-id/scsi-36000c29f5c9638dec9f19b220fbe36b1.# oc get -n local-storage localvolume NAME AGE local-block 25h# oc edit -n local-storage localvolume local-blockExample output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/scsi-36000c29346bca85f723c4c1f268b5630 - /dev/disk/by-id/scsi-36000c29134dfcfaf2dfeeb9f98622786 # - /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237 - /dev/disk/by-id/scsi-36000c29f5c9638dec9f19b220fbe36b1 storageClassName: localblock volumeMode: Block [...]Make sure to save the changes after editing the CR.
Display PVs with
localblock.$ oc get pv | grep localblockExample output:
local-pv-3e8964d3 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 100Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
Delete the PV associated with the failed node.
Identify the
DeviceSetassociated with the OSD to be replaced.# osd_id_to_remove=0 # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvcwhere,
osd_id_to_removeis the integer in the pod name immediately after therook-ceph-osd prefix. In this example, the deployment name isrook-ceph-osd-0.Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68In this example, the PVC name is
ocs-deviceset-0-0-nvs68.Identify the PV associated with the PVC.
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>where,
x,y, andpvc-suffixare the values in theDeviceSetidentified in the previous step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 100Gi RWO localblock 24hIn this example, the associated PV is
local-pv-d9c5cbd6.Delete the PVC.
oc delete pvc <pvc-name> -n openshift-storageDelete the PV.
# oc delete pv local-pv-d9c5cbd6Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
Remove the failed OSD from the cluster.
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removalpod. A status ofCompletedconfirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storageNoteIf
ocs-osd-removalfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1Delete OSD pod deployment and crashcollector pod deployment.
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storageDeploy the new OSD by restarting the
rook-ceph-operatorto force operator reconciliation.# oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20hDelete the
rook-ceph-operator.# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deletedVerify that the
rook-ceph-operatorpod is restarted.# oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66sCreation of the new OSD and
monmight take several minutes after the operator restarts.
Delete the
ocs-osd-removaljob.# oc delete job ocs-osd-removal-${osd_id_to_remove}Example output:
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Runningstate:-
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in
Runningstate.Make sure that the new incremental
monis created and is in theRunningstate.$ oc get pod -n openshift-storage | grep monExample output:
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162mOSD and Mon might take several minutes to get to the
Runningstate.
- If verification steps fail, contact Red Hat Support.
9.3.3. Replacing failed storage nodes on bare metal infrastructure Copy linkLink copied to clipboard!
Prerequisites
- You must be logged into OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced. Make a note of the rack label.
$ oc get nodes --show-labels | grep <node_name>Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Scale down the deployments of the pods identified in the previous step.
For example:
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageMark the node as unschedulable.
$ oc adm cordon <node_name>Drain the node.
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsNoteIf the failed node is not connected to the network, remove the pods running on it by using the command:
$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}' $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsDelete the node.
$ oc delete node <node_name>- Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
- Create a new OpenShift Container Platform node using the new bare metal machine.
Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in
Pendingstate:$ oc get csrApprove all required OpenShift Container Storage CSRs for the new node:
$ oc adm certificate approve <Certificate_Name>-
Click Compute
Nodes in OpenShift Web Console, confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.
Add a new disk entry to
LocalVolumeCR.Edit
LocalVolumeCR and remove or comment out failed device/dev/disk/by-id/{id}and add the new/dev/disk/by-id/{id}. In this example, the new device is/dev/disk/by-id/scsi-36000c29f5c9638dec9f19b220fbe36b1.# oc get -n local-storage localvolume NAME AGE local-block 25h# oc edit -n local-storage localvolume local-blockExample output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/scsi-36000c29346bca85f723c4c1f268b5630 - /dev/disk/by-id/scsi-36000c29134dfcfaf2dfeeb9f98622786 # - /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237 - /dev/disk/by-id/scsi-36000c29f5c9638dec9f19b220fbe36b1 storageClassName: localblock volumeMode: Block [...]Make sure to save the changes after editing the CR.
Display PVs with
localblock.$ oc get pv | grep localblockExample output:
local-pv-3e8964d3 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 100Gi RWO Delete Available localblock 3m24s local-pv-d9c5cbd6 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-0-nvs68 localblock
Delete the PV associated with the failed node.
Identify the
DeviceSetassociated with the OSD to be replaced.# osd_id_to_remove=0 # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvcwhere,
osd_id_to_removeis the integer in the pod name immediately after therook-ceph-osd prefix. In this example, the deployment name isrook-ceph-osd-0.Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68In this example, the PVC name is
ocs-deviceset-0-0-nvs68.Identify the PV associated with the PVC.
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>where,
x,y, andpvc-suffixare the values in theDeviceSetidentified in the previous step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 100Gi RWO localblock 24hIn this example, the associated PV is
local-pv-d9c5cbd6.Delete the PVC.
oc delete pvc <pvc-name> -n openshift-storageDelete the PV.
# oc delete pv local-pv-d9c5cbd6Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
Remove the failed OSD from the cluster.
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removalpod. A status ofCompletedconfirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storageNoteIf
ocs-osd-removalfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1Delete OSD pod deployment and crashcollector pod deployment.
$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storageDeploy the new OSD by restarting the
rook-ceph-operatorto force operator reconciliation.# oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20hDelete the
rook-ceph-operator.# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deletedVerify that the
rook-ceph-operatorpod is restarted.# oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66sCreation of the new OSD and
monmight take several minutes after the operator restarts.
Delete the
ocs-osd-removaljob.# oc delete job ocs-osd-removal-${osd_id_to_remove}Example output:
job.batch "ocs-osd-removal-0" deleted
Verification steps
Execute the following command and verify that the new node is present in the output:
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Make sure that the new incremental
monis created and is in theRunningstate.$ oc get pod -n openshift-storage | grep monExample output:
rook-ceph-mon-c-64556f7659-c2ngc 1/1 Running 0 6h14m rook-ceph-mon-d-7c8b74dc4d-tt6hd 1/1 Running 0 4h24m rook-ceph-mon-e-57fb8c657-wg5f2 1/1 Running 0 162mOSD and Mon might take several minutes to get to the
Runningstate.
- If verification steps fail, contact Red Hat Support.