Chapter 10. Replacing a storage device
Depending on the type of your deployment, you can choose one of the following procedures to replace a storage device:
For dynamically created storage clusters deployed on AWS, see:
- For dynamically created storage clusters deployed on VMware, see Section 10.2.1, “Replacing operational or failed storage devices on VMware user-provisioned infrastructure”
For storage clusters deployed using local storage devices, see:
10.1. Dynamically provisioned OpenShift Container Storage deployed on AWS Copy linkLink copied to clipboard!
10.1.1. Replacing operational or failed storage devices on AWS user-provisioned infrastructure Copy linkLink copied to clipboard!
When you need to replace a device in a dynamically created storage cluster on an AWS user-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:
10.1.2. Replacing operational or failed storage devices on AWS installer-provisioned infrastructure Copy linkLink copied to clipboard!
When you need to replace a device in a dynamically created storage cluster on an AWS installer-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:
10.2. Dynamically provisioned OpenShift Container Storage deployed on VMware Copy linkLink copied to clipboard!
10.2.1. Replacing operational or failed storage devices on VMware user-provisioned infrastructure Copy linkLink copied to clipboard!
Use this procedure when a virtual machine disk (VMDK) needs to be replaced in OpenShift Container Storage which is deployed dynamically on VMware infrastructure. This procedure helps to create a new Persistent Volume Claim (PVC) on a new volume and remove the old object storage device (OSD).
Procedure
Identify the OSD that needs to be replaced.
# oc get -n openshift-storage pods -l app=rook-ceph-osd -o wideExample output:
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.131.2.32 compute-1 <none> <none>In this example,
rook-ceph-osd-0-6d77d6c7c6-m8xj6needs to be replaced.NoteIf the OSD to be replaced is healthy, the status of the pod will be Running.
Scale down the OSD deployment for the OSD to be replaced
# osd_id_to_remove=0 # oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0where,
osd_id_to_removeis the integer in the pod name immediately after therook-ceph-osdprefix. In this example, the deployment name isrook-ceph-osd-0.Example output:
deployment.extensions/rook-ceph-osd-0 scaledVerify that the
rook-ceph-osdpod is terminated.# oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}Example output:
No resources found.NoteIf the
rook-ceph-osdpod is interminatingstate, use theforceoption to delete the pod.# oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0Example output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deletedRemove the old OSD from the cluster so that a new OSD can be added.
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -WarningThis step results in OSD being completely removed from the cluster. Make sure that the correct value of
osd_id_to_removeis provided.Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removalpod. A status ofCompletedconfirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storageNoteIf
ocs-osd-removalfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1Delete the PVC resources associated with the OSD to be replaced.
Identify the
DeviceSetassociated with the OSD to be replaced.# oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvcExample output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68In this example, the PVC name is
ocs-deviceset-0-0-nvs68.Identify the PV associated with the PVC.
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>where,
x,y, andpvc-suffixare the values in theDeviceSetidentified in the previous step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound pvc-0e621d45-7d18-4d35-a282-9700c3cc8524 512Gi RWO thin 24hIn this example, the PVC is
ocs-deviceset-0-0-nvs68that is identified in the previous step and associated PV ispvc-0e621d45-7d18-4d35-a282-9700c3cc8524.Identify the
prepare-podassociated with the OSD to be replaced. Use the PVC name obtained in an earlier step.# oc describe -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix> | grep Mountedwhere,
x,y, andpvc-suffixare the values in theDeviceSetidentified in an earlier step.Example output:
Mounted By: rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7Delete the
osd-preparepod before removing the associated PVC.# oc delete -n openshift-storage pod rook-ceph-osd-prepare-ocs-deviceset-<x>-<y>-<pvc-suffix>-<pod-suffix>where,
x,y,pvc-suffix, andpod-suffixare the values in theosd-preparepod name identified in the previous step.Example output:
pod "rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7" deletedDelete the PVC associated with the device.
# oc delete -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>where,
x,y, andpvc-suffixare the values in theDeviceSetidentified in an earlier step.Example output:
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
Create new OSD for new device.
Delete the deployment for the OSD to be replaced.
# oc delete -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove}Example output:
deployment.extensions/rook-ceph-osd-0 deletedVerify that the PV for the device identified in an earlier step is deleted.
# oc get -n openshift-storage pv pvc-0e621d45-7d18-4d35-a282-9700c3cc8524Example output:
Error from server (NotFound): persistentvolumes "pvc-0e621d45-7d18-4d35-a282-9700c3cc8524" not foundIn this example, the PV name is
pvc-0e621d45-7d18-4d35-a282-9700c3cc8524.If the PV still exists, delete the PV associated with the device.
# oc delete pv pvc-0e621d45-7d18-4d35-a282-9700c3cc8524Example output:
persistentvolume "pvc-0e621d45-7d18-4d35-a282-9700c3cc8524" deletedIn this example, the PV name is
pvc-0e621d45-7d18-4d35-a282-9700c3cc8524.
Deploy the new OSD by restarting the
rook-ceph-operatorto force operator reconciliation.Identify the name of the
rook-ceph-operator.# oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20hDelete the
rook-ceph-operator.# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deletedIn this example, the rook-ceph-operator pod name is
rook-ceph-operator-6f74fb5bff-2d982.Verify that the
rook-ceph-operatorpod is restarted.# oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66sCreation of the new OSD may take several minutes after the operator restarts.
Delete the
ocs-osd-removaljob.# oc delete job ocs-osd-removal-${osd_id_to_remove}Example output:
job.batch "ocs-osd-removal-0" deleted
Verfication steps
Verify that there is a new OSD running and a new PVC created.
# oc get -n openshift-storage pods -l app=rook-ceph-osdExample output:
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h# oc get -n openshift-storage pvcExample output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-2s6w4 Bound pvc-7c9bcaf7-de68-40e1-95f9-0b0d7c0ae2fc 512Gi RWO thin 5m ocs-deviceset-1-0-q8fwh Bound pvc-9e7e00cb-6b33-402e-9dc5-b8df4fd9010f 512Gi RWO thin 1d20h ocs-deviceset-2-0-9v8lq Bound pvc-38cdfcee-ea7e-42a5-a6e1-aaa6d4924291 512Gi RWO thin 1d20hLog in to OpenShift Web Console and view the storage dashboard.
Figure 10.1. OSD status in OpenShift Container Platform storage dashboard after device replacement
10.3. OpenShift Container Storage deployed using local storage devices Copy linkLink copied to clipboard!
10.3.1. Replacing failed storage devices on Amazon EC2 infrastructure Copy linkLink copied to clipboard!
When you need to replace a storage device on an Amazon EC2 (storage-optimized I3) infrastructure, you must replace the storage node. For information about how to replace nodes, see Replacing failed storage nodes on Amazon EC2 infrastructure.
10.3.2. Replacing operational or failed storage devices on VMware and bare metal infrastructures Copy linkLink copied to clipboard!
You can replace an object storage device (OSD) in OpenShift Container Storage deployed using local storage devices on bare metal and VMware infrastructures. Use this procedure when an underlying storage device needs to be replaced.
Procedure
Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.
# oc get -n openshift-storage pods -l app=rook-ceph-osd -o wideExample output:
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>In this example,
rook-ceph-osd-0-6d77d6c7c6-m8xj6needs to be replaced andcompute-2is the OCP node on which the OSD is scheduled.NoteIf the OSD to be replaced is healthy, the status of the pod will be
Running.Scale down the OSD deployment for the OSD to be replaced.
# osd_id_to_remove=0 # oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0where
osd_id_to_removeis the integer in the pod name immediately after therook-ceph-osdprefix. In this example, the deployment name isrook-ceph-osd-0.Example output:
deployment.extensions/rook-ceph-osd-0 scaledVerify that the
rook-ceph-osdpod is terminated.# oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}Example output:
No resources found in openshift-storage namespace.NoteIf the
rook-ceph-osdpod is interminatingstate, use theforceoption to delete the pod.# oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --forceExample output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deletedRemove the old OSD from the cluster so that a new OSD can be added.
Delete any old
ocs-osd-removaljobs.# oc delete job ocs-osd-removal-${osd_id_to_remove}Example output:
job.batch "ocs-osd-removal-0" deletedRemove the old OSD from the cluster
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -WarningThis step results in OSD being completely removed from the cluster. Make sure that the correct value of
osd_id_to_removeis provided.
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removalpod. A status ofCompletedconfirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storageNoteIf
ocs-osd-removalfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1Delete the Persistent Volume Claim (PVC) resources associated with the OSD to be replaced.
Identify the
DeviceSetassociated with the OSD to be replaced.# oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvcExample output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68In this example, the PVC name is
ocs-deviceset-0-0-nvs68.Identify the PV associated with the PVC.
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>where,
x,y, andpvc-suffixare the values in theDeviceSetidentified in an earlier step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 1490Gi RWO localblock 24hIn this example, the associated PV is
local-pv-d9c5cbd6.Identify the name of the device to be replaced.
# oc get pv local-pv-<pv-suffix> -o yaml | grep pathwhere,
pv-suffixis the value in the PV name identified in an earlier step.Example output:
path: /mnt/local-storage/localblock/nvme0n1In this example, the device name is
nvme0n1.Identify the
prepare-podassociated with the OSD to be replaced.# oc describe -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix> | grep Mountedwhere,
x,y, andpvc-suffixare the values in theDeviceSetidentified in an earlier step.Example output:
Mounted By: rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7In this example the
prepare-podname isrook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7.Delete the
osd-preparepod before removing the associated PVC.# oc delete -n openshift-storage pod rook-ceph-osd-prepare-ocs-deviceset-<x>-<y>-<pvc-suffix>-<pod-suffix>where,
x,y,pvc-suffix, andpod-suffixare the values in theosd-preparepod name identified in an earlier step.Example output:
pod "rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7" deletedDelete the PVC associated with the OSD to be replaced.
# oc delete -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>where,
x,y, andpvc-suffixare the values in theDeviceSetidentified in an earlier step.Example output:
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
Replace the old device and use the new device to create a new OpenShift Container Platform PV.
Log in to OpenShift Container Platform node with the device to be replaced. In this example, the OpenShift Container Platform node is
compute-2.# oc debug node/compute-2Example output:
Starting pod/compute-2-debug ... To use host binaries, run `chroot /host` Pod IP: 10.70.56.66 If you don't see a command prompt, try pressing enter. # chroot /hostRecord the
/dev/disk/by-id/{id}that is to be replaced using the device name,nvme0n1, identified earlier.# ls -alh /mnt/local-storage/localblockExample output:
total 0 drwxr-xr-x. 2 root root 51 Aug 18 19:05 . drwxr-xr-x. 3 root root 24 Aug 18 19:05 .. lrwxrwxrwx. 1 root root 57 Aug 18 19:05 nvme0n1 -> /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251Find the name of the
LocalVolumeCR, and remove or comment out the device/dev/disk/by-id/{id}that is to be replaced.# oc get -n local-storage localvolume NAME AGE local-block 25h# oc edit -n local-storage localvolume local-blockExample output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4895e0e5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ea2f0f5251 # - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251 storageClassName: localblock volumeMode: Block [...]Make sure to save the changes after editing the CR.
Log in to OpenShift Container Platform node with the device to be replaced and remove the old
symlink.# oc debug node/compute-2Example output:
Starting pod/compute-2-debug ... To use host binaries, run `chroot /host` Pod IP: 10.70.56.66 If you don't see a command prompt, try pressing enter. # chroot /hostIdentify the old
symlinkfor the device name to be replaced. In this example, the device name isnvme0n1.# ls -alh /mnt/local-storage/localblockExample output:
total 0 drwxr-xr-x. 2 root root 51 Aug 18 19:05 . drwxr-xr-x. 3 root root 24 Aug 18 19:05 .. lrwxrwxrwx. 1 root root 57 Aug 18 19:05 nvme0n1 -> /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251Remove the
symlink.# rm /mnt/local-storage/localblock/nvme0n1Verify that the
symlinkis removed.# ls -alh /mnt/local-storage/localblockExample output:
total 0 drwxr-xr-x. 2 root root 17 Apr 10 00:56 . drwxr-xr-x. 3 root root 24 Apr 8 23:03 ..ImportantFor new deployments of OpenShift Container Storage 4.5 or later, LVM is not in use,
ceph-volumeraw mode is in play instead. Therefore, additional validation is not needed and you can proceed to the next step.For OpenShift Container Storage 4.4, or if OpenShift Container Storage has been upgraded to version 4.5 from a prior version, then both
/dev/mapperand/dev/should be checked to see if there are orphans related tocephbefore moving on. Use the results ofvgdisplayto find these orphans. If there is anything in/dev/mapperor/dev/ceph-*withcephin the name that is not from the list of VG Names, usedmsetupto remove it.
Delete the PV associated with the device to be replaced, which was identified in earlier steps. In this example, the PV name is
local-pv-d9c5cbd6.# oc delete pv local-pv-d9c5cbd6Example output:
persistentvolume "local-pv-d9c5cbd6" deleted- Replace the device with the new device.
Log back into the correct OpenShift Container Platform node and identify the device name for the new drive. The device name can be the same as the old device, but the
by-idmust change unless you are reseating the same device.# lsblkExample output:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 120G 0 disk |-sda1 8:1 0 384M 0 part /boot |-sda2 8:2 0 127M 0 part /boot/efi |-sda3 8:3 0 1M 0 part `-sda4 8:4 0 119.5G 0 part `-coreos-luks-root-nocrypt 253:0 0 119.5G 0 dm /sysroot nvme0n1 259:0 0 1.5T 0 diskIn this example, the new device name is
nvme0n1.Identify the
/dev/disk/by-id/{id}for the new device and record it.# ls -alh /dev/disk/by-id | grep nvme0n1Example output:
lrwxrwxrwx. 1 root root 57 Aug 18 19:05 nvme0n1 -> /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ce090e5251
After the new
/dev/disk/by-id/{id}is available a new disk entry can be added to theLocalVolumeCR.Find the name of the
LocalVolumeCR.# oc get -n local-storage localvolume NAME AGE local-block 25hEdit
LocalVolumeCR and add the new/dev/disk/by-id/{id}. In this example the new device is/dev/disk/by-id/nvme-eui.01000000010000005cd2e4ce090e5251.# oc edit -n local-storage localvolume local-blockExample output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4895e0e5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ea2f0f5251 # - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251 - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ce090e5251 storageClassName: localblock volumeMode: Block [...]Make sure to save the changes after editing the CR.
Verify that there is a new PV in
Availablestate and of the correct size.# oc get pv | grep 1490GiExample output:
local-pv-3e8964d3 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 1490Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 1490Gi RWO Delete AvailableCreate new OSD for new device.
Delete the deployment for the OSD to be replaced.
# osd_id_to_remove=0 # oc delete -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove}Example output:
deployment.extensions/rook-ceph-osd-0 deletedDeploy the new OSD by restarting the
rook-ceph-operatorto force operator reconciliation.Identify the name of the
rook-ceph-operator.# oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20hDelete the
rook-ceph-operator.# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deletedIn this example, the rook-ceph-operator pod name is
rook-ceph-operator-6f74fb5bff-2d982.Verify that the
rook-ceph-operatorpod is restarted.# oc get -n openshift-storage pod -l app=rook-ceph-operatorExample output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66sCreation of the new OSD may take several minutes after the operator restarts.
Verfication steps
Verify that there is a new OSD running and a new PVC created.
# oc get -n openshift-storage pods -l app=rook-ceph-osdExample output:
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h# oc get -n openshift-storage pvc | grep localblockExample output:
ocs-deviceset-0-0-c2mqb Bound local-pv-b481410 1490Gi RWO localblock 5m ocs-deviceset-1-0-959rp Bound local-pv-414755e0 1490Gi RWO localblock 1d20h ocs-deviceset-2-0-79j94 Bound local-pv-3e8964d3 1490Gi RWO localblock 1d20hLog in to OpenShift Web Console and view the storage dashboard.
Figure 10.2. OSD status in OpenShift Container Platform storage dashboard after device replacement