Chapter 10. Replacing storage devices
Depending on the type of your deployment, you can choose one of the following procedures to replace a storage device:
For dynamically created storage clusters deployed on AWS, see:
- For dynamically created storage clusters deployed on VMware, see Section 10.2.1, “Replacing operational or failed storage devices on VMware user-provisioned infrastructure”
For storage clusters deployed using local storage devices, see:
10.1. Dynamically provisioned OpenShift Container Storage deployed on AWS
10.1.1. Replacing operational or failed storage devices on AWS user-provisioned infrastructure
When you need to replace a device in a dynamically created storage cluster on an AWS user-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:
10.1.2. Replacing operational or failed storage devices on AWS installer-provisioned infrastructure
When you need to replace a device in a dynamically created storage cluster on an AWS installer-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:
10.2. Dynamically provisioned OpenShift Container Storage deployed on VMware
10.2.1. Replacing operational or failed storage devices on VMware user-provisioned infrastructure
Use this procedure when a virtual machine disk (VMDK) needs to be replaced in OpenShift Container Storage which is deployed dynamically on VMware infrastructure. This procedure helps to create a new persistent volume claim (PVC) on a new volume and remove the old object storage device (OSD).
Procedure
Identify the OSD that needs to be replaced.
# oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
Example output:
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.131.2.32 compute-1 <none> <none>
In this example,
rook-ceph-osd-0-6d77d6c7c6-m8xj6
needs to be replaced.NoteIf the OSD to be replaced is healthy, the status of the pod will be Running.
Scale down the OSD deployment for the OSD to be replaced
# osd_id_to_remove=0 # oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
deployment.extensions/rook-ceph-osd-0 scaled
Verify that the
rook-ceph-osd
pod is terminated.# oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
Example output:
No resources found.
NoteIf the
rook-ceph-osd
pod is interminating
state, use theforce
option to delete the pod.# oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0
Example output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
Remove the old OSD from the cluster so that a new OSD can be added.
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
WarningThis step results in OSD being completely removed from the cluster. Make sure that the correct value of
osd_id_to_remove
is provided.Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf
ocs-osd-removal
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete the PVC resources associated with the OSD to be replaced.
Identify the
DeviceSet
associated with the OSD to be replaced.# oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Identify the PV associated with the PVC.
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in the previous step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound pvc-0e621d45-7d18-4d35-a282-9700c3cc8524 512Gi RWO thin 24h
In this example, the PVC is
ocs-deviceset-0-0-nvs68
that is identified in the previous step and associated PV ispvc-0e621d45-7d18-4d35-a282-9700c3cc8524
.Identify the
prepare-pod
associated with the OSD to be replaced. Use the PVC name obtained in an earlier step.# oc describe -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix> | grep Mounted
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in an earlier step.Example output:
Mounted By: rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7
Delete the
osd-prepare
pod before removing the associated PVC.# oc delete -n openshift-storage pod rook-ceph-osd-prepare-ocs-deviceset-<x>-<y>-<pvc-suffix>-<pod-suffix>
where,
x
,y
,pvc-suffix
, andpod-suffix
are the values in theosd-prepare
pod name identified in the previous step.Example output:
pod "rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7" deleted
Delete the PVC associated with the device.
# oc delete -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in an earlier step.Example output:
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
Create new OSD for new device.
Delete the deployment for the OSD to be replaced.
# oc delete -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove}
Example output:
deployment.extensions/rook-ceph-osd-0 deleted
Verify that the PV for the device identified in an earlier step is deleted.
# oc get -n openshift-storage pv pvc-0e621d45-7d18-4d35-a282-9700c3cc8524
Example output:
Error from server (NotFound): persistentvolumes "pvc-0e621d45-7d18-4d35-a282-9700c3cc8524" not found
In this example, the PV name is
pvc-0e621d45-7d18-4d35-a282-9700c3cc8524
.If the PV still exists, delete the PV associated with the device.
# oc delete pv pvc-0e621d45-7d18-4d35-a282-9700c3cc8524
Example output:
persistentvolume "pvc-0e621d45-7d18-4d35-a282-9700c3cc8524" deleted
In this example, the PV name is
pvc-0e621d45-7d18-4d35-a282-9700c3cc8524
.
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Identify the name of the
rook-ceph-operator
.# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
Delete the
rook-ceph-operator
.# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
In this example, the rook-ceph-operator pod name is
rook-ceph-operator-6f74fb5bff-2d982
.Verify that the
rook-ceph-operator
pod is restarted.# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD may take several minutes after the operator restarts.
Delete the
ocs-osd-removal
job.# oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
job.batch "ocs-osd-removal-0" deleted
Verfication steps
Verify that there is a new OSD running and a new PVC created.
# oc get -n openshift-storage pods -l app=rook-ceph-osd
Example output:
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
# oc get -n openshift-storage pvc
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-2s6w4 Bound pvc-7c9bcaf7-de68-40e1-95f9-0b0d7c0ae2fc 512Gi RWO thin 5m ocs-deviceset-1-0-q8fwh Bound pvc-9e7e00cb-6b33-402e-9dc5-b8df4fd9010f 512Gi RWO thin 1d20h ocs-deviceset-2-0-9v8lq Bound pvc-38cdfcee-ea7e-42a5-a6e1-aaa6d4924291 512Gi RWO thin 1d20h
Log in to OpenShift Web Console and view the storage dashboard.
Figure 10.1. OSD status in OpenShift Container Platform storage dashboard after device replacement
10.3. OpenShift Container Storage deployed using local storage devices
10.3.1. Replacing failed storage devices on Amazon EC2 infrastructure
When you need to replace a storage device on an Amazon EC2 (storage-optimized I3) infrastructure, you must replace the storage node. For information about how to replace nodes, see Replacing failed storage nodes on Amazon EC2 infrastructure.
10.3.2. Replacing operational or failed storage devices on VMware and bare metal infrastructures
You can replace an object storage device (OSD) in OpenShift Container Storage deployed using local storage devices on bare metal and VMware infrastructures. Use this procedure when an underlying storage device needs to be replaced.
Procedure
Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.
# oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
Example output:
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>
In this example,
rook-ceph-osd-0-6d77d6c7c6-m8xj6
needs to be replaced andcompute-2
is the OCP node on which the OSD is scheduled.NoteIf the OSD to be replaced is healthy, the status of the pod will be
Running
.Scale down the OSD deployment for the OSD to be replaced.
# osd_id_to_remove=0 # oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
where
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
deployment.extensions/rook-ceph-osd-0 scaled
Verify that the
rook-ceph-osd
pod is terminated.# oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
Example output:
No resources found in openshift-storage namespace.
NoteIf the
rook-ceph-osd
pod is interminating
state, use theforce
option to delete the pod.# oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force
Example output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
Remove the old OSD from the cluster so that a new OSD can be added.
Delete any old
ocs-osd-removal
jobs.# oc delete job ocs-osd-removal-${osd_id_to_remove}
Example output:
job.batch "ocs-osd-removal-0" deleted
Remove the old OSD from the cluster
# oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
WarningThis step results in OSD being completely removed from the cluster. Make sure that the correct value of
osd_id_to_remove
is provided.
Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removal
pod. A status ofCompleted
confirms that the OSD removal job succeeded.# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
NoteIf
ocs-osd-removal
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging. For example:# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Delete the persistent volume claim (PVC) resources associated with the OSD to be replaced.
Identify the
DeviceSet
associated with the OSD to be replaced.# oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
In this example, the PVC name is
ocs-deviceset-0-0-nvs68
.Identify the PV associated with the PVC.
# oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in an earlier step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-d9c5cbd6 100Gi RWO localblock 24h
In this example, the associated PV is
local-pv-d9c5cbd6
.Identify the name of the device to be replaced.
# oc get pv local-pv-<pv-suffix> -o yaml | grep path
where,
pv-suffix
is the value in the PV name identified in an earlier step.Example output:
path: /mnt/local-storage/localblock/sdb
In this example, the device name is
sdb
.Identify the
prepare-pod
associated with the OSD to be replaced.# oc describe -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix> | grep Mounted
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in an earlier step.Example output:
Mounted By: rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7
In this example the
prepare-pod
name isrook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7
.Delete the
osd-prepare
pod before removing the associated PVC.# oc delete -n openshift-storage pod rook-ceph-osd-prepare-ocs-deviceset-<x>-<y>-<pvc-suffix>-<pod-suffix>
where,
x
,y
,pvc-suffix
, andpod-suffix
are the values in theosd-prepare
pod name identified in an earlier step.Example output:
pod "rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7" deleted
Delete the PVC associated with the OSD to be replaced.
# oc delete -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in an earlier step.Example output:
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
Replace the old device and use the new device to create a new OpenShift Container Platform PV.
Log in to OpenShift Container Platform node with the device to be replaced. In this example, the OpenShift Container Platform node is
compute-2
.# oc debug node/compute-2
Example output:
Starting pod/compute-2-debug ... To use host binaries, run `chroot /host` Pod IP: 10.70.56.66 If you don't see a command prompt, try pressing enter. # chroot /host
Record the
/dev/disk/by-id/{id}
that is to be replaced using the device name,sdb
, identified earlier.# ls -alh /mnt/local-storage/localblock
Example output:
total 0 drwxr-xr-x. 2 root root 17 Apr 8 23:03 . drwxr-xr-x. 3 root root 24 Apr 8 23:03 .. lrwxrwxrwx. 1 root root 54 Apr 8 23:03 sdb -> /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237
Find the name of the
LocalVolume
CR, and remove or comment out the device/dev/disk/by-id/{id}
that is to be replaced.# oc get -n local-storage localvolume NAME AGE local-block 25h
# oc edit -n local-storage localvolume local-block
Example output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/scsi-36000c29346bca85f723c4c1f268b5630 - /dev/disk/by-id/scsi-36000c29134dfcfaf2dfeeb9f98622786 # - /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
Log in to OpenShift Container Platform node with the device to be replaced and remove the old
symlink
.# oc debug node/compute-2
Example output:
Starting pod/compute-2-debug ... To use host binaries, run `chroot /host` Pod IP: 10.70.56.66 If you don't see a command prompt, try pressing enter. # chroot /host
Identify the old
symlink
for the device name to be replaced. In this example, the device name issdb
.# ls -alh /mnt/local-storage/localblock
Example output:
total 0 drwxr-xr-x. 2 root root 28 Apr 10 00:42 . drwxr-xr-x. 3 root root 24 Apr 8 23:03 .. lrwxrwxrwx. 1 root root 54 Apr 8 23:03 sdb -> /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237
Remove the
symlink
.# rm /mnt/local-storage/localblock/sdb
Verify that the
symlink
is removed.# ls -alh /mnt/local-storage/localblock
Example output:
total 0 drwxr-xr-x. 2 root root 17 Apr 10 00:56 . drwxr-xr-x. 3 root root 24 Apr 8 23:03 ..
ImportantBoth
/dev/mapper
and/dev/
should be checked to see if there are orphans related toceph
before moving on. Use the results ofvgdisplay
to find these orphans. If there is anything in/dev/mapper
or/dev/ceph-*
withceph
in the name that is not from the list of VG Names, usedmsetup
to remove it.
Delete the PV associated with the device to be replaced, which was identified in earlier steps. In this example, the PV name is
local-pv-d9c5cbd6
.# oc delete pv local-pv-d9c5cbd6
Example output:
persistentvolume "local-pv-d9c5cbd6" deleted
- Replace the device with the new device.
Log back into the correct OpenShift Cotainer Platform node and identify the device name for the new drive. The device name can be the same as the old device, but the
by-id
must change unless you are reseating the same device.# lsblk
Example output:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 60G 0 disk |-sda1 8:1 0 384M 0 part /boot |-sda2 8:2 0 127M 0 part /boot/efi |-sda3 8:3 0 1M 0 part `-sda4 8:4 0 59.5G 0 part `-coreos-luks-root-nocrypt 253:0 0 59.5G 0 dm /sysroot sdb 8:16 0 100G 0 disk
In this example, the new device name is
sdb
.Identify the
/dev/disk/by-id/{id}
for the new device and record it.# ls -alh /dev/disk/by-id | grep sdb
Example output:
lrwxrwxrwx. 1 root root 9 Apr 9 20:45 scsi-36000c29f5c9638dec9f19b220fbe36b1 -> ../../sdb
After the new
/dev/disk/by-id/{id}
is available a new disk entry can be added to theLocalVolume
CR.Find the name of the
LocalVolume
CR.# oc get -n local-storage localvolume NAME AGE local-block 25h
Edit
LocalVolume
CR and add the new/dev/disk/by-id/{id}
. In this example the new device is/dev/disk/by-id/scsi-36000c29f5c9638dec9f19b220fbe36b1
.# oc edit -n local-storage localvolume local-block
Example output:
[...] storageClassDevices: - devicePaths: - /dev/disk/by-id/scsi-36000c29346bca85f723c4c1f268b5630 - /dev/disk/by-id/scsi-36000c29134dfcfaf2dfeeb9f98622786 # - /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237 - /dev/disk/by-id/scsi-36000c29f5c9638dec9f19b220fbe36b1 storageClassName: localblock volumeMode: Block [...]
Make sure to save the changes after editing the CR.
Verify that there is a new PV in
Available
state and of the correct size.# oc get pv | grep 100Gi
Example output:
local-pv-3e8964d3 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-0-79j94 localblock 25h local-pv-414755e0 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-0-959rp localblock 25h local-pv-b481410 100Gi RWO Delete Available
Create new OSD for new device.
Delete the deployment for the OSD to be replaced.
# osd_id_to_remove=0 # oc delete -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove}
Example output:
deployment.extensions/rook-ceph-osd-0 deleted
Deploy the new OSD by restarting the
rook-ceph-operator
to force operator reconciliation.Identify the name of the
rook-ceph-operator
.# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 1d20h
Delete the
rook-ceph-operator
.# oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
In this example, the rook-ceph-operator pod name is
rook-ceph-operator-6f74fb5bff-2d982
.Verify that the
rook-ceph-operator
pod is restarted.# oc get -n openshift-storage pod -l app=rook-ceph-operator
Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
Creation of the new OSD may take several minutes after the operator restarts.
Verfication steps
Verify that there is a new OSD running and a new PVC created.
# oc get -n openshift-storage pods -l app=rook-ceph-osd
Example output:
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
# oc get -n openshift-storage pvc | grep localblock
Example output:
ocs-deviceset-0-0-c2mqb Bound local-pv-b481410 100Gi RWO localblock 5m ocs-deviceset-1-0-959rp Bound local-pv-414755e0 100Gi RWO localblock 1d20h ocs-deviceset-2-0-79j94 Bound local-pv-3e8964d3 100Gi RWO localblock 1d20h
Log in to OpenShift Web Console and view the storage dashboard.
Figure 10.2. OSD status in OpenShift Container Platform storage dashboard after device replacement