
Chapter 10. Replacing storage devices

Depending on the type of your deployment, you can choose one of the following procedures to replace a storage device:

10.1. Dynamically provisioned OpenShift Container Storage deployed on AWS

10.1.1. Replacing operational or failed storage devices on AWS user-provisioned infrastructure

When you need to replace a device in a dynamically created storage cluster on an AWS user-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:

10.1.2. Replacing operational or failed storage devices on AWS installer-provisioned infrastructure

When you need to replace a device in a dynamically created storage cluster on an AWS installer-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:

10.2. Dynamically provisioned OpenShift Container Storage deployed on VMware

10.2.1. Replacing operational or failed storage devices on VMware user-provisioned infrastructure

Use this procedure when a virtual machine disk (VMDK) needs to be replaced in OpenShift Container Storage which is deployed dynamically on VMware infrastructure. This procedure helps to create a new persistent volume claim (PVC) on a new volume and remove the old object storage device (OSD).


  1. Identify the OSD that needs to be replaced.

    # oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide

    Example output:

    rook-ceph-osd-0-6d77d6c7c6-m8xj6    0/1    CrashLoopBackOff    0    24h   compute-2   <none>           <none>
    rook-ceph-osd-1-85d99fb95f-2svc7    1/1    Running    0    24h   compute-0   <none>           <none>
    rook-ceph-osd-2-6c66cdb977-jp542    1/1    Running    0    24h   compute-1   <none>           <none>

    In this example, rook-ceph-osd-0-6d77d6c7c6-m8xj6 needs to be replaced.


    If the OSD to be replaced is healthy, the status of the pod will be Running.

  2. Scale down the OSD deployment for the OSD to be replaced

    # osd_id_to_remove=0
    # oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0

    where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

    Example output:

    deployment.extensions/rook-ceph-osd-0 scaled
  3. Verify that the rook-ceph-osd pod is terminated.

    # oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}

    Example output:

    No resources found.

    If the rook-ceph-osd pod is in terminating state, use the force option to delete the pod.

    # oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0

    Example output:

    warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
      pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
  4. Remove the old OSD from the cluster so that a new OSD can be added.

    # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -

    This step results in OSD being completely removed from the cluster. Make sure that the correct value of osd_id_to_remove is provided.

  5. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage

    If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
  6. Delete the PVC resources associated with the OSD to be replaced.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep

      Example output: ocs-deviceset-0-0-nvs68

      In this example, the PVC name is ocs-deviceset-0-0-nvs68.

    2. Identify the PV associated with the PVC.

      # oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in the previous step.

      Example output:

      NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound    pvc-0e621d45-7d18-4d35-a282-9700c3cc8524   512Gi      RWO            thin           24h

      In this example, the PVC is ocs-deviceset-0-0-nvs68 that is identified in the previous step and associated PV is pvc-0e621d45-7d18-4d35-a282-9700c3cc8524.

    3. Identify the prepare-pod associated with the OSD to be replaced. Use the PVC name obtained in an earlier step.

      # oc describe -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix> | grep Mounted

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      Mounted By:    rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7
    4. Delete the osd-prepare pod before removing the associated PVC.

      # oc delete -n openshift-storage pod rook-ceph-osd-prepare-ocs-deviceset-<x>-<y>-<pvc-suffix>-<pod-suffix>

      where, x, y, pvc-suffix, and pod-suffix are the values in the osd-prepare pod name identified in the previous step.

      Example output:

      pod "rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7" deleted
    5. Delete the PVC associated with the device.

      # oc delete -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
  7. Create new OSD for new device.

    1. Delete the deployment for the OSD to be replaced.

      # oc delete -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove}

      Example output:

      deployment.extensions/rook-ceph-osd-0 deleted
    2. Verify that the PV for the device identified in an earlier step is deleted.

      # oc get -n openshift-storage pv pvc-0e621d45-7d18-4d35-a282-9700c3cc8524

      Example output:

      Error from server (NotFound): persistentvolumes "pvc-0e621d45-7d18-4d35-a282-9700c3cc8524" not found

      In this example, the PV name is pvc-0e621d45-7d18-4d35-a282-9700c3cc8524.

      • If the PV still exists, delete the PV associated with the device.

        # oc delete pv pvc-0e621d45-7d18-4d35-a282-9700c3cc8524

        Example output:

        persistentvolume "pvc-0e621d45-7d18-4d35-a282-9700c3cc8524" deleted

        In this example, the PV name is pvc-0e621d45-7d18-4d35-a282-9700c3cc8524.

    3. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

      1. Identify the name of the rook-ceph-operator.

        # oc get -n openshift-storage pod -l app=rook-ceph-operator

        Example output:

        NAME                                  READY   STATUS    RESTARTS   AGE
        rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          1d20h
      2. Delete the rook-ceph-operator.

        # oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

        Example output:

        pod "rook-ceph-operator-6f74fb5bff-2d982" deleted

        In this example, the rook-ceph-operator pod name is rook-ceph-operator-6f74fb5bff-2d982.

      3. Verify that the rook-ceph-operator pod is restarted.

        # oc get -n openshift-storage pod -l app=rook-ceph-operator

        Example output:

        NAME                                  READY   STATUS    RESTARTS   AGE
        rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

        Creation of the new OSD may take several minutes after the operator restarts.

  8. Delete the ocs-osd-removal job.

    # oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    job.batch "ocs-osd-removal-0" deleted

Verfication steps

  • Verify that there is a new OSD running and a new PVC created.

    # oc get -n openshift-storage pods -l app=rook-ceph-osd

    Example output:

    rook-ceph-osd-0-5f7f4747d4-snshw                                  1/1     Running     0          4m47s
    rook-ceph-osd-1-85d99fb95f-2svc7                                  1/1     Running     0          1d20h
    rook-ceph-osd-2-6c66cdb977-jp542                                  1/1     Running     0          1d20h
    # oc get -n openshift-storage pvc

    Example output:

    NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
    ocs-deviceset-0-0-2s6w4   Bound    pvc-7c9bcaf7-de68-40e1-95f9-0b0d7c0ae2fc   512Gi      RWO            thin            5m
    ocs-deviceset-1-0-q8fwh   Bound    pvc-9e7e00cb-6b33-402e-9dc5-b8df4fd9010f   512Gi      RWO            thin            1d20h
    ocs-deviceset-2-0-9v8lq   Bound    pvc-38cdfcee-ea7e-42a5-a6e1-aaa6d4924291   512Gi      RWO            thin            1d20h
  • Log in to OpenShift Web Console and view the storage dashboard.

    Figure 10.1. OSD status in OpenShift Container Platform storage dashboard after device replacement

    OCP storage dashboard showing the healthy OSD.

10.3. OpenShift Container Storage deployed using local storage devices

10.3.1. Replacing failed storage devices on Amazon EC2 infrastructure

When you need to replace a storage device on an Amazon EC2 (storage-optimized I3) infrastructure, you must replace the storage node. For information about how to replace nodes, see Replacing failed storage nodes on Amazon EC2 infrastructure.

10.3.2. Replacing operational or failed storage devices on VMware and bare metal infrastructures

You can replace an object storage device (OSD) in OpenShift Container Storage deployed using local storage devices on bare metal and VMware infrastructures. Use this procedure when an underlying storage device needs to be replaced.


  1. Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.

    # oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide

    Example output:

    rook-ceph-osd-0-6d77d6c7c6-m8xj6    0/1    CrashLoopBackOff    0    24h   compute-2   <none>           <none>
    rook-ceph-osd-1-85d99fb95f-2svc7    1/1    Running    0    24h   compute-0   <none>           <none>
    rook-ceph-osd-2-6c66cdb977-jp542    1/1    Running    0    24h   compute-1   <none>           <none>

    In this example, rook-ceph-osd-0-6d77d6c7c6-m8xj6 needs to be replaced and compute-2 is the OCP node on which the OSD is scheduled.


    If the OSD to be replaced is healthy, the status of the pod will be Running.

  2. Scale down the OSD deployment for the OSD to be replaced.

    # osd_id_to_remove=0
    # oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0

    where osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

    Example output:

    deployment.extensions/rook-ceph-osd-0 scaled
  3. Verify that the rook-ceph-osd pod is terminated.

    # oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}

    Example output:

    No resources found in openshift-storage namespace.

    If the rook-ceph-osd pod is in terminating state, use the force option to delete the pod.

    # oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force

    Example output:

    warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
      pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
  4. Remove the old OSD from the cluster so that a new OSD can be added.

    1. Delete any old ocs-osd-removal jobs.

      # oc delete job ocs-osd-removal-${osd_id_to_remove}

      Example output:

      job.batch "ocs-osd-removal-0" deleted
    2. Remove the old OSD from the cluster

      # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -

      This step results in OSD being completely removed from the cluster. Make sure that the correct value of osd_id_to_remove is provided.

  5. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage

    If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
  6. Delete the persistent volume claim (PVC) resources associated with the OSD to be replaced.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep

      Example output: ocs-deviceset-0-0-nvs68

      In this example, the PVC name is ocs-deviceset-0-0-nvs68.

    2. Identify the PV associated with the PVC.

      # oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-d9c5cbd6   100Gi      RWO            localblock     24h

      In this example, the associated PV is local-pv-d9c5cbd6.

    3. Identify the name of the device to be replaced.

      # oc get pv local-pv-<pv-suffix> -o yaml | grep path

      where, pv-suffix is the value in the PV name identified in an earlier step.

      Example output:

      path: /mnt/local-storage/localblock/sdb

      In this example, the device name is sdb.

    4. Identify the prepare-pod associated with the OSD to be replaced.

      # oc describe -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix> | grep Mounted

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      Mounted By:    rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7

      In this example the prepare-pod name is rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7.

    5. Delete the osd-prepare pod before removing the associated PVC.

      # oc delete -n openshift-storage pod rook-ceph-osd-prepare-ocs-deviceset-<x>-<y>-<pvc-suffix>-<pod-suffix>

      where, x, y, pvc-suffix, and pod-suffix are the values in the osd-prepare pod name identified in an earlier step.

      Example output:

      pod "rook-ceph-osd-prepare-ocs-deviceset-0-0-nvs68-zblp7" deleted
    6. Delete the PVC associated with the OSD to be replaced.

      # oc delete -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
  7. Replace the old device and use the new device to create a new OpenShift Container Platform PV.

    1. Log in to OpenShift Container Platform node with the device to be replaced. In this example, the OpenShift Container Platform node is compute-2.

      # oc debug node/compute-2

      Example output:

      Starting pod/compute-2-debug ...
      To use host binaries, run `chroot /host`
      Pod IP:
      If you don't see a command prompt, try pressing enter.
      # chroot /host
    2. Record the /dev/disk/by-id/{id} that is to be replaced using the device name, sdb, identified earlier.

      # ls -alh /mnt/local-storage/localblock

      Example output:

      total 0
      drwxr-xr-x. 2 root root 17 Apr  8 23:03 .
      drwxr-xr-x. 3 root root 24 Apr  8 23:03 ..
      lrwxrwxrwx. 1 root root 54 Apr  8 23:03 sdb -> /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237
    3. Find the name of the LocalVolume CR, and remove or comment out the device /dev/disk/by-id/{id} that is to be replaced.

      # oc get -n local-storage localvolume
      NAME          AGE
      local-block   25h
      # oc edit -n local-storage localvolume local-block

      Example output:

        - devicePaths:
          - /dev/disk/by-id/scsi-36000c29346bca85f723c4c1f268b5630
          - /dev/disk/by-id/scsi-36000c29134dfcfaf2dfeeb9f98622786
      #   - /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237
          storageClassName: localblock
          volumeMode: Block

      Make sure to save the changes after editing the CR.

  8. Log in to OpenShift Container Platform node with the device to be replaced and remove the old symlink.

    # oc debug node/compute-2

    Example output:

    Starting pod/compute-2-debug ...
    To use host binaries, run `chroot /host`
    Pod IP:
    If you don't see a command prompt, try pressing enter.
    # chroot /host
    1. Identify the old symlink for the device name to be replaced. In this example, the device name is sdb.

      # ls -alh /mnt/local-storage/localblock

      Example output:

      total 0
      drwxr-xr-x. 2 root root 28 Apr 10 00:42 .
      drwxr-xr-x. 3 root root 24 Apr  8 23:03 ..
      lrwxrwxrwx. 1 root root 54 Apr  8 23:03 sdb -> /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237
    2. Remove the symlink.

      # rm /mnt/local-storage/localblock/sdb
    3. Verify that the symlink is removed.

      # ls -alh /mnt/local-storage/localblock

      Example output:

      total 0
      drwxr-xr-x. 2 root root 17 Apr 10 00:56 .
      drwxr-xr-x. 3 root root 24 Apr  8 23:03 ..

      Both /dev/mapper and /dev/ should be checked to see if there are orphans related to ceph before moving on. Use the results of vgdisplay to find these orphans. If there is anything in /dev/mapper or /dev/ceph-* with ceph in the name that is not from the list of VG Names, use dmsetup to remove it.

  9. Delete the PV associated with the device to be replaced, which was identified in earlier steps. In this example, the PV name is local-pv-d9c5cbd6.

    # oc delete pv local-pv-d9c5cbd6

    Example output:

    persistentvolume "local-pv-d9c5cbd6" deleted
  10. Replace the device with the new device.
  11. Log back into the correct OpenShift Cotainer Platform node and identify the device name for the new drive. The device name can be the same as the old device, but the by-id must change unless you are reseating the same device.

    # lsblk

    Example output:

    NAME                         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    sda                            8:0    0   60G  0 disk
    |-sda1                         8:1    0  384M  0 part /boot
    |-sda2                         8:2    0  127M  0 part /boot/efi
    |-sda3                         8:3    0    1M  0 part
    `-sda4                         8:4    0 59.5G  0 part
      `-coreos-luks-root-nocrypt 253:0    0 59.5G  0 dm   /sysroot
    sdb                            8:16   0  100G  0 disk

    In this example, the new device name is sdb.

    1. Identify the /dev/disk/by-id/{id} for the new device and record it.

      # ls -alh /dev/disk/by-id | grep sdb

      Example output:

      lrwxrwxrwx. 1 root root   9 Apr  9 20:45 scsi-36000c29f5c9638dec9f19b220fbe36b1 -> ../../sdb
  12. After the new /dev/disk/by-id/{id} is available a new disk entry can be added to the LocalVolume CR.

    1. Find the name of the LocalVolume CR.

      # oc get -n local-storage localvolume
      NAME          AGE
      local-block   25h
    2. Edit LocalVolume CR and add the new /dev/disk/by-id/{id}. In this example the new device is /dev/disk/by-id/scsi-36000c29f5c9638dec9f19b220fbe36b1.

      # oc edit -n local-storage localvolume local-block

      Example output:

        - devicePaths:
          - /dev/disk/by-id/scsi-36000c29346bca85f723c4c1f268b5630
          - /dev/disk/by-id/scsi-36000c29134dfcfaf2dfeeb9f98622786
      #   - /dev/disk/by-id/scsi-36000c2962b2f613ba1f8f4c5cf952237
          - /dev/disk/by-id/scsi-36000c29f5c9638dec9f19b220fbe36b1
          storageClassName: localblock
          volumeMode: Block

      Make sure to save the changes after editing the CR.

  13. Verify that there is a new PV in Available state and of the correct size.

    # oc get pv | grep 100Gi

    Example output:

    local-pv-3e8964d3                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-2-0-79j94   localblock                             25h
    local-pv-414755e0                          100Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-1-0-959rp   localblock                             25h
    local-pv-b481410                           100Gi      RWO            Delete           Available
  14. Create new OSD for new device.

    1. Delete the deployment for the OSD to be replaced.

      # osd_id_to_remove=0
      # oc delete -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove}

      Example output:

      deployment.extensions/rook-ceph-osd-0 deleted
    2. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

      1. Identify the name of the rook-ceph-operator.

        # oc get -n openshift-storage pod -l app=rook-ceph-operator

        Example output:

        NAME                                  READY   STATUS    RESTARTS   AGE
        rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          1d20h
      2. Delete the rook-ceph-operator.

        # oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

        Example output:

        pod "rook-ceph-operator-6f74fb5bff-2d982" deleted

        In this example, the rook-ceph-operator pod name is rook-ceph-operator-6f74fb5bff-2d982.

      3. Verify that the rook-ceph-operator pod is restarted.

        # oc get -n openshift-storage pod -l app=rook-ceph-operator

        Example output:

        NAME                                  READY   STATUS    RESTARTS   AGE
        rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

        Creation of the new OSD may take several minutes after the operator restarts.

Verfication steps

  • Verify that there is a new OSD running and a new PVC created.

    # oc get -n openshift-storage pods -l app=rook-ceph-osd

    Example output:

    rook-ceph-osd-0-5f7f4747d4-snshw                                  1/1     Running     0          4m47s
    rook-ceph-osd-1-85d99fb95f-2svc7                                  1/1     Running     0          1d20h
    rook-ceph-osd-2-6c66cdb977-jp542                                  1/1     Running     0          1d20h
    # oc get -n openshift-storage pvc | grep localblock

    Example output:

    ocs-deviceset-0-0-c2mqb   Bound    local-pv-b481410                          100Gi     RWO            localblock                    5m
    ocs-deviceset-1-0-959rp   Bound    local-pv-414755e0                          100Gi     RWO            localblock                    1d20h
    ocs-deviceset-2-0-79j94   Bound    local-pv-3e8964d3                          100Gi     RWO            localblock                    1d20h
  • Log in to OpenShift Web Console and view the storage dashboard.

    Figure 10.2. OSD status in OpenShift Container Platform storage dashboard after device replacement

    OCP storage dashboard showing the healthy OSD.
