Chapter 3. Dynamically provisioned OpenShift Data Foundation deployed on Red Hat Virtualization


3.1. Replacing operational or failed storage devices on Red Hat Virtualization installer-provisioned infrastructure

Create a new Persistent Volume Claim (PVC) on a new volume, and remove the old object storage device (OSD).

Prerequisites

  • Ensure that the data is resilient.

    • In the OpenShift Web Console, click Storage Data Foundation.
    • Click the Storage Systems tab, and then click ocs-storagecluster-storagesystem.
    • In the Status card of Block and File dashboard, under the Overview tab, verify that Data Resiliency has a green tick mark.

Procedure

  1. Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.

    Copy to Clipboard Toggle word wrap
    $ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-osd-0-6d77d6c7c6-m8xj6    0/1    CrashLoopBackOff    0    24h   10.129.0.16   compute-2   <none>           <none>
    rook-ceph-osd-1-85d99fb95f-2svc7    1/1    Running             0    24h   10.128.2.24   compute-0   <none>           <none>
    rook-ceph-osd-2-6c66cdb977-jp542    1/1    Running             0    24h   10.130.0.18   compute-1   <none>           <none>

    In this example, rook-ceph-osd-0-6d77d6c7c6-m8xj6 needs to be replaced and compute-2 is the OpenShift Container platform node on which the OSD is scheduled.

    Note

    If the OSD to be replaced is healthy, the status of the pod will be Running.

  2. Scale down the OSD deployment for the OSD to be replaced.

    Each time you want to replace the OSD, update the osd_id_to_remove parameter with the OSD ID, and repeat this step.

    Copy to Clipboard Toggle word wrap
    $ osd_id_to_remove=0
    Copy to Clipboard Toggle word wrap
    $ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0

    where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

    Example output:

    Copy to Clipboard Toggle word wrap
    deployment.extensions/rook-ceph-osd-0 scaled
  3. Verify that the rook-ceph-osd pod is terminated.

    Copy to Clipboard Toggle word wrap
    $ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    No resources found.
    Important

    If the rook-ceph-osd pod is in terminating state, use the force option to delete the pod.

    Copy to Clipboard Toggle word wrap
    $ oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0

    Example output:

    Copy to Clipboard Toggle word wrap
    warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
      pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
  4. Remove the old OSD from the cluster so that you can add a new OSD.

    1. Delete any old ocs-osd-removal jobs.

      Copy to Clipboard Toggle word wrap
      $ oc delete -n openshift-storage job ocs-osd-removal-job

      Example output:

      Copy to Clipboard Toggle word wrap
      job.batch "ocs-osd-removal-job"
    2. Navigate to the openshift-storage project.

      Copy to Clipboard Toggle word wrap
      $ oc project openshift-storage
    3. Remove the old OSD from the cluster.

      Copy to Clipboard Toggle word wrap
      $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -

      The FORCE_OSD_REMOVAL value must be changed to “true” in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.

      Warning

      This step results in OSD being completely removed from the cluster. Ensure that the correct value of osd_id_to_remove is provided.

  5. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    Copy to Clipboard Toggle word wrap
    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
  6. Ensure that the OSD removal is completed.

    Copy to Clipboard Toggle word wrap
    $ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'

    Example output:

    Copy to Clipboard Toggle word wrap
    2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
    Important

    If the ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging.

    For example:

    Copy to Clipboard Toggle word wrap
    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1'
  7. If encryption was enabled at the time of install, remove dm-crypt managed device-mapper mapping from the OSD devices that are removed from the respective OpenShift Data Foundation nodes.

    1. Get the PVC name(s) of the replaced OSD(s) from the logs of ocs-osd-removal-job pod.

      Copy to Clipboard Toggle word wrap
      $ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1  |egrep -i ‘pvc|deviceset’

      Example output:

      Copy to Clipboard Toggle word wrap
      2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
    2. For each of the previously identified nodes, do the following:

      1. Create a debug pod and chroot to the host on the storage node.

        Copy to Clipboard Toggle word wrap
        $ oc debug node/<node name>
        <node name>

        Is the name of the node.

        Copy to Clipboard Toggle word wrap
        $ chroot /host
      2. Find a relevant device name based on the PVC names identified in the previous step.

        Copy to Clipboard Toggle word wrap
        $ dmsetup ls| grep <pvc name>
        <pvc name>

        Is the name of the PVC.

        Example output:

        Copy to Clipboard Toggle word wrap
        ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
      3. Remove the mapped device.

        Copy to Clipboard Toggle word wrap
        $ cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
        Important

        If the above command gets stuck due to insufficient privileges, run the following commands:

        • Press CTRL+Z to exit the above command.
        • Find the PID of the process which was stuck.

          Copy to Clipboard Toggle word wrap
          $ ps -ef | grep crypt
        • Terminate the process using the kill command.

          Copy to Clipboard Toggle word wrap
          $ kill -9 <PID>
          <PID>
          Is the process ID.
        • Verify that the device name is removed.

          Copy to Clipboard Toggle word wrap
          $ dmsetup ls
  8. Delete the ocs-osd-removal job.

    Copy to Clipboard Toggle word wrap
    $ oc delete -n openshift-storage job ocs-osd-removal-job

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-job" deleted
Note

When using an external key management system (KMS) with data encryption, the old OSD encryption key can be removed from the Vault server as it is now an orphan key.

Verfication steps

  1. Verify that there is a new OSD running.

    Copy to Clipboard Toggle word wrap
    $ oc get -n openshift-storage pods -l app=rook-ceph-osd

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-osd-0-5f7f4747d4-snshw                                  1/1     Running     0          4m47s
    rook-ceph-osd-1-85d99fb95f-2svc7                                  1/1     Running     0          1d20h
    rook-ceph-osd-2-6c66cdb977-jp542                                  1/1     Running     0          1d20h
  2. Verify that there is a new PVC created which is in Bound state.

    Copy to Clipboard Toggle word wrap
    $ oc get -n openshift-storage pvc
  3. Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    1. Identify the nodes where the new OSD pods are running.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
      <OSD-pod-name>

      Is the name of the OSD pod.

      For example:

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm

      Example output:

      Copy to Clipboard Toggle word wrap
      NODE
      compute-1
    2. For each of the previously identified nodes, do the following:

      1. Create a debug pod and open a chroot environment for the selected host(s).

        Copy to Clipboard Toggle word wrap
        $ oc debug node/<node name>
        <node name>

        Is the name of the node.

        Copy to Clipboard Toggle word wrap
        $ chroot /host
      2. Check for the crypt keyword beside the ocs-deviceset name(s).

        Copy to Clipboard Toggle word wrap
        $ lsblk
  4. Log in to OpenShift Web Console and view the storage dashboard.
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat, Inc.