此内容没有您所选择的语言版本。

Chapter 9. Replacing storage nodes for OpenShift Container Storage


For OpenShift Container Storage, node replacement can be performed proactively for an operational node and reactively for a failed node for the following deployments:

  • For Amazon Web Services (AWS)

    • User-provisioned infrastructure
    • Installer-provisioned infrastructure
  • For VMware

    • User-provisioned infrastructure
  • For local storage devices

    • Bare metal
    • Amazon EC2 I3
    • VMware
  • For replacing your storage nodes in external mode, see Red Hat Ceph Storage documentation.

9.1. OpenShift Container Storage deployed on AWS

9.1.1. Replacing an operational AWS node on user-provisioned infrastructure

Perform this procedure to replace an operational node on AWS user-provisioned infrastructure.

Procedure

  1. Identify the node that needs to be replaced.
  2. Mark the node as unschedulable using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  3. Drain the node using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Important

    This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.

  4. Delete the node using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc delete nodes <node_name>
  5. Create a new AWS machine instance with the required infrastructure. See Platform requirements.
  6. Create a new OpenShift Container Platform node using the new AWS machine instance.
  7. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  8. Approve all required OpenShift Container Platform CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  9. Click Compute Nodes, confirm if the new node is in Ready state.
  10. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      Copy to Clipboard Toggle word wrap
      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.
  4. If verification steps fail, kindly contact Red Hat Support.

9.1.2. Replacing an operational AWS node on installer-provisioned infrastructure

Use this procedure to replace an operational node on AWS installer-provisioned infrastructure (IPI).

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the node that needs to be replaced. Take a note of its Machine Name.
  3. Mark the node as unschedulable using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  4. Drain the node using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Important

    This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.

  5. Click Compute Machines. Search for the required machine.
  6. Besides the required machine, click the Action menu (⋮) Delete Machine.
  7. Click Delete to confirm the machine deletion. A new machine is automatically created.
  8. Wait for new machine to start and transition into Running state.

    Important

    This activity may take at least 5-10 minutes or more.

  9. Click Compute Nodes, confirm if the new node is in Ready state.
  10. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      Copy to Clipboard Toggle word wrap
      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.
  4. If verification steps fail, kindly contact Red Hat Support.

9.1.3. Replacing a failed AWS node on user-provisioned infrastructure

Perform this procedure to replace a failed node which is not operational on AWS user-provisioned infrastructure (UPI) for OpenShift Container Storage.

Procedure

  1. Identify the AWS machine instance of the node that needs to be replaced.
  2. Log in to AWS and terminate the identified AWS machine instance.
  3. Create a new AWS machine instance with the required infrastructure. See platform requirements.
  4. Create a new OpenShift Container Platform node using the new AWS machine instance.
  5. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  6. Approve all required OpenShift Container Platform CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  7. Click Compute Nodes, confirm if the new node is in Ready state.
  8. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      Copy to Clipboard Toggle word wrap
      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.
  4. If verification steps fail, contact Red Hat Support.

9.1.4. Replacing a failed AWS node on installer-provisioned infrastructure

Perform this procedure to replace a failed node which is not operational on AWS installer-provisioned infrastructure (IPI) for OpenShift Container Storage.

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the faulty node and click on its Machine Name.
  3. Click Actions Edit Annotations, and click Add More.
  4. Add machine.openshift.io/exclude-node-draining and click Save.
  5. Click Actions Delete Machine, and click Delete.
  6. A new machine is automatically created, wait for new machine to start.

    Important

    This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.

  7. Click Compute Nodes, confirm if the new node is in Ready state.
  8. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      Copy to Clipboard Toggle word wrap
      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  9. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  10. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  11. Verify that all other required OpenShift Container Storage pods are in Running state.
  12. If verification steps fail, kindly contact Red Hat Support.

9.2. OpenShift Container Storage deployed on VMware

9.2.1. Replacing an operational VMware node on user-provisioned infrastructure

Perform this procedure to replace an operational node on VMware user-provisioned infrastructure (UPI).

Procedure

  1. Identify the node and its VM that needs to be replaced.
  2. Mark the node as unschedulable using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  3. Drain the node using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Important

    This activity may take at least 5-10 minutes or more. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.

  4. Delete the node using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc delete nodes <node_name>
  5. Log in to vSphere and terminate the identified VM.

    Important

    VM should be deleted only from the inventory and not from the disk.

  6. Create a new VM on vSphere with the required infrastructure. See Platform requirements.
  7. Create a new OpenShift Container Platform worker node using the new VM.
  8. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  9. Approve all required OpenShift Container Platform CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  10. Click Compute Nodes, confirm if the new node is in Ready state.
  11. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      Copy to Clipboard Toggle word wrap
      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.
  4. If verification steps fail, kindly contact Red Hat Support.

9.2.2. Replacing a failed VMware node on user-provisioned infrastructure

Perform this procedure to replace a failed node on VMware user-provisioned infrastructure (UPI).

Procedure

  1. Identify the node and its VM that needs to be replaced.
  2. Delete the node using the following command:

    Copy to Clipboard Toggle word wrap
    $ oc delete nodes <node_name>
  3. Log in to vSphere and terminate the identified VM.

    Important

    VM should be deleted only from the inventory and not from the disk.

  4. Create a new VM on vSphere with the required infrastructure. See Platform requirements.
  5. Create a new OpenShift Container Platform worker node using the new VM.
  6. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  7. Approve all required OpenShift Container Platform CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  8. Click Compute Nodes, confirm if the new node is in Ready state.
  9. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      Copy to Clipboard Toggle word wrap
      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.
  4. If verification steps fail, kindly contact Red Hat Support.

9.3. OpenShift Container Storage deployed using local storage devices

9.3.1. Replacing storage nodes on bare metal infrastructure

9.3.1.1. Replacing an operational node on bare metal user-provisioned infrastructure

Prerequisites

  • You must be logged into the OpenShift Container Platform (OCP) cluster.

Procedure

  1. Identify the node and get labels on the node to be replaced. Make a note of the rack label.

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep <node_name>
  2. Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    Copy to Clipboard Toggle word wrap
    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
  4. Mark the nodes as unschedulable.

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  5. Drain the node.

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
  6. Delete the node.

    Copy to Clipboard Toggle word wrap
    $ oc delete node <node_name>
  7. Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
  8. Create a new OpenShift Container Platform node using the new bare metal machine.
  9. Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  10. Approve all required OpenShift Container Storage CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  11. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  12. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    Copy to Clipboard Toggle word wrap
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  13. Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.

    1. Add a new disk entry to LocalVolume CR.

      Edit LocalVolume CR and remove or comment out failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}. In this example, the new device is /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A.

      Copy to Clipboard Toggle word wrap
      # oc get -n local-storage localvolume
      NAME          AGE
      local-block   25h
      Copy to Clipboard Toggle word wrap
      # oc edit -n local-storage localvolume local-block

      Example output:

      Copy to Clipboard Toggle word wrap
      [...]
          storageClassDevices:
          - devicePaths:
            - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY81260978128A
            - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY80440W5U128A
            - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB85AABDE128A
            - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A
            storageClassName: localblock
            volumeMode: Block
      [...]

      Make sure to save the changes after editing the CR.

    2. Display PVs with localblock.

      Copy to Clipboard Toggle word wrap
      $ oc get pv | grep localblock

      Example output:

      Copy to Clipboard Toggle word wrap
      local-pv-3e8964d3                          931Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-2-0-79j94   localblock                             25h
      local-pv-414755e0                          931Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-1-0-959rp   localblock                             25h
      local-pv-b481410                           931Gi      RWO            Delete           Available                                               localblock                             3m24s
      local-pv-d9c5cbd6                          931Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-0-0-nvs68   localblock
  14. Delete the PV associated with the failed node.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      Copy to Clipboard Toggle word wrap
      # osd_id_to_remove=0
      # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

      Example output:

      Copy to Clipboard Toggle word wrap
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68

      In this example, the PVC name is ocs-deviceset-0-0-nvs68.

    2. Identify the PV associated with the PVC.

      Copy to Clipboard Toggle word wrap
      # oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in the previous step.

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-d9c5cbd6   931Gi      RWO            localblock     24h

      In this example, the associated PV is local-pv-d9c5cbd6.

    3. Delete the PVC.

      Copy to Clipboard Toggle word wrap
      # oc delete pvc <pvc-name> -n openshift-storage
    4. Delete the PV.

      Copy to Clipboard Toggle word wrap
      # oc delete pv local-pv-d9c5cbd6

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolume "local-pv-d9c5cbd6" deleted
  15. Remove the failed OSD from the cluster.

    Copy to Clipboard Toggle word wrap
    # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
  16. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

    Copy to Clipboard Toggle word wrap
    # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
    Note

    If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    Copy to Clipboard Toggle word wrap
    # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
  17. Delete OSD pod deployment and crashcollector pod deployment.

    Copy to Clipboard Toggle word wrap
    $ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
  18. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

    Copy to Clipboard Toggle word wrap
    # oc get -n openshift-storage pod -l app=rook-ceph-operator

    Example output:

    Copy to Clipboard Toggle word wrap
    NAME                                  READY   STATUS    RESTARTS   AGE
    rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          1d20h
    1. Delete the rook-ceph-operator.

      Copy to Clipboard Toggle word wrap
      # oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

      Example output:

      Copy to Clipboard Toggle word wrap
      pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
    2. Verify that the rook-ceph-operator pod is restarted.

      Copy to Clipboard Toggle word wrap
      # oc get -n openshift-storage pod -l app=rook-ceph-operator

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                                  READY   STATUS    RESTARTS   AGE
      rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

      Creation of the new OSD and mon might take several minutes after the operator restarts.

  19. Delete the ocs-osd-removal job.

    Copy to Clipboard Toggle word wrap
    # oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-0" deleted

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Make sure that the new incremental mon is created and is in the Running state.

    Copy to Clipboard Toggle word wrap
    $ oc get pod -n openshift-storage | grep mon

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-mon-c-64556f7659-c2ngc                                  1/1     Running     0          6h14m
    rook-ceph-mon-d-7c8b74dc4d-tt6hd                                  1/1     Running     0          4h24m
    rook-ceph-mon-e-57fb8c657-wg5f2                                   1/1     Running     0          162m

    OSD and Mon might take several minutes to get to the Running state.

  4. If verification steps fail, contact Red Hat Support.

9.3.1.2. Replacing a failed node on bare metal user-provisioned infrastructure

Prerequisites

  • You must be logged into the OpenShift Container Platform (OCP) cluster.

Procedure

  1. Identify the node and get labels on the node to be replaced. Make a note of the rack label.

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep <node_name>
  2. Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    Copy to Clipboard Toggle word wrap
    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
  4. Mark the node as unschedulable.

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  5. Remove the pods which are in Terminating state

    Copy to Clipboard Toggle word wrap
    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
  6. Drain the node.

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
  7. Delete the node.

    Copy to Clipboard Toggle word wrap
    $ oc delete node <node_name>
  8. Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.
  9. Create a new OpenShift Container Platform node using the new bare metal machine.
  10. Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  11. Approve all required OpenShift Container Storage CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  12. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  13. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    Copy to Clipboard Toggle word wrap
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  14. Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.

    1. Add a new disk entry to LocalVolume CR.

      Edit LocalVolume CR and remove or comment out failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}. In this example, the new device is /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A.

      Copy to Clipboard Toggle word wrap
      # oc get -n local-storage localvolume
      NAME          AGE
      local-block   25h
      Copy to Clipboard Toggle word wrap
      # oc edit -n local-storage localvolume local-block

      Example output:

      Copy to Clipboard Toggle word wrap
      [...]
          storageClassDevices:
          - devicePaths:
            - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY81260978128A
            - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPY80440W5U128A
            - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB85AABDE128A
            - /dev/disk/by-id/nvme-INTEL_SSDPEKKA128G7_BTPYB89THF49128A
            storageClassName: localblock
            volumeMode: Block
      [...]

      Make sure to save the changes after editing the CR.

    2. Display PVs with localblock.

      Copy to Clipboard Toggle word wrap
      $ oc get pv | grep localblock

      Example output:

      Copy to Clipboard Toggle word wrap
      local-pv-3e8964d3                          931Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-2-0-79j94   localblock                             25h
      local-pv-414755e0                          931Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-1-0-959rp   localblock                             25h
      local-pv-b481410                           931Gi      RWO            Delete           Available                                               localblock                             3m24s
      local-pv-d9c5cbd6                          931Gi      RWO            Delete           Bound    openshift-storage/ocs-deviceset-0-0-nvs68   localblock
  15. Delete the PV associated with the failed node.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      Copy to Clipboard Toggle word wrap
      # osd_id_to_remove=0
      # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

      Example output:

      Copy to Clipboard Toggle word wrap
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68

      In this example, the PVC name is ocs-deviceset-0-0-nvs68.

    2. Identify the PV associated with the PVC.

      Copy to Clipboard Toggle word wrap
      # oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in the previous step.

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-d9c5cbd6   931Gi      RWO            localblock     24h

      In this example, the associated PV is local-pv-d9c5cbd6.

    3. Delete the PVC.

      Copy to Clipboard Toggle word wrap
      # oc delete pvc <pvc-name> -n openshift-storage
    4. Delete the PV.

      Copy to Clipboard Toggle word wrap
      # oc delete pv local-pv-d9c5cbd6

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolume "local-pv-d9c5cbd6" deleted
  16. Remove the failed OSD from the cluster.

    Copy to Clipboard Toggle word wrap
    # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
  17. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

    Copy to Clipboard Toggle word wrap
    # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
    Note

    If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    Copy to Clipboard Toggle word wrap
    # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
  18. Delete OSD pod deployment and crashcollector pod deployment.

    Copy to Clipboard Toggle word wrap
    $ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
  19. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

    Copy to Clipboard Toggle word wrap
    # oc get -n openshift-storage pod -l app=rook-ceph-operator

    Example output:

    Copy to Clipboard Toggle word wrap
    NAME                                  READY   STATUS    RESTARTS   AGE
    rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          1d20h
    1. Delete the rook-ceph-operator.

      Copy to Clipboard Toggle word wrap
      # oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

      Example output:

      Copy to Clipboard Toggle word wrap
      pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
    2. Verify that the rook-ceph-operator pod is restarted.

      Copy to Clipboard Toggle word wrap
      # oc get -n openshift-storage pod -l app=rook-ceph-operator

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                                  READY   STATUS    RESTARTS   AGE
      rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

      Creation of the new OSD and mon might take several minutes after the operator restarts.

  20. Delete the ocs-osd-removal job.

    Copy to Clipboard Toggle word wrap
    # oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-0" deleted

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Make sure that the new incremental mon is created and is in the Running state.

    Copy to Clipboard Toggle word wrap
    $ oc get pod -n openshift-storage | grep mon

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-mon-c-64556f7659-c2ngc                                  1/1     Running     0          6h14m
    rook-ceph-mon-d-7c8b74dc4d-tt6hd                                  1/1     Running     0          4h24m
    rook-ceph-mon-e-57fb8c657-wg5f2                                   1/1     Running     0          162m

    OSD and Mon might take several minutes to get to the Running state.

  4. If verification steps fail, contact Red Hat Support.

9.3.2. Replacing storage nodes on Amazon EC2 infrastructure

9.3.2.1. Replacing an operational Amazon EC2 node on user-provisioned infrastructure

Perform this procedure to replace an operational node on Amazon EC2 I3 user-provisioned infrastructure (UPI).

Important

Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

Prerequisites

  • You must be logged into the OpenShift Container Platform (OCP) cluster.

Procedure

  1. Identify the node and get labels on the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep <node_name>
  2. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    Copy to Clipboard Toggle word wrap
    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
  4. Mark the nodes as unschedulable.

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  5. Drain the node.

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
  6. Delete the node.

    Copy to Clipboard Toggle word wrap
    $ oc delete node <node_name>
  7. Create a new Amazon EC2 I3 machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
  8. Create a new OpenShift Container Platform node using the new Amazon EC2 I3 machine instance.
  9. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  10. Approve all required OpenShift Container Platform CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  11. Click Compute Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
  12. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    Copy to Clipboard Toggle word wrap
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  13. Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.

    1. Add the new disk entries to LocalVolume CR.

      Edit LocalVolume CR. You can either remove or comment out the failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}.

      Copy to Clipboard Toggle word wrap
      $ oc get -n local-storage localvolume

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME          AGE
      local-block   25h
      Copy to Clipboard Toggle word wrap
      $ oc edit -n local-storage localvolume local-block

      Example output:

      Copy to Clipboard Toggle word wrap
      [...]
          storageClassDevices:
          - devicePaths:
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP
        #   - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7
        #   - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
            storageClassName: localblock
            volumeMode: Block
      [...]

      Make sure to save the changes after editing the CR.

      You can see that in this CR the below two new devices using by-id have been added.

      • nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
      • nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
    2. Display PVs with localblock.

      Copy to Clipboard Toggle word wrap
      $ oc get pv | grep localblock

      Example output:

      Copy to Clipboard Toggle word wrap
      local-pv-3646185e   2328Gi  RWO     Delete      Available                                               localblock  9s
      local-pv-3933e86    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-1-v9jp4   localblock  5h1m
      local-pv-8176b2bf   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-0-nvs68   localblock  5h1m
      local-pv-ab7cabb3   2328Gi  RWO     Delete      Available                                               localblock  9s
      local-pv-ac52e8a    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-0-knrgr   localblock  5h1m
      local-pv-b7e6fd37   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-0-rdm7m   localblock  5h1m
      local-pv-cb454338   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-1-h9hfm   localblock  5h1m
      local-pv-da5e3175   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-1-g97lq   localblock  5h
      ...
  14. Delete each PV and OSD associated with the failed node using the following steps.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      Copy to Clipboard Toggle word wrap
      $ osd_id_to_remove=0
      $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

      Example output:

      Copy to Clipboard Toggle word wrap
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
    2. Identify the PV associated with the PVC.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-8176b2bf   2328Gi      RWO            localblock     4h49m

      In this example, the associated PV is local-pv-8176b2bf.

    3. Delete the PVC which was identified in earlier steps. In this example, the PVC name is ocs-deviceset-0-0-nvs68.

      Copy to Clipboard Toggle word wrap
      $ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
    4. Delete the PV which was identified in earlier steps. In this example, the PV name is local-pv-8176b2bf.

      Copy to Clipboard Toggle word wrap
      $ oc delete pv local-pv-8176b2bf

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolume "local-pv-8176b2bf" deleted
    5. Remove the failed OSD from the cluster.

      Copy to Clipboard Toggle word wrap
      $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
    6. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

      Copy to Clipboard Toggle word wrap
      # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
      Note

      If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

      Copy to Clipboard Toggle word wrap
      # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
    7. Delete the OSD pod deployment.

      Copy to Clipboard Toggle word wrap
      $ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
  15. Delete crashcollector pod deployment identified in an earlier step.

    Copy to Clipboard Toggle word wrap
    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
  16. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

    Copy to Clipboard Toggle word wrap
    $ oc get -n openshift-storage pod -l app=rook-ceph-operator

    Example output:

    Copy to Clipboard Toggle word wrap
    NAME                                  READY   STATUS    RESTARTS   AGE
    rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          5h3m
    1. Delete the rook-ceph-operator.

      Copy to Clipboard Toggle word wrap
      $ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

      Example output:

      Copy to Clipboard Toggle word wrap
      pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
    2. Verify that the rook-ceph-operator pod is restarted.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage pod -l app=rook-ceph-operator

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                                  READY   STATUS    RESTARTS   AGE
      rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

      Creation of the new OSD may take several minutes after the operator starts.

  17. Delete the ocs-osd-removal job(s).

    Copy to Clipboard Toggle word wrap
    $ oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-0" deleted

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Also, ensure that the new incremental mon is created and is in the Running state.

    Copy to Clipboard Toggle word wrap
    $ oc get pod -n openshift-storage | grep mon

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-mon-a-64556f7659-c2ngc    1/1     Running     0   5h1m
    rook-ceph-mon-b-7c8b74dc4d-tt6hd    1/1     Running     0   5h1m
    rook-ceph-mon-d-57fb8c657-wg5f2     1/1     Running     0   27m

    OSDs and mon’s might take several minutes to get to the Running state.

  4. If verification steps fail, contact Red Hat Support.

9.3.2.2. Replacing an operational Amazon EC2 node on installer-provisioned infrastructure

Use this procedure to replace an operational node on Amazon EC2 I3 installer-provisioned infrastructure (IPI).

Important

Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

Prerequisites

  • You must be logged into the OpenShift Container Platform (OCP) cluster.

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the node that needs to be replaced. Take a note of its Machine Name.
  3. Get labels on the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep <node_name>
  4. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
  5. Scale down the deployments of the pods identified in the previous step.

    For example:

    Copy to Clipboard Toggle word wrap
    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
  6. Mark the nodes as unschedulable.

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  7. Drain the node.

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
  8. Click Compute Machines. Search for the required machine.
  9. Besides the required machine, click the Action menu (⋮) Delete Machine.
  10. Click Delete to confirm the machine deletion. A new machine is automatically created.
  11. Wait for the new machine to start and transition into Running state.

    Important

    This activity may take at least 5-10 minutes or more.

  12. Click Compute Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
  13. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    Copy to Clipboard Toggle word wrap
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  14. Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.

    1. Add the new disk entries to LocalVolume CR.

      Edit LocalVolume CR. You can either remove or comment out the failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}.

      Copy to Clipboard Toggle word wrap
      $ oc get -n local-storage localvolume

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME          AGE
      local-block   25h
      Copy to Clipboard Toggle word wrap
      $ oc edit -n local-storage localvolume local-block

      Example output:

      Copy to Clipboard Toggle word wrap
      [...]
          storageClassDevices:
          - devicePaths:
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP
        #   - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7
        #   - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
            storageClassName: localblock
            volumeMode: Block
      [...]

      Make sure to save the changes after editing the CR.

      You can see that in this CR the below two new devices using by-id have been added.

      • nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
      • nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
    2. Display PVs with localblock.

      Copy to Clipboard Toggle word wrap
      $ oc get pv | grep localblock

      Example output:

      Copy to Clipboard Toggle word wrap
      local-pv-3646185e   2328Gi  RWO     Delete      Available                                               localblock  9s
      local-pv-3933e86    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-1-v9jp4   localblock  5h1m
      local-pv-8176b2bf   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-0-nvs68   localblock  5h1m
      local-pv-ab7cabb3   2328Gi  RWO     Delete      Available                                               localblock  9s
      local-pv-ac52e8a    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-0-knrgr   localblock  5h1m
      local-pv-b7e6fd37   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-0-rdm7m   localblock  5h1m
      local-pv-cb454338   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-1-h9hfm   localblock  5h1m
      local-pv-da5e3175   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-1-g97lq   localblock  5h
      ...
  15. Delete each PV and OSD associated with the failed node using the following steps.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      Copy to Clipboard Toggle word wrap
      $ osd_id_to_remove=0
      $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

      Example output:

      Copy to Clipboard Toggle word wrap
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
    2. Identify the PV associated with the PVC.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-8176b2bf   2328Gi      RWO            localblock     4h49m

      In this example, the associated PV is local-pv-8176b2bf.

    3. Delete the PVC which was identified in earlier steps. In this example, the PVC name is ocs-deviceset-0-0-nvs68.

      Copy to Clipboard Toggle word wrap
      $ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
    4. Delete the PV which was identified in earlier steps. In this example, the PV name is local-pv-8176b2bf.

      Copy to Clipboard Toggle word wrap
      $ oc delete pv local-pv-8176b2bf

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolume "local-pv-8176b2bf" deleted
    5. Remove the failed OSD from the cluster.

      Copy to Clipboard Toggle word wrap
      $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
    6. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

      Copy to Clipboard Toggle word wrap
      # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
      Note

      If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

      Copy to Clipboard Toggle word wrap
      # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
    7. Delete the OSD pod deployment.

      Copy to Clipboard Toggle word wrap
      $ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
  16. Delete crashcollector pod deployment identified in an earlier step.

    Copy to Clipboard Toggle word wrap
    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
  17. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

    Copy to Clipboard Toggle word wrap
    $ oc get -n openshift-storage pod -l app=rook-ceph-operator

    Example output:

    Copy to Clipboard Toggle word wrap
    NAME                                  READY   STATUS    RESTARTS   AGE
    rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          5h3m
    1. Delete the rook-ceph-operator.

      Copy to Clipboard Toggle word wrap
      $ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

      Example output:

      Copy to Clipboard Toggle word wrap
      pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
    2. Verify that the rook-ceph-operator pod is restarted.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage pod -l app=rook-ceph-operator

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                                  READY   STATUS    RESTARTS   AGE
      rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

      Creation of the new OSD may take several minutes after the operator starts.

  18. Delete the ocs-osd-removal job(s).

    Copy to Clipboard Toggle word wrap
    $ oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-0" deleted

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Also, ensure that the new incremental mon is created and is in the Running state.

    Copy to Clipboard Toggle word wrap
    $ oc get pod -n openshift-storage | grep mon

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-mon-a-64556f7659-c2ngc    1/1     Running     0   5h1m
    rook-ceph-mon-b-7c8b74dc4d-tt6hd    1/1     Running     0   5h1m
    rook-ceph-mon-d-57fb8c657-wg5f2     1/1     Running     0   27m

    OSDs and mon’s might take several minutes to get to the Running state.

  4. If verification steps fail, contact Red Hat Support.

9.3.2.3. Replacing a failed Amazon EC2 node on user-provisioned infrastructure

The ephemeral storage of Amazon EC2 I3 for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Amazon EC2 infrastructure.

Important

Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

Prerequisites

  • You must be logged into the OpenShift Container Platform (OCP) cluster.

Procedure

  1. Identify the node and get labels on the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep <node_name>
  2. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    Copy to Clipboard Toggle word wrap
    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
  4. Mark the nodes as unschedulable.

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  5. Remove the pods which are in Terminating state.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
  6. Drain the node.

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
  7. Delete the node.

    Copy to Clipboard Toggle word wrap
    $ oc delete node <node_name>
  8. Create a new Amazon EC2 I3 machine instance with the required infrastructure. See Supported Infrastructure and Platforms.
  9. Create a new OpenShift Container Platform node using the new Amazon EC2 I3 machine instance.
  10. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  11. Approve all required OpenShift Container Platform CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  12. Click Compute Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
  13. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    Copy to Clipboard Toggle word wrap
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  14. Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.

    1. Add the new disk entries to LocalVolume CR.

      Edit LocalVolume CR. You can either remove or comment out the failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}.

      Copy to Clipboard Toggle word wrap
      $ oc get -n local-storage localvolume

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME          AGE
      local-block   25h
      Copy to Clipboard Toggle word wrap
      $ oc edit -n local-storage localvolume local-block

      Example output:

      Copy to Clipboard Toggle word wrap
      [...]
          storageClassDevices:
          - devicePaths:
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP
        #   - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7
        #   - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
            storageClassName: localblock
            volumeMode: Block
      [...]

      Make sure to save the changes after editing the CR.

      You can see that in this CR the below two new devices using by-id have been added.

      • nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
      • nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
    2. Display PVs with localblock.

      Copy to Clipboard Toggle word wrap
      $ oc get pv | grep localblock

      Example output:

      Copy to Clipboard Toggle word wrap
      local-pv-3646185e   2328Gi  RWO     Delete      Available                                               localblock  9s
      local-pv-3933e86    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-1-v9jp4   localblock  5h1m
      local-pv-8176b2bf   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-0-nvs68   localblock  5h1m
      local-pv-ab7cabb3   2328Gi  RWO     Delete      Available                                               localblock  9s
      local-pv-ac52e8a    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-0-knrgr   localblock  5h1m
      local-pv-b7e6fd37   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-0-rdm7m   localblock  5h1m
      local-pv-cb454338   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-1-h9hfm   localblock  5h1m
      local-pv-da5e3175   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-1-g97lq   localblock  5h
      ...
  15. Delete each PV and OSD associated with the failed node using the following steps.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      Copy to Clipboard Toggle word wrap
      $ osd_id_to_remove=0
      $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

      Example output:

      Copy to Clipboard Toggle word wrap
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
    2. Identify the PV associated with the PVC.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-8176b2bf   2328Gi      RWO            localblock     4h49m

      In this example, the associated PV is local-pv-8176b2bf.

    3. Delete the PVC which was identified in earlier steps. In this example, the PVC name is ocs-deviceset-0-0-nvs68.

      Copy to Clipboard Toggle word wrap
      $ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
    4. Delete the PV which was identified in earlier steps. In this example, the PV name is local-pv-8176b2bf.

      Copy to Clipboard Toggle word wrap
      $ oc delete pv local-pv-8176b2bf

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolume "local-pv-8176b2bf" deleted
    5. Remove the failed OSD from the cluster.

      Copy to Clipboard Toggle word wrap
      $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
    6. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

      Copy to Clipboard Toggle word wrap
      # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
      Note

      If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

      Copy to Clipboard Toggle word wrap
      # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
    7. Delete the OSD pod deployment.

      Copy to Clipboard Toggle word wrap
      $ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
  16. Delete crashcollector pod deployment identified in an earlier step.

    Copy to Clipboard Toggle word wrap
    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
  17. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

    Copy to Clipboard Toggle word wrap
    $ oc get -n openshift-storage pod -l app=rook-ceph-operator

    Example output:

    Copy to Clipboard Toggle word wrap
    NAME                                  READY   STATUS    RESTARTS   AGE
    rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          5h3m
    1. Delete the rook-ceph-operator.

      Copy to Clipboard Toggle word wrap
      $ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

      Example output:

      Copy to Clipboard Toggle word wrap
      pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
    2. Verify that the rook-ceph-operator pod is restarted.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage pod -l app=rook-ceph-operator

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                                  READY   STATUS    RESTARTS   AGE
      rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

      Creation of the new OSD may take several minutes after the operator starts.

  18. Delete the ocs-osd-removal job(s).

    Copy to Clipboard Toggle word wrap
    $ oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-0" deleted

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Also, ensure that the new incremental mon is created and is in the Running state.

    Copy to Clipboard Toggle word wrap
    $ oc get pod -n openshift-storage | grep mon

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-mon-a-64556f7659-c2ngc    1/1     Running     0   5h1m
    rook-ceph-mon-b-7c8b74dc4d-tt6hd    1/1     Running     0   5h1m
    rook-ceph-mon-d-57fb8c657-wg5f2     1/1     Running     0   27m

    OSDs and mon’s might take several minutes to get to the Running state.

  4. If verification steps fail, contact Red Hat Support.

9.3.2.4. Replacing a failed Amazon EC2 node on installer-provisioned infrastructure

The ephemeral storage of Amazon EC2 I3 for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Amazon EC2 infrastructure.

Important

Replacing storage nodes in Amazon EC2 I3 infrastructure is a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

Prerequisites

  • You must be logged into the OpenShift Container Platform (OCP) cluster.

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the node that needs to be replaced. Take a note of its Machine Name.
  3. Get the labels on the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep <node_name>
  4. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
  5. Scale down the deployments of the pods identified in the previous step.

    For example:

    Copy to Clipboard Toggle word wrap
    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
  6. Mark the node as unschedulable.

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  7. Remove the pods which are in Terminating state.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
  8. Drain the node.

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
  9. Click Compute Machines. Search for the required machine.
  10. Besides the required machine, click the Action menu (⋮) Delete Machine.
  11. Click Delete to confirm the machine deletion. A new machine is automatically created.
  12. Wait for the new machine to start and transition into Running state.

    Important

    This activity may take at least 5-10 minutes or more.

  13. Click Compute Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
  14. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    Copy to Clipboard Toggle word wrap
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  15. Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.

    1. Add the new disk entries to LocalVolume CR.

      Edit LocalVolume CR. You can either remove or comment out the failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}.

      Copy to Clipboard Toggle word wrap
      $ oc get -n local-storage localvolume

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME          AGE
      local-block   25h
      Copy to Clipboard Toggle word wrap
      $ oc edit -n local-storage localvolume local-block

      Example output:

      Copy to Clipboard Toggle word wrap
      [...]
          storageClassDevices:
          - devicePaths:
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441494EC
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84FE3E9
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE4
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS10382E5D7441464EP
        #   - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1F45C01D7E84F43E7
        #   - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS136BC945B4ECB9AE8
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
            - /dev/disk/by-id/nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
            storageClassName: localblock
            volumeMode: Block
      [...]

      Make sure to save the changes after editing the CR.

      You can see that in this CR the below two new devices using by-id have been added.

      • nvme-Amazon_EC2_NVMe_Instance_Storage_AWS6F45C01D7E84FE3E9
      • nvme-Amazon_EC2_NVMe_Instance_Storage_AWS636BC945B4ECB9AE4
    2. Display PVs with localblock.

      Copy to Clipboard Toggle word wrap
      $ oc get pv | grep localblock

      Example output:

      Copy to Clipboard Toggle word wrap
      local-pv-3646185e   2328Gi  RWO     Delete      Available                                               localblock  9s
      local-pv-3933e86    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-1-v9jp4   localblock  5h1m
      local-pv-8176b2bf   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-0-nvs68   localblock  5h1m
      local-pv-ab7cabb3   2328Gi  RWO     Delete      Available                                               localblock  9s
      local-pv-ac52e8a    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-0-knrgr   localblock  5h1m
      local-pv-b7e6fd37   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-0-rdm7m   localblock  5h1m
      local-pv-cb454338   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-1-h9hfm   localblock  5h1m
      local-pv-da5e3175   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-1-g97lq   localblock  5h
      ...
  16. Delete each PV and OSD associated with the failed node using the following steps.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      Copy to Clipboard Toggle word wrap
      $ osd_id_to_remove=0
      $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

      Example output:

      Copy to Clipboard Toggle word wrap
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
    2. Identify the PV associated with the PVC.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-8176b2bf   2328Gi      RWO            localblock     4h49m

      In this example, the associated PV is local-pv-8176b2bf.

    3. Delete the PVC which was identified in earlier steps. In this example, the PVC name is ocs-deviceset-0-0-nvs68.

      Copy to Clipboard Toggle word wrap
      $ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
    4. Delete the PV which was identified in earlier steps. In this example, the PV name is local-pv-8176b2bf.

      Copy to Clipboard Toggle word wrap
      $ oc delete pv local-pv-8176b2bf

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolume "local-pv-8176b2bf" deleted
    5. Remove the failed OSD from the cluster.

      Copy to Clipboard Toggle word wrap
      $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
    6. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

      Copy to Clipboard Toggle word wrap
      # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
      Note

      If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

      Copy to Clipboard Toggle word wrap
      # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
    7. Delete the OSD pod deployment.

      Copy to Clipboard Toggle word wrap
      $ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
  17. Delete crashcollector pod deployment identified in an earlier step.

    Copy to Clipboard Toggle word wrap
    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
  18. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

    Copy to Clipboard Toggle word wrap
    $ oc get -n openshift-storage pod -l app=rook-ceph-operator

    Example output:

    Copy to Clipboard Toggle word wrap
    NAME                                  READY   STATUS    RESTARTS   AGE
    rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          5h3m
    1. Delete the rook-ceph-operator.

      Copy to Clipboard Toggle word wrap
      $ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

      Example output:

      Copy to Clipboard Toggle word wrap
      pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
    2. Verify that the rook-ceph-operator pod is restarted.

      Copy to Clipboard Toggle word wrap
      $ oc get -n openshift-storage pod -l app=rook-ceph-operator

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                                  READY   STATUS    RESTARTS   AGE
      rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

      Creation of the new OSD may take several minutes after the operator starts.

  19. Delete the ocs-osd-removal job(s).

    Copy to Clipboard Toggle word wrap
    $ oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-0" deleted

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Also, ensure that the new incremental mon is created and is in the Running state.

    Copy to Clipboard Toggle word wrap
    $ oc get pod -n openshift-storage | grep mon

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-mon-a-64556f7659-c2ngc    1/1     Running     0   5h1m
    rook-ceph-mon-b-7c8b74dc4d-tt6hd    1/1     Running     0   5h1m
    rook-ceph-mon-d-57fb8c657-wg5f2     1/1     Running     0   27m

    OSDs and mon’s might take several minutes to get to the Running state.

  4. If verification steps fail, contact Red Hat Support.

9.3.3. Replacing storage nodes on VMware infrastructure

9.3.3.1. Replacing an operational node on VMware user-provisioned infrastructure

Prerequisites

  • You must be logged into the OpenShift Container Platform (OCP) cluster.

Procedure

  1. Identify the node and get labels on the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep <node_name>
  2. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    Copy to Clipboard Toggle word wrap
    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
  4. Mark the node as unschedulable.

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  5. Drain the node.

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
  6. Delete the node.

    Copy to Clipboard Toggle word wrap
    $ oc delete node <node_name>
  7. Log in to vSphere and terminate the identified VM.
  8. Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
  9. Create a new OpenShift Container Platform worker node using the new VM.
  10. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  11. Approve all required OpenShift Container Platform CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  12. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  13. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      Copy to Clipboard Toggle word wrap
      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  14. Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.

    1. Add a new disk entry to LocalVolume CR.

      Edit LocalVolume CR and remove or comment out failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}. In this example, the new device is /dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251.

      Copy to Clipboard Toggle word wrap
      # oc get -n local-storage localvolume

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME          AGE
      local-block   25h
      Copy to Clipboard Toggle word wrap
      # oc edit -n local-storage localvolume local-block

      Example output:

      Copy to Clipboard Toggle word wrap
      [...]
          storageClassDevices:
          - devicePaths:
            - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4895e0e5251
            - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ea2f0f5251
        #   - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251
            - /dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251
            storageClassName: localblock
            volumeMode: Block
      [...]

      Make sure to save the changes after editing the CR.

    2. Display PVs with localblock.

      Copy to Clipboard Toggle word wrap
      $ oc get pv | grep localblock

      Example output:

      Copy to Clipboard Toggle word wrap
      local-pv-3e8964d3                          1490Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-2-0-79j94   localblock                             25h
      local-pv-414755e0                          1490Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-1-0-959rp   localblock                             25h
      local-pv-b481410                           1490Gi      RWO            Delete           Available                                               localblock                             3m24s
      local-pv-d9c5cbd6                          1490Gi      RWO            Delete           Bound     openshift-storage/ocs-deviceset-0-0-nvs68   localblock
  15. Delete the PV associated with the failed node.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      Copy to Clipboard Toggle word wrap
      # osd_id_to_remove=0
      # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

      Example output:

      Copy to Clipboard Toggle word wrap
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68

      In this example, the PVC name is ocs-deviceset-0-0-nvs68.

    2. Identify the PV associated with the PVC.

      Copy to Clipboard Toggle word wrap
      # oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in the previous step.

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-d9c5cbd6   1490Gi     RWO            localblock     24h

      In this example, the associated PV is local-pv-d9c5cbd6.

    3. Delete the PVC.

      Copy to Clipboard Toggle word wrap
      oc delete pvc <pvc-name> -n openshift-storage
    4. Delete the PV.

      Copy to Clipboard Toggle word wrap
      # oc delete pv local-pv-d9c5cbd6

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolume "local-pv-d9c5cbd6" deleted
  16. Remove the failed OSD from the cluster.

    Copy to Clipboard Toggle word wrap
    # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
  17. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

    Copy to Clipboard Toggle word wrap
    # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
    Note

    If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    Copy to Clipboard Toggle word wrap
    # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
  18. Delete OSD pod deployment and crashcollector pod deployment.

    Copy to Clipboard Toggle word wrap
    $ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
  19. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

    Copy to Clipboard Toggle word wrap
    # oc get -n openshift-storage pod -l app=rook-ceph-operator

    Example output:

    Copy to Clipboard Toggle word wrap
    NAME                                  READY   STATUS    RESTARTS   AGE
    rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          1d20h
    1. Delete the rook-ceph-operator.

      Copy to Clipboard Toggle word wrap
      # oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

      Example output:

      Copy to Clipboard Toggle word wrap
      pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
    2. Verify that the rook-ceph-operator pod is restarted.

      Copy to Clipboard Toggle word wrap
      # oc get -n openshift-storage pod -l app=rook-ceph-operator

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                                  READY   STATUS    RESTARTS   AGE
      rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

      Creation of the new OSD and mon might take several minutes after the operator restarts.

  20. Delete the ocs-osd-removal job.

    Copy to Clipboard Toggle word wrap
    # oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-0" deleted

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    Copy to Clipboard Toggle word wrap
    $ oc get pod -n openshift-storage | grep mon

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-mon-c-64556f7659-c2ngc                                  1/1     Running     0          6h14m
    rook-ceph-mon-d-7c8b74dc4d-tt6hd                                  1/1     Running     0          4h24m
    rook-ceph-mon-e-57fb8c657-wg5f2                                   1/1     Running     0          162m

    OSD and Mon might take several minutes to get to the Running state.

  4. If verification steps fail, contact Red Hat Support.

9.3.3.2. Replacing a failed node on VMware user-provisioned infrastructure

Prerequisites

  • You must be logged into the OpenShift Container Platform (OCP) cluster.

Procedure

  1. Identify the node and get labels on the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep <node_name>
  2. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    Copy to Clipboard Toggle word wrap
    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
  4. Mark the node as unschedulable.

    Copy to Clipboard Toggle word wrap
    $ oc adm cordon <node_name>
  5. Remove the pods which are in Terminating state.

    Copy to Clipboard Toggle word wrap
    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
  6. Drain the node.

    Copy to Clipboard Toggle word wrap
    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
  7. Delete the node.

    Copy to Clipboard Toggle word wrap
    $ oc delete node <node_name>
  8. Log in to vSphere and terminate the identified VM.
  9. Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
  10. Create a new OpenShift Container Platform worker node using the new VM.
  11. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    Copy to Clipboard Toggle word wrap
    $ oc get csr
  12. Approve all required OpenShift Container Platform CSRs for the new node:

    Copy to Clipboard Toggle word wrap
    $ oc adm certificate approve <Certificate_Name>
  13. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  14. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    Copy to Clipboard Toggle word wrap
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
  15. Add the local storage devices available in these worker nodes to the OpenShift Container Storage StorageCluster.

    1. Add a new disk entry to LocalVolume CR.

      Edit LocalVolume CR and remove or comment out failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}. In this example, the new device is /dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251.

      Copy to Clipboard Toggle word wrap
      # oc get -n local-storage localvolume

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME          AGE
      local-block   25h
      Copy to Clipboard Toggle word wrap
      # oc edit -n local-storage localvolume local-block

      Example output:

      Copy to Clipboard Toggle word wrap
      [...]
          storageClassDevices:
          - devicePaths:
            - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4895e0e5251
            - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4ea2f0f5251
        #   - /dev/disk/by-id/nvme-eui.01000000010000005cd2e4de2f0f5251
            - /dev/disk/by-id/nvme-eui.01000000010000005cd2e490020e5251
            storageClassName: localblock
            volumeMode: Block
      [...]

      Make sure to save the changes after editing the CR.

    2. Display PVs with localblock.

      Copy to Clipboard Toggle word wrap
      $ oc get pv | grep localblock

      Example output:

      Copy to Clipboard Toggle word wrap
      local-pv-3e8964d3                          1490Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-2-0-79j94   localblock                             25h
      local-pv-414755e0                          1490Gi      RWO            Delete           Bound       openshift-storage/ocs-deviceset-1-0-959rp   localblock                             25h
      local-pv-b481410                           1490Gi      RWO            Delete           Available                                               localblock                             3m24s
      local-pv-d9c5cbd6                          1490Gi      RWO            Delete           Bound     openshift-storage/ocs-deviceset-0-0-nvs68   localblock
  16. Delete the PV associated with the failed node.

    1. Identify the DeviceSet associated with the OSD to be replaced.

      Copy to Clipboard Toggle word wrap
      # osd_id_to_remove=0
      # oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

      Example output:

      Copy to Clipboard Toggle word wrap
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
      ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68

      In this example, the PVC name is ocs-deviceset-0-0-nvs68.

    2. Identify the PV associated with the PVC.

      Copy to Clipboard Toggle word wrap
      # oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

      where, x, y, and pvc-suffix are the values in the DeviceSet identified in the previous step.

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
      ocs-deviceset-0-0-nvs68   Bound   local-pv-d9c5cbd6   1490Gi     RWO            localblock     24h

      In this example, the associated PV is local-pv-d9c5cbd6.

    3. Delete the PVC.

      Copy to Clipboard Toggle word wrap
      oc delete pvc <pvc-name> -n openshift-storage
    4. Delete the PV.

      Copy to Clipboard Toggle word wrap
      # oc delete pv local-pv-d9c5cbd6

      Example output:

      Copy to Clipboard Toggle word wrap
      persistentvolume "local-pv-d9c5cbd6" deleted
  17. Remove the failed OSD from the cluster.

    Copy to Clipboard Toggle word wrap
    # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -
  18. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.

    Copy to Clipboard Toggle word wrap
    # oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
    Note

    If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    Copy to Clipboard Toggle word wrap
    # oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
  19. Delete OSD pod deployment and crashcollector pod deployment.

    Copy to Clipboard Toggle word wrap
    $ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage
    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
  20. Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

    Copy to Clipboard Toggle word wrap
    # oc get -n openshift-storage pod -l app=rook-ceph-operator

    Example output:

    Copy to Clipboard Toggle word wrap
    NAME                                  READY   STATUS    RESTARTS   AGE
    rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          1d20h
    1. Delete the rook-ceph-operator.

      Copy to Clipboard Toggle word wrap
      # oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

      Example output:

      Copy to Clipboard Toggle word wrap
      pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
    2. Verify that the rook-ceph-operator pod is restarted.

      Copy to Clipboard Toggle word wrap
      # oc get -n openshift-storage pod -l app=rook-ceph-operator

      Example output:

      Copy to Clipboard Toggle word wrap
      NAME                                  READY   STATUS    RESTARTS   AGE
      rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

      Creation of the new OSD and mon might take several minutes after the operator restarts.

  21. Delete the`ocs-osd-removal` job.

    Copy to Clipboard Toggle word wrap
    # oc delete job ocs-osd-removal-${osd_id_to_remove}

    Example output:

    Copy to Clipboard Toggle word wrap
    job.batch "ocs-osd-removal-0" deleted

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    Copy to Clipboard Toggle word wrap
    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    Copy to Clipboard Toggle word wrap
    $ oc get pod -n openshift-storage | grep mon

    Example output:

    Copy to Clipboard Toggle word wrap
    rook-ceph-mon-c-64556f7659-c2ngc                                  1/1     Running     0          6h14m
    rook-ceph-mon-d-7c8b74dc4d-tt6hd                                  1/1     Running     0          4h24m
    rook-ceph-mon-e-57fb8c657-wg5f2                                   1/1     Running     0          162m

    OSD and Mon might take several minutes to get to the Running state.

  4. If verification steps fail, contact Red Hat Support.
返回顶部
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2025 Red Hat, Inc.