Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. OpenShift Container Storage deployed using local storage devices


2.1. Replacing storage nodes on bare metal infrastructure

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
  • You must be logged into the OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the LocalVolumeDiscovery and LocalVolumeSet objects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Identify the NODE and get labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  2. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  4. Mark the node as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  5. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  6. Delete the node.

    $ oc delete node <node_name>
    Copy to Clipboard Toggle word wrap
  7. Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.

    Important

    For information about how to replace a master node when you have installed OpenShift Container Storage on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.

  8. Create a new OpenShift Container Platform node using the new bare metal machine.
  9. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    $ oc get csr
    Copy to Clipboard Toggle word wrap
  10. Approve all required OpenShift Container Platform CSRs for the new node:

    $ oc adm certificate approve <Certificate_Name>
    Copy to Clipboard Toggle word wrap
  11. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  12. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
      Copy to Clipboard Toggle word wrap
  13. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  14. Add a new worker node to localVolumeDiscovery and localVolumeSet.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

    2. Determine which localVolumeSet to edit.

      # oc get -n $local_storage_project localvolumeset
      NAME          AGE
      localblock   25h
      Copy to Clipboard Toggle word wrap
    3. Update the localVolumeSet definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumeset localblock
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

  15. Verify that the new localblock PV is available.

    $oc get pv | grep localblock | grep Available
    local-pv-551d950     512Gi    RWO    Delete  Available
    localblock     26s
    Copy to Clipboard Toggle word wrap
  16. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  17. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    $ oc process -n openshift-storage ocs-osd-removal \
    -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
    Copy to Clipboard Toggle word wrap
  18. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  19. Delete the ocs-osd-removal-job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
    0          38m
    rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
    0          38m
    rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
    0          4m8s
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
  • You must be logged into the OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the LocalVolumeDiscovery and LocalVolumeSet objects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Identify the NODE and get labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  2. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  4. Mark the node as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  5. Remove the pods which are in Terminating state.

    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
    Copy to Clipboard Toggle word wrap
  6. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  7. Delete the node.

    $ oc delete node <node_name>
    Copy to Clipboard Toggle word wrap
  8. Get a new bare metal machine with required infrastructure. See Installing a cluster on bare metal.

    Important

    For information about how to replace a master node when you have installed OpenShift Container Storage on a three-node OpenShift compact bare-metal cluster, see the Backup and Restore guide in the OpenShift Container Platform documentation.

  9. Create a new OpenShift Container Platform node using the new bare metal machine.
  10. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    $ oc get csr
    Copy to Clipboard Toggle word wrap
  11. Approve all required OpenShift Container Platform CSRs for the new node:

    $ oc adm certificate approve <Certificate_Name>
    Copy to Clipboard Toggle word wrap
  12. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  13. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
      Copy to Clipboard Toggle word wrap
  14. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  15. Add a new worker node to localVolumeDiscovery and localVolumeSet.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

    2. Determine which localVolumeSet to edit.

      # oc get -n $local_storage_project localvolumeset
      NAME          AGE
      localblock   25h
      Copy to Clipboard Toggle word wrap
    3. Update the localVolumeSet definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumeset localblock
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

  16. Verify that the new localblock PV is available.

    $oc get pv | grep localblock | grep Available
    local-pv-551d950     512Gi    RWO    Delete  Available
    localblock     26s
    Copy to Clipboard Toggle word wrap
  17. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  18. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    $ oc process -n openshift-storage ocs-osd-removal \
    -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
    Copy to Clipboard Toggle word wrap
  19. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  20. Delete the ocs-osd-removal-job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
    0          38m
    rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
    0          38m
    rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
    0          4m8s
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

You can choose one of the following procedures to replace storage nodes:

Use this procedure to replace an operational node on IBM Z or LinuxONE infrastructure.

Procedure

  1. Log in to OpenShift Web Console.
  2. Click Compute Nodes.
  3. Identify the node that needs to be replaced. Take a note of its Machine Name.
  4. Mark the node as unschedulable using the following command:

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  5. Drain the node using the following command:

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
    Important

    This activity may take at least 5-10 minutes. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.

  6. Click Compute Machines. Search for the required machine.
  7. Besides the required machine, click the Action menu (⋮) Delete Machine.
  8. Click Delete to confirm the machine deletion. A new machine is automatically created.
  9. Wait for the new machine to start and transition into Running state.

    Important

    This activity may take at least 5-10 minutes.

  10. Click Compute Nodes, confirm if the new node is in Ready state.
  11. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
      Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.
  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

Perform this procedure to replace a failed node which is not operational on IBM Z or LinuxONE infrastructure for OpenShift Container Storage.

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the faulty node and click on its Machine Name.
  3. Click Actions Edit Annotations, and click Add More.
  4. Add machine.openshift.io/exclude-node-draining and click Save.
  5. Click Actions Delete Machine, and click Delete.
  6. A new machine is automatically created, wait for new machine to start.

    Important

    This activity may take at least 5-10 minutes. Ceph errors generated during this period are temporary and are automatically resolved when the new node is labeled and functional.

  7. Click Compute Nodes, confirm if the new node is in Ready state.
  8. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From the web user interface
    1. For the new node, click Action Menu (⋮) Edit Labels
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From the command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
      Copy to Clipboard Toggle word wrap
  9. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= | cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  10. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  11. Verify that all other required OpenShift Container Storage pods are in Running state.
  12. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  13. (Optional) If data encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  14. If verification steps fail, contact Red Hat Support.

2.3. Replacing storage nodes on VMware infrastructure

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
  • You must be logged into the OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the LocalVolumeDiscovery and LocalVolumeSet objects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Identify the NODE and get labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  2. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  4. Mark the node as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  5. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  6. Delete the node.

    $ oc delete node <node_name>
    Copy to Clipboard Toggle word wrap
  7. Log in to vSphere and terminate the identified VM.
  8. Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
  9. Create a new OpenShift Container Platform worker node using the new VM.
  10. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    $ oc get csr
    Copy to Clipboard Toggle word wrap
  11. Approve all required OpenShift Container Platform CSRs for the new node:

    $ oc adm certificate approve <Certificate_Name>
    Copy to Clipboard Toggle word wrap
  12. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  13. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
      Copy to Clipboard Toggle word wrap
  14. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  15. Add a new worker node to localVolumeDiscovery and localVolumeSet.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

    2. Determine which localVolumeSet to edit.

      # oc get -n $local_storage_project localvolumeset
      NAME          AGE
      localblock   25h
      Copy to Clipboard Toggle word wrap
    3. Update the localVolumeSet definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumeset localblock
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

  16. Verify that the new localblock PV is available.

    $oc get pv | grep localblock | grep Available
    local-pv-551d950     512Gi    RWO    Delete  Available
    localblock     26s
    Copy to Clipboard Toggle word wrap
  17. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  18. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    $ oc process -n openshift-storage ocs-osd-removal \
    -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
    Copy to Clipboard Toggle word wrap
  19. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  20. Delete the ocs-osd-removal-job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
    0          38m
    rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
    0          38m
    rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
    0          4m8s
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
  • You must be logged into the OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the LocalVolumeDiscovery and LocalVolumeSet objects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the node that needs to be replaced. Take a note of its Machine Name.
  3. Get labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  4. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  5. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  6. Mark the node as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  7. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  8. Click Compute Machines. Search for the required machine.
  9. Besides the required machine, click the Action menu (⋮) Delete Machine.
  10. Click Delete to confirm the machine deletion. A new machine is automatically created.
  11. Wait for the new machine to start and transition into Running state.

    Important

    This activity may take at least 5-10 minutes or more.

  12. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  13. Physically add a new device to the node.
  14. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
      Copy to Clipboard Toggle word wrap
  15. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  16. Add a new worker node to localVolumeDiscovery and localVolumeSet.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

    2. Determine which localVolumeSet to edit.

      # oc get -n $local_storage_project localvolumeset
      NAME          AGE
      localblock   25h
      Copy to Clipboard Toggle word wrap
    3. Update the localVolumeSet definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumeset localblock
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

  17. Verify that the new localblock PV is available.

    $oc get pv | grep localblock | grep Available
    local-pv-551d950     512Gi    RWO    Delete  Available
    localblock     26s
    Copy to Clipboard Toggle word wrap
  18. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  19. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    $ oc process -n openshift-storage ocs-osd-removal \
    -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
    Copy to Clipboard Toggle word wrap
  20. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  21. Identify the PV associated with the PVC.

    #oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
    local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1
    Copy to Clipboard Toggle word wrap

    If there is a PV in Released state, delete it.

    # oc delete pv <persistent-volume>
    Copy to Clipboard Toggle word wrap

    For example:

    #oc delete pv local-pv-d6bf175b
    persistentvolume "local-pv-d9c5cbd6" deleted
    Copy to Clipboard Toggle word wrap
  22. Identify the crashcollector pod deployment.

    $ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
    Copy to Clipboard Toggle word wrap

    If there is an existing crashcollector pod deployment, delete it.

    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
    Copy to Clipboard Toggle word wrap
  23. Delete the ocs-osd-removal-job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
    0          38m
    rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
    0          38m
    rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
    0          4m8s
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
  • You must be logged into the OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the LocalVolumeDiscovery and LocalVolumeSet objects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Identify the NODE and get labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  2. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  4. Mark the node as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  5. Remove the pods which are in Terminating state.

    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
    Copy to Clipboard Toggle word wrap
  6. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  7. Delete the node.

    $ oc delete node <node_name>
    Copy to Clipboard Toggle word wrap
  8. Log in to vSphere and terminate the identified VM.
  9. Create a new VM on VMware with the required infrastructure. See Supported Infrastructure and Platforms.
  10. Create a new OpenShift Container Platform worker node using the new VM.
  11. Check for certificate signing requests (CSRs) related to OpenShift Container Platform that are in Pending state:

    $ oc get csr
    Copy to Clipboard Toggle word wrap
  12. Approve all required OpenShift Container Platform CSRs for the new node:

    $ oc adm certificate approve <Certificate_Name>
    Copy to Clipboard Toggle word wrap
  13. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  14. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
      Copy to Clipboard Toggle word wrap
  15. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  16. Add a new worker node to localVolumeDiscovery and localVolumeSet.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

    2. Determine which localVolumeSet to edit.

      # oc get -n $local_storage_project localvolumeset
      NAME          AGE
      localblock   25h
      Copy to Clipboard Toggle word wrap
    3. Update the localVolumeSet definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumeset localblock
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

  17. Verify that the new localblock PV is available.

    $oc get pv | grep localblock | grep Available
    local-pv-551d950     512Gi    RWO    Delete  Available
    localblock     26s
    Copy to Clipboard Toggle word wrap
  18. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  19. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    $ oc process -n openshift-storage ocs-osd-removal \
    -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
    Copy to Clipboard Toggle word wrap
  20. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  21. Delete the ocs-osd-removal-job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
    0          38m
    rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
    0          38m
    rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
    0          4m8s
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure, resources, and disks to the node being replaced.
  • You must be logged into the OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the LocalVolumeDiscovery and LocalVolumeSet objects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the node that needs to be replaced. Take a note of its Machine Name.
  3. Get labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  4. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  5. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  6. Mark the node as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  7. Remove the pods which are in Terminating state.

    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
    Copy to Clipboard Toggle word wrap
  8. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  9. Click Compute Machines. Search for the required machine.
  10. Besides the required machine, click the Action menu (⋮) Delete Machine.
  11. Click Delete to confirm the machine deletion. A new machine is automatically created.
  12. Wait for the new machine to start and transition into Running state.

    Important

    This activity may take at least 5-10 minutes or more.

  13. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  14. Physically add a new device to the node.
  15. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:

      $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
      Copy to Clipboard Toggle word wrap
  16. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  17. Add a new worker node to localVolumeDiscovery and localVolumeSet.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

    2. Determine which localVolumeSet to edit.

      # oc get -n $local_storage_project localvolumeset
      NAME          AGE
      localblock   25h
      Copy to Clipboard Toggle word wrap
    3. Update the localVolumeSet definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumeset localblock
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

  18. Verify that the new localblock PV is available.

    $oc get pv | grep localblock | grep Available
    local-pv-551d950     512Gi    RWO    Delete  Available
    localblock     26s
    Copy to Clipboard Toggle word wrap
  19. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  20. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    $ oc process -n openshift-storage ocs-osd-removal \
    -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
    Copy to Clipboard Toggle word wrap
  21. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  22. Identify the PV associated with the PVC.

    #oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
    local-pv-d6bf175b  1490Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  compute-1
    Copy to Clipboard Toggle word wrap

    If there is a PV in Released state, delete it.

    # oc delete pv <persistent-volume>
    Copy to Clipboard Toggle word wrap

    For example:

    #oc delete pv local-pv-d6bf175b
    persistentvolume "local-pv-d9c5cbd6" deleted
    Copy to Clipboard Toggle word wrap
  23. Identify the crashcollector pod deployment.

    $ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
    Copy to Clipboard Toggle word wrap

    If there is an existing crashcollector pod deployment, delete it.

    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
    Copy to Clipboard Toggle word wrap
  24. Delete the ocs-osd-removal-job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running
    0          38m
    rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running
    0          38m
    rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running
    0          4m8s
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

Use this procedure to replace an operational node on Red Hat Virtualization installer-provisioned infrastructure (IPI).

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure, resources and disks to the node being replaced.
  • You must be logged into the OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the LocalVolumeDiscovery and LocalVolumeSet objects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the node that needs to be replaced. Take a note of its Machine Name.
  3. Get labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  4. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  5. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  6. Mark the nodes as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  7. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  8. Click Compute Machines. Search for the required machine.
  9. Besides the required machine, click the Action menu (⋮) Delete Machine.
  10. Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for the new machine to start and transition into Running state.

    Important

    This activity may take at least 5-10 minutes or more.

  11. Click Compute Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
  12. Physically add the new device(s) to the node.
  13. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
    Copy to Clipboard Toggle word wrap
  14. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  15. Add a new worker node to localVolumeDiscovery and localVolumeSet.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

    2. Determine which localVolumeSet to edit.

      # oc get -n $local_storage_project localvolumeset
      NAME          AGE
      localblock   25h
      Copy to Clipboard Toggle word wrap
    3. Update the localVolumeSet definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumeset localblock
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

  16. Verify that the new localblock PV is available.

    $oc get pv | grep localblock | grep Available
    local-pv-551d950     512Gi    RWO    Delete  Available
    localblock     26s
    Copy to Clipboard Toggle word wrap
  17. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  18. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    $ oc process -n openshift-storage ocs-osd-removal \
    -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
    Copy to Clipboard Toggle word wrap
  19. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  20. Identify the PV associated with the PVC.

    # oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
    local-pv-d6bf175b  512Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  server3.example.com
    Copy to Clipboard Toggle word wrap

    If there is a PV in Released state, delete it.

    # oc delete pv <persistent-volume>
    Copy to Clipboard Toggle word wrap

    For example:

    # oc delete pv local-pv-d6bf175b
    persistentvolume "local-pv-d6bf175b" deleted
    Copy to Clipboard Toggle word wrap
  21. Identify the crashcollector pod deployment.

    $ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
    Copy to Clipboard Toggle word wrap

    If there is an existing crashcollector pod, delete it.

    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
    Copy to Clipboard Toggle word wrap
  22. Delete the ocs-osd-removal job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running  0  38m
    rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running  0  38m
    rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running  0  4m8s
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

Perform this procedure to replace a failed node which is not operational on Red Hat Virtualization installer-provisioned infrastructure (IPI) for OpenShift Container Storage.

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure, resources and disks to the node being replaced.
  • You must be logged into the OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage version 4.8 from a previous version, and have not already created the LocalVolumeDiscovery and LocalVolumeSet objects, do so now by following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Log in to OpenShift Web Console and click Compute Nodes.
  2. Identify the node that needs to be replaced. Take a note of its Machine Name.
  3. Get the labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  4. Identify the mon (if any) and OSDs that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  5. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  6. Mark the node as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  7. Remove the pods which are in the Terminating state.

    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
    Copy to Clipboard Toggle word wrap
  8. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  9. Click Compute Machines. Search for the required machine.
  10. Besides the required machine, click the Action menu (⋮) Delete Machine.
  11. Click Delete to confirm the machine deletion. A new machine is automatically created. Wait for the new machine to start and transition into Running state.

    Important

    This activity may take at least 5-10 minutes or more.

  12. Click Compute Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
  13. Physically add the new device(s) to the node.
  14. Apply the OpenShift Container Storage label to the new node using any one of the following:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    • Execute the following command to apply the OpenShift Container Storage label to the new node:
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
    Copy to Clipboard Toggle word wrap
  15. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  16. Add a new worker node to localVolumeDiscovery and localVolumeSet.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

    2. Determine which localVolumeSet to edit.

      # oc get -n $local_storage_project localvolumeset
      NAME          AGE
      localblock   25h
      Copy to Clipboard Toggle word wrap
    3. Update the localVolumeSet definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumeset localblock
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  - server1.example.com
                  - server2.example.com
                  #- server3.example.com
                  - newnode.example.com
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, server3.example.com was removed and newnode.example.com is the new node.

  17. Verify that the new localblock PV is available.

    $oc get pv | grep localblock | grep Available
    local-pv-551d950     512Gi    RWO    Delete  Available
    localblock     26s
    Copy to Clipboard Toggle word wrap
  18. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  19. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    $ oc process -n openshift-storage ocs-osd-removal \
    -p FAILED_OSD_IDS=failed-osd-id1,failed-osd-id2 | oc create -f -
    Copy to Clipboard Toggle word wrap
  20. Verify that the OSD was removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  21. Identify the PV associated with the PVC.

    # oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
    local-pv-d6bf175b  512Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-0-data-0-6c5pw  localblock  2d22h  server3.example.com
    Copy to Clipboard Toggle word wrap

    If there is a PV in Released state, delete it.

    # oc delete pv <persistent-volume>
    Copy to Clipboard Toggle word wrap

    For example:

    # oc delete pv local-pv-d6bf175b
    persistentvolume "local-pv-d6bf175b" deleted
    Copy to Clipboard Toggle word wrap
  22. Identify the crashcollector pod deployment.

    $ oc get deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
    Copy to Clipboard Toggle word wrap

    If there is an existing crashcollector pod deployment, delete it.

    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=failed-node-name -n openshift-storage
    Copy to Clipboard Toggle word wrap
  23. Delete the ocs-osd-removal job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-a-cd575c89b-b6k66         2/2     Running  0   38m
    
    rook-ceph-mon-b-6776bc469b-tzzt8        2/2     Running  0   38m
    
    rook-ceph-mon-d-5ff5d488b5-7v8xh        2/2     Running  0   4m8s
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.

For OpenShift Container Storage, node replacement can be performed proactively for an operational node and reactively for a failed node for the IBM Power Systems related deployments.

Prerequisites

  • Red Hat recommends that replacement nodes are configured with similar infrastructure and resources to the node being replaced.
  • You must be logged into OpenShift Container Platform (RHOCP) cluster.
  • If you upgraded to OpenShift Container Storage 4.8 from a previous version and have not already created the LocalVolumeDiscovery object, do so now following the procedure described in Post-update configuration changes for clusters backed by local storage.

Procedure

  1. Identify the node and get labels on the node to be replaced.

    $ oc get nodes --show-labels | grep <node_name>
    Copy to Clipboard Toggle word wrap
  2. Identify the mon (if any) and object storage device (OSD) pods that are running in the node to be replaced.

    $ oc get pods -n openshift-storage -o wide | grep -i <node_name>
    Copy to Clipboard Toggle word wrap
  3. Scale down the deployments of the pods identified in the previous step.

    For example:

    $ oc scale deployment rook-ceph-mon-a --replicas=0 -n openshift-storage
    $ oc scale deployment rook-ceph-osd-1 --replicas=0 -n openshift-storage
    $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
    Copy to Clipboard Toggle word wrap
  4. Mark the node as unschedulable.

    $ oc adm cordon <node_name>
    Copy to Clipboard Toggle word wrap
  5. Remove the pods which are in Terminating state

    $ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
    Copy to Clipboard Toggle word wrap
  6. Drain the node.

    $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
    Copy to Clipboard Toggle word wrap
  7. Delete the node.

    $ oc delete node <node_name>
    Copy to Clipboard Toggle word wrap
  8. Get a new IBM Power machine with required infrastructure. See Installing a cluster on IBM Power Systems.
  9. Create a new OpenShift Container Platform node using the new IBM Power Systems machine.
  10. Check for certificate signing requests (CSRs) related to OpenShift Container Storage that are in Pending state:

    $ oc get csr
    Copy to Clipboard Toggle word wrap
  11. Approve all required OpenShift Container Storage CSRs for the new node:

    $ oc adm certificate approve <Certificate_Name>
    Copy to Clipboard Toggle word wrap
  12. Click Compute Nodes in OpenShift Web Console, confirm if the new node is in Ready state.
  13. Apply the OpenShift Container Storage label to the new node using your preferred interface:

    From User interface
    1. For the new node, click Action Menu (⋮) Edit Labels.
    2. Add cluster.ocs.openshift.io/openshift-storage and click Save.
    From Command line interface
    1. Execute the following command to apply the OpenShift Container Storage label to the new node:
    $ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=''
    Copy to Clipboard Toggle word wrap
  14. Identify the namespace where OpenShift local storage operator is installed and assign it to local_storage_project variable:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    Copy to Clipboard Toggle word wrap

    For example:

    $ local_storage_project=$(oc get csv --all-namespaces | awk '{print $1}' | grep local)
    echo $local_storage_project
    openshift-local-storage
    Copy to Clipboard Toggle word wrap
  15. Add a new worker node to localVolumeDiscovery.

    1. Update the localVolumeDiscovery definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolumediscovery auto-discover-devices
      [...]
         nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  #- worker-0
                  - worker-1
                  - worker-2
                  - worker-3
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, worker-0 was removed and worker-3 is the new node.

  16. Add a newly added worker node to localVolume.

    1. Determine which localVolume to edit.

      # oc get -n $local_storage_project localvolume
      NAME           AGE
      localblock    25h
      Copy to Clipboard Toggle word wrap
    2. Update the localVolume definition to include the new node and remove the failed node.

      # oc edit -n $local_storage_project localvolume localblock
      [...]
          nodeSelector:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                  #- worker-0
                  - worker-1
                  - worker-2
                  - worker-3
      [...]
      Copy to Clipboard Toggle word wrap

      Remember to save before exiting the editor.

      In the above example, worker-0 was removed and worker-3 is the new node.

  17. Verify that the new localblock PV is available.

    $ oc get pv | grep localblock
    NAME              CAPACITY   ACCESSMODES RECLAIMPOLICY STATUS     CLAIM             STORAGECLASS                 AGE
    local-pv-3e8964d3    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-2-data-0-mdbg9  localblock     25h
    local-pv-414755e0    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-1-data-0-4cslf  localblock     25h
    local-pv-b481410    500Gi     RWO        Delete       Available                                            localblock     3m24s
    local-pv-5c9b8982    500Gi    RWO         Delete       Bound      ocs-deviceset-localblock-0-data-0-g2mmc  localblock     25h
    Copy to Clipboard Toggle word wrap
  18. Change to the openshift-storage project.

    $ oc project openshift-storage
    Copy to Clipboard Toggle word wrap
  19. Remove the failed OSD from the cluster. You can specify multiple failed OSDs if required.

    1. Identify the PVC as afterwards we need to delete PV associated with that specific PVC.

      $ osd_id_to_remove=1
      $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
      Copy to Clipboard Toggle word wrap

      where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-1.

      Example output:

      ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
          ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-g2mmc
      Copy to Clipboard Toggle word wrap

      In this example, the PVC name is ocs-deviceset-localblock-0-data-0-g2mmc.

    2. Remove the failed OSD from the cluster.

      $ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} |oc create -f -
      Copy to Clipboard Toggle word wrap

      You can remove more than one OSD by adding comma separated OSD IDs in the command. (For example: FAILED_OSD_IDS=0,1,2)

      Warning

      This step results in OSD being completely removed from the cluster. Ensure that the correct value of osd_id_to_remove is provided.

  20. Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal-job pod.

    A status of Completed confirms that the OSD removal job succeeded.

    # oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
    Note

    If ocs-osd-removal-job fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:

    # oc logs -l job-name=ocs-osd-removal-job -n openshift-storage
    Copy to Clipboard Toggle word wrap
  21. Delete the PV associated with the failed node.

    1. Identify the PV associated with the PVC. PVC name should be identical to what we obtained in Step 16(a).

      # oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
      local-pv-5c9b8982  500Gi  RWO  Delete  Released  openshift-storage/ocs-deviceset-localblock-0-data-0-g2mmc  localblock  24h  worker-0
      Copy to Clipboard Toggle word wrap
    2. Delete the PV.

      # oc delete pv <persistent-volume>
      Copy to Clipboard Toggle word wrap

      For example:

      # oc delete pv local-pv-5c9b8982
      persistentvolume "local-pv-5c9b8982" deleted
      Copy to Clipboard Toggle word wrap
  22. Delete the crashcollector pod deployment.

    $ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> -n openshift-storage
    Copy to Clipboard Toggle word wrap
  23. Delete the ocs-osd-removal-job.

    # oc delete -n openshift-storage job ocs-osd-removal-job
    Copy to Clipboard Toggle word wrap

    Example output:

    job.batch "ocs-osd-removal-job" deleted
    Copy to Clipboard Toggle word wrap

Verification steps

  1. Execute the following command and verify that the new node is present in the output:

    $ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
    Copy to Clipboard Toggle word wrap
  2. Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:

    • csi-cephfsplugin-*
    • csi-rbdplugin-*
  3. Verify that all other required OpenShift Container Storage pods are in Running state.

    Ensure that the new incremental mon is created and is in the Running state.

    $ oc get pod -n openshift-storage | grep mon
    Copy to Clipboard Toggle word wrap

    Example output:

    rook-ceph-mon-b-74f6dc9dd6-4llzq                                   1/1     Running     0          6h14m
    rook-ceph-mon-c-74948755c-h7wtx                                  1/1     Running     0          4h24m
    rook-ceph-mon-d-598f69869b-4bv49                                   1/1     Running     0          162m
    Copy to Clipboard Toggle word wrap

    OSD and Mon might take several minutes to get to the Running state.

  4. Verify that new OSD pods are running on the replacement node.

    $ oc get pods -o wide -n openshift-storage| egrep -i new-node-name | egrep osd
    Copy to Clipboard Toggle word wrap
  5. (Optional) If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.

    For each of the new nodes identified in previous step, do the following:

    1. Create a debug pod and open a chroot environment for the selected host(s).

      $ oc debug node/<node name>
      $ chroot /host
      Copy to Clipboard Toggle word wrap
    2. Run “lsblk” and check for the “crypt” keyword beside the ocs-deviceset name(s)

      $ lsblk
      Copy to Clipboard Toggle word wrap
  6. If verification steps fail, contact Red Hat Support.
Retour au début
Red Hat logoGithubredditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance. Découvrez nos récentes mises à jour.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez le Blog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

Theme

© 2025 Red Hat