Questo contenuto non è disponibile nella lingua selezionata.

Chapter 9. Replacing failed storage nodes on Red Hat Virtualization platform

The ephemeral storage of Red Hat Virtualization platform for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Red Hat Virtualization.

Prerequisites

You must be logged into OpenShift Container Platform (OCP) cluster.

Procedure

Identify the node and get labels on the node to be replaced.
```
oc get nodes --show-labels | grep <node_name>
```
```
$ oc get nodes --show-labels | grep <node_name>
```
Copy to Clipboard Toggle word wrap
Identify the mon (if any) and OSDs that are running in the node to be replaced.
```
oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
```
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>
```
Copy to Clipboard Toggle word wrap

Scale down the deployments of the pods identified in the previous step.

For example:

oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage
$ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name>  --replicas=0 -n openshift-storage

Copy to Clipboard

Toggle word wrap

Mark the nodes as unschedulable.
```
oc adm cordon <node_name>
```
```
$ oc adm cordon <node_name>
```
Copy to Clipboard Toggle word wrap

Drain the node.

oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets

$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Note

If the failed node is not connected to the network, remove the pods running on it by using the command:

oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets

$ oc get pods -A -o wide | grep -i <node_name> |  awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2  " --grace-period=0 " " --force ")}'
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets

Copy to Clipboard

Toggle word wrap

Remove the failed node after making a note of the device by-id.

List the nodes.

oc get nodes

$ oc get nodes

Copy to Clipboard

Toggle word wrap

Example output:

NAME          STATUS   ROLES    AGE     VERSION
rhvworker01    Ready    worker   6h45m   v1.16.2
rhvworker02    Ready    worker   6h45m   v1.16.2
rhvworker03    Ready    worker   6h45m   v1.16.2

NAME          STATUS   ROLES    AGE     VERSION
rhvworker01    Ready    worker   6h45m   v1.16.2
rhvworker02    Ready    worker   6h45m   v1.16.2
rhvworker03    Ready    worker   6h45m   v1.16.2

Copy to Clipboard

Toggle word wrap

Find the unique by-id device name for each of the existing nodes.

oc debug node/<Nodename>

$ oc debug node/<Nodename>

Copy to Clipboard

Toggle word wrap

Example output:

NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   120G  0 disk
|-sda1                         8:1    0   384M  0 part /boot
|-sda2                         8:2    0   127M  0 part /boot/efi
|-sda3                         8:3    0     1M  0 part
`-sda4                         8:4    0 119.5G  0 part
   -coreos-luks-root-nocrypt 253:0    0 119.5G  0 dm   /sysroot
sdb                            8:16   0   500G  0 disk
sr0                           11:0    1   374K  0 rom
sr1                           11:1    1  1024M  0 rom
rbd0                         252:0    0    30G  0 disk /var/lib/kubelet/pods/1fe52fb8-ca70-40e1-ac74-28802baf3b80/volumes/kubernetes.io~csi/pvc-705982e5-0a7d-42f5-916f-69340541d52d/mount
rbd1                         252:16   0    30G  0 disk /var/lib/kubelet/pods/b48b22ba-532f-4d4f-9f58-debfd03c3221/volumes/kubernetes.io~csi/pvc-bb7cdb1c-03fe-4b31-97e3-3c8fa9758f16/mount
sh-4.4# ls -l /dev/disk/by-id/ | grep sdb
lrwxrwxrwx. 1 root root  9 Oct  7 14:58 scsi-0QEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7eb-a5553d4b619a -> ../../sdb
lrwxrwxrwx. 1 root root  9 Oct  7 14:58 scsi-SQEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7eb-a5553d4b619a -> ../../sdb
sh-4.4# exit
exit
sh-4.2# exit
exit
Removing debug pod ...

NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   120G  0 disk
|-sda1                         8:1    0   384M  0 part /boot
|-sda2                         8:2    0   127M  0 part /boot/efi
|-sda3                         8:3    0     1M  0 part
`-sda4                         8:4    0 119.5G  0 part
   -coreos-luks-root-nocrypt 253:0    0 119.5G  0 dm   /sysroot
sdb                            8:16   0   500G  0 disk
sr0                           11:0    1   374K  0 rom
sr1                           11:1    1  1024M  0 rom
rbd0                         252:0    0    30G  0 disk /var/lib/kubelet/pods/1fe52fb8-ca70-40e1-ac74-28802baf3b80/volumes/kubernetes.io~csi/pvc-705982e5-0a7d-42f5-916f-69340541d52d/mount
rbd1                         252:16   0    30G  0 disk /var/lib/kubelet/pods/b48b22ba-532f-4d4f-9f58-debfd03c3221/volumes/kubernetes.io~csi/pvc-bb7cdb1c-03fe-4b31-97e3-3c8fa9758f16/mount
sh-4.4# ls -l /dev/disk/by-id/ | grep sdb
lrwxrwxrwx. 1 root root  9 Oct  7 14:58 scsi-0QEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7eb-a5553d4b619a -> ../../sdb
lrwxrwxrwx. 1 root root  9 Oct  7 14:58 scsi-SQEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7eb-a5553d4b619a -> ../../sdb
sh-4.4# exit
exit
sh-4.2# exit
exit
Removing debug pod ...

Copy to Clipboard

Toggle word wrap

Repeat this to identify the device ID for all the existing nodes.

Delete the machine corresponding to the failed node. A new node is automatically added.
1. Click Compute Machines. Search for the required machine.
2. Beside the required machine, click the Action menu (⋮) Delete Machine.
3. Click Delete to confirm the machine deletion. A new machine is automatically created.
4. Wait for the new machine to start and transition into Running state.
  Important
  This activity may take at least 5-10 minutes or more.
[Optional]: If the failed Red Hat Virtualization instance is not removed automatically, terminate the instance from the console.

Click Compute Nodes in the OpenShift web console. Confirm if the new node is in Ready state.
Apply the OpenShift Container Storage label to the new node using any one of the following:
From User interface
For the new node, click Action Menu (⋮) Edit Labels.
Add cluster.ocs.openshift.io/openshift-storage and click Save.
From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:

$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""

Copy to Clipboard Toggle word wrap

Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.

Add the new disk entries to LocalVolume CR.

Edit LocalVolume CR. You can either remove or comment out the failed device /dev/disk/by-id/{id} and add the new /dev/disk/by-id/{id}. For information about identifying the device by-id, see Finding available storage devices.

oc get -n local-storage localvolume

$ oc get -n local-storage localvolume

Copy to Clipboard

Toggle word wrap

Example output:

NAME          AGE
local-block   25h

NAME          AGE
local-block   25h

Copy to Clipboard

Toggle word wrap

oc edit -n local-storage localvolume local-block

$ oc edit -n local-storage localvolume local-block

Copy to Clipboard

Toggle word wrap

Example output:

[...]
    storageClassDevices:
    - devicePaths:
      - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_901abab1-b164-4bb4-b9a8-6bdbce2de23b
      - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7eb-a5553d4b619a
      # SQEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7be-a5553d4b619a
      - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_c7388957-3e09-4135-bac1-af7551467d0b

      storageClassName: localblock
      volumeMode: Block
[...]

[...]
    storageClassDevices:
    - devicePaths:
      - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_901abab1-b164-4bb4-b9a8-6bdbce2de23b
      - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7eb-a5553d4b619a
      # SQEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7be-a5553d4b619a
      - /dev/disk/by-id/scsi-SQEMU_QEMU_HARDDISK_c7388957-3e09-4135-bac1-af7551467d0b

      storageClassName: localblock
      volumeMode: Block
[...]

Copy to Clipboard

Toggle word wrap

Make sure to save the changes after editing the CR.

You can see that in this CR the following new device using by-id has been added.

SQEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7eb-a5553d4b619a

Display PVs with localblock.

oc get pv | grep localblock

$ oc get pv | grep localblock

Copy to Clipboard

Toggle word wrap

Example output:

local-pv-3646185e   2328Gi  RWO     Delete      Available                                               localblock  9s
local-pv-3933e86    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-1-v9jp4   localblock  5h1m
local-pv-8176b2bf   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-0-nvs68   localblock  5h1m
local-pv-ab7cabb3   2328Gi  RWO     Delete      Available                                               localblock  9s
local-pv-ac52e8a    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-0-knrgr   localblock  5h1m
local-pv-b7e6fd37   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-0-rdm7m   localblock  5h1m
local-pv-cb454338   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-1-h9hfm   localblock  5h1m
local-pv-da5e3175   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-1-g97lq   localblock  5h
...

local-pv-3646185e   2328Gi  RWO     Delete      Available                                               localblock  9s
local-pv-3933e86    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-1-v9jp4   localblock  5h1m
local-pv-8176b2bf   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-0-nvs68   localblock  5h1m
local-pv-ab7cabb3   2328Gi  RWO     Delete      Available                                               localblock  9s
local-pv-ac52e8a    2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-0-knrgr   localblock  5h1m
local-pv-b7e6fd37   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-2-0-rdm7m   localblock  5h1m
local-pv-cb454338   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-0-1-h9hfm   localblock  5h1m
local-pv-da5e3175   2328Gi  RWO     Delete      Bound       openshift-storage/ocs-deviceset-1-1-g97lq   localblock  5h
...

Copy to Clipboard

Toggle word wrap

Delete each PV and OSD associated with the failed node using the following steps.

Identify the DeviceSet associated with the OSD to be replaced.

osd_id_to_remove=0
oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

$ osd_id_to_remove=0
$ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc

Copy to Clipboard

Toggle word wrap

where, osd_id_to_remove is the integer in the pod name immediately after the rook-ceph-osd prefix. In this example, the deployment name is rook-ceph-osd-0.

Example output:

ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
    ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68

ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68
    ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68

Copy to Clipboard

Toggle word wrap

Identify the PV associated with the PVC.

oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>

Copy to Clipboard

Toggle word wrap

where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.

Example output:

NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ocs-deviceset-0-0-nvs68   Bound   local-pv-8176b2bf   2328Gi      RWO            localblock     4h49m

NAME                      STATUS        VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ocs-deviceset-0-0-nvs68   Bound   local-pv-8176b2bf   2328Gi      RWO            localblock     4h49m

Copy to Clipboard

Toggle word wrap

In this example, the associated PV is local-pv-8176b2bf.

Delete the PVC which was identified in earlier steps. In this example, the PVC name is ocs-deviceset-0-0-nvs68.
```
oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
```
```
$ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
```
Copy to Clipboard Toggle word wrap
Example output:
```
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
```
```
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
```
Copy to Clipboard Toggle word wrap
Delete the PV which was identified in earlier steps. In this example, the PV name is local-pv-8176b2bf.
```
oc delete pv local-pv-8176b2bf
```
```
$ oc delete pv local-pv-8176b2bf
```
Copy to Clipboard Toggle word wrap
Example output:
```
persistentvolume "local-pv-8176b2bf" deleted
```
```
persistentvolume "local-pv-8176b2bf" deleted
```
Copy to Clipboard Toggle word wrap

Remove the failed OSD from the cluster.

oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -

$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -

Copy to Clipboard

Toggle word wrap

Verify that the OSD is removed successfully by checking the status of the ocs-osd-removal pod. A status of Completed confirms that the OSD removal job succeeded.
```
oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
```
```
# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage
```
Copy to Clipboard Toggle word wrap
Note
If ocs-osd-removal fails and the pod is not in the expected Completed state, check the pod logs for further debugging. For example:
# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1
Copy to Clipboard Toggle word wrap

Delete the OSD pod deployment.

oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage

$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage

Copy to Clipboard

Toggle word wrap

Delete crashcollector pod deployment identified in an earlier step.

oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage

$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage

Copy to Clipboard

Toggle word wrap

Deploy the new OSD by restarting the rook-ceph-operator to force operator reconciliation.

oc get -n openshift-storage pod -l app=rook-ceph-operator

$ oc get -n openshift-storage pod -l app=rook-ceph-operator

Copy to Clipboard

Toggle word wrap

Example output:

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          5h3m

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-6f74fb5bff-2d982   1/1     Running   0          5h3m

Copy to Clipboard

Toggle word wrap

Delete the rook-ceph-operator.

oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

$ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982

Copy to Clipboard

Toggle word wrap

Example output:

pod "rook-ceph-operator-6f74fb5bff-2d982" deleted

pod "rook-ceph-operator-6f74fb5bff-2d982" deleted

Copy to Clipboard

Toggle word wrap

Verify that the rook-ceph-operator pod is restarted.

oc get -n openshift-storage pod -l app=rook-ceph-operator

$ oc get -n openshift-storage pod -l app=rook-ceph-operator

Copy to Clipboard

Toggle word wrap

Example output:

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

NAME                                  READY   STATUS    RESTARTS   AGE
rook-ceph-operator-6f74fb5bff-7mvrq   1/1     Running   0          66s

Copy to Clipboard

Toggle word wrap

Creation of the new OSD may take several minutes after the operator starts.

Delete the ocs-osd-removal job(s).

oc delete job ocs-osd-removal-${osd_id_to_remove}

$ oc delete job ocs-osd-removal-${osd_id_to_remove}

Copy to Clipboard

Toggle word wrap

Example output:

job.batch "ocs-osd-removal-0" deleted

job.batch "ocs-osd-removal-0" deleted

Copy to Clipboard

Toggle word wrap

Verification steps

Execute the following command and verify that the new node is present in the output:

oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1

Copy to Clipboard

Toggle word wrap

Click Workloads Pods, confirm that at least the following pods on the new node are in Running state:
- csi-cephfsplugin-*
- csi-rbdplugin-*

Verify that all other required OpenShift Container Storage pods are in Running state.

Also, ensure that the new incremental mon is created and is in the Running state.

oc get pod -n openshift-storage | grep mon

$ oc get pod -n openshift-storage | grep mon

Copy to Clipboard

Toggle word wrap

Example output:

rook-ceph-mon-a-64556f7659-c2ngc    1/1     Running     0   5h1m
rook-ceph-mon-b-7c8b74dc4d-tt6hd    1/1     Running     0   5h1m
rook-ceph-mon-d-57fb8c657-wg5f2     1/1     Running     0   27m

rook-ceph-mon-a-64556f7659-c2ngc    1/1     Running     0   5h1m
rook-ceph-mon-b-7c8b74dc4d-tt6hd    1/1     Running     0   5h1m
rook-ceph-mon-d-57fb8c657-wg5f2     1/1     Running     0   27m

Copy to Clipboard

Toggle word wrap

OSDs and mons might take several minutes to get to the Running state.

If the verification steps fail, contact Red Hat Support.

Questo contenuto non è disponibile nella lingua selezionata.

Chapter 9. Replacing failed storage nodes on Red Hat Virtualization platform

Formazione

Prova, acquista e vendi

Community

Informazioni sulla documentazione di Red Hat

Rendiamo l’open source più inclusivo

Informazioni su Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links