OpenShift Container Storage is now OpenShift Data Foundation starting with version 4.9.
Questo contenuto non è disponibile nella lingua selezionata.
Chapter 9. Replacing failed storage nodes on Red Hat Virtualization platform
The ephemeral storage of Red Hat Virtualization platform for OpenShift Container Storage might cause data loss when there is an instance power off. Use this procedure to recover from such an instance power off on Red Hat Virtualization.
Prerequisites
- You must be logged into OpenShift Container Platform (OCP) cluster.
Procedure
Identify the node and get labels on the node to be replaced.
oc get nodes --show-labels | grep <node_name>
$ oc get nodes --show-labels | grep <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the mon (if any) and OSDs that are running in the node to be replaced.
oc get pods -n openshift-storage -o wide | grep -i <node_name>
$ oc get pods -n openshift-storage -o wide | grep -i <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the deployments of the pods identified in the previous step.
For example:
oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storage
$ oc scale deployment rook-ceph-mon-c --replicas=0 -n openshift-storage $ oc scale deployment rook-ceph-osd-0 --replicas=0 -n openshift-storage $ oc scale deployment --selector=app=rook-ceph-crashcollector,node_name=<node_name> --replicas=0 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mark the nodes as unschedulable.
oc adm cordon <node_name>
$ oc adm cordon <node_name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Drain the node.
oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
$ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the failed node is not connected to the network, remove the pods running on it by using the command:
oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}' oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets$ oc get pods -A -o wide | grep -i <node_name> | awk '{if ($4 == "Terminating") system ("oc -n " $1 " delete pods " $2 " --grace-period=0 " " --force ")}' $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsetsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed node after making a note of the device
by-id.List the nodes.
oc get nodes
$ oc get nodesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME STATUS ROLES AGE VERSION rhvworker01 Ready worker 6h45m v1.16.2 rhvworker02 Ready worker 6h45m v1.16.2 rhvworker03 Ready worker 6h45m v1.16.2
NAME STATUS ROLES AGE VERSION rhvworker01 Ready worker 6h45m v1.16.2 rhvworker02 Ready worker 6h45m v1.16.2 rhvworker03 Ready worker 6h45m v1.16.2Copy to Clipboard Copied! Toggle word wrap Toggle overflow Find the unique
by-iddevice name for each of the existing nodes.oc debug node/<Nodename>
$ oc debug node/<Nodename>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Repeat this to identify the device ID for all the existing nodes.
Delete the machine corresponding to the failed node. A new node is automatically added.
-
Click Compute
Machines. Search for the required machine. -
Beside the required machine, click the Action menu (⋮)
Delete Machine. - Click Delete to confirm the machine deletion. A new machine is automatically created.
Wait for the new machine to start and transition into Running state.
ImportantThis activity may take at least 5-10 minutes or more.
-
Click Compute
- [Optional]: If the failed Red Hat Virtualization instance is not removed automatically, terminate the instance from the console.
-
Click Compute
Nodes in the OpenShift web console. Confirm if the new node is in Ready state. Apply the OpenShift Container Storage label to the new node using any one of the following:
- From User interface
-
For the new node, click Action Menu (⋮)
Edit Labels. -
Add
cluster.ocs.openshift.io/openshift-storageand click Save.
-
For the new node, click Action Menu (⋮)
- From Command line interface
Execute the following command to apply the OpenShift Container Storage label to the new node:
oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""
$ oc label node <new_node_name> cluster.ocs.openshift.io/openshift-storage=""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Add the local storage devices available in the new worker node to the OpenShift Container Storage StorageCluster.
Add the new disk entries to LocalVolume CR.
Edit
LocalVolumeCR. You can either remove or comment out the failed device/dev/disk/by-id/{id}and add the new/dev/disk/by-id/{id}. For information about identifying the deviceby-id, see Finding available storage devices.oc get -n local-storage localvolume
$ oc get -n local-storage localvolumeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE local-block 25h
NAME AGE local-block 25hCopy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n local-storage localvolume local-block
$ oc edit -n local-storage localvolume local-blockCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Make sure to save the changes after editing the CR.
You can see that in this CR the following new device using
by-idhas been added.SQEMU_QEMU_HARDDISK_a01cab90-d3cd-4822-a7eb-a5553d4b619aDisplay PVs with
localblock.oc get pv | grep localblock
$ oc get pv | grep localblockCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Delete each PV and OSD associated with the failed node using the following steps.
Identify the DeviceSet associated with the OSD to be replaced.
osd_id_to_remove=0 oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc$ osd_id_to_remove=0 $ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvcCopy to Clipboard Copied! Toggle word wrap Toggle overflow where,
osd_id_to_removeis the integer in the pod name immediately after therook-ceph-osdprefix. In this example, the deployment name isrook-ceph-osd-0.Example output:
ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68 ceph.rook.io/pvc: ocs-deviceset-0-0-nvs68Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PV associated with the PVC.
oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>Copy to Clipboard Copied! Toggle word wrap Toggle overflow where, x, y, and pvc-suffix are the values in the DeviceSet identified in an earlier step.
Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49m
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-nvs68 Bound local-pv-8176b2bf 2328Gi RWO localblock 4h49mCopy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the associated PV is
local-pv-8176b2bf.Delete the PVC which was identified in earlier steps. In this example, the PVC name is ocs-deviceset-0-0-nvs68.
oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storage
$ oc delete pvc ocs-deviceset-0-0-nvs68 -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deleted
persistentvolumeclaim "ocs-deviceset-0-0-nvs68" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV which was identified in earlier steps. In this example, the PV name is local-pv-8176b2bf.
oc delete pv local-pv-8176b2bf
$ oc delete pv local-pv-8176b2bfCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
persistentvolume "local-pv-8176b2bf" deleted
persistentvolume "local-pv-8176b2bf" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_ID=${osd_id_to_remove} | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the OSD is removed successfully by checking the status of the
ocs-osd-removalpod. A status ofCompletedconfirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage# oc get pod -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf
ocs-osd-removalfails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging. For example:oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1# oc logs -l job-name=ocs-osd-removal-${osd_id_to_remove} -n openshift-storage --tail=-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the OSD pod deployment.
oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storage$ oc delete deployment rook-ceph-osd-${osd_id_to_remove} -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Delete
crashcollectorpod deployment identified in an earlier step.oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storage
$ oc delete deployment --selector=app=rook-ceph-crashcollector,node_name=<old_node_name> -n openshift-storageCopy to Clipboard Copied! Toggle word wrap Toggle overflow Deploy the new OSD by restarting the
rook-ceph-operatorto force operator reconciliation.oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3m
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-2d982 1/1 Running 0 5h3mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
rook-ceph-operator.oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982
$ oc delete -n openshift-storage pod rook-ceph-operator-6f74fb5bff-2d982Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
rook-ceph-operatorpod is restarted.oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-6f74fb5bff-7mvrq 1/1 Running 0 66sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Creation of the new OSD may take several minutes after the operator starts.
Delete the
ocs-osd-removaljob(s).oc delete job ocs-osd-removal-${osd_id_to_remove}$ oc delete job ocs-osd-removal-${osd_id_to_remove}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-0" deleted
job.batch "ocs-osd-removal-0" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Execute the following command and verify that the new node is present in the output:
oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1
$ oc get nodes --show-labels | grep cluster.ocs.openshift.io/openshift-storage= |cut -d' ' -f1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Workloads
Pods, confirm that at least the following pods on the new node are in Running state: -
csi-cephfsplugin-* -
csi-rbdplugin-*
-
Verify that all other required OpenShift Container Storage pods are in Running state.
Also, ensure that the new incremental mon is created and is in the Running state.
oc get pod -n openshift-storage | grep mon
$ oc get pod -n openshift-storage | grep monCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27m
rook-ceph-mon-a-64556f7659-c2ngc 1/1 Running 0 5h1m rook-ceph-mon-b-7c8b74dc4d-tt6hd 1/1 Running 0 5h1m rook-ceph-mon-d-57fb8c657-wg5f2 1/1 Running 0 27mCopy to Clipboard Copied! Toggle word wrap Toggle overflow OSDs and mons might take several minutes to get to the Running state.
- If the verification steps fail, contact Red Hat Support.