Chapter 2. Dynamically provisioned OpenShift Data Foundation deployed on VMware
2.1. Replacing operational or failed storage devices on VMware infrastructure Copy linkLink copied to clipboard!
Create a new Persistent Volume Claim (PVC) on a new volume, and remove the old object storage device (OSD) when one or more virtual machine disks (VMDK) needs to be replaced in OpenShift Data Foundation which is deployed dynamically on VMware infrastructure.
Prerequisites
Ensure that the data is resilient.
-
In the OpenShift Web Console, click Storage
Data Foundation. -
Click the Storage Systems tab, and then click
ocs-storagecluster. - In the Status card of Block and File dashboard, under the Overview tab, verify that Data Resiliency has a green tick mark.
-
In the OpenShift Web Console, click Storage
Procedure
Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wideExample output:
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>In this example,
rook-ceph-osd-0-6d77d6c7c6-m8xj6needs to be replaced andcompute-2is the OpenShift Container platform node on which the OSD is scheduled.NoteThe status of the pod is
Running, if the OSD you want to replace is healthy.Scale down the OSD deployment for the OSD to be replaced.
Each time you want to replace the OSD, update the
osd_id_to_removeparameter with the OSD ID, and repeat this step.$ osd_id_to_remove=0$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0where,
osd_id_to_removeis the integer in the pod name immediately after therook-ceph-osdprefix. In this example, the deployment name isrook-ceph-osd-0.Example output:
deployment.extensions/rook-ceph-osd-0 scaledVerify that the
rook-ceph-osdpod is terminated.$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}Example output:
No resources found.ImportantIf the
rook-ceph-osdpod is interminatingstate, use theforceoption to delete the pod.$ oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0Example output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deletedRemove the old OSD from the cluster so that you can add a new OSD.
Delete any old
ocs-osd-removaljobs.$ oc delete -n openshift-storage job ocs-osd-removal-jobExample output:
job.batch "ocs-osd-removal-job" deletedNoteIf the above job does not reach
Completedstate after 10 minutes, then the job must be deleted and rerun withFORCE_OSD_REMOVAL=true.Navigate to the
openshift-storageproject.$ oc project openshift-storageRemove the old OSD from the cluster.
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -The FORCE_OSD_REMOVAL value must be changed to “true” in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
WarningThis step results in OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_removeis provided.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-jobpod.A status of
Completedconfirms that the OSD removal job succeeded.$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storageEnsure that the OSD removal is completed.
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0ImportantIf the
ocs-osd-removal-jobpod fails and the pod is not in the expectedCompletedstate, check the pod logs for further debugging.For example:
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1If encryption was enabled at the time of install, remove
dm-cryptmanageddevice-mappermapping from the OSD devices that are removed from the respective OpenShift Data Foundation nodes.Get the PVC name(s) of the replaced OSD(s) from the logs of
ocs-osd-removal-jobpod.$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i ‘pvc|deviceset’Example output:
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"For each of the previously identified nodes, do the following:
Create a
debugpod andchrootto the host on the storage node.$ oc debug node/<node name><node name>Is the name of the node.
$ chroot /host
Find a relevant device name based on the PVC names identified in the previous step.
$ dmsetup ls| grep <pvc name><pvc name>Is the name of the PVC.
Example output:
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
Remove the mapped device.
$ cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcryptImportantIf the above command gets stuck due to insufficient privileges, run the following commands:
-
Press
CTRL+Zto exit the above command. Find the PID of the process which was stuck.
$ ps -ef | grep cryptTerminate the process using the
killcommand.$ kill -9 <PID><PID>- Is the process ID.
Verify that the device name is removed.
$ dmsetup ls
-
Press
Delete the
ocs-osd-removaljob.$ oc delete -n openshift-storage job ocs-osd-removal-jobExample output:
job.batch "ocs-osd-removal-job" deleted
When using an external key management system (KMS) with data encryption, the old OSD encryption key can be removed from the Vault server as it is now an orphan key.
Verification steps
Verify that there is a new OSD running.
$ oc get -n openshift-storage pods -l app=rook-ceph-osdExample output:
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20hVerify that there is a new PVC created which is in
Boundstate.$ oc get -n openshift-storage pvcExample output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-2s6w4 Bound pvc-7c9bcaf7-de68-40e1-95f9-0b0d7c0ae2fc 512Gi RWO thin 5m ocs-deviceset-1-0-q8fwh Bound pvc-9e7e00cb-6b33-402e-9dc5-b8df4fd9010f 512Gi RWO thin 1d20h ocs-deviceset-2-0-9v8lq Bound pvc-38cdfcee-ea7e-42a5-a6e1-aaa6d4924291 512Gi RWO thin 1d20hOptional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
Identify the nodes where the new OSD pods are running.
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name><OSD-pod-name>Is the name of the OSD pod.
For example:
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqmExample output:
NODE compute-1
For each of the nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
$ oc debug node/<node name><node name>Is the name of the node.
$ chroot /host
Check for the
cryptkeyword beside theocs-devicesetname(s).$ lsblk
- Log in to OpenShift Web Console and view the storage dashboard.