OpenShift Container Storage is now OpenShift Data Foundation starting with version 4.9.
Chapter 3. Dynamically provisioned OpenShift Data Foundation deployed on Red Hat Virtualization
3.1. Replacing operational or failed storage devices on Red Hat Virtualization installer-provisioned infrastructure
Create a new Persistent Volume Claim (PVC) on a new volume, and remove the old object storage device (OSD).
Prerequisites
Ensure that the data is resilient.
-
In the OpenShift Web Console, click Storage
Data Foundation. -
Click the Storage Systems tab, and then click
ocs-storagecluster-storagesystem
. - In the Status card of Block and File dashboard, under the Overview tab, verify that Data Resiliency has a green tick mark.
-
In the OpenShift Web Console, click Storage
Procedure
Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>
In this example,
rook-ceph-osd-0-6d77d6c7c6-m8xj6
needs to be replaced andcompute-2
is the OpenShift Container platform node on which the OSD is scheduled.NoteIf the OSD to be replaced is healthy, the status of the pod will be
Running
.Scale down the OSD deployment for the OSD to be replaced.
Each time you want to replace the OSD, update the
osd_id_to_remove
parameter with the OSD ID, and repeat this step.Copy to Clipboard Copied! Toggle word wrap Toggle overflow osd_id_to_remove=0
$ osd_id_to_remove=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow deployment.extensions/rook-ceph-osd-0 scaled
deployment.extensions/rook-ceph-osd-0 scaled
Verify that the
rook-ceph-osd
pod is terminated.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow No resources found.
No resources found.
ImportantIf the
rook-ceph-osd
pod is interminating
state, use theforce
option to delete the pod.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0
$ oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
Remove the old OSD from the cluster so that you can add a new OSD.
Delete any old
ocs-osd-removal
jobs.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage job ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-job"
job.batch "ocs-osd-removal-job"
Navigate to the
openshift-storage
project.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc project openshift-storage
$ oc project openshift-storage
Remove the old OSD from the cluster.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} -p FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
The FORCE_OSD_REMOVAL value must be changed to “true” in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
WarningThis step results in OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_remove
is provided.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
# oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Ensure that the OSD removal is completed.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
ImportantIf the
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging.For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1'
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1'
If encryption was enabled at the time of install, remove
dm-crypt
manageddevice-mapper
mapping from the OSD devices that are removed from the respective OpenShift Data Foundation nodes.Get the PVC name(s) of the replaced OSD(s) from the logs of
ocs-osd-removal-job
pod.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i ‘pvc|deviceset’
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i ‘pvc|deviceset’
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
For each of the previously identified nodes, do the following:
Create a
debug
pod andchroot
to the host on the storage node.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc debug node/<node name>
$ oc debug node/<node name>
<node name>
Is the name of the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Find a relevant device name based on the PVC names identified in the previous step.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow dmsetup ls| grep <pvc name>
$ dmsetup ls| grep <pvc name>
<pvc name>
Is the name of the PVC.
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
Remove the mapped device.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
$ cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
ImportantIf the above command gets stuck due to insufficient privileges, run the following commands:
-
Press
CTRL+Z
to exit the above command. Find the PID of the process which was stuck.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ps -ef | grep crypt
$ ps -ef | grep crypt
Terminate the process using the
kill
command.Copy to Clipboard Copied! Toggle word wrap Toggle overflow kill -9 <PID>
$ kill -9 <PID>
<PID>
- Is the process ID.
Verify that the device name is removed.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow dmsetup ls
$ dmsetup ls
-
Press
Delete the
ocs-osd-removal
job.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete -n openshift-storage job ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
When using an external key management system (KMS) with data encryption, the old OSD encryption key can be removed from the Vault server as it is now an orphan key.
Verfication steps
Verify that there is a new OSD running.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pods -l app=rook-ceph-osd
$ oc get -n openshift-storage pods -l app=rook-ceph-osd
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
Verify that there is a new PVC created which is in
Bound
state.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage pvc
$ oc get -n openshift-storage pvc
Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
Identify the nodes where the new OSD pods are running.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
<OSD-pod-name>
Is the name of the OSD pod.
For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NODE compute-1
NODE compute-1
For each of the previously identified nodes, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc debug node/<node name>
$ oc debug node/<node name>
<node name>
Is the name of the node.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host
$ chroot /host
Check for the
crypt
keyword beside theocs-deviceset
name(s).Copy to Clipboard Copied! Toggle word wrap Toggle overflow lsblk
$ lsblk
- Log in to OpenShift Web Console and view the storage dashboard.