Replacing devices
Instructions for safely replacing operational or failed devices
Abstract
Making open source more inclusive Copy linkLink copied to clipboard!
Red Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message.
Providing feedback on Red Hat documentation Copy linkLink copied to clipboard!
We appreciate your input on our documentation. Do let us know how we can make it better.
To give feedback, create a Jira ticket:
- Log in to the Jira.
- Click Create in the top navigation bar
- Enter a descriptive title in the Summary field.
- Enter your suggestion for improvement in the Description field. Include links to the relevant parts of the documentation.
- Select Documentation in the Components field.
- Click Create at the bottom of the dialogue.
Preface Copy linkLink copied to clipboard!
Depending on the type of your deployment, you can choose one of the following procedures to replace a storage device:
For dynamically created storage clusters deployed on AWS, see:
- For dynamically created storage clusters deployed on VMware, see Section 2.1, “Replacing operational or failed storage devices on VMware infrastructure”.
- For dynamically created storage clusters deployed on Microsoft Azure, see Section 3.1, “Replacing operational or failed storage devices on Azure installer-provisioned infrastructure”.
For storage clusters deployed using local storage devices, see:
OpenShift Data Foundation does not support heterogeneous OSD sizes.
Chapter 1. Dynamically provisioned OpenShift Data Foundation deployed on AWS Copy linkLink copied to clipboard!
1.1. Replacing operational or failed storage devices on AWS user-provisioned infrastructure Copy linkLink copied to clipboard!
When you need to replace a device in a dynamically created storage cluster on an AWS user-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:
1.2. Replacing operational or failed storage devices on AWS installer-provisioned infrastructure Copy linkLink copied to clipboard!
When you need to replace a device in a dynamically created storage cluster on an AWS installer-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:
Chapter 2. Dynamically provisioned OpenShift Data Foundation deployed on VMware Copy linkLink copied to clipboard!
2.1. Replacing operational or failed storage devices on VMware infrastructure Copy linkLink copied to clipboard!
Create a new Persistent Volume Claim (PVC) on a new volume, and remove the old object storage device (OSD) when one or more virtual machine disks (VMDK) needs to be replaced in OpenShift Data Foundation which is deployed dynamically on VMware infrastructure.
Prerequisites
Ensure that the data is resilient.
- In the OpenShift Web Console, click Storage → Data Foundation.
-
Click the Storage Systems tab, and then click
ocs-storagecluster-storagesystem
. - In the Status card of Block and File dashboard, under the Overview tab, verify that Data Resiliency has a green tick mark.
Procedure
Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.
oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example,
rook-ceph-osd-0-6d77d6c7c6-m8xj6
needs to be replaced andcompute-2
is the OpenShift Container platform node on which the OSD is scheduled.NoteThe status of the pod is
Running
, if the OSD you want to replace is healthy.Scale down the OSD deployment for the OSD to be replaced.
Each time you want to replace the OSD, update the
osd_id_to_remove
parameter with the OSD ID, and repeat this step.osd_id_to_remove=0
$ osd_id_to_remove=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
deployment.extensions/rook-ceph-osd-0 scaled
deployment.extensions/rook-ceph-osd-0 scaled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
rook-ceph-osd
pod is terminated.oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
No resources found.
No resources found.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
rook-ceph-osd
pod is interminating
state, use theforce
option to delete the pod.oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0
$ oc delete pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --force --grace-period=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the old OSD from the cluster so that you can add a new OSD.
Delete any old
ocs-osd-removal
jobs.oc delete -n openshift-storage job ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project.oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the old OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The FORCE_OSD_REMOVAL value must be changed to “true” in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
WarningThis step results in OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_remove
is provided.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
pod fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging.For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If encryption was enabled at the time of install, remove
dm-crypt
manageddevice-mapper
mapping from the OSD devices that are removed from the respective OpenShift Data Foundation nodes.Get the PVC name(s) of the replaced OSD(s) from the logs of
ocs-osd-removal-job
pod.oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i ‘pvc|deviceset’
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i ‘pvc|deviceset’
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For each of the previously identified nodes, do the following:
Create a
debug
pod andchroot
to the host on the storage node.oc debug node/<node name>
$ oc debug node/<node name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node name>
Is the name of the node.
chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Find a relevant device name based on the PVC names identified in the previous step.
dmsetup ls| grep <pvc name>
$ dmsetup ls| grep <pvc name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <pvc name>
Is the name of the PVC.
Example output:
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Remove the mapped device.
cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
$ cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the above command gets stuck due to insufficient privileges, run the following commands:
-
Press
CTRL+Z
to exit the above command. Find the PID of the process which was stuck.
ps -ef | grep crypt
$ ps -ef | grep crypt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Terminate the process using the
kill
command.kill -9 <PID>
$ kill -9 <PID>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <PID>
- Is the process ID.
Verify that the device name is removed.
dmsetup ls
$ dmsetup ls
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Press
Delete the
ocs-osd-removal
job.oc delete -n openshift-storage job ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
When using an external key management system (KMS) with data encryption, the old OSD encryption key can be removed from the Vault server as it is now an orphan key.
Verfication steps
Verify that there is a new OSD running.
oc get -n openshift-storage pods -l app=rook-ceph-osd
$ oc get -n openshift-storage pods -l app=rook-ceph-osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that there is a new PVC created which is in
Bound
state.oc get -n openshift-storage pvc
$ oc get -n openshift-storage pvc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-2s6w4 Bound pvc-7c9bcaf7-de68-40e1-95f9-0b0d7c0ae2fc 512Gi RWO thin 5m ocs-deviceset-1-0-q8fwh Bound pvc-9e7e00cb-6b33-402e-9dc5-b8df4fd9010f 512Gi RWO thin 1d20h ocs-deviceset-2-0-9v8lq Bound pvc-38cdfcee-ea7e-42a5-a6e1-aaa6d4924291 512Gi RWO thin 1d20h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-0-0-2s6w4 Bound pvc-7c9bcaf7-de68-40e1-95f9-0b0d7c0ae2fc 512Gi RWO thin 5m ocs-deviceset-1-0-q8fwh Bound pvc-9e7e00cb-6b33-402e-9dc5-b8df4fd9010f 512Gi RWO thin 1d20h ocs-deviceset-2-0-9v8lq Bound pvc-38cdfcee-ea7e-42a5-a6e1-aaa6d4924291 512Gi RWO thin 1d20h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
Identify the nodes where the new OSD pods are running.
oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <OSD-pod-name>
Is the name of the OSD pod.
For example:
oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NODE compute-1
NODE compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For each of the nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name>
$ oc debug node/<node name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node name>
Is the name of the node.
chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Check for the
crypt
keyword beside theocs-deviceset
name(s).lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Log in to OpenShift Web Console and view the storage dashboard.
Chapter 3. Dynamically provisioned OpenShift Data Foundation deployed on Microsoft Azure Copy linkLink copied to clipboard!
3.1. Replacing operational or failed storage devices on Azure installer-provisioned infrastructure Copy linkLink copied to clipboard!
When you need to replace a device in a dynamically created storage cluster on an Azure installer-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:
Chapter 4. Dynamically provisioned OpenShift Data Foundation deployed on Google cloud Copy linkLink copied to clipboard!
4.1. Replacing operational or failed storage devices on Google Cloud installer-provisioned infrastructure Copy linkLink copied to clipboard!
When you need to replace a device in a dynamically created storage cluster on an Google Cloud installer-provisioned infrastructure, you must replace the storage node. For information about how to replace nodes, see:
Chapter 5. OpenShift Data Foundation deployed using local storage devices Copy linkLink copied to clipboard!
5.1. Replacing operational or failed storage devices on clusters backed by local storage devices Copy linkLink copied to clipboard!
You can replace an object storage device (OSD) in OpenShift Data Foundation deployed using local storage devices on the following infrastructures:
- Bare metal
- VMware
There might be a need to replace one or more underlying storage devices.
Prerequisites
- Red Hat recommends that replacement devices are configured with similar infrastructure and resources to the device being replaced.
Ensure that the data is resilient.
- In the OpenShift Web Console, click Storage → Data Foundation.
-
Click the Storage Systems tab, and then click
ocs-storagecluster-storagesystem
. - In the Status card of Block and File dashboard, under the Overview tab, verify that Data Resiliency has a green tick mark.
Procedure
- Remove the underlying storage device from relevant worker node.
Verify that relevant OSD Pod has moved to CrashLoopBackOff state.
Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.
oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>
rook-ceph-osd-0-6d77d6c7c6-m8xj6 0/1 CrashLoopBackOff 0 24h 10.129.0.16 compute-2 <none> <none> rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 24h 10.128.2.24 compute-0 <none> <none> rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 24h 10.130.0.18 compute-1 <none> <none>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example,
rook-ceph-osd-0-6d77d6c7c6-m8xj6
needs to be replaced andcompute-2
is the OpenShift Container platform node on which the OSD is scheduled.Scale down the
rook-ceph-operator
deployment.oc -n openshift-storage scale deployment rook-ceph-operator --replicas=0
$ oc -n openshift-storage scale deployment rook-ceph-operator --replicas=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Scale down the OSD deployment for the OSD to be replaced.
osd_id_to_remove=0
$ osd_id_to_remove=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
deployment.extensions/rook-ceph-osd-0 scaled
deployment.extensions/rook-ceph-osd-0 scaled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
rook-ceph-osd
pod is terminated.oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
No resources found in openshift-storage namespace.
No resources found in openshift-storage namespace.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
rook-ceph-osd
pod is interminating
state for more than a few minutes, use theforce
option to delete the pod.oc delete -n openshift-storage pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force
$ oc delete -n openshift-storage pod rook-ceph-osd-0-6d77d6c7c6-m8xj6 --grace-period=0 --force
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-6d77d6c7c6-m8xj6" force deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the old OSD from the cluster so that you can add a new OSD.
Delete any old
ocs-osd-removal
jobs.oc delete -n openshift-storage job ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
openshift-storage
project.oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the old OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The FORCE_OSD_REMOVAL value must be changed to “true” in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
WarningThis step results in OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_remove
is provided.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging.For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If encryption was enabled at the time of install, remove
dm-crypt
manageddevice-mapper
mapping from the OSD devices that are removed from the respective OpenShift Data Foundation nodes.Get the Persistent Volume Claim (PVC) name(s) of the replaced OSD(s) from the logs of
ocs-osd-removal-job
pod.oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i 'pvc|deviceset'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i 'pvc|deviceset'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For each of the previously identified nodes, do the following:
Create a
debug
pod andchroot
to the host on the storage node.oc debug node/<node name>
$ oc debug node/<node name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node name>
Is the name of the node.
chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Find the relevant device name based on the PVC names identified in the previous step.
dmsetup ls| grep <pvc name>
$ dmsetup ls| grep <pvc name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <pvc name>
Is the name of the PVC.
Example output:
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Remove the mapped device.
cryptsetup luksClose --debug --verbose <ocs-deviceset-name>
$ cryptsetup luksClose --debug --verbose <ocs-deviceset-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <ocs-deviceset-name>
Is the name of the relevant device based on the PVC names identified in the previous step.
ImportantIf the above command gets stuck due to insufficient privileges, run the following commands:
-
Press
CTRL+Z
to exit the above command. Find the PID of the process which was stuck.
ps -ef | grep crypt
$ ps -ef | grep crypt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Terminate the process using the
kill
command.kill -9 <PID>
$ kill -9 <PID>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <PID>
- Is the process ID.
Verify that the device name is removed.
dmsetup ls
$ dmsetup ls
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Press
Find the persistent volume (PV) that need to be deleted.
oc get pv -L kubernetes.io/hostname | grep <storageclass-name> | grep Released
$ oc get pv -L kubernetes.io/hostname | grep <storageclass-name> | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV.
oc delete pv <pv_name>
$ oc delete pv <pv_name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Physically add a new device to the node.
Track the provisioning of PVs for the devices that match the
deviceInclusionSpec
. It can take a few minutes to provision the PVs.oc -n openshift-local-storage describe localvolumeset <lvs-name>
$ oc -n openshift-local-storage describe localvolumeset <lvs-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Once the PV is provisioned, a new OSD pod is automatically created for the PV.
Delete the
ocs-osd-removal
job(s).oc delete -n openshift-storage job ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteWhen using an external key management system (KMS) with data encryption, the old OSD encryption key can be removed from the Vault server as it is now an orphan key.
Scale up the
rook-ceph-operator
deployment.oc -n openshift-storage scale deployment rook-ceph-operator --replicas=1
$ oc -n openshift-storage scale deployment rook-ceph-operator --replicas=1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow [Optional] Verify the Ceph cluster health and check if there are any crash reports associated with the previous failed OSD which has now been replaced.
oc exec -it $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name) -n openshift-storage -- ceph health detail -c /var/lib/rook/openshift-storage/openshift-storage.config
$ oc exec -it $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name) -n openshift-storage -- ceph health detail -c /var/lib/rook/openshift-storage/openshift-storage.config
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To clear all crash reports from Ceph health:
oc exec -it $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name) -n openshift-storage -- ceph crash archive-all -c /var/lib/rook/openshift-storage/openshift-storage.config
$ oc exec -it $(oc get pod -n openshift-storage -l app=rook-ceph-operator -o name) -n openshift-storage -- ceph crash archive-all -c /var/lib/rook/openshift-storage/openshift-storage.config
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification steps
Verify that there is a new OSD running.
oc get -n openshift-storage pods -l app=rook-ceph-osd
$ oc get -n openshift-storage pods -l app=rook-ceph-osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
rook-ceph-osd-0-5f7f4747d4-snshw 1/1 Running 0 4m47s rook-ceph-osd-1-85d99fb95f-2svc7 1/1 Running 0 1d20h rook-ceph-osd-2-6c66cdb977-jp542 1/1 Running 0 1d20h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the new OSD does not show as
Running
after a few minutes, restart therook-ceph-operator
pod to force a reconciliation.oc delete pod -n openshift-storage -l app=rook-ceph-operator
$ oc delete pod -n openshift-storage -l app=rook-ceph-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
pod "rook-ceph-operator-6f74fb5bff-2d982" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that a new PVC is created.
oc get -n openshift-storage pvc | grep <lvs-name>
$ oc get -n openshift-storage pvc | grep <lvs-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
ocs-deviceset-0-0-c2mqb Bound local-pv-b481410 1490Gi RWO localblock 5m ocs-deviceset-1-0-959rp Bound local-pv-414755e0 1490Gi RWO localblock 1d20h ocs-deviceset-2-0-79j94 Bound local-pv-3e8964d3 1490Gi RWO localblock 1d20h
ocs-deviceset-0-0-c2mqb Bound local-pv-b481410 1490Gi RWO localblock 5m ocs-deviceset-1-0-959rp Bound local-pv-414755e0 1490Gi RWO localblock 1d20h ocs-deviceset-2-0-79j94 Bound local-pv-3e8964d3 1490Gi RWO localblock 1d20h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
Identify the nodes where the new OSD pods are running.
oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <OSD-pod-name>
Is the name of the OSD pod.
For example:
oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NODE compute-1
NODE compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For each of the nodes identified in the previous step, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name>
$ oc debug node/<node name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node name>
Is the name of the node.
chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Check for the
crypt
keyword beside theocs-deviceset
name(s).lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Log in to OpenShift Web Console and check the OSD status on the storage dashboard.
A full data recovery may take longer depending on the volume of data being recovered.
5.2. Replacing operational or failed storage devices on IBM Power Copy linkLink copied to clipboard!
You can replace an object storage device (OSD) in OpenShift Data Foundation deployed using local storage devices on IBM Power.
There might be a need to replace one or more underlying storage devices.
Prerequisites
- Red Hat recommends that replacement devices are configured with similar infrastructure and resources to the device being replaced.
Ensure that the data is resilient.
- In the OpenShift Web Console, click Storage → Data Foundation.
-
Click the Storage Systems tab, and then click
ocs-storagecluster-storagesystem
. - In the Status card of Block and File dashboard, under the Overview tab, verify that Data Resiliency has a green tick mark.
Procedure
Identify the OSD that needs to be replaced and the OpenShift Container Platform node that has the OSD scheduled on it.
oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
$ oc get -n openshift-storage pods -l app=rook-ceph-osd -o wide
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-osd-0-86bf8cdc8-4nb5t 0/1 crashLoopBackOff 0 24h 10.129.2.26 worker-0 <none> <none> rook-ceph-osd-1-7c99657cfb-jdzvz 1/1 Running 0 24h 10.128.2.46 worker-1 <none> <none> rook-ceph-osd-2-5f9f6dfb5b-2mnw9 1/1 Running 0 24h 10.131.0.33 worker-2 <none> <none>
rook-ceph-osd-0-86bf8cdc8-4nb5t 0/1 crashLoopBackOff 0 24h 10.129.2.26 worker-0 <none> <none> rook-ceph-osd-1-7c99657cfb-jdzvz 1/1 Running 0 24h 10.128.2.46 worker-1 <none> <none> rook-ceph-osd-2-5f9f6dfb5b-2mnw9 1/1 Running 0 24h 10.131.0.33 worker-2 <none> <none>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example,
rook-ceph-osd-0-86bf8cdc8-4nb5t
needs to be replaced andworker-0
is the RHOCP node on which the OSD is scheduled.NoteThe status of the pod is
Running
if the OSD you want to replace is healthy.Scale down the OSD deployment for the OSD to be replaced.
osd_id_to_remove=0
$ osd_id_to_remove=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
$ oc scale -n openshift-storage deployment rook-ceph-osd-${osd_id_to_remove} --replicas=0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
osd_id_to_remove
is the integer in the pod name immediately after therook-ceph-osd
prefix. In this example, the deployment name isrook-ceph-osd-0
.Example output:
deployment.extensions/rook-ceph-osd-0 scaled
deployment.extensions/rook-ceph-osd-0 scaled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
rook-ceph-osd
pod is terminated.oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
$ oc get -n openshift-storage pods -l ceph-osd-id=${osd_id_to_remove}
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
No resources found in openshift-storage namespace.
No resources found in openshift-storage namespace.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
rook-ceph-osd
pod is interminating
state for more than a few minutes, use theforce
option to delete the pod.oc delete -n openshift-storage pod rook-ceph-osd-0-86bf8cdc8-4nb5t --grace-period=0 --force
$ oc delete -n openshift-storage pod rook-ceph-osd-0-86bf8cdc8-4nb5t --grace-period=0 --force
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-86bf8cdc8-4nb5t" force deleted
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. pod "rook-ceph-osd-0-86bf8cdc8-4nb5t" force deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the old OSD from the cluster so that you can add a new OSD.
Identify the
DeviceSet
associated with the OSD to be replaced.oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
$ oc get -n openshift-storage -o yaml deployment rook-ceph-osd-${osd_id_to_remove} | grep ceph.rook.io/pvc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-64xjl ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-64xjl
ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-64xjl ceph.rook.io/pvc: ocs-deviceset-localblock-0-data-0-64xjl
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the Persistent Volume Claim (PVC) name is
ocs-deviceset-localblock-0-data-0-64xjl
.Identify the Persistent Volume (PV) associated with the PVC.
oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
$ oc get -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in an earlier step.Example output:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-localblock-0-data-0-64xjl Bound local-pv-8137c873 256Gi RWO localblock 24h
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE ocs-deviceset-localblock-0-data-0-64xjl Bound local-pv-8137c873 256Gi RWO localblock 24h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the associated PV is
local-pv-8137c873
.Identify the name of the device to be replaced.
oc get pv local-pv-<pv-suffix> -o yaml | grep path
$ oc get pv local-pv-<pv-suffix> -o yaml | grep path
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
pv-suffix
is the value in the PV name identified in an earlier step.Example output:
path: /mnt/local-storage/localblock/vdc
path: /mnt/local-storage/localblock/vdc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the device name is
vdc
.Identify the
prepare-pod
associated with the OSD to be replaced.oc describe -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix> | grep Used
$ oc describe -n openshift-storage pvc ocs-deviceset-<x>-<y>-<pvc-suffix> | grep Used
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
x
,y
, andpvc-suffix
are the values in theDeviceSet
identified in an earlier step.Example output:
Used By: rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-0-64knzkc
Used By: rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-0-64knzkc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the
prepare-pod
name isrook-ceph-osd-prepare-ocs-deviceset-localblock-0-data-0-64knzkc
.Delete any old
ocs-osd-removal
jobs.oc delete -n openshift-storage job ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
openshift-storage
project.oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the old OSD from the cluster.
oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
$ oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=${osd_id_to_remove} FORCE_OSD_REMOVAL=false |oc create -n openshift-storage -f -
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The FORCE_OSD_REMOVAL value must be changed to “true” in clusters that only have three OSDs, or clusters with insufficient space to restore all three replicas of the data after the OSD is removed.
WarningThis step results in OSD being completely removed from the cluster. Ensure that the correct value of
osd_id_to_remove
is provided.
Verify that the OSD was removed successfully by checking the status of the
ocs-osd-removal-job
pod.A status of
Completed
confirms that the OSD removal job succeeded.oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
$ oc get pod -l job-name=ocs-osd-removal-job -n openshift-storage
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD removal is completed.
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 | egrep -i 'completed removal'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
2022-05-10 06:50:04.501511 I | cephosd: completed removal of OSD 0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the
ocs-osd-removal-job
fails and the pod is not in the expectedCompleted
state, check the pod logs for further debugging.For example:
oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
# oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If encryption was enabled at the time of install, remove
dm-crypt
manageddevice-mapper
mapping from the OSD devices that are removed from the respective OpenShift Data Foundation nodes.Get the PVC name(s) of the replaced OSD(s) from the logs of
ocs-osd-removal-job
pod.oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i ‘pvc|deviceset’
$ oc logs -l job-name=ocs-osd-removal-job -n openshift-storage --tail=-1 |egrep -i ‘pvc|deviceset’
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
2021-05-12 14:31:34.666000 I | cephosd: removing the OSD PVC "ocs-deviceset-xxxx-xxx-xxx-xxx"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For each of the previously identified nodes, do the following:
Create a
debug
pod andchroot
to the host on the storage node.oc debug node/<node name>
$ oc debug node/<node name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node name>
Is the name of the node.
chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Find the relevant device name based on the PVC names identified in the previous step.
dmsetup ls| grep <pvc name>
$ dmsetup ls| grep <pvc name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <pvc name>
Is the name of the PVC.
Example output:
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt (253:0)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Remove the mapped device.
cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
$ cryptsetup luksClose --debug --verbose ocs-deviceset-xxx-xxx-xxx-xxx-block-dmcrypt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantIf the above command gets stuck due to insufficient privileges, run the following commands:
-
Press
CTRL+Z
to exit the above command. Find the PID of the process which was stuck.
ps -ef | grep crypt
$ ps -ef | grep crypt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Terminate the process using the
kill
command.kill -9 <PID>
$ kill -9 <PID>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <PID>
- Is the process ID.
Verify that the device name is removed.
dmsetup ls
$ dmsetup ls
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Press
Find the PV that need to be deleted.
oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
$ oc get pv -L kubernetes.io/hostname | grep localblock | grep Released
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
local-pv-d6bf175b 1490Gi RWO Delete Released openshift-storage/ocs-deviceset-0-data-0-6c5pw localblock 2d22h compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PV.
oc delete pv <pv-name>
$ oc delete pv <pv-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <pv-name>
- Is the name of the PV.
Replace the old device and use the new device to create a new OpenShift Container Platform PV.
Log in to the OpenShift Container Platform node with the device to be replaced. In this example, the OpenShift Container Platform node is
worker-0
.oc debug node/worker-0
$ oc debug node/worker-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Starting pod/worker-0-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.88.21 If you don't see a command prompt, try pressing enter. # chroot /host
Starting pod/worker-0-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.88.21 If you don't see a command prompt, try pressing enter. # chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Record the
/dev/disk
that is to be replaced using the device name,vdc
, identified earlier.ls -alh /mnt/local-storage/localblock
# ls -alh /mnt/local-storage/localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
total 0 drwxr-xr-x. 2 root root 17 Nov 18 15:23 . drwxr-xr-x. 3 root root 24 Nov 18 15:23 .. lrwxrwxrwx. 1 root root 8 Nov 18 15:23 vdc -> /dev/vdc
total 0 drwxr-xr-x. 2 root root 17 Nov 18 15:23 . drwxr-xr-x. 3 root root 24 Nov 18 15:23 .. lrwxrwxrwx. 1 root root 8 Nov 18 15:23 vdc -> /dev/vdc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Find the name of the
LocalVolume
CR, and remove or comment out the device/dev/disk
that is to be replaced.oc get -n openshift-local-storage localvolume
$ oc get -n openshift-local-storage localvolume
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME AGE localblock 25h
NAME AGE localblock 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc edit -n openshift-local-storage localvolume localblock
# oc edit -n openshift-local-storage localvolume localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Make sure to save the changes after editing the CR.
Log in to the OpenShift Container Platform node with the device to be replaced and remove the old
symlink
.oc debug node/worker-0
$ oc debug node/worker-0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Starting pod/worker-0-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.88.21 If you don't see a command prompt, try pressing enter. # chroot /host
Starting pod/worker-0-debug ... To use host binaries, run `chroot /host` Pod IP: 192.168.88.21 If you don't see a command prompt, try pressing enter. # chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the old
symlink
for the device name to be replaced. In this example, the device name isvdc
.ls -alh /mnt/local-storage/localblock
# ls -alh /mnt/local-storage/localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
total 0 drwxr-xr-x. 2 root root 17 Nov 18 15:23 . drwxr-xr-x. 3 root root 24 Nov 18 15:23 .. lrwxrwxrwx. 1 root root 8 Nov 18 15:23 vdc -> /dev/vdc
total 0 drwxr-xr-x. 2 root root 17 Nov 18 15:23 . drwxr-xr-x. 3 root root 24 Nov 18 15:23 .. lrwxrwxrwx. 1 root root 8 Nov 18 15:23 vdc -> /dev/vdc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the
symlink
.rm /mnt/local-storage/localblock/vdc
# rm /mnt/local-storage/localblock/vdc
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
symlink
is removed.ls -alh /mnt/local-storage/localblock
# ls -alh /mnt/local-storage/localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
total 0 drwxr-xr-x. 2 root root 6 Nov 18 17:11 . drwxr-xr-x. 3 root root 24 Nov 18 15:23 ..
total 0 drwxr-xr-x. 2 root root 6 Nov 18 17:11 . drwxr-xr-x. 3 root root 24 Nov 18 15:23 ..
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Replace the old device with the new device.
Log back into the correct OpenShift Cotainer Platform node and identify the device name for the new drive. The device name must change unless you are resetting the same device.
lsblk
# lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the new device name is
vdd
.After the new
/dev/disk
is available, you can add a new disk entry to the LocalVolume CR.Edit the LocalVolume CR and add the new
/dev/disk
.In this example, the new device is
/dev/vdd
.oc edit -n openshift-local-storage localvolume localblock
# oc edit -n openshift-local-storage localvolume localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Make sure to save the changes after editing the CR.
Verify that there is a new PV in
Available
state and of the correct size.oc get pv | grep 256Gi
$ oc get pv | grep 256Gi
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
local-pv-1e31f771 256Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-2-data-0-6xhkf localblock 24h local-pv-ec7f2b80 256Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-1-data-0-hr2fx localblock 24h local-pv-8137c873 256Gi RWO Delete Available localblock 32m
local-pv-1e31f771 256Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-2-data-0-6xhkf localblock 24h local-pv-ec7f2b80 256Gi RWO Delete Bound openshift-storage/ocs-deviceset-localblock-1-data-0-hr2fx localblock 24h local-pv-8137c873 256Gi RWO Delete Available localblock 32m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a new OSD for the new device.
Deploy the new OSD. You need to restart the
rook-ceph-operator
to force operator reconciliation.Identify the name of the
rook-ceph-operator
.oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-85f6494db4-sg62v 1/1 Running 0 1d20h
NAME READY STATUS RESTARTS AGE rook-ceph-operator-85f6494db4-sg62v 1/1 Running 0 1d20h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
rook-ceph-operator
.oc delete -n openshift-storage pod rook-ceph-operator-85f6494db4-sg62v
$ oc delete -n openshift-storage pod rook-ceph-operator-85f6494db4-sg62v
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
pod "rook-ceph-operator-85f6494db4-sg62v" deleted
pod "rook-ceph-operator-85f6494db4-sg62v" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the rook-ceph-operator pod name is
rook-ceph-operator-85f6494db4-sg62v
.Verify that the
rook-ceph-operator
pod is restarted.oc get -n openshift-storage pod -l app=rook-ceph-operator
$ oc get -n openshift-storage pod -l app=rook-ceph-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NAME READY STATUS RESTARTS AGE rook-ceph-operator-85f6494db4-wx9xx 1/1 Running 0 50s
NAME READY STATUS RESTARTS AGE rook-ceph-operator-85f6494db4-wx9xx 1/1 Running 0 50s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Creation of the new OSD may take several minutes after the operator restarts.
Delete the
ocs-osd-removal
job(s).oc delete -n openshift-storage job ocs-osd-removal-job
$ oc delete -n openshift-storage job ocs-osd-removal-job
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
job.batch "ocs-osd-removal-job" deleted
job.batch "ocs-osd-removal-job" deleted
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
When using an external key management system (KMS) with data encryption, the old OSD encryption key can be removed from the Vault server as it is now an orphan key.
Verfication steps
Verify that there is a new OSD running.
oc get -n openshift-storage pods -l app=rook-ceph-osd
$ oc get -n openshift-storage pods -l app=rook-ceph-osd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
rook-ceph-osd-0-76d8fb97f9-mn8qz 1/1 Running 0 23m rook-ceph-osd-1-7c99657cfb-jdzvz 1/1 Running 1 25h rook-ceph-osd-2-5f9f6dfb5b-2mnw9 1/1 Running 0 25h
rook-ceph-osd-0-76d8fb97f9-mn8qz 1/1 Running 0 23m rook-ceph-osd-1-7c99657cfb-jdzvz 1/1 Running 1 25h rook-ceph-osd-2-5f9f6dfb5b-2mnw9 1/1 Running 0 25h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that a new PVC is created.
oc get -n openshift-storage pvc | grep localblock
$ oc get -n openshift-storage pvc | grep localblock
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
ocs-deviceset-localblock-0-data-0-q4q6b Bound local-pv-8137c873 256Gi RWO localblock 10m ocs-deviceset-localblock-1-data-0-hr2fx Bound local-pv-ec7f2b80 256Gi RWO localblock 1d20h ocs-deviceset-localblock-2-data-0-6xhkf Bound local-pv-1e31f771 256Gi RWO localblock 1d20h
ocs-deviceset-localblock-0-data-0-q4q6b Bound local-pv-8137c873 256Gi RWO localblock 10m ocs-deviceset-localblock-1-data-0-hr2fx Bound local-pv-ec7f2b80 256Gi RWO localblock 1d20h ocs-deviceset-localblock-2-data-0-6xhkf Bound local-pv-1e31f771 256Gi RWO localblock 1d20h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: If cluster-wide encryption is enabled on the cluster, verify that the new OSD devices are encrypted.
Identify the nodes where the new OSD pods are running.
oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/<OSD-pod-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <OSD-pod-name>
Is the name of the OSD pod.
For example:
oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
$ oc get -n openshift-storage -o=custom-columns=NODE:.spec.nodeName pod/rook-ceph-osd-0-544db49d7f-qrgqm
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
NODE compute-1
NODE compute-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For each of the previously identified nodes, do the following:
Create a debug pod and open a chroot environment for the selected host(s).
oc debug node/<node name>
$ oc debug node/<node name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow <node name>
Is the name of the node.
chroot /host
$ chroot /host
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Check for the
crypt
keyword beside theocs-deviceset
name(s).lsblk
$ lsblk
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Log in to OpenShift Web Console and check the status card in the OpenShift Data Foundation dashboard under Storage section.
A full data recovery may take longer depending on the volume of data being recovered.
5.3. Replacing operational or failed storage devices on IBM Z or IBM LinuxONE infrastructure Copy linkLink copied to clipboard!
You can replace operational or failed storage devices on IBM Z or IBM® LinuxONE infrastructure with new Small Computer System Interface (SCSI) disks.
IBM Z or IBM® LinuxONE supports SCSI FCP disk logical units (SCSI disks) as persistent storage devices from external disk storage. You can identify a SCSI disk using its FCP Device number, two target worldwide port names (WWPN1 and WWPN2), and the logical unit number (LUN). For more information, see https://www.ibm.com/support/knowledgecenter/SSB27U_6.4.0/com.ibm.zvm.v640.hcpa5/scsiover.html
Prerequisites
Ensure that the data is resilient.
- In the OpenShift Web Console, click Storage → Data Foundation.
-
Click the Storage Systems tab, and then click
ocs-storagecluster-storagesystem
. - In the Status card of Block and File dashboard, under the Overview tab, verify that Data Resiliency has a green tick mark.
Procedure
List all the disks.
lszdev
$ lszdev
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow A SCSI disk is represented as a
zfcp-lun
with the structure<device-id>:<wwpn>:<lun-id>
in theID
section. The first disk is used for the operating system. If one storage device fails, you can replace it with a new disk.Remove the disk.
Run the following command on the disk, replacing
scsi-id
with the SCSI disk identifier of the disk to be replaced:chzdev -d scsi-id
$ chzdev -d scsi-id
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example, the following command removes one disk with the device ID
0.0.8204
, the WWPN0x500507630a0b50a4
, and the LUN0x4002403000000000
:chzdev -d 0.0.8204:0x500407630c0b50a4:0x3002b03000000000
$ chzdev -d 0.0.8204:0x500407630c0b50a4:0x3002b03000000000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Append a new SCSI disk.
chzdev -e 0.0.8204:0x500507630b1b50a4:0x4001302a00000000
$ chzdev -e 0.0.8204:0x500507630b1b50a4:0x4001302a00000000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe device ID for the new disk must be the same as the disk to be replaced. The new disk is identified with its WWPN and LUN ID.
List all the FCP devices to verify the new disk is configured.
lszdev zfcp-lun
$ lszdev zfcp-lun
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
TYPE ID ON PERS NAMES zfcp-lun 0.0.8204:0x102107630b1b5060:0x4001402900000000 yes no sda sg0 zfcp-lun 0.0.8204:0x500507630b1b50a4:0x4001302a00000000 yes yes sdb sg1
TYPE ID ON PERS NAMES zfcp-lun 0.0.8204:0x102107630b1b5060:0x4001402900000000 yes no sda sg0 zfcp-lun 0.0.8204:0x500507630b1b50a4:0x4001302a00000000 yes yes sdb sg1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow