Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 9. Removing failed or unwanted Ceph Object Storage devices
The failed or unwanted Ceph OSDs (Object Storage Devices) affects the performance of the storage infrastructure. Hence, to improve the reliability and resilience of the storage cluster, you must remove the failed or unwanted Ceph OSDs.
If you have any failed or unwanted Ceph OSDs to remove:
Verify the Ceph health status.
For more information see: Verifying Ceph cluster is healthy.
Based on the provisioning of the OSDs, remove failed or unwanted Ceph OSDs.
See:
If you are using local disks, you can reuse these disks after removing the old OSDs.
9.1. Verifying Ceph cluster is healthy Copier lienLien copié sur presse-papiers!
Storage health is visible on the Block and File and Object dashboards.
Procedure
-
In the OpenShift Web Console, click Storage
Data Foundation. - In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- In the Status card of the Block and File tab, verify that the Storage Cluster has a green tick.
- In the Details card, verify that the cluster information is displayed.
9.2. Removing failed or unwanted Ceph OSDs in dynamically provisioned Red Hat OpenShift Data Foundation Copier lienLien copié sur presse-papiers!
Follow the steps in the procedure to remove the failed or unwanted Ceph Object Storage Devices (OSDs) in dynamically provisioned Red Hat OpenShift Data Foundation.
Scaling down of clusters is supported only with the help of the Red Hat support team.
- Removing an OSD when the Ceph component is not in a healthy state can result in data loss.
- Removing two or more OSDs at the same time results in data loss.
Prerequisites
- Check if Ceph is healthy. For more information see Verifying Ceph cluster is healthy.
- Ensure no alerts are firing or any rebuilding process is in progress.
Procedure
Scale down the OSD deployment.
oc scale deployment rook-ceph-osd-<osd-id> --replicas=0
# oc scale deployment rook-ceph-osd-<osd-id> --replicas=0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the
osd-preparepod for the Ceph OSD to be removed.oc get deployment rook-ceph-osd-<osd-id> -oyaml | grep ceph.rook.io/pvc
# oc get deployment rook-ceph-osd-<osd-id> -oyaml | grep ceph.rook.io/pvcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
osd-preparepod.oc delete -n openshift-storage pod rook-ceph-osd-prepare-<pvc-from-above-command>-<pod-suffix>
# oc delete -n openshift-storage pod rook-ceph-osd-prepare-<pvc-from-above-command>-<pod-suffix>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster.
failed_osd_id=<osd-id> oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=$failed_osd_id | oc create -f -
# failed_osd_id=<osd-id> # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=$failed_osd_id | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
FAILED_OSD_IDis the integer in the pod name immediately after therook-ceph-osdprefix.Verify that the OSD is removed successfully by checking the logs.
oc logs -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>
# oc logs -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Optional: If you get an error as
cephosd:osd.0 is NOT ok to destroyfrom the ocs-osd-removal-job pod in OpenShift Container Platform, see Troubleshooting the errorcephosd:osd.0 is NOT ok to destroywhile removing failed or unwanted Ceph OSDs. Delete the OSD deployment.
oc delete deployment rook-ceph-osd-<osd-id>
# oc delete deployment rook-ceph-osd-<osd-id>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification step
To check if the OSD is deleted successfully, run:
oc get pod -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>
# oc get pod -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>Copy to Clipboard Copied! Toggle word wrap Toggle overflow This command must return the status as Completed.
9.3. Removing failed or unwanted Ceph OSDs provisioned using local storage devices Copier lienLien copié sur presse-papiers!
You can remove failed or unwanted Ceph provisioned Object Storage Devices (OSDs) using local storage devices by following the steps in the procedure.
Scaling down of clusters is supported only with the help of the Red Hat support team.
- Removing an OSD when the Ceph component is not in a healthy state can result in data loss.
- Removing two or more OSDs at the same time results in data loss.
Prerequisites
- Check if Ceph is healthy. For more information see Verifying Ceph cluster is healthy.
- Ensure no alerts are firing or any rebuilding process is in progress.
Procedure
Forcibly, mark the OSD down by scaling the replicas on the OSD deployment to 0. You can skip this step if the OSD is already down due to failure.
oc scale deployment rook-ceph-osd-<osd-id> --replicas=0
# oc scale deployment rook-ceph-osd-<osd-id> --replicas=0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the failed OSD from the cluster.
failed_osd_id=<osd_id> oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=$failed_osd_id | oc create -f -
# failed_osd_id=<osd_id> # oc process -n openshift-storage ocs-osd-removal -p FAILED_OSD_IDS=$failed_osd_id | oc create -f -Copy to Clipboard Copied! Toggle word wrap Toggle overflow where,
FAILED_OSD_IDis the integer in the pod name immediately after therook-ceph-osdprefix.Verify that the OSD is removed successfully by checking the logs.
oc logs -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>
# oc logs -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Optional: If you get an error as
cephosd:osd.0 is NOT ok to destroyfrom the ocs-osd-removal-job pod in OpenShift Container Platform, see Troubleshooting the errorcephosd:osd.0 is NOT ok to destroywhile removing failed or unwanted Ceph OSDs. Delete persistent volume claim (PVC) resources associated with the failed OSD.
Get the
PVCassociated with the failed OSD.oc get -n openshift-storage -o yaml deployment rook-ceph-osd-<osd-id> | grep ceph.rook.io/pvc
# oc get -n openshift-storage -o yaml deployment rook-ceph-osd-<osd-id> | grep ceph.rook.io/pvcCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the
persistent volume(PV) associated with the PVC.oc get -n openshift-storage pvc <pvc-name>
# oc get -n openshift-storage pvc <pvc-name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the failed device name.
oc get pv <pv-name-from-above-command> -oyaml | grep path
# oc get pv <pv-name-from-above-command> -oyaml | grep pathCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the
prepare-podassociated with the failed OSD.oc describe -n openshift-storage pvc ocs-deviceset-0-0-nvs68 | grep Mounted
# oc describe -n openshift-storage pvc ocs-deviceset-0-0-nvs68 | grep MountedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
osd-prepare podbefore removing the associated PVC.oc delete -n openshift-storage pod <osd-prepare-pod-from-above-command>
# oc delete -n openshift-storage pod <osd-prepare-pod-from-above-command>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the
PVCassociated with the failed OSD.oc delete -n openshift-storage pvc <pvc-name-from-step-a>
# oc delete -n openshift-storage pvc <pvc-name-from-step-a>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Remove failed device entry from the
LocalVolume custom resource(CR).Log in to node with the failed device.
oc debug node/<node_with_failed_osd>
# oc debug node/<node_with_failed_osd>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Record the /dev/disk/by-id/<id> for the failed device name.
ls -alh /mnt/local-storage/localblock/
# ls -alh /mnt/local-storage/localblock/Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Optional: In case, Local Storage Operator is used for provisioning OSD, login to the machine with {osd-id} and remove the device symlink.
oc debug node/<node_with_failed_osd>
# oc debug node/<node_with_failed_osd>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the OSD symlink for the failed device name.
ls -alh /mnt/local-storage/localblock
# ls -alh /mnt/local-storage/localblockCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the symlink.
rm /mnt/local-storage/localblock/<failed-device-name>
# rm /mnt/local-storage/localblock/<failed-device-name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Delete the PV associated with the OSD.
oc delete pv <pv-name>
# oc delete pv <pv-name>
Verification step
To check if the OSD is deleted successfully, run:
#oc get pod -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>
#oc get pod -n openshift-storage ocs-osd-removal-$<failed_osd_id>-<pod-suffix>Copy to Clipboard Copied! Toggle word wrap Toggle overflow This command must return the status as Completed.
9.4. Troubleshooting the error cephosd:osd.0 is NOT ok to destroy while removing failed or unwanted Ceph OSDs Copier lienLien copié sur presse-papiers!
If you get an error as cephosd:osd.0 is NOT ok to destroy from the ocs-osd-removal-job pod in OpenShift Container Platform, run the Object Storage Device (OSD) removal job with FORCE_OSD_REMOVAL option to move the OSD to a destroyed state.
oc process -n openshift-storage ocs-osd-removal -p FORCE_OSD_REMOVAL=true -p FAILED_OSD_IDS=$<failed_osd_id> | oc create -f -
# oc process -n openshift-storage ocs-osd-removal -p FORCE_OSD_REMOVAL=true -p FAILED_OSD_IDS=$<failed_osd_id> | oc create -f -
You must use the FORCE_OSD_REMOVAL option only if all the PGs are in active state. If not, PGs must either complete the back filling or further investigate to ensure they are active.