Chapter 6. Troubleshooting alerts and errors in OpenShift Data Foundation
6.1. Resolving alerts and errors
Red Hat OpenShift Data Foundation can detect and automatically resolve a number of common failure scenarios. However, some problems require administrator intervention.
To know the errors currently firing, check one of the following locations:
-
Observe
Alerting Firing option -
Home
Overview Cluster tab -
Storage
Data Foundation Storage System storage system link in the pop up Overview Block and File tab -
Storage
Data Foundation Storage System storage system link in the pop up Overview Object tab
Copy the error displayed and search it in the following section to know its severity and resolution:
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Inspect the user interface and log, and verify if an update is in progress.
|
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Inspect the user interface and log, and verify if an update is in progress.
|
Name:
Message:
Description: Severity: Crtical Resolution: Fix Procedure: Remove unnecessary data or expand the cluster. |
Name:
Fixed:
Description: Severity: Warning Resolution: Fix Procedure: Remove unnecessary data or expand the cluster. |
Name:
Message:
Description: Severity: Warning Resolution: Workaround |
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Finding the error code of an unhealthy namespace store resource |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message: Description: `Minimum required replicas for storage metadata service not available. Might affect the working of storage cluster.` Severity: Warning Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message: Description: Disaster recovery is failing for one or a few applications. Severity: Warning Resolution: Contact Red Hat support |
Name:
Message: Description: Disaster recovery is failing for the entire cluster. Mirror daemon is in unhealthy status for more than 1m. Mirroring on this cluster is not working as expected. Severity: Critical Resolution: Contact Red Hat support |
6.2. Resolving cluster health issues
There is a finite set of possible health messages that a Red Hat Ceph Storage cluster can raise that show in the OpenShift Data Foundation user interface. These are defined as health checks which have unique identifiers. The identifier is a terse pseudo-human-readable string that is intended to enable tools to make sense of health checks, and present them in a way that reflects their meaning. Click the health code below for more information and troubleshooting.
Health code | Description |
---|---|
One or more Ceph Monitors are low on disk space. |
6.2.1. MON_DISK_LOW
This alert triggers if the available space on the file system storing the monitor database as a percentage, drops below mon_data_avail_warn
(default: 15%). This may indicate that some other process or user on the system is filling up the same file system used by the monitor. It may also indicate that the monitor’s database is large.
The paths to the file system differ depending on the deployment of your mons. You can find the path to where the mon is deployed in storagecluster.yaml
.
Example paths:
-
Mon deployed over PVC path:
/var/lib/ceph/mon
-
Mon deployed over hostpath:
/var/lib/rook/mon
In order to clear up space, view the high usage files in the file system and choose which to delete. To view the files, run:
du -a <path-in-the-mon-node> |sort -n -r |head -n10
# du -a <path-in-the-mon-node> |sort -n -r |head -n10
Replace <path-in-the-mon-node>
with the path to the file system where mons are deployed.
6.3. Resolving cluster alerts
There is a finite set of possible health alerts that a Red Hat Ceph Storage cluster can raise that show in the OpenShift Data Foundation user interface. These are defined as health alerts which have unique identifiers. The identifier is a terse pseudo-human-readable string that is intended to enable tools to make sense of health checks, and present them in a way that reflects their meaning. Click the health alert for more information and troubleshooting.
Health alert | Overview |
---|---|
Storage cluster utilization has crossed 80%. | |
Storage cluster is in an error state for more than 10 minutes. | |
Storage cluster is nearing full capacity. Data deletion or cluster expansion is required. | |
Storage cluster is read-only now and needs immediate data deletion or cluster expansion. | |
Storage cluster is in a warning state for more than 10 mins. | |
Data recovery has been active for too long. | |
Minimum required replicas for storage metadata service not available. Might affect the working of the storage cluster. | |
Ceph Manager has disappeared from Prometheus target discovery. | |
Ceph manager is missing replicas. Thispts health status reporting and will cause some of the information reported by the | |
The Ceph monitor leader is being changed an unusual number of times. | |
Storage cluster quorum is low. | |
The number of monitor pods in the storage cluster are not enough. | |
There are different versions of Ceph Mon components running. | |
A storage node went down. Check the node immediately. The alert should contain the node name. | |
Utilization of back-end Object Storage Device (OSD) has crossed 80%. Free up some space immediately or expand the storage cluster or contact support. | |
A disk device is not responding on one of the hosts. | |
A disk device is not accessible on one of the hosts. | |
Ceph storage OSD flapping. | |
One of the OSD storage devices is nearing full. | |
OSD requests are taking too long to process. | |
There are different versions of Ceph OSD components running. | |
Self-healing operations are taking too long. | |
Storage pool quota usage has crossed 90%. | |
Storage pool quota usage has crossed 70%. | |
Persistent Volume Claim usage has exceeded more than 85% of its capacity. | |
Persistent Volume Claim usage has exceeded more than 75% of its capacity. |
6.3.1. CephClusterCriticallyFull
Meaning | Storage cluster utilization has crossed 80% and will become read-only at 85%. Your Ceph cluster will become read-only once utilization crosses 85%. Free up some space or expand the storage cluster immediately. It is common to see alerts related to Object Storage Device (OSD) devices full or near full prior to this alert. |
Impact | High |
Diagnosis
- Scaling storage
- Depending on the type of cluster, you need to add storage devices, nodes, or both. For more information, see the Scaling storage guide.
Mitigation
- Deleting information
- If it is not possible to scale up the cluster, you need to delete information in order to free up some space.
6.3.2. CephClusterErrorState
Meaning | This alert reflects that the storage cluster is in ERROR state for an unacceptable amount of time and thispts the storage availability. Check for other alerts that would have triggered prior to this one and troubleshoot those alerts first. |
Impact | Critical |
Diagnosis
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep rook-ceph
$ oc get pod | grep rook-ceph
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a rook-ceph that is in the pending state, not running or not ready
# Examine the output for a rook-ceph that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied!
- pod status: NOT pending, but NOT running
Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! Important- If a node was assigned, check the kubelet on the node.
- If the basic health of the running pods, node affinity and resource availability on the nodes are verified, run the Ceph tools to get the status of the storage components.
Mitigation
- Debugging log information
This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.3. CephClusterNearFull
Meaning | Storage cluster utilization has crossed 75% and will become read-only at 85%. Free up some space or expand the storage cluster. |
Impact | Critical |
Diagnosis
- Scaling storage
- Depending on the type of cluster, you need to add storage devices, nodes, or both. For more information, see the Scaling storage guide.
Mitigation
- Deleting information
- If it is not possible to scale up the cluster, you need to delete information in order to free up some space.
6.3.4. CephClusterReadOnly
Meaning | Storage cluster utilization has crossed 85% and will become read-only now. Free up some space or expand the storage cluster immediately. |
Impact | Critical |
Diagnosis
- Scaling storage
- Depending on the type of cluster, you need to add storage devices, nodes, or both. For more information, see the Scaling storage guide.
Mitigation
- Deleting information
- If it is not possible to scale up the cluster, you need to delete information in order to free up some space.
6.3.5. CephClusterWarningState
Meaning | This alert reflects that the storage cluster has been in a warning state for an unacceptable amount of time. While the storage operations will continue to function in this state, it is recommended to fix the errors so that the cluster does not get into an error statepting operations. Check for other alerts that might have triggered prior to this one and troubleshoot those alerts first. |
Impact | High |
Diagnosis
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep {ceph-component}
oc get pod | grep {ceph-component}
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a {ceph-component} that is in the pending state, not running or not ready
# Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied!
- pod status: NOT pending, but NOT running
Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! ImportantIf a node was assigned, check the kubelet on the node.
Mitigation
- Debugging log information
- This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.6. CephDataRecoveryTakingTooLong
Meaning | Data recovery is slow. Check whether all the Object Storage Devices (OSDs) are up and running. |
Impact | High |
Diagnosis
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep rook-ceph-osd
oc get pod | grep rook-ceph-osd
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a {ceph-component} that is in the pending state, not running or not ready
# Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied!
- pod status: NOT pending, but NOT running
Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! ImportantIf a node was assigned, check the kubelet on the node.
Mitigation
- Debugging log information
- This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.7. CephMdsMissingReplicas
Meaning | Minimum required replicas for the storage metadata service (MDS) are not available. MDS is responsible for filing metadata. Degradation of the MDS service can affect how the storage cluster works (related to the CephFS storage class) and should be fixed as soon as possible. |
Impact | High |
Diagnosis
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep rook-ceph-mds
oc get pod | grep rook-ceph-mds
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a {ceph-component} that is in the pending state, not running or not ready
# Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied!
- pod status: NOT pending, but NOT running
Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! ImportantIf a node was assigned, check the kubelet on the node.
Mitigation
- Debugging log information
- This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.8. CephMgrIsAbsent
Meaning | Not having a Ceph manager runningpts the monitoring of the cluster. Persistent Volume Claim (PVC) creation and deletion requests should be resolved as soon as possible. |
Impact | High |
Diagnosis
Verify that the
rook-ceph-mgr
pod is failing, and restart if necessary. If the Ceph mgr pod restart fails, follow the general pod troubleshooting to resolve the issue.Verify that the Ceph mgr pod is failing:
oc get pods | grep mgr
$ oc get pods | grep mgr
Copy to Clipboard Copied! Describe the Ceph mgr pod for more details:
oc describe pods/<pod_name>
$ oc describe pods/<pod_name>
Copy to Clipboard Copied! <pod_name>
-
Specify the
rook-ceph-mgr
pod name from the previous step.
Analyze the errors related to resource issues.
Delete the pod, and wait for the pod to restart:
oc get pods | grep mgr
$ oc get pods | grep mgr
Copy to Clipboard Copied!
Follow these steps for general pod troubleshooting:
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep rook-ceph-mgr
oc get pod | grep rook-ceph-mgr
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a {ceph-component} that is in the pending state, not running or not ready
# Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied!
- pod status: NOT pending, but NOT running
Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! ImportantIf a node was assigned, check the kubelet on the node.
Mitigation
- Debugging log information
- This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.9. CephMgrIsMissingReplicas
Meaning | To resolve this alert, you need to determine the cause of the disappearance of the Ceph manager and restart if necessary. |
Impact | High |
Diagnosis
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep rook-ceph-mgr
oc get pod | grep rook-ceph-mgr
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a {ceph-component} that is in the pending state, not running or not ready
# Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied!
- pod status: NOT pending, but NOT running
Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! ImportantIf a node was assigned, check the kubelet on the node.
Mitigation
- Debugging log information
- This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.10. CephMonHighNumberOfLeaderChanges
Meaning | In a Ceph cluster there is a redundant set of monitor pods that store critical information about the storage cluster. Monitor pods synchronize periodically to obtain information about the storage cluster. The first monitor pod to get the most updated information becomes the leader, and the other monitor pods will start their synchronization process after asking the leader. A problem in network connection or another kind of problem in one or more monitor pods produces an unusual change of the leader. This situation can negatively affect the storage cluster performance. |
Impact | Medium |
Check for any network issues. If there is a network issue, you need to escalate to the OpenShift Data Foundation team before you proceed with any of the following troubleshooting steps.
Diagnosis
Print the logs of the affected monitor pod to gather more information about the issue:
oc logs <rook-ceph-mon-X-yyyy> -n openshift-storage
$ oc logs <rook-ceph-mon-X-yyyy> -n openshift-storage
Copy to Clipboard Copied! <rook-ceph-mon-X-yyyy>
- Specify the name of the affected monitor pod.
- Alternatively, use the Openshift Web console to open the logs of the affected monitor pod. More information about possible causes is reflected in the log.
Perform the general pod troubleshooting steps:
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep {ceph-component}
oc get pod | grep {ceph-component}
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a {ceph-component} that is in the pending state, not running or not ready
# Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied! - pod status: NOT pending, running, but NOT ready
- Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied! - pod status: NOT pending, but NOT running
- Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! ImportantIf a node was assigned, check the kubelet on the node.
Mitigation
- Debugging log information
- This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.11. CephMonQuorumAtRisk
Meaning | Multiple MONs work together to provide redundancy. Each of the MONs keeps a copy of the metadata. The cluster is deployed with 3 MONs, and requires 2 or more MONs to be up and running for quorum and for the storage operations to run. If quorum is lost, access to data is at risk. |
Impact | High |
Diagnosis
Restore the Ceph MON Quorum. For more information, see Restoring ceph-monitor quorum in OpenShift Data Foundation in the Troubleshooting guide. If the restoration of the Ceph MON Quorum fails, follow the general pod troubleshooting to resolve the issue.
Perform the following for general pod troubleshooting:
- pod status: pending
+[
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep rook-ceph-mon
oc get pod | grep rook-ceph-mon
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a {ceph-component} that is in the pending state, not running or not ready
# Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
- Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied! - pod status: NOT pending, but NOT running
- Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! ImportantIf a node was assigned, check the kubelet on the node.
Mitigation
- Debugging log information
- This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.12. CephMonQuorumLost
Meaning | In a Ceph cluster there is a redundant set of monitor pods that store critical information about the storage cluster. Monitor pods synchronize periodically to obtain information about the storage cluster. The first monitor pod to get the most updated information becomes the leader, and the other monitor pods will start their synchronization process after asking the leader. A problem in network connection or another kind of problem in one or more monitor pods produces an unusual change of the leader. This situation can negatively affect the storage cluster performance. |
Impact | High |
Check for any network issues. If there is a network issue, you need to escalate to the OpenShift Data Foundation team before you proceed with any of the following troubleshooting steps.
Diagnosis
Restore the Ceph MON Quorum. For more information, see Restoring ceph-monitor quorum in OpenShift Data Foundation in the Troubleshooting guide. If the restoration of the Ceph MON Quorum fails, follow the general pod troubleshooting to resolve the issue.
Alternatively, perform general pod troubleshooting:
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep {ceph-component}
oc get pod | grep {ceph-component}
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a {ceph-component} that is in the pending state, not running or not ready
# Examine the output for a {ceph-component} that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
- Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied! - pod status: NOT pending, but NOT running
- Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! ImportantIf a node was assigned, check the kubelet on the node.
Mitigation
- Debugging log information
- This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.13. CephMonVersionMismatch
Meaning | Typically this alert triggers during an upgrade that is taking a long time. |
Impact | Medium |
Diagnosis
Check the ocs-operator
subscription status and the operator pod health to check if an operator upgrade is in progress.
Check the
ocs-operator
subscription health.oc get sub $(oc get pods -n openshift-storage | grep -v ocs-operator) -n openshift-storage -o json | jq .status.conditions
$ oc get sub $(oc get pods -n openshift-storage | grep -v ocs-operator) -n openshift-storage -o json | jq .status.conditions
Copy to Clipboard Copied! The status condition types are
CatalogSourcesUnhealthy
,InstallPlanMissing
,InstallPlanPending
, andInstallPlanFailed
. The status for each type should beFalse
.Example output:
[ { "lastTransitionTime": "2021-01-26T19:21:37Z", "message": "all available catalogsources are healthy", "reason": "AllCatalogSourcesHealthy", "status": "False", "type": "CatalogSourcesUnhealthy" } ]
[ { "lastTransitionTime": "2021-01-26T19:21:37Z", "message": "all available catalogsources are healthy", "reason": "AllCatalogSourcesHealthy", "status": "False", "type": "CatalogSourcesUnhealthy" } ]
Copy to Clipboard Copied! The example output shows a
False
status for typeCatalogSourcesUnHealthly
, which means that the catalog sources are healthy.Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.
oc get pod -n openshift-storage | grep ocs-operator OCSOP=$(oc get pod -n openshift-storage -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator) echo $OCSOP oc get pod/${OCSOP} -n openshift-storage oc describe pod/${OCSOP} -n openshift-storage
$ oc get pod -n openshift-storage | grep ocs-operator OCSOP=$(oc get pod -n openshift-storage -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator) echo $OCSOP oc get pod/${OCSOP} -n openshift-storage oc describe pod/${OCSOP} -n openshift-storage
Copy to Clipboard Copied! If you determine that the `ocs-operator`is in progress, wait for 5 mins and this alert should resolve itself. If you have waited or see a different error status condition, continue troubleshooting.
Mitigation
- Debugging log information
This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.14. CephNodeDown
Meaning | A node running Ceph pods is down. While storage operations will continue to function as Ceph is designed to deal with a node failure, it is recommended to resolve the issue to minimize the risk of another node going down and affecting storage functions. |
Impact | Medium |
Diagnosis
List all the pods that are running and failing:
oc -n openshift-storage get pods
oc -n openshift-storage get pods
Copy to Clipboard Copied! ImportantEnsure that you meet the OpenShift Data Foundation resource requirements so that the Object Storage Device (OSD) pods are scheduled on the new node. This may take a few minutes as the Ceph cluster recovers data for the failing but now recovering OSD. To watch this recovery in action, ensure that the OSD pods are correctly placed on the new worker node.
Check if the OSD pods that were previously failing are now running:
oc -n openshift-storage get pods
oc -n openshift-storage get pods
Copy to Clipboard Copied! If the previously failing OSD pods have not been scheduled, use the
describe
command and check the events for reasons the pods were not rescheduled.Describe the events for the failing OSD pod:
oc -n openshift-storage get pods | grep osd
oc -n openshift-storage get pods | grep osd
Copy to Clipboard Copied! Find the one or more failing OSD pods:
oc -n openshift-storage describe pods/<osd_podname_ from_the_ previous step>
oc -n openshift-storage describe pods/<osd_podname_ from_the_ previous step>
Copy to Clipboard Copied! In the events section look for the failure reasons, such as the resources are not being met.
In addition, you may use the
rook-ceph-toolbox
to watch the recovery. This step is optional, but is helpful for large Ceph clusters. To access the toolbox, run the following command:TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name) oc rsh -n openshift-storage $TOOLS_POD
TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name) oc rsh -n openshift-storage $TOOLS_POD
Copy to Clipboard Copied! From the rsh command prompt, run the following, and watch for "recovery" under the io section:
ceph status
ceph status
Copy to Clipboard Copied! Determine if there are failed nodes.
Get the list of worker nodes, and check for the node status:
oc get nodes --selector='node-role.kubernetes.io/worker','!node-role.kubernetes.io/infra'
oc get nodes --selector='node-role.kubernetes.io/worker','!node-role.kubernetes.io/infra'
Copy to Clipboard Copied! Describe the node which is of the
NotReady
status to get more information about the failure:oc describe node <node_name>
oc describe node <node_name>
Copy to Clipboard Copied!
Mitigation
- Debugging log information
This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.15. CephOSDCriticallyFull
Meaning | One of the Object Storage Devices (OSDs) is critically full. Expand the cluster immediately. |
Impact | High |
Diagnosis
- Deleting data to free up storage space
- You can delete data, and the cluster will resolve the alert through self healing processes.
This is only applicable to OpenShift Data Foundation clusters that are near or full but not in read-only mode. Read-only mode prevents any changes that include deleting data, that is, deletion of Persistent Volume Claim (PVC), Persistent Volume (PV) or both.
- Expanding the storage capacity
- Current storage size is less than 1 TB
You must first assess the ability to expand. For every 1 TB of storage added, the cluster needs to have 3 nodes each with a minimum available 2 vCPUs and 8 GiB memory.
You can increase the storage capacity to 4 TB via the add-on and the cluster will resolve the alert through self healing processes. If the minimum vCPU and memory resource requirements are not met, you need to add 3 additional worker nodes to the cluster.
Mitigation
- If your current storage size is equal to 4 TB, contact Red Hat support.
Optional: Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.16. CephOSDDiskNotResponding
Meaning | A disk device is not responding. Check whether all the Object Storage Devices (OSDs) are up and running. |
Impact | Medium |
Diagnosis
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep rook-ceph
$ oc get pod | grep rook-ceph
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a rook-ceph that is in the pending state, not running or not ready
# Examine the output for a rook-ceph that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied!
- pod status: NOT pending, but NOT running
Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! Important- If a node was assigned, check the kubelet on the node.
- If the basic health of the running pods, node affinity and resource availability on the nodes are verified, run the Ceph tools to get the status of the storage components.
Mitigation
- Debugging log information
This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.18. CephOSDFlapping
Meaning | A storage daemon has restarted 5 times in the last 5 minutes. Check the pod events or Ceph status to find out the cause. |
Impact | High |
Diagnosis
Follow the steps in the Flapping OSDs section of the Red Hat Ceph Storage Troubleshooting Guide.
Alternatively, follow the steps for general pod troublshooting:
- pod status: pending
Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:
oc project openshift-storage
$ oc project openshift-storage
Copy to Clipboard Copied! oc get pod | grep rook-ceph
$ oc get pod | grep rook-ceph
Copy to Clipboard Copied! Set
MYPOD
as the variable for the pod that is identified as the problem pod:Examine the output for a rook-ceph that is in the pending state, not running or not ready
# Examine the output for a rook-ceph that is in the pending state, not running or not ready MYPOD=<pod_name>
Copy to Clipboard Copied! <pod_name>
- Specify the name of the pod that is identified as the problem pod.
Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:
oc get pod/${MYPOD} -o wide
$ oc get pod/${MYPOD} -o wide
Copy to Clipboard Copied!
- pod status: NOT pending, running, but NOT ready
Check the readiness probe:
oc describe pod/${MYPOD}
$ oc describe pod/${MYPOD}
Copy to Clipboard Copied!
- pod status: NOT pending, but NOT running
Check for application or image issues:
oc logs pod/${MYPOD}
$ oc logs pod/${MYPOD}
Copy to Clipboard Copied! Important- If a node was assigned, check the kubelet on the node.
- If the basic health of the running pods, node affinity and resource availability on the nodes are verified, run the Ceph tools to get the status of the storage components.
Mitigation
- Debugging log information
This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.13
Copy to Clipboard Copied!
6.3.19. CephOSDNearFull
Meaning | Utilization of back-end storage device Object Storage Device (OSD) has crossed 75% on a host. |
Impact | High |
Mitigation
Free up some space in the cluster, expand the storage cluster, or contact Red Hat support. For more information on scaling storage, see the Scaling storage guide.
6.3.20. CephOSDSlowOps
Meaning |
An Object Storage Device (OSD) with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the |
Impact | Medium |
Diagnosis
More information about the slow requests can be obtained using the Openshift console.
Access the OSD pod terminal, and run the following commands:
ceph daemon osd.<id> ops
$ ceph daemon osd.<id> ops
Copy to Clipboard Copied! ceph daemon osd.<id> dump_historic_ops
$ ceph daemon osd.<id> dump_historic_ops
Copy to Clipboard Copied! NoteThe number of the OSD is seen in the pod name. For example, in
rook-ceph-osd-0-5d86d4d8d4-zlqkx
,<0>
is the OSD.
Mitigation
The main causes of the OSDs having slow requests are: * Problems with the underlying hardware or infrastructure, such as, disk drives, hosts, racks, or network switches. Use the Openshift monitoring console to find the alerts or errors about cluster resources. This can give you an idea about the root cause of the slow operations in the OSD. * Problems with the network. These problems are usually connected with flapping OSDs. See the Flapping OSDs section of the Red Hat Ceph Storage Troubleshooting Guide * If it is a network issue, escalate to the OpenShift Data Foundation team * System load. Use the Openshift console to review the metrics of the OSD pod and the node which is running the OSD. Adding or assigning more resources can be a possible solution.
6.3.21. CephOSDVersionMismatch
Meaning | Typically this alert triggers during an upgrade that is taking a long time. |
Impact | Medium |
Diagnosis
Check the ocs-operator
subscription status and the operator pod health to check if an operator upgrade is in progress.
Check the
ocs-operator
subscription health.oc get sub $(oc get pods -n openshift-storage | grep -v ocs-operator) -n openshift-storage -o json | jq .status.conditions
$ oc get sub $(oc get pods -n openshift-storage | grep -v ocs-operator) -n openshift-storage -o json | jq .status.conditions
Copy to Clipboard Copied! The status condition types are
CatalogSourcesUnhealthy
,InstallPlanMissing
,InstallPlanPending
, andInstallPlanFailed
. The status for each type should beFalse
.Example output:
[ { "lastTransitionTime": "2021-01-26T19:21:37Z", "message": "all available catalogsources are healthy", "reason": "AllCatalogSourcesHealthy", "status": "False", "type": "CatalogSourcesUnhealthy" } ]
[ { "lastTransitionTime": "2021-01-26T19:21:37Z", "message": "all available catalogsources are healthy", "reason": "AllCatalogSourcesHealthy", "status": "False", "type": "CatalogSourcesUnhealthy" } ]
Copy to Clipboard Copied! The example output shows a
False
status for typeCatalogSourcesUnHealthly
, which means that the catalog sources are healthy.Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.
oc get pod -n openshift-storage | grep ocs-operator OCSOP=$(oc get pod -n openshift-storage -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator) echo $OCSOP oc get pod/${OCSOP} -n openshift-storage oc describe pod/${OCSOP} -n openshift-storage
$ oc get pod -n openshift-storage | grep ocs-operator OCSOP=$(oc get pod -n openshift-storage -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator) echo $OCSOP oc get pod/${OCSOP} -n openshift-storage oc describe pod/${OCSOP} -n openshift-storage
Copy to Clipboard Copied! If you determine that the `ocs-operator`is in progress, wait for 5 mins and this alert should resolve itself. If you have waited or see a different error status condition, continue troubleshooting.
6.3.22. CephPGRepairTakingTooLong
Meaning | Self-healing operations are taking too long. |
Impact | High |
Diagnosis
Check for inconsistent Placement Groups (PGs), and repair them. For more information, see the Red Hat Knowledgebase solution Handle Inconsistent Placement Groups in Ceph.
6.3.23. CephPoolQuotaBytesCriticallyExhausted
Meaning |
One or more pools has reached, or is very close to reaching, its quota. The threshold to trigger this error condition is controlled by the |
Impact | High |
Mitigation
Adjust the pool quotas. Run the following commands to fully remove or adjust the pool quotas up or down:
ceph osd pool set-quota <pool> max_bytes <bytes>
ceph osd pool set-quota <pool> max_bytes <bytes>
ceph osd pool set-quota <pool> max_objects <objects>
ceph osd pool set-quota <pool> max_objects <objects>
Setting the quota value to 0
will disable the quota.
6.3.24. CephPoolQuotaBytesNearExhaustion
Meaning |
One or more pools is approaching a configured fullness threshold. One threshold that can trigger this warning condition is the |
Impact | High |
Mitigation
Adjust the pool quotas. Run the following commands to fully remove or adjust the pool quotas up or down:
ceph osd pool set-quota <pool> max_bytes <bytes>
ceph osd pool set-quota <pool> max_bytes <bytes>
ceph osd pool set-quota <pool> max_objects <objects>
ceph osd pool set-quota <pool> max_objects <objects>
Setting the quota value to 0
will disable the quota.
6.3.25. PersistentVolumeUsageCritical
Meaning | A Persistent Volume Claim (PVC) is nearing its full capacity and may lead to data loss if not attended to timely. |
Impact | High |
Mitigation
Expand the PVC size to increase the capacity.
- Log in to the OpenShift Web Console.
-
Click Storage
PersistentVolumeClaim. -
Select
openshift-storage
from the Project drop-down list. -
On the PVC you want to expand, click Action menu (⋮)
Expand PVC. - Update the Total size to the desired size.
- Click Expand.
Alternatively, you can delete unnecessary data that may be taking up space.
6.3.26. PersistentVolumeUsageNearFull
Meaning | A Persistent Volume Claim (PVC) is nearing its full capacity and may lead to data loss if not attended to timely. |
Impact | High |
Mitigation
Expand the PVC size to increase the capacity.
- Log in to the OpenShift Web Console.
-
Click Storage
PersistentVolumeClaim. -
Select
openshift-storage
from the Project drop-down list. -
On the PVC you want to expand, click Action menu (⋮)
Expand PVC. - Update the Total size to the desired size.
- Click Expand.
Alternatively, you can delete unnecessary data that may be taking up space.
6.4. Finding the error code of an unhealthy bucket
Procedure
-
In the OpenShift Web Console, click Storage
Object Storage. - Click the Object Bucket Claims tab.
-
Look for the object bucket claims (OBCs) that are not in
Bound
state and click on it. Click the Events tab and do one of the following:
- Look for events that might hint you about the current state of the bucket.
- Click the YAML tab and look for related errors around the status and mode sections of the YAML.
If the OBC is in
Pending
state. the error might appear in the product logs. However, in this case, it is recommended to verify that all the variables provided are accurate.
6.5. Finding the error code of an unhealthy namespace store resource
Procedure
-
In the OpenShift Web Console, click Storage
Object Storage. - Click the Namespace Store tab.
-
Look for the namespace store resources that are not in
Bound
state and click on it. Click the Events tab and do one of the following:
- Look for events that might hint you about the current state of the resource.
- Click the YAML tab and look for related errors around the status and mode sections of the YAML.
6.6. Recovering pods
When a first node (say NODE1
) goes to NotReady state because of some issue, the hosted pods that are using PVC with ReadWriteOnce (RWO) access mode try to move to the second node (say NODE2
) but get stuck due to multi-attach error. In such a case, you can recover MON, OSD, and application pods by using the following steps.
Procedure
-
Power off
NODE1
(from AWS or vSphere side) and ensure thatNODE1
is completely down. Force delete the pods on
NODE1
by using the following command:oc delete pod <pod-name> --grace-period=0 --force
$ oc delete pod <pod-name> --grace-period=0 --force
Copy to Clipboard Copied!
6.7. Recovering from EBS volume detach
When an OSD or MON elastic block storage (EBS) volume where the OSD disk resides is detached from the worker Amazon EC2 instance, the volume gets reattached automatically within one or two minutes. However, the OSD pod gets into a CrashLoopBackOff
state. To recover and bring back the pod to Running
state, you must restart the EC2 instance.
6.8. Enabling and disabling debug logs for rook-ceph-operator
Enable the debug logs for the rook-ceph-operator to obtain information about failures that help in troubleshooting issues.
Procedure
- Enabling the debug logs
Edit the configmap of the rook-ceph-operator.
oc edit configmap rook-ceph-operator-config
$ oc edit configmap rook-ceph-operator-config
Copy to Clipboard Copied! Add the
ROOK_LOG_LEVEL: DEBUG
parameter in therook-ceph-operator-config
yaml file to enable the debug logs for rook-ceph-operator.… data: # The logging level for the operator: INFO | DEBUG ROOK_LOG_LEVEL: DEBUG
… data: # The logging level for the operator: INFO | DEBUG ROOK_LOG_LEVEL: DEBUG
Copy to Clipboard Copied! Now, the rook-ceph-operator logs consist of the debug information.
- Disabling the debug logs
Edit the configmap of the rook-ceph-operator.
oc edit configmap rook-ceph-operator-config
$ oc edit configmap rook-ceph-operator-config
Copy to Clipboard Copied! Add the
ROOK_LOG_LEVEL: INFO
parameter in therook-ceph-operator-config
yaml file to disable the debug logs for rook-ceph-operator.… data: # The logging level for the operator: INFO | DEBUG ROOK_LOG_LEVEL: INFO
… data: # The logging level for the operator: INFO | DEBUG ROOK_LOG_LEVEL: INFO
Copy to Clipboard Copied!
6.9. Troubleshooting unhealthy blocklisted nodes
6.9.1. ODFRBDClientBlocked
Meaning |
This alert indicates that an RBD client might be blocked by Ceph on a specific node within your Kubernetes cluster. The blocklisting occurs when the |
Impact | High |
Diagnosis
The blocklisting of an RBD client can occur due to several factors, such as network or cluster slowness. In certain cases, the exclusive lock contention among three contending clients (workload, mirror daemon, and manager/scheduler) can lead to the blocklist.
Mitigation
- Taint the blocklisted node: In Kubernetes, consider tainting the node that is blocklisted to trigger the eviction of pods to another node. This approach relies on the assumption that the unmounting/unmapping process progresses gracefully. Once the pods have been successfully evicted, the blocklisted node can be untainted, allowing the blocklist to be cleared. The pods can then be moved back to the untainted node.
- Reboot the blocklisted node: If tainting the node and evicting the pods do not resolve the blocklisting issue, a reboot of the blocklisted node can be attempted. This step may help alleviate any underlying issues causing the blocklist and restore normal functionality.
Investigating and resolving the blocklist issue promptly is essential to avoid any further impact on the storage cluster.