Chapter 6. Troubleshooting alerts and errors in OpenShift Data Foundation
6.1. Resolving alerts and errors
Red Hat OpenShift Data Foundation can detect and automatically resolve a number of common failure scenarios. However, some problems require administrator intervention.
To know the errors currently firing, check one of the following locations:
-
Observe
Alerting Firing option -
Home
Overview Cluster tab -
Storage
Data Foundation Storage System storage system link in the pop up Overview Block and File tab -
Storage
Data Foundation Storage System storage system link in the pop up Overview Object tab
Copy the error displayed and search it in the following section to know its severity and resolution:
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Inspect the user interface and log, and verify if an update is in progress.
|
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Inspect the user interface and log, and verify if an update is in progress.
|
Name:
Message:
Description: Severity: Crtical Resolution: Fix Procedure: Remove unnecessary data or expand the cluster. |
Name:
Fixed:
Description: Severity: Warning Resolution: Fix Procedure: Remove unnecessary data or expand the cluster. |
Name:
Message:
Description: Severity: Warning Resolution: Workaround Procedure: Resolving NooBaa Bucket Error State |
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Resolving NooBaa Bucket Error State |
Name:
Message:
Description: Severity: Warning Resolution: Fix Procedure: Resolving NooBaa Bucket Error State |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Workaround Procedure: Resolving NooBaa Bucket Error State |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message:
Description: Severity: Warning Resolution: Fix |
Name:
Message: Description: `Minimum required replicas for storage metadata service not available. Might affect the working of storage cluster.` Severity: Warning Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Warning Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support Procedure:
|
Name:
Message:
Description: Severity: Critical Resolution: Contact Red Hat support |
Name:
Message: Description: Disaster recovery is failing for one or a few applications. Severity: Warning Resolution: Contact Red Hat support |
Name:
Message: Description: Disaster recovery is failing for the entire cluster. Mirror daemon is in unhealthy status for more than 1m. Mirroring on this cluster is not working as expected. Severity: Critical Resolution: Contact Red Hat support |
6.2. Resolving cluster health issues
There is a finite set of possible health messages that a Red Hat Ceph Storage cluster can raise that show in the OpenShift Data Foundation user interface. These are defined as health checks which have unique identifiers. The identifier is a terse pseudo-human-readable string that is intended to enable tools to make sense of health checks, and present them in a way that reflects their meaning. Click the health code below for more information and troubleshooting.
Health code | Description |
---|---|
One or more Ceph Monitors are low on disk space. |
6.2.1. MON_DISK_LOW
This alert triggers if the available space on the file system storing the monitor database as a percentage, drops below mon_data_avail_warn
(default: 15%). This may indicate that some other process or user on the system is filling up the same file system used by the monitor. It may also indicate that the monitor’s database is large.
The paths to the file system differ depending on the deployment of your mons. You can find the path to where the mon is deployed in storagecluster.yaml
.
Example paths:
-
Mon deployed over PVC path:
/var/lib/ceph/mon
-
Mon deployed over hostpath:
/var/lib/rook/mon
In order to clear up space, view the high usage files in the file system and choose which to delete. To view the files, run:
# du -a <path-in-the-mon-node> |sort -n -r |head -n10
Replace <path-in-the-mon-node>
with the path to the file system where mons are deployed.
6.3. Resolving NooBaa Bucket Error State
Procedure
-
In the OpenShift Web Console, click Storage
Data Foundation. - In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- Click the Object tab.
- In the Details card, click the link under System Name field.
- In the left pane, click Buckets option and search for the bucket in error state. If the bucket in error state is a namespace bucket, be sure to click the Namespace Buckets pane.
- Click on it’s Bucket Name. Error encountered in bucket is displayed.
Depending on the specific error of the bucket, perform one or both of the following:
For space related errors:
- In the left pane, click Resources option.
- Click on the resource in error state.
- Scale the resource by adding more agents.
For resource health errors:
- In the left pane, click Resources option.
- Click on the resource in error state.
- Connectivity error means the backing service is not available and needs to be restored.
- For access/permissions errors, update the connection’s Access Key and Secret Key.
6.4. Resolving NooBaa Bucket Exceeding Quota State
To resolve A NooBaa Bucket Is In Exceeding Quota State error perform one of the following:
- Cleanup some of the data on the bucket.
Increase the bucket quota by performing the following steps:
-
In the OpenShift Web Console, click Storage
Data Foundation. - In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- Click the Object tab.
- In the Details card, click the link under System Name field.
- In the left pane, click Buckets option and search for the bucket in error state.
- Click on its Bucket Name. Error encountered in bucket is displayed.
-
Click Bucket Policies
Edit Quota and increase the quota.
-
In the OpenShift Web Console, click Storage
6.5. Resolving NooBaa Bucket Capacity or Quota State
Procedure
-
In the OpenShift Web Console, click Storage
Data Foundation. - In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
- Click the Object tab.
- In the Details card, click the link under System Name field.
- In the left pane, click the Resources option and search for the PV pool resource.
- For the PV pool resource with low capacity status, click on its Resource Name.
- Edit the pool configuration and increase the number of agents.
6.6. Recovering pods
When a first node (say NODE1
) goes to NotReady state because of some issue, the hosted pods that are using PVC with ReadWriteOnce (RWO) access mode try to move to the second node (say NODE2
) but get stuck due to multi-attach error. In such a case, you can recover MON, OSD, and application pods by using the following steps.
Procedure
-
Power off
NODE1
(from AWS or vSphere side) and ensure thatNODE1
is completely down. Force delete the pods on
NODE1
by using the following command:$ oc delete pod <pod-name> --grace-period=0 --force
6.7. Recovering from EBS volume detach
When an OSD or MON elastic block storage (EBS) volume where the OSD disk resides is detached from the worker Amazon EC2 instance, the volume gets reattached automatically within one or two minutes. However, the OSD pod gets into a CrashLoopBackOff
state. To recover and bring back the pod to Running
state, you must restart the EC2 instance.
6.8. Enabling and disabling debug logs for rook-ceph-operator
Enable the debug logs for the rook-ceph-operator to obtain information about failures that help in troubleshooting issues.
Procedure
- Enabling the debug logs
Edit the configmap of the rook-ceph-operator.
$ oc edit configmap rook-ceph-operator-config
Add the
ROOK_LOG_LEVEL: DEBUG
parameter in therook-ceph-operator-config
yaml file to enable the debug logs for rook-ceph-operator.… data: # The logging level for the operator: INFO | DEBUG ROOK_LOG_LEVEL: DEBUG
Now, the rook-ceph-operator logs consist of the debug information.
- Disabling the debug logs
Edit the configmap of the rook-ceph-operator.
$ oc edit configmap rook-ceph-operator-config
Add the
ROOK_LOG_LEVEL: INFO
parameter in therook-ceph-operator-config
yaml file to disable the debug logs for rook-ceph-operator.… data: # The logging level for the operator: INFO | DEBUG ROOK_LOG_LEVEL: INFO