Search

Chapter 6. Troubleshooting alerts and errors in OpenShift Data Foundation

download PDF

6.1. Resolving alerts and errors

Red Hat OpenShift Data Foundation can detect and automatically resolve a number of common failure scenarios. However, some problems require administrator intervention.

To know the errors currently firing, check one of the following locations:

  • Observe Alerting Firing option
  • Home Overview Cluster tab
  • Storage Data Foundation Storage System storage system link in the pop up Overview Block and File tab
  • Storage Data Foundation Storage System storage system link in the pop up Overview Object tab

Copy the error displayed and search it in the following section to know its severity and resolution:

Name: CephMonVersionMismatch

Message: There are multiple versions of storage services running.

Description: There are {{ $value }} different versions of Ceph Mon components running.

Severity: Warning

Resolution: Fix

Procedure: Inspect the user interface and log, and verify if an update is in progress.

  • If an update is in progress, this alert is temporary.
  • If an update is not in progress, restart the upgrade process.

Name: CephOSDVersionMismatch

Message: There are multiple versions of storage services running.

Description: There are {{ $value }} different versions of Ceph OSD components running.

Severity: Warning

Resolution: Fix

Procedure: Inspect the user interface and log, and verify if an update is in progress.

  • If an update is in progress, this alert is temporary.
  • If an update is not in progress, restart the upgrade process.

Name: CephClusterCriticallyFull

Message: Storage cluster is critically full and needs immediate expansion

Description: Storage cluster utilization has crossed 85%.

Severity: Crtical

Resolution: Fix

Procedure: Remove unnecessary data or expand the cluster.

Name: CephClusterNearFull

Fixed: Storage cluster is nearing full. Expansion is required.

Description: Storage cluster utilization has crossed 75%.

Severity: Warning

Resolution: Fix

Procedure: Remove unnecessary data or expand the cluster.

Name: NooBaaBucketErrorState

Message: A NooBaa Bucket Is In Error State

Description: A NooBaa bucket {{ $labels.bucket_name }} is in error state for more than 6m

Severity: Warning

Resolution: Workaround

Procedure: Resolving NooBaa Bucket Error State

Name: NooBaaNamespaceResourceErrorState

Message: A NooBaa Namespace Resource Is In Error State

Description: A NooBaa namespace resource {{ $labels.namespace_resource_name }} is in error state for more than 5m

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Error State

Name: NooBaaNamespaceBucketErrorState

Message: A NooBaa Namespace Bucket Is In Error State

Description: A NooBaa namespace bucket {{ $labels.bucket_name }} is in error state for more than 5m

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Error State

Name: NooBaaBucketExceedingQuotaState

Message: A NooBaa Bucket Is In Exceeding Quota State

Description: A NooBaa bucket {{ $labels.bucket_name }} is exceeding its quota - {{ printf "%0.0f" $value }}% used message: A NooBaa Bucket Is In Exceeding Quota State

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Exceeding Quota State

Name: NooBaaBucketLowCapacityState

Message: A NooBaa Bucket Is In Low Capacity State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its capacity

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaBucketNoCapacityState

Message: A NooBaa Bucket Is In No Capacity State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using all of its capacity

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaBucketReachingQuotaState

Message: A NooBaa Bucket Is In Reaching Quota State

Description: A NooBaa bucket {{ $labels.bucket_name }} is using {{ printf "%0.0f" $value }}% of its quota

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaResourceErrorState

Message: A NooBaa Resource Is In Error State

Description: A NooBaa resource {{ $labels.resource_name }} is in error state for more than 6m

Severity: Warning

Resolution: Workaround

Procedure: Resolving NooBaa Bucket Error State

Name: NooBaaSystemCapacityWarning100

Message: A NooBaa System Approached Its Capacity

Description: A NooBaa system approached its capacity, usage is at 100%

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaSystemCapacityWarning85

Message: A NooBaa System Is Approaching Its Capacity

Description: A NooBaa system is approaching its capacity, usage is more than 85%

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: NooBaaSystemCapacityWarning95

Message: A NooBaa System Is Approaching Its Capacity

Description: A NooBaa system is approaching its capacity, usage is more than 95%

Severity: Warning

Resolution: Fix

Procedure: Resolving NooBaa Bucket Capacity or Quota State

Name: CephMdsMissingReplicas

Message: Insufficient replicas for storage metadata service.

Description: `Minimum required replicas for storage metadata service not available.

Might affect the working of storage cluster.`

Severity: Warning

Resolution: Contact Red Hat support

Procedure:

  1. Check for alerts and operator status.
  2. If the issue cannot be identified, contact Red Hat support.

Name: CephMgrIsAbsent

Message: Storage metrics collector service not available anymore.

Description: Ceph Manager has disappeared from Prometheus target discovery.

Severity: Critical

Resolution: Contact Red Hat support

Procedure:

  1. Inspect the user interface and log, and verify if an update is in progress.

    • If an update is in progress, this alert is temporary.
    • If an update is not in progress, restart the upgrade process.
  2. Once the upgrade is complete, check for alerts and operator status.
  3. If the issue persists or cannot be identified, contact Red Hat support.

Name: CephNodeDown

Message: Storage node {{ $labels.node }} went down

Description: Storage node {{ $labels.node }} went down. Check the node immediately.

Severity: Critical

Resolution: Contact Red Hat support

Procedure:

  1. Check which node stopped functioning and its cause.
  2. Take appropriate actions to recover the node. If node cannot be recovered:

Name: CephClusterErrorState

Message: Storage cluster is in error state

Description: Storage cluster is in error state for more than 10m.

Severity: Critical

Resolution: Contact Red Hat support

Procedure:

  1. Check for alerts and operator status.
  2. If the issue cannot be identified, download log files and diagnostic information using must-gather.
  3. Open a Support Ticket with Red Hat Support with an attachment of the output of must-gather.

Name: CephClusterWarningState

Message: Storage cluster is in degraded state

Description: Storage cluster is in warning state for more than 10m.

Severity: Warning

Resolution: Contact Red Hat support

Procedure:

  1. Check for alerts and operator status.
  2. If the issue cannot be identified, download log files and diagnostic information using must-gather.
  3. Open a Support Ticket with Red Hat Support with an attachment of the output of must-gather.

Name: CephDataRecoveryTakingTooLong

Message: Data recovery is slow

Description: Data recovery has been active for too long.

Severity: Warning

Resolution: Contact Red Hat support

Name: CephOSDDiskNotResponding

Message: Disk not responding

Description: Disk device {{ $labels.device }} not responding, on host {{ $labels.host }}.

Severity: Critical

Resolution: Contact Red Hat support

Name: CephOSDDiskUnavailable

Message: Disk not accessible

Description: Disk device {{ $labels.device }} not accessible on host {{ $labels.host }}.

Severity: Critical

Resolution: Contact Red Hat support

Name: CephPGRepairTakingTooLong

Message: Self heal problems detected

Description: Self heal operations taking too long.

Severity: Warning

Resolution: Contact Red Hat support

Name: CephMonHighNumberOfLeaderChanges

Message: Storage Cluster has seen many leader changes recently.

Description: 'Ceph Monitor "{{ $labels.job }}": instance {{ $labels.instance }} has seen {{ $value printf "%.2f" }} leader changes per minute recently.'

Severity: Warning

Resolution: Contact Red Hat support

Name: CephMonQuorumAtRisk

Message: Storage quorum at risk

Description: Storage cluster quorum is low.

Severity: Critical

Resolution: Contact Red Hat support

Name: ClusterObjectStoreState

Message: Cluster Object Store is in an unhealthy state. Check Ceph cluster health.

Description: Cluster Object Store is in an unhealthy state for more than 15s. Check Ceph cluster health.

Severity: Critical

Resolution: Contact Red Hat support

Procedure:

Name: CephOSDFlapping

Message: Storage daemon osd.x has restarted 5 times in the last 5 minutes. Check the pod events or Ceph status to find out the cause.

Description: Storage OSD restarts more than 5 times in 5 minutes.

Severity: Critical

Resolution: Contact Red Hat support

Name: OdfPoolMirroringImageHealth

Message: Mirroring image(s) (PV) in the pool <pool-name> are in Warning state for more than a 1m. Mirroring might not work as expected.

Description: Disaster recovery is failing for one or a few applications.

Severity: Warning

Resolution: Contact Red Hat support

Name: OdfMirrorDaemonStatus

Message: Mirror daemon is unhealthy.

Description: Disaster recovery is failing for the entire cluster. Mirror daemon is in an unhealthy status for more than 1m. Mirroring on this cluster is not working as expected.

Severity: Critical

Resolution: Contact Red Hat support

6.2. Resolving cluster health issues

There is a finite set of possible health messages that a Red Hat Ceph Storage cluster can raise that show in the OpenShift Data Foundation user interface. These are defined as health checks which have unique identifiers. The identifier is a terse pseudo-human-readable string that is intended to enable tools to make sense of health checks, and present them in a way that reflects their meaning. Click the health code below for more information and troubleshooting.

Health codeDescription

MON_DISK_LOW

One or more Ceph Monitors are low on disk space.

6.2.1. MON_DISK_LOW

This alert triggers if the available space on the file system storing the monitor database as a percentage, drops below mon_data_avail_warn (default: 15%). This may indicate that some other process or user on the system is filling up the same file system used by the monitor. It may also indicate that the monitor’s database is large.

Note

The paths to the file system differ depending on the deployment of your mons. You can find the path to where the mon is deployed in storagecluster.yaml.

Example paths:

  • Mon deployed over PVC path: /var/lib/ceph/mon
  • Mon deployed over hostpath: /var/lib/rook/mon

In order to clear up space, view the high usage files in the file system and choose which to delete. To view the files, run:

# du -a <path-in-the-mon-node> |sort -n -r |head -n10

Replace <path-in-the-mon-node> with the path to the file system where mons are deployed.

6.3. Resolving cluster alerts

There is a finite set of possible health alerts that a Red Hat Ceph Storage cluster can raise that show in the OpenShift Data Foundation user interface. These are defined as health alerts which have unique identifiers. The identifier is a terse pseudo-human-readable string that is intended to enable tools to make sense of health checks, and present them in a way that reflects their meaning. Click the health alert for more information and troubleshooting.

Table 6.1. Types of cluster health alerts
Health alertOverview

CephClusterCriticallyFull

Storage cluster utilization has crossed 80%.

CephClusterErrorState

Storage cluster is in an error state for more than 10 minutes.

CephClusterNearFull

Storage cluster is nearing full capacity. Data deletion or cluster expansion is required.

CephClusterReadOnly

Storage cluster is read-only now and needs immediate data deletion or cluster expansion.

CephClusterWarningState

Storage cluster is in a warning state for more than 10 mins.

CephDataRecoveryTakingTooLong

Data recovery has been active for too long.

CephMdsCacheUsageHigh

Ceph metadata service (MDS) cache usage for the MDS daemon has exceeded 95% of the mds_cache_memory_limit.

CephMdsCpuUsageHigh

Ceph MDS CPU usage for the MDS daemon has exceeded the threshold for adequate performance.

CephMdsMissingReplicas

Minimum required replicas for storage metadata service not available. Might affect the working of the storage cluster.

CephMgrIsAbsent

Ceph Manager has disappeared from Prometheus target discovery.

CephMgrIsMissingReplicas

Ceph manager is missing replicas. Thispts health status reporting and will cause some of the information reported by the ceph status command to be missing or stale. In addition, the Ceph manager is responsible for a manager framework aimed at expanding the existing capabilities of Ceph.

CephMonHighNumberOfLeaderChanges

The Ceph monitor leader is being changed an unusual number of times.

CephMonQuorumAtRisk

Storage cluster quorum is low.

CephMonQuorumLost

The number of monitor pods in the storage cluster are not enough.

CephMonVersionMismatch

There are different versions of Ceph Mon components running.

CephNodeDown

A storage node went down. Check the node immediately. The alert should contain the node name.

CephOSDCriticallyFull

Utilization of back-end Object Storage Device (OSD) has crossed 80%. Free up some space immediately or expand the storage cluster or contact support.

CephOSDDiskNotResponding

A disk device is not responding on one of the hosts.

CephOSDDiskUnavailable

A disk device is not accessible on one of the hosts.

CephOSDFlapping

Ceph storage OSD flapping.

CephOSDNearFull

One of the OSD storage devices is nearing full.

CephOSDSlowOps

OSD requests are taking too long to process.

CephOSDVersionMismatch

There are different versions of Ceph OSD components running.

CephPGRepairTakingTooLong

Self-healing operations are taking too long.

CephPoolQuotaBytesCriticallyExhausted

Storage pool quota usage has crossed 90%.

CephPoolQuotaBytesNearExhaustion

Storage pool quota usage has crossed 70%.

OSDCPULoadHigh

CPU usage in the OSD container on a specific pod has exceeded 80%, potentially affecting the performance of the OSD.

PersistentVolumeUsageCritical

Persistent Volume Claim usage has exceeded more than 85% of its capacity.

PersistentVolumeUsageNearFull

Persistent Volume Claim usage has exceeded more than 75% of its capacity.

6.3.1. CephClusterCriticallyFull

Meaning

Storage cluster utilization has crossed 80% and will become read-only at 85%. Your Ceph cluster will become read-only once utilization crosses 85%. Free up some space or expand the storage cluster immediately. It is common to see alerts related to Object Storage Device (OSD) full or near full prior to this alert.

Impact

High

Diagnosis

Scaling storage
Depending on the type of cluster, you need to add storage devices, nodes, or both. For more information, see the Scaling storage guide.

Mitigation

Deleting information
If it is not possible to scale up the cluster, you need to delete information to free up some space.

6.3.2. CephClusterErrorState

Meaning

This alert reflects that the storage cluster is in ERROR state for an unacceptable amount of time and thispts the storage availability. Check for other alerts that would have triggered prior to this one and troubleshoot those alerts first.

Impact

Critical

Diagnosis

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    $ oc get pod | grep rook-ceph
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a rook-ceph that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:

    $ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:

    $ oc logs pod/${MYPOD}
    Important
    • If a node was assigned, check the kubelet on the node.
    • If the basic health of the running pods, node affinity and resource availability on the nodes are verified, run the Ceph tools to get the status of the storage components.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:

    $ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.3. CephClusterNearFull

Meaning

Storage cluster utilization has crossed 75% and will become read-only at 85%. Free up some space or expand the storage cluster.

Impact

Critical

Diagnosis

Scaling storage
Depending on the type of cluster, you need to add storage devices, nodes, or both. For more information, see the Scaling storage guide.

Mitigation

Deleting information
If it is not possible to scale up the cluster, you need to delete information in order to free up some space.

6.3.4. CephClusterReadOnly

Meaning

Storage cluster utilization has crossed 85% and will become read-only now. Free up some space or expand the storage cluster immediately.

Impact

Critical

Diagnosis

Scaling storage
Depending on the type of cluster, you need to add storage devices, nodes, or both. For more information, see the Scaling storage guide.

Mitigation

Deleting information
If it is not possible to scale up the cluster, you need to delete information in order to free up some space.

6.3.5. CephClusterWarningState

Meaning

This alert reflects that the storage cluster has been in a warning state for an unacceptable amount of time. While the storage operations will continue to function in this state, it is recommended to fix the errors so that the cluster does not get into an error state. Check for other alerts that might have triggered prior to this one and troubleshoot those alerts first.

Impact

High

Diagnosis

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    oc get pod | grep {ceph-component}
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a {ceph-component}  that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:

    $ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:

    $ oc logs pod/${MYPOD}
    Important

    If a node was assigned, check the kubelet on the node.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.6. CephDataRecoveryTakingTooLong

Meaning

Data recovery is slow. Check whether all the Object Storage Devices (OSDs) are up and running.

Impact

High

Diagnosis

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    oc get pod | grep rook-ceph-osd
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a {ceph-component}  that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:

    $ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:

    $ oc logs pod/${MYPOD}
    Important

    If a node was assigned, check the kubelet on the node.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.7. CephMdsCacheUsageHigh

Meaning

When the storage metadata service (MDS) cannot keep its cache usage under the target threshold specified by mds_health_cache_threshold, or 150% of the cache limit set by mds_cache_memory_limit, the MDS sends a health alert to the monitors indicating the cache is too large. As a result, the MDS related operations become slow.

Impact

High

Diagnosis

The MDS tries to stay under a reservation of the mds_cache_memory_limit by trimming unused metadata in its cache and recalling cached items in the client caches. It is possible for the MDS to exceed this limit due to slow recall from clients as a result of multiple clients accesing the files.

Mitigation

Make sure you have enough memory provisioned for MDS cache. Memory resources for the MDS pods need to be updated in the ocs-storageCluster in order to increase the mds_cache_memory_limit. Run the following command to set the memory of MDS pods, for example, 16GB:

$ oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"memory": "16Gi"},"requests": {"memory": "16Gi"}}}}}'

OpenShift Data Foundation automatically sets mds_cache_memory_limit to half of the MDS pod memory limit. If the memory is set to 8GB using the previous command, then the operator sets the MDS cache memory limit to 4GB.

6.3.8. CephMdsCpuUsageHigh

Meaning

The storage metadata service (MDS) serves filesystem metadata. The MDS is crucial for any file creation, rename, deletion, and update operations. MDS by default is allocated two or three CPUs. This does not cause issues as long as there are not too many metadata operations. When the metadata operation load increases enough to trigger this alert, it means the default CPU allocation is unable to cope with load. You need to increase the CPU allocation.

Impact

High

Diagnosis

Click Workloads Pods. Select the corresponding MDS pod and click on the Metrics tab. There you will see the allocated and used CPU. By default, the alert is fired if the used CPU is 67% of allocated CPU for 6 hours. If this is the case, follow the steps in the mitigation section.

Mitigation

You need to increase the allocated CPU.

Use the following command to set the number of allocated CPU for MDS, for example, 8:

$ oc patch -n openshift-storage storagecluster ocs-storagecluster \
    --type merge \
    --patch '{"spec": {"resources": {"mds": {"limits": {"cpu": "8"},
    "requests": {"cpu": "8"}}}}}'

6.3.9. CephMdsMissingReplicas

Meaning

Minimum required replicas for the storage metadata service (MDS) are not available. MDS is responsible for filing metadata. Degradation of the MDS service can affect how the storage cluster works (related to the CephFS storage class) and should be fixed as soon as possible.

Impact

High

Diagnosis

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    oc get pod | grep rook-ceph-mds
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a {ceph-component}  that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:

    $ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:

    $ oc logs pod/${MYPOD}
    Important

    If a node was assigned, check the kubelet on the node.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.10. CephMgrIsAbsent

Meaning

Not having a Ceph manager running the monitoring of the cluster. Persistent Volume Claim (PVC) creation and deletion requests should be resolved as soon as possible.

Impact

High

Diagnosis

  • Verify that the rook-ceph-mgr pod is failing, and restart if necessary. If the Ceph mgr pod restart fails, follow the general pod troubleshooting to resolve the issue.

    • Verify that the Ceph mgr pod is failing:

      $ oc get pods | grep mgr
    • Describe the Ceph mgr pod for more details:

      $ oc describe pods/<pod_name>
      <pod_name>
      Specify the rook-ceph-mgr pod name from the previous step.

      Analyze the errors related to resource issues.

    • Delete the pod, and wait for the pod to restart:

      $ oc get pods | grep mgr

Follow these steps for general pod troubleshooting:

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    oc get pod | grep rook-ceph-mgr
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a {ceph-component}  that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:

    $ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:

    $ oc logs pod/${MYPOD}
    Important

    If a node was assigned, check the kubelet on the node.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.11. CephMgrIsMissingReplicas

Meaning

To resolve this alert, you need to determine the cause of the disappearance of the Ceph manager and restart if necessary.

Impact

High

Diagnosis

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    oc get pod | grep rook-ceph-mgr
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a {ceph-component}  that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:

    $ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:

    $ oc logs pod/${MYPOD}
    Important

    If a node was assigned, check the kubelet on the node.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.12. CephMonHighNumberOfLeaderChanges

Meaning

In a Ceph cluster there is a redundant set of monitor pods that store critical information about the storage cluster. Monitor pods synchronize periodically to obtain information about the storage cluster. The first monitor pod to get the most updated information becomes the leader, and the other monitor pods will start their synchronization process after asking the leader. A problem in network connection or another kind of problem in one or more monitor pods produces an unusual change of the leader. This situation can negatively affect the storage cluster performance.

Impact

Medium

Important

Check for any network issues. If there is a network issue, you need to escalate to the OpenShift Data Foundation team before you proceed with any of the following troubleshooting steps.

Diagnosis

  1. Print the logs of the affected monitor pod to gather more information about the issue:

    $ oc logs <rook-ceph-mon-X-yyyy> -n openshift-storage
    <rook-ceph-mon-X-yyyy>
    Specify the name of the affected monitor pod.
  2. Alternatively, use the Openshift Web console to open the logs of the affected monitor pod. More information about possible causes is reflected in the log.
  3. Perform the general pod troubleshooting steps:

    pod status: pending
  4. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    oc get pod | grep {ceph-component}
  5. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a {ceph-component}  that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  6. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
    pod status: NOT pending, running, but NOT ready
    • Check the readiness probe:
    $ oc describe pod/${MYPOD}
    pod status: NOT pending, but NOT running
    • Check for application or image issues:
    $ oc logs pod/${MYPOD}
    Important

    If a node was assigned, check the kubelet on the node.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.13. CephMonQuorumAtRisk

Meaning

Multiple MONs work together to provide redundancy. Each of the MONs keeps a copy of the metadata. The cluster is deployed with 3 MONs, and requires 2 or more MONs to be up and running for quorum and for the storage operations to run. If quorum is lost, access to data is at risk.

Impact

High

Diagnosis

Restore the Ceph MON Quorum. For more information, see Restoring ceph-monitor quorum in OpenShift Data Foundation in the Troubleshooting guide. If the restoration of the Ceph MON Quorum fails, follow the general pod troubleshooting to resolve the issue.

Perform the following for general pod troubleshooting:

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    oc get pod | grep rook-ceph-mon
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a {ceph-component}  that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:
$ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:
$ oc logs pod/${MYPOD}
Important

If a node was assigned, check the kubelet on the node.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.14. CephMonQuorumLost

Meaning

In a Ceph cluster there is a redundant set of monitor pods that store critical information about the storage cluster. Monitor pods synchronize periodically to obtain information about the storage cluster. The first monitor pod to get the most updated information becomes the leader, and the other monitor pods will start their synchronization process after asking the leader. A problem in network connection or another kind of problem in one or more monitor pods produces an unusual change of the leader. This situation can negatively affect the storage cluster performance.

Impact

High

Important

Check for any network issues. If there is a network issue, you need to escalate to the OpenShift Data Foundation team before you proceed with any of the following troubleshooting steps.

Diagnosis

Restore the Ceph MON Quorum. For more information, see Restoring ceph-monitor quorum in OpenShift Data Foundation in the Troubleshooting guide. If the restoration of the Ceph MON Quorum fails, follow the general pod troubleshooting to resolve the issue.

Alternatively, perform general pod troubleshooting:

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    oc get pod | grep {ceph-component}
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a {ceph-component}  that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:
$ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:
$ oc logs pod/${MYPOD}
Important

If a node was assigned, check the kubelet on the node.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:
$ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.15. CephMonVersionMismatch

Meaning

Typically this alert triggers during an upgrade that is taking a long time.

Impact

Medium

Diagnosis

Check the ocs-operator subscription status and the operator pod health to check if an operator upgrade is in progress.

  1. Check the ocs-operator subscription health.

    $ oc get sub $(oc get pods -n openshift-storage | grep -v ocs-operator) -n openshift-storage -o json | jq .status.conditions

    The status condition types are CatalogSourcesUnhealthy, InstallPlanMissing, InstallPlanPending, and InstallPlanFailed. The status for each type should be False.

    Example output:

    [
      {
        "lastTransitionTime": "2021-01-26T19:21:37Z",
        "message": "all available catalogsources are healthy",
        "reason": "AllCatalogSourcesHealthy",
        "status": "False",
        "type": "CatalogSourcesUnhealthy"
      }
    ]

    The example output shows a False status for type CatalogSourcesUnHealthly, which means that the catalog sources are healthy.

  2. Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.

    $ oc get pod -n openshift-storage | grep ocs-operator OCSOP=$(oc get pod -n openshift-storage -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator) echo $OCSOP oc get pod/${OCSOP} -n openshift-storage oc describe pod/${OCSOP} -n openshift-storage

    If you determine that the `ocs-operator`is in progress, wait for 5 mins and this alert should resolve itself. If you have waited or see a different error status condition, continue troubleshooting.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:

    $ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.16. CephNodeDown

Meaning

A node running Ceph pods is down. While storage operations will continue to function as Ceph is designed to deal with a node failure, it is recommended to resolve the issue to minimize the risk of another node going down and affecting storage functions.

Impact

Medium

Diagnosis

  1. List all the pods that are running and failing:

    oc -n openshift-storage get pods
    Important

    Ensure that you meet the OpenShift Data Foundation resource requirements so that the Object Storage Device (OSD) pods are scheduled on the new node. This may take a few minutes as the Ceph cluster recovers data for the failing but now recovering OSD. To watch this recovery in action, ensure that the OSD pods are correctly placed on the new worker node.

  2. Check if the OSD pods that were previously failing are now running:

    oc -n openshift-storage get pods

    If the previously failing OSD pods have not been scheduled, use the describe command and check the events for reasons the pods were not rescheduled.

  3. Describe the events for the failing OSD pod:

    oc -n openshift-storage get pods | grep osd
  4. Find the one or more failing OSD pods:

    oc -n openshift-storage describe pods/<osd_podname_ from_the_ previous step>

    In the events section look for the failure reasons, such as the resources are not being met.

    In addition, you may use the rook-ceph-toolbox to watch the recovery. This step is optional, but is helpful for large Ceph clusters. To access the toolbox, run the following command:

    TOOLS_POD=$(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
    oc rsh -n openshift-storage $TOOLS_POD

    From the rsh command prompt, run the following, and watch for "recovery" under the io section:

    ceph status
  5. Determine if there are failed nodes.

    1. Get the list of worker nodes, and check for the node status:

      oc get nodes --selector='node-role.kubernetes.io/worker','!node-role.kubernetes.io/infra'
    2. Describe the node which is of the NotReady status to get more information about the failure:

      oc describe node <node_name>

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:

    $ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.17. CephOSDCriticallyFull

Meaning

One of the Object Storage Devices (OSDs) is critically full. Expand the cluster immediately.

Impact

High

Diagnosis

Deleting data to free up storage space
You can delete data, and the cluster will resolve the alert through self healing processes.
Important

This is only applicable to OpenShift Data Foundation clusters that are near or full but not in read-only mode. Read-only mode prevents any changes that include deleting data, that is, deletion of Persistent Volume Claim (PVC), Persistent Volume (PV) or both.

Expanding the storage capacity
Current storage size is less than 1 TB

You must first assess the ability to expand. For every 1 TB of storage added, the cluster needs to have 3 nodes each with a minimum available 2 vCPUs and 8 GiB memory.

You can increase the storage capacity to 4 TB via the add-on and the cluster will resolve the alert through self healing processes. If the minimum vCPU and memory resource requirements are not met, you need to add 3 additional worker nodes to the cluster.

Mitigation

  • If your current storage size is equal to 4 TB, contact Red Hat support.
  • Optional: Run the following command to gather the debugging information for the Ceph cluster:

    $ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.18. CephOSDDiskNotResponding

Meaning

A disk device is not responding. Check whether all the Object Storage Devices (OSDs) are up and running.

Impact

Medium

Diagnosis

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    $ oc get pod | grep rook-ceph
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a rook-ceph that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:

    $ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:

    $ oc logs pod/${MYPOD}
    Important
    • If a node was assigned, check the kubelet on the node.
    • If the basic health of the running pods, node affinity and resource availability on the nodes are verified, run the Ceph tools to get the status of the storage components.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:

    $ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.19. CephOSDDiskUnavailable

Meaning

A disk device is not accessible on one of the hosts and its corresponding Object Storage Device (OSD) is marked out by the Ceph cluster. This alert is raised when a Ceph node fails to recover within 10 minutes.

Impact

High

Diagnosis

Determine the failed node
  1. Get the list of worker nodes, and check for the node status:
oc get nodes --selector='node-role.kubernetes.io/worker','!node-role.kubernetes.io/infra'
  1. Describe the node which is of NotReady status to get more information on the failure:

    oc describe node <node_name>

6.3.20. CephOSDFlapping

Meaning

A storage daemon has restarted 5 times in the last 5 minutes. Check the pod events or Ceph status to find out the cause.

Impact

High

Diagnosis

Follow the steps in the Flapping OSDs section of the Red Hat Ceph Storage Troubleshooting Guide.

Alternatively, follow the steps for general pod troubleshooting:

pod status: pending
  1. Check for resource issues, pending Persistent Volume Claims (PVCs), node assignment, and kubelet problems:

    $ oc project openshift-storage
    $ oc get pod | grep rook-ceph
  2. Set MYPOD as the variable for the pod that is identified as the problem pod:

    # Examine the output for a rook-ceph that is in the pending state, not running or not ready
    MYPOD=<pod_name>
    <pod_name>
    Specify the name of the pod that is identified as the problem pod.
  3. Look for the resource limitations or pending PVCs. Otherwise, check for the node assignment:

    $ oc get pod/${MYPOD} -o wide
pod status: NOT pending, running, but NOT ready
  • Check the readiness probe:

    $ oc describe pod/${MYPOD}
pod status: NOT pending, but NOT running
  • Check for application or image issues:

    $ oc logs pod/${MYPOD}
    Important
    • If a node was assigned, check the kubelet on the node.
    • If the basic health of the running pods, node affinity and resource availability on the nodes are verified, run the Ceph tools to get the status of the storage components.

Mitigation

Debugging log information
  • This step is optional. Run the following command to gather the debugging information for the Ceph cluster:

    $ oc adm must-gather --image=registry.redhat.io/odf4/odf-must-gather-rhel9:v4.15

6.3.21. CephOSDNearFull

Meaning

Utilization of back-end storage device Object Storage Device (OSD) has crossed 75% on a host.

Impact

High

Mitigation

Free up some space in the cluster, expand the storage cluster, or contact Red Hat support. For more information on scaling storage, see the Scaling storage guide.

6.3.22. CephOSDSlowOps

Meaning

An Object Storage Device (OSD) with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.

Impact

Medium

Diagnosis

More information about the slow requests can be obtained using the Openshift console.

  1. Access the OSD pod terminal, and run the following commands:

    $ ceph daemon osd.<id> ops
    $ ceph daemon osd.<id> dump_historic_ops
    Note

    The number of the OSD is seen in the pod name. For example, in rook-ceph-osd-0-5d86d4d8d4-zlqkx, <0> is the OSD.

Mitigation

The main causes of the OSDs having slow requests are:

  • Problems with the underlying hardware or infrastructure, such as, disk drives, hosts, racks, or network switches. Use the Openshift monitoring console to find the alerts or errors about cluster resources. This can give you an idea about the root cause of the slow operations in the OSD.
  • Problems with the network. These problems are usually connected with flapping OSDs. See the Flapping OSDs section of the Red Hat Ceph Storage Troubleshooting Guide
  • If it is a network issue, escalate to the OpenShift Data Foundation team
  • System load. Use the Openshift console to review the metrics of the OSD pod and the node which is running the OSD. Adding or assigning more resources can be a possible solution.

6.3.23. CephOSDVersionMismatch

Meaning

Typically this alert triggers during an upgrade that is taking a long time.

Impact

Medium

Diagnosis

Check the ocs-operator subscription status and the operator pod health to check if an operator upgrade is in progress.

  1. Check the ocs-operator subscription health.

    $ oc get sub $(oc get pods -n openshift-storage | grep -v ocs-operator) -n openshift-storage -o json | jq .status.conditions

    The status condition types are CatalogSourcesUnhealthy, InstallPlanMissing, InstallPlanPending, and InstallPlanFailed. The status for each type should be False.

    Example output:

    [
      {
        "lastTransitionTime": "2021-01-26T19:21:37Z",
        "message": "all available catalogsources are healthy",
        "reason": "AllCatalogSourcesHealthy",
        "status": "False",
        "type": "CatalogSourcesUnhealthy"
      }
    ]

    The example output shows a False status for type CatalogSourcesUnHealthly, which means that the catalog sources are healthy.

  2. Check the OCS operator pod status to see if there is an OCS operator upgrading in progress.

    $ oc get pod -n openshift-storage | grep ocs-operator OCSOP=$(oc get pod -n openshift-storage -o custom-columns=POD:.metadata.name --no-headers | grep ocs-operator) echo $OCSOP oc get pod/${OCSOP} -n openshift-storage oc describe pod/${OCSOP} -n openshift-storage

    If you determine that the `ocs-operator`is in progress, wait for 5 mins and this alert should resolve itself. If you have waited or see a different error status condition, continue troubleshooting.

6.3.24. CephPGRepairTakingTooLong

Meaning

Self-healing operations are taking too long.

Impact

High

Diagnosis

Check for inconsistent Placement Groups (PGs), and repair them. For more information, see the Red Hat Knowledgebase solution Handle Inconsistent Placement Groups in Ceph.

6.3.25. CephPoolQuotaBytesCriticallyExhausted

Meaning

One or more pools has reached, or is very close to reaching, its quota. The threshold to trigger this error condition is controlled by the mon_pool_quota_crit_threshold configuration option.

Impact

High

Mitigation

Adjust the pool quotas. Run the following commands to fully remove or adjust the pool quotas up or down:

ceph osd pool set-quota <pool> max_bytes <bytes>
ceph osd pool set-quota <pool> max_objects <objects>

Setting the quota value to 0 will disable the quota.

6.3.26. CephPoolQuotaBytesNearExhaustion

Meaning

One or more pools is approaching a configured fullness threshold. One threshold that can trigger this warning condition is the mon_pool_quota_warn_threshold configuration option.

Impact

High

Mitigation

Adjust the pool quotas. Run the following commands to fully remove or adjust the pool quotas up or down:

ceph osd pool set-quota <pool> max_bytes <bytes>
ceph osd pool set-quota <pool> max_objects <objects>

Setting the quota value to 0 will disable the quota.

6.3.27. OSDCPULoadHigh

Meaning

OSD is a critical component in Ceph storage, responsible for managing data placement and recovery. High CPU usage in the OSD container suggests increased processing demands, potentially leading to degraded storage performance.

Impact

High

Diagnosis

  1. Navigate to the Kubernetes dashboard or equivalent.
  2. Access the Workloads section and select the relevant pod associated with the OSD alert.
  3. Click the Metrics tab to view CPU metrics for the OSD container.
  4. Verify that the CPU usage exceeds 80% over a significant period (as specified in the alert configuration).

Mitigation

If the OSD CPU usage is consistently high, consider taking the following steps:

  1. Evaluate the overall storage cluster performance and identify the OSDs contributing to high CPU usage.
  2. Increase the number of OSDs in the cluster by adding more new storage devices in the existing nodes or adding new nodes with new storage devices. Review the Scaling storage guide for instructions to help distribute the load and improve overall system performance.

6.3.28. PersistentVolumeUsageCritical

Meaning

A Persistent Volume Claim (PVC) is nearing its full capacity and may lead to data loss if not attended to timely.

Impact

High

Mitigation

Expand the PVC size to increase the capacity.

  1. Log in to the OpenShift Web Console.
  2. Click Storage PersistentVolumeClaim.
  3. Select openshift-storage from the Project drop-down list.
  4. On the PVC you want to expand, click Action menu (⋮) Expand PVC.
  5. Update the Total size to the desired size.
  6. Click Expand.

Alternatively, you can delete unnecessary data that may be taking up space.

6.3.29. PersistentVolumeUsageNearFull

Meaning

A Persistent Volume Claim (PVC) is nearing its full capacity and may lead to data loss if not attended to timely.

Impact

High

Mitigation

Expand the PVC size to increase the capacity.

  1. Log in to the OpenShift Web Console.
  2. Click Storage PersistentVolumeClaim.
  3. Select openshift-storage from the Project drop-down list.
  4. On the PVC you want to expand, click Action menu (⋮) Expand PVC.
  5. Update the Total size to the desired size.
  6. Click Expand.

Alternatively, you can delete unnecessary data that may be taking up space.

6.4. Resolving NooBaa Bucket Error State

Procedure

  1. In the OpenShift Web Console, click Storage Data Foundation.
  2. In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
  3. Click the Object tab.
  4. In the Details card, click the link under the System Name field.
  5. In the left pane, click the Buckets option and search for the bucket in error state. If the bucket in error state is a namespace bucket, be sure to click the Namespace Buckets pane.
  6. Click on its Bucket Name. Error encountered in bucket is displayed.
  7. Depending on the specific error of the bucket, perform one or both of the following:

    1. For space related errors:

      1. In the left pane, click the Resources option.
      2. Click on the resource in error state.
      3. Scale the resource by adding more agents.
    2. For resource health errors:

      1. In the left pane, click the Resources option.
      2. Click on the resource in error state.
      3. Connectivity error means the backing service is not available and needs to be restored.
      4. For access/permissions errors, update the connection’s Access Key and Secret Key.

6.5. Resolving NooBaa Bucket Exceeding Quota State

To resolve A NooBaa Bucket Is In Exceeding Quota State error perform one of the following:

  • Cleanup some of the data on the bucket.
  • Increase the bucket quota by performing the following steps:

    1. In the OpenShift Web Console, click Storage Data Foundation.
    2. In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
    3. Click the Object tab.
    4. In the Details card, click the link under the System Name field.
    5. In the left pane, click the Buckets option and search for the bucket in error state.
    6. Click on its Bucket Name. Error encountered in bucket is displayed.
    7. Click Bucket Policies Edit Quota and increase the quota.

6.6. Resolving NooBaa Bucket Capacity or Quota State

Procedure

  1. In the OpenShift Web Console, click Storage Data Foundation.
  2. In the Status card of the Overview tab, click Storage System and then click the storage system link from the pop up that appears.
  3. Click the Object tab.
  4. In the Details card, click the link under the System Name field.
  5. In the left pane, click the Resources option and search for the PV pool resource.
  6. For the PV pool resource with low capacity status, click on its Resource Name.
  7. Edit the pool configuration and increase the number of agents.

6.7. Recovering pods

When a first node (say NODE1) goes to NotReady state because of some issue, the hosted pods that are using PVC with ReadWriteOnce (RWO) access mode try to move to the second node (say NODE2) but get stuck due to multi-attach error. In such a case, you can recover MON, OSD, and application pods by using the following steps.

Procedure

  1. Power off NODE1 (from AWS or vSphere side) and ensure that NODE1 is completely down.
  2. Force delete the pods on NODE1 by using the following command:

    $ oc delete pod <pod-name> --grace-period=0 --force

6.8. Recovering from EBS volume detach

When an OSD or MON elastic block storage (EBS) volume where the OSD disk resides is detached from the worker Amazon EC2 instance, the volume gets reattached automatically within one or two minutes. However, the OSD pod gets into a CrashLoopBackOff state. To recover and bring back the pod to Running state, you must restart the EC2 instance.

6.9. Enabling and disabling debug logs for rook-ceph-operator

Enable the debug logs for the rook-ceph-operator to obtain information about failures that help in troubleshooting issues.

Procedure

Enabling the debug logs
  1. Edit the configmap of the rook-ceph-operator.

    $ oc edit configmap rook-ceph-operator-config
  2. Add the ROOK_LOG_LEVEL: DEBUG parameter in the rook-ceph-operator-config yaml file to enable the debug logs for rook-ceph-operator.

    …
    data:
      # The logging level for the operator: INFO | DEBUG
      ROOK_LOG_LEVEL: DEBUG

    Now, the rook-ceph-operator logs consist of the debug information.

Disabling the debug logs
  1. Edit the configmap of the rook-ceph-operator.

    $ oc edit configmap rook-ceph-operator-config
  2. Add the ROOK_LOG_LEVEL: INFO parameter in the rook-ceph-operator-config yaml file to disable the debug logs for rook-ceph-operator.

    …
    data:
      # The logging level for the operator: INFO | DEBUG
      ROOK_LOG_LEVEL: INFO

6.10. Resolving low Ceph monitor count alert

The CephMonLowNumber alert is displayed in the notification panel or Alert Center of the OpenShift Web Console to indicate the low number of Ceph monitor count when your internal mode deployment has five or more nodes, racks, or rooms, and when there are five or more number of failure domains in the deployment. You can increase the Ceph monitor count to improve the availability of cluster.

Procedure

  1. In the CephMonLowNumber alert of the notification panel or Alert Center of OpenShift Web Console, click Configure.
  2. In the Configure Ceph Monitor pop up, click Update count.

    In the pop up, the recommended monitor count depending on the number of failure zones is shown.

  3. In the Configure CephMon pop up, update the monitor count value based on the recommended value and click Save changes.

6.11. Troubleshooting unhealthy blocklisted nodes

6.11.1. ODFRBDClientBlocked

Meaning

This alert indicates that an RADOS Block Device (RBD) client might be blocked by Ceph on a specific node within your Kubernetes cluster. The blocklisting occurs when the ocs_rbd_client_blocklisted metric reports a value of 1 for the node. Additionally, there are pods in a CreateContainerError state on the same node. The blocklisting can potentially result in the filesystem for the Persistent Volume Claims (PVCs) using RBD becoming read-only. It is crucial to investigate this alert to prevent any disruption to your storage cluster.

Impact

High

Diagnosis

The blocklisting of an RBD client can occur due to several factors, such as network or cluster slowness. In certain cases, the exclusive lock contention among three contending clients (workload, mirror daemon, and manager/scheduler) can lead to the blocklist.

Mitigation

  1. Taint the blocklisted node: In Kubernetes, consider tainting the node that is blocklisted to trigger the eviction of pods to another node. This approach relies on the assumption that the unmounting/unmapping process progresses gracefully. Once the pods have been successfully evicted, the blocklisted node can be untainted, allowing the blocklist to be cleared. The pods can then be moved back to the untainted node.
  2. Reboot the blocklisted node: If tainting the node and evicting the pods do not resolve the blocklisting issue, a reboot of the blocklisted node can be attempted. This step may help alleviate any underlying issues causing the blocklist and restore normal functionality.
Important

Investigating and resolving the blocklist issue promptly is essential to avoid any further impact on the storage cluster.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.