Appendix B. Health messages of a Ceph cluster
There is a finite set of possible health messages that a Red Hat Ceph Storage cluster can raise. These are defined as health checks which have unique identifiers. The identifier is a terse pseudo-human-readable string that is intended to enable tools to make sense of health checks, and present them in a way that reflects their meaning.
Health Code | Description |
---|---|
| Warn if old version of Ceph are running on any daemons. It will generate a health error if multiple versions are detected. |
| One or more Ceph Monitor daemons are currently down. |
|
The clocks on the nodes running the |
|
The |
| One or more Ceph Monitors are low on disk space. |
| One or more Ceph Monitors are critically low on disk space. |
| The database size for one or more Ceph Monitors are very large. |
|
One or more clients or daemons are connected to the storage cluster that are not securely reclaiming their |
|
Ceph is currently configured to allow clients to reconnect to monitors using an insecure process to reclaim their previous |
Health Code | Description |
---|---|
| All Ceph Manager daemons are currently down. |
| An enabled Ceph Manager module is failing its dependency check. |
| A Ceph Manager module has experienced an unexpected error. Typically, this means an unhandled exception was raised from the module’s serve function. |
Health Code | Description |
---|---|
| One or more OSDs are marked down. |
| All the OSDs within a particular CRUSH subtree are marked down, for example all OSDs on a host. For example, OSD_HOST_DOWN and OSD_ROOT_DOWN |
|
An OSD is referenced in the CRUSH map hierarchy but does not exist. Remove the OSD by running |
|
The utilization thresholds for nearfull, backfillfull, full, or, failsafefull are not ascending. Adjust the thresholds by running |
|
One or more OSDs has exceeded the full threshold and is preventing the storage cluster from servicing writes. Restore write availability by raising the full threshold by a small margin |
| One or more OSDs has exceeded the backfillfull threshold, which will prevent data from being allowed to rebalance to this device. |
| One or more OSDs has exceeded the nearfull threshold. |
|
One or more storage cluster flags of interest has been set. These flags include full, pauserd, pausewr, noup, nodown, noin, noout, nobackfill, norecover, norebalance, noscrub, nodeep_scrub, and notieragent. Except for full, the flags can be cleared with |
| One or more OSDs or CRUSH has a flag of interest set. These flags include noup, nodown, noin, and noout. |
| The CRUSH map is using very old settings and should be updated. |
|
The CRUSH map is using an older, non-optimal method for calculating intermediate weight values for |
|
One or more cache pools is not configured with a hit set to track utilization, which will prevent the tiering agent from identifying cold objects to flush and evict from the cache. Configure the hit sets on the cache pool with |
|
|
|
One or more pools has reached its quota and is no longer allowing writes. Increase the pool quota with |
|
One or more OSDs that use the BlueStore backend is allocated db partitions but that space has filled, such that metadata has “spilled over” onto the normal slow device. Disable this with |
| This output gives three values which are BDEV_DB free, BDEV_SLOW free and available_from_bluestore. |
|
If the BlueStore File System (BlueFS) is running low on available free space and there is little |
| As BlueStore works free space on underlying storage will get fragmented. This is normal and unavoidable but excessive fragmentation will cause slowdown. |
|
BlueStore tracks its internal usage statistics on a per-pool granular basis, and one or more OSDs have BlueStore volumes. Disable the warning with |
|
BlueStore tracks omap space utilization by pool. Disable the warning with |
|
BlueStore tracks omap space utilization by PG. Disable the warning with |
| One or more OSDs using BlueStore has an internal inconsistency between the size of the physical device and the metadata tracking its size. |
|
One or more OSDs is unable to load a BlueStore compression plugin. This can be caused by a broken installation, in which the |
| One or more OSDs using BlueStore detects spurious read errors at main device. BlueStore has recovered from these errors by retrying disk reads. |
Health Code | Description |
---|---|
|
One or more devices is expected to fail soon, where the warning threshold is controlled by the |
|
One or more devices is expected to fail soon and has been marked “out” of the storage cluster based on |
|
Too many devices are expected to fail soon and the |
Health Code | Description |
---|---|
| Data availability is reduced, meaning that the storage cluster is unable to service potential read or write requests for some data in the cluster. |
| Data redundancy is reduced for some data, meaning the storage cluster does not have the desired number of replicas for for replicated pools or erasure code fragments. |
|
Data redundancy might be reduced or at risk for some data due to a lack of free space in the storage cluster, specifically, one or more PGs has the |
|
Data redundancy might be reduced or at risk for some data due to a lack of free space in the storage cluster, specifically, one or more PGs has the |
|
Data scrubbing has discovered some problems with data consistency in the storage cluster, specifically, one or more PGs has the inconsistent or |
| Recent OSD scrubs have uncovered inconsistencies. |
| When a read error occurs and another replica is available it is used to repair the error immediately, so that the client can get the object data. |
|
One or more pools contain large omap objects as determined by |
|
A cache tier pool is nearly full. Adjust the cache pool target size with |
|
The number of PGs in use in the storage cluster is below the configurable threshold of |
|
One or more pools has a |
|
One or more pools should probably have more PGs, based on the amount of data that is currently stored in the pool. You can either disable auto-scaling of PGs with |
|
The number of PGs in use in the storage cluster is above the configurable threshold of |
|
One or more pools should probably have more PGs, based on the amount of data that is currently stored in the pool. You can either disable auto-scaling of PGs with |
|
One or more pools have a |
|
One or more pools have both |
|
The number of OSDs in the storage cluster is below the configurable threshold of |
|
One or more pools has a |
|
One or more pools has an average number of objects per PG that is significantly higher than the overall storage cluster average. The specific threshold is controlled by the |
|
A pool exists that contains one or more objects but has not been tagged for use by a particular application. Resolve this warning by labeling the pool for use by an application with |
|
One or more pools has reached its quota. The threshold to trigger this error condition is controlled by the |
|
One or more pools is approaching a configured fullness threshold. Adjust the pool quotas with |
| One or more objects in the storage cluster is not stored on the node the storage cluster would like it to be stored on. This is an indication that data migration due to some recent storage cluster change has not yet completed. |
| One or more objects in the storage cluster cannot be found, specifically, the OSDs know that a new or updated copy of an object should exist, but a copy of that version of the object has not been found on OSDs that are currently online. |
| One or more OSD or monitor requests is taking a long time to process. This can be an indication of extreme load, a slow storage device, or a software bug. |
|
One or more PGs has not been scrubbed recently. PGs are normally scrubbed within every configured interval specified by |
|
One or more PGs has not been deep scrubbed recently. Initiate the scrub with |
| The snapshot trim queue for one or more PGs has exceeded the configured warning threshold. This indicates that either an extremely large number of snapshots were recently deleted, or that the OSDs are unable to trim snapshots quickly enough to keep up with the rate of new snapshot deletions. |
Health Code | Description |
---|---|
| One or more Ceph daemons has crashed recently, and the crash has not yet been acknowledged by the administrator. |
| Telemetry has been enabled, but the contents of the telemetry report have changed since that time, so telemetry reports will not be sent. |
|
One or more auth users has capabilities that cannot be parsed by the monitor. Update the capabilities of the user with |
|
The |
|
The Dashboard debug mode is enabled. This means, if there is an error while processing a REST API request, the HTTP error response contains a Python traceback. Disable the debug mode with |