Appendix B. Health messages of a Ceph cluster

There is a finite set of possible health messages that a Red Hat Ceph Storage cluster can raise. These are defined as health checks which have unique identifiers. The identifier is a terse pseudo-human-readable string that is intended to enable tools to make sense of health checks, and present them in a way that reflects their meaning.

Table B.1. Monitor
Health Code	Description
`DAEMON_OLD_VERSION`	Warn if old version of Ceph are running on any daemons. It will generate a health error if multiple versions are detected.
`MON_DOWN`	One or more Ceph Monitor daemons are currently down.
`MON_CLOCK_SKEW`	The clocks on the nodes running the `ceph-mon` daemons are not sufficiently well synchronized. Resolve it by synchronizing the clocks using `ntpd` or `chrony`.
`MON_MSGR2_NOT_ENABLED`	The `ms_bind_msgr2` option is enabled but one or more Ceph Monitors is not configured to bind to a v2 port in the cluster’s monmap. Resolve this by running `ceph mon enable-msgr2` command.
`MON_DISK_LOW`	One or more Ceph Monitors are low on disk space.
`MON_DISK_CRIT`	One or more Ceph Monitors are critically low on disk space.
`MON_DISK_BIG`	The database size for one or more Ceph Monitors are very large.
`AUTH_INSECURE_GLOBAL_ID_RECLAIM`	One or more clients or daemons are connected to the storage cluster that are not securely reclaiming their `global_id` when reconnecting to a Ceph Monitor.
`AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED`	Ceph is currently configured to allow clients to reconnect to monitors using an insecure process to reclaim their previous `global_id` because the setting `auth_allow_insecure_global_id_reclaim` is set to `true`.

Table B.2. Manager
Health Code	Description
`MGR_DOWN`	All Ceph Manager daemons are currently down.
`MGR_MODULE_DEPENDENCY`	An enabled Ceph Manager module is failing its dependency check.
`MGR_MODULE_ERROR`	A Ceph Manager module has experienced an unexpected error. Typically, this means an unhandled exception was raised from the module’s serve function.

Table B.3. OSDs
Health Code	Description
`OSD_DOWN`	One or more OSDs are marked down.
`OSD_CRUSH_TYPE_DOWN`	All the OSDs within a particular CRUSH subtree are marked down, for example all OSDs on a host. For example, OSD_HOST_DOWN and OSD_ROOT_DOWN
`OSD_ORPHAN`	An OSD is referenced in the CRUSH map hierarchy but does not exist. Remove the OSD by running `ceph osd crush rm osd._OSD_ID` command.
`OSD_OUT_OF_ORDER_FULL`	The utilization thresholds for nearfull, backfillfull, full, or, failsafefull are not ascending. Adjust the thresholds by running `ceph osd set-nearfull-ratio RATIO`, `ceph osd set-backfillfull-ratio RATIO`, and `ceph osd set-full-ratio RATIO`
`OSD_FULL`	One or more OSDs has exceeded the full threshold and is preventing the storage cluster from servicing writes. Restore write availability by raising the full threshold by a small margin `ceph osd set-full-ratio RATIO`.
`OSD_BACKFILLFULL`	One or more OSDs has exceeded the backfillfull threshold, which will prevent data from being allowed to rebalance to this device.
`OSD_NEARFULL`	One or more OSDs has exceeded the nearfull threshold.
`OSDMAP_FLAGS`	One or more storage cluster flags of interest has been set. These flags include full, pauserd, pausewr, noup, nodown, noin, noout, nobackfill, norecover, norebalance, noscrub, nodeep_scrub, and notieragent. Except for full, the flags can be cleared with `ceph osd set FLAG` and `ceph osd unset FLAG` commands.
`OSD_FLAGS`	One or more OSDs or CRUSH has a flag of interest set. These flags include noup, nodown, noin, and noout.
`OLD_CRUSH_TUNABLES`	The CRUSH map is using very old settings and should be updated.
`OLD_CRUSH_STRAW_CALC_VERSION`	The CRUSH map is using an older, non-optimal method for calculating intermediate weight values for `straw` buckets.
`CACHE_POOL_NO_HIT_SET`	One or more cache pools is not configured with a hit set to track utilization, which will prevent the tiering agent from identifying cold objects to flush and evict from the cache. Configure the hit sets on the cache pool with `ceph osd pool set_POOL_NAME_ hit_set_type TYPE`, `ceph osd pool set POOL_NAME hit_set_period PERIOD_IN_SECONDS`, `ceph osd pool set POOL_NAME hit_set_count NUMBER_OF_HIT_SETS`, and `ceph osd pool set POOL_NAME hit_set_fpp TARGET_FALSE_POSITIVE_RATE` commands.
`OSD_NO_SORTBITWISE`	`sortbitwise` flag is not set. Set the flag with `ceph osd set sortbitwise` command.
`POOL_FULL`	One or more pools has reached its quota and is no longer allowing writes. Increase the pool quota with `ceph osd pool set-quota POOL_NAME max_objects NUMBER_OF_OBJECTS` and `ceph osd pool set-quota POOL_NAME max_bytes BYTES` or delete some existing data to reduce utilization.
`BLUEFS_SPILLOVER`	One or more OSDs that use the BlueStore backend is allocated db partitions but that space has filled, such that metadata has “spilled over” onto the normal slow device. Disable this with `ceph config set osd bluestore_warn_on_bluefs_spillover false` command.
`BLUEFS_AVAILABLE_SPACE`	This output gives three values which are BDEV_DB free, BDEV_SLOW free and available_from_bluestore.
`BLUEFS_LOW_SPACE`	If the BlueStore File System (BlueFS) is running low on available free space and there is little `available_from_bluestore` one can consider reducing BlueFS allocation unit size.
`BLUESTORE_FRAGMENTATION`	As BlueStore works free space on underlying storage will get fragmented. This is normal and unavoidable but excessive fragmentation will cause slowdown.
`BLUESTORE_LEGACY_STATFS`	BlueStore tracks its internal usage statistics on a per-pool granular basis, and one or more OSDs have BlueStore volumes. Disable the warning with `ceph config set global bluestore_warn_on_legacy_statfs false` command.
`BLUESTORE_NO_PER_POOL_OMAP`	BlueStore tracks omap space utilization by pool. Disable the warning with `ceph config set global bluestore_warn_on_no_per_pool_omap false` command.
`BLUESTORE_NO_PER_PG_OMAP`	BlueStore tracks omap space utilization by PG. Disable the warning with `ceph config set global bluestore_warn_on_no_per_pg_omap false` command.
`BLUESTORE_DISK_SIZE_MISMATCH`	One or more OSDs using BlueStore has an internal inconsistency between the size of the physical device and the metadata tracking its size.
`BLUESTORE_NO_COMPRESSION` `	One or more OSDs is unable to load a BlueStore compression plugin. This can be caused by a broken installation, in which the `ceph-osd` binary does not match the compression plugins, or a recent upgrade that did not include a restart of the `ceph-osd` daemon.
`BLUESTORE_SPURIOUS_READ_ERRORS`	One or more OSDs using BlueStore detects spurious read errors at main device. BlueStore has recovered from these errors by retrying disk reads.

Table B.4. Device health
Health Code	Description
`DEVICE_HEALTH`	One or more devices is expected to fail soon, where the warning threshold is controlled by the `mgr/devicehealth/warn_threshold` config option. Mark the device out to migrate the data and replace the hardware.
`DEVICE_HEALTH_IN_USE`	One or more devices is expected to fail soon and has been marked “out” of the storage cluster based on `mgr/devicehealth/mark_out_threshold`, but it is still participating in one more PGs.
`DEVICE_HEALTH_TOOMANY`	Too many devices are expected to fail soon and the `mgr/devicehealth/self_heal` behavior is enabled, such that marking out all of the ailing devices would exceed the clusters `mon_osd_min_in_ratio` ratio that prevents too many OSDs from being automatically marked `out`.

Table B.5. Pools and placement groups
Health Code	Description
`PG_AVAILABILITY`	Data availability is reduced, meaning that the storage cluster is unable to service potential read or write requests for some data in the cluster.
`PG_DEGRADED`	Data redundancy is reduced for some data, meaning the storage cluster does not have the desired number of replicas for for replicated pools or erasure code fragments.
`PG_RECOVERY_FULL`	Data redundancy might be reduced or at risk for some data due to a lack of free space in the storage cluster, specifically, one or more PGs has the `recovery_toofull` flag set, which means that the cluster is unable to migrate or recover data because one or more OSDs is above the `full` threshold.
`PG_BACKFILL_FULL`	Data redundancy might be reduced or at risk for some data due to a lack of free space in the storage cluster, specifically, one or more PGs has the `backfill_toofull` flag set, which means that the cluster is unable to migrate or recover data because one or more OSDs is above the `backfillfull` threshold.
`PG_DAMAGED`	Data scrubbing has discovered some problems with data consistency in the storage cluster, specifically, one or more PGs has the inconsistent or `snaptrim_error` flag is set, indicating an earlier scrub operation found a problem, or that the `repair` flag is set, meaning a repair for such an inconsistency is currently in progress.
`OSD_SCRUB_ERRORS`	Recent OSD scrubs have uncovered inconsistencies.
`OSD_TOO_MANY_REPAIRS`	When a read error occurs and another replica is available it is used to repair the error immediately, so that the client can get the object data.
`LARGE_OMAP_OBJECTS`	One or more pools contain large omap objects as determined by `osd_deep_scrub_large_omap_object_key_threshold` or `osd_deep_scrub_large_omap_object_value_sum_threshold` or both. Adjust the thresholds with `ceph config set osd osd_deep_scrub_large_omap_object_key_threshold KEYS` and `ceph config set osd osd_deep_scrub_large_omap_object_value_sum_threshold BYTES` commands.
`CACHE_POOL_NEAR_FULL`	A cache tier pool is nearly full. Adjust the cache pool target size with `ceph osd pool set CACHE_POOL_NAME target_max_bytes BYTES` and `ceph osd pool set CACHE_POOL_NAME target_max_bytes BYTES` commands.
`TOO_FEW_PGS`	The number of PGs in use in the storage cluster is below the configurable threshold of `mon_pg_warn_min_per_osd` PGs per OSD.
`POOL_PG_NUM_NOT_POWER_OF_TWO`	One or more pools has a `pg_num` value that is not a power of two. Disable the warning with `ceph config set global mon_warn_on_pool_pg_num_not_power_of_two false` command.
`POOL_TOO_FEW_PGS`	One or more pools should probably have more PGs, based on the amount of data that is currently stored in the pool. You can either disable auto-scaling of PGs with `ceph osd pool set POOL_NAME pg_autoscale_mode off` command, automatically adjust the number of PGs with `ceph osd pool set POOL_NAME pg_autoscale_mode on` command or manually set the number of PGs with `ceph osd pool set POOL_NAME pg_num _NEW_PG_NUMBER` command.
`TOO_MANY_PGS`	The number of PGs in use in the storage cluster is above the configurable threshold of `mon_max_pg_per_osd` PGs per OSD. Increase the number of OSDs in the cluster by adding more hardware.
`POOL_TOO_MANY_PGS`	One or more pools should probably have more PGs, based on the amount of data that is currently stored in the pool. You can either disable auto-scaling of PGs with `ceph osd pool set POOL_NAME pg_autoscale_mode off` command, automatically adjust the number of PGs with `ceph osd pool set POOL_NAME pg_autoscale_mode on` command or manually set the number of PGs with `ceph osd pool set POOL_NAME pg_num _NEW_PG_NUMBER` command.
`POOL_TARGET_SIZE_BYTES_OVERCOMMITTED`	One or more pools have a `target_size_bytes` property set to estimate the expected size of the pool, but the values exceed the total available storage. Set the value for the pool to zero with `ceph osd pool set POOL_NAME target_size_bytes 0` command.
`POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO`	One or more pools have both `target_size_bytes` and `target_size_ratio` set to estimate the expected size of the pool. Set the value for the pool to zero with `ceph osd pool set POOL_NAME target_size_bytes 0` command.
`TOO_FEW_OSDS`	The number of OSDs in the storage cluster is below the configurable threshold of o`sd_pool_default_size.
`SMALLER_PGP_NUM`	One or more pools has a `pgp_num` value less than `pg_num`. This is normally an indication that the PG count was increased without also increasing the placement behavior. Resolve this by setting `pgp_num` to match with `pg_num` with `ceph osd pool set POOL_NAME pgp_num PG_NUM_VALUE` command.
`MANY_OBJECTS_PER_PG`	One or more pools has an average number of objects per PG that is significantly higher than the overall storage cluster average. The specific threshold is controlled by the `mon_pg_warn_max_object_skew` configuration value.
`POOL_APP_NOT_ENABLED`	A pool exists that contains one or more objects but has not been tagged for use by a particular application. Resolve this warning by labeling the pool for use by an application with `rbd pool init POOL_NAME` command.
`POOL_FULL`	One or more pools has reached its quota. The threshold to trigger this error condition is controlled by the `mon_pool_quota_crit_threshold` configuration option.
`POOL_NEAR_FULL`	One or more pools is approaching a configured fullness threshold. Adjust the pool quotas with `ceph osd pool set-quota POOL_NAME max_objects NUMBER_OF_OBJECTS` and `ceph osd pool set-quota POOL_NAME max_bytes BYTES` commands.
`OBJECT_MISPLACED`	One or more objects in the storage cluster is not stored on the node the storage cluster would like it to be stored on. This is an indication that data migration due to some recent storage cluster change has not yet completed.
`OBJECT_UNFOUND`	One or more objects in the storage cluster cannot be found, specifically, the OSDs know that a new or updated copy of an object should exist, but a copy of that version of the object has not been found on OSDs that are currently online.
`SLOW_OPS`	One or more OSD or monitor requests is taking a long time to process. This can be an indication of extreme load, a slow storage device, or a software bug.
`PG_NOT_SCRUBBED`	One or more PGs has not been scrubbed recently. PGs are normally scrubbed within every configured interval specified by `osd_scrub_max_interval` globally. Initiate the scrub with `ceph pg scrub PG_ID` command.
`PG_NOT_DEEP_SCRUBBED`	One or more PGs has not been deep scrubbed recently. Initiate the scrub with `ceph pg deep-scrub PG_ID` command. PGs are normally scrubbed every `osd_deep_scrub_interval` seconds, and this warning triggers when `mon_warn_pg_not_deep_scrubbed_ratio` percentage of interval has elapsed without a scrub since it was due.
`PG_SLOW_SNAP_TRIMMING`	The snapshot trim queue for one or more PGs has exceeded the configured warning threshold. This indicates that either an extremely large number of snapshots were recently deleted, or that the OSDs are unable to trim snapshots quickly enough to keep up with the rate of new snapshot deletions.

Table B.6. Miscellaneous
Health Code	Description
`RECENT_CRASH`	One or more Ceph daemons has crashed recently, and the crash has not yet been acknowledged by the administrator.
`TELEMETRY_CHANGED`	Telemetry has been enabled, but the contents of the telemetry report have changed since that time, so telemetry reports will not be sent.
`AUTH_BAD_CAPS`	One or more auth users has capabilities that cannot be parsed by the monitor. Update the capabilities of the user with `ceph auth ENTITY_NAME DAEMON_TYPE CAPS` command.
`OSD_NO_DOWN_OUT_INTERVAL`	The `mon_osd_down_out_interval` option is set to zero, which means that the system will not automatically perform any repair or healing operations after an OSD fails. Silence the interval with `ceph config global mon mon_warn_on_osd_down_out_interval_zero false` command.
`DASHBOARD_DEBUG`	The Dashboard debug mode is enabled. This means, if there is an error while processing a REST API request, the HTTP error response contains a Python traceback. Disable the debug mode with `ceph dashboard debug disable` command.

Appendix B. Health messages of a Ceph cluster

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links