Chapter 1. Initial Troubleshooting


This chapter includes information on:

1.1. Identifying Problems

To determine possible causes of the error with Red Hat Ceph Storage you encounter, answer the following question:

  1. Certain problems can arise when using unsupported configurations. Ensure that your configuration is supported. See the Red Hat Ceph Storage: Supported configurations article for details.
  2. Do you know what Ceph component causes the problem?

1.1.1. Diagnosing the Health of a Ceph Storage Cluster

This procedure lists basic steps to diagnose the health of a Ceph Storage Cluster.

  1. Check the overall status of the cluster:

    # ceph health detail

    If the command returns HEALTH_WARN or HEALTH_ERR see Section 1.2, “Understanding the Output of the ceph health Command” for details.

  2. Check the Ceph logs for any error messages listed in Section 1.3, “Understanding Ceph Logs”. The logs are located by default in the /var/log/ceph/ directory.
  3. If the logs do not include sufficient amount of information, increase the debugging level and try to reproduce the action that failed. See Chapter 2, Configuring Logging for details.

1.2. Understanding the Output of the ceph health Command

The ceph health command returns information about the status of the Ceph Storage Cluster:

  • HEALTH_OK indicates that the cluster is healthy.
  • HEALTH_WARN indicates a warning. In some cases, the Ceph status returns to HEALTH_OK automatically, for example when Ceph finishes the rebalancing process. However, consider further troubleshooting if a cluster is in the HEALTH_WARN state for longer time.
  • HEALTH_ERR indicates a more serious problem that requires your immediate attention.

Use the ceph health detail and ceph -s commands to get a more detailed output.

The following tables list the most common HEALTH_ERR and HEALTH_WARN error messages related to Monitors, OSDs, and placement groups. The tables provide links to corresponding sections that explain the errors and point to specific procedures to fix problems.

Table 1.1. Error Messages Related to Monitors
Error messageSee

HEALTH_WARN

mon.X is down (out of quorum)

Section 4.1.1, “A Monitor Is Out of Quorum”

clock skew

Section 4.1.2, “Clock Skew”

store is getting too big!

Section 4.1.3, “The Monitor Store is Getting Too Big”

Table 1.2. Error Messages Related to OSDs
Error messageSee

HEALTH_ERR

full osds

Section 5.1.1, “Full OSDs”

HEALTH_WARN

nearfull osds

Section 5.1.2, “Nearfull OSDs”

osds are down

Section 5.1.3, “One or More OSDs Are Down”

Section 5.1.4, “Flapping OSDs”

requests are blocked

Section 5.1.5, “Slow Requests, and Requests are Blocked”

slow requests

Section 5.1.5, “Slow Requests, and Requests are Blocked”

Table 1.3. Error Messages Related to Placement Groups
Error messageSee

HEALTH_ERR

pgs down

Section 6.1.5, “Placement Groups Are down

pgs inconsistent

Section 6.1.2, “Inconsistent Placement Groups”

scrub errors

Section 6.1.2, “Inconsistent Placement Groups”

HEALTH_WARN

pgs stale

Section 6.1.1, “Stale Placement Groups”

unfound

Section 6.1.6, “Unfound Objects”

1.3. Understanding Ceph Logs

By default, Ceph stores its logs in the /var/log/ceph/ directory.

The <cluster-name>.log is the main cluster log file that includes the global cluster events. By default, this log is named ceph.log. Only the Monitor hosts include the main cluster log.

Each OSD and Monitor has its own log file, named <cluster-name>-osd.<number>.log and <cluster-name>-mon.<hostname>.log.

When you increase debugging level for Ceph subsystems, Ceph generates a new log files for those subsystems as well. For details about logging, see Chapter 2, Configuring Logging.

The following tables list the most common Ceph log error messages related to Monitors and OSDs. The tables provide links to corresponding sections that explain the errors and point to specific procedures to fix them.

Table 1.4. Common Error Messages in Ceph Logs Related to Monitors
Error messageLog fileSee

clock skew

Main cluster log

Section 4.1.2, “Clock Skew”

clocks not synchronized

Main cluster log

Section 4.1.2, “Clock Skew”

Corruption: error in middle of record

Monitor log

Section 4.1.1, “A Monitor Is Out of Quorum”

Section 4.3, “Recovering the Monitor Store”

Corruption: 1 missing files

Monitor log

Section 4.1.1, “A Monitor Is Out of Quorum”

Section 4.3, “Recovering the Monitor Store”

Caught signal (Bus error)

Monitor log

Section 4.1.1, “A Monitor Is Out of Quorum”

Table 1.5. Common Error Messages in Ceph Logs Related to OSDs
Error messageLog fileSee

heartbeat_check: no reply from osd.X

Main cluster log

Section 5.1.4, “Flapping OSDs”

wrongly marked me down

Main cluster log

Section 5.1.4, “Flapping OSDs”

osds have slow requests

Main cluster log

Section 5.1.5, “Slow Requests, and Requests are Blocked”

FAILED assert(!m_filestore_fail_eio)

OSD log

Section 5.1.3, “One or More OSDs Are Down”

FAILED assert(0 == "hit suicide timeout")

OSD log

Section 5.1.3, “One or More OSDs Are Down”

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.