Chapter 1. Initial Troubleshooting

As a storage administrator, you can do the initial troubleshooting of a Red Hat Ceph Storage cluster before contacting Red Hat support. This chapter includes the following information:

Prerequisites

A running Red Hat Ceph Storage cluster.

1.1. Identifying problems
Copy link

To determine possible causes of the error with the Red Hat Ceph Storage cluster, answer the questions in the Procedure section.

Prerequisites

A running Red Hat Ceph Storage cluster.

Procedure

Certain problems can arise when using unsupported configurations. Ensure that your configuration is supported.
Do you know what Ceph component causes the problem?
1. No. Follow Diagnosing the health of a Ceph storage cluster procedure in the Red Hat Ceph Storage Troubleshooting Guide.
2. Ceph Monitors. See Troubleshooting Ceph Monitors section in the Red Hat Ceph Storage Troubleshooting Guide.
3. Ceph OSDs. See Troubleshooting Ceph OSDs section in the Red Hat Ceph Storage Troubleshooting Guide.
4. Ceph placement groups. See Troubleshooting Ceph placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
5. Multi-site Ceph Object Gateway. See Troubleshooting a multi-site Ceph Object Gateway section in the Red Hat Ceph Storage Troubleshooting Guide.

Additional Resources

See the Red Hat Ceph Storage: Supported configurations article for details.

1.2. Diagnosing the health of a storage cluster
Copy link

This procedure lists basic steps to diagnose the health of a Red Hat Ceph Storage cluster.

Prerequisites

A running Red Hat Ceph Storage cluster.

Procedure

Log into the Cephadm shell:
Example
```
cephadm shell
```
```
[root@host01 ~]# cephadm shell
```
Copy to Clipboard Toggle word wrap
Check the overall status of the storage cluster:
Example
```
[ceph: root@host01 /]# ceph health detail
```
```
[ceph: root@host01 /]# ceph health detail
```
Copy to Clipboard Toggle word wrap
If the command returns HEALTH_WARN or HEALTH_ERR see Understanding Ceph health for details.
Monitor the logs of the storage cluster:
Example
```
[ceph: root@host01 /]# ceph -W cephadm
```
```
[ceph: root@host01 /]# ceph -W cephadm
```
Copy to Clipboard Toggle word wrap
To capture the logs of the cluster to a file, run the following commands:
Example
```
[ceph: root@host01 /]# ceph config set global log_to_file true
[ceph: root@host01 /]# ceph config set global mon_cluster_log_to_file true
```
```
[ceph: root@host01 /]# ceph config set global log_to_file true
[ceph: root@host01 /]# ceph config set global mon_cluster_log_to_file true
```
Copy to Clipboard Toggle word wrap
The logs are located by default in the /var/log/ceph/CLUSTER_FSID/ directory. Check the Ceph logs for any error messages listed in Understanding Ceph logs.
If the logs do not include a sufficient amount of information, increase the debugging level and try to reproduce the action that failed. See Configuring logging for details.

1.3. Understanding Ceph health
Copy link

The ceph health command returns information about the status of the Red Hat Ceph Storage cluster:

HEALTH_OK indicates that the cluster is healthy.
HEALTH_WARN indicates a warning. In some cases, the Ceph status returns to HEALTH_OK automatically. For example when Red Hat Ceph Storage cluster finishes the rebalancing process. However, consider further troubleshooting if a cluster is in the HEALTH_WARN state for longer time.
HEALTH_ERR indicates a more serious problem that requires your immediate attention.

Use the ceph health detail and ceph -s commands to get a more detailed output.

Note

A health warning is displayed if there is no mgr daemon running. In case the last mgr daemon of a Red Hat Ceph Storage cluster was removed, you can manually deploy a mgr daemon, on a random host of the Red Hat Storage cluster. See the Manually deploying a mgr daemon in the Red Hat Ceph Storage 6 Administration Guide.

1.4. Muting health alerts of a Ceph cluster
Copy link

In certain scenarios, users might want to temporarily mute some warnings, because they are already aware of the warning and cannot act on it right away. You can mute health checks so that they do not affect the overall reported status of the Ceph cluster.

Alerts are specified using the health check codes. One example is, when an OSD is brought down for maintenance, OSD_DOWN warnings are expected. You can choose to mute the warning until the maintenance is over because those warnings put the cluster in HEALTH_WARN instead of HEALTH_OK for the entire duration of maintenance.

Most health mutes also disappear if the extent of an alert gets worse. For example, if there is one OSD down, and the alert is muted, the mute disappears if one or more additional OSDs go down. This is true for any health alert that involves a count indicating how much or how many of something is triggering the warning or error.

Prerequisites

A running Red Hat Ceph Storage cluster.
Root-level of access to the nodes.
A health warning message.

Procedure

Log into the Cephadm shell:
Example
```
cephadm shell
```
```
[root@host01 ~]# cephadm shell
```
Copy to Clipboard Toggle word wrap

Check the health of the Red Hat Ceph Storage cluster by running the ceph health detail command:

Example

[ceph: root@host01 /]# ceph health detail

HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
[WRN] OSD_DOWN: 1 osds down
    osd.1 (root=default,host=host01) is down
[WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
    osd.1 has flags noup

[ceph: root@host01 /]# ceph health detail

HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
[WRN] OSD_DOWN: 1 osds down
    osd.1 (root=default,host=host01) is down
[WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
    osd.1 has flags noup

Copy to Clipboard

Toggle word wrap

You can see that the storage cluster is in HEALTH_WARN status as one of the OSDs is down.

Mute the alert:

Syntax

ceph health mute HEALTH_MESSAGE

ceph health mute HEALTH_MESSAGE

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host01 /]# ceph health mute OSD_DOWN

[ceph: root@host01 /]# ceph health mute OSD_DOWN

Copy to Clipboard

Toggle word wrap

Optional: A health check mute can have a time to live (TTL) associated with it, such that the mute automatically expires after the specified period of time has elapsed. Specify the TTL as an optional duration argument in the command:
Syntax
```
ceph health mute HEALTH_MESSAGE DURATION
```
```
ceph health mute HEALTH_MESSAGE DURATION
```
Copy to Clipboard Toggle word wrap
DURATION can be specified in s, sec, m, min, h, or hour.
Example
```
[ceph: root@host01 /]# ceph health mute OSD_DOWN 10m
```
```
[ceph: root@host01 /]# ceph health mute OSD_DOWN 10m
```
Copy to Clipboard Toggle word wrap
In this example, the alert OSD_DOWN is muted for 10 minutes.

Verify if the Red Hat Ceph Storage cluster status has changed to HEALTH_OK:

Example

[ceph: root@host01 /]# ceph -s
  cluster:
    id:     81a4597a-b711-11eb-8cb8-001a4a000740
    health: HEALTH_OK
            (muted: OSD_DOWN(9m) OSD_FLAGS(9m))

  services:
    mon: 3 daemons, quorum host01,host02,host03 (age 33h)
    mgr: host01.pzhfuh(active, since 33h), standbys: host02.wsnngf, host03.xwzphg
    osd: 11 osds: 10 up (since 4m), 11 in (since 5d)

  data:
    pools:   1 pools, 1 pgs
    objects: 13 objects, 0 B
    usage:   85 MiB used, 165 GiB / 165 GiB avail
    pgs:     1 active+clean

[ceph: root@host01 /]# ceph -s
  cluster:
    id:     81a4597a-b711-11eb-8cb8-001a4a000740
    health: HEALTH_OK
            (muted: OSD_DOWN(9m) OSD_FLAGS(9m))

  services:
    mon: 3 daemons, quorum host01,host02,host03 (age 33h)
    mgr: host01.pzhfuh(active, since 33h), standbys: host02.wsnngf, host03.xwzphg
    osd: 11 osds: 10 up (since 4m), 11 in (since 5d)

  data:
    pools:   1 pools, 1 pgs
    objects: 13 objects, 0 B
    usage:   85 MiB used, 165 GiB / 165 GiB avail
    pgs:     1 active+clean

Copy to Clipboard

Toggle word wrap

In this example, you can see that the alert OSD_DOWN and OSD_FLAG is muted and the mute is active for nine minutes.

Optional: You can retain the mute even after the alert is cleared by making it sticky.

Syntax

ceph health mute HEALTH_MESSAGE DURATION --sticky

ceph health mute HEALTH_MESSAGE DURATION --sticky

Copy to Clipboard

Toggle word wrap

Example

[ceph: root@host01 /]# ceph health mute OSD_DOWN 1h --sticky

[ceph: root@host01 /]# ceph health mute OSD_DOWN 1h --sticky

Copy to Clipboard

Toggle word wrap

You can remove the mute by running the following command:
Syntax
```
ceph health unmute HEALTH_MESSAGE
```
```
ceph health unmute HEALTH_MESSAGE
```
Copy to Clipboard Toggle word wrap
Example
```
[ceph: root@host01 /]# ceph health unmute OSD_DOWN
```
```
[ceph: root@host01 /]# ceph health unmute OSD_DOWN
```
Copy to Clipboard Toggle word wrap

1.5. Understanding Ceph logs
Copy link

Ceph stores its logs in the /var/log/ceph/CLUSTER_FSID/ directory after the logging to files is enabled.

The CLUSTER_NAME.log is the main storage cluster log file that includes global events. By default, the log file name is ceph.log. Only the Ceph Monitor nodes include the main storage cluster log.

Each Ceph OSD and Monitor has its own log file, named CLUSTER_NAME-osd.NUMBER.log and CLUSTER_NAME-mon.HOSTNAME.log.

When you increase debugging level for Ceph subsystems, Ceph generates new log files for those subsystems as well.

1.6. Generating an sos report
Copy link

You can run the sos report command to collect the configuration details, system information, and diagnostic information of a Red Hat Ceph Storage cluster from a Red Hat Enterprise Linux. Red Hat Support team uses this information for further troubleshooting of the storage cluster.

Prerequisites

A running Red Hat Ceph Storage cluster.
Root-level access to the nodes.

Procedure

Install the sos package:
Example
```
dnf install sos
```
```
[root@host01 ~]# dnf install sos
```
Copy to Clipboard Toggle word wrap
Run the sos report to get the system information of the storage cluster:
Example
```
sos report -a --all-logs
```
```
[root@host01 ~]# sos report -a --all-logs
```
Copy to Clipboard Toggle word wrap
The report is saved in the /var/tmp file.
Run the following command for specific Ceph daemon information:
Example
```
sos report --all-logs -e ceph_mgr,ceph_common,ceph_mon,ceph_osd,ceph_ansible,ceph_mds,ceph_rgw
```
```
[root@host01 ~]# sos report --all-logs -e ceph_mgr,ceph_common,ceph_mon,ceph_osd,ceph_ansible,ceph_mds,ceph_rgw
```
Copy to Clipboard Toggle word wrap

Chapter 1. Initial Troubleshooting

1.1. Identifying problems
Copy link

1.2. Diagnosing the health of a storage cluster
Copy link

1.3. Understanding Ceph health
Copy link

1.4. Muting health alerts of a Ceph cluster
Copy link

1.5. Understanding Ceph logs
Copy link

1.6. Generating an sos report
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 1. Initial Troubleshooting

1.1. Identifying problemsCopy linkLink copied to clipboard!

1.2. Diagnosing the health of a storage clusterCopy linkLink copied to clipboard!

1.3. Understanding Ceph healthCopy linkLink copied to clipboard!

1.4. Muting health alerts of a Ceph clusterCopy linkLink copied to clipboard!

1.5. Understanding Ceph logsCopy linkLink copied to clipboard!

1.6. Generating an sos reportCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

1.1. Identifying problems
Copy link

1.2. Diagnosing the health of a storage cluster
Copy link

1.3. Understanding Ceph health
Copy link

1.4. Muting health alerts of a Ceph cluster
Copy link

1.5. Understanding Ceph logs
Copy link

1.6. Generating an sos report
Copy link