Chapter 1. Initial Troubleshooting
As a storage administrator, you can do the initial troubleshooting of a Red Hat Ceph Storage cluster before contacting Red Hat support. This chapter includes the following information:
Prerequisites
- A running Red Hat Ceph Storage cluster.
1.1. Identifying problems
To determine possible causes of the error with the Red Hat Ceph Storage cluster, answer the questions in the Procedure section.
Prerequisites
- A running Red Hat Ceph Storage cluster.
Procedure
- Certain problems can arise when using unsupported configurations. Ensure that your configuration is supported.
Do you know what Ceph component causes the problem?
- No. Follow Diagnosing the health of a Ceph storage cluster procedure in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph Monitors. See Troubleshooting Ceph Monitors section in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph OSDs. See Troubleshooting Ceph OSDs section in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph placement groups. See Troubleshooting Ceph placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
- Multi-site Ceph Object Gateway. See Troubleshooting a multi-site Ceph Object Gateway section in the Red Hat Ceph Storage Troubleshooting Guide.
Additional Resources
- See the Red Hat Ceph Storage: Supported configurations article for details.
1.2. Diagnosing the health of a storage cluster
This procedure lists basic steps to diagnose the health of a Red Hat Ceph Storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Check the overall status of the storage cluster:
Example
[ceph: root@host01 /]# ceph health detail
If the command returns
HEALTH_WARN
orHEALTH_ERR
see Understanding Ceph health for details.Monitor the logs of the storage cluster:
Example
[ceph: root@host01 /]# ceph -W cephadm
To capture the logs of the cluster to a file, run the following commands:
Example
[ceph: root@host01 /]# ceph config set global log_to_file true [ceph: root@host01 /]# ceph config set global mon_cluster_log_to_file true
The logs are located by default in the
/var/log/ceph/CLUSTER_FSID/
directory. Check the Ceph logs for any error messages listed in Understanding Ceph logs.- If the logs do not include a sufficient amount of information, increase the debugging level and try to reproduce the action that failed. See Configuring logging for details.
1.3. Understanding Ceph health
The ceph health
command returns information about the status of the Red Hat Ceph Storage cluster:
-
HEALTH_OK
indicates that the cluster is healthy. -
HEALTH_WARN
indicates a warning. In some cases, the Ceph status returns toHEALTH_OK
automatically. For example when Red Hat Ceph Storage cluster finishes the rebalancing process. However, consider further troubleshooting if a cluster is in theHEALTH_WARN
state for longer time. -
HEALTH_ERR
indicates a more serious problem that requires your immediate attention.
Use the ceph health detail
and ceph -s
commands to get a more detailed output.
A health warning is displayed if there is no mgr
daemon running. In case the last mgr
daemon of a Red Hat Ceph Storage cluster was removed, you can manually deploy a mgr
daemon, on a random host of the Red Hat Storage cluster. See the Manually deploying a mgr daemon in the Red Hat Ceph Storage 6 Administration Guide.
Additional Resources
- See the Ceph Monitor error messages table in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Ceph OSD error messages table in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Placement group error messages table in the Red Hat Ceph Storage Troubleshooting Guide.
1.4. Muting health alerts of a Ceph cluster
In certain scenarios, users might want to temporarily mute some warnings, because they are already aware of the warning and cannot act on it right away. You can mute health checks so that they do not affect the overall reported status of the Ceph cluster.
Alerts are specified using the health check codes. One example is, when an OSD is brought down for maintenance, OSD_DOWN
warnings are expected. You can choose to mute the warning until the maintenance is over because those warnings put the cluster in HEALTH_WARN
instead of HEALTH_OK
for the entire duration of maintenance.
Most health mutes also disappear if the extent of an alert gets worse. For example, if there is one OSD down, and the alert is muted, the mute disappears if one or more additional OSDs go down. This is true for any health alert that involves a count indicating how much or how many of something is triggering the warning or error.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level of access to the nodes.
- A health warning message.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Check the health of the Red Hat Ceph Storage cluster by running the
ceph health detail
command:Example
[ceph: root@host01 /]# ceph health detail HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set [WRN] OSD_DOWN: 1 osds down osd.1 (root=default,host=host01) is down [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set osd.1 has flags noup
You can see that the storage cluster is in
HEALTH_WARN
status as one of the OSDs is down.Mute the alert:
Syntax
ceph health mute HEALTH_MESSAGE
Example
[ceph: root@host01 /]# ceph health mute OSD_DOWN
Optional: A health check mute can have a time to live (TTL) associated with it, such that the mute automatically expires after the specified period of time has elapsed. Specify the TTL as an optional duration argument in the command:
Syntax
ceph health mute HEALTH_MESSAGE DURATION
DURATION can be specified in
s
,sec
,m
,min
,h
, orhour
.Example
[ceph: root@host01 /]# ceph health mute OSD_DOWN 10m
In this example, the alert
OSD_DOWN
is muted for 10 minutes.Verify if the Red Hat Ceph Storage cluster status has changed to
HEALTH_OK
:Example
[ceph: root@host01 /]# ceph -s cluster: id: 81a4597a-b711-11eb-8cb8-001a4a000740 health: HEALTH_OK (muted: OSD_DOWN(9m) OSD_FLAGS(9m)) services: mon: 3 daemons, quorum host01,host02,host03 (age 33h) mgr: host01.pzhfuh(active, since 33h), standbys: host02.wsnngf, host03.xwzphg osd: 11 osds: 10 up (since 4m), 11 in (since 5d) data: pools: 1 pools, 1 pgs objects: 13 objects, 0 B usage: 85 MiB used, 165 GiB / 165 GiB avail pgs: 1 active+clean
In this example, you can see that the alert OSD_DOWN and OSD_FLAG is muted and the mute is active for nine minutes.
Optional: You can retain the mute even after the alert is cleared by making it sticky.
Syntax
ceph health mute HEALTH_MESSAGE DURATION --sticky
Example
[ceph: root@host01 /]# ceph health mute OSD_DOWN 1h --sticky
You can remove the mute by running the following command:
Syntax
ceph health unmute HEALTH_MESSAGE
Example
[ceph: root@host01 /]# ceph health unmute OSD_DOWN
Additional Resources
- See the Health messages of a Ceph cluster section in the Red Hat Ceph Storage Troubleshooting Guide for details.
1.5. Understanding Ceph logs
Ceph stores its logs in the /var/log/ceph/CLUSTER_FSID/
directory after the logging to files is enabled.
The CLUSTER_NAME.log
is the main storage cluster log file that includes global events. By default, the log file name is ceph.log
. Only the Ceph Monitor nodes include the main storage cluster log.
Each Ceph OSD and Monitor has its own log file, named CLUSTER_NAME-osd.NUMBER.log
and CLUSTER_NAME-mon.HOSTNAME.log
.
When you increase debugging level for Ceph subsystems, Ceph generates new log files for those subsystems as well.
Additional Resources
- For details about logging, see Configuring logging in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Common Ceph Monitor error messages in the Ceph logs table in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Common Ceph OSD error messages in the Ceph logs table in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Ceph daemon logs to enable logging to files.
1.6. Generating an sos report
You can run the sos report
command to collect the configuration details, system information, and diagnostic information of a Red Hat Ceph Storage cluster from a Red Hat Enterprise Linux. Red Hat Support team uses this information for further troubleshooting of the storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the nodes.
Procedure
Install the
sos
package:Example
[root@host01 ~]# dnf install sos
Run the
sos report
to get the system information of the storage cluster:Example
[root@host01 ~]# sosreport -a --all-logs
The report is saved in the
/var/tmp
file.Run the following command for specific Ceph daemon information:
Example
[root@host01 ~]# sos report --all-logs -e ceph_mgr,ceph_common,ceph_mon,ceph_osd,ceph_ansible,ceph_mds,ceph_rgw
Additional Resources
- See the What is an sosreport and how to create one in Red Hat Enterprise Linux? KnowledgeBase article for more information.