Chapter 1. Initial Troubleshooting
As a storage administrator, you can do the initial troubleshooting of a Red Hat Ceph Storage cluster before contacting Red Hat support. This chapter includes the following information:
Prerequisites
- A running Red Hat Ceph Storage cluster.
1.1. Identifying problems Copy linkLink copied to clipboard!
To determine possible causes of the error with the Red Hat Ceph Storage cluster, answer the questions in the Procedure section.
Prerequisites
- A running Red Hat Ceph Storage cluster.
Procedure
- Certain problems can arise when using unsupported configurations. Ensure that your configuration is supported.
Do you know what Ceph component causes the problem?
- No. Follow Diagnosing the health of a Ceph storage cluster procedure in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph Monitors. See Troubleshooting Ceph Monitors section in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph OSDs. See Troubleshooting Ceph OSDs section in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph placement groups. See Troubleshooting Ceph placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
- Multi-site Ceph Object Gateway. See Troubleshooting a multi-site Ceph Object Gateway section in the Red Hat Ceph Storage Troubleshooting Guide.
Additional Resources
- See the Red Hat Ceph Storage: Supported configurations article for details.
1.2. Diagnosing the health of a storage cluster Copy linkLink copied to clipboard!
This procedure lists basic steps to diagnose the health of a Red Hat Ceph Storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
Procedure
Log into the Cephadm shell:
Example
cephadm shell
[root@host01 ~]# cephadm shell
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the overall status of the storage cluster:
Example
[ceph: root@host01 /]# ceph health detail
[ceph: root@host01 /]# ceph health detail
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the command returns
HEALTH_WARN
orHEALTH_ERR
see Understanding Ceph health for details.Monitor the logs of the storage cluster:
Example
[ceph: root@host01 /]# ceph -W cephadm
[ceph: root@host01 /]# ceph -W cephadm
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To capture the logs of the cluster to a file, run the following commands:
Example
[ceph: root@host01 /]# ceph config set global log_to_file true [ceph: root@host01 /]# ceph config set global mon_cluster_log_to_file true
[ceph: root@host01 /]# ceph config set global log_to_file true [ceph: root@host01 /]# ceph config set global mon_cluster_log_to_file true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The logs are located by default in the
/var/log/ceph/CLUSTER_FSID/
directory. Check the Ceph logs for any error messages listed in Understanding Ceph logs.- If the logs do not include a sufficient amount of information, increase the debugging level and try to reproduce the action that failed. See Configuring logging for details.
1.3. Understanding Ceph health Copy linkLink copied to clipboard!
The ceph health
command returns information about the status of the Red Hat Ceph Storage cluster:
-
HEALTH_OK
indicates that the cluster is healthy. -
HEALTH_WARN
indicates a warning. In some cases, the Ceph status returns toHEALTH_OK
automatically. For example when Red Hat Ceph Storage cluster finishes the rebalancing process. However, consider further troubleshooting if a cluster is in theHEALTH_WARN
state for longer time. -
HEALTH_ERR
indicates a more serious problem that requires your immediate attention.
Use the ceph health detail
and ceph -s
commands to get a more detailed output.
A health warning is displayed if there is no mgr
daemon running. In case the last mgr
daemon of a Red Hat Ceph Storage cluster was removed, you can manually deploy a mgr
daemon, on a random host of the Red Hat Storage cluster. See the Manually deploying a mgr daemon in the Red Hat Ceph Storage 6 Administration Guide.
1.4. Muting health alerts of a Ceph cluster Copy linkLink copied to clipboard!
In certain scenarios, users might want to temporarily mute some warnings, because they are already aware of the warning and cannot act on it right away. You can mute health checks so that they do not affect the overall reported status of the Ceph cluster.
Alerts are specified using the health check codes. One example is, when an OSD is brought down for maintenance, OSD_DOWN
warnings are expected. You can choose to mute the warning until the maintenance is over because those warnings put the cluster in HEALTH_WARN
instead of HEALTH_OK
for the entire duration of maintenance.
Most health mutes also disappear if the extent of an alert gets worse. For example, if there is one OSD down, and the alert is muted, the mute disappears if one or more additional OSDs go down. This is true for any health alert that involves a count indicating how much or how many of something is triggering the warning or error.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level of access to the nodes.
- A health warning message.
Procedure
Log into the Cephadm shell:
Example
cephadm shell
[root@host01 ~]# cephadm shell
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the health of the Red Hat Ceph Storage cluster by running the
ceph health detail
command:Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can see that the storage cluster is in
HEALTH_WARN
status as one of the OSDs is down.Mute the alert:
Syntax
ceph health mute HEALTH_MESSAGE
ceph health mute HEALTH_MESSAGE
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph health mute OSD_DOWN
[ceph: root@host01 /]# ceph health mute OSD_DOWN
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: A health check mute can have a time to live (TTL) associated with it, such that the mute automatically expires after the specified period of time has elapsed. Specify the TTL as an optional duration argument in the command:
Syntax
ceph health mute HEALTH_MESSAGE DURATION
ceph health mute HEALTH_MESSAGE DURATION
Copy to Clipboard Copied! Toggle word wrap Toggle overflow DURATION can be specified in
s
,sec
,m
,min
,h
, orhour
.Example
[ceph: root@host01 /]# ceph health mute OSD_DOWN 10m
[ceph: root@host01 /]# ceph health mute OSD_DOWN 10m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the alert
OSD_DOWN
is muted for 10 minutes.Verify if the Red Hat Ceph Storage cluster status has changed to
HEALTH_OK
:Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, you can see that the alert OSD_DOWN and OSD_FLAG is muted and the mute is active for nine minutes.
Optional: You can retain the mute even after the alert is cleared by making it sticky.
Syntax
ceph health mute HEALTH_MESSAGE DURATION --sticky
ceph health mute HEALTH_MESSAGE DURATION --sticky
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph health mute OSD_DOWN 1h --sticky
[ceph: root@host01 /]# ceph health mute OSD_DOWN 1h --sticky
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can remove the mute by running the following command:
Syntax
ceph health unmute HEALTH_MESSAGE
ceph health unmute HEALTH_MESSAGE
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph health unmute OSD_DOWN
[ceph: root@host01 /]# ceph health unmute OSD_DOWN
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
1.5. Understanding Ceph logs Copy linkLink copied to clipboard!
Ceph stores its logs in the /var/log/ceph/CLUSTER_FSID/
directory after the logging to files is enabled.
The CLUSTER_NAME.log
is the main storage cluster log file that includes global events. By default, the log file name is ceph.log
. Only the Ceph Monitor nodes include the main storage cluster log.
Each Ceph OSD and Monitor has its own log file, named CLUSTER_NAME-osd.NUMBER.log
and CLUSTER_NAME-mon.HOSTNAME.log
.
When you increase debugging level for Ceph subsystems, Ceph generates new log files for those subsystems as well.
1.6. Generating an sos report Copy linkLink copied to clipboard!
You can run the sos report
command to collect the configuration details, system information, and diagnostic information of a Red Hat Ceph Storage cluster from a Red Hat Enterprise Linux. Red Hat Support team uses this information for further troubleshooting of the storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the nodes.
Procedure
Install the
sos
package:Example
dnf install sos
[root@host01 ~]# dnf install sos
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
sos report
to get the system information of the storage cluster:Example
sos report -a --all-logs
[root@host01 ~]# sos report -a --all-logs
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The report is saved in the
/var/tmp
file.Run the following command for specific Ceph daemon information:
Example
sos report --all-logs -e ceph_mgr,ceph_common,ceph_mon,ceph_osd,ceph_ansible,ceph_mds,ceph_rgw
[root@host01 ~]# sos report --all-logs -e ceph_mgr,ceph_common,ceph_mon,ceph_osd,ceph_ansible,ceph_mds,ceph_rgw
Copy to Clipboard Copied! Toggle word wrap Toggle overflow