Este conteúdo não está disponível no idioma selecionado.
Troubleshooting Guide
Troubleshooting Red Hat Ceph Storage
Abstract
Chapter 1. Initial Troubleshooting Copiar o linkLink copiado para a área de transferência!
This chapter includes information on:
- How to start troubleshooting Ceph errors (Identifying problems)
-
Most common
ceph healtherror messages (Understanding Ceph Health) - Most common Ceph log error messages (Understanding Ceph log)
1.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- A running Red Hat Ceph Storage cluster.
1.2. Identifying problems Copiar o linkLink copiado para a área de transferência!
To determine possible causes of the error with the Red Hat Ceph Storage cluster, answer the questions in the Procedure section.
Prerequisites
- A running Red Hat Ceph Storage cluster.
Procedure
- Certain problems can arise when using unsupported configurations. Ensure that your configuration is supported.
Do you know what Ceph component causes the problem?
- No. Follow Diagnosing the health of a Ceph storage cluster procedure in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph Monitors. See Troubleshooting Ceph Monitors section in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph OSDs. See Troubleshooting Ceph OSDs section in the Red Hat Ceph Storage Troubleshooting Guide.
- Ceph placement groups. See Troubleshooting Ceph placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
- Multi-site Ceph Object Gateway. See Troubleshooting a multi-site Ceph Object Gateway section in the Red Hat Ceph Storage Troubleshooting Guide.
Additional Resources
- See the Red Hat Ceph Storage: Supported configurations article for details.
1.2.1. Diagnosing the health of a storage cluster Copiar o linkLink copiado para a área de transferência!
This procedure lists basic steps to diagnose the health of a Red Hat Ceph Storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
Procedure
Check the overall status of the storage cluster:
ceph health detail
[root@mon ~]# ceph health detailCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the command returns
HEALTH_WARNorHEALTH_ERRsee Understanding Ceph health for details.-
Check the Ceph logs for any error messages listed in Understanding Ceph logs. The logs are located by default in the
/var/log/ceph/directory. - If the logs do not include sufficient amount of information, increase the debugging level and try to reproduce the action that failed. See Configuring logging for details.
1.3. Understanding Ceph health Copiar o linkLink copiado para a área de transferência!
The ceph health command returns information about the status of the Red Hat Ceph Storage cluster:
-
HEALTH_OKindicates that the cluster is healthy. -
HEALTH_WARNindicates a warning. In some cases, the Ceph status returns toHEALTH_OKautomatically. For example when Red Hat Ceph Storage cluster finishes the rebalancing process. However, consider further troubleshooting if a cluster is in theHEALTH_WARNstate for longer time. -
HEALTH_ERRindicates a more serious problem that requires your immediate attention.
Use the ceph health detail and ceph -s commands to get a more detailed output.
Additional Resources
- See the Ceph Monitor error messages table in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Ceph OSD error messages table in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Placement group error messages table in the Red Hat Ceph Storage Troubleshooting Guide.
1.4. Understanding Ceph logs Copiar o linkLink copiado para a área de transferência!
1.4.1. Non containerized deployment Copiar o linkLink copiado para a área de transferência!
By default, Ceph stores its logs in the /var/log/ceph/ directory.
The CLUSTER_NAME.log is the main storage cluster log file that includes global events. By default, the log file name is ceph.log. Only the Ceph Monitor nodes include the main storage cluster log.
Each Ceph OSD and Monitor has its own log file, named CLUSTER_NAME-osd.NUMBER.log and CLUSTER_NAME-mon.HOSTNAME.log.
When you increase debugging level for Ceph subsystems, Ceph generates new log files for those subsystems as well.
1.4.2. Container-based deployment Copiar o linkLink copiado para a área de transferência!
For container-based deployment, by default, Ceph log to journald, accessible using the journactl command. However, you can configure Ceph to log to files in /var/log/ceph in the configuration settings.
To enable logging Ceph Monitors, Ceph Manager, Ceph Object Gateway, and any other daemons, set
log_to_filetotrueunder [global] settings.Example
[ceph: root@host01 ~]# ceph config set global log_to_file true
[ceph: root@host01 ~]# ceph config set global log_to_file trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow To enable logging for Ceph Monitor cluster and audit logs, set
mon_cluster_log_to_filetotrue.Example
[ceph: root@host01 ~]# ceph config set mon mon_cluster_log_to_file true
[ceph: root@host01 ~]# ceph config set mon mon_cluster_log_to_file trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow
If you choose to log to files, it is recommended to disable logging to journald or else everything is logged twice. Run the following commands to disable logging to journald:
ceph config set global log_to_journald false ceph config set global mon_cluster_log_to_journald false
# ceph config set global log_to_journald false
# ceph config set global mon_cluster_log_to_journald false
1.5. Gathering logs from multiple hosts in a Ceph cluster using Ansible Copiar o linkLink copiado para a área de transferência!
Starting with Red Hat Ceph Storage 4.2, you can use ceph-ansible to gather logs from multiple hosts in a Ceph cluster. It captures etc/ceph and /var/log/ceph directories from the Ceph nodes. This playbook can be used to collect logs for a bare-metal and containerized storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the nodes.
-
The
ceph-ansiblepackage is installed on the node.
Procedure
Log into the Ansible administration node as an ansible user.
NoteEnsure the node has adequate space to collect the logs from the hosts.
Navigate to
/usr/share/ceph-ansibledirectory:Example
cd /usr/share/ceph-ansible
[ansible@admin ~]# cd /usr/share/ceph-ansibleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the Ansible playbook to gather the logs:
Example
ansible@admin ceph-ansible]$ ansible-playbook infrastructure-playbooks/gather-ceph-logs.yml -i hosts
ansible@admin ceph-ansible]$ ansible-playbook infrastructure-playbooks/gather-ceph-logs.yml -i hostsCopy to Clipboard Copied! Toggle word wrap Toggle overflow The logs are stored in the
/tmpdirectory of the Ansible node.
Chapter 2. Configuring logging Copiar o linkLink copiado para a área de transferência!
This chapter describes how to configure logging for various Ceph subsystems.
Logging is resource intensive. Also, verbose logging can generate a huge amount of data in a relatively short time. If you are encountering problems in a specific subsystem of the cluster, enable logging only of that subsystem. See Section 2.2, “Ceph subsystems” for more information.
In addition, consider setting up a rotation of log files. See Section 2.5, “Accelerating log rotation” for details.
Once you fix any problems you encounter, change the subsystems log and memory levels to their default values. See Appendix A, Ceph subsystems default logging level values for list of all Ceph subsystems and their default values.
You can configure Ceph logging by:
-
Using the
cephcommand at runtime. This is the most common approach. See Section 2.3, “Configuring logging at runtime” for details. - Updating the Ceph configuration file. Use this approach if you are encountering problems when starting the cluster. See Section 2.4, “Configuring logging in configuration file” for details.
2.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- A running Red Hat Ceph Storage cluster.
2.2. Ceph subsystems Copiar o linkLink copiado para a área de transferência!
This section contains information about Ceph subsystems and their logging levels.
Understanding Ceph Subsystems and Their Logging Levels
Ceph consists of several subsystems.
Each subsystem has a logging level of its:
-
Output logs that are stored by default in
/var/log/ceph/directory (log level) - Logs that are stored in a memory cache (memory level)
In general, Ceph does not send logs stored in memory to the output logs unless:
- A fatal signal is raised
- An assert in source code is triggered
- You request it
You can set different values for each of these subsystems. Ceph logging levels operate on scale of 1 to 20, where 1 is terse and 20 is verbose.
Use a single value for the log level and memory level to set them both to the same value. For example, debug_osd = 5 sets the debug level for the ceph-osd daemon to 5.
To use different values for the output log level and the memory level, separate the values with a forward slash (/). For example, debug_mon = 1/5 sets the debug log level for the ceph-mon daemon to 1 and its memory log level to 5.
For container-based deployment, Ceph generates logs to journald. You can enable logging to files in /var/log/ceph by setting log_to_file parameter to true under [global] in the Ceph configuration file. See Understanding ceph logs for more details.
| Subsystem | Log Level | Memory Level | Description |
|---|---|---|---|
|
| 1 | 5 | The administration socket |
|
| 1 | 5 | Authentication |
|
| 0 | 5 |
Any application or library that uses |
|
| 1 | 5 | The BlueStore OSD backend |
|
| 1 | 5 | The OSD journal |
|
| 1 | 5 | The Metadata Servers |
|
| 0 | 5 | The Monitor client handles communication between most Ceph daemons and Monitors |
|
| 1 | 5 | Monitors |
|
| 0 | 5 | The messaging system between Ceph components |
|
| 0 | 5 | The OSD Daemons |
|
| 0 | 5 | The algorithm that Monitors use to establish a consensus |
|
| 0 | 5 | Reliable Autonomic Distributed Object Store, a core component of Ceph |
|
| 0 | 5 | The Ceph Block Devices |
|
| 1 | 5 | The Ceph Object Gateway |
Example Log Outputs
The following examples show the type of messages in the logs when you increase the verbosity for the Monitors and OSDs.
Monitor Debug Settings
debug_ms = 5 debug_mon = 20 debug_paxos = 20 debug_auth = 20
debug_ms = 5
debug_mon = 20
debug_paxos = 20
debug_auth = 20
Example Log Output of Monitor Debug Settings
OSD Debug Settings
debug_ms = 5 debug_osd = 20
debug_ms = 5
debug_osd = 20
Example Log Output of OSD Debug Settings
2.3. Configuring logging at runtime Copiar o linkLink copiado para a área de transferência!
You can configure the logging of Ceph subsystems at system runtime to help troubleshoot any issues that might occur.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Access to Ceph debugger.
Procedure
To activate the Ceph debugging output,
dout(), at runtime:ceph tell TYPE.ID injectargs --debug-SUBSYSTEM VALUE [--NAME VALUE]
ceph tell TYPE.ID injectargs --debug-SUBSYSTEM VALUE [--NAME VALUE]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace:
-
TYPEwith the type of Ceph daemons (osd,mon, ormds) -
IDwith a specific ID of the Ceph daemon. Alternatively, use*to apply the runtime setting to all daemons of a particular type. -
SUBSYSTEMwith a specific subsystem. VALUEwith a number from1to20, where1is terse and20is verbose.For example, to set the log level for the OSD subsystem on the OSD named
osd.0to 0 and the memory level to 5:ceph tell osd.0 injectargs --debug-osd 0/5
# ceph tell osd.0 injectargs --debug-osd 0/5Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
To see the configuration settings at runtime:
-
Log in to the host with a running Ceph daemon, for example
ceph-osdorceph-mon. Display the configuration:
ceph daemon NAME config show | less
ceph daemon NAME config show | lessCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph daemon osd.0 config show | less
# ceph daemon osd.0 config show | lessCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- See Ceph subsystems for details.
- See Configuration logging in configuration file for details.
- The Ceph Debugging and Logging Configuration Reference chapter in the Configuration Guide for Red Hat Ceph Storage 4.
2.4. Configuring logging in configuration file Copiar o linkLink copiado para a área de transferência!
Configure Ceph subsystems to log informational, warning, and error messages to the log file. You can specify the debugging level in the Ceph configuration file, by default /etc/ceph/ceph.conf.
Prerequisites
- A running Red Hat Ceph Storage cluster.
Procedure
To activate Ceph debugging output,
dout()at boot time, add the debugging settings to the Ceph configuration file.-
For subsystems common to each daemon, add the settings under the
[global]section. For subsystems for particular daemons, add the settings under a daemon section, such as
[mon],[osd], or[mds].Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
For subsystems common to each daemon, add the settings under the
Additional Resources
- Ceph subsystems
- Configuring logging at runtime
- The Ceph Debugging and Logging Configuration Reference chapter in the Configuration Guide for Red Hat Ceph Storage 4
2.5. Accelerating log rotation Copiar o linkLink copiado para a área de transferência!
Increasing debugging level for Ceph components might generate a huge amount of data. If you have almost full disks, you can accelerate log rotation by modifying the Ceph log rotation file at /etc/logrotate.d/ceph. The Cron job scheduler uses this file to schedule log rotation.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the node.
Procedure
Add the size setting after the rotation frequency to the log rotation file:
rotate 7 weekly size SIZE compress sharedscripts
rotate 7 weekly size SIZE compress sharedscriptsCopy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to rotate a log file when it reaches 500 MB:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Open the
crontabeditor:crontab -e
[root@mon ~]# crontab -eCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add an entry to check the
/etc/logrotate.d/cephfile. For example, to instruct Cron to check/etc/logrotate.d/cephevery 30 minutes:30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1
30 * * * * /usr/sbin/logrotate /etc/logrotate.d/ceph >/dev/null 2>&1Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- The Scheduling a Recurring Job Using Cron section in the System Administrator’s Guide for Red Hat Enterprise Linux 7.
Chapter 3. Troubleshooting networking issues Copiar o linkLink copiado para a área de transferência!
This chapter lists basic troubleshooting procedures connected with networking and Network Time Protocol (NTP).
3.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- A running Red Hat Ceph Storage cluster.
3.2. Basic networking troubleshooting Copiar o linkLink copiado para a área de transferência!
Red Hat Ceph Storage depends heavily on a reliable network connection. Red Hat Ceph Storage nodes use the network for communicating with each other. Networking issues can cause many problems with Ceph OSDs, such as them flapping, or being incorrectly reported as down. Networking issues can also cause the Ceph Monitor’s clock skew errors. In addition, packet loss, high latency, or limited bandwidth can impact the cluster performance and stability.
Prerequisites
- Root-level access to the node.
Procedure
Installing the
net-toolsandtelnetpackages can help when troubleshooting network issues that can occur in a Ceph storage cluster:Red Hat Enterprise Linux 7
yum install net-tools yum install telnet
[root@mon ~]# yum install net-tools [root@mon ~]# yum install telnetCopy to Clipboard Copied! Toggle word wrap Toggle overflow Red Hat Enterprise Linux 8
dnf install net-tools dnf install telnet
[root@mon ~]# dnf install net-tools [root@mon ~]# dnf install telnetCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the
cluster_networkandpublic_networkparameters in the Ceph configuration file include the correct values:Example
[root@mon ~]# cat /etc/ceph/ceph.conf | grep net cluster_network = 192.168.1.0/24 public_network = 192.168.0.0/24
[root@mon ~]# cat /etc/ceph/ceph.conf | grep net cluster_network = 192.168.1.0/24 public_network = 192.168.0.0/24Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the network interfaces are up:
Example
ip link list
[root@mon ~]# ip link list 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp22s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 40:f2:e9:b8:a0:48 brd ff:ff:ff:ff:ff:ffCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Ceph nodes are able to reach each other using their short host names. Verify this on each node in the storage cluster:
Syntax
ping SHORT_HOST_NAME
ping SHORT_HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ping osd01
[root@mon ~]# ping osd01Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you use a firewall, ensure that Ceph nodes are able to reach other on their appropriate ports. The
firewall-cmdandtelnettools can validate the port status, and if the port is open respectively:Syntax
firewall-cmd --info-zone=ZONE telnet IP_ADDRESS PORT
firewall-cmd --info-zone=ZONE telnet IP_ADDRESS PORTCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that there are no errors on the interface counters. Verify that the network connectivity between nodes has expected latency, and that there is no packet loss.
Using the
ethtoolcommand:Syntax
ethtool -S INTERFACE
ethtool -S INTERFACECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Using the
ifconfigcommand:Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Using the
netstatcommand:Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For performance issues, in addition to the latency checks and to verify the network bandwidth between all nodes of the storage cluster, use the
iperf3tool. Theiperf3tool does a simple point-to-point network bandwidth test between a server and a client.Install the
iperf3package on the Red Hat Ceph Storage nodes you want to check the bandwidth:Red Hat Enterprise Linux 7
yum install iperf3
[root@mon ~]# yum install iperf3Copy to Clipboard Copied! Toggle word wrap Toggle overflow Red Hat Enterprise Linux 8
dnf install iperf3
[root@mon ~]# dnf install iperf3Copy to Clipboard Copied! Toggle word wrap Toggle overflow On a Red Hat Ceph Storage node, start the
iperf3server:Example
iperf3 -s
[root@mon ~]# iperf3 -s ----------------------------------------------------------- Server listening on 5201 -----------------------------------------------------------Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe default port is 5201, but can be set using the
-Pcommand argument.On a different Red Hat Ceph Storage node, start the
iperf3client:Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This output shows a network bandwidth of 1.1 Gbits/second between the Red Hat Ceph Storage nodes, along with no retransmissions (
Retr) during the test.Red Hat recommends you validate the network bandwidth between all the nodes in the storage cluster.
Ensure that all nodes have the same network interconnect speed. Slower attached nodes might slow down the faster connected ones. Also, ensure that the inter switch links can handle the aggregated bandwidth of the attached nodes:
Syntax
ethtool INTERFACE
ethtool INTERFACECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- See the Basic Network troubleshooting solution on the Customer Portal for details.
- See the Verifying and configuring the MTU value section in the Red Hat Ceph Storage Configuration Guide.
- See the Configuring Firewall section in the Red Hat Ceph Storage Installation Guide.
- See the What is the "ethtool" command and how can I use it to obtain information about my network devices and interfaces for details.
- See the RHEL network interface dropping packets solutions on the Customer Portal for details.
- For details, see the What are the performance benchmarking tools available for Red Hat Ceph Storage? solution on the Customer Portal.
- The Networking Guide for Red Hat Enterprise Linux 7.
- For more information, see Knowledgebase articles and solutions related to troubleshooting networking issues on the Customer Portal.
3.3. Basic chrony NTP troubleshooting Copiar o linkLink copiado para a área de transferência!
This section includes basic chrony troubleshooting steps.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor node.
Procedure
Verify that the
chronyddaemon is running on the Ceph Monitor hosts:Example
systemctl status chronyd
[root@mon ~]# systemctl status chronydCopy to Clipboard Copied! Toggle word wrap Toggle overflow If
chronydis not running, enable and start it:Example
systemctl enable chronyd systemctl start chronyd
[root@mon ~]# systemctl enable chronyd [root@mon ~]# systemctl start chronydCopy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that
chronydis synchronizing the clocks correctly:Example
chronyc sources chronyc sourcestats chronyc tracking
[root@mon ~]# chronyc sources [root@mon ~]# chronyc sourcestats [root@mon ~]# chronyc trackingCopy to Clipboard Copied! Toggle word wrap Toggle overflow
3.4. Basic NTP troubleshooting Copiar o linkLink copiado para a área de transferência!
This section includes basic NTP troubleshooting steps.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor node.
Procedure
Verify that the
ntpddaemon is running on the Ceph Monitor hosts:Example
systemctl status ntpd
[root@mon ~]# systemctl status ntpdCopy to Clipboard Copied! Toggle word wrap Toggle overflow If
ntpdis not running, enable and start it:Example
systemctl enable ntpd systemctl start ntpd
[root@mon ~]# systemctl enable ntpd [root@mon ~]# systemctl start ntpdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that
ntpdis synchronizing the clocks correctly:Example
ntpq -p
[root@mon ~]# ntpq -pCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 4. Troubleshooting Ceph Monitors Copiar o linkLink copiado para a área de transferência!
This chapter contains information on how to fix the most common errors related to the Ceph Monitors.
4.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- Verify the network connection.
4.2. Most common Ceph Monitor errors Copiar o linkLink copiado para a área de transferência!
The following tables list the most common error messages that are returned by the ceph health detail command, or included in the Ceph logs. The tables provide links to corresponding sections that explain the errors and point to specific procedures to fix the problems.
4.2.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- A running Red Hat Ceph Storage cluster.
4.2.2. Ceph Monitor error messages Copiar o linkLink copiado para a área de transferência!
A table of common Ceph Monitor error messages, and a potential fix.
| Error message | See |
|---|---|
|
| |
|
| |
|
| |
|
| |
4.2.3. Common Ceph Monitor error messages in the Ceph logs Copiar o linkLink copiado para a área de transferência!
A table of common Ceph Monitor error messages found in the Ceph logs, and a link to a potential fix.
| Error message | Log file | See |
|---|---|---|
|
| Main cluster log | |
|
| Main cluster log | |
|
| Monitor log | |
|
| Monitor log | |
|
| Monitor log |
4.2.4. Ceph Monitor is out of quorum Copiar o linkLink copiado para a área de transferência!
One or more Ceph Monitors are marked as down but the other Ceph Monitors are still able to form a quorum. In addition, the ceph health detail command returns an error message similar to the following one:
HEALTH_WARN 1 mons down, quorum 1,2 mon.b,mon.c mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
HEALTH_WARN 1 mons down, quorum 1,2 mon.b,mon.c
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
What This Means
Ceph marks a Ceph Monitor as down due to various reasons.
If the ceph-mon daemon is not running, it might have a corrupted store or some other error is preventing the daemon from starting. Also, the /var/ partition might be full. As a consequence, ceph-mon is not able to perform any operations to the store located by default at /var/lib/ceph/mon-SHORT_HOST_NAME/store.db and terminates.
If the ceph-mon daemon is running but the Ceph Monitor is out of quorum and marked as down, the cause of the problem depends on the Ceph Monitor state:
-
If the Ceph Monitor is in the probing state longer than expected, it cannot find the other Ceph Monitors. This problem can be caused by networking issues, or the Ceph Monitor can have an outdated Ceph Monitor map (
monmap) and be trying to reach the other Ceph Monitors on incorrect IP addresses. Alternatively, if themonmapis up-to-date, Ceph Monitor’s clock might not be synchronized. - If the Ceph Monitor is in the electing state longer than expected, the Ceph Monitor’s clock might not be synchronized.
- If the Ceph Monitor changes its state from synchronizing to electing and back, the cluster state is advancing. This means that it is generating new maps faster than the synchronization process can handle.
- If the Ceph Monitor marks itself as the leader or a peon, then it believes to be in a quorum, while the remaining cluster is sure that it is not. This problem can be caused by failed clock synchronization.
To Troubleshoot This Problem
Verify that the
ceph-mondaemon is running. If not, start it:systemctl status ceph-mon@HOST_NAME systemctl start ceph-mon@HOST_NAME
[root@mon ~]# systemctl status ceph-mon@HOST_NAME [root@mon ~]# systemctl start ceph-mon@HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
HOST_NAMEwith the short name of the host where the daemon is running. Use thehostname -scommand when unsure.-
If you are not able to start
ceph-mon, follow the steps in Theceph-mondaemon cannot start. -
If you are able to start the
ceph-mondaemon but is is marked asdown, follow the steps in Theceph-mondaemon is running, but marked as `down`.
The ceph-mon Daemon Cannot Start
-
Check the corresponding Ceph Monitor log, by default located at
/var/log/ceph/ceph-mon.HOST_NAME.log. If the log contains error messages similar to the following ones, the Ceph Monitor might have a corrupted store.
Corruption: error in middle of record Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb
Corruption: error in middle of record Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldbCopy to Clipboard Copied! Toggle word wrap Toggle overflow To fix this problem, replace the Ceph Monitor. See Replacing a failed monitor.
If the log contains an error message similar to the following one, the
/var/partition might be full. Delete any unnecessary data from/var/.Caught signal (Bus error)
Caught signal (Bus error)Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantDo not delete any data from the Monitor directory manually. Instead, use the
ceph-monstore-toolto compact it. See Compacting the Ceph Monitor store for details.- If you see any other error messages, open a support ticket. See Contacting Red Hat Support for service for details.
The ceph-mon Daemon Is Running, but Still Marked as down
From the Ceph Monitor host that is out of the quorum, use the
mon_statuscommand to check its state:ceph daemon ID mon_status
[root@mon ~]# ceph daemon ID mon_statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
IDwith the ID of the Ceph Monitor, for example:ceph daemon mon.a mon_status
[root@mon ~]# ceph daemon mon.a mon_statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the status is probing, verify the locations of the other Ceph Monitors in the
mon_statusoutput.-
If the addresses are incorrect, the Ceph Monitor has incorrect Ceph Monitor map (
monmap). To fix this problem, see Injecting a Ceph Monitor map. - If the addresses are correct, verify that the Ceph Monitor clocks are synchronized. See Clock skew for details. In addition, troubleshoot any networking issues, see Troubleshooting Networking issues for details.
-
If the addresses are incorrect, the Ceph Monitor has incorrect Ceph Monitor map (
- If the status is electing, verify that the Ceph Monitor clocks are synchronized. See Clock skew for details.
- If the status changes from electing to synchronizing, open a support ticket. See Contacting Red Hat Support for service for details.
- If the Ceph Monitor is the leader or a peon, verify that the Ceph Monitor clocks are synchronized. See Clock skew for details. Open a support ticket if synchronizing the clocks does not solve the problem. See Contacting Red Hat Support for service for details.
Additional Resources
- See Understanding Ceph Monitor status
- The Starting, Stopping, Restarting the Ceph daemons by instance section in the Administration Guide for Red Hat Ceph Storage 4
- The Using the Ceph Administration Socket section in the Administration Guide for Red Hat Ceph Storage 4
4.2.5. Clock skew Copiar o linkLink copiado para a área de transferência!
A Ceph Monitor is out of quorum, and the ceph health detail command output contains error messages similar to these:
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum) mon.a addr 127.0.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
mon.a (rank 0) addr 127.0.0.1:6789/0 is down (out of quorum)
mon.a addr 127.0.0.1:6789/0 clock skew 0.08235s > max 0.05s (latency 0.0045s)
In addition, Ceph logs contain error messages similar to these:
2015-06-04 07:28:32.035795 7f806062e700 0 log [WRN] : mon.a 127.0.0.1:6789/0 clock skew 0.14s > max 0.05s 2015-06-04 04:31:25.773235 7f4997663700 0 log [WRN] : message from mon.1 was stamped 0.186257s in the future, clocks not synchronized
2015-06-04 07:28:32.035795 7f806062e700 0 log [WRN] : mon.a 127.0.0.1:6789/0 clock skew 0.14s > max 0.05s
2015-06-04 04:31:25.773235 7f4997663700 0 log [WRN] : message from mon.1 was stamped 0.186257s in the future, clocks not synchronized
What This Means
The clock skew error message indicates that Ceph Monitors' clocks are not synchronized. Clock synchronization is important because Ceph Monitors depend on time precision and behave unpredictably if their clocks are not synchronized.
The mon_clock_drift_allowed parameter determines what disparity between the clocks is tolerated. By default, this parameter is set to 0.05 seconds.
Do not change the default value of mon_clock_drift_allowed without previous testing. Changing this value might affect the stability of the Ceph Monitors and the Ceph Storage Cluster in general.
Possible causes of the clock skew error include network problems or problems with Network Time Protocol (NTP) synchronization if that is configured. In addition, time synchronization does not work properly on Ceph Monitors deployed on virtual machines.
To Troubleshoot This Problem
Verify that your network works correctly. For details, see Troubleshooting networking issues. In particular, troubleshoot any problems with NTP clients if you use NTP.
- If you use chrony for NTP, see Basic chrony NTP troubleshooting section for more information.
-
If you use
ntpd, see Basic NTP troubleshooting.
If you use a remote NTP server, consider deploying your own NTP server on your network.
- For details, see the Using the Chrony suite to configure NTP chapter in the Configuring basic system settings for Red Hat Enterprise Linux 8.
- See the Configuring NTP Using ntpd chapter in the System Administrator’s Guide for Red Hat Enterprise Linux 7.
Ceph evaluates time synchronization every five minutes only so there will be a delay between fixing the problem and clearing the clock skew messages.
Additional Resources
4.2.6. The Ceph Monitor store is getting too big Copiar o linkLink copiado para a área de transferência!
The ceph health command returns an error message similar to the following one:
mon.ceph1 store is getting too big! 48031 MB >= 15360 MB -- 62% avail
mon.ceph1 store is getting too big! 48031 MB >= 15360 MB -- 62% avail
What This Means
Ceph Monitors store is in fact a LevelDB database that stores entries as key–values pairs. The database includes a cluster map and is located by default at /var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAME/store.db.
Querying a large Monitor store can take time. As a consequence, the Ceph Monitor can be delayed in responding to client queries.
In addition, if the /var/ partition is full, the Ceph Monitor cannot perform any write operations to the store and terminates. See Ceph Monitor is out of quorum for details on troubleshooting this issue.
To Troubleshoot This Problem
Check the size of the database:
du -sch /var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAME/store.db
du -sch /var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAME/store.dbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the name of the cluster and the short host name of the host where the
ceph-monis running.Example
du -sch /var/lib/ceph/mon/ceph-host1/store.db
# du -sch /var/lib/ceph/mon/ceph-host1/store.db 47G /var/lib/ceph/mon/ceph-ceph1/store.db/ 47G totalCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Compact the Ceph Monitor store. For details, see Compacting the Ceph Monitor Store.
Additional Resources
4.2.7. Understanding Ceph Monitor status Copiar o linkLink copiado para a área de transferência!
The mon_status command returns information about a Ceph Monitor, such as:
- State
- Rank
- Elections epoch
-
Monitor map (
monmap)
If Ceph Monitors are able to form a quorum, use mon_status with the ceph command-line utility.
If Ceph Monitors are not able to form a quorum, but the ceph-mon daemon is running, use the administration socket to execute mon_status.
An example output of mon_status
Ceph Monitor States
- Leader
-
During the electing phase, Ceph Monitors are electing a leader. The leader is the Ceph Monitor with the highest rank, that is the rank with the lowest value. In the example above, the leader is
mon.1. - Peon
- Peons are the Ceph Monitors in the quorum that are not leaders. If the leader fails, the peon with the highest rank becomes a new leader.
- Probing
-
A Ceph Monitor is in the probing state if it is looking for other Ceph Monitors. For example after you start the Ceph Monitors, they are probing until they find enough Ceph Monitors specified in the Ceph Monitor map (
monmap) to form a quorum. - Electing
- A Ceph Monitor is in the electing state if it is in the process of electing the leader. Usually, this status changes quickly.
- Synchronizing
- A Ceph Monitor is in the synchronizing state if it is synchronizing with the other Ceph Monitors to join the quorum. The smaller the Ceph Monitor store it, the faster the synchronization process. Therefore, if you have a large store, synchronization takes longer time.
Additional Resources
- For details, see the Using the Ceph Administration Socket section in the Administration Guide for Red Hat Ceph Storage 4.
4.2.8. Additional Resources Copiar o linkLink copiado para a área de transferência!
- See the Section 4.2.2, “Ceph Monitor error messages” in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Section 4.2.3, “Common Ceph Monitor error messages in the Ceph logs” in the Red Hat Ceph Storage Troubleshooting Guide.
4.3. Injecting a monmap Copiar o linkLink copiado para a área de transferência!
If a Ceph Monitor has an outdated or corrupted Ceph Monitor map (monmap), it cannot join a quorum because it is trying to reach the other Ceph Monitors on incorrect IP addresses.
The safest way to fix this problem is to obtain and inject the actual Ceph Monitor map from other Ceph Monitors.
This action overwrites the existing Ceph Monitor map kept by the Ceph Monitor.
This procedure shows how to inject the Ceph Monitor map when the other Ceph Monitors are able to form a quorum, or when at least one Ceph Monitor has a correct Ceph Monitor map. If all Ceph Monitors have corrupted store and therefore also the Ceph Monitor map, see Recovering the Ceph Monitor store.
Prerequisites
- Access to the Ceph Monitor Map.
- Root-level access to the Ceph Monitor node.
Procedure
If the remaining Ceph Monitors are able to form a quorum, get the Ceph Monitor map by using the
ceph mon getmapcommand:ceph mon getmap -o /tmp/monmap
[root@mon ~]# ceph mon getmap -o /tmp/monmapCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the remaining Ceph Monitors are not able to form the quorum and you have at least one Ceph Monitor with a correct Ceph Monitor map, copy it from that Ceph Monitor:
Stop the Ceph Monitor which you want to copy the Ceph Monitor map from:
systemctl stop ceph-mon@<host-name>
[root@mon ~]# systemctl stop ceph-mon@<host-name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to stop the Ceph Monitor running on a host with the
host1short host name:systemctl stop ceph-mon@host1
[root@mon ~]# systemctl stop ceph-mon@host1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy the Ceph Monitor map:
ceph-mon -i ID --extract-monmap /tmp/monmap
[root@mon ~]# ceph-mon -i ID --extract-monmap /tmp/monmapCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
IDwith the ID of the Ceph Monitor which you want to copy the Ceph Monitor map from:ceph-mon -i mon.a --extract-monmap /tmp/monmap
[root@mon ~]# ceph-mon -i mon.a --extract-monmap /tmp/monmapCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Stop the Ceph Monitor with the corrupted or outdated Ceph Monitor map:
systemctl stop ceph-mon@HOST_NAME
[root@mon ~]# systemctl stop ceph-mon@HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to stop a Ceph Monitor running on a host with the
host2short host name:systemctl stop ceph-mon@host2
[root@mon ~]# systemctl stop ceph-mon@host2Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can inject the Ceph Monitor map as a
cephuser in two different ways:Run the command as a
cephuser:Syntax
su - ceph -c 'ceph-mon -i ID --inject-monmap /tmp/monmap'
su - ceph -c 'ceph-mon -i ID --inject-monmap /tmp/monmap'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
IDwith the ID of the Ceph Monitor with the corrupted or outdated Ceph Monitor map:Example
su - ceph -c 'ceph-mon -i mon.c --inject-monmap /tmp/monmap'
[root@mon ~]# su - ceph -c 'ceph-mon -i mon.c --inject-monmap /tmp/monmap'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the command as a
rootuser and then runchownto change the permissions:Run the command as a root user:
Syntax
ceph-mon -i ID --inject-monmap /tmp/monmap
ceph-mon -i ID --inject-monmap /tmp/monmapCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
IDwith the ID of the Ceph Monitor with the corrupted or outdated Ceph Monitor map:Example
ceph-mon -i mon.c --inject-monmap /tmp/monmap
[root@mon ~]# ceph-mon -i mon.c --inject-monmap /tmp/monmapCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change the file permissions:
Example
chown -R ceph:ceph /var/lib/ceph/mon/ceph-c/
[root@mon ~]# chown -R ceph:ceph /var/lib/ceph/mon/ceph-c/Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Start the Ceph Monitor:
systemctl start ceph-mon@host2
[root@mon ~]# systemctl start ceph-mon@host2Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you copied the Ceph Monitor map from another Ceph Monitor, start that Ceph Monitor, too:
systemctl start ceph-mon@host1
[root@mon ~]# systemctl start ceph-mon@host1Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.4. Replacing a failed Monitor Copiar o linkLink copiado para a área de transferência!
When a Monitor has a corrupted store, the recommended way to fix this problem is to replace the Monitor by using the Ansible automation application.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Able to form a quroum.
- Root-level access to Ceph Monitor node.
Procedure
From the Monitor host, remove the Monitor store by default located at
/var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAME:rm -rf /var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAME
rm -rf /var/lib/ceph/mon/CLUSTER_NAME-SHORT_HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the short host name of the Monitor host and the cluster name. For example, to remove the Monitor store of a Monitor running on
host1from a cluster calledremote:rm -rf /var/lib/ceph/mon/remote-host1
[root@mon ~]# rm -rf /var/lib/ceph/mon/remote-host1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the Monitor from the Monitor map (
monmap):ceph mon remove SHORT_HOST_NAME --cluster CLUSTER_NAME
ceph mon remove SHORT_HOST_NAME --cluster CLUSTER_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the short host name of the Monitor host and the cluster name. For example, to remove the Monitor running on
host1from a cluster calledremote:ceph mon remove host1 --cluster remote
[root@mon ~]# ceph mon remove host1 --cluster remoteCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Troubleshoot and fix any problems related to the underlying file system or hardware of the Monitor host.
From the Ansible administration node, redeploy the Monitor by running the
ceph-ansibleplaybook:/usr/share/ceph-ansible/ansible-playbook site.yml
$ /usr/share/ceph-ansible/ansible-playbook site.ymlCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- See the Ceph Monitor is out of quorum for details.
- The Managing the storage cluster size chapter in the Red Hat Ceph Storage Operations Guide.
- The Deploying Red Hat Ceph Storage chapter in the Red Hat Ceph Storage 4 Installation Guide.
4.5. Compacting the monitor store Copiar o linkLink copiado para a área de transferência!
When the Monitor store has grown big in size, you can compact it:
-
Dynamically by using the
ceph tellcommand. -
Upon the start of the
ceph-mondaemon. -
By using the
ceph-monstore-toolwhen theceph-mondaemon is not running. Use this method when the previously mentioned methods fail to compact the Monitor store or when the Monitor is out of quorum and its log contains theCaught signal (Bus error)error message.
Monitor store size changes when the cluster is not in the active+clean state or during the rebalancing process. For this reason, compact the Monitor store when rebalancing is completed. Also, ensure that the placement groups are in the active+clean state.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor node.
Procedure
To compact the Monitor store when the
ceph-mondaemon is running:ceph tell mon.HOST_NAME compact
ceph tell mon.HOST_NAME compactCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
HOST_NAMEwith the short host name of the host where theceph-monis running. Use thehostname -scommand when unsure.ceph tell mon.host1 compact
# ceph tell mon.host1 compactCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the following parameter to the Ceph configuration under the
[mon]section:[mon] mon_compact_on_start = true
[mon] mon_compact_on_start = trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the
ceph-mondaemon:systemctl restart ceph-mon@_HOST_NAME_
[root@mon ~]# systemctl restart ceph-mon@_HOST_NAME_Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
HOST_NAMEwith the short name of the host where the daemon is running. Use thehostname -scommand when unsure.systemctl restart ceph-mon@host1
[root@mon ~]# systemctl restart ceph-mon@host1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that Monitors have formed a quorum:
ceph mon stat
[root@mon ~]# ceph mon statCopy to Clipboard Copied! Toggle word wrap Toggle overflow Repeat these steps on other Monitors if needed.
NoteBefore you start, ensure that you have the
ceph-testpackage installed.Verify that the
ceph-mondaemon with the large store is not running. Stop the daemon if needed.systemctl status ceph-mon@HOST_NAME systemctl stop ceph-mon@HOST_NAME
[root@mon ~]# systemctl status ceph-mon@HOST_NAME [root@mon ~]# systemctl stop ceph-mon@HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
HOST_NAMEwith the short name of the host where the daemon is running. Use thehostname -scommand when unsure.systemctl status ceph-mon@host1 systemctl stop ceph-mon@host1
[root@mon ~]# systemctl status ceph-mon@host1 [root@mon ~]# systemctl stop ceph-mon@host1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Compact the Monitor store as a
cephuser in two different ways:Run the command as a
cephuser:Syntax
su - ceph -c 'ceph-monstore-tool /var/lib/ceph/mon/mon.HOST_NAME compact'
su - ceph -c 'ceph-monstore-tool /var/lib/ceph/mon/mon.HOST_NAME compact'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
su - ceph -c 'ceph-monstore-tool /var/lib/ceph/mon/mon.node1 compact'
[root@mon ~]# su - ceph -c 'ceph-monstore-tool /var/lib/ceph/mon/mon.node1 compact'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the command as a
rootuser and then runchownto change the permissions:Run the command as a root user:
Syntax
ceph-monstore-tool /var/lib/ceph/mon/mon.HOST_NAME compact
ceph-monstore-tool /var/lib/ceph/mon/mon.HOST_NAME compactCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-monstore-tool /var/lib/ceph/mon/mon.node1 compact
[root@mon ~]# ceph-monstore-tool /var/lib/ceph/mon/mon.node1 compactCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change the file permissions:
Example
chown -R ceph:ceph /var/lib/ceph/mon/mon.node1
[root@mon ~]# chown -R ceph:ceph /var/lib/ceph/mon/mon.node1Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Start
ceph-monagain:systemctl start ceph-mon@HOST_NAME
[root@mon ~]# systemctl start ceph-mon@HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl start ceph-mon@host1
[root@mon ~]# systemctl start ceph-mon@host1Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.6. Opening port for Ceph manager Copiar o linkLink copiado para a área de transferência!
The ceph-mgr daemons receive placement group information from OSDs on the same range of ports as the ceph-osd daemons. If these ports are not open, a cluster will devolve from HEALTH_OK to HEALTH_WARN and will indicate that PGs are unknown with a percentage count of the PGs unknown.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to Ceph manager.
Procedure
To resolve this situation, for each host running
ceph-mgrdaemons, open ports6800-7300.Example
[root@ceph-mgr] # firewall-cmd --add-port 6800-7300/tcp [root@ceph-mgr] # firewall-cmd --add-port 6800-7300/tcp --permanent
[root@ceph-mgr] # firewall-cmd --add-port 6800-7300/tcp [root@ceph-mgr] # firewall-cmd --add-port 6800-7300/tcp --permanentCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Restart the
ceph-mgrdaemons.
4.7. Recovering the Ceph Monitor for bare-metal deployments Copiar o linkLink copiado para a área de transferência!
If all the monitors are down in your Red Hat Ceph Storage cluster, and the ceph -s command does not execute as expected, you can recover the monitors using the monmaptool command. The monmaptool command rebuilds the Ceph monitor store from the keyring files of the daemons.
This procedure is for bare-metal Red Hat Ceph Storage deployments only. For containerized Red Hat Ceph Storage deployments, see the Knowledgebase article MON recovery procedure for Red Hat Ceph Storage containerized deployment when all the three mon are down..
Prerequisites
- Bare-metal deployed Red Hat Ceph Storage cluster.
- Root-level access to all the nodes.
- All the Ceph monitors are down.
Procedure
- Log into the monitor node.
From the monitor nodes, if you are unable to access the OSD nodes without being the root user, then copy the public key pair to the OSD nodes:
Generate the SSH key pair, accept the default file name, and leave the passphrase empty:
Example
ssh-keygen
[root@mons-1 ~]# ssh-keygenCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy the public key to all the OSD nodes in the storage cluster:
Example
ssh-copy-id root@osds-1 ssh-copy-id root@osds-2 ssh-copy-id root@osds-3
[root@mons-1 ~]# ssh-copy-id root@osds-1 [root@mons-1 ~]# ssh-copy-id root@osds-2 [root@mons-1 ~]# ssh-copy-id root@osds-3Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Stop the OSD daemon service on all the OSD nodes:
Example
sudo systemctl stop ceph-osd\*.service ceph-osd.target
[root@osds-1 ~]# sudo systemctl stop ceph-osd\*.service ceph-osd.targetCopy to Clipboard Copied! Toggle word wrap Toggle overflow To collect the cluster map from all the OSD nodes, create the recovery file and execute the script:
Create the recovery file:
Example
touch recover.sh
[root@mons-1 ~]# touch recover.shCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the following content to the file and replace the OSD_NODES with either the IP addresses of all the OSD nodes or the hostname of the all OSD nodes in the Red Hat Ceph Storage cluster:
Syntax
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Provide executable permission on the file:
Example
chmod 755 recover.sh
[root@mons-1 ~]# chmod 755 recover.shCopy to Clipboard Copied! Toggle word wrap Toggle overflow Execute the file to gather the keyrings of all the OSDs from all the OSD nodes in the storage cluster:
Example
./recovery.sh
[root@mons-1 ~]# ./recovery.shCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Fetch the keyrings of other daemons from the respective nodes:
For Ceph Monitor, the keyring is the same for all the Ceph monitors.
Syntax
cat /var/lib/ceph/mon/ceph-MONITOR_NODE/keyring
cat /var/lib/ceph/mon/ceph-MONITOR_NODE/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
cat /var/lib/ceph/mon/ceph-mons-1/keyring
[root@mons-1 ~]# cat /var/lib/ceph/mon/ceph-mons-1/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow For Ceph Manager, fetch the keyring from all the manager nodes:
Syntax
cat /var/lib/ceph/mgr/ceph-MANAGER_NODE/keyring
cat /var/lib/ceph/mgr/ceph-MANAGER_NODE/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
cat /var/lib/ceph/mgr/ceph-mons-1/keyring
[root@mons-1 ~]# cat /var/lib/ceph/mgr/ceph-mons-1/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow For Ceph OSDs, the keyrings are generated from the above script and stored in a temporary path:
In this example, the OSD keyrings are stored in the
/tmp/monstore/keyringfile.For the client, fetch the keyring from all the client nodes:
Syntax
cat /etc/ceph/CLIENT_KEYRING
cat /etc/ceph/CLIENT_KEYRINGCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
cat /etc/ceph/ceph.client.admin.keyring
[root@client ~]# cat /etc/ceph/ceph.client.admin.keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow For meta data server (MDS), fetch the keyring from all the Ceph MDS nodes:
Syntax
cat /var/lib/ceph/mds/ceph-MDS_NODE/keyring
cat /var/lib/ceph/mds/ceph-MDS_NODE/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
cat /var/lib/ceph/mds/ceph-mds-1/keyring
[root@mons-2 ~]# cat /var/lib/ceph/mds/ceph-mds-1/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow For this key ring append the following caps if not exist:
caps mds = "allow" caps mon = "allow profile mds" caps osd = "allow *"
caps mds = "allow" caps mon = "allow profile mds" caps osd = "allow *"Copy to Clipboard Copied! Toggle word wrap Toggle overflow For Ceph Object Gateway, fetch the keyring from all the Ceph Object Gateway nodes:
Syntax
cat /var/lib/ceph/radosgw/ceph-CEPH_OBJECT_GATEWAY_NODE/keyring
cat /var/lib/ceph/radosgw/ceph-CEPH_OBJECT_GATEWAY_NODE/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
cat /var/lib/ceph/radosgw/ceph-rgw-1/keyring
[root@mons-3 ~]# cat /var/lib/ceph/radosgw/ceph-rgw-1/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow For this key ring append the following caps if not exist:
caps mon = "allow rw" caps osd = "allow *"
caps mon = "allow rw" caps osd = "allow *"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
On the Ansible administration node, create a file with all the keyrings fetched from the previous step:
Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: On each Ceph Monitor node, ensure that the monitor map is not available:
Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow From each Ceph Monitor node, fetch the MONITOR_ID, IP_ADDRESS_OF_MONITOR, and FSID from the
etc/ceph/ceph.conffile:Example
[global] cluster network = 10.0.208.0/22 fsid = 9877bde8-ccb2-4758-89c3-90ca9550ffea mon host = [v2:10.0.211.00:3300,v1:10.0.211.00:6789],[v2:10.0.211.13:3300,v1:10.0.211.13:6789],[v2:10.0.210.13:3300,v1:10.0.210.13:6789] mon initial members = ceph-mons-1, ceph-mons-2, ceph-mons-3
[global] cluster network = 10.0.208.0/22 fsid = 9877bde8-ccb2-4758-89c3-90ca9550ffea mon host = [v2:10.0.211.00:3300,v1:10.0.211.00:6789],[v2:10.0.211.13:3300,v1:10.0.211.13:6789],[v2:10.0.210.13:3300,v1:10.0.210.13:6789] mon initial members = ceph-mons-1, ceph-mons-2, ceph-mons-3Copy to Clipboard Copied! Toggle word wrap Toggle overflow On the Ceph Monitor node, rebuild the monitor map:
Syntax
monmaptool --create --addv MONITOR_ID IP_ADDRESS_OF_MONITOR --enable-all-features --clobber PATH_OF_MONITOR_MAP --fsid FSID
monmaptool --create --addv MONITOR_ID IP_ADDRESS_OF_MONITOR --enable-all-features --clobber PATH_OF_MONITOR_MAP --fsid FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
monmaptool --create --addv mons-1 [v2:10.74.177.30:3300,v1:10.74.177.30:6789] --addv mons-2 [v2:10.74.179.197:3300,v1:10.74.179.197:6789] --addv mons-3 [v2:10.74.182.123:3300,v1:10.74.182.123:6789] --enable-all-features --clobber /root/monmap.mons-1 --fsid 6c01cb34-33bf-44d0-9aec-3432276f6be8
[root@mons-1 ~]# monmaptool --create --addv mons-1 [v2:10.74.177.30:3300,v1:10.74.177.30:6789] --addv mons-2 [v2:10.74.179.197:3300,v1:10.74.179.197:6789] --addv mons-3 [v2:10.74.182.123:3300,v1:10.74.182.123:6789] --enable-all-features --clobber /root/monmap.mons-1 --fsid 6c01cb34-33bf-44d0-9aec-3432276f6be8 monmaptool: monmap file /root/monmap.mons-1 monmaptool: set fsid to 6c01cb34-33bf-44d0-9aec-3432276f6be8 monmaptool: writing epoch 0 to /root/monmap.mon-a (3 monitors)Copy to Clipboard Copied! Toggle word wrap Toggle overflow On the Ceph Monitor node, check the generated monitor map:
Syntax
monmaptool PATH_OF_MONITOR_MAP --print
monmaptool PATH_OF_MONITOR_MAP --printCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow On the Ceph Monitor node where we are recovering the monitors, rebuild the Ceph Monitor store from the collected map:
Syntax
ceph-monstore-tool /tmp/monstore rebuild -- --keyring KEYRING_PATH --monmap PATH_OF_MONITOR_MAP
ceph-monstore-tool /tmp/monstore rebuild -- --keyring KEYRING_PATH --monmap PATH_OF_MONITOR_MAPCopy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the recovery is run on the
mons-1node.Example
ceph-monstore-tool /tmp/monstore rebuild -- --keyring /tmp/monstore/keyring --monmap /root/monmap.mons-1
[root@mons-1 ~]# ceph-monstore-tool /tmp/monstore rebuild -- --keyring /tmp/monstore/keyring --monmap /root/monmap.mons-1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change the ownership of monstore directory to ceph:
Example
chown -R ceph:ceph /tmp/monstore
[root@mons-1 ~]# chown -R ceph:ceph /tmp/monstoreCopy to Clipboard Copied! Toggle word wrap Toggle overflow On all the Ceph Monitor nodes, take a back-up of the corrupted store:
Example
mv /var/lib/ceph/mon/ceph-mons-1/store.db /var/lib/ceph/mon/ceph-mons-1/store.db.corrupted
[root@mons-1 ~]# mv /var/lib/ceph/mon/ceph-mons-1/store.db /var/lib/ceph/mon/ceph-mons-1/store.db.corruptedCopy to Clipboard Copied! Toggle word wrap Toggle overflow On all the Ceph Monitor nodes, replace the corrupted store:
Example
scp -r /tmp/monstore/store.db mons-1:/var/lib/ceph/mon/ceph-mons-1/
[root@mons-1 ~]# scp -r /tmp/monstore/store.db mons-1:/var/lib/ceph/mon/ceph-mons-1/Copy to Clipboard Copied! Toggle word wrap Toggle overflow On all the Ceph Monitor nodes, change the owner of the new store:
Example
chown -R ceph:ceph /var/lib/ceph/mon/ceph-HOSTNAME/store.db
[root@mons-1 ~]# chown -R ceph:ceph /var/lib/ceph/mon/ceph-HOSTNAME/store.dbCopy to Clipboard Copied! Toggle word wrap Toggle overflow On all the Ceph OSD nodes, start the OSDs:
Example
sudo systemctl start ceph-osd.target
[root@osds-1 ~]# sudo systemctl start ceph-osd.targetCopy to Clipboard Copied! Toggle word wrap Toggle overflow On all the Ceph Monitor nodes, start the monitors
Example
sudo systemctl start ceph-mon.target
[root@mons-1 ~]# sudo systemctl start ceph-mon.targetCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.8. Recovering the Ceph Monitor store Copiar o linkLink copiado para a área de transferência!
Ceph Monitors store the cluster map in a key–value store such as LevelDB. If the store is corrupted on a Monitor, the Monitor terminates unexpectedly and fails to start again. The Ceph logs might include the following errors:
Corruption: error in middle of record Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb
Corruption: error in middle of record
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/mon.0/store.db/1234567.ldb
Production Red Hat Ceph Storage clusters use at least three Ceph Monitors so that if one fails, it can be replaced with another one. However, under certain circumstances, all Ceph Monitors can have corrupted stores. For example, when the Ceph Monitor nodes have incorrectly configured disk or file system settings, a power outage can corrupt the underlying file system.
If there is corruption on all Ceph Monitors, you can recover it with information stored on the OSD nodes by using utilities called ceph-monstore-tool and ceph-objectstore-tool.
These procedures cannot recover the following information:
- Metadata Daemon Server (MDS) keyrings and maps
Placement Group settings:
-
full ratioset by using theceph pg set_full_ratiocommand -
nearfull ratioset by using theceph pg set_nearfull_ratiocommand
-
Never restore the Ceph Monitor store from an old backup. Rebuild the Ceph Monitor store from the current cluster state using the following steps and restore from that.
4.8.1. Recovering the Ceph Monitor store when using BlueStore Copiar o linkLink copiado para a área de transferência!
Follow this procedure if the Ceph Monitor store is corrupted on all Ceph Monitors and you use the BlueStore back end.
In containerized environments, this method requires attaching Ceph repositories and restoring to a non-containerized Ceph Monitor first.
This procedure can cause data loss. If you are unsure about any step in this procedure, contact the Red Hat Technical Support for an assistance with the recovering process.
Prerequisites
Bare-metal deployments
-
The
rsyncandceph-testpackages are installed.
-
The
Container deployments
- All OSDs containers are stopped.
- Enable Ceph repositories on the Ceph nodes based on their roles.
-
The
ceph-testandrsyncpackages are installed on the OSD and Monitor nodes. -
The
ceph-monpackage is installed on the Monitor nodes. -
The
ceph-osdpackage is installed on the OSD nodes.
Procedure
If you use Ceph in containers, mount all disk with Ceph data to a temporary location. Repeat this step for all OSD nodes.
List the data partitions. Use
ceph-volumeorceph-diskdepending on which utility you used to set up the devices:ceph-volume lvm list
[root@osd ~]# ceph-volume lvm listCopy to Clipboard Copied! Toggle word wrap Toggle overflow or
ceph-disk list
[root@osd ~]# ceph-disk listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Mount the data partitions to temporary location:
mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-$i
mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-$iCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restore the SELinux context:
for i in {OSD_ID}; do restorecon /var/lib/ceph/osd/ceph-$i; donefor i in {OSD_ID}; do restorecon /var/lib/ceph/osd/ceph-$i; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace OSD_ID with a numeric, space-separated list of Ceph OSD IDs on the OSD node.
Change the owner and group to
ceph:ceph:for i in {OSD_ID}; do chown -R ceph:ceph /var/lib/ceph/osd/ceph-$i; donefor i in {OSD_ID}; do chown -R ceph:ceph /var/lib/ceph/osd/ceph-$i; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace OSD_ID with a numeric, space-separated list of Ceph OSD IDs on the OSD node.
ImportantDue to a bug that causes the
update-mon-dbcommand to use additionaldbanddb.slowdirectories for the Monitor database, you must also copy these directories. To do so:Prepare a temporary location outside the container to mount and access the OSD database and extract the OSD maps needed to restore the Ceph Monitor:
ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev OSD-DATA --path /var/lib/ceph/osd/ceph-OSD-ID
ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev OSD-DATA --path /var/lib/ceph/osd/ceph-OSD-IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace OSD-DATA with the Volume Group (VG) or Logical Volume (LV) path to the to the OSD data and OSD-ID with the ID of the OSD.
Create a symbolic link between the BlueStore database and
block.db:ln -snf BLUESTORE DATABASE /var/lib/ceph/osd/ceph-OSD-ID/block.db
ln -snf BLUESTORE DATABASE /var/lib/ceph/osd/ceph-OSD-ID/block.dbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace BLUESTORE-DATABASE with the Volume Group (VG) or Logical Volume (LV) path to the BlueStore database and OSD-ID with the ID of the OSD.
Use the following commands from the Ceph Monitor node with the corrupted store. Repeat them for all OSDs on all nodes.
Collect the cluster map from all OSD nodes:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the appropriate capabilities:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Move all
sstfile from thedbanddb.slowdirectories to the temporary location:mv /root/db/*.sst /root/db.slow/*.sst /tmp/monstore/store.db
[root@mon ~]# mv /root/db/*.sst /root/db.slow/*.sst /tmp/monstore/store.dbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Rebuild the Monitor store from the collected map:
ceph-monstore-tool /tmp/monstore rebuild -- --keyring /etc/ceph/ceph.client.admin
[root@mon ~]# ceph-monstore-tool /tmp/monstore rebuild -- --keyring /etc/ceph/ceph.client.adminCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteAfter using this command, only keyrings extracted from the OSDs and the keyring specified on the
ceph-monstore-toolcommand line are present in Ceph’s authentication database. You have to recreate or import all other keyrings, such as clients, Ceph Manager, Ceph Object Gateway, and others, so those clients can access the cluster.Back up the corrupted store. Repeat this step for all Ceph Monitor nodes:
mv /var/lib/ceph/mon/ceph-HOSTNAME/store.db /var/lib/ceph/mon/ceph-HOSTNAME/store.db.corrupted
mv /var/lib/ceph/mon/ceph-HOSTNAME/store.db /var/lib/ceph/mon/ceph-HOSTNAME/store.db.corruptedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOSTNAME with the host name of the Ceph Monitor node.
Replace the corrupted store. Repeat this step for all Ceph Monitor nodes:
scp -r /tmp/monstore/store.db HOSTNAME:/var/lib/ceph/mon/ceph-HOSTNAME/
scp -r /tmp/monstore/store.db HOSTNAME:/var/lib/ceph/mon/ceph-HOSTNAME/Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOSTNAME with the host name of the Monitor node.
Change the owner of the new store. Repeat this step for all Ceph Monitor nodes:
chown -R ceph:ceph /var/lib/ceph/mon/ceph-HOSTNAME/store.db
chown -R ceph:ceph /var/lib/ceph/mon/ceph-HOSTNAME/store.dbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOSTNAME with the host name of the Ceph Monitor node.
If you use Ceph in containers, then unmount all the temporary mounted OSDs on all nodes:
umount /var/lib/ceph/osd/ceph-*
[root@osd ~]# umount /var/lib/ceph/osd/ceph-*Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start all the Ceph Monitor daemons:
systemctl start ceph-mon *
[root@mon ~]# systemctl start ceph-mon *Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the Monitors are able to form a quorum:
Bare-metal deployments
ceph -s
[root@mon ~]# ceph -sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Containers
docker exec ceph-mon-_HOSTNAME_ ceph -s
[user@admin ~]$ docker exec ceph-mon-_HOSTNAME_ ceph -sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOSTNAME with the host name of the Ceph Monitor node.
Import the Ceph Manager keyring and start all Ceph Manager processes:
ceph auth import -i /etc/ceph/ceph.mgr.HOSTNAME.keyring systemctl start ceph-mgr@HOSTNAME
ceph auth import -i /etc/ceph/ceph.mgr.HOSTNAME.keyring systemctl start ceph-mgr@HOSTNAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOSTNAME with the host name of the Ceph Manager node.
Start all OSD processes across all OSD nodes:
systemctl start ceph-osd *
[root@osd ~]# systemctl start ceph-osd *Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSDs are returning to service:
Bare-metal deployments
ceph -s
[root@mon ~]# ceph -sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Containers
docker exec ceph-mon-_HOSTNAME_ ceph -s
[user@admin ~]$ docker exec ceph-mon-_HOSTNAME_ ceph -sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOSTNAME with the host name of the Ceph Monitor node.
Additional Resources
- For details on registering Ceph nodes to the Content Delivery Network (CDN), see Registering Red Hat Ceph Storage Nodes to the CDN and Attaching Subscriptions section in the Red Hat Ceph Storage Installation Guide.
- For details on enabling repositories, see the Enabling the Red Hat Ceph Storage Repositories section in the Red Hat Ceph Storage Installation Guide.
4.9. Additional Resources Copiar o linkLink copiado para a área de transferência!
- See Chapter 3, Troubleshooting networking issues in the Red Hat Ceph Storage Troubleshooting Guide for network-related problems.
Chapter 5. Troubleshooting Ceph OSDs Copiar o linkLink copiado para a área de transferência!
This chapter contains information on how to fix the most common errors related to Ceph OSDs.
5.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- Verify your network connection. See Troubleshooting networking issues for details.
-
Verify that Monitors have a quorum by using the
ceph healthcommand. If the command returns a health status (HEALTH_OK,HEALTH_WARN, orHEALTH_ERR), the Monitors are able to form a quorum. If not, address any Monitor problems first. See Troubleshooting Ceph Monitors for details. For details aboutceph healthsee Understanding Ceph health. - Optionally, stop the rebalancing process to save time and resources. See Stopping and starting rebalancing for details.
5.2. Most common Ceph OSD errors Copiar o linkLink copiado para a área de transferência!
The following tables list the most common error messages that are returned by the ceph health detail command, or included in the Ceph logs. The tables provide links to corresponding sections that explain the errors and point to specific procedures to fix the problems.
5.2.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- Root-level access to the Ceph OSD nodes.
5.2.2. Ceph OSD error messages Copiar o linkLink copiado para a área de transferência!
A table of common Ceph OSD error messages, and a potential fix.
| Error message | See |
|---|---|
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
5.2.3. Common Ceph OSD error messages in the Ceph logs Copiar o linkLink copiado para a área de transferência!
A table of common Ceph OSD error messages found in the Ceph logs, and a link to a potential fix.
| Error message | Log file | See |
|---|---|---|
|
| Main cluster log | |
|
| Main cluster log | |
|
| Main cluster log | |
|
| OSD log |
5.2.4. Full OSDs Copiar o linkLink copiado para a área de transferência!
The ceph health detail command returns an error message similar to the following one:
HEALTH_ERR 1 full osds osd.3 is full at 95%
HEALTH_ERR 1 full osds
osd.3 is full at 95%
What This Means
Ceph prevents clients from performing I/O operations on full OSD nodes to avoid losing data. It returns the HEALTH_ERR full osds message when the cluster reaches the capacity set by the mon_osd_full_ratio parameter. By default, this parameter is set to 0.95 which means 95% of the cluster capacity.
To Troubleshoot This Problem
Determine how many percent of raw storage (%RAW USED) is used:
ceph df
# ceph df
If %RAW USED is above 70-75%, you can:
- Delete unnecessary data. This is a short-term solution to avoid production downtime.
- Scale the cluster by adding a new OSD node. This is a long-term solution recommended by Red Hat.
Additional Resources
- Nearfull OSDS in the Red Hat Ceph Storage Troubleshooting Guide.
- See Deleting data from a full storage cluster for details.
5.2.5. Backfillfull OSDs Copiar o linkLink copiado para a área de transferência!
The ceph health detail command returns an error message similar to the following one:
health: HEALTH_WARN 3 backfillfull osd(s) Low space hindering backfill (add storage if this doesn't resolve itself): 32 pgs backfill_toofull
health: HEALTH_WARN
3 backfillfull osd(s)
Low space hindering backfill (add storage if this doesn't resolve itself): 32 pgs backfill_toofull
What this means
When one or more OSDs has exceeded the backfillfull threshold, Ceph prevents data from rebalancing to this device. This is an early warning that rebalancing might not complete and that the cluster is approaching full. The default for the backfullfull threshold is 90%.
To troubleshoot this problem
Check utilization by pool:
ceph df
ceph df
If %RAW USED is above 70-75%, you can carry out one of the following actions:
- Delete unnecessary data. This is a short-term solution to avoid production downtime.
- Scale the cluster by adding a new OSD node. This is a long-term solution recommended by Red Hat.
Increase the
backfillfullratio for the OSDs that contain the PGs stuck inbackfull_toofullto allow the recovery process to continue. Add new storage to the cluster as soon as possible or remove data to prevent filling more OSDs.Syntax
ceph osd set-backfillfull-ratio VALUE
ceph osd set-backfillfull-ratio VALUECopy to Clipboard Copied! Toggle word wrap Toggle overflow The range for VALUE is 0.0 to 1.0.
Example
[ceph: root@host01/]# ceph osd set-backfillfull-ratio 0.92
[ceph: root@host01/]# ceph osd set-backfillfull-ratio 0.92Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.2.6. Nearfull OSDs Copiar o linkLink copiado para a área de transferência!
The ceph health detail command returns an error message similar to the following one:
HEALTH_WARN 1 nearfull osds osd.2 is near full at 85%
HEALTH_WARN 1 nearfull osds
osd.2 is near full at 85%
What This Means
Ceph returns the nearfull osds message when the cluster reaches the capacity set by the mon osd nearfull ratio defaults parameter. By default, this parameter is set to 0.85 which means 85% of the cluster capacity.
Ceph distributes data based on the CRUSH hierarchy in the best possible way but it cannot guarantee equal distribution. The main causes of the uneven data distribution and the nearfull osds messages are:
- The OSDs are not balanced among the OSD nodes in the cluster. That is, some OSD nodes host significantly more OSDs than others, or the weight of some OSDs in the CRUSH map is not adequate to their capacity.
- The Placement Group (PG) count is not proper as per the number of the OSDs, use case, target PGs per OSD, and OSD utilization.
- The cluster uses inappropriate CRUSH tunables.
- The back-end storage for OSDs is almost full.
To Troubleshoot This Problem:
- Verify that the PG count is sufficient and increase it if needed.
- Verify that you use CRUSH tunables optimal to the cluster version and adjust them if not.
- Change the weight of OSDs by utilization.
Enable the Ceph Manager balancer module which optimizes the placement of placement groups (PGs) across OSDs in order to achieve a balanced distribution
Example
ceph mgr module enable balancer
[root@mon ~]# ceph mgr module enable balancerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Determine how much space is left on the disks used by OSDs.
To view how much space OSDs use in general:
ceph osd df
[root@mon ~]# ceph osd dfCopy to Clipboard Copied! Toggle word wrap Toggle overflow To view how much space OSDs use on particular nodes. Use the following command from the node containing
nearfulOSDs:df
$ dfCopy to Clipboard Copied! Toggle word wrap Toggle overflow - If needed, add a new OSD node.
Additional Resources
- Full OSDs
- See the Using the Ceph Manager balancer module section in the Red Hat Ceph Storage Operations Guide.
- See the Set an OSD’s Weight by Utilization section in the Storage Strategies guide for Red Hat Ceph Storage 4.
- For details, see the CRUSH Tunables section in the Storage Strategies guide for Red Hat Ceph Storage 4 and the How can I test the impact CRUSH map tunable modifications will have on my PG distribution across OSDs in Red Hat Ceph Storage? solution on the Red Hat Customer Portal.
- See Increasing the placement group for details.
5.2.7. Down OSDs Copiar o linkLink copiado para a área de transferência!
The ceph health command returns an error similar to the following one:
HEALTH_WARN 1/3 in osds are down
HEALTH_WARN 1/3 in osds are down
What This Means
One of the ceph-osd processes is unavailable due to a possible service failure or problems with communication with other OSDs. As a consequence, the surviving ceph-osd daemons reported this failure to the Monitors.
If the ceph-osd daemon is not running, the underlying OSD drive or file system is either corrupted, or some other error, such as a missing keyring, is preventing the daemon from starting.
In most cases, networking issues cause the situation when the ceph-osd daemon is running but still marked as down.
To Troubleshoot This Problem
Determine which OSD is
down:ceph health detail
[root@mon ~]# ceph health detail HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080Copy to Clipboard Copied! Toggle word wrap Toggle overflow Try to restart the
ceph-osddaemon:systemctl restart ceph-osd@OSD_NUMBER
[root@mon ~]# systemctl restart ceph-osd@OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_NUMBERwith the ID of the OSD that isdown, for example:systemctl restart ceph-osd@0
[root@mon ~]# systemctl restart ceph-osd@0Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
If you are not able start
ceph-osd, follow the steps in Theceph-osddaemon cannot start. -
If you are able to start the
ceph-osddaemon but it is marked asdown, follow the steps in Theceph-osddaemon is running but still marked as `down`.
-
If you are not able start
The ceph-osd daemon cannot start
- If you have a node containing a number of OSDs (generally, more than twelve), verify that the default maximum number of threads (PID count) is sufficient. See Increasing the PID count for details.
-
Verify that the OSD data and journal partitions are mounted properly. You can use the
ceph-volume lvm listcommand to list all devices and volumes associated with the Ceph Storage Cluster and then manually inspect if they are mounted properly. See themount(8)manual page for details. -
If you got the
ERROR: missing keyring, cannot use cephx for authenticationerror message, the OSD is a missing keyring. If you got the
ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1error message, theceph-osddaemon cannot read the underlying file system. See the following steps for instructions on how to troubleshoot and fix this error.NoteIf this error message is returned during boot time of the OSD host, open a support ticket as this might indicate a known issue tracked in the Red Hat Bugzilla 1439210.
Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the
/var/log/ceph/directory for bare-metal deployments.NoteFor container-based deployment, Ceph generates logs to
journald. You can enable logging to files in/var/log/cephby settinglog_to_fileparameter totrueunder [global] in the Ceph configuration file. See Understanding ceph logs for more details.An
EIOerror message similar to the following one indicates a failure of the underlying disk:To fix this problem replace the underlying OSD disk. See Replacing an OSD drive for details.
If the log includes any other
FAILED asserterrors, such as the following one, open a support ticket. See Contacting Red Hat Support for service for details.FAILED assert(0 == "hit suicide timeout")
FAILED assert(0 == "hit suicide timeout")Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Check the
dmesgoutput for the errors with the underlying file system or disk:dmesg
$ dmesgCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
If the
dmesgoutput includes anySCSI errorerror messages, see the SCSI Error Codes Solution Finder solution on the Red Hat Customer Portal to determine the best way to fix the problem. - Alternatively, if you are unable to fix the underlying file system, replace the OSD drive. See Replacing an OSD drive for details.
-
If the
If the OSD failed with a segmentation fault, such as the following one, gather the required information and open a support ticket. See Contacting Red Hat Support for service for details.
Caught signal (Segmentation fault)
Caught signal (Segmentation fault)Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The ceph-osd is running but still marked as down
Check the corresponding log file to determine the cause of the failure. By default, Ceph stores log files in the
/var/log/ceph/directory in bare-metal deployments.NoteFor container-based deployment, Ceph generates logs to
journald. You can enable logging to files in/var/log/cephby settinglog_to_fileparameter totrueunder [global] in the Ceph configuration file. See Understanding ceph logs for more details.If the log includes error messages similar to the following ones, see Flapping OSDs.
wrongly marked me down heartbeat_check: no reply from osd.2 since back
wrongly marked me down heartbeat_check: no reply from osd.2 since backCopy to Clipboard Copied! Toggle word wrap Toggle overflow - If you see any other errors, open a support ticket. See Contacting Red Hat Support for service for details.
Additional Resources
- Flapping OSDs
- Stale placement groups
- See the Starting, stopping, restarting the Ceph daemon by instances section in the Red Hat Ceph Storage Administration Guide.
- See the Managing Ceph keyrings section in the Red Hat Ceph Storage Administration Guide.
5.2.8. Flapping OSDs Copiar o linkLink copiado para a área de transferência!
The ceph -w | grep osds command shows OSDs repeatedly as down and then up again within a short period of time:
In addition the Ceph log contains error messages similar to the following ones:
2021-07-25 03:44:06.510583 osd.50 127.0.0.1:6801/149046 18992 : cluster [WRN] map e600547 wrongly marked me down
2021-07-25 03:44:06.510583 osd.50 127.0.0.1:6801/149046 18992 : cluster [WRN] map e600547 wrongly marked me down
2021-07-25 19:00:08.906864 7fa2a0033700 -1 osd.254 609110 heartbeat_check: no reply from osd.2 since back 2021-07-25 19:00:07.444113 front 2021-07-25 18:59:48.311935 (cutoff 2021-07-25 18:59:48.906862)
2021-07-25 19:00:08.906864 7fa2a0033700 -1 osd.254 609110 heartbeat_check: no reply from osd.2 since back 2021-07-25 19:00:07.444113 front 2021-07-25 18:59:48.311935 (cutoff 2021-07-25 18:59:48.906862)
What This Means
The main causes of flapping OSDs are:
- Certain storage cluster operations, such as scrubbing or recovery, take an abnormal amount of time, for example if you perform these operations on objects with a large index or large placement groups. Usually, after these operations finish, the flapping OSDs problem is solved.
-
Problems with the underlying physical hardware. In this case, the
ceph health detailcommand also returns theslow requestserror message. - Problems with network.
Ceph OSDs cannot manage situations where the private network for the storage cluster fails, or significant latency is on the public client-facing network.
Ceph OSDs use the private network for sending heartbeat packets to each other to indicate that they are up and in. If the private storage cluster network does not work properly, OSDs are unable to send and receive the heartbeat packets. As a consequence, they report each other as being down to the Ceph Monitors, while marking themselves as up.
The following parameters in the Ceph configuration file influence this behavior:
| Parameter | Description | Default value |
|---|---|---|
|
|
How long OSDs wait for the heartbeat packets to return before reporting an OSD as | 20 seconds |
|
|
How many OSDs must report another OSD as | 2 |
This table shows that in default configuration, the Ceph Monitors mark an OSD as down if only one OSD made three distinct reports about the first OSD being down. In some cases, if one single host encounters network issues, the entire cluster can experience flapping OSDs. This is because the OSDs that reside on the host will report other OSDs in the cluster as down.
The flapping OSDs scenario does not include the situation when the OSD processes are started and then immediately killed.
To Troubleshoot This Problem
Check the output of the
ceph health detailcommand again. If it includes theslow requestserror message, see for details on how to troubleshoot this issue.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Determine which OSDs are marked as
downand on what nodes they reside:ceph osd tree | grep down
# ceph osd tree | grep downCopy to Clipboard Copied! Toggle word wrap Toggle overflow - On the nodes containing the flapping OSDs, troubleshoot and fix any networking problems. For details, see Troubleshooting networking issues.
Alternatively, you can temporary force Monitors to stop marking the OSDs as
downandupby setting thenoupandnodownflags:ceph osd set noup ceph osd set nodown
# ceph osd set noup # ceph osd set nodownCopy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantUsing the
noupandnodownflags does not fix the root cause of the problem but only prevents OSDs from flapping. To open a support ticket, see the Contacting Red Hat Support for service section for details.
Flapping OSDs can be caused by MTU misconfiguration on Ceph OSD nodes, at the network switch level, or both. To resolve the issue, set MTU to a uniform size on all storage cluster nodes, including on the core and access network switches with a planned downtime. Do not tune osd heartbeat min size because changing this setting can hide issues within the network, and it will not solve actual network inconsistency.
Additional Resources
- See the Verifying the Network Configuration for Red Hat Ceph Storage section in the Red Hat Ceph Storage Installation Guide for details.
- See the Ceph heartbeat section in the Red Hat Ceph Storage Architecture Guide for details.
- See the Slow requests or requests are blocked section in the Red Hat Ceph Storage Troubleshooting Guide.
- See Red Hat’s Knowledgebase solution How to reduce scrub impact in a Red Hat Ceph Storage cluster? for tuning scrubbing process.
5.2.9. Slow requests or requests are blocked Copiar o linkLink copiado para a área de transferência!
The ceph-osd daemon is slow to respond to a request and the ceph health detail command returns an error message similar to the following one:
In addition, the Ceph logs include an error message similar to the following ones:
2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
What This Means
An OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.
The main causes of OSDs having slow requests are:
- Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
- Problems with network. These problems are usually connected with flapping OSDs. See Flapping OSDs for details.
- System load
The following table shows the types of slow requests. Use the dump_historic_ops administration socket command to determine the type of a slow request. For details about the administration socket, see the Using the Ceph Administration Socket section in the Administration Guide for Red Hat Ceph Storage 4.
| Slow request type | Description |
|---|---|
|
| The OSD is waiting to acquire a lock on a placement group for the operation. |
|
| The OSD is waiting for replica OSDs to apply the operation to the journal. |
|
| The OSD did not reach any major operation milestone. |
|
| The OSDs have not replicated an object the specified number of times yet. |
To Troubleshoot This Problem
- Determine if the OSDs with slow or block requests share a common piece of hardware, for example a disk drive, host, rack, or network switch.
If the OSDs share a disk:
Use the
smartmontoolsutility to check the health of the disk or the logs to determine any errors on the disk.NoteThe
smartmontoolsutility is included in thesmartmontoolspackage.Use the
iostatutility to get the I/O wait report (%iowai) on the OSD disk to determine if the disk is under heavy load.NoteThe
iostatutility is included in thesysstatpackage.
If the OSDs share the node with another service:
- Check the RAM and CPU utilization
-
Use the
netstatutility to see the network statistics on the Network Interface Controllers (NICs) and troubleshoot any networking issues. See also Troubleshooting networking issues for further information.
- If the OSDs share a rack, check the network switch for the rack. For example, if you use jumbo frames, verify that the NIC in the path has jumbo frames set.
- If you are unable to determine a common piece of hardware shared by OSDs with slow requests, or to troubleshoot and fix hardware and networking problems, open a support ticket. See Contacting Red Hat support for service for details.
Additional Resources
- See the Using the Ceph Administration Socket section in the Red Hat Ceph Storage Administration Guide for details.
5.3. Stopping and starting rebalancing Copiar o linkLink copiado para a área de transferência!
When an OSD fails or you stop it, the CRUSH algorithm automatically starts the rebalancing process to redistribute data across the remaining OSDs.
Rebalancing can take time and resources, therefore, consider stopping rebalancing during troubleshooting or maintaining OSDs.
Placement groups within the stopped OSDs become degraded during troubleshooting and maintenance.
Prerequisites
- Root-level access to the Ceph Monitor node.
Procedure
To do so, set the
nooutflag before stopping the OSD:ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow When you finish troubleshooting or maintenance, unset the
nooutflag to start rebalancing:ceph osd unset noout
[root@mon ~]# ceph osd unset nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- The Rebalancing and Recovery section in the Red Hat Ceph Storage Architecture Guide.
5.4. Mounting the OSD data partition Copiar o linkLink copiado para a área de transferência!
If the OSD data partition is not mounted correctly, the ceph-osd daemon cannot start. If you discover that the partition is not mounted as expected, follow the steps in this section to mount it.
This section is specific to bare-metal deployments only.
Prerequisites
-
Access to the
ceph-osddaemon. - Root-level access to the Ceph Monitor node.
Procedure
Mount the partition:
mount -o noatime PARTITION /var/lib/ceph/osd/CLUSTER_NAME-OSD_NUMBER
[root@ceph-mon]# mount -o noatime PARTITION /var/lib/ceph/osd/CLUSTER_NAME-OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
PARTITIONwith the path to the partition on the OSD drive dedicated to OSD data. Specify the cluster name and the OSD number.Example
mount -o noatime /dev/sdd1 /var/lib/ceph/osd/ceph-0
[root@ceph-mon]# mount -o noatime /dev/sdd1 /var/lib/ceph/osd/ceph-0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Try to start the failed
ceph-osddaemon:systemctl start ceph-osd@OSD_NUMBER
[root@ceph-mon]# systemctl start ceph-osd@OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace the
OSD_NUMBERwith the ID of the OSD.Example
systemctl start ceph-osd@0
[root@ceph-mon]# systemctl start ceph-osd@0Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- See the Down OSDs in the Red Hat Ceph Storage Troubleshooting Guide for more details.
5.5. Replacing an OSD drive Copiar o linkLink copiado para a área de transferência!
Ceph is designed for fault tolerance, which means that it can operate in a degraded state without losing data. Consequently, Ceph can operate even if a data storage drive fails. In the context of a failed drive, the degraded state means that the extra copies of the data stored on other OSDs will backfill automatically to other OSDs in the cluster. However, if this occurs, replace the failed OSD drive and recreate the OSD manually.
When a drive fails, Ceph reports the OSD as down:
HEALTH_WARN 1/3 in osds are down osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
Ceph can mark an OSD as down also as a consequence of networking or permissions problems. See Down OSDs for details.
Modern servers typically deploy with hot-swappable drives so you can pull a failed drive and replace it with a new one without bringing down the node. The whole procedure includes these steps:
- Remove the OSD from the Ceph cluster. For details, see the Removing an OSD from the Ceph Cluster procedure.
- Replace the drive. For details, see Replacing the physical drive] section.
- Add the OSD to the cluster. For details, see Adding an OSD to the Ceph Cluster] procedure.
Prerequisites
- Root-level access to the Ceph Monitor node.
Determine which OSD is
down:ceph osd tree | grep -i down
[root@mon ~]# ceph osd tree | grep -i down ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 0 0.00999 osd.0 down 1.00000 1.00000Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the OSD process is stopped. Use the following command from the OSD node:
systemctl status ceph-osd@_OSD_NUMBER_
[root@mon ~]# systemctl status ceph-osd@_OSD_NUMBER_Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_NUMBERwith the ID of the OSD marked asdown, for example:systemctl status ceph-osd@osd.0
[root@mon ~]# systemctl status ceph-osd@osd.0 ... Active: inactive (dead)Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the
ceph-osddaemon is running. See Down OSDs for more details about troubleshooting OSDs that are marked asdownbut their correspondingceph-osddaemon is running.
Procedure: Removing an OSD from the Ceph Cluster
Mark the OSD as
out:ceph osd out osd.OSD_NUMBER
[root@mon ~]# ceph osd out osd.OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_NUMBERwith the ID of the OSD that is marked asdown, for example:ceph osd out osd.0
[root@mon ~]# ceph osd out osd.0 marked out osd.0.Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf the OSD is
down, Ceph marks it asoutautomatically after 600 seconds when it does not receive any heartbeat packet from the OSD. When this happens, other OSDs with copies of the failed OSD data begin backfilling to ensure that the required number of copies exists within the cluster. While the cluster is backfilling, the cluster will be in adegradedstate.Ensure that the failed OSD is backfilling. The output will include information similar to the following one:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the OSD from the CRUSH map:
ceph osd crush remove osd.OSD_NUMBER
[root@mon ~]# ceph osd crush remove osd.OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_NUMBERwith the ID of the OSD that is marked asdown, for example:ceph osd crush remove osd.0
[root@mon ~]# ceph osd crush remove osd.0 removed item id 0 name 'osd.0' from crush mapCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove authentication keys related to the OSD:
ceph auth del osd.OSD_NUMBER
[root@mon ~]# ceph auth del osd.OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_NUMBERwith the ID of the OSD that is marked asdown, for example:ceph auth del osd.0
[root@mon ~]# ceph auth del osd.0 updatedCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the OSD from the Ceph Storage Cluster:
ceph osd rm osd.OSD_NUMBER
[root@mon ~]# ceph osd rm osd.OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_NUMBERwith the ID of the OSD that is marked asdown, for example:ceph osd rm osd.0
[root@mon ~]# ceph osd rm osd.0 removed osd.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you have removed the OSD successfully, it is not present in the output of the following command:
ceph osd tree
[root@mon ~]# ceph osd treeCopy to Clipboard Copied! Toggle word wrap Toggle overflow For bare-metal deployments, unmount the failed drive:
umount /var/lib/ceph/osd/CLUSTER_NAME-OSD_NUMBER
[root@mon ~]# umount /var/lib/ceph/osd/CLUSTER_NAME-OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the name of the cluster and the ID of the OSD, for example:
umount /var/lib/ceph/osd/ceph-0/
[root@mon ~]# umount /var/lib/ceph/osd/ceph-0/Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you have unmounted the drive successfully, it is not present in the output of the following command:
df -h
[root@mon ~]# df -hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure: Replacing the physical drive
See the documentation for the hardware node for details on replacing the physical drive.
- If the drive is hot-swappable, replace the failed drive with a new one.
- If the drive is not hot-swappable and the node contains multiple OSDs, you might have to shut down the whole node and replace the physical drive. Consider preventing the cluster from backfilling. See the Stopping and Starting Rebalancing chapter in the Red Hat Ceph Storage Troubleshooting Guide for details.
-
When the drive appears under the
/dev/directory, make a note of the drive path. - If you want to add the OSD manually, find the OSD drive and format the disk.
Procedure: Adding an OSD to the Ceph Cluster
Add the OSD again.
If you used Ansible to deploy the cluster, run the
ceph-ansibleplaybook again from the Ceph administration server:Bare-metal deployments:
Syntax
ansible-playbook site.yml -i hosts --limit NEW_OSD_NODE_NAME
ansible-playbook site.yml -i hosts --limit NEW_OSD_NODE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ansible-playbook site.yml -i hosts --limit node03
[user@admin ceph-ansible]$ ansible-playbook site.yml -i hosts --limit node03Copy to Clipboard Copied! Toggle word wrap Toggle overflow Container deployments:
Syntax
ansible-playbook site-container.yml -i hosts --limit NEW_OSD_NODE_NAME
ansible-playbook site-container.yml -i hosts --limit NEW_OSD_NODE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ansible-playbook site-container.yml -i hosts --limit node03
[user@admin ceph-ansible]$ ansible-playbook site-container.yml -i hosts --limit node03Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- If you added the OSD manually, see the Adding a Ceph OSD with the Command-line Interface section in the Red Hat Ceph Storage 4 Operations Guide.
Ensure that the CRUSH hierarchy is accurate:
ceph osd tree
[root@mon ~]# ceph osd treeCopy to Clipboard Copied! Toggle word wrap Toggle overflow If you are not satisfied with the location of the OSD in the CRUSH hierarchy, move the OSD to a desired location:
ceph osd crush move BUCKET_TO_MOVE BUCKET_TYPE=PARENT_BUCKET
[root@mon ~]# ceph osd crush move BUCKET_TO_MOVE BUCKET_TYPE=PARENT_BUCKETCopy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to move the bucket located at
sdd:row1to the root bucket:ceph osd crush move ssd:row1 root=ssd:root
[root@mon ~]# ceph osd crush move ssd:row1 root=ssd:rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- See the Down OSDs section in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Managing the storage cluster size chapter in the Red Hat Ceph Storage Operations Guide.
- See the Red Hat Ceph Storage Installation Guide.
5.6. Increasing the PID count Copiar o linkLink copiado para a área de transferência!
If you have a node containing more than 12 Ceph OSDs, the default maximum number of threads (PID count) can be insufficient, especially during recovery. As a consequence, some ceph-osd daemons can terminate and fail to start again. If this happens, increase the maximum possible number of threads allowed.
Procedure
To temporary increase the number:
sysctl -w kernel.pid.max=4194303
[root@mon ~]# sysctl -w kernel.pid.max=4194303
To permanently increase the number, update the /etc/sysctl.conf file as follows:
kernel.pid.max = 4194303
kernel.pid.max = 4194303
5.7. Deleting data from a full storage cluster Copiar o linkLink copiado para a área de transferência!
Ceph automatically prevents any I/O operations on OSDs that reached the capacity specified by the mon_osd_full_ratio parameter and returns the full osds error message.
This procedure shows how to delete unnecessary data to fix this error.
The mon_osd_full_ratio parameter sets the value of the full_ratio parameter when creating a cluster. You cannot change the value of mon_osd_full_ratio afterwards. To temporarily increase the full_ratio value, increase the set-full-ratio instead.
Prerequisites
- Root-level access to the Ceph Monitor node.
Procedure
Determine the current value of
full_ratio, by default it is set to0.95:ceph osd dump | grep -i full
[root@mon ~]# ceph osd dump | grep -i full full_ratio 0.95Copy to Clipboard Copied! Toggle word wrap Toggle overflow Temporarily increase the value of
set-full-ratioto0.97:ceph osd set-full-ratio 0.97
[root@mon ~]# ceph osd set-full-ratio 0.97Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantRed Hat strongly recommends to not set the
set-full-ratioto a value higher than 0.97. Setting this parameter to a higher value makes the recovery process harder. As a consequence, you might not be able to recover full OSDs at all.Verify that you successfully set the parameter to
0.97:ceph osd dump | grep -i full
[root@mon ~]# ceph osd dump | grep -i full full_ratio 0.97Copy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the cluster state:
ceph -w
[root@mon ~]# ceph -wCopy to Clipboard Copied! Toggle word wrap Toggle overflow As soon as the cluster changes its state from
fulltonearfull, delete any unnecessary data.Set the value of
full_ratioback to0.95:ceph osd set-full-ratio 0.95
[root@mon ~]# ceph osd set-full-ratio 0.95Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that you successfully set the parameter to
0.95:ceph osd dump | grep -i full
[root@mon ~]# ceph osd dump | grep -i full full_ratio 0.95Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- Full OSDs section in the Red Hat Ceph Storage Troubleshooting Guide.
- Nearfull OSDs section in the Red Hat Ceph Storage Troubleshooting Guide.
5.8. Redeploying OSDs after upgrading the storage cluster Copiar o linkLink copiado para a área de transferência!
This section describes how to redeploy OSDS after upgrading from Red Hat Ceph Storage 3 to Red Hat Ceph Storage 4 with non-collocated daemons for OSDs with block.db on dedicated devices, without upgrading the operating system.
This procedure applies to both bare-metal and container deployments, unless specified.
After the upgrade, the playbook for redeploying OSDs can fail with the an error message:
GPT headers found, they must be removed on: /dev/vdb
GPT headers found, they must be removed on: /dev/vdb
You can redeploy the OSDs by creating a partition in the block.db device and running the Ansible playbook.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ansible Administration node.
- Ansible user account created.
Procedure
Create the partition on the
block.dbdevice. Thissgdiskcommand uses the next available partition number automatically:Syntax
sgdisk --new=0:0:_JOURNAL_SIZE_ -- NEW_DEVICE_PATH
sgdisk --new=0:0:_JOURNAL_SIZE_ -- NEW_DEVICE_PATHCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
sgdisk --new=0:0:+2G -- /dev/vdb
[root@admin ~]# sgdisk --new=0:0:+2G -- /dev/vdbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
host_varsdirectory:mkdir /usr/share/ceph-ansible/host_vars
[root@admin ~]# mkdir /usr/share/ceph-ansible/host_varsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Navigate to the
host_varsdirectory:cd /usr/share/ceph-ansible/host_vars
[root@admin ~]# cd /usr/share/ceph-ansible/host_varsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the hosts file on all the hosts of the storage cluster:
Syntax
touch NEW_OSD_HOST_NAME
touch NEW_OSD_HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
touch osd5
[root@admin host_vars]# touch osd5Copy to Clipboard Copied! Toggle word wrap Toggle overflow In the hosts file, define the data device:
Syntax
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Switch to the Ansible user and verify that Ansible can reach all the Ceph nodes:
ansible all -m ping
[admin@admin ~]$ ansible all -m pingCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change directory to the Ansible configuration directory:
cd /usr/share/ceph-ansible
[admin@admin ~]$ cd /usr/share/ceph-ansibleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following Ansible playbook with
--limitoption:Bare-metal deployments:
ansible-playbook site.yml --limit osds -i hosts
[admin@admin ceph-ansible]$ ansible-playbook site.yml --limit osds -i hostsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Container deployments:
ansible-playbook site-container.yml --limit osds -i hosts
[admin@admin ceph-ansible]$ ansible-playbook site-container.yml --limit osds -i hostsCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 6. Troubleshooting Ceph MDSs Copiar o linkLink copiado para a área de transferência!
As a storage administrator, you can troubleshoot the most common issues that can occur when using the Ceph Metadata Server (MDS). Some of the common errors that you might encounter:
- An MDS node failure requiring a new MDS deployment.
- An MDS node issue requiring redeployment of an MDS node.
6.1. Redeploying a Ceph MDS Copiar o linkLink copiado para a área de transferência!
Ceph Metadata Server (MDS) daemons are necessary for deploying a Ceph File System. If an MDS node in your cluster fails, you can redeploy a Ceph Metadata Server by removing an MDS server and adding a new or existing server. You can use the command-line interface or Ansible playbook to add or remove an MDS server.
6.1.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- A running Red Hat Ceph Storage cluster.
6.1.2. Removing a Ceph MDS using Ansible Copiar o linkLink copiado para a área de transferência!
To remove a Ceph Metadata Server (MDS) using Ansible, use the shrink-mds playbook.
If there is no replacement MDS to take over once the MDS is removed, the file system will become unavailable to clients. If that is not desirable, consider adding an additional MDS before removing the MDS you would like to take offline.
Prerequisites
- At least one MDS node.
- A running Red Hat Ceph Storage cluster deployed by Ansible.
-
Rootorsudoaccess to an Ansible administration node.
Procedure
- Log in to the Ansible administration node.
Change to the
/usr/share/ceph-ansibledirectory:Example
cd /usr/share/ceph-ansible
[ansible@admin ~]$ cd /usr/share/ceph-ansibleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the Ansible
shrink-mds.ymlplaybook, and when prompted, typeyesto confirm shrinking the cluster:Syntax
ansible-playbook infrastructure-playbooks/shrink-mds.yml -e mds_to_kill=ID -i hosts
ansible-playbook infrastructure-playbooks/shrink-mds.yml -e mds_to_kill=ID -i hostsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace ID with the ID of the MDS node you want to remove. You can remove only one Ceph MDS each time the playbook runs.
Example
ansible-playbook infrastructure-playbooks/shrink-mds.yml -e mds_to_kill=node02 -i hosts
[ansible @admin ceph-ansible]$ ansible-playbook infrastructure-playbooks/shrink-mds.yml -e mds_to_kill=node02 -i hostsCopy to Clipboard Copied! Toggle word wrap Toggle overflow As
rootor withsudoaccess, open and edit the/usr/share/ceph-ansible/hostsinventory file and remove the MDS node under the[mdss]section:Syntax
[mdss] MDS_NODE_NAME MDS_NODE_NAME
[mdss] MDS_NODE_NAME MDS_NODE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[mdss] node01 node03
[mdss] node01 node03Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example,
node02was removed from the[mdss]list.
Verification
Check the status of the MDS daemons:
Syntax
ceph fs dump
ceph fs dumpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.1.3. Removing a Ceph MDS using the command-line interface Copiar o linkLink copiado para a área de transferência!
You can manually remove a Ceph Metadata Server (MDS) using the command-line interface.
If there is no replacement MDS to take over once the current MDS is removed, the file system will become unavailable to clients. If that is not desirable, consider adding an MDS before removing the existing MDS.
Prerequisites
-
The
ceph-commonpackage is installed. - A running Red Hat Ceph Storage cluster.
-
Rootorsudoaccess to the MDS nodes.
Procedure
- Log into the Ceph MDS node that you want to remove the MDS daemon from.
Stop the Ceph MDS service:
Syntax
sudo systemctl stop ceph-mds@HOST_NAME
sudo systemctl stop ceph-mds@HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOST_NAME with the short name of the host where the daemon is running.
Example
sudo systemctl stop ceph-mds@node02
[admin@node02 ~]$ sudo systemctl stop ceph-mds@node02Copy to Clipboard Copied! Toggle word wrap Toggle overflow Disable the MDS service if you are not redeploying MDS to this node:
Syntax
sudo systemctl disable ceph-mds@HOST_NAME
sudo systemctl disable ceph-mds@HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOST_NAME with the short name of the host to disable the daemon.
Example
sudo systemctl disable ceph-mds@node02
[admin@node02 ~]$ sudo systemctl disable ceph-mds@node02Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the
/var/lib/ceph/mds/ceph-MDS_IDdirectory on the MDS node:Syntax
sudo rm -fr /var/lib/ceph/mds/ceph-MDS_ID
sudo rm -fr /var/lib/ceph/mds/ceph-MDS_IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace MDS_ID with the ID of the MDS node that you want to remove the MDS daemon from.
Example
sudo rm -fr /var/lib/ceph/mds/ceph-node02
[admin@node02 ~]$ sudo rm -fr /var/lib/ceph/mds/ceph-node02Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the status of the MDS daemons:
Syntax
ceph fs dump
ceph fs dumpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.1.4. Adding a Ceph MDS using Ansible Copiar o linkLink copiado para a área de transferência!
Use the Ansible playbook to add a Ceph Metadata Server (MDS).
Prerequisites
- A running Red Hat Ceph Storage cluster deployed by Ansible.
-
Rootorsudoaccess to an Ansible administration node. - New or existing servers that can be provisioned as MDS nodes.
Procedure
- Log in to the Ansible administration node
Change to the
/usr/share/ceph-ansibledirectory:Example
cd /usr/share/ceph-ansible
[ansible@admin ~]$ cd /usr/share/ceph-ansibleCopy to Clipboard Copied! Toggle word wrap Toggle overflow As
rootor withsudoaccess, open and edit the/usr/share/ceph-ansible/hostsinventory file and add the MDS node under the[mdss]section:Syntax
[mdss] MDS_NODE_NAME NEW_MDS_NODE_NAME
[mdss] MDS_NODE_NAME NEW_MDS_NODE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace NEW_MDS_NODE_NAME with the host name of the node where you want to install the MDS server.
Alternatively, you can colocate the MDS daemon with the OSD daemon on one node by adding the same node under the
[osds]and[mdss]sections.Example
[mdss] node01 node03
[mdss] node01 node03Copy to Clipboard Copied! Toggle word wrap Toggle overflow As the
ansibleuser, run the Ansible playbook to provision the MDS node:Bare-metal deployments:
ansible-playbook site.yml --limit mdss -i hosts
[ansible@admin ceph-ansible]$ ansible-playbook site.yml --limit mdss -i hostsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Container deployments:
ansible-playbook site-container.yml --limit mdss -i hosts
[ansible@admin ceph-ansible]$ ansible-playbook site-container.yml --limit mdss -i hostsCopy to Clipboard Copied! Toggle word wrap Toggle overflow After the Ansible playbook has finished running, the new Ceph MDS node appears in the storage cluster.
Verification
Check the status of the MDS daemons:
Syntax
ceph fs dump
ceph fs dumpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, you can use the
ceph mds statcommand to check if the MDS is in an active state:Syntax
ceph mds stat
ceph mds statCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph mds stat
[ansible@admin ceph-ansible]$ ceph mds stat cephfs:1 {0=node01=up:active} 1 up:standbyCopy to Clipboard Copied! Toggle word wrap Toggle overflow
6.1.5. Adding a Ceph MDS using the command-line interface Copiar o linkLink copiado para a área de transferência!
You can manually add a Ceph Metadata Server (MDS) using the command-line interface.
Prerequisites
-
The
ceph-commonpackage is installed. - A running Red Hat Ceph Storage cluster.
-
Rootorsudoaccess to the MDS nodes. - New or existing servers that can be provisioned as MDS nodes.
Procedure
Add a new MDS node by logging into the node and creating an MDS mount point:
Syntax
sudo mkdir /var/lib/ceph/mds/ceph-MDS_ID
sudo mkdir /var/lib/ceph/mds/ceph-MDS_IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace MDS_ID with the ID of the MDS node that you want to add the MDS daemon to.
Example
sudo mkdir /var/lib/ceph/mds/ceph-node03
[admin@node03 ~]$ sudo mkdir /var/lib/ceph/mds/ceph-node03Copy to Clipboard Copied! Toggle word wrap Toggle overflow If this is a new MDS node, create the authentication key if you are using Cephx authentication:
Syntax
sudo ceph auth get-or-create mds.MDS_ID mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-MDS_ID/keyring
sudo ceph auth get-or-create mds.MDS_ID mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-MDS_ID/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace MDS_ID with the ID of the MDS node to deploy the MDS daemon on.
Example
sudo ceph auth get-or-create mds.node03 mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-node03/keyring
[admin@node03 ~]$ sudo ceph auth get-or-create mds.node03 mon 'profile mds' mgr 'profile mds' mds 'allow *' osd 'allow *' > /var/lib/ceph/mds/ceph-node03/keyringCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteCephx authentication is enabled by default. See the Cephx authentication link in the Additional Resources section for more information about Cephx authentication.
Start the MDS daemon:
Syntax
sudo systemctl start ceph-mds@HOST_NAME
sudo systemctl start ceph-mds@HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOST_NAME with the short name of the host to start the daemon.
Example
sudo systemctl start ceph-mds@node03
[admin@node03 ~]$ sudo systemctl start ceph-mds@node03Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enable the MDS service:
Syntax
systemctl enable ceph-mds@HOST_NAME
systemctl enable ceph-mds@HOST_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace HOST_NAME with the short name of the host to enable the service.
Example
sudo systemctl enable ceph-mds@node03
[admin@node03 ~]$ sudo systemctl enable ceph-mds@node03Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the status of the MDS daemons:
Syntax
ceph fs dump
ceph fs dumpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Alternatively, you can use the
ceph mds statcommand to check if the MDS is in an active state:Syntax
ceph mds stat
ceph mds statCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph mds stat
[ansible@admin ceph-ansible]$ ceph mds stat cephfs:1 {0=node01=up:active} 1 up:standbyCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 7. Troubleshooting a multisite Ceph Object Gateway Copiar o linkLink copiado para a área de transferência!
This chapter contains information on how to fix the most common errors related to multisite Ceph Object Gateways configuration and operational conditions.
7.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- A running Red Hat Ceph Storage cluster.
- A running Ceph Object Gateway.
7.2. Error code definitions for the Ceph Object Gateway Copiar o linkLink copiado para a área de transferência!
The Ceph Object Gateway logs contain error and warning messages to assist in troubleshooting conditions in your environment. Some common ones are listed below with suggested resolutions.
Common error messages
data_sync: ERROR: a sync operation returned error- This is the high-level data sync process complaining that a lower-level bucket sync process returned an error. This message is redundant; the bucket sync error appears above it in the log.
data sync: ERROR: failed to sync object: BUCKET_NAME:_OBJECT_NAME_- Either the process failed to fetch the required object over HTTP from a remote gateway or the process failed to write that object to RADOS and it will be tried again.
data sync: ERROR: failure in sync, backing out (sync_status=2)-
A low level message reflecting one of the above conditions, specifically that the data was deleted before it could sync and thus showing a
-2 ENOENTstatus. data sync: ERROR: failure in sync, backing out (sync_status=-5)-
A low level message reflecting one of the above conditions, specifically that we failed to write that object to RADOS and thus showing a
-5 EIO. ERROR: failed to fetch remote data log info: ret=11-
This is the
EAGAINgeneric error code fromlibcurlreflecting an error condition from another gateway. It will try again by default. meta sync: ERROR: failed to read mdlog info with (2) No such file or directory- The shard of the mdlog was never created so there is nothing to sync.
Syncing error messages
failed to sync object- Either the process failed to fetch this object over HTTP from a remote gateway or it failed to write that object to RADOS and it will be tried again.
failed to sync bucket instance: (11) Resource temporarily unavailable- A connection issue between primary and secondary zones.
failed to sync bucket instance: (125) Operation canceled- A racing condition exists between writes to the same RADOS object.
Additional Resources
- Contact Red Hat Support for any additional assistance.
7.3. Syncing a multisite Ceph Object Gateway Copiar o linkLink copiado para a área de transferência!
A multisite sync reads the change log from other zones. To get a high-level view of the sync progress from the metadata and the data loags, you can use the following command:
radosgw-admin sync status
radosgw-admin sync status
This command lists which log shards, if any, which are behind their source zone.
Sometimes you might observe recovering shards when running the radosgw-admin sync status command. For data sync, there are 128 shards of replication logs that are each processed independently. If any of the actions triggered by these replication log events result in any error from the network, storage, or elsewhere, those errors get tracked so the operation can retry again later. While a given shard has errors that need a retry, radosgw-admin sync status command reports that shard as recovering. This recovery happens automatically, so the operator does not need to intervene to resolve them.
If the results of the sync status you have run above reports log shards are behind, run the following command substituting the shard-id for X.
Syntax
radosgw-admin data sync status --shard-id=X --source-zone=ZONE_NAME
radosgw-admin data sync status --shard-id=X --source-zone=ZONE_NAME
Example
The output lists which buckets are next to sync and which buckets, if any, are going to be retried due to previous errors.
Inspect the status of individual buckets with the following command, substituting the bucket id for X.
radosgw-admin bucket sync status --bucket=X.
radosgw-admin bucket sync status --bucket=X.
- Replace…
- X with the ID number of the bucket.
The result shows which bucket index log shards are behind their source zone.
A common error in sync is EBUSY, which means the sync is already in progress, often on another gateway. Read errors written to the sync error log, which can be read with the following command:
radosgw-admin sync error list
radosgw-admin sync error list
The syncing process will try again until it is successful. Errors can still occur that can require intervention.
7.3.1. Performance counters for multi-site Ceph Object Gateway data sync Copiar o linkLink copiado para a área de transferência!
The following performance counters are available for multi-site configurations of the Ceph Object Gateway to measure data sync:
-
poll_latencymeasures the latency of requests for remote replication logs. -
fetch_bytesmeasures the number of objects and bytes fetched by data sync.
Use the ceph daemon command to view the current metric data for the performance counters:
Syntax
ceph daemon /var/run/ceph/ceph-client.rgw.RGW_ID.asok perf dump data-sync-from-ZONE_NAME
ceph daemon /var/run/ceph/ceph-client.rgw.RGW_ID.asok perf dump data-sync-from-ZONE_NAME
Example
You must run the ceph daemon command from the node running the daemon.
Additional Resources
- See the Ceph performance counters chapter in the Red Hat Ceph Storage Administration Guide for more information about performance counters.
Chapter 8. Troubleshooting the Ceph iSCSI gateway (Limited Availability) Copiar o linkLink copiado para a área de transferência!
As a storage administrator, you can troubleshoot most common errors that can occur when using the Ceph iSCSI gateway. These are some of the common errors that you might encounter:
- iSCSI login issues.
- VMware ESXi reporting various connection failures.
- Timeout errors.
This technology is Limited Availability. See the Deprecated functionality chapter for additional information.
8.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- A running Red Hat Ceph Storage cluster.
- A running Ceph iSCSI gateway.
- Verify the network connections.
8.2. Gathering information for lost connections causing storage failures on VMware ESXi Copiar o linkLink copiado para a área de transferência!
Collecting system and disk information helps determine which iSCSI target has lost a connection and is possibly causing storage failures. If needed, gathering this information can also be provided to Red Hat’s Global Support Service to aid you in troubleshooting any Ceph iSCSI gateway issues.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph iSCSI gateway, the iSCSI target.
- A running VMware ESXi environment, the iSCSI initiator.
- Root-level access to the VMware ESXi node.
Procedure
On the VWware ESXi node, open the kernel log:
[root@esx:~]# more /var/log/vmkernel.log
[root@esx:~]# more /var/log/vmkernel.logCopy to Clipboard Copied! Toggle word wrap Toggle overflow Gather information from the following error messages in the VMware ESXi kernel log:
Example
2020-03-30T11:07:07.570Z cpu32:66506)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000005 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]
2020-03-30T11:07:07.570Z cpu32:66506)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000005 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3 TSIH: 0]Copy to Clipboard Copied! Toggle word wrap Toggle overflow From this message, make a note of the
ISIDnumber, theTARGETname, and the Target Portal Group Tag (TPGT) number. For this example, we have the following:ISID: 00023d000005 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3
ISID: 00023d000005 TARGET: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw TPGT: 3Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
2020-03-30T11:07:07.570Z cpu32:66506)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba64:CH:4 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound
2020-03-30T11:07:07.570Z cpu32:66506)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: vmhba64:CH:4 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=BoundCopy to Clipboard Copied! Toggle word wrap Toggle overflow From this message, make a note of the adapter channel (
CH) number. For this example, we have the following:vmhba64:CH:4 T:0
vmhba64:CH:4 T:0Copy to Clipboard Copied! Toggle word wrap Toggle overflow To find the remote address of the Ceph iSCSI gateway node:
[root@esx:~]# esxcli iscsi session connection list
[root@esx:~]# esxcli iscsi session connection listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow From the command output, match the
ISIDvalue, and theTARGETname value gathered previously, then make a note of theRemoteAddressvalue. From this example, we have the following:Target: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw ISID: 00023d000003 RemoteAddress: 10.2.132.2
Target: iqn.2017-12.com.redhat.iscsi-gw:ceph-igw ISID: 00023d000003 RemoteAddress: 10.2.132.2Copy to Clipboard Copied! Toggle word wrap Toggle overflow Now, you can collect more information from the Ceph iSCSI gateway node to further troubleshoot the issue.
On the Ceph iSCSI gateway node mentioned by the
RemoteAddressvalue, run ansosreportto gather system information:sosreport
[root@igw ~]# sosreportCopy to Clipboard Copied! Toggle word wrap Toggle overflow
To find a disk that went into a dead state:
[root@esx:~]# esxcli storage nmp device list
[root@esx:~]# esxcli storage nmp device listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow From the command output, match the
CHnumber, and theTPGTnumber gathered previously, then make a note of theDevicevalue. For this example, we have the following:vmhba64:C4:T0 Device: naa.60014054a5d46697f85498e9a257567c TPG_id=3
vmhba64:C4:T0 Device: naa.60014054a5d46697f85498e9a257567c TPG_id=3Copy to Clipboard Copied! Toggle word wrap Toggle overflow With the device name, you can gather some additional information on each iSCSI disk in a
deadstate.Gather more information on the iSCSI disk:
Syntax
esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt esxcli storage core device list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_core_device_list.txt
esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt esxcli storage core device list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_core_device_list.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt [root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt
[root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt [root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Gather additional information on the VMware ESXi environment:
[root@esx:~]# esxcli storage vmfs extent list > /tmp/esxcli_storage_vmfs_extent_list.txt [root@esx:~]# esxcli storage filesystem list > /tmp/esxcli_storage_filesystem_list.txt [root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt [root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt
[root@esx:~]# esxcli storage vmfs extent list > /tmp/esxcli_storage_vmfs_extent_list.txt [root@esx:~]# esxcli storage filesystem list > /tmp/esxcli_storage_filesystem_list.txt [root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt [root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check for potential iSCSI login issues:
Additional Resources
-
See Red Hat’s Knowledgebase solution on creating an
sosreportfor Red Hat Global Support Services. - See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
- How to open a Red Hat support case on the Customer Portal?
8.3. Checking iSCSI login failures because data was not sent Copiar o linkLink copiado para a área de transferência!
On the iSCSI gateway node, you might see generic login negotiation failure messages in the system log, by default /var/log/messages.
Example
Apr 2 23:17:05 osd1 kernel: rx_data returned 0, expecting 48. Apr 2 23:17:05 osd1 kernel: iSCSI Login negotiation failed.
Apr 2 23:17:05 osd1 kernel: rx_data returned 0, expecting 48.
Apr 2 23:17:05 osd1 kernel: iSCSI Login negotiation failed.
While the system is in this state, start collecting system information as suggested in this procedure.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph iSCSI gateway, the iSCSI target.
- A running VMware ESXi environment, the iSCSI initiator.
- Root-level access to the Ceph iSCSI gateway node.
- Root-level access to the VMware ESXi node.
Procedure
Enable additional logging:
echo "module iscsi_target_mod +p" > /sys/kernel/debug/dynamic_debug/control echo "module target_core_mod +p" > /sys/kernel/debug/dynamic_debug/control
[root@igw ~]# echo "module iscsi_target_mod +p" > /sys/kernel/debug/dynamic_debug/control [root@igw ~]# echo "module target_core_mod +p" > /sys/kernel/debug/dynamic_debug/controlCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Wait a couple of minutes for the extra debugging information to populate the system log.
Disable the additional logging:
echo "module iscsi_target_mod -p" > /sys/kernel/debug/dynamic_debug/control echo "module target_core_mod -p" > /sys/kernel/debug/dynamic_debug/control
[root@igw ~]# echo "module iscsi_target_mod -p" > /sys/kernel/debug/dynamic_debug/control [root@igw ~]# echo "module target_core_mod -p" > /sys/kernel/debug/dynamic_debug/controlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run an
sosreportto gather system information:sosreport
[root@igw ~]# sosreportCopy to Clipboard Copied! Toggle word wrap Toggle overflow Capture network traffic for the Ceph iSCSI gateway and the VMware ESXi nodes simultaneously:
Syntax
tcpdump -s0 -i NETWORK_INTERFACE -w OUTPUT_FILE_PATH
tcpdump -s0 -i NETWORK_INTERFACE -w OUTPUT_FILE_PATHCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
tcpdump -s 0 -i eth0 -w /tmp/igw-eth0-tcpdump.pcap
[root@igw ~]# tcpdump -s 0 -i eth0 -w /tmp/igw-eth0-tcpdump.pcapCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteLook for traffic on port 3260.
Network packet capture files can be large, so compress the
tcpdumpoutput from the iSCSI target and initiators before uploading any files to Red Hat Global Support Services:Syntax
gzip OUTPUT_FILE_PATH
gzip OUTPUT_FILE_PATHCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
gzip /tmp/igw-eth0-tcpdump.pcap
[root@igw ~]# gzip /tmp/igw-eth0-tcpdump.pcapCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Gather additional information on the VMware ESXi environment:
[root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt [root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txt
[root@esx:~]# esxcli iscsi session list > /tmp/esxcli_iscsi_session_list.txt [root@esx:~]# esxcli iscsi session connection list > /tmp/esxcli_iscsi_session_connection_list.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow List and collect more information on each iSCSI disk:
Syntax
esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txt
esxcli storage nmp path list -d ISCSI_DISK_DEVICE > /tmp/esxcli_storage_nmp_path_list.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[root@esx:~]# esxcli storage nmp device list [root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt [root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txt
[root@esx:~]# esxcli storage nmp device list [root@esx:~]# esxcli storage nmp path list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_nmp_path_list.txt [root@esx:~]# esxcli storage core device list -d naa.60014054a5d46697f85498e9a257567c > /tmp/esxcli_storage_core_device_list.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
-
See Red Hat’s Knowledgebase solution on creating an
sosreportfor Red Hat Global Support Services. - See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
- See Red Hat’s Knowledgebase solution on How to capture network packets with tcpdump? for more information.
- How to open a Red Hat support case on the Customer Portal?
8.4. Checking iSCSI login failures because of a timeout or not able to find a portal group Copiar o linkLink copiado para a área de transferência!
On the iSCSI gateway node, you might see timeout or unable to locate a target portal group messages in the system log, by default /var/log/messages.
Example
Mar 28 00:29:01 osd2 kernel: iSCSI Login timeout on Network Portal 10.2.132.2:3260
Mar 28 00:29:01 osd2 kernel: iSCSI Login timeout on Network Portal 10.2.132.2:3260
or
Example
Mar 23 20:25:39 osd1 kernel: Unable to locate Target Portal Group on iqn.2017-12.com.redhat.iscsi-gw:ceph-igw
Mar 23 20:25:39 osd1 kernel: Unable to locate Target Portal Group on iqn.2017-12.com.redhat.iscsi-gw:ceph-igw
While the system is in this state, start collecting system information as suggested in this procedure.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph iSCSI gateway.
- Root-level access to the Ceph iSCSI gateway node.
Procedure
Enable the dumping of waiting tasks and write them to a file:
dmesg -c ; echo w > /proc/sysrq-trigger ; dmesg -c > /tmp/waiting-tasks.txt
[root@igw ~]# dmesg -c ; echo w > /proc/sysrq-trigger ; dmesg -c > /tmp/waiting-tasks.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Review the list of waiting tasks for the following messages:
-
iscsit_tpg_disable_portal_group -
core_tmr_abort_task -
transport_generic_free_cmd
If any of these messages appear in the waiting task list, then this is an indication that something went wrong with the
tcmu-runnerservice. Maybe thetcmu-runnerservice was not restarted properly, or maybe thetcmu-runnerservice has crashed.-
Verify if the
tcmu-runnerservice is running:systemctl status tcmu-runner
[root@igw ~]# systemctl status tcmu-runnerCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the
tcmu-runnerservice is not running, then stop therbd-target-gwservice before restarting thetcmu-runnerservice:Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantStopping the Ceph iSCSI gateway first prevents IOs from getting stuck while the
tcmu-runnerservice is down.-
If the
tcmu-runnerservice is running, the this might be a new bug. Open a new Red Hat support case.
Additional Resources
-
See Red Hat’s Knowledgebase solution on creating an
sosreportfor Red Hat Global Support Services. - See Red Hat’s Knowledgebase solution on uploading files for Red Hat Global Support Services.
- How to open a Red Hat support case on the Customer Portal?
8.5. Timeout command errors Copiar o linkLink copiado para a área de transferência!
The Ceph iSCSI gateway might report command timeout errors when a SCSI command has failed in the system log.
Example
Mar 23 20:03:14 igw tcmu-runner: 2018-03-23 20:03:14.052 2513 [ERROR] tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.gw1lun011: Timing out cmd.
Mar 23 20:03:14 igw tcmu-runner: 2018-03-23 20:03:14.052 2513 [ERROR] tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.gw1lun011: Timing out cmd.
or
Example
Mar 23 20:03:14 igw tcmu-runner: tcmu_notify_conn_lost:176 rbd/rbd.gw1lun011: Handler connection lost (lock state 1)
Mar 23 20:03:14 igw tcmu-runner: tcmu_notify_conn_lost:176 rbd/rbd.gw1lun011: Handler connection lost (lock state 1)
What This Means
It is possible there are other stuck tasks waiting to be processed, causing the SCSI command to timeout because a response was not received in a timely manner. Another reason for these error messages, might be related to an unhealthy Red Hat Ceph Storage cluster.
To Troubleshoot This Problem
- Check to see if there are waiting tasks that might be holding things up.
- Check the health of the Red Hat Ceph Storage cluster.
- Collect system information from each device in the path from the Ceph iSCSI gateway node to the iSCSI initiator node.
Additional Resources
- See the Checking iSCSI login failures because of a timeout or not able to find a portal group section of the Red Hat Ceph Storage Troubleshooting Guide for more details on how to view waiting tasks.
- See the Diagnosing the health of a storage cluster section of the Red Hat Ceph Storage Troubleshooting Guide for more details on checking the storage cluster health.
- See the Gathering information for lost connections causing storage failures on VMware ESXi section of the Red Hat Ceph Storage Troubleshooting Guide for more details on collecting the necessary information.
8.6. Abort task errors Copiar o linkLink copiado para a área de transferência!
The Ceph iSCSI gateway might report abort task errors in the system log.
Example
Apr 1 14:23:58 igw kernel: ABORT_TASK: Found referenced iSCSI task_tag: 1085531
Apr 1 14:23:58 igw kernel: ABORT_TASK: Found referenced iSCSI task_tag: 1085531
What This Means
It is possible that some other network disruptions, such as a failed switch or bad port, is causing this type of error message. Another possibility, is an unhealthy Red Hat Ceph Storage cluster.
To Troubleshoot This Problem
- Check for any network disruptions in the environment.
- Check the health of the Red Hat Ceph Storage cluster.
- Collect system information from each device in the path from the Ceph iSCSI gateway node to the iSCSI initiator node.
Additional Resources
- See the Diagnosing the health of a storage cluster section of the Red Hat Ceph Storage Troubleshooting Guide for more details on checking the storage cluster health.
- See the Gathering information for lost connections causing storage failures on VMware ESXi section of the Red Hat Ceph Storage Troubleshooting Guide for more details on collecting the necessary information.
8.7. Additional Resources Copiar o linkLink copiado para a área de transferência!
- See the Red Hat Ceph Storage Block Device Guide for more details on the Ceph iSCSI gateway.
- See Chapter 3, Troubleshooting networking issues for details.
Chapter 9. Troubleshooting Ceph placement groups Copiar o linkLink copiado para a área de transferência!
This section contains information about fixing the most common errors related to the Ceph Placement Groups (PGs).
9.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- Verify your network connection.
- Ensure that Monitors are able to form a quorum.
-
Ensure that all healthy OSDs are
upandin, and the backfilling and recovery processes are finished.
9.2. Most common Ceph placement groups errors Copiar o linkLink copiado para a área de transferência!
The following table lists the most common errors messages that are returned by the ceph health detail command. The table provides links to corresponding sections that explain the errors and point to specific procedures to fix the problems.
In addition, you can list placement groups that are stuck in a state that is not optimal. See Section 9.3, “Listing placement groups stuck in stale, inactive, or unclean state” for details.
9.2.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- A running Red Hat Ceph Storage cluster.
- A running Ceph Object Gateway.
9.2.2. Placement group error messages Copiar o linkLink copiado para a área de transferência!
A table of common placement group error messages, and a potential fix.
| Error message | See |
|---|---|
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
9.2.3. Stale placement groups Copiar o linkLink copiado para a área de transferência!
The ceph health command lists some Placement Groups (PGs) as stale:
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
What This Means
The Monitor marks a placement group as stale when it does not receive any status update from the primary OSD of the placement group’s acting set or when other OSDs reported that the primary OSD is down.
Usually, PGs enter the stale state after you start the storage cluster and until the peering process completes. However, when the PGs remain stale for longer than expected, it might indicate that the primary OSD for those PGs is down or not reporting PG statistics to the Monitor. When the primary OSD storing stale PGs is back up, Ceph starts to recover the PGs.
The mon_osd_report_timeout setting determines how often OSDs report PGs statistics to Monitors. Be default, this parameter is set to 0.5, which means that OSDs report the statistics every half a second.
To Troubleshoot This Problem
Identify which PGs are
staleand on what OSDs they are stored. The error message will include information similar to the following example:Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Troubleshoot any problems with the OSDs that are marked as
down. For details, see Down OSDs.
Additional Resources
- The Monitoring Placement Group sets section in the Administration Guide for Red Hat Ceph Storage 4
9.2.4. Inconsistent placement groups Copiar o linkLink copiado para a área de transferência!
Some placement groups are marked as active + clean + inconsistent and the ceph health detail returns an error messages similar to the following one:
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 0.6 is active+clean+inconsistent, acting [0,1,2] 2 scrub errors
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors
What This Means
When Ceph detects inconsistencies in one or more replicas of an object in a placement group, it marks the placement group as inconsistent. The most common inconsistencies are:
- Objects have an incorrect size.
- Objects are missing from one replica after a recovery finished.
In most cases, errors during scrubbing cause inconsistency within placement groups.
To Troubleshoot This Problem
Determine which placement group is in the
inconsistentstate:ceph health detail
# ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 0.6 is active+clean+inconsistent, acting [0,1,2] 2 scrub errorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Determine why the placement group is
inconsistent.Start the deep scrubbing process on the placement group:
ceph pg deep-scrub ID
[root@mon ~]# ceph pg deep-scrub IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
IDwith the ID of theinconsistentplacement group, for example:ceph pg deep-scrub 0.6
[root@mon ~]# ceph pg deep-scrub 0.6 instructing pg 0.6 on osd.0 to deep-scrubCopy to Clipboard Copied! Toggle word wrap Toggle overflow Search the output of the
ceph -wfor any messages related to that placement group:ceph -w | grep ID
ceph -w | grep IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
IDwith the ID of theinconsistentplacement group, for example:ceph -w | grep 0.6
[root@mon ~]# ceph -w | grep 0.6 2015-02-26 01:35:36.778215 osd.106 [ERR] 0.6 deep-scrub stat mismatch, got 636/635 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 1855455/1854371 bytes. 2015-02-26 01:35:36.788334 osd.106 [ERR] 0.6 deep-scrub 1 errorsCopy to Clipboard Copied! Toggle word wrap Toggle overflow
If the output includes any error messages similar to the following ones, you can repair the
inconsistentplacement group. See Repairing inconsistent placement groups for details.PG.ID shard OSD: soid OBJECT missing attr , missing attr _ATTRIBUTE_TYPE PG.ID shard OSD: soid OBJECT digest 0 != known digest DIGEST, size 0 != known size SIZE PG.ID shard OSD: soid OBJECT size 0 != known size SIZE PG.ID deep-scrub stat mismatch, got MISMATCH PG.ID shard OSD: soid OBJECT candidate had a read error, digest 0 != known digest DIGEST
PG.ID shard OSD: soid OBJECT missing attr , missing attr _ATTRIBUTE_TYPE PG.ID shard OSD: soid OBJECT digest 0 != known digest DIGEST, size 0 != known size SIZE PG.ID shard OSD: soid OBJECT size 0 != known size SIZE PG.ID deep-scrub stat mismatch, got MISMATCH PG.ID shard OSD: soid OBJECT candidate had a read error, digest 0 != known digest DIGESTCopy to Clipboard Copied! Toggle word wrap Toggle overflow If the output includes any error messages similar to the following ones, it is not safe to repair the
inconsistentplacement group because you can lose data. Open a support ticket in this situation. See Contacting Red Hat support for details.PG.ID shard OSD: soid OBJECT digest DIGEST != known digest DIGEST PG.ID shard OSD: soid OBJECT omap_digest DIGEST != known omap_digest DIGEST
PG.ID shard OSD: soid OBJECT digest DIGEST != known digest DIGEST PG.ID shard OSD: soid OBJECT omap_digest DIGEST != known omap_digest DIGESTCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- Listing placement group inconsistencies in the Red Hat Ceph Storage Troubleshooting Guide.
- The Ceph Data integrity section in the Red Hat Ceph Storage Architecture Guide.
- The Scrubbing the OSD section in the Red Hat Ceph Storage Configuration Guide.
9.2.5. Unclean placement groups Copiar o linkLink copiado para a área de transferência!
The ceph health command returns an error message similar to the following one:
HEALTH_WARN 197 pgs stuck unclean
HEALTH_WARN 197 pgs stuck unclean
What This Means
Ceph marks a placement group as unclean if it has not achieved the active+clean state for the number of seconds specified in the mon_pg_stuck_threshold parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold is 300 seconds.
If a placement group is unclean, it contains objects that are not replicated the number of times specified in the osd_pool_default_size parameter. The default value of osd_pool_default_size is 3, which means that Ceph creates three replicas.
Usually, unclean placement groups indicate that some OSDs might be down.
To Troubleshoot This Problem
Determine which OSDs are
down:ceph osd tree
# ceph osd treeCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Troubleshoot and fix any problems with the OSDs. See Down OSDs for details.
Additional Resources
9.2.6. Inactive placement groups Copiar o linkLink copiado para a área de transferência!
The ceph health command returns a error message similar to the following one:
HEALTH_WARN 197 pgs stuck inactive
HEALTH_WARN 197 pgs stuck inactive
What This Means
Ceph marks a placement group as inactive if it has not be active for the number of seconds specified in the mon_pg_stuck_threshold parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold is 300 seconds.
Usually, inactive placement groups indicate that some OSDs might be down.
To Troubleshoot This Problem
Determine which OSDs are
down:ceph osd tree
# ceph osd treeCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Troubleshoot and fix any problems with the OSDs.
Additional Resources
9.2.7. Placement groups are down Copiar o linkLink copiado para a área de transferência!
The ceph health detail command reports that some placement groups are down:
What This Means
In certain cases, the peering process can be blocked, which prevents a placement group from becoming active and usable. Usually, a failure of an OSD causes the peering failures.
To Troubleshoot This Problem
Determine what blocks the peering process:
ceph pg ID query
[root@mon ~]# ceph pg ID query
Replace ID with the ID of the placement group that is down, for example:
The recovery_state section includes information why the peering process is blocked.
-
If the output includes the
peering is blocked due to down osdserror message, see Down OSDs. - If you see any other error message, open a support ticket. See Contacting Red Hat Support service for details.
Additional Resources
- The Ceph OSD peering section in the Red Hat Ceph Storage Administration Guide.
9.2.8. Unfound objects Copiar o linkLink copiado para a área de transferência!
The ceph health command returns an error message similar to the following one, containing the unfound keyword:
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
What This Means
Ceph marks objects as unfound when it knows these objects or their newer copies exist but it is unable to find them. As a consequence, Ceph cannot recover such objects and proceed with the recovery process.
An Example Situation
A placement group stores data on osd.1 and osd.2.
-
osd.1goesdown. -
osd.2handles some write operations. -
osd.1comesup. -
A peering process between
osd.1andosd.2starts, and the objects missing onosd.1are queued for recovery. -
Before Ceph copies new objects,
osd.2goesdown.
As a result, osd.1 knows that these objects exist, but there is no OSD that has a copy of the objects.
In this scenario, Ceph is waiting for the failed node to be accessible again, and the unfound objects blocks the recovery process.
To Troubleshoot This Problem
Determine which placement group contain
unfoundobjects:ceph health detail
[root@mon ~]# ceph health detail HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; recovery 5/937611 objects degraded (0.001%); 1/312537 unfound (0.000%) pg 3.8a5 is stuck unclean for 803946.712780, current state active+recovering, last acting [320,248,0] pg 3.8a5 is active+recovering, acting [320,248,0], 1 unfound recovery 5/937611 objects degraded (0.001%); **1/312537 unfound (0.000%)**Copy to Clipboard Copied! Toggle word wrap Toggle overflow List more information about the placement group:
ceph pg ID query
[root@mon ~]# ceph pg ID queryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
IDwith the ID of the placement group containing theunfoundobjects, for example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
might_have_unfoundsection includes OSDs where Ceph tried to locate theunfoundobjects:-
The
already probedstatus indicates that Ceph cannot locate theunfoundobjects in that OSD. -
The
osd is downstatus indicates that Ceph cannot contact that OSD.
-
The
-
Troubleshoot the OSDs that are marked as
down. See Down OSDs for details. -
If you are unable to fix the problem that causes the OSD to be
down, open a support ticket. See Contacting Red Hat Support for service for details.
9.3. Listing placement groups stuck in stale, inactive, or unclean state Copiar o linkLink copiado para a área de transferência!
After a failure, placement groups enter states like degraded or peering. This states indicate normal progression through the failure recovery process.
However, if a placement group stays in one of these states for a longer time than expected, it can be an indication of a larger problem. The Monitors reports when placement groups get stuck in a state that is not optimal.
The mon_pg_stuck_threshold option in the Ceph configuration file determines the number of seconds after which placement groups are considered inactive, unclean, or stale.
The following table lists these states together with a short explanation.
| State | What it means | Most common causes | See |
|---|---|---|---|
|
| The PG has not been able to service read/write requests. |
| |
|
| The PG contains objects that are not replicated the desired number of times. Something is preventing the PG from recovering. |
| |
|
|
The status of the PG has not been updated by a |
|
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the node.
Procedure
List the stuck PGs:
ceph pg dump_stuck inactive ceph pg dump_stuck unclean ceph pg dump_stuck stale
[root@mon ~]# ceph pg dump_stuck inactive [root@mon ~]# ceph pg dump_stuck unclean [root@mon ~]# ceph pg dump_stuck staleCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- See the Placement group states section in the Red Hat Ceph Storage Administration Guide.
9.4. Listing placement group inconsistencies Copiar o linkLink copiado para a área de transferência!
Use the rados utility to list inconsistencies in various replicas of an objects. Use the --format=json-pretty option to list a more detailed output.
This section covers the listing of:
- Inconsistent placement group in a pool
- Inconsistent objects in a placement group
- Inconsistent snapshot sets in a placement group
Prerequisites
- A running Red Hat Ceph Storage cluster in a healthy state.
- Root-level access to the node.
Procedure
rados list-inconsistent-pg POOL --format=json-pretty
rados list-inconsistent-pg POOL --format=json-pretty
For example, list all inconsistent placement groups in a pool named data:
rados list-inconsistent-pg data --format=json-pretty
# rados list-inconsistent-pg data --format=json-pretty
[0.6]
rados list-inconsistent-obj PLACEMENT_GROUP_ID
rados list-inconsistent-obj PLACEMENT_GROUP_ID
For example, list inconsistent objects in a placement group with ID 0.6:
The following fields are important to determine what causes the inconsistency:
-
name: The name of the object with inconsistent replicas. -
nspace: The namespace that is a logical separation of a pool. It’s empty by default. -
locator: The key that is used as the alternative of the object name for placement. -
snap: The snapshot ID of the object. The only writable version of the object is calledhead. If an object is a clone, this field includes its sequential ID. -
version: The version ID of the object with inconsistent replicas. Each write operation to an object increments it. errors: A list of errors that indicate inconsistencies between shards without determining which shard or shards are incorrect. See theshardarray to further investigate the errors.-
data_digest_mismatch: The digest of the replica read from one OSD is different from the other OSDs. -
size_mismatch: The size of a clone or theheadobject does not match the expectation. -
read_error: This error indicates inconsistencies caused most likely by disk errors.
-
union_shard_error: The union of all errors specific to shards. These errors are connected to a faulty shard. The errors that end withoiindicate that you have to compare the information from a faulty object to information with selected objects. See theshardarray to further investigate the errors.In the above example, the object replica stored on
osd.2has different digest than the replicas stored onosd.0andosd.1. Specifically, the digest of the replica is not0xffffffffas calculated from the shard read fromosd.2, but0xe978e67f. In addition, the size of the replica read fromosd.2is 0, while the size reported byosd.0andosd.1is 968.
rados list-inconsistent-snapset PLACEMENT_GROUP_ID
rados list-inconsistent-snapset PLACEMENT_GROUP_ID
For example, list inconsistent sets of snapshots (snapsets) in a placement group with ID 0.23:
The command returns the following errors:
-
ss_attr_missing: One or more attributes are missing. Attributes are information about snapshots encoded into a snapshot set as a list of key-value pairs. -
ss_attr_corrupted: One or more attributes fail to decode. -
clone_missing: A clone is missing. -
snapset_mismatch: The snapshot set is inconsistent by itself. -
head_mismatch: The snapshot set indicates thatheadexists or not, but the scrub results report otherwise. -
headless: Theheadof the snapshot set is missing. -
size_mismatch: The size of a clone or theheadobject does not match the expectation.
Additional Resources
- Inconsistent placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
- Repairing inconsistent placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
9.5. Repairing inconsistent placement groups Copiar o linkLink copiado para a área de transferência!
Due to an error during deep scrubbing, some placement groups can include inconsistencies. Ceph reports such placement groups as inconsistent:
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 0.6 is active+clean+inconsistent, acting [0,1,2] 2 scrub errors
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors
You can repair only certain inconsistencies.
Do not repair the placement groups if the Ceph logs include the following errors:
_PG_._ID_ shard _OSD_: soid _OBJECT_ digest _DIGEST_ != known digest _DIGEST_ _PG_._ID_ shard _OSD_: soid _OBJECT_ omap_digest _DIGEST_ != known omap_digest _DIGEST_
_PG_._ID_ shard _OSD_: soid _OBJECT_ digest _DIGEST_ != known digest _DIGEST_
_PG_._ID_ shard _OSD_: soid _OBJECT_ omap_digest _DIGEST_ != known omap_digest _DIGEST_
Open a support ticket instead. See Contacting Red Hat Support for service for details.
Prerequisites
- Root-level access to the Ceph Monitor node.
Procedure
-
Repair the
inconsistentplacement groups:
ceph pg repair ID
[root@mon ~]# ceph pg repair ID
-
Replace
IDwith the ID of theinconsistentplacement group.
Additional Resources
- Inconsistent placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
- Listing placement group inconsistencies Red Hat Ceph Storage Troubleshooting Guide.
9.6. Increasing the placement group Copiar o linkLink copiado para a área de transferência!
Insufficient Placement Group (PG) count impacts the performance of the Ceph cluster and data distribution. It is one of the main causes of the nearfull osds error messages.
The recommended ratio is between 100 and 300 PGs per OSD. This ratio can decrease when you add more OSDs to the cluster.
The pg_num and pgp_num parameters determine the PG count. These parameters are configured per each pool, and therefore, you must adjust each pool with low PG count separately.
Increasing the PG count is the most intensive process that you can perform on a Ceph cluster. This process might have serious performance impact if not done in a slow and methodical way. Once you increase pgp_num, you will not be able to stop or reverse the process and you must complete it. Consider increasing the PG count outside of business critical processing time allocation, and alert all clients about the potential performance impact. Do not change the PG count if the cluster is in the HEALTH_ERR state.
Prerequisites
- A running Red Hat Ceph Storage cluster in a healthy state.
- Root-level access to the node.
Procedure
Reduce the impact of data redistribution and recovery on individual OSDs and OSD hosts:
Lower the value of the
osd max backfills,osd_recovery_max_active, andosd_recovery_op_priorityparameters:ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'
[root@mon ~]# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Disable the shallow and deep scrubbing:
ceph osd set noscrub ceph osd set nodeep-scrub
[root@mon ~]# ceph osd set noscrub [root@mon ~]# ceph osd set nodeep-scrubCopy to Clipboard Copied! Toggle word wrap Toggle overflow
-
Use the Ceph Placement Groups (PGs) per Pool Calculator to calculate the optimal value of the
pg_numandpgp_numparameters. Increase the
pg_numvalue in small increments until you reach the desired value.- Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
Increment the
pg_numvalue:ceph osd pool set POOL pg_num VALUE
ceph osd pool set POOL pg_num VALUECopy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the pool name and the new value, for example:
ceph osd pool set data pg_num 4
# ceph osd pool set data pg_num 4Copy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the status of the cluster:
ceph -s
# ceph -sCopy to Clipboard Copied! Toggle word wrap Toggle overflow The PGs state will change from
creatingtoactive+clean. Wait until all PGs are in theactive+cleanstate.
Increase the
pgp_numvalue in small increments until you reach the desired value:- Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
Increment the
pgp_numvalue:ceph osd pool set POOL pgp_num VALUE
ceph osd pool set POOL pgp_num VALUECopy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the pool name and the new value, for example:
ceph osd pool set data pgp_num 4
# ceph osd pool set data pgp_num 4Copy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the status of the cluster:
ceph -s
# ceph -sCopy to Clipboard Copied! Toggle word wrap Toggle overflow The PGs state will change through
peering,wait_backfill,backfilling,recover, and others. Wait until all PGs are in theactive+cleanstate.
- Repeat the previous steps for all pools with insufficient PG count.
Set
osd max backfills,osd_recovery_max_active, andosd_recovery_op_priorityto their default values:ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enable the shallow and deep scrubbing:
ceph osd unset noscrub ceph osd unset nodeep-scrub
# ceph osd unset noscrub # ceph osd unset nodeep-scrubCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional Resources
- Nearfull OSDs
- The Monitoring Placement Group Sets section in the Administration Guide for Red Hat Ceph Storage 4
9.7. Additional Resources Copiar o linkLink copiado para a área de transferência!
- See Chapter 3, Troubleshooting networking issues for details.
- See Chapter 4, Troubleshooting Ceph Monitors for details about troubleshooting the most common errors related to Ceph Monitors.
- See Chapter 5, Troubleshooting Ceph OSDs for details about troubleshooting the most common errors related to Ceph OSDs.
Chapter 10. Troubleshooting Ceph objects Copiar o linkLink copiado para a área de transferência!
As a storage administrator, you can use the ceph-objectstore-tool utility to perform high-level or low-level object operations. The ceph-objectstore-tool utility can help you troubleshoot problems related to objects within a particular OSD or placement group.
You can also start OSD containers in rescue/maintenance mode to repair OSDs without installing Ceph packages on the OSD node.
Manipulating objects can cause unrecoverable data loss. Contact Red Hat support before using the ceph-objectstore-tool utility.
10.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- Verify there are no network-related issues.
10.2. Troubleshooting Ceph objects in a containerized environment Copiar o linkLink copiado para a área de transferência!
The OSD container can be started in rescue/maintenance mode to repair OSDs in Red Hat Ceph Storage 4 without installing Ceph packages on the OSD node.
You can use ceph-bluestore-tool to run consistency check with fsck command, or to run consistency check and repair any errors with repair command.
This procedure is specific to containerized deployments only. Skip this section for bare-metal deployments
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run
fsckandrepaircommands.Syntax
ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-OSD_ID ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-OSD_ID
ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-OSD_ID ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-OSD_IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0
[root@osd ~]# ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0 fsck successCopy to Clipboard Copied! Toggle word wrap Toggle overflow ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0
[root@osd ~]# ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0 repair successCopy to Clipboard Copied! Toggle word wrap Toggle overflow After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.3. Troubleshooting high-level object operations Copiar o linkLink copiado para a área de transferência!
As a storage administrator, you can use the ceph-objectstore-tool utility to perform high-level object operations. The ceph-objectstore-tool utility supports the following high-level object operations:
- List objects
- List lost objects
- Fix lost objects
Manipulating objects can cause unrecoverable data loss. Contact Red Hat support before using the ceph-objectstore-tool utility.
10.3.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- Root-level access to the Ceph OSD nodes.
10.3.2. Listing objects Copiar o linkLink copiado para a área de transferência!
The OSD can contain zero to many placement groups, and zero to many objects within a placement group (PG). The ceph-objectstore-tool utility allows you to list objects stored within an OSD.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Verify the appropriate OSD is down:
systemctl status ceph-osd@OSD_NUMBER
[root@osd ~]# systemctl status ceph-osd@OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Identify all the objects within an OSD, regardless of their placement group:
ceph-objectstore-tool --data-path PATH_TO_OSD --op list
[root@osd ~]# ceph-objectstore-tool --data-path PATH_TO_OSD --op listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list
[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify all the objects within a placement group:
ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID --op list
[root@osd ~]# ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID --op listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c --op list
[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c --op listCopy to Clipboard Copied! Toggle word wrap Toggle overflow Identify the PG an object belongs to:
ceph-objectstore-tool --data-path PATH_TO_OSD --op list OBJECT_ID
[root@osd ~]# ceph-objectstore-tool --data-path PATH_TO_OSD --op list OBJECT_IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list default.region
[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list default.regionCopy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.3.3. Listing lost objects Copiar o linkLink copiado para a área de transferência!
An OSD can mark objects as lost or unfound. You can use the ceph-objectstore-tool to list the lost and unfound objects stored within an OSD.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Verify the appropriate OSD is down:
systemctl status ceph-osd@OSD_NUMBER
[root@osd ~]# systemctl status ceph-osd@OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Use the
ceph-objectstore-toolutility to list lost and unfound objects. Select the appropriate circumstance:To list all the lost objects:
ceph-objectstore-tool --data-path PATH_TO_OSD --op list-lost
[root@osd ~]# ceph-objectstore-tool --data-path PATH_TO_OSD --op list-lostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-lost
[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-lostCopy to Clipboard Copied! Toggle word wrap Toggle overflow To list all the lost objects within a placement group:
ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID --op list-lost
[root@osd ~]# ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID --op list-lostCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c --op list-lost
[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c --op list-lostCopy to Clipboard Copied! Toggle word wrap Toggle overflow To list a lost object by its identifier:
ceph-objectstore-tool --data-path PATH_TO_OSD --op list-lost OBJECT_ID
[root@osd ~]# ceph-objectstore-tool --data-path PATH_TO_OSD --op list-lost OBJECT_IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-lost default.region
[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op list-lost default.regionCopy to Clipboard Copied! Toggle word wrap Toggle overflow
For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.3.4. Fixing lost objects Copiar o linkLink copiado para a área de transferência!
You can use the ceph-objectstore-tool utility to list and fix lost and unfound objects stored within a Ceph OSD. This procedure applies only to legacy objects.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Verify the appropriate OSD is down:
Syntax
systemctl status ceph-osd@OSD_NUMBER
[root@osd ~]# systemctl status ceph-osd@OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
To list all the lost legacy objects:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD --op fix-lost --dry-run
ceph-objectstore-tool --data-path PATH_TO_OSD --op fix-lost --dry-runCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op fix-lost --dry-run
[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op fix-lost --dry-runCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use the
ceph-objectstore-toolutility to fix lost and unfound objects as acephuser. Select the appropriate circumstance:To fix all lost objects:
Syntax
su - ceph -c 'ceph-objectstore-tool --data-path PATH_TO_OSD --op fix-lost'
su - ceph -c 'ceph-objectstore-tool --data-path PATH_TO_OSD --op fix-lost'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
su - ceph -c 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op fix-lost'
[root@osd ~]# su - ceph -c 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op fix-lost'Copy to Clipboard Copied! Toggle word wrap Toggle overflow To fix all the lost objects within a placement group:
su - ceph -c 'ceph-objectstore-tool --data-path _PATH_TO_OSD_ --pgid _PG_ID_ --op fix-lost'
su - ceph -c 'ceph-objectstore-tool --data-path _PATH_TO_OSD_ --pgid _PG_ID_ --op fix-lost'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
su - ceph -c 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c --op fix-lost'
[root@osd ~]# su - ceph -c 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c --op fix-lost'Copy to Clipboard Copied! Toggle word wrap Toggle overflow To fix a lost object by its identifier:
Syntax
su - ceph -c 'ceph-objectstore-tool --data-path PATH_TO_OSD --op fix-lost OBJECT_ID'
su - ceph -c 'ceph-objectstore-tool --data-path PATH_TO_OSD --op fix-lost OBJECT_ID'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
su - ceph -c 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op fix-lost default.region'
[root@osd ~]# su - ceph -c 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --op fix-lost default.region'Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.4. Troubleshooting low-level object operations Copiar o linkLink copiado para a área de transferência!
As a storage administrator, you can use the ceph-objectstore-tool utility to perform low-level object operations. The ceph-objectstore-tool utility supports the following low-level object operations:
- Manipulate the object’s content
- Remove an object
- List the object map (OMAP)
- Manipulate the OMAP header
- Manipulate the OMAP key
- List the object’s attributes
- Manipulate the object’s attribute key
Manipulating objects can cause unrecoverable data loss. Contact Red Hat support before using the ceph-objectstore-tool utility.
10.4.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- Root-level access to the Ceph OSD nodes.
10.4.2. Manipulating the object’s content Copiar o linkLink copiado para a área de transferência!
With the ceph-objectstore-tool utility, you can get or set bytes on an object.
Setting the bytes on an object can cause unrecoverable data loss. To prevent data loss, make a backup copy of the object.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Verify the appropriate OSD is down:
systemctl status ceph-osd@$OSD_NUMBER
[root@osd ~]# systemctl status ceph-osd@$OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Find the object by listing the objects of the OSD or placement group (PG).
Before setting the bytes on an object, make a backup and a working copy of the object:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Edit the working copy object file and modify the object contents accordingly.
Set the bytes of the object:
ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID \ OBJECT \ set-bytes < OBJECT_FILE_NAME
[root@osd ~]# ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID \ OBJECT \ set-bytes < OBJECT_FILE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c \ '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ set-bytes < zone_info.default.working-copy[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c \ '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ set-bytes < zone_info.default.working-copyCopy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.4.3. Removing an object Copiar o linkLink copiado para a área de transferência!
Use the ceph-objectstore-tool utility to remove an object. By removing an object, its contents and references are removed from the placement group (PG).
You cannot recreate an object once it is removed.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Verify the appropriate OSD is down:
systemctl status ceph-osd@$OSD_NUMBER
[root@osd ~]# systemctl status ceph-osd@$OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Remove an object:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID \ OBJECT \ remove
ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID \ OBJECT \ removeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c \ '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ remove[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c \ '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ removeCopy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.4.4. Listing the object map Copiar o linkLink copiado para a área de transferência!
Use the ceph-objectstore-tool utility to list the contents of the object map (OMAP). The output provides you a list of keys.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Verify the appropriate OSD is down:
systemctl status ceph-osd@OSD_NUMBER
[root@osd ~]# systemctl status ceph-osd@OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
List the object map:
ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID \ OBJECT \ list-omap
[root@osd ~]# ceph-objectstore-tool --data-path PATH_TO_OSD --pgid PG_ID \ OBJECT \ list-omapCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c \ '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ list-omap[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --pgid 0.1c \ '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ list-omapCopy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.4.5. Manipulating the object map header Copiar o linkLink copiado para a área de transferência!
The ceph-objectstore-tool utility will output the object map (OMAP) header with the values associated with the object’s keys.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verify the appropriate OSD is down:
Syntax
systemctl status ceph-osd@OSD_NUMBER
systemctl status ceph-osd@OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get the object map header:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ get-omaphdr > OBJECT_MAP_FILE_NAME
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ get-omaphdr > OBJECT_MAP_FILE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ get-omaphdr > zone_info.default.omaphdr.txt[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ get-omaphdr > zone_info.default.omaphdr.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set the object map header:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ get-omaphdr < OBJECT_MAP_FILE_NAME
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ get-omaphdr < OBJECT_MAP_FILE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
su - ceph -c 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ set-omaphdr < zone_info.default.omaphdr.txt[root@osd ~]# su - ceph -c 'ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ set-omaphdr < zone_info.default.omaphdr.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.4.6. Manipulating the object map key Copiar o linkLink copiado para a área de transferência!
Use the ceph-objectstore-tool utility to change the object map (OMAP) key. You need to provide the data path, the placement group identifier (PG ID), the object, and the key in the OMAP.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Get the object map key:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ get-omap KEY > OBJECT_MAP_FILE_NAME
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ get-omap KEY > OBJECT_MAP_FILE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ get-omap "" > zone_info.default.omap.txt[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ get-omap "" > zone_info.default.omap.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set the object map key:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ set-omap KEY < OBJECT_MAP_FILE_NAME
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ set-omap KEY < OBJECT_MAP_FILE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ set-omap "" < zone_info.default.omap.txt[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ set-omap "" < zone_info.default.omap.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the object map key:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ rm-omap KEY
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ rm-omap KEYCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ rm-omap ""[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ rm-omap ""Copy to Clipboard Copied! Toggle word wrap Toggle overflow
For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.4.7. Listing the object’s attributes Copiar o linkLink copiado para a área de transferência!
Use the ceph-objectstore-tool utility to list an object’s attributes. The output provides you with the object’s keys and values.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Verify the appropriate OSD is down:
systemctl status ceph-osd@OSD_NUMBER
[root@osd ~]# systemctl status ceph-osd@OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
List the object’s attributes:
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ list-attrs
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ list-attrsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ list-attrs[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ list-attrsCopy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.4.8. Manipulating the object attribute key Copiar o linkLink copiado para a área de transferência!
Use the ceph-objectstore-tool utility to change an object’s attributes. To manipulate the object’s attributes you need the data and journal paths, the placement group identifier (PG ID), the object, and the key in the object’s attribute.
Prerequisites
- Root-level access to the Ceph OSD node.
-
Stopping the
ceph-osddaemon.
Procedure
Verify the appropriate OSD is down:
systemctl status ceph-osd@$OSD_NUMBER
[root@osd ~]# systemctl status ceph-osd@$OSD_NUMBERCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl status ceph-osd@1
[root@osd ~]# systemctl status ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to access the bluestore tool, follow the below steps:
Set
nooutflag on cluster.Example
ceph osd set noout
[root@mon ~]# ceph osd set nooutCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Login to the node hosting the OSD container.
Backup
/etc/systemd/system/ceph-osd@.serviceunit file to/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backup
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.backupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/root.Example
mv /run/ceph-osd@0.service-cid /root
[root@osd ~]# mv /run/ceph-osd@0.service-cid /rootCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit
/etc/systemd/system/ceph-osd@.serviceunit file and add-it --entrypoint /bin/bashoption to podman command.Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
OSD_IDwith the ID of the OSD.Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Login to the container associated with the
OSD_ID.Syntax
podman exec -it ceph-osd-OSD_ID /bin/bash
podman exec -it ceph-osd-OSD_ID /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
podman exec -it ceph-osd-0 /bin/bash
[root@osd ~]# podman exec -it ceph-osd-0 /bin/bashCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get
osd fsidand activate the OSD to mount OSD’s logical volume (LV).Syntax
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSID
ceph-volume lvm list |grep -A15 "osd\.OSD_ID"|grep "osd fsid" ceph-volume lvm activate --bluestore OSD_ID OSD_FSIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Get the object’s attributes:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ get-attrs KEY > OBJECT_ATTRS_FILE_NAME
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ get-attrs KEY > OBJECT_ATTRS_FILE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ get-attrs "oid" > zone_info.default.attr.txt[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ get-attrs "oid" > zone_info.default.attr.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set an object’s attributes:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ set-attrs KEY < OBJECT_ATTRS_FILE_NAME
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ set-attrs KEY < OBJECT_ATTRS_FILE_NAMECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ set-attrs "oid" < zone_info.default.attr.txt[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ set-attrs "oid" < zone_info.default.attr.txtCopy to Clipboard Copied! Toggle word wrap Toggle overflow Remove an object’s attributes:
Syntax
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ rm-attrs KEY
ceph-objectstore-tool --data-path PATH_TO_OSD \ --pgid PG_ID OBJECT \ rm-attrs KEYCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ rm-attrs "oid"[root@osd ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 \ --pgid 0.1c '{"oid":"zone_info.default","key":"","snapid":-2,"hash":235010478,"max":0,"pool":11,"namespace":""}' \ rm-attrs "oid"Copy to Clipboard Copied! Toggle word wrap Toggle overflow For containerized deployments, to revert the changes, follow the below steps:
After exiting the container, copy
/etc/systemd/system/ceph-osd@.serviceunit file from/rootdirectory.Example
cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.service
[root@osd ~]# cp /etc/systemd/system/ceph-osd@.service /root/ceph-osd@.service.modified [root@osd ~]# cp /root/ceph-osd@.service.backup /etc/systemd/system/ceph-osd@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload
systemdmanager configuration.Example
systemctl daemon-reload
[root@osd ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Move
/run/ceph-osd@OSD_ID.service-cidfile to/tmp.Example
mv /run/ceph-osd@0.service-cid /tmp
[root@osd ~]# mv /run/ceph-osd@0.service-cid /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the OSD service associated with the
OSD_ID.Syntax
systemctl restart ceph-osd@OSD_ID.service
[root@osd ~]# systemctl restart ceph-osd@OSD_ID.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
systemctl restart ceph-osd@0.service
[root@osd ~]# systemctl restart ceph-osd@0.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow
10.5. Additional Resources Copiar o linkLink copiado para a área de transferência!
- For Red Hat Ceph Storage support, see the Red Hat Customer Portal.
Chapter 11. Contacting Red Hat support for service Copiar o linkLink copiado para a área de transferência!
If the information in this guide did not help you to solve the problem, this chapter explains how you contact the Red Hat support service.
11.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
- Red Hat support account.
11.2. Providing information to Red Hat Support engineers Copiar o linkLink copiado para a área de transferência!
If you are unable to fix problems related to Red Hat Ceph Storage, contact the Red Hat Support Service and provide sufficient amount of information that helps the support engineers to faster troubleshoot the problem you encounter.
Prerequisites
- Root-level access to the node.
- Red Hat support account.
Procedure
- Open a support ticket on the Red Hat Customer Portal.
-
Ideally, attach an
sosreportto the ticket. See the What is a sosreport and how to create one in Red Hat Enterprise Linux 4.6 and later? solution for details. - If the Ceph daemons fail with a segmentation fault, consider generating a human-readable core dump file. See Generating readable core dump files for details.
11.3. Generating readable core dump files Copiar o linkLink copiado para a área de transferência!
When a Ceph daemon terminates unexpectedly with a segmentation fault, gather the information about its failure and provide it to the Red Hat Support Engineers.
Such information speeds up the initial investigation. Also, the Support Engineers can compare the information from the core dump files with Red Hat Ceph Storage cluster known issues.
11.3.1. Prerequisites Copiar o linkLink copiado para a área de transferência!
Install the
ceph-debuginfopackage if it is not installed already.Enable the repository containing the
ceph-debuginfopackage:Red Hat Enterprise Linux 7:
subscription-manager repos --enable=rhel-7-server-rhceph-4-DAEMON-debug-rpms
subscription-manager repos --enable=rhel-7-server-rhceph-4-DAEMON-debug-rpmsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
DAEMONwithosdormondepending on the type of Ceph node.Red Hat Enterprise Linux 8:
subscription-manager repos --enable=rhceph-4-tools-for-rhel-8-x86_64-debug-rpms
subscription-manager repos --enable=rhceph-4-tools-for-rhel-8-x86_64-debug-rpmsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Install the
ceph-debuginfopackage:yum install ceph-debuginfo
[root@mon ~]# yum install ceph-debuginfoCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Ensure that the
gdbpackage is installed and if it is not, install it:yum install gdb
[root@mon ~]# yum install gdbCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Continue with the procedure based on the type of your deployment:
11.3.2. Generating readable core dump files on bare-metal deployments Copiar o linkLink copiado para a área de transferência!
Follow this procedure to generate a core dump file if you use Red Hat Ceph Storage on bare-metal.
Procedure
Enable generating core dump files for Ceph.
Set the proper
ulimitsfor the core dump files by adding the following parameter to the/etc/systemd/system.conffile:DefaultLimitCORE=infinity
DefaultLimitCORE=infinityCopy to Clipboard Copied! Toggle word wrap Toggle overflow Comment out the
PrivateTmp=trueparameter in the Ceph daemon service file, by default located at/lib/systemd/system/CLUSTER_NAME-DAEMON@.service:PrivateTmp=true
[root@mon ~]# PrivateTmp=trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set the
suid_dumpableflag to2to allow the Ceph daemons to generate dump core files:sysctl fs.suid_dumpable=2
[root@mon ~]# sysctl fs.suid_dumpable=2Copy to Clipboard Copied! Toggle word wrap Toggle overflow Adjust the core dump files location:
sysctl kernel.core_pattern=/tmp/core
[root@mon ~]# sysctl kernel.core_pattern=/tmp/coreCopy to Clipboard Copied! Toggle word wrap Toggle overflow Modify
/etc/systemd/coredump.conffile by adding the following lines under section[Coredump]:ProcessSizeMax=8G ExternalSizeMax=8G JournalSizeMax=8G
ProcessSizeMax=8G ExternalSizeMax=8G JournalSizeMax=8GCopy to Clipboard Copied! Toggle word wrap Toggle overflow Reload the
systemdservice for the changes to take effect:systemctl daemon-reload
[root@mon ~]# systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the Ceph daemon for the changes to take effect:
systemctl restart ceph-DAEMON@ID
[root@mon ~]# systemctl restart ceph-DAEMON@IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the daemon type (
osdormon) and its ID (numbers for OSDs, or short host names for Monitors) for example:systemctl restart ceph-osd@1
[root@mon ~]# systemctl restart ceph-osd@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Reproduce the failure, for example try to start the daemon again.
Use the GNU Debugger (GDB) to generate a readable backtrace from an application core dump file:
gdb /usr/bin/ceph-DAEMON /tmp/core.PID
gdb /usr/bin/ceph-DAEMON /tmp/core.PIDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the daemon type and the PID of the failed process, for example:
gdb /usr/bin/ceph-osd /tmp/core.123456
$ gdb /usr/bin/ceph-osd /tmp/core.123456Copy to Clipboard Copied! Toggle word wrap Toggle overflow In the GDB command prompt disable paging and enable logging to a file by entering the commands
set pag offandset log on:(gdb) set pag off (gdb) set log on
(gdb) set pag off (gdb) set log onCopy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the
backtracecommand to all threads of the process by enteringthr a a bt full:(gdb) thr a a bt full
(gdb) thr a a bt fullCopy to Clipboard Copied! Toggle word wrap Toggle overflow After the backtrace is generated turn off logging by entering
set log off:(gdb) set log off
(gdb) set log offCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Transfer the log file
gdb.txtto the system you access the Red Hat Customer Portal from and attach it to a support ticket.
11.3.3. Generating readable core dump files in containerized deployments Copiar o linkLink copiado para a área de transferência!
Follow this procedure to generate a core dump file if you use Red Hat Ceph Storage in containers. The procedure involves two scenarios of capturing the core dump file:
- When a Ceph process terminates unexpectedly due to the SIGILL, SIGTRAP, SIGABRT, or SIGSEGV error.
or
- Manually, for example for debugging issues such as Ceph processes are consuming high CPU cycles, or are not responding.
Prerequisites
- Root-level access to the container node running the Ceph containers.
- Installation of the appropriate debugging packages.
-
Installation of the GNU Project Debugger (
gdb) package.
Procedure
If a Ceph process terminates unexpectedly due to the SIGILL, SIGTRAP, SIGABRT, or SIGSEGV error:
Set the core pattern to the
systemd-coredumpservice on the node where the container with the failed Ceph process is running, for example:echo "| /usr/lib/systemd/systemd-coredump %P %u %g %s %t %e" > /proc/sys/kernel/core_pattern
[root@mon]# echo "| /usr/lib/systemd/systemd-coredump %P %u %g %s %t %e" > /proc/sys/kernel/core_patternCopy to Clipboard Copied! Toggle word wrap Toggle overflow Watch for the next container failure due to a Ceph process and search for a core dump file in the
/var/lib/systemd/coredump/directory, for example:ls -ltr /var/lib/systemd/coredump
[root@mon]# ls -ltr /var/lib/systemd/coredump total 8232 -rw-r-----. 1 root root 8427548 Jan 22 19:24 core.ceph-osd.167.5ede29340b6c4fe4845147f847514c12.15622.1584573794000000.xzCopy to Clipboard Copied! Toggle word wrap Toggle overflow
To manually capture a core dump file for the Ceph Monitors and Ceph Managers:
Get the
ceph-monpackage details of the Ceph daemon from the container:Red Hat Enterprise Linux 7:
docker exec -it NAME /bin/bash rpm -qa | grep ceph
[root@mon]# docker exec -it NAME /bin/bash [root@mon]# rpm -qa | grep cephCopy to Clipboard Copied! Toggle word wrap Toggle overflow Red Hat Enterprise Linux 8:
podman exec -it NAME /bin/bash rpm -qa | grep ceph
[root@mon]# podman exec -it NAME /bin/bash [root@mon]# rpm -qa | grep cephCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace NAME with the name of the Ceph container.
Make a backup copy and open for editing the
ceph-mon@.servicefile:cp /etc/systemd/system/ceph-mon@.service /etc/systemd/system/ceph-mon@.service.orig
[root@mon]# cp /etc/systemd/system/ceph-mon@.service /etc/systemd/system/ceph-mon@.service.origCopy to Clipboard Copied! Toggle word wrap Toggle overflow In the
ceph-mon@.servicefile, add these three options to the[Service]section, each on a separate line:--pid=host \ --ipc=host \ --cap-add=SYS_PTRACE \
--pid=host \ --ipc=host \ --cap-add=SYS_PTRACE \Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the Ceph Monitor daemon:
Syntax
systemctl restart ceph-mon@MONITOR_ID
systemctl restart ceph-mon@MONITOR_IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace MONITOR_ID with the ID number of the Ceph Monitor.
Example
systemctl restart ceph-mon@1
[root@mon]# systemctl restart ceph-mon@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Install the
gdbpackage inside the Ceph Monitor container:Red Hat Enterprise Linux 7:
docker exec -it ceph-mon-MONITOR_ID /bin/bash
[root@mon]# docker exec -it ceph-mon-MONITOR_ID /bin/bash sh $ yum install gdbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Red Hat Enterprise Linux 8:
podman exec -it ceph-mon-MONITOR_ID /bin/bash
[root@mon]# podman exec -it ceph-mon-MONITOR_ID /bin/bash sh $ yum install gdbCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace MONITOR_ID with the ID number of the Ceph Monitor.
Find the process ID:
Syntax
ps -aef | grep PROCESS | grep -v run
ps -aef | grep PROCESS | grep -v runCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace PROCESS with the name of failed process, for example
ceph-mon.Example
ps -aef | grep ceph-mon | grep -v run
[root@mon]# ps -aef | grep ceph-mon | grep -v run ceph 15390 15266 0 18:54 ? 00:00:29 /usr/bin/ceph-mon --cluster ceph --setroot ceph --setgroup ceph -d -i 5 ceph 18110 17985 1 19:40 ? 00:00:08 /usr/bin/ceph-mon --cluster ceph --setroot ceph --setgroup ceph -d -i 2Copy to Clipboard Copied! Toggle word wrap Toggle overflow Generate the core dump file:
Syntax
gcore ID
gcore IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace ID with the ID of the failed process that you got from the previous step, for example
18110:Example
gcore 18110
[root@mon]# gcore 18110 warning: target file /proc/18110/cmdline contained unexpected null characters Saved corefile core.18110Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the core dump file has been generated correctly.
Example
ls -ltr
[root@mon]# ls -ltr total 709772 -rw-r--r--. 1 root root 726799544 Mar 18 19:46 core.18110Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy the core dump file outside of the Ceph Monitor container:
Red Hat Enterprise Linux 7:
docker cp ceph-mon-MONITOR_ID:/tmp/mon.core.MONITOR_PID /tmp
[root@mon]# docker cp ceph-mon-MONITOR_ID:/tmp/mon.core.MONITOR_PID /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Red Hat Enterprise Linux 8:
podman cp ceph-mon-MONITOR_ID:/tmp/mon.core.MONITOR_PID /tmp
[root@mon]# podman cp ceph-mon-MONITOR_ID:/tmp/mon.core.MONITOR_PID /tmpCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace MONITOR_ID with the ID number of the Ceph Monitor and replace MONITOR_PID with the process ID number.
Restore the backup copy of the
ceph-mon@.servicefile:cp /etc/systemd/system/ceph-mon@.service.orig /etc/systemd/system/ceph-mon@.service
[root@mon]# cp /etc/systemd/system/ceph-mon@.service.orig /etc/systemd/system/ceph-mon@.serviceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Restart the Ceph Monitor daemon:
Syntax
systemctl restart ceph-mon@MONITOR_ID
systemctl restart ceph-mon@MONITOR_IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace MONITOR_ID with the ID number of the Ceph Monitor.
Example
systemctl restart ceph-mon@1
[root@mon]# systemctl restart ceph-mon@1Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Upload the core dump file for analysis by Red Hat support, see step 4.
To manually capture a core dump file for Ceph OSDs:
Get the
ceph-osdpackage details of the Ceph daemon from the container:Red Hat Enterprise Linux 7:
docker exec -it NAME /bin/bash rpm -qa | grep ceph
[root@osd]# docker exec -it NAME /bin/bash [root@osd]# rpm -qa | grep cephCopy to Clipboard Copied! Toggle word wrap Toggle overflow Red Hat Enterprise Linux 8:
podman exec -it NAME /bin/bash rpm -qa | grep ceph
[root@osd]# podman exec -it NAME /bin/bash [root@osd]# rpm -qa | grep cephCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace NAME with the name of the Ceph container.
Install the Ceph package for the same version of the
ceph-osdpackage on the node where the Ceph containers are running:Red Hat Enterprise Linux 7:
yum install ceph-osd
[root@osd]# yum install ceph-osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Red Hat Enterprise Linux 8:
dnf install ceph-osd
[root@osd]# dnf install ceph-osdCopy to Clipboard Copied! Toggle word wrap Toggle overflow If needed, enable the appropriate repository first. See the Enabling the Red Hat Ceph Storage repositories section in the Installation Guide for details.
Find the ID of the process that has failed:
ps -aef | grep PROCESS | grep -v run
ps -aef | grep PROCESS | grep -v runCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace PROCESS with the name of failed process, for example
ceph-osd.ps -aef | grep ceph-osd | grep -v run
[root@osd]# ps -aef | grep ceph-osd | grep -v run ceph 15390 15266 0 18:54 ? 00:00:29 /usr/bin/ceph-osd --cluster ceph --setroot ceph --setgroup ceph -d -i 5 ceph 18110 17985 1 19:40 ? 00:00:08 /usr/bin/ceph-osd --cluster ceph --setroot ceph --setgroup ceph -d -i 2Copy to Clipboard Copied! Toggle word wrap Toggle overflow Generate the core dump file:
gcore ID
gcore IDCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace ID with the ID of the failed process that you got from the previous step, for example
18110:gcore 18110
[root@osd]# gcore 18110 warning: target file /proc/18110/cmdline contained unexpected null characters Saved corefile core.18110Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the core dump file has been generated correctly.
ls -ltr
[root@osd]# ls -ltr total 709772 -rw-r--r--. 1 root root 726799544 Mar 18 19:46 core.18110Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Upload the core dump file for analysis by Red Hat support, see the next step.
- Upload the core dump file for analysis to a Red Hat support case. See Providing information to Red Hat Support engineers for details.
11.3.4. Additional Resources Copiar o linkLink copiado para a área de transferência!
- The How to use gdb to generate a readable backtrace from an application core solution on the Red Hat Customer Portal
- The How to enable core file dumps when an application crashes or segmentation faults solution on the Red Hat Customer Portal
Appendix A. Ceph subsystems default logging level values Copiar o linkLink copiado para a área de transferência!
A table of the default logging level values for the various Ceph subsystems.
| Subsystem | Log Level | Memory Level |
|---|---|---|
|
| 1 | 5 |
|
| 1 | 5 |
|
| 0 | 0 |
|
| 0 | 5 |
|
| 0 | 5 |
|
| 1 | 5 |
|
| 0 | 5 |
|
| 0 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 0 | 5 |
|
| 1 | 5 |
|
| 0 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 0 | 5 |
|
| 1 | 5 |
|
| 0 | 5 |
|
| 0 | 5 |
|
| 0 | 5 |
|
| 0 | 0 |
|
| 0 | 5 |
|
| 0 | 5 |
|
| 0 | 5 |
|
| 1 | 5 |
|
| 0 | 5 |
|
| 0 | 5 |
|
| 1 | 5 |
|
| 1 | 5 |
|
| 0 | 5 |
|
| 0 | 5 |