Chapter 8. Troubleshooting Ceph placement groups
This section contains information about fixing the most common errors related to the Ceph Placement Groups (PGs).
8.1. Prerequisites
- Verify your network connection.
- Ensure that Monitors are able to form a quorum.
-
Ensure that all healthy OSDs are
up
andin
, and the backfilling and recovery processes are finished.
8.2. Most common Ceph placement groups errors
The following table lists the most common error messages that are returned by the ceph health detail
command. The table provides links to corresponding sections that explain the errors and point to specific procedures to fix the problems.
In addition, you can list placement groups that are stuck in a state that is not optimal. See Section 8.3, “Listing placement groups stuck in stale
, inactive
, or unclean
state” for details.
8.2.1. Prerequisites
- A running Red Hat Ceph Storage cluster.
- A running Ceph Object Gateway.
8.2.2. Placement group error messages
A table of common placement group error messages, and a potential fix.
Error message | See |
---|---|
| |
| |
| |
| |
| |
| |
|
8.2.3. Stale placement groups
The ceph health
command lists some Placement Groups (PGs) as stale
:
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
What This Means
The Monitor marks a placement group as stale
when it does not receive any status update from the primary OSD of the placement group’s acting set or when other OSDs reported that the primary OSD is down
.
Usually, PGs enter the stale
state after you start the storage cluster and until the peering process completes. However, when the PGs remain stale
for longer than expected, it might indicate that the primary OSD for those PGs is down
or not reporting PG statistics to the Monitor. When the primary OSD storing stale
PGs is back up
, Ceph starts to recover the PGs.
The mon_osd_report_timeout
setting determines how often OSDs report PGs statistics to Monitors. By default, this parameter is set to 0.5
, which means that OSDs report the statistics every half a second.
To Troubleshoot This Problem
Identify which PGs are
stale
and on what OSDs they are stored. The error message includes information similar to the following example:Example
[ceph: root@host01 /]# ceph health detail HEALTH_WARN 24 pgs stale; 3/300 in osds are down ... pg 2.5 is stuck stale+active+remapped, last acting [2,0] ... osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080 osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539 osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
-
Troubleshoot any problems with the OSDs that are marked as
down
. For details, see Down OSDs.
Additional Resources
- The Monitoring Placement Group Sets section in the Administration Guide for Red Hat Ceph Storage 5
8.2.4. Inconsistent placement groups
Some placement groups are marked as active + clean + inconsistent
and the ceph health detail
returns an error message similar to the following one:
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 0.6 is active+clean+inconsistent, acting [0,1,2] 2 scrub errors
What This Means
When Ceph detects inconsistencies in one or more replicas of an object in a placement group, it marks the placement group as inconsistent
. The most common inconsistencies are:
- Objects have an incorrect size.
- Objects are missing from one replica after a recovery finished.
In most cases, errors during scrubbing cause inconsistency within placement groups.
To Troubleshoot This Problem
Log in to the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Determine which placement group is in the
inconsistent
state:[ceph: root@host01 /]# ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 0.6 is active+clean+inconsistent, acting [0,1,2] 2 scrub errors
Determine why the placement group is
inconsistent
.Start the deep scrubbing process on the placement group:
Syntax
ceph pg deep-scrub ID
Replace
ID
with the ID of theinconsistent
placement group, for example:[ceph: root@host01 /]# ceph pg deep-scrub 0.6 instructing pg 0.6 on osd.0 to deep-scrub
Search the output of the
ceph -w
for any messages related to that placement group:Syntax
ceph -w | grep ID
Replace
ID
with the ID of theinconsistent
placement group, for example:[ceph: root@host01 /]# ceph -w | grep 0.6 2022-05-26 01:35:36.778215 osd.106 [ERR] 0.6 deep-scrub stat mismatch, got 636/635 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 1855455/1854371 bytes. 2022-05-26 01:35:36.788334 osd.106 [ERR] 0.6 deep-scrub 1 errors
If the output includes any error messages similar to the following ones, you can repair the
inconsistent
placement group. See Repairing inconsistent placement groups for details.Syntax
PG.ID shard OSD: soid OBJECT missing attr , missing attr _ATTRIBUTE_TYPE PG.ID shard OSD: soid OBJECT digest 0 != known digest DIGEST, size 0 != known size SIZE PG.ID shard OSD: soid OBJECT size 0 != known size SIZE PG.ID deep-scrub stat mismatch, got MISMATCH PG.ID shard OSD: soid OBJECT candidate had a read error, digest 0 != known digest DIGEST
If the output includes any error messages similar to the following ones, it is not safe to repair the
inconsistent
placement group because you can lose data. Open a support ticket in this situation. See Contacting Red Hat support for details.PG.ID shard OSD: soid OBJECT digest DIGEST != known digest DIGEST PG.ID shard OSD: soid OBJECT omap_digest DIGEST != known omap_digest DIGEST
Additional Resources
- See the Listing placement group inconsistencies in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Ceph data integrity section in the Red Hat Ceph Storage Architecture Guide.
- See the Scrubbing the OSD section in the Red Hat Ceph Storage Configuration Guide.
8.2.5. Unclean placement groups
The ceph health
command returns an error message similar to the following one:
HEALTH_WARN 197 pgs stuck unclean
What This Means
Ceph marks a placement group as unclean
if it has not achieved the active+clean
state for the number of seconds specified in the mon_pg_stuck_threshold
parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold
is 300
seconds.
If a placement group is unclean
, it contains objects that are not replicated the number of times specified in the osd_pool_default_size
parameter. The default value of osd_pool_default_size
is 3
, which means that Ceph creates three replicas.
Usually, unclean
placement groups indicate that some OSDs might be down
.
To Troubleshoot This Problem
Determine which OSDs are
down
:[ceph: root@host01 /]# ceph osd tree
- Troubleshoot and fix any problems with the OSDs. See Down OSDs for details.
Additional Resources
8.2.6. Inactive placement groups
The ceph health
command returns an error message similar to the following one:
HEALTH_WARN 197 pgs stuck inactive
What This Means
Ceph marks a placement group as inactive
if it has not be active for the number of seconds specified in the mon_pg_stuck_threshold
parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold
is 300
seconds.
Usually, inactive
placement groups indicate that some OSDs might be down
.
To Troubleshoot This Problem
Determine which OSDs are
down
:# ceph osd tree
- Troubleshoot and fix any problems with the OSDs.
Additional Resources
8.2.7. Placement groups are down
The ceph health detail
command reports that some placement groups are down
:
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down ... pg 0.5 is down+peering pg 1.4 is down+peering ... osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
What This Means
In certain cases, the peering process can be blocked, which prevents a placement group from becoming active and usable. Usually, a failure of an OSD causes the peering failures.
To Troubleshoot This Problem
Determine what blocks the peering process:
Syntax
ceph pg ID query
Replace ID
with the ID of the placement group that is down
:
Example
[ceph: root@host01 /]# ceph pg 0.5 query { "state": "down+peering", ... "recovery_state": [ { "name": "Started\/Primary\/Peering\/GetInfo", "enter_time": "2021-08-06 14:40:16.169679", "requested_info_from": []}, { "name": "Started\/Primary\/Peering", "enter_time": "2021-08-06 14:40:16.169659", "probing_osds": [ 0, 1], "blocked": "peering is blocked due to down osds", "down_osds_we_would_probe": [ 1], "peering_blocked_by": [ { "osd": 1, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed"}]}, { "name": "Started", "enter_time": "2021-08-06 14:40:16.169513"} ] }
The recovery_state
section includes information on why the peering process is blocked.
-
If the output includes the
peering is blocked due to down osds
error message, see Down OSDs. - If you see any other error message, open a support ticket. See Contacting Red Hat Support service for details.
Additional Resources
- The Ceph OSD peering section in the Red Hat Ceph Storage Administration Guide.
8.2.8. Unfound objects
The ceph health
command returns an error message similar to the following one, containing the unfound
keyword:
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
What This Means
Ceph marks objects as unfound
when it knows these objects or their newer copies exist but it is unable to find them. As a consequence, Ceph cannot recover such objects and proceed with the recovery process.
An Example Situation
A placement group stores data on osd.1
and osd.2
.
-
osd.1
goesdown
. -
osd.2
handles some write operations. -
osd.1
comesup
. -
A peering process between
osd.1
andosd.2
starts, and the objects missing onosd.1
are queued for recovery. -
Before Ceph copies new objects,
osd.2
goesdown
.
As a result, osd.1
knows that these objects exist, but there is no OSD that has a copy of the objects.
In this scenario, Ceph is waiting for the failed node to be accessible again, and the unfound
objects blocks the recovery process.
To Troubleshoot This Problem
Log in to the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Determine which placement group contains
unfound
objects:[ceph: root@host01 /]# ceph health detail HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; recovery 5/937611 objects degraded (0.001%); 1/312537 unfound (0.000%) pg 3.8a5 is stuck unclean for 803946.712780, current state active+recovering, last acting [320,248,0] pg 3.8a5 is active+recovering, acting [320,248,0], 1 unfound recovery 5/937611 objects degraded (0.001%); **1/312537 unfound (0.000%)**
List more information about the placement group:
Syntax
ceph pg ID query
Replace
ID
with the ID of the placement group containing theunfound
objects:Example
[ceph: root@host01 /]# ceph pg 3.8a5 query { "state": "active+recovering", "epoch": 10741, "up": [ 320, 248, 0], "acting": [ 320, 248, 0], <snip> "recovery_state": [ { "name": "Started\/Primary\/Active", "enter_time": "2021-08-28 19:30:12.058136", "might_have_unfound": [ { "osd": "0", "status": "already probed"}, { "osd": "248", "status": "already probed"}, { "osd": "301", "status": "already probed"}, { "osd": "362", "status": "already probed"}, { "osd": "395", "status": "already probed"}, { "osd": "429", "status": "osd is down"}], "recovery_progress": { "backfill_targets": [], "waiting_on_backfill": [], "last_backfill_started": "0\/\/0\/\/-1", "backfill_info": { "begin": "0\/\/0\/\/-1", "end": "0\/\/0\/\/-1", "objects": []}, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "pull_from_peer": [], "pushing": []}}, "scrub": { "scrubber.epoch_start": "0", "scrubber.active": 0, "scrubber.block_writes": 0, "scrubber.finalizing": 0, "scrubber.waiting_on": 0, "scrubber.waiting_on_whom": []}}, { "name": "Started", "enter_time": "2021-08-28 19:30:11.044020"}],
The
might_have_unfound
section includes OSDs where Ceph tried to locate theunfound
objects:-
The
already probed
status indicates that Ceph cannot locate theunfound
objects in that OSD. -
The
osd is down
status indicates that Ceph cannot contact that OSD.
-
The
-
Troubleshoot the OSDs that are marked as
down
. See Down OSDs for details. -
If you are unable to fix the problem that causes the OSD to be
down
, open a support ticket. See Contacting Red Hat Support for service for details.
8.3. Listing placement groups stuck in stale
, inactive
, or unclean
state
After a failure, placement groups enter states like degraded
or peering
. This states indicate normal progression through the failure recovery process.
However, if a placement group stays in one of these states for a longer time than expected, it can be an indication of a larger problem. The Monitors report when placement groups get stuck in a state that is not optimal.
The mon_pg_stuck_threshold
option in the Ceph configuration file determines the number of seconds after which placement groups are considered inactive
, unclean
, or stale
.
The following table lists these states together with a short explanation.
State | What it means | Most common causes | See |
---|---|---|---|
| The PG has not been able to service read/write requests. |
| |
| The PG contains objects that are not replicated the desired number of times. Something is preventing the PG from recovering. |
| |
|
The status of the PG has not been updated by a |
|
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the node.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
List the stuck PGs:
Example
[ceph: root@host01 /]# ceph pg dump_stuck inactive [ceph: root@host01 /]# ceph pg dump_stuck unclean [ceph: root@host01 /]# ceph pg dump_stuck stale
Additional Resources
- See the Placement Group States section in the Red Hat Ceph Storage Administration Guide.
8.4. Listing placement group inconsistencies
Use the rados
utility to list inconsistencies in various replicas of objects. Use the --format=json-pretty
option to list a more detailed output.
This section covers the listing of:
- Inconsistent placement group in a pool
- Inconsistent objects in a placement group
- Inconsistent snapshot sets in a placement group
Prerequisites
- A running Red Hat Ceph Storage cluster in a healthy state.
- Root-level access to the node.
Procedure
List all the inconsistent placement groups in a pool:
Syntax
rados list-inconsistent-pg POOL --format=json-pretty
Example
[ceph: root@host01 /]# rados list-inconsistent-pg data --format=json-pretty [0.6]
List inconsistent objects in a placement group with ID:
Syntax
rados list-inconsistent-obj PLACEMENT_GROUP_ID
Example
[ceph: root@host01 /]# rados list-inconsistent-obj 0.6 { "epoch": 14, "inconsistents": [ { "object": { "name": "image1", "nspace": "", "locator": "", "snap": "head", "version": 1 }, "errors": [ "data_digest_mismatch", "size_mismatch" ], "union_shard_errors": [ "data_digest_mismatch_oi", "size_mismatch_oi" ], "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", "shards": [ { "osd": 0, "errors": [], "size": 968, "omap_digest": "0xffffffff", "data_digest": "0xe978e67f" }, { "osd": 1, "errors": [], "size": 968, "omap_digest": "0xffffffff", "data_digest": "0xe978e67f" }, { "osd": 2, "errors": [ "data_digest_mismatch_oi", "size_mismatch_oi" ], "size": 0, "omap_digest": "0xffffffff", "data_digest": "0xffffffff" } ] } ] }
The following fields are important to determine what causes the inconsistency:
-
name
: The name of the object with inconsistent replicas. -
nspace
: The namespace that is a logical separation of a pool. It’s empty by default. -
locator
: The key that is used as the alternative of the object name for placement. -
snap
: The snapshot ID of the object. The only writable version of the object is calledhead
. If an object is a clone, this field includes its sequential ID. -
version
: The version ID of the object with inconsistent replicas. Each write operation to an object increments it. errors
: A list of errors that indicate inconsistencies between shards without determining which shard or shards are incorrect. See theshard
array to further investigate the errors.-
data_digest_mismatch
: The digest of the replica read from one OSD is different from the other OSDs. -
size_mismatch
: The size of a clone or thehead
object does not match the expectation. -
read_error
: This error indicates inconsistencies caused most likely by disk errors.
-
union_shard_error
: The union of all errors specific to shards. These errors are connected to a faulty shard. The errors that end withoi
indicate that you have to compare the information from a faulty object to information with selected objects. See theshard
array to further investigate the errors.In the above example, the object replica stored on
osd.2
has different digest than the replicas stored onosd.0
andosd.1
. Specifically, the digest of the replica is not0xffffffff
as calculated from the shard read fromosd.2
, but0xe978e67f
. In addition, the size of the replica read fromosd.2
is 0, while the size reported byosd.0
andosd.1
is 968.
-
List inconsistent sets of snapshots:
Syntax
rados list-inconsistent-snapset PLACEMENT_GROUP_ID
Example
[ceph: root@host01 /]# rados list-inconsistent-snapset 0.23 --format=json-pretty { "epoch": 64, "inconsistents": [ { "name": "obj5", "nspace": "", "locator": "", "snap": "0x00000001", "headless": true }, { "name": "obj5", "nspace": "", "locator": "", "snap": "0x00000002", "headless": true }, { "name": "obj5", "nspace": "", "locator": "", "snap": "head", "ss_attr_missing": true, "extra_clones": true, "extra clones": [ 2, 1 ] } ]
The command returns the following errors:
-
ss_attr_missing
: One or more attributes are missing. Attributes are information about snapshots encoded into a snapshot set as a list of key-value pairs. -
ss_attr_corrupted
: One or more attributes fail to decode. -
clone_missing
: A clone is missing. -
snapset_mismatch
: The snapshot set is inconsistent by itself. -
head_mismatch
: The snapshot set indicates thathead
exists or not, but the scrub results report otherwise. -
headless
: Thehead
of the snapshot set is missing. -
size_mismatch
: The size of a clone or thehead
object does not match the expectation.
-
Additional Resources
- Inconsistent placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
- Repairing inconsistent placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
8.5. Repairing inconsistent placement groups
Due to an error during deep scrubbing, some placement groups can include inconsistencies. Ceph reports such placement groups as inconsistent
:
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 0.6 is active+clean+inconsistent, acting [0,1,2] 2 scrub errors
You can repair only certain inconsistencies.
Do not repair the placement groups if the Ceph logs include the following errors:
_PG_._ID_ shard _OSD_: soid _OBJECT_ digest _DIGEST_ != known digest _DIGEST_ _PG_._ID_ shard _OSD_: soid _OBJECT_ omap_digest _DIGEST_ != known omap_digest _DIGEST_
Open a support ticket instead. See Contacting Red Hat Support for service for details.
Prerequisites
- Root-level access to the Ceph Monitor node.
Procedure
Repair the
inconsistent
placement groups:Syntax
ceph pg repair ID
Replace
ID
with the ID of theinconsistent
placement group.
Additional Resources
- See the Inconsistent placement groups section in the Red Hat Ceph Storage Troubleshooting Guide.
- See the Listing placement group inconsistencies Red Hat Ceph Storage Troubleshooting Guide.
8.6. Increasing the placement group
Insufficient Placement Group (PG) count impacts the performance of the Ceph cluster and data distribution. It is one of the main causes of the nearfull osds
error messages.
The recommended ratio is between 100 and 300 PGs per OSD. This ratio can decrease when you add more OSDs to the cluster.
The pg_num
and pgp_num
parameters determine the PG count. These parameters are configured per each pool, and therefore, you must adjust each pool with low PG count separately.
Increasing the PG count is the most intensive process that you can perform on a Ceph cluster. This process might have a serious performance impact if not done in a slow and methodical way. Once you increase pgp_num
, you will not be able to stop or reverse the process and you must complete it. Consider increasing the PG count outside of business critical processing time allocation, and alert all clients about the potential performance impact. Do not change the PG count if the cluster is in the HEALTH_ERR
state.
Prerequisites
- A running Red Hat Ceph Storage cluster in a healthy state.
- Root-level access to the node.
Procedure
Reduce the impact of data redistribution and recovery on individual OSDs and OSD hosts:
Lower the value of the
osd max backfills
,osd_recovery_max_active
, andosd_recovery_op_priority
parameters:[ceph: root@host01 /]# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'
Disable the shallow and deep scrubbing:
[ceph: root@host01 /]# ceph osd set noscrub [ceph: root@host01 /]# ceph osd set nodeep-scrub
-
Use the Ceph Placement Groups (PGs) per Pool Calculator to calculate the optimal value of the
pg_num
andpgp_num
parameters. Increase the
pg_num
value in small increments until you reach the desired value.- Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
Increment the
pg_num
value:Syntax
ceph osd pool set POOL pg_num VALUE
Specify the pool name and the new value, for example:
Example
[ceph: root@host01 /]# ceph osd pool set data pg_num 4
Monitor the status of the cluster:
Example
[ceph: root@host01 /]# ceph -s
The PGs state will change from
creating
toactive+clean
. Wait until all PGs are in theactive+clean
state.
Increase the
pgp_num
value in small increments until you reach the desired value:- Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
Increment the
pgp_num
value:Syntax
ceph osd pool set POOL pgp_num VALUE
Specify the pool name and the new value, for example:
[ceph: root@host01 /]# ceph osd pool set data pgp_num 4
Monitor the status of the cluster:
[ceph: root@host01 /]# ceph -s
The PGs state will change through
peering
,wait_backfill
,backfilling
,recover
, and others. Wait until all PGs are in theactive+clean
state.
- Repeat the previous steps for all pools with insufficient PG count.
Set
osd max backfills
,osd_recovery_max_active
, andosd_recovery_op_priority
to their default values:[ceph: root@host01 /]# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
Enable the shallow and deep scrubbing:
[ceph: root@host01 /]# ceph osd unset noscrub [ceph: root@host01 /]# ceph osd unset nodeep-scrub
Additional Resources
- See the Nearfull OSDs
- See the Monitoring Placement Group Sets section in the Red Hat Ceph Storage Administration Guide.
8.7. Additional Resources
- See Chapter 3, Troubleshooting networking issues for details.
- See Chapter 4, Troubleshooting Ceph Monitors for details about troubleshooting the most common errors related to Ceph Monitors.
- See Chapter 5, Troubleshooting Ceph OSDs for details about troubleshooting the most common errors related to Ceph OSDs.
- See the Auto-scaling placement groups section in the Red Hat Ceph Storage Storage Strategies Guide for more information on PG autoscaler.