Chapter 6. Troubleshooting Placement Groups
This section contains information about fixing the most common errors related to the Ceph Placement Groups (PGs).
Before You Start
- Verify your network connection. See Chapter 3, Troubleshooting Networking Issues for details.
- Ensure that Monitors are able to form a quorum. See Chapter 4, Troubleshooting Monitors for details about troubleshooting the most common errors related to Monitors.
-
Ensure that all healthy OSDs are
up
andin
, and the backfilling and recovery processes are finished. See Chapter 5, Troubleshooting OSDs for details about troubleshooting the most common errors related to OSDs.
6.2. Listing Placement Groups in stale
, inactive
, or unclean
State
After a failure, placement groups enter states like degraded
or peering
. This states indicate normal progression through the failure recovery process.
However, if a placement group stays in one of these states for a longer time than expected, it can be an indication of a larger problem. The Monitors reports when placement groups get stuck in a state that is not optimal.
The following table lists these states together with a short explanation.
State | What it means | Most common causes | See |
---|---|---|---|
| The PG has not been able to service read/write requests. |
| |
| The PG contains objects that are not replicated the desired number of times. Something is preventing the PG from recovering. |
| |
|
The status of the PG has not been updated by a |
|
The mon_pg_stuck_threshold
parameter in the Ceph configuration file determines the number of seconds after which placement groups are considered inactive
, unclean
, or stale
.
List the stuck PGs:
# ceph pg dump_stuck inactive # ceph pg dump_stuck unclean # ceph pg dump_stuck stale
See Also
- The Monitoring Placement Group States section in the Administration Guide for Red Hat Ceph Storage 2
6.3. Listing Inconsistencies
Use the rados
utility to list inconsistencies in various replicas of an objects. Use the --format=json-pretty
option to list a more detailed output.
You can list:
Listing Inconsistent Placement Groups in a Pool
rados list-inconsistent-pg <pool> --format=json-pretty
For example, list all inconsistent placement groups in a pool named data
:
# rados list-inconsistent-pg data --format=json-pretty [0.6]
Listing Inconsistent Objects in a Placement Group
rados list-inconsistent-obj <placement-group-id>
For example, list inconsistent objects in a placement group with ID 0.6
:
# rados list-inconsistent-obj 0.6 { "epoch": 14, "inconsistents": [ { "object": { "name": "image1", "nspace": "", "locator": "", "snap": "head", "version": 1 }, "errors": [ "data_digest_mismatch", "size_mismatch" ], "union_shard_errors": [ "data_digest_mismatch_oi", "size_mismatch_oi" ], "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])", "shards": [ { "osd": 0, "errors": [], "size": 968, "omap_digest": "0xffffffff", "data_digest": "0xe978e67f" }, { "osd": 1, "errors": [], "size": 968, "omap_digest": "0xffffffff", "data_digest": "0xe978e67f" }, { "osd": 2, "errors": [ "data_digest_mismatch_oi", "size_mismatch_oi" ], "size": 0, "omap_digest": "0xffffffff", "data_digest": "0xffffffff" } ] } ] }
The following fields are important to determine what causes the inconsistency:
-
name
: The name of the object with inconsistent replicas. -
nspace
: The namespace that is a logical separation of a pool. It’s empty by default. -
locator
: The key that is used as the alternative of the object name for placement. -
snap
: The snapshot ID of the object. The only writable version of the object is calledhead
. If an object is a clone, this field includes its sequential ID. -
version
: The version ID of the object with inconsistent replicas. Each write operation to an object increments it. errors
: A list of errors that indicate inconsistencies between shards without determining which shard or shards are incorrect. See theshard
array to further investigate the errors.-
data_digest_mismatch
: The digest of the replica read from one OSD is different from the other OSDs. -
size_mismatch
: The size of a clone or thehead
object does not match the expectation. -
read_error
: This error indicates inconsistencies caused most likely by disk errors.
-
union_shard_error
: The union of all errors specific to shards. These errors are connected to a faulty shard. The errors that end withoi
indicate that you have to compare the information from a faulty object to information with selected objects. See theshard
array to further investigate the errors.In the above example, the object replica stored on
osd.2
has different digest than the replicas stored onosd.0
andosd.1
. Specifically, the digest of the replica is not0xffffffff
as calculated from the shard read fromosd.2
, but0xe978e67f
. In addition, the size of the replica read fromosd.2
is 0, while the size reported byosd.0
andosd.1
is 968.
Listing Inconsistent Snapshot Sets in a Placement Group
rados list-inconsistent-snapset <placement-group-id>
For example, list inconsistent sets of snapshots (snapsets
) in a placement group with ID 0.23
:
# rados list-inconsistent-snapset 0.23 --format=json-pretty { "epoch": 64, "inconsistents": [ { "name": "obj5", "nspace": "", "locator": "", "snap": "0x00000001", "headless": true }, { "name": "obj5", "nspace": "", "locator": "", "snap": "0x00000002", "headless": true }, { "name": "obj5", "nspace": "", "locator": "", "snap": "head", "ss_attr_missing": true, "extra_clones": true, "extra clones": [ 2, 1 ] } ]
The command returns the following errors:
-
ss_attr_missing
: One or more attributes are missing. Attributes are information about snapshots encoded into a snapshot set as a list of key-value pairs. -
ss_attr_corrupted
: One or more attributes fail to decode. -
clone_missing
: A clone is missing. -
snapset_mismatch
: The snapshot set is inconsistent by itself. -
head_mismatch
: The snapshot set indicates thathead
exists or not, but the scrub results report otherwise. -
headless
: Thehead
of the snapshot set is missing. -
size_mismatch
: The size of a clone or thehead
object does not match the expectation.
See Also
6.4. Repairing Inconsistent Placement Groups
Due to an error during deep scrubbing, some placement groups can include inconsistencies. Ceph reports such placement groups as inconsistent
:
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors pg 0.6 is active+clean+inconsistent, acting [0,1,2] 2 scrub errors
You can repair only certain inconsistencies. Do not repair the placement groups if the Ceph logs include the following errors:
<pg.id> shard <osd>: soid <object> digest <digest> != known digest <digest> <pg.id> shard <osd>: soid <object> omap_digest <digest> != known omap_digest <digest>
Open a support ticket instead. See Chapter 7, Contacting Red Hat Support Service for details.
Repair the inconsistent
placement groups:
ceph pg repair <id>
Replace <id>
with the ID of the inconsistent
placement group.
See Also
6.5. Increasing the PG Count
Insufficient Placement Group (PG) count impacts the performance of the Ceph cluster and data distribution. It is one of the main causes of the nearfull osds
error messages.
The recommended ratio is between 100 and 300 PGs per OSD. This ratio can decrease when you add more OSDs to the cluster.
The pg_num
and pgp_num
parameters determine the PG count. These parameters are configured per each pool, and therefore, you must adjust each pool with low PG count separately.
Increasing the PG count is the most intensive process that you can perform on a Ceph cluster. This process might have serious performance impact if not done in a slow and methodical way. Once you increase pgp_num
, you will not be able to stop or reverse the process and you must complete it.
Consider increasing the PG count outside of business critical processing time allocation, and alert all clients about the potential performance impact.
Do not change the PG count if the cluster is in the HEALTH_ERR
state.
Procedure: Increasing the PG Count
Reduce the impact of data redistribution and recovery on individual OSDs and OSD hosts:
Lower the value of the
osd max backfills
,osd_recovery_max_active
, andosd_recovery_op_priority
parameters:# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'
Disable the shallow and deep scrubbing:
# ceph osd set noscrub # ceph osd set nodeep-scrub
-
Use the Ceph Placement Groups (PGs) per Pool Calculator to calculate the optimal value of the
pg_num
andpgp_num
parameters. Increase the
pg_num
value in small increments until you reach the desired value.- Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
Increment the
pg_num
value:ceph osd pool set <pool> pg_num <value>
Specify the pool name and the new value, for example:
# ceph osd pool set data pg_num 4
Monitor the status of the cluster:
# ceph -s
The PGs state will change from
creating
toactive+clean
. Wait until all PGs are in theactive+clean
state.
Increase the
pgp_num
value in small increments until you reach the desired value:- Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
Increment the
pgp_num
value:ceph osd pool set <pool> pgp_num <value>
Specify the pool name and the new value, for example:
# ceph osd pool set data pgp_num 4
Monitor the status of the cluster:
# ceph -s
The PGs state will change through
peering
,wait_backfill
,backfilling
,recover
, and others. Wait until all PGs are in theactive+clean
state.
- Repeat the previous steps for all pools with insufficient PG count.
Set
osd max backfills
,osd_recovery_max_active
, andosd_recovery_op_priority
to their default values:# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'
Enable the shallow and deep scrubbing:
# ceph osd unset noscrub # ceph osd unset nodeep-scrub
See also
- Section 5.1.2, “Nearfull OSDs”
- The Monitoring Placement Group States section in the Administration Guide for Red Hat Ceph Storage 2