Chapter 6. Troubleshooting Placement Groups

This section contains information about fixing the most common errors related to the Ceph Placement Groups (PGs).

Before You Start

Verify your network connection. See Chapter 3, Troubleshooting Networking Issues for details.
Ensure that Monitors are able to form a quorum. See Chapter 4, Troubleshooting Monitors for details about troubleshooting the most common errors related to Monitors.
Ensure that all healthy OSDs are up and in, and the backfilling and recovery processes are finished. See Chapter 5, Troubleshooting OSDs for details about troubleshooting the most common errors related to OSDs.

6.1. The Most Common Error Messages Related to Placement Groups
Copy link

The following table lists the most common errors messages that are returned by the ceph health detail command. The table provides links to corresponding sections that explain the errors and point to specific procedures to fix the problems.

In addition, you can list placement groups that are stuck in a state that is not optimal. See Section 6.2, “Listing Placement Groups in stale, inactive, or unclean State” for details.

Expand

Table 6.1. Error Messages Related to Placement Groups
Error message	See
`HEALTH_ERR`
`pgs down`	Section 6.1.5, “Placement Groups Are `down`”
`pgs inconsistent`	Section 6.1.2, “Inconsistent Placement Groups”
`scrub errors`	Section 6.1.2, “Inconsistent Placement Groups”
`HEALTH_WARN`
`pgs stale`	Section 6.1.1, “Stale Placement Groups”
`unfound`	Section 6.1.6, “Unfound Objects”

6.1.1. Stale Placement Groups
Copy link

The ceph health command lists some Placement Groups (PGs) as stale:

HEALTH_WARN 24 pgs stale; 3/300 in osds are down

HEALTH_WARN 24 pgs stale; 3/300 in osds are down

Copy to Clipboard

Toggle word wrap

What This Means

The Monitor marks a placement group as stale when it does not receive any status update from the primary OSD of the placement group’s acting set or when other OSDs reported that the primary OSD is down.

Usually, PGs enter the stale state after you start the storage cluster and until the peering process completes. However, when the PGs remain stale for longer than expected, it might indicate that the primary OSD for those PGs is down or not reporting PG statistics to the Monitor. When the primary OSD storing stale PGs is back up, Ceph starts to recover the PGs.

The mon_osd_report_timeout setting determines how often OSDs report PGs statistics to Monitors. Be default, this parameter is set to 0.5, which means that OSDs report the statistics every half a second.

To Troubleshoot This Problem

Identify which PGs are stale and on what OSDs they are stored. The error message will include information similar to the following example:

ceph health detail

# ceph health detail
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
...
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
...
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861

Copy to Clipboard

Toggle word wrap

Troubleshoot any problems with the OSDs that are marked as down. For details, see Section 5.1.3, “One or More OSDs Are Down”.

6.1.2. Inconsistent Placement Groups
Copy link

Some placement groups are marked as active + clean + inconsistent and the ceph health detail returns an error messages similar to the following one:

HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors

HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors

Copy to Clipboard

Toggle word wrap

What This Means

When Ceph detects inconsistencies in one or more replicas of an object in a placement group, it marks the placement group as inconsistent. The most common inconsistencies are:

Objects have an incorrect size.
Objects are missing from one replica after a recovery finished.

In most cases, errors during scrubbing cause inconsistency within placement groups.

To Troubleshoot This Problem

Determine which placement group is in the inconsistent state:

ceph health detail

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors

Copy to Clipboard

Toggle word wrap

Determine why the placement group is inconsistent.
1. Start the deep scrubbing process on the placement group:
  ceph pg deep-scrub <id>
  Copy to Clipboard Toggle word wrap
  Replace <id> with the ID of the inconsistent placement group, for example:
  # ceph pg deep-scrub 0.6 instructing pg 0.6 on osd.0 to deep-scrub
  Copy to Clipboard Toggle word wrap
2. Search the output of the ceph -w for any messages related to that placement group:
  ceph -w | grep <id>
  Copy to Clipboard Toggle word wrap
  Replace <id> with the ID of the inconsistent placement group, for example:
  # ceph -w | grep 0.6 2015-02-26 01:35:36.778215 osd.106 [ERR] 0.6 deep-scrub stat mismatch, got 636/635 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 1855455/1854371 bytes. 2015-02-26 01:35:36.788334 osd.106 [ERR] 0.6 deep-scrub 1 errors
  Copy to Clipboard Toggle word wrap

If the output includes any error messages similar to the following ones, you can repair the inconsistent placement group. See Section 6.4, “Repairing Inconsistent Placement Groups” for details.

<pg.id> shard <osd>: soid <object> missing attr _, missing attr <attr type>
<pg.id> shard <osd>: soid <object> digest 0 != known digest <digest>, size 0 != known size <size>
<pg.id> shard <osd>: soid <object> size 0 != known size <size>
<pg.id> deep-scrub stat mismatch, got <mismatch>
<pg.id> shard <osd>: soid <object> candidate had a read error, digest 0 != known digest <digest>

<pg.id> shard <osd>: soid <object> missing attr _, missing attr <attr type>
<pg.id> shard <osd>: soid <object> digest 0 != known digest <digest>, size 0 != known size <size>
<pg.id> shard <osd>: soid <object> size 0 != known size <size>
<pg.id> deep-scrub stat mismatch, got <mismatch>
<pg.id> shard <osd>: soid <object> candidate had a read error, digest 0 != known digest <digest>

Copy to Clipboard

Toggle word wrap

If the output includes any error messages similar to the following ones, it is not safe to repair the inconsistent placement group because you can lose data. Open a support ticket in this situation. See Chapter 7, Contacting Red Hat Support Service for details.

<pg.id> shard <osd>: soid <object> digest <digest> != known digest <digest>
<pg.id> shard <osd>: soid <object> omap_digest <digest> != known omap_digest <digest>

<pg.id> shard <osd>: soid <object> digest <digest> != known digest <digest>
<pg.id> shard <osd>: soid <object> omap_digest <digest> != known omap_digest <digest>

Copy to Clipboard

Toggle word wrap

6.1.3. Unclean Placement Groups
Copy link

The ceph health command returns an error message similar to the following one:

HEALTH_WARN 197 pgs stuck unclean

HEALTH_WARN 197 pgs stuck unclean

Copy to Clipboard

Toggle word wrap

What This Means

Ceph marks a placement group as unclean if it has not achieved the active+clean state for the number of seconds specified in the mon_pg_stuck_threshold parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold is 300 seconds.

If a placement group is unclean, it contains objects that are not replicated the number of times specified in the osd_pool_default_size parameter. The default value of osd_pool_default_size is 3, which means that Ceph creates three replicas.

Usually, unclean placement groups indicate that some OSDs might be down.

To Troubleshoot This Problem

Determine which OSDs are down:
```
ceph osd tree
```
```
# ceph osd tree
```
Copy to Clipboard Toggle word wrap
Troubleshoot and fix any problems with the OSDs. See Section 5.1.3, “One or More OSDs Are Down” for details.

6.1.4. Inactive Placement Groups
Copy link

The ceph health command returns a error message similar to the following one:

HEALTH_WARN 197 pgs stuck inactive

HEALTH_WARN 197 pgs stuck inactive

Copy to Clipboard

Toggle word wrap

What This Means

Ceph marks a placement group as inactive if it has not be active for the number of seconds specified in the mon_pg_stuck_threshold parameter in the Ceph configuration file. The default value of mon_pg_stuck_threshold is 300 seconds.

Usually, inactive placement groups indicate that some OSDs might be down.

To Troubleshoot This Problem

Determine which OSDs are down:
```
ceph osd tree
```
```
# ceph osd tree
```
Copy to Clipboard Toggle word wrap
Troubleshoot and fix any problems with the OSDs. See Section 5.1.3, “One or More OSDs Are Down” for details.

6.1.5. Placement Groups Are down
Copy link

The ceph health detail command reports that some placement groups are down:

HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651

HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651

Copy to Clipboard

Toggle word wrap

What This Means

In certain cases, the peering process can be blocked, which prevents a placement group from becoming active and usable. Usually, a failure of an OSD causes the peering failures.

To Troubleshoot This Problem

Determine what blocks the peering process:

ceph pg <id> query

ceph pg <id> query

Copy to Clipboard

Toggle word wrap

Replace <id> with the ID of the placement group that is down, for example:

ceph pg 0.5 query

# ceph pg 0.5 query

{ "state": "down+peering",
  ...
  "recovery_state": [
       { "name": "Started\/Primary\/Peering\/GetInfo",
         "enter_time": "2012-03-06 14:40:16.169679",
         "requested_info_from": []},
       { "name": "Started\/Primary\/Peering",
         "enter_time": "2012-03-06 14:40:16.169659",
         "probing_osds": [
               0,
               1],
         "blocked": "peering is blocked due to down osds",
         "down_osds_we_would_probe": [
               1],
         "peering_blocked_by": [
               { "osd": 1,
                 "current_lost_at": 0,
                 "comment": "starting or marking this osd lost may let us proceed"}]},
       { "name": "Started",
         "enter_time": "2012-03-06 14:40:16.169513"}
   ]
}

Copy to Clipboard

Toggle word wrap

The recovery_state section includes information why the peering process is blocked.

If the output includes the peering is blocked due to down osds error message, see Section 5.1.3, “One or More OSDs Are Down”.
If you see any other error message, open a support ticket. See Chapter 7, Contacting Red Hat Support Service for details.

6.1.6. Unfound Objects
Copy link

The ceph health command returns an error message similar to the following one, containing the unfound keyword:

HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)

HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)

Copy to Clipboard

Toggle word wrap

What This Means

Ceph marks objects as unfound when it knows these objects or their newer copies exist but it is unable to find them. As a consequence, Ceph cannot recover such objects and proceed with the recovery process.

An Example Situation

A placement group stores data on osd.1 and osd.2.

osd.1 goes down.
osd.2 handles some write operations.
osd.1 comes up.
A peering process between osd.1 and osd.2 starts, and the objects missing on osd.1 are queued for recovery.
Before Ceph copies new objects, osd.2 goes down.

As a result, osd.1 knows that these objects exist, but there is no OSD that has a copy of the objects.

In this scenario, Ceph is waiting for the failed node to be accessible again, and the unfound objects blocks the recovery process.

To Troubleshoot This Problem

Determine which placement group contain unfound objects:

ceph health detail

# ceph health detail
HEALTH_WARN 1 pgs recovering; 1 pgs stuck unclean; recovery 5/937611 objects degraded (0.001%); 1/312537 unfound (0.000%)
pg 3.8a5 is stuck unclean for 803946.712780, current state active+recovering, last acting [320,248,0]
pg 3.8a5 is active+recovering, acting [320,248,0], 1 unfound
recovery 5/937611 objects degraded (0.001%); **1/312537 unfound (0.000%)**

Copy to Clipboard

Toggle word wrap

List more information about the placement group:

ceph pg <id> query

# ceph pg <id> query

Copy to Clipboard

Toggle word wrap

Replace <id> with the ID of the placement group containing the unfound objects, for example:

ceph pg 3.8a5 query

# ceph pg 3.8a5 query
{ "state": "active+recovering",
  "epoch": 10741,
  "up": [
        320,
        248,
        0],
  "acting": [
        320,
        248,
        0],
<snip>
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2015-01-28 19:30:12.058136",
          "might_have_unfound": [
                { "osd": "0",
                  "status": "already probed"},
                { "osd": "248",
                  "status": "already probed"},
                { "osd": "301",
                  "status": "already probed"},
                { "osd": "362",
                  "status": "already probed"},
                { "osd": "395",
                  "status": "already probed"},
                { "osd": "429",
                  "status": "osd is down"}],
          "recovery_progress": { "backfill_targets": [],
              "waiting_on_backfill": [],
              "last_backfill_started": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": [],
              "backfills_in_flight": [],
              "recovering": [],
              "pg_backend": { "pull_from_peer": [],
                  "pushing": []}},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2015-01-28 19:30:11.044020"}],

Copy to Clipboard

Toggle word wrap

The might_have_unfound section includes OSDs where Ceph tried to locate the unfound objects:

The already probed status indicates that Ceph cannot locate the unfound objects in that OSD.
The osd is down status indicates that Ceph cannot contact that OSD.

Troubleshoot the OSDs that are marked as down. See Section 5.1.3, “One or More OSDs Are Down” for details.
If you are unable to fix the problem that causes the OSD to be down, open a support ticket. See Chapter 7, Contacting Red Hat Support Service for details.

6.2. Listing Placement Groups in stale, inactive, or unclean State
Copy link

After a failure, placement groups enter states like degraded or peering. This states indicate normal progression through the failure recovery process.

However, if a placement group stays in one of these states for a longer time than expected, it can be an indication of a larger problem. The Monitors reports when placement groups get stuck in a state that is not optimal.

The following table lists these states together with a short explanation.

Expand

State	What it means	Most common causes	See
`inactive`	The PG has not been able to service read/write requests.	Peering problems	Section 6.1.4, “Inactive Placement Groups”
`unclean`	The PG contains objects that are not replicated the desired number of times. Something is preventing the PG from recovering.	`unfound` objects OSDs are `down` Incorrect configuration	Section 6.1.3, “Unclean Placement Groups”
`stale`	The status of the PG has not been updated by a `ceph-osd` daemon.	OSDs are `down`	Section 6.1.1, “Stale Placement Groups”

The mon_pg_stuck_threshold parameter in the Ceph configuration file determines the number of seconds after which placement groups are considered inactive, unclean, or stale.

List the stuck PGs:

ceph pg dump_stuck inactive
ceph pg dump_stuck unclean
ceph pg dump_stuck stale

# ceph pg dump_stuck inactive
# ceph pg dump_stuck unclean
# ceph pg dump_stuck stale

Copy to Clipboard

Toggle word wrap

6.3. Listing Inconsistencies
Copy link

Use the rados utility to list inconsistencies in various replicas of an objects. Use the --format=json-pretty option to list a more detailed output.

You can list:

Listing Inconsistent Placement Groups in a Pool

rados list-inconsistent-pg <pool> --format=json-pretty

rados list-inconsistent-pg <pool> --format=json-pretty

Copy to Clipboard

Toggle word wrap

For example, list all inconsistent placement groups in a pool named data:

rados list-inconsistent-pg data --format=json-pretty

# rados list-inconsistent-pg data --format=json-pretty
[0.6]

Copy to Clipboard

Toggle word wrap

Listing Inconsistent Objects in a Placement Group

rados list-inconsistent-obj <placement-group-id>

rados list-inconsistent-obj <placement-group-id>

Copy to Clipboard

Toggle word wrap

For example, list inconsistent objects in a placement group with ID 0.6:

rados list-inconsistent-obj 0.6

# rados list-inconsistent-obj 0.6
{
    "epoch": 14,
    "inconsistents": [
        {
            "object": {
                "name": "image1",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "version": 1
            },
            "errors": [
                "data_digest_mismatch",
                "size_mismatch"
            ],
            "union_shard_errors": [
                "data_digest_mismatch_oi",
                "size_mismatch_oi"
            ],
            "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
            "shards": [
                {
                    "osd": 0,
                    "errors": [],
                    "size": 968,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xe978e67f"
                },
                {
                    "osd": 1,
                    "errors": [],
                    "size": 968,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xe978e67f"
                },
                {
                    "osd": 2,
                    "errors": [
                        "data_digest_mismatch_oi",
                        "size_mismatch_oi"
                    ],
                    "size": 0,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xffffffff"
                }
            ]
        }
    ]
}

Copy to Clipboard

Toggle word wrap

The following fields are important to determine what causes the inconsistency:

name: The name of the object with inconsistent replicas.
nspace: The namespace that is a logical separation of a pool. It’s empty by default.
locator: The key that is used as the alternative of the object name for placement.
snap: The snapshot ID of the object. The only writable version of the object is called head. If an object is a clone, this field includes its sequential ID.
version: The version ID of the object with inconsistent replicas. Each write operation to an object increments it.
errors: A list of errors that indicate inconsistencies between shards without determining which shard or shards are incorrect. See the shard array to further investigate the errors.
- data_digest_mismatch: The digest of the replica read from one OSD is different from the other OSDs.
- size_mismatch: The size of a clone or the head object does not match the expectation.
- read_error: This error indicates inconsistencies caused most likely by disk errors.
union_shard_error: The union of all errors specific to shards. These errors are connected to a faulty shard. The errors that end with oi indicate that you have to compare the information from a faulty object to information with selected objects. See the shard array to further investigate the errors.
In the above example, the object replica stored on osd.2 has different digest than the replicas stored on osd.0 and osd.1. Specifically, the digest of the replica is not 0xffffffff as calculated from the shard read from osd.2, but 0xe978e67f. In addition, the size of the replica read from osd.2 is 0, while the size reported by osd.0 and osd.1 is 968.

Listing Inconsistent Snapshot Sets in a Placement Group

rados list-inconsistent-snapset <placement-group-id>

rados list-inconsistent-snapset <placement-group-id>

Copy to Clipboard

Toggle word wrap

For example, list inconsistent sets of snapshots (snapsets) in a placement group with ID 0.23:

rados list-inconsistent-snapset 0.23 --format=json-pretty

# rados list-inconsistent-snapset 0.23 --format=json-pretty
{
    "epoch": 64,
    "inconsistents": [
        {
            "name": "obj5",
            "nspace": "",
            "locator": "",
            "snap": "0x00000001",
            "headless": true
        },
        {
            "name": "obj5",
            "nspace": "",
            "locator": "",
            "snap": "0x00000002",
            "headless": true
        },
        {
            "name": "obj5",
            "nspace": "",
            "locator": "",
            "snap": "head",
            "ss_attr_missing": true,
            "extra_clones": true,
            "extra clones": [
                2,
                1
            ]
        }
    ]

Copy to Clipboard

Toggle word wrap

The command returns the following errors:

ss_attr_missing: One or more attributes are missing. Attributes are information about snapshots encoded into a snapshot set as a list of key-value pairs.
ss_attr_corrupted: One or more attributes fail to decode.
clone_missing: A clone is missing.
snapset_mismatch: The snapshot set is inconsistent by itself.
head_mismatch: The snapshot set indicates that head exists or not, but the scrub results report otherwise.
headless: The head of the snapshot set is missing.
size_mismatch: The size of a clone or the head object does not match the expectation.

6.4. Repairing Inconsistent Placement Groups
Copy link

Due to an error during deep scrubbing, some placement groups can include inconsistencies. Ceph reports such placement groups as inconsistent:

HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors

HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors

Copy to Clipboard

Toggle word wrap

Warning

You can repair only certain inconsistencies. Do not repair the placement groups if the Ceph logs include the following errors:

<pg.id> shard <osd>: soid <object> digest <digest> != known digest <digest>
<pg.id> shard <osd>: soid <object> omap_digest <digest> != known omap_digest <digest>

<pg.id> shard <osd>: soid <object> digest <digest> != known digest <digest>
<pg.id> shard <osd>: soid <object> omap_digest <digest> != known omap_digest <digest>

Copy to Clipboard

Toggle word wrap

Open a support ticket instead. See Chapter 7, Contacting Red Hat Support Service for details.

Repair the inconsistent placement groups:

ceph pg repair <id>

ceph pg repair <id>

Copy to Clipboard

Toggle word wrap

Replace <id> with the ID of the inconsistent placement group.

6.5. Increasing the PG Count
Copy link

Insufficient Placement Group (PG) count impacts the performance of the Ceph cluster and data distribution. It is one of the main causes of the nearfull osds error messages.

The recommended ratio is between 100 and 300 PGs per OSD. This ratio can decrease when you add more OSDs to the cluster.

The pg_num and pgp_num parameters determine the PG count. These parameters are configured per each pool, and therefore, you must adjust each pool with low PG count separately.

Important

Increasing the PG count is the most intensive process that you can perform on a Ceph cluster. This process might have serious performance impact if not done in a slow and methodical way. Once you increase pgp_num, you will not be able to stop or reverse the process and you must complete it.

Consider increasing the PG count outside of business critical processing time allocation, and alert all clients about the potential performance impact.

Do not change the PG count if the cluster is in the HEALTH_ERR state.

Procedure: Increasing the PG Count

Reduce the impact of data redistribution and recovery on individual OSDs and OSD hosts:
1. Lower the value of the osd max backfills, osd_recovery_max_active, and osd_recovery_op_priority parameters:
  # ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 1 --osd_recovery_op_priority 1'
  Copy to Clipboard Toggle word wrap
2. Disable the shallow and deep scrubbing:
  # ceph osd set noscrub # ceph osd set nodeep-scrub
  Copy to Clipboard Toggle word wrap
Use the Ceph Placement Groups (PGs) per Pool Calculator to calculate the optimal value of the pg_num and pgp_num parameters.
Increase the pg_num value in small increments until you reach the desired value.
1. Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
2. Increment the pg_num value:
  ceph osd pool set <pool> pg_num <value>
  Copy to Clipboard Toggle word wrap
  Specify the pool name and the new value, for example:
  # ceph osd pool set data pg_num 4
  Copy to Clipboard Toggle word wrap
3. Monitor the status of the cluster:
  # ceph -s
  Copy to Clipboard Toggle word wrap
  The PGs state will change from creating to active+clean. Wait until all PGs are in the active+clean state.
Increase the pgp_num value in small increments until you reach the desired value:
1. Determine the starting increment value. Use a very low value that is a power of two, and increase it when you determine the impact on the cluster. The optimal value depends on the pool size, OSD count, and client I/O load.
2. Increment the pgp_num value:
  ceph osd pool set <pool> pgp_num <value>
  Copy to Clipboard Toggle word wrap
  Specify the pool name and the new value, for example:
  # ceph osd pool set data pgp_num 4
  Copy to Clipboard Toggle word wrap
3. Monitor the status of the cluster:
  # ceph -s
  Copy to Clipboard Toggle word wrap
  The PGs state will change through peering, wait_backfill, backfilling, recover, and others. Wait until all PGs are in the active+clean state.
Repeat the previous steps for all pools with insufficient PG count.

Set osd max backfills, osd_recovery_max_active, and osd_recovery_op_priority to their default values:

ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'

# ceph tell osd.* injectargs '--osd_max_backfills 1 --osd_recovery_max_active 3 --osd_recovery_op_priority 3'

Copy to Clipboard

Toggle word wrap

Enable the shallow and deep scrubbing:

ceph osd unset noscrub
ceph osd unset nodeep-scrub

# ceph osd unset noscrub
# ceph osd unset nodeep-scrub

Copy to Clipboard

Toggle word wrap

Chapter 6. Troubleshooting Placement Groups

Before You Start

6.1. The Most Common Error Messages Related to Placement GroupsCopy linkLink copied to clipboard!

6.1.1. Stale Placement GroupsCopy linkLink copied to clipboard!

What This Means

To Troubleshoot This Problem

See Also

6.1.2. Inconsistent Placement GroupsCopy linkLink copied to clipboard!

What This Means

To Troubleshoot This Problem

See Also

6.1.3. Unclean Placement GroupsCopy linkLink copied to clipboard!

What This Means

To Troubleshoot This Problem

See Also

6.1.4. Inactive Placement GroupsCopy linkLink copied to clipboard!

What This Means

To Troubleshoot This Problem

See Also

6.1.5. Placement Groups Are downCopy linkLink copied to clipboard!

What This Means

To Troubleshoot This Problem

See Also

6.1.6. Unfound ObjectsCopy linkLink copied to clipboard!

What This Means

An Example Situation

To Troubleshoot This Problem

6.2. Listing Placement Groups in stale, inactive, or unclean StateCopy linkLink copied to clipboard!

See Also

6.3. Listing InconsistenciesCopy linkLink copied to clipboard!

Listing Inconsistent Placement Groups in a Pool

Listing Inconsistent Objects in a Placement Group

Listing Inconsistent Snapshot Sets in a Placement Group

See Also

6.4. Repairing Inconsistent Placement GroupsCopy linkLink copied to clipboard!

See Also

6.5. Increasing the PG CountCopy linkLink copied to clipboard!

Procedure: Increasing the PG Count

See also

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. The Most Common Error Messages Related to Placement Groups
Copy link

6.1.1. Stale Placement Groups
Copy link

6.1.2. Inconsistent Placement Groups
Copy link

6.1.3. Unclean Placement Groups
Copy link

6.1.4. Inactive Placement Groups
Copy link

6.1.5. Placement Groups Are down
Copy link

6.1.6. Unfound Objects
Copy link

6.2. Listing Placement Groups in stale, inactive, or unclean State
Copy link

6.3. Listing Inconsistencies
Copy link

6.4. Repairing Inconsistent Placement Groups
Copy link

6.5. Increasing the PG Count
Copy link