이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 4. Auto-scaling placement groups


The number of placement groups (PGs) in a pool plays a significant role in how a cluster peers, distributes data, and rebalances.

Auto-scaling the number of PGs can make managing the cluster easier. The pg-autoscaling command provides recommendations for scaling PGs, or automatically scales PGs based on how the cluster is being used.

4.1. Placement group auto-scaling

How the auto-scaler works

The auto-scaler analyzes pools and adjusts on a per-subtree basis. Because each pool can map to a different CRUSH rule, and each rule can distribute data across different devices, Ceph considers utilization of each subtree of the hierarchy independently. For example, a pool that maps to OSDs of class ssd, and a pool that maps to OSDs of class hdd, will each have optimal PG counts that depend on the number of those respective device types.

4.2. Placement group splitting and merging

Splitting

Red Hat Ceph Storage can split existing placement groups (PGs) into smaller PGs, which increases the total number of PGs for a given pool. Splitting existing placement groups (PGs) allows a small Red Hat Ceph Storage cluster to scale over time as storage requirements increase. The PG auto-scaling feature can increase the pg_num value, which causes the existing PGs to split as the storage cluster expands. If the PG auto-scaling feature is disabled, then you can manually increase the pg_num value, which triggers the PG split process to begin. For example, increasing the pg_num value from 4 to 16, will split into four pieces. Increasing the pg_num value will also increase the pgp_num value, but the pgp_num value increases at a gradual rate. This gradual increase is done to minimize the impact to a storage cluster’s performance and to a client’s workload, because migrating object data adds a significant load to the system. By default, Ceph queues and moves no more than 5% of the object data that is in a "misplaced" state. This default percentage can be adjusted with the target_max_misplaced_ratio option.

Merging

Red Hat Ceph Storage can also merge two existing PGs into a larger PG, which decreases the total number of PGs. Merging two PGs together can be useful, especially when the relative amount of objects in a pool decreases over time, or when the initial number of PGs chosen was too large. While merging PGs can be useful, it is also a complex and delicate process. When doing a merge, pausing I/O to the PG occurs, and only one PG is merged at a time to minimize the impact to a storage cluster’s performance. Ceph works slowly on merging the object data until the new pg_num value is reached.

4.3. Setting placement group auto-scaling modes

Each pool in the Red Hat Ceph Storage cluster has a pg_autoscale_mode property for PGs that you can set to off, on, or warn.

  • off: Disables auto-scaling for the pool. It is up to the administrator to choose an appropriate PG number for each pool. Refer to the Placement group count section for more information.
  • on: Enables automated adjustments of the PG count for the given pool.
  • warn: Raises health alerts when the PG count needs adjustment.
Note

In Red Hat Ceph Storage 5 and later releases, pg_autoscale_mode is on by default. Upgraded storage clusters retain the existing pg_autoscale_mode setting. The pg_auto_scale mode is on for the newly created pools. PG count is automatically adjusted, and ceph status might display a recovering state during PG count adjustment.

The autoscaler uses the bulk flag to determine which pool should start with a full complement of PGs and only scales down when the usage ratio across the pool is not even. However, if the pool does not have the bulk flag, the pool starts with minimal PGs and only when there is more usage in the pool.

Note

The autoscaler identifies any overlapping roots and prevents the pools with such roots from scaling because overlapping roots can cause problems with the scaling process.

Procedure

  • Enable auto-scaling on an existing pool:

    Syntax

    ceph osd pool set POOL_NAME pg_autoscale_mode on
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph osd pool set testpool pg_autoscale_mode on
    Copy to Clipboard Toggle word wrap

  • Enable auto-scaling on a newly created pool:

    Syntax

    ceph config set global osd_pool_default_pg_autoscale_mode MODE
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph config set global osd_pool_default_pg_autoscale_mode on
    Copy to Clipboard Toggle word wrap

  • Create a pool with the bulk flag:

    Syntax

    ceph osd pool create POOL_NAME --bulk
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]#  ceph osd pool create testpool --bulk
    Copy to Clipboard Toggle word wrap

  • Set or unset the bulk flag for an existing pool:

    Important

    The values must be written as true, false, 1, or 0. 1 is equivalent to true and 0 is equivalent to false. If written with different capitalization, or with other content, an error is emitted.

    The following is an example of the command written with the wrong syntax:

    [ceph: root@host01 /]# ceph osd pool set ec_pool_overwrite bulk True
    Error EINVAL: expecting value 'true', 'false', '0', or '1'
    Copy to Clipboard Toggle word wrap

    Syntax

    ceph osd pool set POOL_NAME bulk true/false/1/0
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]#  ceph osd pool set testpool bulk true
    Copy to Clipboard Toggle word wrap

  • Get the bulk flag of an existing pool:

    Syntax

    ceph osd pool get POOL_NAME bulk
    Copy to Clipboard Toggle word wrap

    Example

    [ceph: root@host01 /]# ceph osd pool get testpool bulk
    bulk: true
    Copy to Clipboard Toggle word wrap

4.4. Viewing placement group scaling recommendations

You can view the pool, its relative utilization, and any suggested changes to the placement group (PG) count within the storage cluster.

Prerequisites

  • A running Red Hat Ceph Storage cluster
  • A cephadm shell or shell on a node with the cluster’s admin key.

Procedure

  • You can view each pool, its relative utilization, and any suggested changes to the PG count using:

    [ceph: root@host01 /]# ceph osd pool autoscale-status
    Copy to Clipboard Toggle word wrap

    Output will look similar to the following:

    POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
    device_health_metrics      0                 3.0        374.9G  0.0000                                  1.0       1              on         False
    cephfs.cephfs.meta     24632                 3.0        374.9G  0.0000                                  4.0      32              on         False
    cephfs.cephfs.data         0                 3.0        374.9G  0.0000                                  1.0      32              on         False
    .rgw.root               1323                 3.0        374.9G  0.0000                                  1.0      32              on         False
    default.rgw.log         3702                 3.0        374.9G  0.0000                                  1.0      32              on         False
    default.rgw.control        0                 3.0        374.9G  0.0000                                  1.0      32              on         False
    default.rgw.meta         382                 3.0        374.9G  0.0000                                  4.0       8              on         False
    Copy to Clipboard Toggle word wrap

See the following output definitions for understanding the scaling status.

SIZE
The amount of data stored in the pool.
TARGET SIZE
If present, the amount of data the administrator has specified they expect to eventually be stored in this pool. The system uses the larger of the two values for its calculation.
RATE
The multiplier for the pool that determines how much raw storage capacity the pool uses. For example, a 3 replica pool has a ratio of 3.0, while a k=4,m=2 erasure coded pool has a ratio of 1.5.
RAW CAPACITY
The total amount of raw storage capacity on the OSDs that are responsible for storing the pool’s data.
RATIO
The ratio of the total capacity that the pool is consuming, that is, ratio = size * rate / raw capacity.
TARGET RATIO
If present, the ratio of storage the administrator has specified that they expect the pool to consume relative to other pools with target ratios set. If both target size bytes and ratio are specified, the ratio takes precedence. The default value of TARGET RATIO is 0 unless it was specified while creating the pool. The more the --target_ratio you give in a pool, the larger the PGs you are expecting the pool to have.
EFFECTIVE RATIO

The target ratio after adjusting in two ways:

  1. Subtracting any capacity expected to be used by pools with target size set.
  2. Normalizing the target ratios among pools with target ratio set so they collectively target the rest of the space. For example, 4 pools with target ratio 1.0 would have an effective ratio of 0.25. The system uses the larger of the actual ratio and the effective ratio for its calculation.
BIAS
Used as a multiplier to manually adjust a pool’s PG based on prior information about how much PGs a specific pool is expected to have. By default, the value is 1.0 unless it was specified when creating a pool. The more --bias you give in a pool, the larger the PGs you are expecting the pool to have.
PG_NUM
The current number of PGs for the pool, or the current number of PGs that the pool is working towards, if a pg_num change is in progress.
NEW PG_NUM
If present, the suggested number of PGs (pg_num). It is always a power of 2, and is only present if the suggested value varies from the current value by more than a factor of 3.
AUTOSCALE
The pool pg_autoscale_mode, and is either on, off, or warn.
BULK

Used to determine which pool should start out with a full complement of PGs. BULK only scales down when the usage ratio across the pool is not even. If the pool does not have this flag the pool starts out with a minimal amount of PGs and only used when there is more usage in the pool.

The BULK values are true, false, 1, or 0, where 1 is equivalent to true and 0 is equivalent to false. The default value is false.

Set the BULK value either during or after pool creation.

For more information about using the bulk flag, see Creating a pool and Configuring placement group auto-scaling modes.

4.5. Configuring placement group auto-scaling

Allowing the cluster to automatically scale placement groups (PGs) based on cluster usage is the simplest approach to scaling PGs.

Ceph takes the total available storage and the target number of PGs for the whole system, compares how much data is stored in each pool, and apportions the PGs accordingly.

Configuring placement group autoscaling solves for the number of PG replicas per OSD.

Note

This is not the same as dividing the pool’s pg_num value by the number of OSDs.

A replicated size=3 pool has 3*pg_num PG replicas.

An erasure-coded pool has (k + m)*pg_num replicas.

Each OSD’s current number of PG replicas, the PG ratio, can be seen by using the ceph osd df command.

In the following example, osd.217 holds 127 PG replicas:

[ceph: root@host01 /]# ceph osd df | head
ID  CLASS WEIGHT    REWEIGHT SIZE   RAW    USE    DATA   OMAP   META    AVAIL %USE VAR PGS STATUS
217 hdd   18.53969  1.00000  19 TiB 15 TiB 14 TiB 21 KiB 53 GiB 3.9 TiB 79.20 1.01 127 up
Copy to Clipboard Toggle word wrap

The command only makes changes to a pool whose current number of PGs (pg_num) is more than three times off from the calculated or suggested PG number. This threshold can be modified by adjusting the value at pool granularity:

ceph osd pool set POOL threshold THRESHOLD
Copy to Clipboard Toggle word wrap

The following example adjusts the threshold for a pool named pool01 to 2.5:

[ceph: root@host01 /]# ceph osd pool set pool01 threshold 2.5
Copy to Clipboard Toggle word wrap

The target number of PG replicas per OSD is determined by the mon_target_pg_per_osd central configuration option. The default value is 100 but most clusters benefit from setting to 250:

ceph config set global mon_target_pg_per_osd VALUE
Copy to Clipboard Toggle word wrap

For example:

[ceph: root@host01 /]# ceph config set global mon_target_pg_per_osd 250
Copy to Clipboard Toggle word wrap
Important

When raising the target number of PG replicas per OSD, it is important to also raise the central configuration option mon_max_pg_per_osd. This value is a failsafe that guards against accidents. The default value is 250, but it is recommended to raise the value to 600.

ceph config set global mon_max_pg_per_osd 600
Copy to Clipboard Toggle word wrap

4.5.1. Updating noautoscale flag

If you want to enable or disable the autoscaler for all the pools at the same time, you can use the noautoscale global flag. This global flag is useful during upgradation of the storage cluster when some OSDs are bounced or when the cluster is under maintenance. You can set the flag before any activity and unset it once the activity is complete.

By default, the noautoscale flag is set to off. When this flag is set, then all the pools have pg_autoscale_mode as off and all the pools have the autoscaler disabled.

Prerequisites

  • A running Red Hat Ceph Storage cluster
  • Root-level access to all the nodes.

Procedure

  1. Get the value of the noautoscale flag:

    Example

    [ceph: root@host01 /]# ceph osd pool get noautoscale
    Copy to Clipboard Toggle word wrap

  2. Set the noautoscale flag before any activity:

    Example

    [ceph: root@host01 /]# ceph osd pool set noautoscale
    Copy to Clipboard Toggle word wrap

  3. Unset the noautoscale flag on completion of the activity:

    Example

    [ceph: root@host01 /]# ceph osd pool unset noautoscale
    Copy to Clipboard Toggle word wrap

4.6. Specifying target pool size

A newly created pool consumes a small fraction of the total cluster capacity and appears to the system that it will need a small number of PGs. However, in most cases, cluster administrators know which pools are expected to consume most of the system capacity over time. If you provide this information, known as the target size to Red Hat Ceph Storage, such pools can use a more appropriate number of PGs (pg_num) from the beginning. This approach prevents subsequent changes in pg_num and the overhead associated with moving data around when making those adjustments.

You can specify target size of a pool in these ways:

4.6.1. Specifying target size using the absolute size of the pool

Procedure

  1. Set the target size using the absolute size of the pool in bytes:

    ceph osd pool set pool-name target_size_bytes value
    Copy to Clipboard Toggle word wrap

    For example, to instruct the system that mypool is expected to consume 100T of space:

    $ ceph osd pool set mypool target_size_bytes 100T
    Copy to Clipboard Toggle word wrap

You can also set the target size of a pool at creation time by adding the optional --target-size-bytes <bytes> argument to the ceph osd pool create command.

4.6.2. Specifying target size using the total cluster capacity

Procedure

  1. Set the target size using the ratio of the total cluster capacity:

    Syntax

    ceph osd pool set pool-name target_size_ratio ratio
    Copy to Clipboard Toggle word wrap

    For Example:

    [ceph: root@host01 /]# ceph osd pool set mypool target_size_ratio 1.0
    Copy to Clipboard Toggle word wrap

    tells the system that the pool mypool is expected to consume 1.0 relative to the other pools with target_size_ratio set. If mypool is the only pool in the cluster, this means an expected use of 100% of the total capacity. If there is a second pool with target_size_ratio as 1.0, both pools would expect to use 50% of the cluster capacity.

You can also set the target size of a pool at creation time by adding the optional --target-size-ratio <ratio> argument to the ceph osd pool create command.

Note

If you specify impossible target size values, for example, a capacity larger than the total cluster, or ratios that sum to more than 1.0, the cluster raises a POOL_TARGET_SIZE_RATIO_OVERCOMMITTED or POOL_TARGET_SIZE_BYTES_OVERCOMMITTED health warning.

If you specify both target_size_ratio and target_size_bytes for a pool, the cluster considers only the ratio, and raises a POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO health warning.

4.7. Placement group command line interface

The ceph CLI allows you to set and get the number of placement groups for a pool, view the PG map and retrieve PG statistics.

4.7.1. Setting number of placement groups in a pool

To set the number of placement groups in a pool, you must specify the number of placement groups at the time you create the pool. See Creating a Pool for details. Once you set placement groups for a pool, you can increase the number of placement groups (but you cannot decrease the number of placement groups). To increase the number of placement groups, execute the following:

Syntax

ceph osd pool set POOL_NAME pg_num PG_NUM
Copy to Clipboard Toggle word wrap

Once you increase the number of placement groups, you must also increase the number of placement groups for placement (pgp_num) before your cluster will rebalance. The pgp_num should be equal to the pg_num. To increase the number of placement groups for placement, execute the following:

Syntax

ceph osd pool set POOL_NAME pgp_num PGP_NUM
Copy to Clipboard Toggle word wrap

4.7.2. Getting number of placement groups in a pool

To get the number of placement groups in a pool, execute the following:

Syntax

ceph osd pool get POOL_NAME pg_num
Copy to Clipboard Toggle word wrap

4.7.3. Getting statistics for placement groups

To get the statistics for the placement groups in your storag cluster, execute the following:

Syntax

ceph pg dump [--format FORMAT]
Copy to Clipboard Toggle word wrap

Valid formats are plain (default) and json.

4.7.4. Getting statistics for stuck placement groups

To get the statistics for all placement groups stuck in a specified state, execute the following:

Syntax

ceph pg dump_stuck {inactive|unclean|stale|undersized|degraded [inactive|unclean|stale|undersized|degraded...]} INTERVAL
Copy to Clipboard Toggle word wrap

Inactive Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come up and in.

Unclean Placement groups contain objects that are not replicated the desired number of times. They should be recovering.

Stale Placement groups are in an unknown state - the OSDs that host them have not reported to the monitor cluster in a while (configured by mon_osd_report_timeout).

Valid formats are plain (default) and json. The threshold defines the minimum number of seconds the placement group is stuck before including it in the returned statistics (default 300 seconds).

4.7.5. Getting placement group maps

To get the placement group map for a particular placement group, execute the following:

Syntax

ceph pg map PG_ID
Copy to Clipboard Toggle word wrap

Example

[ceph: root@host01 /]# ceph pg map 1.6c
Copy to Clipboard Toggle word wrap

Ceph returns the placement group map, the placement group, and the OSD status:

osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
Copy to Clipboard Toggle word wrap

4.7.6. Scrubbing placement groups

To scrub a placement group, execute the following:

Syntax

ceph pg scrub PG_ID
Copy to Clipboard Toggle word wrap

Ceph checks the primary and any replica nodes, generates a catalog of all objects in the placement group and compares them to ensure that no objects are missing or mismatched, and their contents are consistent. Assuming the replicas all match, a final semantic sweep ensures that all of the snapshot-related object metadata is consistent. Errors are reported via logs.

4.7.7. Marking unfound objects

If the cluster has lost one or more objects, and you have decided to abandon the search for the lost data, you must mark the unfound objects as lost.

If all possible locations have been queried and objects are still lost, you might have to give up on the lost objects. This is possible given unusual combinations of failures that allow the cluster to learn about writes that were performed before the writes themselves are recovered.

Currently the only supported option is "revert", which will either roll back to a previous version of the object or (if it was a new object) forget about it entirely. To mark the "unfound" objects as "lost", execute the following:

Syntax

ceph pg PG_ID mark_unfound_lost revert|delete
Copy to Clipboard Toggle word wrap

Important

Use this feature with caution, because it might confuse applications that expect the object(s) to exist.

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2026 Red Hat
맨 위로 이동