Chapter 4. Auto-scaling placement groups
The number of placement groups (PGs) in a pool plays a significant role in how a cluster peers, distributes data, and rebalances.
Auto-scaling the number of PGs can make managing the cluster easier. The pg-autoscaling command provides recommendations for scaling PGs, or automatically scales PGs based on how the cluster is being used.
- To learn more about how auto-scaling works, see Section 4.1, “Placement group auto-scaling”.
- To enable, or disable auto-scaling, see Section 4.3, “Setting placement group auto-scaling modes”.
- To view placement group scaling recommendations, see Section 4.4, “Viewing placement group scaling recommendations”.
- To set placement group auto-scaling, see Section 4.5, “Configuring placement group auto-scaling”.
-
To update the autoscaler globally, see Section 4.5.1, “Updating
noautoscaleflag” - To set target pool size see, Section 4.6, “Specifying target pool size”.
4.1. Placement group auto-scaling Copy linkLink copied to clipboard!
How the auto-scaler works
The auto-scaler analyzes pools and adjusts on a per-subtree basis. Because each pool can map to a different CRUSH rule, and each rule can distribute data across different devices, Ceph considers utilization of each subtree of the hierarchy independently. For example, a pool that maps to OSDs of class ssd, and a pool that maps to OSDs of class hdd, will each have optimal PG counts that depend on the number of those respective device types.
4.2. Placement group splitting and merging Copy linkLink copied to clipboard!
Splitting
Red Hat Ceph Storage can split existing placement groups (PGs) into smaller PGs, which increases the total number of PGs for a given pool. Splitting existing placement groups (PGs) allows a small Red Hat Ceph Storage cluster to scale over time as storage requirements increase. The PG auto-scaling feature can increase the pg_num value, which causes the existing PGs to split as the storage cluster expands. If the PG auto-scaling feature is disabled, then you can manually increase the pg_num value, which triggers the PG split process to begin. For example, increasing the pg_num value from 4 to 16, will split into four pieces. Increasing the pg_num value will also increase the pgp_num value, but the pgp_num value increases at a gradual rate. This gradual increase is done to minimize the impact to a storage cluster’s performance and to a client’s workload, because migrating object data adds a significant load to the system. By default, Ceph queues and moves no more than 5% of the object data that is in a "misplaced" state. This default percentage can be adjusted with the target_max_misplaced_ratio option.
Merging
Red Hat Ceph Storage can also merge two existing PGs into a larger PG, which decreases the total number of PGs. Merging two PGs together can be useful, especially when the relative amount of objects in a pool decreases over time, or when the initial number of PGs chosen was too large. While merging PGs can be useful, it is also a complex and delicate process. When doing a merge, pausing I/O to the PG occurs, and only one PG is merged at a time to minimize the impact to a storage cluster’s performance. Ceph works slowly on merging the object data until the new pg_num value is reached.
4.3. Setting placement group auto-scaling modes Copy linkLink copied to clipboard!
Each pool in the Red Hat Ceph Storage cluster has a pg_autoscale_mode property for PGs that you can set to off, on, or warn.
-
off: Disables auto-scaling for the pool. It is up to the administrator to choose an appropriate PG number for each pool. Refer to the Placement group count section for more information. -
on: Enables automated adjustments of the PG count for the given pool. -
warn: Raises health alerts when the PG count needs adjustment.
In Red Hat Ceph Storage 5 and later releases, pg_autoscale_mode is on by default. Upgraded storage clusters retain the existing pg_autoscale_mode setting. The pg_auto_scale mode is on for the newly created pools. PG count is automatically adjusted, and ceph status might display a recovering state during PG count adjustment.
The autoscaler uses the bulk flag to determine which pool should start with a full complement of PGs and only scales down when the usage ratio across the pool is not even. However, if the pool does not have the bulk flag, the pool starts with minimal PGs and only when there is more usage in the pool.
The autoscaler identifies any overlapping roots and prevents the pools with such roots from scaling because overlapping roots can cause problems with the scaling process.
Procedure
Enable auto-scaling on an existing pool:
Syntax
ceph osd pool set POOL_NAME pg_autoscale_mode on
ceph osd pool set POOL_NAME pg_autoscale_mode onCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd pool set testpool pg_autoscale_mode on
[ceph: root@host01 /]# ceph osd pool set testpool pg_autoscale_mode onCopy to Clipboard Copied! Toggle word wrap Toggle overflow Enable auto-scaling on a newly created pool:
Syntax
ceph config set global osd_pool_default_pg_autoscale_mode MODE
ceph config set global osd_pool_default_pg_autoscale_mode MODECopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph config set global osd_pool_default_pg_autoscale_mode on
[ceph: root@host01 /]# ceph config set global osd_pool_default_pg_autoscale_mode onCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a pool with the
bulkflag:Syntax
ceph osd pool create POOL_NAME --bulk
ceph osd pool create POOL_NAME --bulkCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd pool create testpool --bulk
[ceph: root@host01 /]# ceph osd pool create testpool --bulkCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set or unset the
bulkflag for an existing pool:ImportantThe values must be written as
true,false,1, or0.1is equivalent totrueand0is equivalent tofalse. If written with different capitalization, or with other content, an error is emitted.The following is an example of the command written with the wrong syntax:
[ceph: root@host01 /]# ceph osd pool set ec_pool_overwrite bulk True Error EINVAL: expecting value 'true', 'false', '0', or '1'
[ceph: root@host01 /]# ceph osd pool set ec_pool_overwrite bulk True Error EINVAL: expecting value 'true', 'false', '0', or '1'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Syntax
ceph osd pool set POOL_NAME bulk true/false/1/0
ceph osd pool set POOL_NAME bulk true/false/1/0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd pool set testpool bulk true
[ceph: root@host01 /]# ceph osd pool set testpool bulk trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Get the
bulkflag of an existing pool:Syntax
ceph osd pool get POOL_NAME bulk
ceph osd pool get POOL_NAME bulkCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd pool get testpool bulk bulk: true
[ceph: root@host01 /]# ceph osd pool get testpool bulk bulk: trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.4. Viewing placement group scaling recommendations Copy linkLink copied to clipboard!
You can view the pool, its relative utilization, and any suggested changes to the placement group (PG) count within the storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster
- A cephadm shell or shell on a node with the cluster’s admin key.
Procedure
You can view each pool, its relative utilization, and any suggested changes to the PG count using:
[ceph: root@host01 /]# ceph osd pool autoscale-status
[ceph: root@host01 /]# ceph osd pool autoscale-statusCopy to Clipboard Copied! Toggle word wrap Toggle overflow Output will look similar to the following:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
See the following output definitions for understanding the scaling status.
SIZE- The amount of data stored in the pool.
TARGET SIZE- If present, the amount of data the administrator has specified they expect to eventually be stored in this pool. The system uses the larger of the two values for its calculation.
RATE-
The multiplier for the pool that determines how much raw storage capacity the pool uses. For example, a
3replica pool has a ratio of3.0, while ak=4,m=2erasure coded pool has a ratio of1.5. RAW CAPACITY- The total amount of raw storage capacity on the OSDs that are responsible for storing the pool’s data.
RATIO- The ratio of the total capacity that the pool is consuming, that is, ratio = size * rate / raw capacity.
TARGET RATIO-
If present, the ratio of storage the administrator has specified that they expect the pool to consume relative to other pools with target ratios set. If both target size bytes and ratio are specified, the ratio takes precedence. The default value of
TARGET RATIOis0unless it was specified while creating the pool. The more the--target_ratioyou give in a pool, the larger the PGs you are expecting the pool to have. EFFECTIVE RATIOThe target ratio after adjusting in two ways:
- Subtracting any capacity expected to be used by pools with target size set.
-
Normalizing the target ratios among pools with target ratio set so they collectively target the rest of the space. For example, 4 pools with
target ratio1.0 would have aneffective ratioof 0.25. The system uses the larger of the actual ratio and the effective ratio for its calculation.
BIAS-
Used as a multiplier to manually adjust a pool’s PG based on prior information about how much PGs a specific pool is expected to have. By default, the value is 1.0 unless it was specified when creating a pool. The more
--biasyou give in a pool, the larger the PGs you are expecting the pool to have. PG_NUM-
The current number of PGs for the pool, or the current number of PGs that the pool is working towards, if a
pg_numchange is in progress. NEW PG_NUM-
If present, the suggested number of PGs (
pg_num). It is always a power of 2, and is only present if the suggested value varies from the current value by more than a factor of 3. AUTOSCALE-
The pool
pg_autoscale_mode, and is eitheron,off, orwarn. BULKUsed to determine which pool should start out with a full complement of PGs.
BULKonly scales down when the usage ratio across the pool is not even. If the pool does not have this flag the pool starts out with a minimal amount of PGs and only used when there is more usage in the pool.The
BULKvalues aretrue,false,1, or0, where1is equivalent totrueand0is equivalent tofalse. The default value isfalse.Set the
BULKvalue either during or after pool creation.
For more information about using the bulk flag, see Creating a pool and Configuring placement group auto-scaling modes.
4.5. Configuring placement group auto-scaling Copy linkLink copied to clipboard!
Allowing the cluster to automatically scale placement groups (PGs) based on cluster usage is the simplest approach to scaling PGs.
Ceph takes the total available storage and the target number of PGs for the whole system, compares how much data is stored in each pool, and apportions the PGs accordingly.
Configuring placement group autoscaling solves for the number of PG replicas per OSD.
This is not the same as dividing the pool’s pg_num value by the number of OSDs.
A replicated size=3 pool has 3*pg_num PG replicas.
An erasure-coded pool has (k + m)*pg_num replicas.
Each OSD’s current number of PG replicas, the PG ratio, can be seen by using the ceph osd df command.
In the following example, osd.217 holds 127 PG replicas:
[ceph: root@host01 /]# ceph osd df | head ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 217 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 21 KiB 53 GiB 3.9 TiB 79.20 1.01 127 up
[ceph: root@host01 /]# ceph osd df | head
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
217 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 21 KiB 53 GiB 3.9 TiB 79.20 1.01 127 up
The command only makes changes to a pool whose current number of PGs (pg_num) is more than three times off from the calculated or suggested PG number. This threshold can be modified by adjusting the value at pool granularity:
ceph osd pool set POOL threshold THRESHOLD
ceph osd pool set POOL threshold THRESHOLD
The following example adjusts the threshold for a pool named pool01 to 2.5:
[ceph: root@host01 /]# ceph osd pool set pool01 threshold 2.5
[ceph: root@host01 /]# ceph osd pool set pool01 threshold 2.5
The target number of PG replicas per OSD is determined by the mon_target_pg_per_osd central configuration option. The default value is 100 but most clusters benefit from setting to 250:
ceph config set global mon_target_pg_per_osd VALUE
ceph config set global mon_target_pg_per_osd VALUE
For example:
[ceph: root@host01 /]# ceph config set global mon_target_pg_per_osd 250
[ceph: root@host01 /]# ceph config set global mon_target_pg_per_osd 250
When raising the target number of PG replicas per OSD, it is important to also raise the central configuration option mon_max_pg_per_osd. This value is a failsafe that guards against accidents. The default value is 250, but it is recommended to raise the value to 600.
ceph config set global mon_max_pg_per_osd 600
ceph config set global mon_max_pg_per_osd 600
4.5.1. Updating noautoscale flag Copy linkLink copied to clipboard!
If you want to enable or disable the autoscaler for all the pools at the same time, you can use the noautoscale global flag. This global flag is useful during upgradation of the storage cluster when some OSDs are bounced or when the cluster is under maintenance. You can set the flag before any activity and unset it once the activity is complete.
By default, the noautoscale flag is set to off. When this flag is set, then all the pools have pg_autoscale_mode as off and all the pools have the autoscaler disabled.
Prerequisites
- A running Red Hat Ceph Storage cluster
- Root-level access to all the nodes.
Procedure
Get the value of the
noautoscaleflag:Example
[ceph: root@host01 /]# ceph osd pool get noautoscale
[ceph: root@host01 /]# ceph osd pool get noautoscaleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set the
noautoscaleflag before any activity:Example
[ceph: root@host01 /]# ceph osd pool set noautoscale
[ceph: root@host01 /]# ceph osd pool set noautoscaleCopy to Clipboard Copied! Toggle word wrap Toggle overflow Unset the
noautoscaleflag on completion of the activity:Example
[ceph: root@host01 /]# ceph osd pool unset noautoscale
[ceph: root@host01 /]# ceph osd pool unset noautoscaleCopy to Clipboard Copied! Toggle word wrap Toggle overflow
4.6. Specifying target pool size Copy linkLink copied to clipboard!
A newly created pool consumes a small fraction of the total cluster capacity and appears to the system that it will need a small number of PGs. However, in most cases, cluster administrators know which pools are expected to consume most of the system capacity over time. If you provide this information, known as the target size to Red Hat Ceph Storage, such pools can use a more appropriate number of PGs (pg_num) from the beginning. This approach prevents subsequent changes in pg_num and the overhead associated with moving data around when making those adjustments.
You can specify target size of a pool in these ways:
4.6.1. Specifying target size using the absolute size of the pool Copy linkLink copied to clipboard!
Procedure
Set the
target sizeusing the absolute size of the pool in bytes:ceph osd pool set pool-name target_size_bytes value
ceph osd pool set pool-name target_size_bytes valueCopy to Clipboard Copied! Toggle word wrap Toggle overflow For example, to instruct the system that
mypoolis expected to consume 100T of space:ceph osd pool set mypool target_size_bytes 100T
$ ceph osd pool set mypool target_size_bytes 100TCopy to Clipboard Copied! Toggle word wrap Toggle overflow
You can also set the target size of a pool at creation time by adding the optional --target-size-bytes <bytes> argument to the ceph osd pool create command.
4.6.2. Specifying target size using the total cluster capacity Copy linkLink copied to clipboard!
Procedure
Set the
target sizeusing the ratio of the total cluster capacity:Syntax
ceph osd pool set pool-name target_size_ratio ratio
ceph osd pool set pool-name target_size_ratio ratioCopy to Clipboard Copied! Toggle word wrap Toggle overflow For Example:
[ceph: root@host01 /]# ceph osd pool set mypool target_size_ratio 1.0
[ceph: root@host01 /]# ceph osd pool set mypool target_size_ratio 1.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow tells the system that the pool
mypoolis expected to consume 1.0 relative to the other pools withtarget_size_ratioset. Ifmypoolis the only pool in the cluster, this means an expected use of 100% of the total capacity. If there is a second pool withtarget_size_ratioas 1.0, both pools would expect to use 50% of the cluster capacity.
You can also set the target size of a pool at creation time by adding the optional --target-size-ratio <ratio> argument to the ceph osd pool create command.
If you specify impossible target size values, for example, a capacity larger than the total cluster, or ratios that sum to more than 1.0, the cluster raises a POOL_TARGET_SIZE_RATIO_OVERCOMMITTED or POOL_TARGET_SIZE_BYTES_OVERCOMMITTED health warning.
If you specify both target_size_ratio and target_size_bytes for a pool, the cluster considers only the ratio, and raises a POOL_HAS_TARGET_SIZE_BYTES_AND_RATIO health warning.
4.7. Placement group command line interface Copy linkLink copied to clipboard!
The ceph CLI allows you to set and get the number of placement groups for a pool, view the PG map and retrieve PG statistics.
4.7.1. Setting number of placement groups in a pool Copy linkLink copied to clipboard!
To set the number of placement groups in a pool, you must specify the number of placement groups at the time you create the pool. See Creating a Pool for details. Once you set placement groups for a pool, you can increase the number of placement groups (but you cannot decrease the number of placement groups). To increase the number of placement groups, execute the following:
Syntax
ceph osd pool set POOL_NAME pg_num PG_NUM
ceph osd pool set POOL_NAME pg_num PG_NUM
Once you increase the number of placement groups, you must also increase the number of placement groups for placement (pgp_num) before your cluster will rebalance. The pgp_num should be equal to the pg_num. To increase the number of placement groups for placement, execute the following:
Syntax
ceph osd pool set POOL_NAME pgp_num PGP_NUM
ceph osd pool set POOL_NAME pgp_num PGP_NUM
4.7.2. Getting number of placement groups in a pool Copy linkLink copied to clipboard!
To get the number of placement groups in a pool, execute the following:
Syntax
ceph osd pool get POOL_NAME pg_num
ceph osd pool get POOL_NAME pg_num
4.7.3. Getting statistics for placement groups Copy linkLink copied to clipboard!
To get the statistics for the placement groups in your storag cluster, execute the following:
Syntax
ceph pg dump [--format FORMAT]
ceph pg dump [--format FORMAT]
Valid formats are plain (default) and json.
4.7.4. Getting statistics for stuck placement groups Copy linkLink copied to clipboard!
To get the statistics for all placement groups stuck in a specified state, execute the following:
Syntax
ceph pg dump_stuck {inactive|unclean|stale|undersized|degraded [inactive|unclean|stale|undersized|degraded...]} INTERVAL
ceph pg dump_stuck {inactive|unclean|stale|undersized|degraded [inactive|unclean|stale|undersized|degraded...]} INTERVAL
Inactive Placement groups cannot process reads or writes because they are waiting for an OSD with the most up-to-date data to come up and in.
Unclean Placement groups contain objects that are not replicated the desired number of times. They should be recovering.
Stale Placement groups are in an unknown state - the OSDs that host them have not reported to the monitor cluster in a while (configured by mon_osd_report_timeout).
Valid formats are plain (default) and json. The threshold defines the minimum number of seconds the placement group is stuck before including it in the returned statistics (default 300 seconds).
4.7.5. Getting placement group maps Copy linkLink copied to clipboard!
To get the placement group map for a particular placement group, execute the following:
Syntax
ceph pg map PG_ID
ceph pg map PG_ID
Example
[ceph: root@host01 /]# ceph pg map 1.6c
[ceph: root@host01 /]# ceph pg map 1.6c
Ceph returns the placement group map, the placement group, and the OSD status:
osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]
4.7.6. Scrubbing placement groups Copy linkLink copied to clipboard!
To scrub a placement group, execute the following:
Syntax
ceph pg scrub PG_ID
ceph pg scrub PG_ID
Ceph checks the primary and any replica nodes, generates a catalog of all objects in the placement group and compares them to ensure that no objects are missing or mismatched, and their contents are consistent. Assuming the replicas all match, a final semantic sweep ensures that all of the snapshot-related object metadata is consistent. Errors are reported via logs.
4.7.7. Marking unfound objects Copy linkLink copied to clipboard!
If the cluster has lost one or more objects, and you have decided to abandon the search for the lost data, you must mark the unfound objects as lost.
If all possible locations have been queried and objects are still lost, you might have to give up on the lost objects. This is possible given unusual combinations of failures that allow the cluster to learn about writes that were performed before the writes themselves are recovered.
Currently the only supported option is "revert", which will either roll back to a previous version of the object or (if it was a new object) forget about it entirely. To mark the "unfound" objects as "lost", execute the following:
Syntax
ceph pg PG_ID mark_unfound_lost revert|delete
ceph pg PG_ID mark_unfound_lost revert|delete
Use this feature with caution, because it might confuse applications that expect the object(s) to exist.