Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 11. The mClock OSD scheduler
As a storage administrator, you can implement the Red Hat Ceph Storage’s quality of service (QoS) using mClock queueing scheduler. This is based on an adaptation of the mClock algorithm called dmClock.
The mClock OSD scheduler provides the desired QoS using configuration profiles to allocate proper reservation, weight, and limit tags to the service types.
The mClock OSD scheduler performs the QoS calculations for the different device types, that is SSD or HDD, by using the OSD’s IOPS capability (determined automatically) and maximum sequential bandwidth capability (See osd_mclock_max_sequential_bandwidth_hdd
and osd_mclock_max_sequential_bandwidth_ssd
in The mclock configuration options section).
11.1. Comparison of mClock OSD scheduler with WPQ OSD scheduler
The mClock OSD scheduler is the default scheduler, replacing the previous Weighted Priority Queue (WPQ) OSD scheduler, in older Red Hat Ceph Storage systems.
The mClock scheduler is supported for BlueStore OSDs.
The mClock OSD scheduler currently features an immediate queue, into which operations that require immediate response are queued. The immediate queue is not handled by mClock and functions as a simple first in, first out queue and is given the first priority.
Operations, such as OSD replication operations, OSD operation replies, peering, recoveries marked with the highest priority, and so forth, are queued into the immediate queue. All other operations are enqueued into the mClock queue that works according to the mClock algorithm.
The mClock queue, mclock_scheduler
, prioritizes operations based on which bucket they belong to, that is pg recovery
, pg scrub
, snap trim
, client op
, and pg deletion
.
With background operations in progress, the average client throughput, that is the input and output operations per second (IOPS), are significantly higher and latencies are lower with the mClock profiles when compared to the WPQ scheduler. That is because of mClock’s effective allocation of the QoS parameters.
Additional Resources
- See the mClock profiles section for more information.
11.2. The allocation of input and output resources
This section describes how the QoS controls work internally with reservation, limit, and weight allocation. The user is not expected to set these controls as the mClock profiles automatically set them. Tuning these controls can only be performed using the available mClock profiles.
The dmClock algorithm allocates the input and output (I/O) resources of the Ceph cluster in proportion to weights. It implements the constraints of minimum reservation and maximum limitation to ensure the services can compete for the resources fairly.
Currently, the mclock_scheduler
operation queue divides Ceph services involving I/O resources into following buckets:
-
client op
: the input and output operations per second (IOPS) issued by a client. -
pg deletion
: the IOPS issued by primary Ceph OSD. -
snap trim
: the snapshot trimming-related requests. -
pg recovery
: the recovery-related requests. -
pg scrub
: the scrub-related requests.
The resources are partitioned using the following three sets of tags, meaning that the share of each type of service is controlled by these three tags:
- Reservation
- Limit
- Weight
Reservation
The minimum IOPS allocated for the service. The more reservation a service has, the more resources it is guaranteed to possess, as long as it requires so.
For example, a service with the reservation set to 0.1 (or 10%) always has 10% of the OSD’s IOPS capacity allocated for itself. Therefore, even if the clients start to issue large amounts of I/O requests, they do not exhaust all the I/O resources and the service’s operations are not depleted even in a cluster with high load.
Limit
The maximum IOPS allocated for the service. The service does not get more than the set number of requests per second serviced, even if it requires so and no other services are competing with it. If a service crosses the enforced limit, the operation remains in the operation queue until the limit is restored.
If the value is set to 0
(disabled), the service is not restricted by the limit setting and it can use all the resources if there is no other competing operation. This is represented as "MAX" in the mClock profiles.
The reservation and limit parameter allocations are per-shard, based on the type of backing device, that is HDD or SSD, under the Ceph OSD. See OSD Object storage daemon configuration options for more details about osd_op_num_shards_hdd
and osd_op_num_shards_ssd
parameters.
Weight
The proportional share of capacity if extra capacity or system is not enough. The service can use a larger portion of the I/O resource, if its weight is higher than its competitor’s.
The reservation and limit values for a service are specified in terms of a proportion of the total IOPS capacity of the OSD. The proportion is represented as a percentage in the mClock profiles. The weight does not have a unit. The weights are relative to one another, so if one class of requests has a weight of 9 and another a weight of 1, then the requests are performed at a 9 to 1 ratio. However, that only happens once the reservations are met and those values include the operations performed under the reservation phase.
If the weight is set to W
, then for a given class of requests the next one that enters has a weight tag of 1/W
and the previous weight tag, or the current time, whichever is larger. That means, if W
is too large and thus 1/W
is too small, the calculated tag might never be assigned as it gets a value of the current time.
Therefore, values for weight should be always under the number of requests expected to be serviced each second.
11.3. Factors that impact mClock operation queues
There are three factors that can reduce the impact of the mClock operation queues within Red Hat Ceph Storage:
- The number of shards for client operations.
- The number of operations in the operation sequencer.
- The usage of distributed system for Ceph OSDs
The number of shards for client operations
Requests to a Ceph OSD are sharded by their placement group identifier. Each shard has its own mClock queue and these queues neither interact, nor share information amongst them.
The number of shards can be controlled with these configuration options:
-
osd_op_num_shards
-
osd_op_num_shards_hdd
-
osd_op_num_shards_ssd
A lower number of shards increase the impact of the mClock queues, but might have other damaging effects.
Use the default number of shards as defined by the configuration options osd_op_num_shards
, osd_op_num_shards_hdd
, and osd_op_num_shards_ssd
.
The number of operations in the operation sequencer
Requests are transferred from the operation queue to the operation sequencer, in which they are processed. The mClock scheduler is located in the operation queue. It determines which operation to transfer to the operation sequencer.
The number of operations allowed in the operation sequencer is a complex issue. The aim is to keep enough operations in the operation sequencer so it always works on some, while it waits for disk and network access to complete other operations.
However, mClock no longer has control over an operation that is transferred to the operation sequencer. Therefore, to maximize the impact of mClock, the goal is also to keep as few operations in the operation sequencer as possible.
The configuration options that influence the number of operations in the operation sequencer are:
-
bluestore_throttle_bytes
-
bluestore_throttle_deferred_bytes
-
bluestore_throttle_cost_per_io
-
bluestore_throttle_cost_per_io_hdd
-
bluestore_throttle_cost_per_io_ssd
Use the default values as defined by the bluestore_throttle_bytes
and bluestore_throttle_deferred_bytes
options. However, these options can be determined during the benchmarking phase.
The usage of distributed system for Ceph OSDs
The third factor that affects the impact of the mClock algorithm is the usage of a distributed system, where requests are made to multiple Ceph OSDs, and each Ceph OSD can have multiple shards. However, Red Hat Ceph Storage currently uses the mClock algorithm, which is not a distributed version of mClock.
dmClock is the distributed version of mClock.
Additional Resources
-
See Object Storage Daemon (OSD) configuration options for more details about
osd_op_num_shards_hdd
andosd_op_num_shards_ssd
parameters. - See BlueStore configuration options for more details about BlueStore throttle parameters.
- See Manually benchmarking OSDs for more information.
11.4. The mClock configuration
The mClock profiles hide the low-level details from users, making it easier to configure and use mClock.
The following input parameters are required for an mClock profile to configure the quality of service (QoS) related parameters:
- The total capacity of input and output operations per second (IOPS) for each Ceph OSD. This is determined automatically.
-
The maximum sequential bandwidth capacity (MiB/s) of each OS. See
osd_mclock_max_sequential_bandwidth_[hdd/ssd]
option -
An mClock profile type to be enabled. The default is
balanced
.
Using the settings in the specified profile, a Ceph OSD determines and applies the lower-level mClock and Ceph parameters. The parameters applied by the mClock profile make it possible to tune the QoS between the client I/O and background operations in the OSD.
Additional Resources
- See The Ceph OSD capacity determination for more information about the automated OSD capacity determination.
11.5. mClock clients
The mClock scheduler handles requests from different types of Ceph services. Each service is considered by mClock as a type of client. Depending on the type of requests handled, mClock clients are classified into the buckets:
- Client - Handles input and output (I/O) requests issued by external clients of Ceph.
- Background recovery - Handles internal recovery requests.
- Background best-effort - Handles internal backfill, scrub, snap trim, and placement group (PG) deletion requests.
The mClock scheduler derives the cost of an operation used in the QoS calculations from osd_mclock_max_capacity_iops_hdd | osd_mclock_max_capacity_iops_ssd, osd_mclock_max_sequential_bandwidth_hdd | osd_mclock_max_sequential_bandwidth_ssd and osd_op_num_shards_hdd | osd_op_num_shards_ssd parameters.
11.6. mClock profiles
An mClock profile is a configuration setting. When applied to a running Red Hat Ceph Storage cluster, it enables the throttling of the IOPS operations belonging to different client classes, such as background recovery, scrub
, snap trim
, client op
, and pg deletion
.
The mClock profile uses the capacity limits and the mClock profile type selected by the user to determine the low-level mClock resource control configuration parameters and applies them transparently. Other Red Hat Ceph Storage configuration parameters are also applied. The low-level mClock resource control parameters are the reservation, limit, and weight that provide control of the resource shares. The mClock profiles allocate these parameters differently for each client type.
11.6.1. mClock profile types
mClock profiles can be classified into built-in and custom profiles.
If any mClock profile is active, the following Red Hat Ceph Storage configuration sleep options get disabled, which means they are set to 0
:
-
osd_recovery_sleep
-
osd_recovery_sleep_hdd
-
osd_recovery_sleep_ssd
-
osd_recovery_sleep_hybrid
-
osd_scrub_sleep
-
osd_delete_sleep
-
osd_delete_sleep_hdd
-
osd_delete_sleep_ssd
-
osd_delete_sleep_hybrid
-
osd_snap_trim_sleep
-
osd_snap_trim_sleep_hdd
-
osd_snap_trim_sleep_ssd
-
osd_snap_trim_sleep_hybrid
It is to ensure that mClock scheduler is able to determine when to pick the next operation from its operation queue and transfer it to the operation sequencer. This results in the desired QoS being provided across all its clients.
Custom profile
This profile gives users complete control over all the mClock configuration parameters. It should be used with caution and is meant for advanced users, who understand mClock and Red Hat Ceph Storage related configuration options.
Built-in profiles
When a built-in profile is enabled, the mClock scheduler calculates the low-level mClock parameters, that is, reservation, weight, and limit, based on the profile enabled for each client type.
The mClock parameters are calculated based on the maximum Ceph OSD capacity provided beforehand. Therefore, the following mClock configuration options cannot be modified when using any of the built-in profiles:
-
osd_mclock_scheduler_client_res
-
osd_mclock_scheduler_client_wgt
-
osd_mclock_scheduler_client_lim
-
osd_mclock_scheduler_background_recovery_res
-
osd_mclock_scheduler_background_recovery_wgt
-
osd_mclock_scheduler_background_recovery_lim
-
osd_mclock_scheduler_background_best_effort_res
-
osd_mclock_scheduler_background_best_effort_wgt
osd_mclock_scheduler_background_best_effort_lim
NoteThese defaults cannot be changed using any of the config subsystem commands like
config set
,config daemon
orconfig tell
commands. Although the above command(s) report success, the mclock QoS parameters are reverted to their respective built-in profile defaults.
The following recovery and backfill related Ceph options are overridden to mClock defaults:
Do not change these options as the built-in profiles are optimized based on them. Changing these defaults can result in unexpected performance outcomes.
-
osd_max_backfills
-
osd_recovery_max_active
-
osd_recovery_max_active_hdd
-
osd_recovery_max_active_ssd
The following options show the mClock defaults which is same as the current defaults to maximize the performance of the foreground client operations:
osd_max_backfills
- Original default
-
1
- mClock default
-
1
osd_recovery_max_active
- Original default
-
0
- mClock default
-
0
osd_recovery_max_active_hdd
- Original default
-
3
- mClock default
-
3
osd_recovery_max_active_sdd
- Original default
-
10
- mClock default
-
10
The above mClock defaults can be modified, only if necessary, by enabling osd_mclock_override_recovery_settings
that is by default set as false
. See Modifying backfill and recovery options to modify these parameters.
Built-in profile types
Users can choose from the following built-in profile types:
-
balanced
(default) -
high_client_ops
-
high_recovery_ops
The values mentioned in the list below represent the proportion of the total IOPS capacity of the Ceph OSD allocated for the service type.
-
balanced
:
The default mClock profile is set to balanced
because it represents a compromise between prioritizing client IO or recovery IO. It allocates equal reservation or priority to client operations and background recovery operations. Background best-effort operations are given lower reservation and therefore take longer to complete when there are competing operations. This profile meets the normal or steady state requirements of the cluster which is the case when external client performance requirements is not critical and there are other background operations that still need attention within the OSD.
There might be instances that necessitate giving higher priority to either client operations or recovery operations. To meet such requirements you can choose either the high_client_ops
profile to prioritize client IO or the high_recovery_ops
profile to prioritize recovery IO. These profiles are discussed further below.
- Service type: client
- Reservation
- 50%
- Limit
- MAX
- Weight
-
1
- Service type: background recovery
- Reservation
- 50%
- Limit
- MAX
- Weight
-
1
- Service type: background best-effort
- Reservation
- MIN
- Limit
- 90%
- Weight
1
-
high_client_ops
-
This profile optimizes client performance over background activities by allocating more reservation and limit to client operations as compared to background operations in the Ceph OSD. This profile, for example, can be enabled to provide the needed performance for I/O intensive applications for a sustained period of time at the cost of slower recoveries. The list below shows the resource control parameters set by the profile:
- Service type: client
- Reservation
- 60%
- Limit
- MAX
- Weight
-
2
- Service type: background recovery
- Reservation
- 40%
- Limit
- MAX
- Weight
-
1
- Service type: background best-effort
- Reservation
- MIN
- Limit
- 70%
- Weight
1
-
high_recovery_ops
-
This profile optimizes background recovery performance as compared to external clients and other background operations within the Ceph OSD.
For example, it could be temporarily enabled by an administrator to accelerate background recoveries during non-peak hours. The list below shows the resource control parameters set by the profile:
- Service type: client
- Reservation
- 30%
- Limit
- MAX
- Weight
-
1
- Service type: background recovery
- Reservation
- 70%
- Limit
- MAX
- Weight
-
2
- Service type: background best-effort
- Reservation
- MIN
- Limit
- MAX
- Weight
-
1
Additional Resources
- See the The mClock configuration options for more information about mClock configuration options.
11.6.2. Changing an mClock profile
The default mClock profile is set to balanced
. The other types of the built-in profile are high_client_ops
and high_recovery_ops
.
The custom profile is not recommended unless you are an advanced user.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Set the
osd_mclock_profile
option:Syntax
ceph config set osd.OSD_ID osd_mclock_profile VALUE
Example
[ceph: root@host01 /]# ceph config set osd.0 osd_mclock_profile high_recovery_ops
This example changes the profile to allow faster recoveries on
osd.0
.NoteFor optimal performance the profile must be set on all Ceph OSDs by using the following command:
Syntax
ceph config set osd osd_mclock_profile VALUE
11.6.3. Switching between built-in and custom profiles
The following steps describe switching from built-in profile to custom profile and vice-versa.
You might want to switch to the custom profile if you want complete control over all the mClock configuration options. However, it is recommended not to use the custom profile unless you are an advanced user.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Switch from built-in profile to custom profile
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Switch to the custom profile:
Syntax
ceph config set osd.OSD_ID osd_mclock_profile custom
Example
[ceph: root@host01 /]# ceph config set osd.0 osd_mclock_profile custom
NoteFor optimal performance the profile must be set on all Ceph OSDs by using the following command:
Example
[ceph: root@host01 /]# ceph config set osd osd_mclock_profile custom
Optional: After switching to the custom profile, modify the desired mClock configuration options:
Syntax
ceph config set osd.OSD_ID MCLOCK_CONFIGURATION_OPTION VALUE
Example
[ceph: root@host01 /]# ceph config set osd.0 osd_mclock_scheduler_client_res 0.5
This example changes the client reservation IOPS ratio for a specific OSD
osd.0
to 0.5 (50%)ImportantChange the reservations of other services, such as background recovery and background best-effort accordingly to ensure that the sum of the reservations does not exceed the maximum proportion (1.0) of the IOPS capacity of the OSD.
Switch from custom profile to built-in profile
Log into the cephadm shell:
Example
[root@host01 ~]# cephadm shell
Set the desired built-in profile:
Syntax
ceph config set osd osd_mclock_profile MCLOCK_PROFILE
Example
[ceph: root@host01 /]# ceph config set osd osd_mclock_profile high_client_ops
This example sets the built-in profile to
high_client_ops
on all Ceph OSDs.Determine the existing custom mClock configuration settings in the database:
Example
[ceph: root@host01 /]# ceph config dump
Remove the custom mClock configuration settings determined earlier:
Syntax
ceph config rm osd MCLOCK_CONFIGURATION_OPTION
Example
[ceph: root@host01 /]# ceph config rm osd osd_mclock_scheduler_client_res
This example removes the configuration option
osd_mclock_scheduler_client_res
that was set on all Ceph OSDs.After all existing custom mClock configuration settings are removed from the central configuration database, the configuration settings related to
high_client_ops
are applied.Verify the settings on Ceph OSDs:
Syntax
ceph config show osd.OSD_ID
Example
[ceph: root@host01 /]# ceph config show osd.0
Additional Resources
- See mClock profile types for the list of the mClock configuration options that cannot be modified with built-in profiles.
11.6.4. Switching temporarily between mClock profiles
This section contains steps to temporarily switch between mClock profiles.
This section is for advanced users or for experimental testing. Do not use the below commands on a running storage cluster as it could have unexpected outcomes.
The configuration changes on a Ceph OSD using the below commands are temporary and are lost when the Ceph OSD is restarted.
The configuration options that are overridden using the commands described in this section cannot be modified further using the ceph config set osd.OSD_ID
command. The changes do not take effect until a given Ceph OSD is restarted. This is intentional, as per the configuration subsystem design. However, any further modifications can still be made temporarily using these commands.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Run the following command to override the mClock settings:
Syntax
ceph tell osd.OSD_ID injectargs '--MCLOCK_CONFIGURATION_OPTION=VALUE'
Example
[ceph: root@host01 /]# ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops'
This example overrides the
osd_mclock_profile
option onosd.0
.Optional: You can use the alternative to the previous
ceph tell osd.OSD_ID injectargs
command:Syntax
ceph daemon osd.OSD_ID config set MCLOCK_CONFIGURATION_OPTION VALUE
Example
[ceph: root@host01 /]# ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops
The individual QoS related configuration options for the custom profile can also be modified temporarily using the above commands.
11.6.5. Degraded and Misplaced Object Recovery Rate With mClock Profiles
Degraded object recovery is categorized into the background recovery bucket. Across all mClock profiles, degraded object recovery is given higher priority when compared to misplaced object recovery because degraded objects present a data safety issue not present with objects that are merely misplaced.
Backfill or the misplaced object recovery operation is categorized into the background best-effort bucket. According to the balanced
and high_client_ops
mClock profiles, background best-effort client is not constrained by reservation (set to zero) but is limited to use a fraction of the participating OSD’s capacity if there are no other competing services.
Therefore, with the balanced
or high_client_ops
profile and with other background competing services active, backfilling rates are expected to be slower when compared to the previous WeightedPriorityQueue (WPQ) scheduler.
If higher backfill rates are desired, please follow the steps mentioned in the section below.
Improving backfilling rates
For faster backfilling rate when using either balanced
or high_client_ops
profile, follow the below steps:
- Switch to the 'high_recovery_ops' mClock profile for the duration of the backfills. See Changing an mClock profile to achieve this. Once the backfilling phase is complete, switch the mClock profile to the previously active profile. In case there is no significant improvement in the backfilling rate with the 'high_recovery_ops' profile, continue to the next step.
- Switch the mClock profile back to the previously active profile.
-
Modify 'osd_max_backfills' to a higher value, for example,
3
. See Modifying backfills and recovery options to achieve this. - Once the backfilling is complete, 'osd_max_backfills' can be reset to the default value of 1 by following the same procedure mentioned in step 3.
Please note that modifying osd_max_backfills
may result in other operations, for example, client operations may experience higher latency during the backfilling phase. Therefore, users are recommended to increase osd_max_backfills
in small increments to minimize performance impact to other operations in the cluster.
11.6.6. Modifying backfills
and recovery
options
Modify the backfills
and recovery
options with the ceph config set
command.
The backfill or recovery options that can be modified are listed in mClock profile types.
This section is for advanced users or for experimental testing. Do not use the below commands on a running storage cluster as it could have unexpected outcomes.
Modify the values only for experimental testing, or if the cluster is unable to handle the values or it shows poor performance with the default settings.
The modification of the mClock default backfill or recovery options is restricted by the osd_mclock_override_recovery_settings
option, which is set to false
by default.
If you attempt to modify any default backfill or recovery options without setting osd_mclock_override_recovery_settings
to true
, it resets the options back to the mClock defaults along with a warning message logged in the cluster log.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Set the
osd_mclock_override_recovery_settings
configuration option totrue
on all Ceph OSDs:Example
[ceph: root@host01 /]# ceph config set osd osd_mclock_override_recovery_settings true
Set the desired
backfills
orrecovery
option:Syntax
ceph config set osd OPTION VALUE
Example
[ceph: root@host01 /]# ceph config set osd osd_max_backfills_ 5
Wait a few seconds and verify the configuration for the specific OSD:
Syntax
ceph config show osd.OSD_ID_ | grep OPTION
Example
[ceph: root@host01 /]# ceph config show osd.0 | grep osd_max_backfills
Reset the
osd_mclock_override_recovery_settings
configuration option tofalse
on all OSDs:Example
[ceph: root@host01 /]# ceph config set osd osd_mclock_override_recovery_settings false
11.7. The Ceph OSD capacity determination
The Ceph OSD capacity in terms of total IOPS is determined automatically during the Ceph OSD initialization. This is achieved by running the Ceph OSD bench tool and overriding the default value of osd_mclock_max_capacity_iops_[hdd, ssd]
option depending on the device type. No other action or input is expected from the user to set the Ceph OSD capacity.
Mitigation of unrealistic Ceph OSD capacity from the automated procedure
In certain conditions, the Ceph OSD bench tool might show unrealistic or inflated results depending on the drive configuration and other environment related conditions.
To mitigate the performance impact due to this unrealistic capacity, a couple of threshold configuration options depending on the OSD device type are defined and used:
-
osd_mclock_iops_capacity_threshold_hdd
= 500 -
osd_mclock_iops_capacity_threshold_ssd
= 80000
You can verify these parameters by running the following commands:
[ceph: root@host01 /]# ceph config show osd.0 osd_mclock_iops_capacity_threshold_hdd 500.000000 [ceph: root@host01 /]# ceph config show osd.0 osd_mclock_iops_capacity_threshold_ssd 80000.000000
If you want to manually benchmark OSD(s) or manually tune the BlueStore throttle parameters, see Manually benchmarking OSDs.
You can verify the capacity of an OSD after the cluster is up by running the following command:
Syntax
ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]
Example
[ceph: root@host01 /]# ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
In the above example, you can view the maximum capacity for osd.0
on a Red Hat Ceph Storage node whose underlying device is an SSD.
The following automated step is performed:
Fallback to using default OSD capacity
If the Ceph OSD bench tool reports a measurement that exceeds the above threshold values, the fallback mechanism reverts to the default value of osd_mclock_max_capacity_iops_hdd
or osd_mclock_max_capacity_iops_ssd
. The threshold configuration options can be reconfigured based on the type of drive used.
A cluster warning is logged in case the measurement exceeds the threshold:
Example
3403 Sep 11 11:52:50 dell-r640-039.dsal.lab.eng.rdu2.redhat.com ceph-osd[70342]: log_channel(cluster) log [WRN] : OSD bench result of 49691.213005 IOPS exceeded the threshold limit of 500.000000 IOPS for osd.27. IOPS capacity is unchanged at 315.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].
If the default capacity does not accurately represent the Ceph OSD capacity, it is highly recommended to run a custom benchmark using the preferred tool, for example Fio, on the drive and then override the osd_mclock_max_capacity_iops_[hdd, ssd]
option as described in Specifying maximum OSD capacity.
Additional Resources
- See Manually benchmarking OSDs to manually benchmark Ceph OSDs or manually tune the BlueStore throttle parameters.
-
See The mClock configuration options for more information about the
osd_mclock_max_capacity_iops_[hdd, ssd]
andosd_mclock_iops_capacity_threshold_[hdd, ssd]
options.
11.7.1. Verifying the capacity of an OSD
You can verify the capacity of a Ceph OSD after setting up the storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Verify the capacity of a Ceph OSD:
Syntax
ceph config show osd.OSD_ID osd_mclock_max_capacity_iops_[hdd, ssd]
Example
[ceph: root@host01 /]# ceph config show osd.0 osd_mclock_max_capacity_iops_ssd 21500.000000
11.7.2. Manually benchmarking OSDs
To manually benchmark a Ceph OSD, any existing benchmarking tool, for example Fio, can be used. Regardless of the tool or command used, the steps below remain the same.
The number of shards and BlueStore throttle parameters have an impact on the mClock operation queues. Therefore, it is critical to set these values carefully in order to maximize the impact of the mclock scheduler. See Factors that impact mClock operation queues for more information about these values.
The steps in this section are only necessary if you want to override the Ceph OSD capacity determined automatically during the OSD initialization.
If you have already determined the benchmark data and wish to manually override the maximum OSD capacity for a Ceph OSD, skip to the Specifying maximum OSD capacity section.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Benchmark a Ceph OSD:
Syntax
ceph tell osd.OSD_ID bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]
where:
- TOTAL_BYTES: Total number of bytes to write.
- BYTES_PER_WRITE: Block size per write.
- OBJ_SIZE: Bytes per object.
- NUM_OBJS: Number of objects to write.
Example
[ceph: root@host01 /]# ceph tell osd.0 bench 12288000 4096 4194304 100 { "bytes_written": 12288000, "blocksize": 4096, "elapsed_sec": 1.3718913019999999, "bytes_per_sec": 8956977.8466311768, "iops": 2186.7621695876896 }
11.7.3. Determining the correct BlueStore throttle values
This optional section details the steps used to determine the correct BlueStore throttle values. The steps use the default shards.
Before running the test, clear the caches to get an accurate measurement. Clear the OSD caches between each benchmark run using the following command:
Syntax
ceph tell osd.OSD_ID cache drop
Example
[ceph: root@host01 /]# ceph tell osd.0 cache drop
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor node hosting the OSDs that you wish to benchmark.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Run a simple 4KiB random write workload on an OSD:
Syntax
ceph tell osd.OSD_ID bench 12288000 4096 4194304 100
Example
[ceph: root@host01 /]# ceph tell osd.0 bench 12288000 4096 4194304 100 { "bytes_written": 12288000, "blocksize": 4096, "elapsed_sec": 1.3718913019999999, "bytes_per_sec": 8956977.8466311768, "iops": 2186.7621695876896 1 }
- 1
- The overall throughput obtained from the output of the
osd bench
command. This value is the baseline throughput, when the default BlueStore throttle options are in effect.
- Note the overall throughput, that is IOPS, obtained from the output of the previous command.
If the intent is to determine the BlueStore throttle values for your environment, set
bluestore_throttle_bytes
andbluestore_throttle_deferred_bytes
options to 32 KiB, that is, 32768 Bytes:Syntax
ceph config set osd.OSD_ID bluestore_throttle_bytes 32768 ceph config set osd.OSD_ID bluestore_throttle_deferred_bytes 32768
Example
[ceph: root@host01 /]# ceph config set osd.0 bluestore_throttle_bytes 32768 [ceph: root@host01 /]# ceph config set osd.0 bluestore_throttle_deferred_bytes 32768
Otherwise, you can skip to the next section Specifying maximum OSD capacity.
Run the 4KiB random write test as before using an OSD bench command:
Example
[ceph: root@host01 /]# ceph tell osd.0 bench 12288000 4096 4194304 100
- Notice the overall throughput from the output and compare the value against the baseline throughput recorded earlier.
- If the throughput does not match with the baseline, increase the BlueStore throttle options by multiplying by 2.
- Repeat the steps by running the 4KiB random write test, comparing the value against the baseline throughput, and increasing the BlueStore throttle options by multiplying by 2, until the obtained throughput is very close to the baseline value.
For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB for both BlueStore throttle and deferred bytes was determined to maximize the impact of mClock. For HDDs, the corresponding value was 40 MiB, where the overall throughput was roughly equal to the baseline throughput.
In general for HDDs, the BlueStore throttle values are expected to be higher when compared to SSDs.
11.7.4. Specifying maximum OSD capacity
You can override the maximum Ceph OSD capacity automatically set during OSD initialization.
These steps are optional. Perform the following steps if the default capacity does not accurately represent the Ceph OSD capacity.
Ensure that you determine the benchmark data first, as described in Manually benchmarking OSDs.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
Log into the Cephadm shell:
Example
[root@host01 ~]# cephadm shell
Set
osd_mclock_max_capacity_iops_[hdd, ssd]
option for an OSD:Syntax
ceph config set osd.OSD_ID osd_mclock_max_capacity_iops_[hdd,ssd] VALUE
Example
[ceph: root@host01 /]# ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350
This example sets the maximum capacity for
osd.0
, where an underlying device type is HDD, to 350 IOPS.