Este conteúdo não está disponível no idioma selecionado.
Chapter 11. The mClock OSD scheduler
As a storage administrator, you can implement the Red Hat Ceph Storage’s quality of service (QoS) using mClock queueing scheduler. This is based on an adaptation of the mClock algorithm called dmClock.
The mClock OSD scheduler provides the desired QoS using configuration profiles to allocate proper reservation, weight, and limit tags to the service types.
			The mClock OSD scheduler performs the QoS calculations for the different device types, that is SSD or HDD, by using the OSD’s IOPS capability (determined automatically) and maximum sequential bandwidth capability (See osd_mclock_max_sequential_bandwidth_hdd and osd_mclock_max_sequential_bandwidth_ssd in The mclock configuration options section).
		
11.1. Comparison of mClock OSD scheduler with WPQ OSD scheduler
The mClock OSD scheduler is the default scheduler, replacing the previous Weighted Priority Queue (WPQ) OSD scheduler, in older Red Hat Ceph Storage systems.
The mClock scheduler is supported for BlueStore OSDs.
The mClock OSD scheduler currently features an immediate queue, into which operations that require immediate response are queued. The immediate queue is not handled by mClock and functions as a simple first in, first out queue and is given the first priority.
Operations, such as OSD replication operations, OSD operation replies, peering, recoveries marked with the highest priority, and so forth, are queued into the immediate queue. All other operations are enqueued into the mClock queue that works according to the mClock algorithm.
				The mClock queue, mclock_scheduler, prioritizes operations based on which bucket they belong to, that is pg recovery, pg scrub, snap trim, client op, and pg deletion.
			
With background operations in progress, the average client throughput, that is the input and output operations per second (IOPS), are significantly higher and latencies are lower with the mClock profiles when compared to the WPQ scheduler. That is because of mClock’s effective allocation of the QoS parameters.
11.2. The allocation of input and output resources
This section describes how the QoS controls work internally with reservation, limit, and weight allocation. The user is not expected to set these controls as the mClock profiles automatically set them. Tuning these controls can only be performed using the available mClock profiles.
The dmClock algorithm allocates the input and output (I/O) resources of the Ceph cluster in proportion to weights. It implements the constraints of minimum reservation and maximum limitation to ensure the services can compete for the resources fairly.
				Currently, the mclock_scheduler operation queue divides Ceph services involving I/O resources into following buckets:
			
- 
						client op: the input and output operations per second (IOPS) issued by a client.
- 
						pg deletion: the IOPS issued by primary Ceph OSD.
- 
						snap trim: the snapshot trimming-related requests.
- 
						pg recovery: the recovery-related requests.
- 
						pg scrub: the scrub-related requests.
The resources are partitioned using the following three sets of tags, meaning that the share of each type of service is controlled by these three tags:
- Reservation
- Limit
- Weight
Reservation
The minimum IOPS allocated for the service. The more reservation a service has, the more resources it is guaranteed to possess, as long as it requires so.
For example, a service with the reservation set to 0.1 (or 10%) always has 10% of the OSD’s IOPS capacity allocated for itself. Therefore, even if the clients start to issue large amounts of I/O requests, they do not exhaust all the I/O resources and the service’s operations are not depleted even in a cluster with high load.
Limit
The maximum IOPS allocated for the service. The service does not get more than the set number of requests per second serviced, even if it requires so and no other services are competing with it. If a service crosses the enforced limit, the operation remains in the operation queue until the limit is restored.
					If the value is set to 0 (disabled), the service is not restricted by the limit setting and it can use all the resources if there is no other competing operation. This is represented as "MAX" in the mClock profiles.
				
					The reservation and limit parameter allocations are per-shard, based on the type of backing device, that is HDD or SSD, under the Ceph OSD. See OSD Object storage daemon configuration options for more details about osd_op_num_shards_hdd and osd_op_num_shards_ssd parameters.
				
Weight
The proportional share of capacity if extra capacity or system is not enough. The service can use a larger portion of the I/O resource, if its weight is higher than its competitor’s.
The reservation and limit values for a service are specified in terms of a proportion of the total IOPS capacity of the OSD. The proportion is represented as a percentage in the mClock profiles. The weight does not have a unit. The weights are relative to one another, so if one class of requests has a weight of 9 and another a weight of 1, then the requests are performed at a 9 to 1 ratio. However, that only happens once the reservations are met and those values include the operations performed under the reservation phase.
					If the weight is set to W, then for a given class of requests the next one that enters has a weight tag of 1/W and the previous weight tag, or the current time, whichever is larger. That means, if W is too large and thus 1/W is too small, the calculated tag might never be assigned as it gets a value of the current time.
				
Therefore, values for weight should be always under the number of requests expected to be serviced each second.
11.3. Factors that impact mClock operation queues
There are three factors that can reduce the impact of the mClock operation queues within Red Hat Ceph Storage:
- The number of shards for client operations.
- The number of operations in the operation sequencer.
- The usage of distributed system for Ceph OSDs
The number of shards for client operations
Requests to a Ceph OSD are sharded by their placement group identifier. Each shard has its own mClock queue and these queues neither interact, nor share information amongst them.
The number of shards can be controlled with these configuration options:
- 
						osd_op_num_shards
- 
						osd_op_num_shards_hdd
- 
						osd_op_num_shards_ssd
A lower number of shards increase the impact of the mClock queues, but might have other damaging effects.
					Use the default number of shards as defined by the configuration options osd_op_num_shards, osd_op_num_shards_hdd, and osd_op_num_shards_ssd.
				
The number of operations in the operation sequencer
Requests are transferred from the operation queue to the operation sequencer, in which they are processed. The mClock scheduler is located in the operation queue. It determines which operation to transfer to the operation sequencer.
The number of operations allowed in the operation sequencer is a complex issue. The aim is to keep enough operations in the operation sequencer so it always works on some, while it waits for disk and network access to complete other operations.
However, mClock no longer has control over an operation that is transferred to the operation sequencer. Therefore, to maximize the impact of mClock, the goal is also to keep as few operations in the operation sequencer as possible.
The configuration options that influence the number of operations in the operation sequencer are:
- 
						bluestore_throttle_bytes
- 
						bluestore_throttle_deferred_bytes
- 
						bluestore_throttle_cost_per_io
- 
						bluestore_throttle_cost_per_io_hdd
- 
						bluestore_throttle_cost_per_io_ssd
					Use the default values as defined by the bluestore_throttle_bytes and bluestore_throttle_deferred_bytes options. However, these options can be determined during the benchmarking phase.
				
The usage of distributed system for Ceph OSDs
The third factor that affects the impact of the mClock algorithm is the usage of a distributed system, where requests are made to multiple Ceph OSDs, and each Ceph OSD can have multiple shards. However, Red Hat Ceph Storage currently uses the mClock algorithm, which is not a distributed version of mClock.
dmClock is the distributed version of mClock.
11.4. The mClock configuration
The mClock profiles hide the low-level details from users, making it easier to configure and use mClock.
The following input parameters are required for an mClock profile to configure the quality of service (QoS) related parameters:
- The total capacity of input and output operations per second (IOPS) for each Ceph OSD. This is determined automatically.
- 
						The maximum sequential bandwidth capacity (MiB/s) of each OS. See osd_mclock_max_sequential_bandwidth_[hdd/ssd]option
- 
						An mClock profile type to be enabled. The default is balanced.
Using the settings in the specified profile, a Ceph OSD determines and applies the lower-level mClock and Ceph parameters. The parameters applied by the mClock profile make it possible to tune the QoS between the client I/O and background operations in the OSD.
11.5. mClock clients
The mClock scheduler handles requests from different types of Ceph services. Each service is considered by mClock as a type of client. Depending on the type of requests handled, mClock clients are classified into the buckets:
- Client - Handles input and output (I/O) requests issued by external clients of Ceph.
- Background recovery - Handles internal recovery requests.
- Background best-effort - Handles internal backfill, scrub, snap trim, and placement group (PG) deletion requests.
The mClock scheduler derives the cost of an operation used in the QoS calculations from osd_mclock_max_capacity_iops_hdd | osd_mclock_max_capacity_iops_ssd, osd_mclock_max_sequential_bandwidth_hdd | osd_mclock_max_sequential_bandwidth_ssd and osd_op_num_shards_hdd | osd_op_num_shards_ssd parameters.
11.6. mClock profiles
				An mClock profile is a configuration setting. When applied to a running Red Hat Ceph Storage cluster, it enables the throttling of the IOPS operations belonging to different client classes, such as background recovery, scrub, snap trim, client op, and pg deletion.
			
The mClock profile uses the capacity limits and the mClock profile type selected by the user to determine the low-level mClock resource control configuration parameters and applies them transparently. Other Red Hat Ceph Storage configuration parameters are also applied. The low-level mClock resource control parameters are the reservation, limit, and weight that provide control of the resource shares. The mClock profiles allocate these parameters differently for each client type.
11.6.1. mClock profile types
mClock profiles can be classified into built-in and custom profiles.
					If any mClock profile is active, the following Red Hat Ceph Storage configuration sleep options get disabled, which means they are set to 0:
				
- 
							osd_recovery_sleep
- 
							osd_recovery_sleep_hdd
- 
							osd_recovery_sleep_ssd
- 
							osd_recovery_sleep_hybrid
- 
							osd_scrub_sleep
- 
							osd_delete_sleep
- 
							osd_delete_sleep_hdd
- 
							osd_delete_sleep_ssd
- 
							osd_delete_sleep_hybrid
- 
							osd_snap_trim_sleep
- 
							osd_snap_trim_sleep_hdd
- 
							osd_snap_trim_sleep_ssd
- 
							osd_snap_trim_sleep_hybrid
It is to ensure that mClock scheduler is able to determine when to pick the next operation from its operation queue and transfer it to the operation sequencer. This results in the desired QoS being provided across all its clients.
Custom profile
This profile gives users complete control over all the mClock configuration parameters. It should be used with caution and is meant for advanced users, who understand mClock and Red Hat Ceph Storage related configuration options.
Built-in profiles
When a built-in profile is enabled, the mClock scheduler calculates the low-level mClock parameters, that is, reservation, weight, and limit, based on the profile enabled for each client type.
The mClock parameters are calculated based on the maximum Ceph OSD capacity provided beforehand. Therefore, the following mClock configuration options cannot be modified when using any of the built-in profiles:
- 
							osd_mclock_scheduler_client_res
- 
							osd_mclock_scheduler_client_wgt
- 
							osd_mclock_scheduler_client_lim
- 
							osd_mclock_scheduler_background_recovery_res
- 
							osd_mclock_scheduler_background_recovery_wgt
- 
							osd_mclock_scheduler_background_recovery_lim
- 
							osd_mclock_scheduler_background_best_effort_res
- 
							osd_mclock_scheduler_background_best_effort_wgt
- osd_mclock_scheduler_background_best_effort_limNote- These defaults cannot be changed using any of the config subsystem commands like - config set,- config daemonor- config tellcommands. Although the above command(s) report success, the mclock QoS parameters are reverted to their respective built-in profile defaults.
The following recovery and backfill related Ceph options are overridden to mClock defaults:
Do not change these options as the built-in profiles are optimized based on them. Changing these defaults can result in unexpected performance outcomes.
- 
							osd_max_backfills
- 
							osd_recovery_max_active
- 
							osd_recovery_max_active_hdd
- 
							osd_recovery_max_active_ssd
The following options show the mClock defaults which is same as the current defaults to maximize the performance of the foreground client operations:
- osd_max_backfills
- Original default
- 
											1
- mClock default
- 
											1
 
- osd_recovery_max_active
- Original default
- 
											0
- mClock default
- 
											0
 
- osd_recovery_max_active_hdd
- Original default
- 
											3
- mClock default
- 
											3
 
- osd_recovery_max_active_sdd
- Original default
- 
											10
- mClock default
- 
											10
 
						The above mClock defaults can be modified, only if necessary, by enabling osd_mclock_override_recovery_settings that is by default set as false. See Modifying backfill and recovery options to modify these parameters.
					
Built-in profile types
Users can choose from the following built-in profile types:
- 
							balanced(default)
- 
							high_client_ops
- 
							high_recovery_ops
The values mentioned in the list below represent the proportion of the total IOPS capacity of the Ceph OSD allocated for the service type.
- 
							balanced:
					The default mClock profile is set to balanced because it represents a compromise between prioritizing client IO or recovery IO. It allocates equal reservation or priority to client operations and background recovery operations. Background best-effort operations are given lower reservation and therefore take longer to complete when there are competing operations. This profile meets the normal or steady state requirements of the cluster which is the case when external client performance requirements is not critical and there are other background operations that still need attention within the OSD.
				
					There might be instances that necessitate giving higher priority to either client operations or recovery operations. To meet such requirements you can choose either the high_client_ops profile to prioritize client IO or the high_recovery_ops profile to prioritize recovery IO. These profiles are discussed further below.
				
- Service type: client
- Reservation
- 50%
- Limit
- MAX
- Weight
- 
											1
 
- Service type: background recovery
- Reservation
- 50%
- Limit
- MAX
- Weight
- 
											1
 
- Service type: background best-effort
- Reservation
- MIN
- Limit
- 90%
- Weight
- 1- 
													high_client_ops
 
- 
													
 
This profile optimizes client performance over background activities by allocating more reservation and limit to client operations as compared to background operations in the Ceph OSD. This profile, for example, can be enabled to provide the needed performance for I/O intensive applications for a sustained period of time at the cost of slower recoveries. The list below shows the resource control parameters set by the profile:
- Service type: client
- Reservation
- 60%
- Limit
- MAX
- Weight
- 
											2
 
- Service type: background recovery
- Reservation
- 40%
- Limit
- MAX
- Weight
- 
											1
 
- Service type: background best-effort
- Reservation
- MIN
- Limit
- 70%
- Weight
- 1- 
													high_recovery_ops
 
- 
													
 
This profile optimizes background recovery performance as compared to external clients and other background operations within the Ceph OSD.
For example, it could be temporarily enabled by an administrator to accelerate background recoveries during non-peak hours. The list below shows the resource control parameters set by the profile:
- Service type: client
- Reservation
- 30%
- Limit
- MAX
- Weight
- 
											1
 
- Service type: background recovery
- Reservation
- 70%
- Limit
- MAX
- Weight
- 
											2
 
- Service type: background best-effort
- Reservation
- MIN
- Limit
- MAX
- Weight
- 
											1
 
11.6.2. Changing an mClock profile
					The default mClock profile is set to balanced. The other types of the built-in profile are high_client_ops and high_recovery_ops.
				
The custom profile is not recommended unless you are an advanced user.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
- Log into the Cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Set the - osd_mclock_profileoption:- Syntax - ceph config set osd.OSD_ID osd_mclock_profile VALUE - ceph config set osd.OSD_ID osd_mclock_profile VALUE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_profile high_recovery_ops - [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_profile high_recovery_ops- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - This example changes the profile to allow faster recoveries on - osd.0.Note- For optimal performance the profile must be set on all Ceph OSDs by using the following command: - Syntax - ceph config set osd osd_mclock_profile VALUE - ceph config set osd osd_mclock_profile VALUE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
11.6.3. Switching between built-in and custom profiles
The following steps describe switching from built-in profile to custom profile and vice-versa.
You might want to switch to the custom profile if you want complete control over all the mClock configuration options. However, it is recommended not to use the custom profile unless you are an advanced user.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Switch from built-in profile to custom profile
- Log into the Cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Switch to the custom profile: - Syntax - ceph config set osd.OSD_ID osd_mclock_profile custom - ceph config set osd.OSD_ID osd_mclock_profile custom- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_profile custom - [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_profile custom- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow Note- For optimal performance the profile must be set on all Ceph OSDs by using the following command: - Example - [ceph: root@host01 /]# ceph config set osd osd_mclock_profile custom - [ceph: root@host01 /]# ceph config set osd osd_mclock_profile custom- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Optional: After switching to the custom profile, modify the desired mClock configuration options: - Syntax - ceph config set osd.OSD_ID MCLOCK_CONFIGURATION_OPTION VALUE - ceph config set osd.OSD_ID MCLOCK_CONFIGURATION_OPTION VALUE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_scheduler_client_res 0.5 - [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_scheduler_client_res 0.5- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - This example changes the client reservation IOPS ratio for a specific OSD - osd.0to 0.5 (50%)Important- Change the reservations of other services, such as background recovery and background best-effort accordingly to ensure that the sum of the reservations does not exceed the maximum proportion (1.0) of the IOPS capacity of the OSD. 
Switch from custom profile to built-in profile
- Log into the cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Set the desired built-in profile: - Syntax - ceph config set osd osd_mclock_profile MCLOCK_PROFILE - ceph config set osd osd_mclock_profile MCLOCK_PROFILE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config set osd osd_mclock_profile high_client_ops - [ceph: root@host01 /]# ceph config set osd osd_mclock_profile high_client_ops- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - This example sets the built-in profile to - high_client_opson all Ceph OSDs.
- Determine the existing custom mClock configuration settings in the database: - Example - [ceph: root@host01 /]# ceph config dump - [ceph: root@host01 /]# ceph config dump- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Remove the custom mClock configuration settings determined earlier: - Syntax - ceph config rm osd MCLOCK_CONFIGURATION_OPTION - ceph config rm osd MCLOCK_CONFIGURATION_OPTION- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config rm osd osd_mclock_scheduler_client_res - [ceph: root@host01 /]# ceph config rm osd osd_mclock_scheduler_client_res- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - This example removes the configuration option - osd_mclock_scheduler_client_resthat was set on all Ceph OSDs.- After all existing custom mClock configuration settings are removed from the central configuration database, the configuration settings related to - high_client_opsare applied.
- Verify the settings on Ceph OSDs: - Syntax - ceph config show osd.OSD_ID - ceph config show osd.OSD_ID- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config show osd.0 - [ceph: root@host01 /]# ceph config show osd.0- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
11.6.4. Switching temporarily between mClock profiles
This section contains steps to temporarily switch between mClock profiles.
This section is for advanced users or for experimental testing. Do not use the below commands on a running storage cluster as it could have unexpected outcomes.
The configuration changes on a Ceph OSD using the below commands are temporary and are lost when the Ceph OSD is restarted.
						The configuration options that are overridden using the commands described in this section cannot be modified further using the ceph config set osd.OSD_ID command. The changes do not take effect until a given Ceph OSD is restarted. This is intentional, as per the configuration subsystem design. However, any further modifications can still be made temporarily using these commands.
					
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
- Log into the Cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Run the following command to override the mClock settings: - Syntax - ceph tell osd.OSD_ID injectargs '--MCLOCK_CONFIGURATION_OPTION=VALUE' - ceph tell osd.OSD_ID injectargs '--MCLOCK_CONFIGURATION_OPTION=VALUE'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops' - [ceph: root@host01 /]# ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops'- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - This example overrides the - osd_mclock_profileoption on- osd.0.
- Optional: You can use the alternative to the previous - ceph tell osd.OSD_ID injectargscommand:- Syntax - ceph daemon osd.OSD_ID config set MCLOCK_CONFIGURATION_OPTION VALUE - ceph daemon osd.OSD_ID config set MCLOCK_CONFIGURATION_OPTION VALUE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops - [ceph: root@host01 /]# ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
The individual QoS related configuration options for the custom profile can also be modified temporarily using the above commands.
11.6.5. Degraded and Misplaced Object Recovery Rate With mClock Profiles
Degraded object recovery is categorized into the background recovery bucket. Across all mClock profiles, degraded object recovery is given higher priority when compared to misplaced object recovery because degraded objects present a data safety issue not present with objects that are merely misplaced.
					Backfill or the misplaced object recovery operation is categorized into the background best-effort bucket. According to the balanced and high_client_ops mClock profiles, background best-effort client is not constrained by reservation (set to zero) but is limited to use a fraction of the participating OSD’s capacity if there are no other competing services.
				
					Therefore, with the balanced or high_client_ops profile and with other background competing services active, backfilling rates are expected to be slower when compared to the previous WeightedPriorityQueue (WPQ) scheduler.
				
If higher backfill rates are desired, please follow the steps mentioned in the section below.
Improving backfilling rates
						For faster backfilling rate when using either balanced or high_client_ops profile, follow the below steps:
					
- Switch to the 'high_recovery_ops' mClock profile for the duration of the backfills. See Changing an mClock profile to achieve this. Once the backfilling phase is complete, switch the mClock profile to the previously active profile. In case there is no significant improvement in the backfilling rate with the 'high_recovery_ops' profile, continue to the next step.
- Switch the mClock profile back to the previously active profile.
- 
							Modify 'osd_max_backfills' to a higher value, for example, 3. See Modifying backfills and recovery options to achieve this.
- Once the backfilling is complete, 'osd_max_backfills' can be reset to the default value of 1 by following the same procedure mentioned in step 3.
						Please note that modifying osd_max_backfills may result in other operations, for example, client operations may experience higher latency during the backfilling phase. Therefore, users are recommended to increase osd_max_backfills in small increments to minimize performance impact to other operations in the cluster.
					
11.6.6. Modifying backfills and recovery options
					Modify the backfills and recovery options with the ceph config set command.
				
The backfill or recovery options that can be modified are listed in mClock profile types.
This section is for advanced users or for experimental testing. Do not use the below commands on a running storage cluster as it could have unexpected outcomes.
Modify the values only for experimental testing, or if the cluster is unable to handle the values or it shows poor performance with the default settings.
						The modification of the mClock default backfill or recovery options is restricted by the osd_mclock_override_recovery_settings option, which is set to false by default.
					
						If you attempt to modify any default backfill or recovery options without setting osd_mclock_override_recovery_settings to true, it resets the options back to the mClock defaults along with a warning message logged in the cluster log.
					
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
- Log into the Cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Set the - osd_mclock_override_recovery_settingsconfiguration option to- trueon all Ceph OSDs:- Example - [ceph: root@host01 /]# ceph config set osd osd_mclock_override_recovery_settings true - [ceph: root@host01 /]# ceph config set osd osd_mclock_override_recovery_settings true- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Set the desired - backfillsor- recoveryoption:- Syntax - ceph config set osd OPTION VALUE - ceph config set osd OPTION VALUE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config set osd osd_max_backfills_ 5 - [ceph: root@host01 /]# ceph config set osd osd_max_backfills_ 5- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Wait a few seconds and verify the configuration for the specific OSD: - Syntax - ceph config show osd.OSD_ID_ | grep OPTION - ceph config show osd.OSD_ID_ | grep OPTION- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config show osd.0 | grep osd_max_backfills - [ceph: root@host01 /]# ceph config show osd.0 | grep osd_max_backfills- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Reset the - osd_mclock_override_recovery_settingsconfiguration option to- falseon all OSDs:- Example - [ceph: root@host01 /]# ceph config set osd osd_mclock_override_recovery_settings false - [ceph: root@host01 /]# ceph config set osd osd_mclock_override_recovery_settings false- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
11.7. The Ceph OSD capacity determination
				The Ceph OSD capacity in terms of total IOPS is determined automatically during the Ceph OSD initialization. This is achieved by running the Ceph OSD bench tool and overriding the default value of osd_mclock_max_capacity_iops_[hdd, ssd] option depending on the device type. No other action or input is expected from the user to set the Ceph OSD capacity.
			
Mitigation of unrealistic Ceph OSD capacity from the automated procedure
In certain conditions, the Ceph OSD bench tool might show unrealistic or inflated results depending on the drive configuration and other environment related conditions.
To mitigate the performance impact due to this unrealistic capacity, a couple of threshold configuration options depending on the OSD device type are defined and used:
- 
						osd_mclock_iops_capacity_threshold_hdd= 500
- 
						osd_mclock_iops_capacity_threshold_ssd= 80000
You can verify these parameters by running the following commands:
[ceph: root@host01 /]# ceph config show osd.0 osd_mclock_iops_capacity_threshold_hdd 500.000000 [ceph: root@host01 /]# ceph config show osd.0 osd_mclock_iops_capacity_threshold_ssd 80000.000000
[ceph: root@host01 /]# ceph config show osd.0 osd_mclock_iops_capacity_threshold_hdd
500.000000
[ceph: root@host01 /]# ceph config show osd.0 osd_mclock_iops_capacity_threshold_ssd
80000.000000If you want to manually benchmark OSD(s) or manually tune the BlueStore throttle parameters, see Manually benchmarking OSDs.
You can verify the capacity of an OSD after the cluster is up by running the following command:
Syntax
ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]
ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]Example
[ceph: root@host01 /]# ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
[ceph: root@host01 /]# ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
				In the above example, you can view the maximum capacity for osd.0 on a Red Hat Ceph Storage node whose underlying device is an SSD.
			
The following automated step is performed:
Fallback to using default OSD capacity
				If the Ceph OSD bench tool reports a measurement that exceeds the above threshold values, the fallback mechanism reverts to the default value of osd_mclock_max_capacity_iops_hdd or osd_mclock_max_capacity_iops_ssd. The threshold configuration options can be reconfigured based on the type of drive used.
			
A cluster warning is logged in case the measurement exceeds the threshold:
Example
3403 Sep 11 11:52:50 dell-r640-039.dsal.lab.eng.rdu2.redhat.com ceph-osd[70342]: log_channel(cluster) log [WRN] : OSD bench result of 49691.213005 IOPS exceeded the threshold limit of 500.000000 IOPS for osd.27. IOPS capacity is unchanged at 315.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].
3403 Sep 11 11:52:50 dell-r640-039.dsal.lab.eng.rdu2.redhat.com ceph-osd[70342]: log_channel(cluster) log [WRN] : OSD bench result of 49691.213005 IOPS exceeded the threshold limit of 500.000000 IOPS for osd.27. IOPS capacity is unchanged at 315.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].
					If the default capacity does not accurately represent the Ceph OSD capacity, it is highly recommended to run a custom benchmark using the preferred tool, for example Fio, on the drive and then override the osd_mclock_max_capacity_iops_[hdd, ssd] option as described in Specifying maximum OSD capacity.
				
11.7.1. Verifying the capacity of an OSD
You can verify the capacity of a Ceph OSD after setting up the storage cluster.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
- Log into the Cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Verify the capacity of a Ceph OSD: - Syntax - ceph config show osd.OSD_ID osd_mclock_max_capacity_iops_[hdd, ssd] - ceph config show osd.OSD_ID osd_mclock_max_capacity_iops_[hdd, ssd]- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config show osd.0 osd_mclock_max_capacity_iops_ssd 21500.000000 - [ceph: root@host01 /]# ceph config show osd.0 osd_mclock_max_capacity_iops_ssd 21500.000000- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
11.7.2. Manually benchmarking OSDs
To manually benchmark a Ceph OSD, any existing benchmarking tool, for example Fio, can be used. Regardless of the tool or command used, the steps below remain the same.
The number of shards and BlueStore throttle parameters have an impact on the mClock operation queues. Therefore, it is critical to set these values carefully in order to maximize the impact of the mclock scheduler. See Factors that impact mClock operation queues for more information about these values.
The steps in this section are only necessary if you want to override the Ceph OSD capacity determined automatically during the OSD initialization.
If you have already determined the benchmark data and wish to manually override the maximum OSD capacity for a Ceph OSD, skip to the Specifying maximum OSD capacity section.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
- Log into the Cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Benchmark a Ceph OSD: - Syntax - ceph tell osd.OSD_ID bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS] - ceph tell osd.OSD_ID bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - where: - TOTAL_BYTES: Total number of bytes to write.
- BYTES_PER_WRITE: Block size per write.
- OBJ_SIZE: Bytes per object.
- NUM_OBJS: Number of objects to write.
 - Example - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
11.7.3. Determining the correct BlueStore throttle values
This optional section details the steps used to determine the correct BlueStore throttle values. The steps use the default shards.
Before running the test, clear the caches to get an accurate measurement. Clear the OSD caches between each benchmark run using the following command:
Syntax
ceph tell osd.OSD_ID cache drop
ceph tell osd.OSD_ID cache dropExample
[ceph: root@host01 /]# ceph tell osd.0 cache drop
[ceph: root@host01 /]# ceph tell osd.0 cache dropPrerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor node hosting the OSDs that you wish to benchmark.
Procedure
- Log into the Cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Run a simple 4KiB random write workload on an OSD: - Syntax - ceph tell osd.OSD_ID bench 12288000 4096 4194304 100 - ceph tell osd.OSD_ID bench 12288000 4096 4194304 100- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - 1
- The overall throughput obtained from the output of theosd benchcommand. This value is the baseline throughput, when the default BlueStore throttle options are in effect.
 
- Note the overall throughput, that is IOPS, obtained from the output of the previous command.
- If the intent is to determine the BlueStore throttle values for your environment, set - bluestore_throttle_bytesand- bluestore_throttle_deferred_bytesoptions to 32 KiB, that is, 32768 Bytes:- Syntax - ceph config set osd.OSD_ID bluestore_throttle_bytes 32768 ceph config set osd.OSD_ID bluestore_throttle_deferred_bytes 32768 - ceph config set osd.OSD_ID bluestore_throttle_bytes 32768 ceph config set osd.OSD_ID bluestore_throttle_deferred_bytes 32768- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config set osd.0 bluestore_throttle_bytes 32768 [ceph: root@host01 /]# ceph config set osd.0 bluestore_throttle_deferred_bytes 32768 - [ceph: root@host01 /]# ceph config set osd.0 bluestore_throttle_bytes 32768 [ceph: root@host01 /]# ceph config set osd.0 bluestore_throttle_deferred_bytes 32768- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Otherwise, you can skip to the next section Specifying maximum OSD capacity. 
- Run the 4KiB random write test as before using an OSD bench command: - Example - [ceph: root@host01 /]# ceph tell osd.0 bench 12288000 4096 4194304 100 - [ceph: root@host01 /]# ceph tell osd.0 bench 12288000 4096 4194304 100- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Notice the overall throughput from the output and compare the value against the baseline throughput recorded earlier.
- If the throughput does not match with the baseline, increase the BlueStore throttle options by multiplying by 2.
- Repeat the steps by running the 4KiB random write test, comparing the value against the baseline throughput, and increasing the BlueStore throttle options by multiplying by 2, until the obtained throughput is very close to the baseline value.
For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB for both BlueStore throttle and deferred bytes was determined to maximize the impact of mClock. For HDDs, the corresponding value was 40 MiB, where the overall throughput was roughly equal to the baseline throughput.
In general for HDDs, the BlueStore throttle values are expected to be higher when compared to SSDs.
11.7.4. Specifying maximum OSD capacity
You can override the maximum Ceph OSD capacity automatically set during OSD initialization.
These steps are optional. Perform the following steps if the default capacity does not accurately represent the Ceph OSD capacity.
Ensure that you determine the benchmark data first, as described in Manually benchmarking OSDs.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor host.
Procedure
- Log into the Cephadm shell: - Example - cephadm shell - [root@host01 ~]# cephadm shell- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow 
- Set - osd_mclock_max_capacity_iops_[hdd, ssd]option for an OSD:- Syntax - ceph config set osd.OSD_ID osd_mclock_max_capacity_iops_[hdd,ssd] VALUE - ceph config set osd.OSD_ID osd_mclock_max_capacity_iops_[hdd,ssd] VALUE- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - Example - [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350 - [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350- Copy to Clipboard Copied! - Toggle word wrap Toggle overflow - This example sets the maximum capacity for - osd.0, where an underlying device type is HDD, to 350 IOPS.