Chapter 10. The mClock OSD scheduler


As a storage administrator, you can implement the Red Hat Ceph Storage’s quality of service (QoS) using mClock queueing scheduler. This is based on an adaptation of the mClock algorithm called dmClock.

The mClock OSD scheduler provides the desired QoS using configuration profiles to allocate proper reservation, weight, and limit tags to the service types.

The mClock OSD scheduler performs the QoS calculations for the different device types, that is SSD or HDD, by using the OSD’s IOPS capability (determined automatically) and maximum sequential bandwidth capability (See osd_mclock_max_sequential_bandwidth_hdd and osd_mclock_max_sequential_bandwidth_ssd in The mclock configuration options section).

10.1. Comparison of mClock OSD scheduler with WPQ OSD scheduler

The mClock OSD scheduler replaces the Weighted Priority Queue (WPQ) OSD scheduler as a default scheduler in Red Hat Ceph Storage 6.1.

Important

The mClock scheduler is supported for BlueStore OSDs.

The WPQ OSD scheduler features a strict sub-queue, which is de-queued before the normal queue. The WPQ removes operations from a queue in relation to their priorities to prevent depletion of any queue. This helps in cases where some Ceph OSDs are more overloaded than others.

The mClock OSD scheduler currently features an immediate queue, into which operations that require immediate response are queued. The immediate queue is not handled by mClock and functions as a simple first in, first out queue and is given the first priority.

Operations, such as OSD replication operations, OSD operation replies, peering, recoveries marked with the highest priority, and so forth, are queued into the immediate queue. All other operations are enqueued into the mClock queue that works according to the mClock algorithm.

The mClock queue, mclock_scheduler, prioritizes operations based on which bucket they belong to, that is pg recovery, pg scrub, snap trim, client op, and pg deletion.

With background operations in progress, the average client throughputs, that is the input and output operations per second (IOPS), are significantly higher and latencies are lower with the mClock profiles when compared to the WPQ scheduler. That is because of mClock’s effective allocation of the QoS parameters.

Additional Resources

10.2. The allocation of input and output resources

This section describes how the QoS controls work internally with reservation, limit, and weight allocation. The user is not expected to set these controls as the mClock profiles automatically set them. Tuning these controls can only be performed using the available mClock profiles.

The dmClock algorithm allocates the input and output (I/O) resources of the Ceph cluster in proportion to weights. It implements the constraints of minimum reservation and maximum limitation to ensure the services can compete for the resources fairly.

Currently, the mclock_scheduler operation queue divides Ceph services involving I/O resources into following buckets:

  • client op: the input and output operations per second (IOPS) issued by a client.
  • pg deletion: the IOPS issued by primary Ceph OSD.
  • snap trim: the snapshot trimming-related requests.
  • pg recovery: the recovery-related requests.
  • pg scrub: the scrub-related requests.

The resources are partitioned using the following three sets of tags, meaning that the share of each type of service is controlled by these three tags:

  • Reservation
  • Limit
  • Weight

Reservation

The minimum IOPS allocated for the service. The more reservation a service has, the more resources it is guaranteed to possess, as long as it requires so.

For example, a service with the reservation set to 0.1 (or 10%) always has 10% of the OSD’s IOPS capacity allocated for itself. Therefore, even if the clients start to issue large amounts of I/O requests, they do not exhaust all the I/O resources and the service’s operations are not depleted even in a cluster with high load.

Limit

The maximum IOPS allocated for the service. The service does not get more than the set number of requests per second serviced, even if it requires so and no other services are competing with it. If a service crosses the enforced limit, the operation remains in the operation queue until the limit is restored.

Note

If the value is set to 0 (disabled), the service is not restricted by the limit setting and it can use all the resources if there is no other competing operation. This is represented as "MAX" in the mClock profiles.

Note

The reservation and limit parameter allocations are per-shard, based on the type of backing device, that is HDD or SSD, under the Ceph OSD. See OSD Object storage daemon configuration options for more details about osd_op_num_shards_hdd and osd_op_num_shards_ssd parameters.

Weight

The proportional share of capacity if extra capacity or system is not enough. The service can use a larger portion of the I/O resource, if its weight is higher than its competitor’s.

Note

The reservation and limit values for a service are specified in terms of a proportion of the total IOPS capacity of the OSD. The proportion is represented as a percentage in the mClock profiles. The weight does not have a unit. The weights are relative to one another, so if one class of requests has a weight of 9 and another a weight of 1, then the requests are performed at a 9 to 1 ratio. However, that only happens once the reservations are met and those values include the operations performed under the reservation phase.

Important

If the weight is set to W, then for a given class of requests the next one that enters has a weight tag of 1/W and the previous weight tag, or the current time, whichever is larger. That means, if W is too large and thus 1/W is too small, the calculated tag might never be assigned as it gets a value of the current time.

Therefore, values for weight should be always under the number of requests expected to be serviced each second.

10.3. Factors that impact mClock operation queues

There are three factors that can reduce the impact of the mClock operation queues within Red Hat Ceph Storage:

  • The number of shards for client operations.
  • The number of operations in the operation sequencer.
  • The usage of distributed system for Ceph OSDs

The number of shards for client operations

Requests to a Ceph OSD are sharded by their placement group identifier. Each shard has its own mClock queue and these queues neither interact, nor share information amongst them.

The number of shards can be controlled with these configuration options:

  • osd_op_num_shards
  • osd_op_num_shards_hdd
  • osd_op_num_shards_ssd

A lower number of shards increase the impact of the mClock queues, but might have other damaging effects.

Note

Use the default number of shards as defined by the configuration options osd_op_num_shards, osd_op_num_shards_hdd, and osd_op_num_shards_ssd.

The number of operations in the operation sequencer

Requests are transferred from the operation queue to the operation sequencer, in which they are processed. The mClock scheduler is located in the operation queue. It determines which operation to transfer to the operation sequencer.

The number of operations allowed in the operation sequencer is a complex issue. The aim is to keep enough operations in the operation sequencer so it always works on some, while it waits for disk and network access to complete other operations.

However, mClock no longer has control over an operation that is transferred to the operation sequencer. Therefore, to maximize the impact of mClock, the goal is also to keep as few operations in the operation sequencer as possible.

The configuration options that influence the number of operations in the operation sequencer are:

  • bluestore_throttle_bytes
  • bluestore_throttle_deferred_bytes
  • bluestore_throttle_cost_per_io
  • bluestore_throttle_cost_per_io_hdd
  • bluestore_throttle_cost_per_io_ssd
Note

Use the default values as defined by the bluestore_throttle_bytes and bluestore_throttle_deferred_bytes options. However, these options can be determined during the benchmarking phase.

The usage of distributed system for Ceph OSDs

The third factor that affects the impact of the mClock algorithm is the usage of a distributed system, where requests are made to multiple Ceph OSDs, and each Ceph OSD can have multiple shards. However, Red Hat Ceph Storage currently uses the mClock algorithm, which is not a distributed version of mClock.

Note

dmClock is the distributed version of mClock.

Additional Resources

10.4. The mClock configuration

To make the mClock more user-friendly and intuitive, the mClock configuration profiles are introduced in Red Hat Ceph Storage 6. The mClock profiles hide the low-level details from users, making it easier to configure and use mClock.

The following input parameters are required for an mClock profile to configure the quality of service (QoS) related parameters:

  • The total capacity of input and output operations per second (IOPS) for each Ceph OSD. This is determined automatically.
  • The maximum sequential bandwidth capacity (MiB/s) of each OS. See osd_mclock_max_sequential_bandwidth_[hdd/ssd] option
  • An mClock profile type to be enabled. The default is balanced.

Using the settings in the specified profile, a Ceph OSD determines and applies the lower-level mClock and Ceph parameters. The parameters applied by the mClock profile make it possible to tune the QoS between the client I/O and background operations in the OSD.

Additional Resources

10.5. mClock clients

The mClock scheduler handles requests from different types of Ceph services. Each service is considered by mClock as a type of client. Depending on the type of requests handled, mClock clients are classified into the buckets:

  • Client - Handles input and output (I/O) requests issued by external clients of Ceph.
  • Background recovery - Handles internal recovery requests.
  • Background best-effort - Handles internal backfill, scrub, snap trim, and placement group (PG) deletion requests.

The mClock scheduler derives the cost of an operation used in the QoS calculations from osd_mclock_max_capacity_iops_hdd | osd_mclock_max_capacity_iops_ssd, osd_mclock_max_sequential_bandwidth_hdd | osd_mclock_max_sequential_bandwidth_ssd and osd_op_num_shards_hdd | osd_op_num_shards_ssd parameters.

10.6. mClock profiles

An mClock profile is a configuration setting. When applied to a running Red Hat Ceph Storage cluster, it enables the throttling of the IOPS operations belonging to different client classes, such as background recovery, scrub, snap trim, client op, and pg deletion.

The mClock profile uses the capacity limits and the mClock profile type selected by the user to determine the low-level mClock resource control configuration parameters and applies them transparently. Other Red Hat Ceph Storage configuration parameters are also applied. The low-level mClock resource control parameters are the reservation, limit, and weight that provide control of the resource shares. The mClock profiles allocate these parameters differently for each client type.

10.6.1. mClock profile types

mClock profiles can be classified into built-in and custom profiles.

If any mClock profile is active, the following Red Hat Ceph Storage configuration sleep options get disabled, which means they are set to 0:

  • osd_recovery_sleep
  • osd_recovery_sleep_hdd
  • osd_recovery_sleep_ssd
  • osd_recovery_sleep_hybrid
  • osd_scrub_sleep
  • osd_delete_sleep
  • osd_delete_sleep_hdd
  • osd_delete_sleep_ssd
  • osd_delete_sleep_hybrid
  • osd_snap_trim_sleep
  • osd_snap_trim_sleep_hdd
  • osd_snap_trim_sleep_ssd
  • osd_snap_trim_sleep_hybrid

It is to ensure that mClock scheduler is able to determine when to pick the next operation from its operation queue and transfer it to the operation sequencer. This results in the desired QoS being provided across all its clients.

Custom profile

This profile gives users complete control over all the mClock configuration parameters. It should be used with caution and is meant for advanced users, who understand mClock and Red Hat Ceph Storage related configuration options.

Built-in profiles

When a built-in profile is enabled, the mClock scheduler calculates the low-level mClock parameters, that is, reservation, weight, and limit, based on the profile enabled for each client type.

The mClock parameters are calculated based on the maximum Ceph OSD capacity provided beforehand. Therefore, the following mClock configuration options cannot be modified when using any of the built-in profiles:

  • osd_mclock_scheduler_client_res
  • osd_mclock_scheduler_client_wgt
  • osd_mclock_scheduler_client_lim
  • osd_mclock_scheduler_background_recovery_res
  • osd_mclock_scheduler_background_recovery_wgt
  • osd_mclock_scheduler_background_recovery_lim
  • osd_mclock_scheduler_background_best_effort_res
  • osd_mclock_scheduler_background_best_effort_wgt
  • osd_mclock_scheduler_background_best_effort_lim

    Note

    These defaults cannot be changed using any of the config subsystem commands like config set, config daemon or config tell commands. Although the above command(s) report success, the mclock QoS parameters are reverted to their respective built-in profile defaults.

The following recovery and backfill related Ceph options are overridden to mClock defaults:

Warning

Do not change these options as the built-in profiles are optimized based on them. Changing these defaults can result in unexpected performance outcomes.

  • osd_max_backfills
  • osd_recovery_max_active
  • osd_recovery_max_active_hdd
  • osd_recovery_max_active_ssd

The following options show the mClock defaults which is same as the current defaults to maximize the performance of the foreground client operations:

osd_max_backfills
Original default
1
mClock default
1
osd_recovery_max_active
Original default
0
mClock default
0
osd_recovery_max_active_hdd
Original default
3
mClock default
3
osd_recovery_max_active_sdd
Original default
10
mClock default
10
Note

The above mClock defaults can be modified, only if necessary, by enabling osd_mclock_override_recovery_settings that is by default set as false. See Modifying backfill and recovery options to modify these parameters.

Built-in profile types

Users can choose from the following built-in profile types:

  • balanced (default)
  • high_client_ops
  • high_recovery_ops
Note

The values mentioned in the list below represent the proportion of the total IOPS capacity of the Ceph OSD allocated for the service type.

  • balanced:

The default mClock profile is set to balanced because it represents a compromise between prioritizing client IO or recovery IO. It allocates equal reservation or priority to client operations and background recovery operations. Background best-effort operations are given lower reservation and therefore take longer to complete when there are competing operations. This profile meets the normal or steady state requirements of the cluster which is the case when external client performance requirements is not critical and there are other background operations that still need attention within the OSD.

There might be instances that necessitate giving higher priority to either client operations or recovery operations. To meet such requirements you can choose either the high_client_ops profile to prioritize client IO or the high_recovery_ops profile to prioritize recovery IO. These profiles are discussed further below.

Service type: client
Reservation
50%
Limit
MAX
Weight
1
Service type: background recovery
Reservation
50%
Limit
MAX
Weight
1
Service type: background best-effort
Reservation
MIN
Limit
90%
Weight

1

  • high_client_ops

This profile optimizes client performance over background activities by allocating more reservation and limit to client operations as compared to background operations in the Ceph OSD. This profile, for example, can be enabled to provide the needed performance for I/O intensive applications for a sustained period of time at the cost of slower recoveries. The list below shows the resource control parameters set by the profile:

Service type: client
Reservation
60%
Limit
MAX
Weight
2
Service type: background recovery
Reservation
40%
Limit
MAX
Weight
1
Service type: background best-effort
Reservation
MIN
Limit
70%
Weight

1

  • high_recovery_ops

This profile optimizes background recovery performance as compared to external clients and other background operations within the Ceph OSD.

For example, it could be temporarily enabled by an administrator to accelerate background recoveries during non-peak hours. The list below shows the resource control parameters set by the profile:

Service type: client
Reservation
30%
Limit
MAX
Weight
1
Service type: background recovery
Reservation
70%
Limit
MAX
Weight
2
Service type: background best-effort
Reservation
MIN
Limit
MAX
Weight
1

Additional Resources

10.6.2. Changing an mClock profile

The default mClock profile is set to balanced. The other types of the built-in profile are high_client_ops and high_recovery_ops.

Note

The custom profile is not recommended unless you are an advanced user.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor host.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Set the osd_mclock_profile option:

    Syntax

    ceph config set osd.OSD_ID osd_mclock_profile VALUE

    Example

    [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_profile high_recovery_ops

    This example changes the profile to allow faster recoveries on osd.0.

    Note

    For optimal performance the profile must be set on all Ceph OSDs by using the following command:

    Syntax

    ceph config set osd osd_mclock_profile VALUE

10.6.3. Switching between built-in and custom profiles

The following steps describe switching from built-in profile to custom profile and vice-versa.

You might want to switch to the custom profile if you want complete control over all the mClock configuration options. However, it is recommended not to use the custom profile unless you are an advanced user.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor host.

Switch from built-in profile to custom profile

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Switch to the custom profile:

    Syntax

    ceph config set osd.OSD_ID osd_mclock_profile custom

    Example

    [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_profile custom

    Note

    For optimal performance the profile must be set on all Ceph OSDs by using the following command:

    Example

    [ceph: root@host01 /]# ceph config set osd osd_mclock_profile custom

  3. Optional: After switching to the custom profile, modify the desired mClock configuration options:

    Syntax

    ceph config set osd.OSD_ID MCLOCK_CONFIGURATION_OPTION VALUE

    Example

    [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_scheduler_client_res 0.5

    This example changes the client reservation IOPS ratio for a specific OSD osd.0 to 0.5 (50%)

    Important

    Change the reservations of other services, such as background recovery and background best-effort accordingly to ensure that the sum of the reservations does not exceed the maximum proportion (1.0) of the IOPS capacity of the OSD.

Switch from custom profile to built-in profile

  1. Log into the cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Set the desired built-in profile:

    Syntax

    ceph config set osd osd_mclock_profile MCLOCK_PROFILE

    Example

    [ceph: root@host01 /]# ceph config set osd osd_mclock_profile high_client_ops

    This example sets the built-in profile to high_client_ops on all Ceph OSDs.

  3. Determine the existing custom mClock configuration settings in the database:

    Example

    [ceph: root@host01 /]# ceph config dump

  4. Remove the custom mClock configuration settings determined earlier:

    Syntax

    ceph config rm osd MCLOCK_CONFIGURATION_OPTION

    Example

    [ceph: root@host01 /]# ceph config rm osd osd_mclock_scheduler_client_res

    This example removes the configuration option osd_mclock_scheduler_client_res that was set on all Ceph OSDs.

    After all existing custom mClock configuration settings are removed from the central configuration database, the configuration settings related to high_client_ops are applied.

  5. Verify the settings on Ceph OSDs:

    Syntax

    ceph config show osd.OSD_ID

    Example

    [ceph: root@host01 /]# ceph config show osd.0

Additional Resources

  • See mClock profile types for the list of the mClock configuration options that cannot be modified with built-in profiles.

10.6.4. Switching temporarily between mClock profiles

This section contains steps to temporarily switch between mClock profiles.

Warning

This section is for advanced users or for experimental testing. Do not use the below commands on a running storage cluster as it could have unexpected outcomes.

Note

The configuration changes on a Ceph OSD using the below commands are temporary and are lost when the Ceph OSD is restarted.

Important

The configuration options that are overridden using the commands described in this section cannot be modified further using the ceph config set osd.OSD_ID command. The changes do not take effect until a given Ceph OSD is restarted. This is intentional, as per the configuration subsystem design. However, any further modifications can still be made temporarily using these commands.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor host.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Run the following command to override the mClock settings:

    Syntax

    ceph tell osd.OSD_ID injectargs '--MCLOCK_CONFIGURATION_OPTION=VALUE'

    Example

    [ceph: root@host01 /]# ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops'

    This example overrides the osd_mclock_profile option on osd.0.

  3. Optional: You can use the alternative to the previous ceph tell osd.OSD_ID injectargs command:

    Syntax

    ceph daemon osd.OSD_ID config set MCLOCK_CONFIGURATION_OPTION VALUE

    Example

    [ceph: root@host01 /]# ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops

Note

The individual QoS related configuration options for the custom profile can also be modified temporarily using the above commands.

10.6.5. Degraded and Misplaced Object Recovery Rate With mClock Profiles

Degraded object recovery is categorized into the background recovery bucket. Across all mClock profiles, degraded object recovery is given higher priority when compared to misplaced object recovery because degraded objects present a data safety issue not present with objects that are merely misplaced.

Backfill or the misplaced object recovery operation is categorized into the background best-effort bucket. According to the balanced and high_client_ops mClock profiles, background best-effort client is not constrained by reservation (set to zero) but is limited to use a fraction of the participating OSD’s capacity if there are no other competing services.

Therefore, with the balanced or high_client_ops profile and with other background competing services active, backfilling rates are expected to be slower when compared to the previous WeightedPriorityQueue (WPQ) scheduler.

If higher backfill rates are desired, please follow the steps mentioned in the section below.

Improving backfilling rates

For faster backfilling rate when using either balanced or high_client_ops profile, follow the below steps:

  • Switch to the 'high_recovery_ops' mClock profile for the duration of the backfills. See Changing an mClock profile to achieve this. Once the backfilling phase is complete, switch the mClock profile to the previously active profile. In case there is no significant improvement in the backfilling rate with the 'high_recovery_ops' profile, continue to the next step.
  • Switch the mClock profile back to the previously active profile.
  • Modify 'osd_max_backfills' to a higher value, for example, 3. See Modifying backfills and recovery options to achieve this.
  • Once the backfilling is complete, 'osd_max_backfills' can be reset to the default value of 1 by following the same procedure mentioned in step 3.
Warning

Please note that modifying osd_max_backfills may result in other operations, for example, client operations may experience higher latency during the backfilling phase. Therefore, users are recommended to increase osd_max_backfills in small increments to minimize performance impact to other operations in the cluster.

10.6.6. Modifying backfills and recovery options

Modify the backfills and recovery options with the ceph config set command.

The backfill or recovery options that can be modified are listed in mClock profile types.

Warning

This section is for advanced users or for experimental testing. Do not use the below commands on a running storage cluster as it could have unexpected outcomes.

Modify the values only for experimental testing, or if the cluster is unable to handle the values or it shows poor performance with the default settings.

Important

The modification of the mClock default backfill or recovery options is restricted by the osd_mclock_override_recovery_settings option, which is set to false by default.

If you attempt to modify any default backfill or recovery options without setting osd_mclock_override_recovery_settings to true, it resets the options back to the mClock defaults along with a warning message logged in the cluster log.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor host.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Set the osd_mclock_override_recovery_settings configuration option to true on all Ceph OSDs:

    Example

    [ceph: root@host01 /]# ceph config set osd osd_mclock_override_recovery_settings true

  3. Set the desired backfills or recovery option:

    Syntax

    ceph config set osd OPTION VALUE

    Example

    [ceph: root@host01 /]# ceph config set osd osd_max_backfills_ 5

  4. Wait a few seconds and verify the configuration for the specific OSD:

    Syntax

    ceph config show osd.OSD_ID_ | grep OPTION

    Example

    [ceph: root@host01 /]# ceph config show osd.0 | grep osd_max_backfills

  5. Reset the osd_mclock_override_recovery_settings configuration option to false on all OSDs:

    Example

    [ceph: root@host01 /]# ceph config set osd osd_mclock_override_recovery_settings false

10.7. The Ceph OSD capacity determination

The Ceph OSD capacity in terms of total IOPS is determined automatically during the Ceph OSD initialization. This is achieved by running the Ceph OSD bench tool and overriding the default value of osd_mclock_max_capacity_iops_[hdd, ssd] option depending on the device type. No other action or input is expected from the user to set the Ceph OSD capacity.

Mitigation of unrealistic Ceph OSD capacity from the automated procedure

In certain conditions, the Ceph OSD bench tool might show unrealistic or inflated results depending on the drive configuration and other environment related conditions.

To mitigate the performance impact due to this unrealistic capacity, a couple of threshold configuration options depending on the OSD device type are defined and used:

  • osd_mclock_iops_capacity_threshold_hdd = 500
  • osd_mclock_iops_capacity_threshold_ssd = 80000

The following automated step is performed:

Fallback to using default OSD capacity

If the Ceph OSD bench tool reports a measurement that exceeds the above threshold values, the fallback mechanism reverts to the default value of osd_mclock_max_capacity_iops_hdd or osd_mclock_max_capacity_iops_ssd. The threshold configuration options can be reconfigured based on the type of drive used.

A cluster warning is logged in case the measurement exceeds the threshold:

Example

2022-10-27T15:30:23.270+0000 7f9b5dbe95c0  0 log_channel(cluster) log [WRN]
: OSD bench result of 39546.479392 IOPS exceeded the threshold limit of 25000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000 IOPS. The recommendation is to establish the osd's IOPS capacity using other benchmark tools (e.g. Fio) and then override osd_mclock_max_capacity_iops_[hdd|ssd].

Important

If the default capacity does not accurately represent the Ceph OSD capacity, it is highly recommended to run a custom benchmark using the preferred tool, for example Fio, on the drive and then override the osd_mclock_max_capacity_iops_[hdd, ssd] option as described in Specifying maximum OSD capacity.

Additional Resources

10.7.1. Verifying the capacity of an OSD

You can verify the capacity of a Ceph OSD after setting up the storage cluster.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor host.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Verify the capacity of a Ceph OSD:

    Syntax

    ceph config show osd.OSD_ID osd_mclock_max_capacity_iops_[hdd, ssd]

    Example

    [ceph: root@host01 /]# ceph config show osd.0 osd_mclock_max_capacity_iops_ssd
    
    21500.000000

10.7.2. Manually benchmarking OSDs

To manually benchmark a Ceph OSD, any existing benchmarking tool, for example Fio, can be used. Regardless of the tool or command used, the steps below remain the same.

Important

The number of shards and BlueStore throttle parameters have an impact on the mClock operation queues. Therefore, it is critical to set these values carefully in order to maximize the impact of the mclock scheduler. See Factors that impact mClock operation queues for more information about these values.

Note

The steps in this section are only necessary if you want to override the Ceph OSD capacity determined automatically during the OSD initialization.

Note

If you have already determined the benchmark data and wish to manually override the maximum OSD capacity for a Ceph OSD, skip to the Specifying maximum OSD capacity section.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor host.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Benchmark a Ceph OSD:

    Syntax

    ceph tell osd.OSD_ID bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]

    where:

    • TOTAL_BYTES: Total number of bytes to write.
    • BYTES_PER_WRITE: Block size per write.
    • OBJ_SIZE: Bytes per object.
    • NUM_OBJS: Number of objects to write.

    Example

    [ceph: root@host01 /]# ceph tell osd.0 bench 12288000 4096 4194304 100
    {
        "bytes_written": 12288000,
        "blocksize": 4096,
        "elapsed_sec": 1.3718913019999999,
        "bytes_per_sec": 8956977.8466311768,
        "iops": 2186.7621695876896
    }

10.7.3. Determining the correct BlueStore throttle values

This optional section details the steps used to determine the correct BlueStore throttle values. The steps use the default shards.

Important

Before running the test, clear the caches to get an accurate measurement. Clear the OSD caches between each benchmark run using the following command:

Syntax

ceph tell osd.OSD_ID cache drop

Example

[ceph: root@host01 /]# ceph tell osd.0 cache drop

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor node hosting the OSDs that you wish to benchmark.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Run a simple 4KiB random write workload on an OSD:

    Syntax

    ceph tell osd.OSD_ID bench 12288000 4096 4194304 100

    Example

    [ceph: root@host01 /]# ceph tell osd.0 bench 12288000 4096 4194304 100
    {
        "bytes_written": 12288000,
        "blocksize": 4096,
        "elapsed_sec": 1.3718913019999999,
        "bytes_per_sec": 8956977.8466311768,
        "iops": 2186.7621695876896 1
    }

    1
    The overall throughput obtained from the output of the osd bench command. This value is the baseline throughput, when the default BlueStore throttle options are in effect.
  3. Note the overall throughput, that is IOPS, obtained from the output of the previous command.
  4. If the intent is to determine the BlueStore throttle values for your environment, set bluestore_throttle_bytes and bluestore_throttle_deferred_bytes options to 32 KiB, that is, 32768 Bytes:

    Syntax

    ceph config set osd.OSD_ID bluestore_throttle_bytes 32768
    ceph config set osd.OSD_ID bluestore_throttle_deferred_bytes 32768

    Example

    [ceph: root@host01 /]# ceph config set osd.0 bluestore_throttle_bytes 32768
    [ceph: root@host01 /]# ceph config set osd.0 bluestore_throttle_deferred_bytes 32768

    Otherwise, you can skip to the next section Specifying maximum OSD capacity.

  5. Run the 4KiB random write test as before using an OSD bench command:

    Example

    [ceph: root@host01 /]# ceph tell osd.0 bench 12288000 4096 4194304 100

  6. Notice the overall throughput from the output and compare the value against the baseline throughput recorded earlier.
  7. If the throughput does not match with the baseline, increase the BlueStore throttle options by multiplying by 2.
  8. Repeat the steps by running the 4KiB random write test, comparing the value against the baseline throughput, and increasing the BlueStore throttle options by multiplying by 2, until the obtained throughput is very close to the baseline value.
Note

For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB for both BlueStore throttle and deferred bytes was determined to maximize the impact of mClock. For HDDs, the corresponding value was 40 MiB, where the overall throughput was roughly equal to the baseline throughput.

In general for HDDs, the BlueStore throttle values are expected to be higher when compared to SSDs.

10.7.4. Specifying maximum OSD capacity

You can override the maximum Ceph OSD capacity automatically set during OSD initialization.

These steps are optional. Perform the following steps if the default capacity does not accurately represent the Ceph OSD capacity.

Note

Ensure that you determine the benchmark data first, as described in Manually benchmarking OSDs.

Prerequisites

  • A running Red Hat Ceph Storage cluster.
  • Root-level access to the Ceph Monitor host.

Procedure

  1. Log into the Cephadm shell:

    Example

    [root@host01 ~]# cephadm shell

  2. Set osd_mclock_max_capacity_iops_[hdd, ssd] option for an OSD:

    Syntax

    ceph config set osd.OSD_ID osd_mclock_max_capacity_iops_[hdd,ssd] VALUE

    Example

    [ceph: root@host01 /]# ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350

    This example sets the maximum capacity for osd.0, where an underlying device type is HDD, to 350 IOPS.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.