Chapter 11. CRUSH Tunables

The Ceph project has grown exponentially with many changes and many new features. Beginning with the first commercially supported major release of Ceph, v0.48 (Argonaut), Ceph provides the ability to adjust certain parameters of the CRUSH algorithm (i.e., the settings aren’t frozen into the source code).

A few important points to consider:

Adjusting CRUSH values may result in the shift of some PGs between storage nodes. If the Ceph cluster is already storing a lot of data, be prepared for some fraction of the data to move.
The ceph-osd and ceph-mon daemons will start requiring the feature bits of new connections as soon as they receive an updated map. However, already-connected clients are effectively grandfathered in, and will misbehave if they do not support the new feature. Make sure when you upgrade your Ceph Storage Cluster daemons that you also update your Ceph clients.
If the CRUSH tunables are set to non-legacy values and then later changed back to the legacy values, ceph-osd daemons will not be required to support the feature. However, the OSD peering process requires examining and understanding old maps. Therefore, you should not run old versions of the ceph-osd daemon if the cluster has previously used non-legacy CRUSH values, even if the latest version of the map has been switched back to using the legacy defaults.

11.1. The Evolution of CRUSH Tunables
Copy link

Ceph clients and daemons prior to v0.48 do not detect for tunables and are not compatible with v0.48 and beyond (you must upgrade). The ability to adjust tunable CRUSH values has also evolved with major Ceph releases.

Legacy Values

Legacy values deployed in newer clusters with CRUSH Tunables may misbehave. Issues include:

In Hierarchies with a small number of devices in the leaf buckets, some PGs map to fewer than the desired number of replicas. This commonly happens for hierarchies with "host" nodes with a small number (1-3) of OSDs nested beneath each one.
For large clusters, some small percentages of PGs map to less than the desired number of OSDs. This is more prevalent when there are several layers of the hierarchy (e.g., row, rack, host, osd).
When some OSDs are marked out, the data tends to get redistributed to nearby OSDs instead of across the entire hierarchy.

Important

We strongly encourage upgrading both Ceph clients and Ceph daemons to major supported releases to take advantage of CRUSH tunables. We recommend that all cluster daemons and clients use the same release version.

CRUSH_TUNABLES

Beginning with the first commercially supported major release of Ceph, v0.48 (Argonaut), v0.49 and later, and Linux kernel version 3.6 or later (for the file system and RBD kernel clients), Ceph provides support for the following CRUSH tunables:

choose_local_tries: Number of local retries. Legacy value is 2, optimal value is 0.
choose_local_fallback_tries: Legacy value is 5, optimal value is 0.
choose_total_tries: Total number of attempts to choose an item. Legacy value was 19, subsequent testing indicates that a value of 50 is more appropriate for typical clusters. For extremely large clusters, a larger value might be necessary.

CRUSH_TUNABLES2

Beginning with v0.55 or later, including the second major release of Ceph, v0.56.x (Bobtail), and Linux kernel version v3.9 or later (for the file system and RBD kernel clients), Ceph provides support for CRUSH_TUNABLES and the following setting for CRUSH_TUNABLES2:

chooseleaf_descend_once: Whether a recursive chooseleaf attempt will retry, or only try once and allow the original placement to retry. Legacy default is 0, optimal value is 1.

CRUSH_TUNABLES3

Beginning with the sixth major release of Ceph, v0.78 (Firefly), and Linux kernel version v3.15 or later (for the file system and RBD kernel clients), Ceph provides support for CRUSH_TUNABLES, CRUSH_TUNABLES2 and the following setting for CRUSH_TUNABLES3:

chooseleaf_vary_r: Whether a recursive chooseleaf attempt will start with a non-zero value of r, based on how many attempts the parent has already made. Legacy default is 0, but with this value CRUSH is sometimes unable to find a mapping. The optimal value (in terms of computational cost and correctness) is 1. However, for legacy clusters that have lots of existing data, changing from 0 to 1 will cause a lot of data to move; a value of 4 or 5 will allow CRUSH to find a valid mapping but will make less data move.

11.2. Tuning CRUSH
Copy link

Before you tune CRUSH, you should ensure that all Ceph clients and all Ceph daemons use the same version. If you have recently upgraded, ensure that you have restarted daemons and reconnected clients.

The simplest way to adjust the CRUSH tunables is by changing to a known profile. Those are:

legacy: the legacy behavior from v0.47 (pre-Argonaut) and earlier.
argonaut: the legacy values supported by v0.48 (Argonaut) release.
bobtail: the values supported by the v0.56 (Bobtail) release.
firefly: the values supported by the v0.80 (Firefly) release.
optimal: the current best values.
default: the current default values for a new cluster.

You can select a profile on a running cluster with the command:

ceph osd crush tunables {PROFILE}

ceph osd crush tunables {PROFILE}

Copy to Clipboard

Toggle word wrap

Note

This may result in some data movement.

Generally, you should set the CRUSH tunables after you upgrade, or if you receive a warning. Starting with version v0.74, Ceph will issue a health warning if the CRUSH tunables are not set to their optimal values (the optimal values are the default as of v0.73). To make this warning go away, you have two options:

Adjust the tunables on the existing cluster. Note that this will result in some data movement (possibly as much as 10%). This is the preferred route, but should be taken with care on a production cluster where the data movement may affect performance. You can enable optimal tunables with:
```
ceph osd crush tunables optimal
```
```
ceph osd crush tunables optimal
```
Copy to Clipboard Toggle word wrap
If things go poorly (e.g., too much load) and not very much progress has been made, or there is a client compatibility problem (old kernel cephfs or rbd clients, or pre-bobtail librados clients), you can switch back to an earlier profile:
```
ceph osd crush tunables {profile}
```
```
ceph osd crush tunables {profile}
```
Copy to Clipboard Toggle word wrap
For example, to restore the pre-v0.48 (Argonaut) values, execute:
```
ceph osd crush tunables legacy
```
```
ceph osd crush tunables legacy
```
Copy to Clipboard Toggle word wrap
You can make the warning go away without making any changes to CRUSH by adding the following option to the [mon] section of your ceph.conf file:
```
mon warn on legacy crush tunables = false
```
```
mon warn on legacy crush tunables = false
```
Copy to Clipboard Toggle word wrap
For the change to take effect, you will need to restart the monitors, or apply the option to running monitors with:
```
ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
```
```
ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
```
Copy to Clipboard Toggle word wrap

11.3. Tuning CRUSH, the hard way
Copy link

If you can ensure that all clients are running recent code, you can adjust the tunables by extracting the CRUSH map, modifying the values, and reinjecting it into the cluster.

Extract the latest CRUSH map:
```
ceph osd getcrushmap -o /tmp/crush
```
```
ceph osd getcrushmap -o /tmp/crush
```
Copy to Clipboard Toggle word wrap
Adjust tunables. These values appear to offer the best behavior for both large and small clusters we tested with. You will need to additionally specify the --enable-unsafe-tunables argument to crushtool for this to work. Please use this option with extreme care.:
```
crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
```
```
crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
```
Copy to Clipboard Toggle word wrap
Reinject modified map:
```
ceph osd setcrushmap -i /tmp/crush.new
```
```
ceph osd setcrushmap -i /tmp/crush.new
```
Copy to Clipboard Toggle word wrap

11.4. Legacy values
Copy link

For reference, the legacy values for the CRUSH tunables can be set with:

crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy

crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy

Copy to Clipboard

Toggle word wrap

Again, the special --enable-unsafe-tunables option is required. Further, as noted above, be careful running old versions of the ceph-osd daemon after reverting to legacy values as the feature bit is not perfectly enforced.

Chapter 11. CRUSH Tunables

11.1. The Evolution of CRUSH Tunables
Copy link

11.2. Tuning CRUSH
Copy link

11.3. Tuning CRUSH, the hard way
Copy link

11.4. Legacy values
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 11. CRUSH Tunables

11.1. The Evolution of CRUSH TunablesCopy linkLink copied to clipboard!

11.2. Tuning CRUSHCopy linkLink copied to clipboard!

11.3. Tuning CRUSH, the hard wayCopy linkLink copied to clipboard!

11.4. Legacy valuesCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

11.1. The Evolution of CRUSH Tunables
Copy link

11.2. Tuning CRUSH
Copy link

11.3. Tuning CRUSH, the hard way
Copy link

11.4. Legacy values
Copy link