Chapter 11. CRUSH Tunables
The Ceph project has grown exponentially with many changes and many new features. Beginning with the first commercially supported major release of Ceph, v0.48 (Argonaut), Ceph provides the ability to adjust certain parameters of the CRUSH algorithm (i.e., the settings aren’t frozen into the source code).
A few important points to consider:
- Adjusting CRUSH values may result in the shift of some PGs between storage nodes. If the Ceph cluster is already storing a lot of data, be prepared for some fraction of the data to move.
-
The
ceph-osd
andceph-mon
daemons will start requiring the feature bits of new connections as soon as they receive an updated map. However, already-connected clients are effectively grandfathered in, and will misbehave if they do not support the new feature. Make sure when you upgrade your Ceph Storage Cluster daemons that you also update your Ceph clients. -
If the CRUSH tunables are set to non-legacy values and then later changed back to the legacy values,
ceph-osd
daemons will not be required to support the feature. However, the OSD peering process requires examining and understanding old maps. Therefore, you should not run old versions of theceph-osd
daemon if the cluster has previously used non-legacy CRUSH values, even if the latest version of the map has been switched back to using the legacy defaults.
11.1. The Evolution of CRUSH Tunables
Ceph clients and daemons prior to v0.48 do not detect for tunables and are not compatible with v0.48 and beyond (you must upgrade). The ability to adjust tunable CRUSH values has also evolved with major Ceph releases.
Legacy Values
Legacy values deployed in newer clusters with CRUSH Tunables may misbehave. Issues include:
- In Hierarchies with a small number of devices in the leaf buckets, some PGs map to fewer than the desired number of replicas. This commonly happens for hierarchies with "host" nodes with a small number (1-3) of OSDs nested beneath each one.
- For large clusters, some small percentages of PGs map to less than the desired number of OSDs. This is more prevalent when there are several layers of the hierarchy (e.g., row, rack, host, osd).
- When some OSDs are marked out, the data tends to get redistributed to nearby OSDs instead of across the entire hierarchy.
We strongly encourage upgrading both Ceph clients and Ceph daemons to major supported releases to take advantage of CRUSH tunables. We recommend that all cluster daemons and clients use the same release version.
CRUSH_TUNABLES
Beginning with the first commercially supported major release of Ceph, v0.48 (Argonaut), v0.49 and later, and Linux kernel version 3.6 or later (for the file system and RBD kernel clients), Ceph provides support for the following CRUSH tunables:
-
choose_local_tries
: Number of local retries. Legacy value is 2, optimal value is 0. -
choose_local_fallback_tries
: Legacy value is 5, optimal value is 0. -
choose_total_tries
: Total number of attempts to choose an item. Legacy value was 19, subsequent testing indicates that a value of 50 is more appropriate for typical clusters. For extremely large clusters, a larger value might be necessary.
CRUSH_TUNABLES2
Beginning with v0.55 or later, including the second major release of Ceph, v0.56.x (Bobtail), and Linux kernel version v3.9 or later (for the file system and RBD kernel clients), Ceph provides support for CRUSH_TUNABLES
and the following setting for CRUSH_TUNABLES2
:
-
chooseleaf_descend_once
: Whether a recursive chooseleaf attempt will retry, or only try once and allow the original placement to retry. Legacy default is 0, optimal value is 1.
CRUSH_TUNABLES3
Beginning with the sixth major release of Ceph, v0.78 (Firefly), and Linux kernel version v3.15 or later (for the file system and RBD kernel clients), Ceph provides support for CRUSH_TUNABLES
, CRUSH_TUNABLES2
and the following setting for CRUSH_TUNABLES3
:
-
chooseleaf_vary_r
: Whether a recursive chooseleaf attempt will start with a non-zero value of r, based on how many attempts the parent has already made. Legacy default is 0, but with this value CRUSH is sometimes unable to find a mapping. The optimal value (in terms of computational cost and correctness) is 1. However, for legacy clusters that have lots of existing data, changing from 0 to 1 will cause a lot of data to move; a value of 4 or 5 will allow CRUSH to find a valid mapping but will make less data move.
11.2. Tuning CRUSH
Before you tune CRUSH, you should ensure that all Ceph clients and all Ceph daemons use the same version. If you have recently upgraded, ensure that you have restarted daemons and reconnected clients.
The simplest way to adjust the CRUSH tunables is by changing to a known profile. Those are:
-
legacy
: the legacy behavior from v0.47 (pre-Argonaut) and earlier. -
argonaut
: the legacy values supported by v0.48 (Argonaut) release. -
bobtail
: the values supported by the v0.56 (Bobtail) release. -
firefly
: the values supported by the v0.80 (Firefly) release. -
optimal
: the current best values. -
default
: the current default values for a new cluster.
You can select a profile on a running cluster with the command:
ceph osd crush tunables {PROFILE}
This may result in some data movement.
Generally, you should set the CRUSH tunables after you upgrade, or if you receive a warning. Starting with version v0.74, Ceph will issue a health warning if the CRUSH tunables are not set to their optimal values (the optimal values are the default as of v0.73). To make this warning go away, you have two options:
Adjust the tunables on the existing cluster. Note that this will result in some data movement (possibly as much as 10%). This is the preferred route, but should be taken with care on a production cluster where the data movement may affect performance. You can enable optimal tunables with:
ceph osd crush tunables optimal
If things go poorly (e.g., too much load) and not very much progress has been made, or there is a client compatibility problem (old kernel cephfs or rbd clients, or pre-bobtail librados clients), you can switch back to an earlier profile:
ceph osd crush tunables {profile}
For example, to restore the pre-v0.48 (Argonaut) values, execute:
ceph osd crush tunables legacy
You can make the warning go away without making any changes to CRUSH by adding the following option to the
[mon]
section of yourceph.conf
file:mon warn on legacy crush tunables = false
For the change to take effect, you will need to restart the monitors, or apply the option to running monitors with:
ceph tell mon.\* injectargs --no-mon-warn-on-legacy-crush-tunables
11.3. Tuning CRUSH, the hard way
If you can ensure that all clients are running recent code, you can adjust the tunables by extracting the CRUSH map, modifying the values, and reinjecting it into the cluster.
Extract the latest CRUSH map:
ceph osd getcrushmap -o /tmp/crush
Adjust tunables. These values appear to offer the best behavior for both large and small clusters we tested with. You will need to additionally specify the
--enable-unsafe-tunables
argument tocrushtool
for this to work. Please use this option with extreme care.:crushtool -i /tmp/crush --set-choose-local-tries 0 --set-choose-local-fallback-tries 0 --set-choose-total-tries 50 -o /tmp/crush.new
Reinject modified map:
ceph osd setcrushmap -i /tmp/crush.new
11.4. Legacy values
For reference, the legacy values for the CRUSH tunables can be set with:
crushtool -i /tmp/crush --set-choose-local-tries 2 --set-choose-local-fallback-tries 5 --set-choose-total-tries 19 --set-chooseleaf-descend-once 0 --set-chooseleaf-vary-r 0 -o /tmp/crush.legacy
Again, the special --enable-unsafe-tunables
option is required. Further, as noted above, be careful running old versions of the ceph-osd
daemon after reverting to legacy values as the feature bit is not perfectly enforced.