Chapter 6. CRUSH Hierarchies
The CRUSH map is a directed acyclic graph, so it may accommodate multiple hierarchies (e.g., performance domains). The easiest way to create and modify a CRUSH hierarchy is with the Ceph CLI; however, you can also decompile a CRUSH map, edit it, recompile it, and activate it.
When declaring a bucket instance with the Ceph CLI, you must specify its type and give it a unique name (string). Ceph will automatically assign a bucket ID, set the algorithm to straw
, set the hash to 0
reflecting rjenkins1
and set a weight. When modifying a decompiled CRUSH map, assign the bucket a unique ID expressed as a negative integer (optional), specify a weight relative to the total capacity/capability of its item(s), specify the bucket algorithm (usually straw
), and the hash (usually 0
, reflecting hash algorithm rjenkins1
).
A bucket may have one or more items. The items may consist of node buckets (e.g., racks, rows, hosts) or leaves (e.g., an OSD disk). Items may have a weight that reflects the relative weight of the item.
When modifying a decompiled CRUSH map, you may declare a node bucket with the following syntax:
[bucket-type] [bucket-name] { id [a unique negative numeric ID] weight [the relative capacity/capability of the item(s)] alg [the bucket type: uniform | list | tree | straw ] hash [the hash type: 0 by default] item [item-name] weight [weight] }
For example, using the diagram above, we would define two host buckets and one rack bucket. The OSDs are declared as items within the host buckets:
host node1 { id -1 alg straw hash 0 item osd.0 weight 1.00 item osd.1 weight 1.00 } host node2 { id -2 alg straw hash 0 item osd.2 weight 1.00 item osd.3 weight 1.00 } rack rack1 { id -3 alg straw hash 0 item node1 weight 2.00 item node2 weight 2.00 }
In the foregoing example, note that the rack bucket does not contain any OSDs. Rather it contains lower level host buckets, and includes the sum total of their weight in the item entry.
6.1. CRUSH Location
A CRUSH location is the position of an OSD in terms of the CRUSH map’s hierarchy. When you express a CRUSH location on the command line interface, a CRUSH location specifier takes the form of a list of name/value pairs describing the OSD’s position. For example, if an OSD is in a particular row, rack, chassis and host, and is part of the default
CRUSH tree, its crush location could be described as:
root=default row=a rack=a2 chassis=a2a host=a2a1
Note:
- The order of the keys does not matter.
-
The key name (left of
=
) must be a valid CRUSHtype
. By default these includeroot
,datacenter
,room
,row
,pod
,pdu
,rack
,chassis
andhost
. You may edit the CRUSH map to change the types to suit your needs. -
You do not need to specify all the buckets/keys. For example, by default, Ceph automatically sets a
ceph-osd
daemon’s location to beroot=default host={HOSTNAME}
(based on the output fromhostname -s
).
6.1.1. ceph-crush-location hook
Upon startup, Ceph will get the CRUSH location of each daemon using the ceph-crush-location
tool by default. The ceph-crush-location
utility returns the CRUSH location of a given daemon. Its CLI usage consists of:
ceph-crush-location --cluster {cluster-name} --id {ID} --type {daemon-type}
For example, the following will return the location of OSD.0
:
ceph-crush-location --cluster ceph --id 0 --type osd
By default, the ceph-crush-location
utility will return a CRUSH location string for a given daemon. The location returned in order of precedence is based on:
-
A
{TYPE}_crush_location
option in the Ceph configuration file. For example, for OSD daemons,{TYPE}
would beosd
and the setting would look likeosd_crush_location
. -
A
crush_location
option for a particular daemon in the Ceph configuration file. -
A default of
root=default host=HOSTNAME
where the hostname is returned by thehostname -s
command.
In a typical deployment scenario, provisioning software (or the system administrator) can simply set the crush_location
field in a host’s Ceph configuration file to describe that machine’s location within the datacenter or cluster. This provides location awareness to Ceph daemons and clients alike.
6.1.2. Custom location hooks
A custom location hook can be used in place of the generic hook for OSD daemon placement in the hierarchy. (On startup, each OSD ensures its position is correct.):
osd_crush_location_hook = /path/to/script
This hook is passed several arguments (below) and should output a single line to stdout
with the CRUSH location description.:
ceph-crush-location --cluster {cluster-name} --id {ID} --type {daemon-type}
where the --cluster
name is typically ceph, the --id
is the daemon identifier (the OSD number), and the daemon --type
is typically osd
.
6.2. Add a Bucket
To add a bucket instance to your CRUSH hierarchy, specify the bucket name and its type. Bucket names must be unique in the CRUSH map.
ceph osd crush add-bucket {name} {type}
If you plan to use multiple hierarchies (e.g., for different hardware performance profiles), we recommend a colon-delimited naming convention of {type}:{name}
. where {type}
is the type of hardware or use case and {name}
is the bucket name.
For example, you could create a hierarchy for solid state drives (ssd
), a hierarchy for SAS disks with SSD journals (hdd-journal
), and another hierarchy for SATA drives (hdd
):
ceph osd crush add-bucket ssd:root root ceph osd crush add-bucket hdd-journal:root root ceph osd crush add-bucket hdd:root root
The Ceph CLI will echo back:
added bucket ssd:root type root to crush map added bucket hdd-journal:root type root to crush map added bucket hdd:root type root to crush map
Add an instance of each bucket type you need for your hierarchy. In the following example, we will demonstrate adding buckets for a row with a rack of SSD hosts and a rack of hosts for object storage.
ceph osd crush add-bucket ssd:row1 row ceph osd crush add-bucket ssd:row1-rack1 rack ceph osd crush add-bucket ssd:row1-rack1-host1 host ceph osd crush add-bucket ssd:row1-rack1-host2 host ceph osd crush add-bucket hdd:row1 row ceph osd crush add-bucket hdd:row1-rack2 rack ceph osd crush add-bucket hdd:row1-rack1-host1 host ceph osd crush add-bucket hdd:row1-rack1-host2 host ceph osd crush add-bucket hdd:row1-rack1-host3 host ceph osd crush add-bucket hdd:row1-rack1-host4 host
If you have already used ceph-deploy
or another tool to add OSDs to your cluster, your host nodes may already be in your CRUSH map.
Once you have completed these steps, you can view your tree.
ceph osd tree
Notice that the hierarchy remains flat. You must move your buckets into hierarchical position after you add them to the CRUSH map.
6.3. Move a Bucket
When you create your initial cluster, Ceph will have a default CRUSH map with a root bucket named default
and your initial OSD hosts will appear under the default
bucket. When you add a bucket instance to your CRUSH map, it appears in the CRUSH hierarchy, but it doesn’t necessarily appear under a particular bucket.
To move a bucket instance to a particular location in your CRUSH hierarchy, specify the bucket name and its type. For example:
ceph osd crush move ssd:row1 root=ssd:root ceph osd crush move ssd:row1-rack1 row=ssd:row1 ceph osd crush move ssd:row1-rack1-host1 rack=ssd:row1-rack1 ceph osd crush move ssd:row1-rack1-host2 rack=ssd:row1-rack1
Once you have completed these steps, you can view your tree.
ceph osd tree
You can also use ceph osd crush create-or-move
to create a location while moving an OSD.
6.4. Remove a Bucket
To remove a bucket instance from your CRUSH hierarchy, specify the bucket name. For example:
ceph osd crush remove {bucket-name}
Or:
ceph osd crush rm {bucket-name}
The bucket must be empty in order to remove it.
If you are removing higher level buckets (e.g., a root like default
), check to see if a pool uses a CRUSH rule that selects that bucket. If so, you will need to modify your CRUSH rules; otherwise, peering will fail.
6.5. Bucket Algorithms (Advanced)
When you create buckets using the Ceph CLI, Ceph sets the algorithm to straw
by default. Ceph supports four bucket algorithms, each representing a tradeoff between performance and reorganization efficiency. If you are unsure of which bucket type to use, we recommend using a straw
bucket. The bucket algorithms are:
-
Uniform: Uniform buckets aggregate devices with exactly the same weight. For example, when firms commission or decommission hardware, they typically do so with many machines that have exactly the same physical configuration (e.g., bulk purchases). When storage devices have exactly the same weight, you may use the
uniform
bucket type, which allows CRUSH to map replicas into uniform buckets in constant time. With non-uniform weights, you should use another bucket algorithm. - List: List buckets aggregate their content as linked lists. Based on the RUSH (Replication Under Scalable Hashing) P algorithm, a list is a natural and intuitive choice for an expanding cluster: either an object is relocated to the newest device with some appropriate probability, or it remains on the older devices as before. The result is optimal data migration when items are added to the bucket. Items removed from the middle or tail of the list, however, can result in a significant amount of unnecessary movement, making list buckets most suitable for circumstances in which they never (or very rarely) shrink.
- Tree: Tree buckets use a binary search tree. They are more efficient than list buckets when a bucket contains a larger set of items. Based on the RUSH (Replication Under Scalable Hashing) R algorithm, tree buckets reduce the placement time to O(log n), making them suitable for managing much larger sets of devices or nested buckets.
- Straw (default): List and Tree buckets use a divide and conquer strategy in a way that either gives certain items precedence (e.g., those at the beginning of a list) or obviates the need to consider entire subtrees of items at all. That improves the performance of the replica placement process, but can also introduce suboptimal reorganization behavior when the contents of a bucket change due an addition, removal, or re-weighting of an item. The straw bucket type allows all items to fairly “compete” against each other for replica placement through a process analogous to a draw of straws.