Chapter 6. CRUSH Hierarchies


The CRUSH map is a directed acyclic graph, so it may accommodate multiple hierarchies (e.g., performance domains). The easiest way to create and modify a CRUSH hierarchy is with the Ceph CLI; however, you can also decompile a CRUSH map, edit it, recompile it, and activate it.

When declaring a bucket instance with the Ceph CLI, you must specify its type and give it a unique name (string). Ceph will automatically assign a bucket ID, set the algorithm to straw, set the hash to 0 reflecting rjenkins1 and set a weight. When modifying a decompiled CRUSH map, assign the bucket a unique ID expressed as a negative integer (optional), specify a weight relative to the total capacity/capability of its item(s), specify the bucket algorithm (usually straw), and the hash (usually 0, reflecting hash algorithm rjenkins1).

A bucket may have one or more items. The items may consist of node buckets (e.g., racks, rows, hosts) or leaves (e.g., an OSD disk). Items may have a weight that reflects the relative weight of the item.

When modifying a decompiled CRUSH map, you may declare a node bucket with the following syntax:

[bucket-type] [bucket-name] {
    id [a unique negative numeric ID]
    weight [the relative capacity/capability of the item(s)]
    alg [the bucket type: uniform | list | tree | straw ]
    hash [the hash type: 0 by default]
    item [item-name] weight [weight]
}

For example, using the diagram above, we would define two host buckets and one rack bucket. The OSDs are declared as items within the host buckets:

host node1 {
    id -1
    alg straw
    hash 0
    item osd.0 weight 1.00
    item osd.1 weight 1.00
}

host node2 {
    id -2
    alg straw
    hash 0
    item osd.2 weight 1.00
    item osd.3 weight 1.00
}

rack rack1 {
    id -3
    alg straw
    hash 0
    item node1 weight 2.00
    item node2 weight 2.00
}
Note

In the foregoing example, note that the rack bucket does not contain any OSDs. Rather it contains lower level host buckets, and includes the sum total of their weight in the item entry.

6.1. CRUSH Location

A CRUSH location is the position of an OSD in terms of the CRUSH map’s hierarchy. When you express a CRUSH location on the command line interface, a CRUSH location specifier takes the form of a list of name/value pairs describing the OSD’s position. For example, if an OSD is in a particular row, rack, chassis and host, and is part of the default CRUSH tree, its crush location could be described as:

root=default row=a rack=a2 chassis=a2a host=a2a1

Note:

  1. The order of the keys does not matter.
  2. The key name (left of = ) must be a valid CRUSH type. By default these include root, datacenter, room, row, pod, pdu, rack, chassis and host. You may edit the CRUSH map to change the types to suit your needs.
  3. You do not need to specify all the buckets/keys. For example, by default, Ceph automatically sets a ceph-osd daemon’s location to be root=default host={HOSTNAME} (based on the output from hostname -s).

6.1.1. ceph-crush-location hook

Upon startup, Ceph will get the CRUSH location of each daemon using the ceph-crush-location tool by default. The ceph-crush-location utility returns the CRUSH location of a given daemon. Its CLI usage consists of:

ceph-crush-location --cluster {cluster-name} --id {ID} --type {daemon-type}

For example, the following will return the location of OSD.0:

ceph-crush-location --cluster ceph --id 0 --type osd

By default, the ceph-crush-location utility will return a CRUSH location string for a given daemon. The location returned in order of precedence is based on:

  1. A {TYPE}_crush_location option in the Ceph configuration file. For example, for OSD daemons, {TYPE} would be osd and the setting would look like osd_crush_location.
  2. A crush_location option for a particular daemon in the Ceph configuration file.
  3. A default of root=default host=HOSTNAME where the hostname is returned by the hostname -s command.

In a typical deployment scenario, provisioning software (or the system administrator) can simply set the crush_location field in a host’s Ceph configuration file to describe that machine’s location within the datacenter or cluster. This provides location awareness to Ceph daemons and clients alike.

6.1.2. Custom location hooks

A custom location hook can be used in place of the generic hook for OSD daemon placement in the hierarchy. (On startup, each OSD ensures its position is correct.):

osd_crush_location_hook = /path/to/script

This hook is passed several arguments (below) and should output a single line to stdout with the CRUSH location description.:

ceph-crush-location --cluster {cluster-name} --id {ID} --type {daemon-type}

where the --cluster name is typically ceph, the --id is the daemon identifier (the OSD number), and the daemon --type is typically osd.

6.2. Add a Bucket

To add a bucket instance to your CRUSH hierarchy, specify the bucket name and its type. Bucket names must be unique in the CRUSH map.

ceph osd crush add-bucket {name} {type}

If you plan to use multiple hierarchies (e.g., for different hardware performance profiles), we recommend a colon-delimited naming convention of {type}:{name}. where {type} is the type of hardware or use case and {name} is the bucket name.

For example, you could create a hierarchy for solid state drives (ssd), a hierarchy for SAS disks with SSD journals (hdd-journal), and another hierarchy for SATA drives (hdd):

ceph osd crush add-bucket ssd:root root
ceph osd crush add-bucket hdd-journal:root root
ceph osd crush add-bucket hdd:root root

The Ceph CLI will echo back:

added bucket ssd:root type root to crush map
added bucket hdd-journal:root type root to crush map
added bucket hdd:root type root to crush map

Add an instance of each bucket type you need for your hierarchy. In the following example, we will demonstrate adding buckets for a row with a rack of SSD hosts and a rack of hosts for object storage.

ceph osd crush add-bucket ssd:row1 row
ceph osd crush add-bucket ssd:row1-rack1 rack
ceph osd crush add-bucket ssd:row1-rack1-host1 host
ceph osd crush add-bucket ssd:row1-rack1-host2 host
ceph osd crush add-bucket hdd:row1 row
ceph osd crush add-bucket hdd:row1-rack2 rack
ceph osd crush add-bucket hdd:row1-rack1-host1 host
ceph osd crush add-bucket hdd:row1-rack1-host2 host
ceph osd crush add-bucket hdd:row1-rack1-host3 host
ceph osd crush add-bucket hdd:row1-rack1-host4 host
Note

If you have already used ceph-deploy or another tool to add OSDs to your cluster, your host nodes may already be in your CRUSH map.

Once you have completed these steps, you can view your tree.

ceph osd tree

Notice that the hierarchy remains flat. You must move your buckets into hierarchical position after you add them to the CRUSH map.

6.3. Move a Bucket

When you create your initial cluster, Ceph will have a default CRUSH map with a root bucket named default and your initial OSD hosts will appear under the default bucket. When you add a bucket instance to your CRUSH map, it appears in the CRUSH hierarchy, but it doesn’t necessarily appear under a particular bucket.

To move a bucket instance to a particular location in your CRUSH hierarchy, specify the bucket name and its type. For example:

ceph osd crush move ssd:row1 root=ssd:root
ceph osd crush move ssd:row1-rack1 row=ssd:row1
ceph osd crush move ssd:row1-rack1-host1 rack=ssd:row1-rack1
ceph osd crush move ssd:row1-rack1-host2 rack=ssd:row1-rack1

Once you have completed these steps, you can view your tree.

ceph osd tree
Note

You can also use ceph osd crush create-or-move to create a location while moving an OSD.

6.4. Remove a Bucket

To remove a bucket instance from your CRUSH hierarchy, specify the bucket name. For example:

ceph osd crush remove {bucket-name}

Or:

ceph osd crush rm {bucket-name}
Note

The bucket must be empty in order to remove it.

If you are removing higher level buckets (e.g., a root like default), check to see if a pool uses a CRUSH rule that selects that bucket. If so, you will need to modify your CRUSH rules; otherwise, peering will fail.

6.5. Bucket Algorithms (Advanced)

When you create buckets using the Ceph CLI, Ceph sets the algorithm to straw by default. Ceph supports four bucket algorithms, each representing a tradeoff between performance and reorganization efficiency. If you are unsure of which bucket type to use, we recommend using a straw bucket. The bucket algorithms are:

  1. Uniform: Uniform buckets aggregate devices with exactly the same weight. For example, when firms commission or decommission hardware, they typically do so with many machines that have exactly the same physical configuration (e.g., bulk purchases). When storage devices have exactly the same weight, you may use the uniform bucket type, which allows CRUSH to map replicas into uniform buckets in constant time. With non-uniform weights, you should use another bucket algorithm.
  2. List: List buckets aggregate their content as linked lists. Based on the RUSH (Replication Under Scalable Hashing) P algorithm, a list is a natural and intuitive choice for an expanding cluster: either an object is relocated to the newest device with some appropriate probability, or it remains on the older devices as before. The result is optimal data migration when items are added to the bucket. Items removed from the middle or tail of the list, however, can result in a significant amount of unnecessary movement, making list buckets most suitable for circumstances in which they never (or very rarely) shrink.
  3. Tree: Tree buckets use a binary search tree. They are more efficient than list buckets when a bucket contains a larger set of items. Based on the RUSH (Replication Under Scalable Hashing) R algorithm, tree buckets reduce the placement time to O(log n), making them suitable for managing much larger sets of devices or nested buckets.
  4. Straw (default): List and Tree buckets use a divide and conquer strategy in a way that either gives certain items precedence (e.g., those at the beginning of a list) or obviates the need to consider entire subtrees of items at all. That improves the performance of the replica placement process, but can also introduce suboptimal reorganization behavior when the contents of a bucket change due an addition, removal, or re-weighting of an item. The straw bucket type allows all items to fairly “compete” against each other for replica placement through a process analogous to a draw of straws.
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.