Chapter 35. Erasure Code Plugins (Advanced)


Ceph supports erasure coding with a plug-in architecture, which means you can create erasure coded pools using different types of algorithms. Ceph supports:

  • Jerasure (Default)
  • Locally Repairable
  • ISA (Intel only)

The following sections describe these plug-ins in greater detail.

35.1. Jerasure erasure code plugin

The jerasure plugin is the most generic and flexible plugin, it is also the default for Ceph erasure coded pools.

The jerasure plugin encapsulates the JerasureH library. It is recommended to read the jerasure documentation to get a better understanding of the parameters.

To create a new jerasure erasure code profile:

ceph osd erasure-code-profile set <name> \
     plugin=jerasure \
     k=<data-chunks> \
     m=<coding-chunks> \
     technique=<reed_sol_van|reed_sol_r6_op|cauchy_orig|cauchy_good|liberation|blaum_roth|liber8tion> \
     [ruleset-root=<root>] \
     [ruleset-failure-domain=<bucket-type>] \
     [directory=<directory>] \
     [--force]

Where:

k=<data chunks>

Description
Each object is split in data-chunks parts, each stored on a different OSD.
Type
Integer
Required
Yes.
Example
4

m=<coding-chunks>

Description
Compute coding chunks for each object and store them on different OSDs. The number of coding chunks is also the number of OSDs that can be down without losing data.
Type
Integer
Required
Yes.
Example
2

technique=<reed_sol_van or reed_sol_r6_op or cauchy_orig or cauchy_good or liberation or blaum_roth or liber8tion>

Description
The more flexible technique is reed_sol_van : it is enough to set k and m. The cauchy_good technique can be faster but you need to chose the packetsize carefully. All of reed_sol_r6_op, liberation, blaum_roth, liber8tion are RAID6 equivalents in the sense that they can only be configured with m=2.
Type
String
Required
No.
Default
reed_sol_van

packetsize=<bytes>

Description
The encoding will be done on packets of bytes size at a time. Chosing the right packet size is difficult. The jerasure documentation contains extensive information on this topic.
Type
Integer
Required
No.
Default
2048

ruleset-root=<root>

Description
The name of the crush bucket used for the first step of the ruleset. For intance step take default.
Type
String
Required
No.
Default
default

ruleset-failure-domain=<bucket-type>

Description
Ensure that no two chunks are in a bucket with the same failure domain. For instance, if the failure domain is host no two chunks will be stored on the same host. It is used to create a ruleset step such as step chooseleaf host.
Type
String
Required
No.
Default
host

directory=<directory>

Description
Set the directory name from which the erasure code plugin is loaded.
Type
String
Required
No.
Default
/usr/lib/ceph/erasure-code

--force

Description
Override an existing profile by the same name.
Type
String
Required
No.

35.2. Locally Repairable Erasure Code (LRC) Plugin

With the jerasure plugin, when Ceph stores an erasure-coded object on multiple OSDs, recovering from the loss of one OSD requires reading from all the others. For instance if you configure jerasure with k=8 and m=4, losing one OSD requires reading from the eleven others to repair.

The lrc erasure code plugin creates local parity chunks to be able to recover using fewer OSDs. For instance if you configure lrc with k=8, m=4 and l=4, it will create an additional parity chunk for every four OSDs. When Ceph loses a single OSD, it can recover the object data with only four OSDs instead of eleven.

Although it is probably not an interesting use case when all hosts are connected to the same switch, you can actually observe reduced bandwidth usage between racks when using the lrc erasure code plugin.

$ ceph osd erasure-code-profile set LRCprofile \
     plugin=lrc \
     k=4 m=2 l=3 \
     ruleset-failure-domain=host
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile

In v0.80.x, you will only observe reduced bandwidth if the primary OSD is in the same rack as the lost chunk.:

$ ceph osd erasure-code-profile set LRCprofile \
     plugin=lrc \
     k=4 m=2 l=3 \
     ruleset-locality=rack \
     ruleset-failure-domain=host
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile

35.2.1. Create an LRC Profile

To create a new LRC erasure code profile:

ceph osd erasure-code-profile set <name> \
     plugin=lrc \
     k=<data-chunks> \
     m=<coding-chunks> \
     l=<locality> \
     [ruleset-root=<root>] \
     [ruleset-locality=<bucket-type>] \
     [ruleset-failure-domain=<bucket-type>] \
     [directory=<directory>] \
     [--force]

Where:

k=<data chunks>

Description
Each object is split in data-chunks parts, each stored on a different OSD.
Type
Integer
Required
Yes.
Example
4

m=<coding-chunks>

Description
Compute coding chunks for each object and store them on different OSDs. The number of coding chunks is also the number of OSDs that can be down without losing data.
Type
Integer
Required
Yes.
Example
2

l=<locality>

Description
Group the coding and data chunks into sets of size locality. For instance, for k=4 and m=2, when locality=3 two groups of three are created. Each set can be recovered without reading chunks from another set.
Type
Integer
Required
Yes.
Example
3

ruleset-root=<root>

Description
The name of the crush bucket used for the first step of the ruleset. For intance step take default.
Type
String
Required
No.
Default
default

ruleset-locality=<bucket-type>

Description
The type of the crush bucket in which each set of chunks defined by l will be stored. For instance, if it is set to rack, each group of l chunks will be placed in a different rack. It is used to create a ruleset step such as step choose rack. If it is not set, no such grouping is done.
Type
String
Required
No.

ruleset-failure-domain=<bucket-type>

Description
Ensure that no two chunks are in a bucket with the same failure domain. For instance, if the failure domain is host no two chunks will be stored on the same host. It is used to create a ruleset step such as step chooseleaf host.
Type
String
Required
No.
Default
host

directory=<directory>

Description
Set the directory name from which the erasure code plugin is loaded.
Type
String
Required
No.
Default
/usr/lib/ceph/erasure-code

--force

Description
Override an existing profile by the same name.
Type
String
Required
No.

35.2.2. Create an LRC Profile (low-level)

The sum of k and m must be a multiple of the l parameter. The low level configuration parameters do not impose such a restriction and it may be more convenient to use it for specific purposes. It is for instance possible to define two groups, one with 4 chunks and another with 3 chunks. It is also possible to recursively define locality sets, for instance datacenters and racks into datacenters. The k/m/l are implemented by generating a low level configuration.

The lrc erasure code plugin recursively applies erasure code techniques so that recovering from the loss of some chunks only requires a subset of the available chunks, most of the time.

For instance, when three coding steps are described as:

chunk nr    01234567
step 1      _cDD_cDD
step 2      cDDD____
step 3      ____cDDD

where c are coding chunks calculated from the data chunks D, the loss of chunk 7 can be recovered with the last four chunks. And the loss of chun 2 chunk can be recovered with the first four chunks.

The miminal testing scenario is strictly equivalent to using the default erasure-code profile. The DD implies K=2, the c implies M=1 and uses the jerasure plugin by default.

$ ceph osd erasure-code-profile set LRCprofile \
     plugin=lrc \
     mapping=DD_ \
     layers='[ [ "DDc", "" ] ]'
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile

The lrc plugin is particularly useful for reducing inter-rack bandwidth usage. Although it is probably not an interesting use case when all hosts are connected to the same switch, reduced bandwidth usage can actually be observed. It is equivalent to k=4, m=2 and l=3 although the layout of the chunks is different:

$ ceph osd erasure-code-profile set LRCprofile \
     plugin=lrc \
     mapping=__DD__DD \
     layers='[
               [ "_cDD_cDD", "" ],
               [ "cDDD____", "" ],
               [ "____cDDD", "" ],
             ]'
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile

In Firefly the reduced bandwidth will only be observed if the primary OSD is in the same rack as the lost chunk.:

$ ceph osd erasure-code-profile set LRCprofile \
     plugin=lrc \
     mapping=__DD__DD \
     layers='[
               [ "_cDD_cDD", "" ],
               [ "cDDD____", "" ],
               [ "____cDDD", "" ],
             ]' \
     ruleset-steps='[
                     [ "choose", "rack", 2 ],
                     [ "chooseleaf", "host", 4 ],
                    ]'
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile

LRC now uses jerasure as the default EC backend. It is possible to specify the EC backend/algorithm on a per layer basis using the low level configuration. The second argument in layers=[ [ "DDc", "" ] ] is actually an erasure code profile to be used for this level. The example below specifies the ISA backend with the Cauchy technique to be used in the lrcpool.:

$ ceph osd erasure-code-profile set LRCprofile \
     plugin=lrc \
     mapping=DD_ \
     layers='[ [ "DDc", "plugin=isa technique=cauchy" ] ]'
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile

You could also use a different erasure code profile for for each layer.:

$ ceph osd erasure-code-profile set LRCprofile \
     plugin=lrc \
     mapping=__DD__DD \
     layers='[
               [ "_cDD_cDD", "plugin=isa technique=cauchy" ],
               [ "cDDD____", "plugin=isa" ],
               [ "____cDDD", "plugin=jerasure" ],
             ]'
$ ceph osd pool create lrcpool 12 12 erasure LRCprofile

35.3. Controlling CRUSH Placement

The default CRUSH ruleset provides OSDs that are on different hosts. For instance:

chunk nr    01234567

step 1      _cDD_cDD
step 2      cDDD____
step 3      ____cDDD

needs exactly 8 OSDs, one for each chunk. If the hosts are in two adjacent racks, the first four chunks can be placed in the first rack and the last four in the second rack. Recovering from the loss of a single OSD does not require using bandwidth between the two racks.

For instance:

ruleset-steps='[ [ "choose", "rack", 2 ], [ "chooseleaf", "host", 4 ] ]'

will create a ruleset that will select two crush buckets of type rack and for each of them choose four OSDs, each of them located in different bucket of type host.

The ruleset can also be manually crafted for finer control.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.