Chapter 5. Erasure code pools overview
Ceph uses replicated pools by default, meaning that Ceph copies every object from a primary OSD node to one or more secondary OSDs. The erasure-coded pools reduce the amount of disk space required to ensure data durability but it is computationally a bit more expensive than replication.
Ceph storage strategies involve defining data durability requirements. Data durability means the ability to sustain the loss of one or more OSDs without losing data.
Ceph stores data in pools and there are two types of the pools:
- replicated
- erasure-coded
Erasure coding is a method of storing an object in the Ceph storage cluster durably where the erasure code algorithm breaks the object into data chunks (k
) and coding chunks (m
), and stores those chunks in different OSDs.
In the event of the failure of an OSD, Ceph retrieves the remaining data (k
) and coding (m
) chunks from the other OSDs and the erasure code algorithm restores the object from those chunks.
Red Hat recommends min_size
for erasure-coded pools to be K+1
or more to prevent loss of writes and data.
Erasure coding uses storage capacity more efficiently than replication. The n-replication approach maintains n
copies of an object (3x by default in Ceph), whereas erasure coding maintains only k
+ m
chunks. For example, 3 data and 2 coding chunks use 1.5x the storage space of the original object.
While erasure coding uses less storage overhead than replication, the erasure code algorithm uses more RAM and CPU than replication when it accesses or recovers objects. Erasure coding is advantageous when data storage must be durable and fault tolerant, but do not require fast read performance (for example, cold storage, historical records, and so on).
For the mathematical and detailed explanation on how erasure code works in Ceph, see the Ceph Erasure Coding section in the Architecture Guide for Red Hat Ceph Storage 8.
Ceph creates a default erasure code profile when initializing a cluster with k=2 and m=2, This mean that Ceph will spread the object data over three OSDs (k+m == 4) and Ceph can lose one of those OSDs without losing data. To know more about erasure code profiling see the Erasure Code Profiles section.
Configure only the .rgw.buckets
pool as erasure-coded and all other Ceph Object Gateway pools as replicated, otherwise an attempt to create a new bucket fails with the following error:
set_req_state_err err_no=95 resorting to 500
The reason for this is that erasure-coded pools do not support the omap
operations and certain Ceph Object Gateway metadata pools require the omap
support.
5.1. Creating a sample erasure-coded pool
Create an erasure-coded pool and specify the placement groups. The ceph osd pool create
command creates an erasure-coded pool with the default profile, unless another profile is specified. Profiles define the redundancy of data by setting two parameters, k
, and m
. These parameters define the number of chunks a piece of data is split and the number of coding chunks are created.
The simplest erasure coded pool is equivalent to RAID5 and requires at least four hosts. You can create an erasure-coded pool with 2+2 profile.
Procedure
Set the following configuration for an erasure-coded pool on four nodes with 2+2 configuration.
Syntax
ceph config set mon mon_osd_down_out_subtree_limit host ceph config set osd osd_async_recovery_min_cost 1099511627776
ImportantThis is not needed for an erasure-coded pool in general.
ImportantThe async recovery cost is the number of PG log entries behind on the replica and the number of missing objects. The
osd_target_pg_log_entries_per_osd
is30000
. Hence, an OSD with a single PG could have30000
entries. Since theosd_async_recovery_min_cost
is a 64-bit integer, set the value ofosd_async_recovery_min_cost
to1099511627776
for an EC pool with 2+2 configuration.NoteFor an EC cluster with four nodes, the value of K+M is 2+2. If a node fails completely, it does not recover as four chunks and only three nodes are available. When you set the value of
mon_osd_down_out_subtree_limit
tohost
, during a host down scenario, it prevents the OSDs from marked out, so as to prevent the data from re balancing and the waits until the node is up again.For an erasure-coded pool with a 2+2 configuration, set the profile.
Syntax
ceph osd erasure-code-profile set ec22 k=2 m=2 crush-failure-domain=host
Example
[ceph: root@host01 /]# ceph osd erasure-code-profile set ec22 k=2 m=2 crush-failure-domain=host Pool : ceph osd pool create test-ec-22 erasure ec22
Create an erasure-coded pool.
Example
[ceph: root@host01 /]# ceph osd pool create ecpool 32 32 erasure pool 'ecpool' created $ echo ABCDEFGHI | rados --pool ecpool put NYAN - $ rados --pool ecpool get NYAN - ABCDEFGHI
32 is the number of placement groups.