Chapter 4. Stretch clusters for Ceph storage
As a storage administrator, you can configure stretch clusters by entering stretch mode with 2-site clusters.
Red Hat Ceph Storage is capable of withstanding the loss of Ceph OSDs because of its network and cluster, which are equally reliable with failures randomly distributed across the CRUSH map. If a number of OSDs is shut down, the remaining OSDs and monitors still manage to operate.
However, this might not be the best solution for some stretched cluster configurations where a significant part of the Ceph cluster can use only a single network component. The example is a single cluster located in multiple data centers, for which the user wants to sustain a loss of a full data center.
The standard configuration is with two data centers. Other configurations are in clouds or availability zones. Each site holds two copies of the data, therefore, the replication size is four. The third site should have a tiebreaker monitor, this can be a virtual machine or high-latency compared to the main sites. This monitor chooses one of the sites to restore data if the network connection fails and both data centers remain active.
The standard Ceph configuration survives many failures of the network or data centers and it never compromises data consistency. If you restore enough Ceph servers following a failure, it recovers. Ceph maintains availability if you lose a data center, but can still form a quorum of monitors and have all the data available with enough copies to satisfy pools’ min_size
, or CRUSH rules that replicate again to meet the size.
There are no additional steps to power down a stretch cluster. You can see the Powering down and rebooting Red Hat Ceph Storage cluster for more information.
Stretch cluster failures
Red Hat Ceph Storage never compromises on data integrity and consistency. If there is a network failure or a loss of nodes and the services can still be restored, Ceph returns to normal functionality on its own.
However, there are situations where you lose data availability even if you have enough servers available to meet Ceph’s consistency and sizing constraints, or where you unexpectedly do not meet the constraints.
First important type of failure is caused by inconsistent networks. If there is a network split, Ceph might be unable to mark OSD as down
to remove it from the acting placement group (PG) sets despite the primary OSD being unable to replicate data. When this happens, the I/O is not permitted because Ceph cannot meet its durability guarantees.
The second important category of failures is when it appears that you have data replicated across data enters, but the constraints are not sufficient to guarantee this. For example, you might have data centers A and B, and the CRUSH rule targets three copies and places a copy in each data center with a min_size
of 2
. The PG might go active with two copies in site A and no copies in site B, which means that if you lose site A, you lose the data and Ceph cannot operate on it. This situation is difficult to avoid with standard CRUSH rules.
4.1. Stretch mode for a storage cluster Copy linkLink copied to clipboard!
To configure stretch clusters, you must enter the stretch mode. When stretch mode is enabled, the Ceph OSDs only take PGs as active when they peer across data centers, or whichever other CRUSH bucket type you specified, assuming both are active. Pools increase in size from the default three to four, with two copies on each site.
In stretch mode, Ceph OSDs are only allowed to connect to monitors within the same data center. New monitors are not allowed to join the cluster without specified location.
If all the OSDs and monitors from a data center become inaccessible at once, the surviving data center will enter a degraded
stretch mode. This issues a warning, reduces the min_size
to 1
, and allows the cluster to reach an active
state with the data from the remaining site.
The degraded
state also triggers warnings that the pools are too small, because the pool size does not get changed. However, a special stretch mode flag prevents the OSDs from creating extra copies in the remaining data center, therefore it still keeps 2 copies.
When the missing data center becomes accesible again, the cluster enters recovery
stretch mode. This changes the warning and allows peering, but still requires only the OSDs from the data center, which was up the whole time.
When all PGs are in a known state and are not degraded or incomplete, the cluster goes back to the regular stretch mode, ends the warning, and restores min_size
to its starting value 2
. The cluster again requires both sites to peer, not only the site that stayed up the whole time, therefore you can fail over to the other site, if necessary.
Stretch mode limitations
- It is not possible to exit from stretch mode once it is entered.
- You cannot use erasure-coded pools with clusters in stretch mode. You can neither enter the stretch mode with erasure-coded pools, nor create an erasure-coded pool when the stretch mode is active.
- Stretch mode with no more than two sites is supported.
The weights of the two sites should be the same. If they are not, you receive the following error:
Example
[ceph: root@host01 /]# ceph mon enable_stretch_mode host05 stretch_rule datacenter Error EINVAL: the 2 datacenter instances in the cluster have differing weights 25947 and 15728 but stretch mode currently requires they be the same!
[ceph: root@host01 /]# ceph mon enable_stretch_mode host05 stretch_rule datacenter Error EINVAL: the 2 datacenter instances in the cluster have differing weights 25947 and 15728 but stretch mode currently requires they be the same!
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To achieve same weights on both sites, the Ceph OSDs deployed in the two sites should be of equal size, that is, storage capacity in the first site is equivalent to storage capacity in the second site.
- While it is not enforced, you should run two Ceph monitors on each site and a tiebreaker, for a total of five. This is because OSDs can only connect to monitors in their own site when in stretch mode.
- You have to create your own CRUSH rule, which provides two copies on each site, which totals to four on both sites.
-
You cannot enable stretch mode if you have existing pools with non-default size or
min_size
. -
Because the cluster runs with
min_size 1
when degraded, you should only use stretch mode with all-flash OSDs. This minimizes the time needed to recover once connectivity is restored, and minimizes the potential for data loss.
4.1.1. Setting the crush location for the daemons Copy linkLink copied to clipboard!
Before you enter the stretch mode, you need to prepare the cluster by setting the crush location to the daemons in the Red Hat Ceph Storage cluster. There are two ways to do this:
- Bootstrap the cluster through a service configuration file, where the locations are added to the hosts as part of deployment.
-
Set the locations manually through
ceph osd crush add-bucket
andceph osd crush move
commands after the cluster is deployed.
Method 1: Bootstrapping the cluster
Prerequisites
- Root-level access to the nodes.
Procedure
If you are bootstrapping your new storage cluster, you can create the service configuration
.yaml
file that adds the nodes to the Red Hat Ceph Storage cluster and also sets specific labels for where the services should run:Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Bootstrap the storage cluster with the
--apply-spec
option:Syntax
cephadm bootstrap --apply-spec CONFIGURATION_FILE_NAME --mon-ip MONITOR_IP_ADDRESS --ssh-private-key PRIVATE_KEY --ssh-public-key PUBLIC_KEY --registry-url REGISTRY_URL --registry-username USER_NAME --registry-password PASSWORD
cephadm bootstrap --apply-spec CONFIGURATION_FILE_NAME --mon-ip MONITOR_IP_ADDRESS --ssh-private-key PRIVATE_KEY --ssh-public-key PUBLIC_KEY --registry-url REGISTRY_URL --registry-username USER_NAME --registry-password PASSWORD
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
cephadm bootstrap --apply-spec initial-config.yaml --mon-ip 10.10.128.68 --ssh-private-key /home/ceph/.ssh/id_rsa --ssh-public-key /home/ceph/.ssh/id_rsa.pub --registry-url registry.redhat.io --registry-username myuser1 --registry-password mypassword1
[root@host01 ~]# cephadm bootstrap --apply-spec initial-config.yaml --mon-ip 10.10.128.68 --ssh-private-key /home/ceph/.ssh/id_rsa --ssh-public-key /home/ceph/.ssh/id_rsa.pub --registry-url registry.redhat.io --registry-username myuser1 --registry-password mypassword1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantYou can use different command options with the
cephadm bootstrap
command. However, always include the--apply-spec
option to use the service configuration file and configure the host locations.
Method 2: Setting the locations after the deployment
Prerequisites
- Root-level access to the nodes.
Procedure
Add two buckets to which you plan to set the location of your non-tiebreaker monitors to the CRUSH map, specifying the bucket type as as
datacenter
:Syntax
ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPE
ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPE
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd crush add-bucket DC1 datacenter [ceph: root@host01 /]# ceph osd crush add-bucket DC2 datacenter
[ceph: root@host01 /]# ceph osd crush add-bucket DC1 datacenter [ceph: root@host01 /]# ceph osd crush add-bucket DC2 datacenter
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Move the buckets under
root=default
:Syntax
ceph osd crush move BUCKET_NAME root=default
ceph osd crush move BUCKET_NAME root=default
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd crush move DC1 root=default [ceph: root@host01 /]# ceph osd crush move DC2 root=default
[ceph: root@host01 /]# ceph osd crush move DC1 root=default [ceph: root@host01 /]# ceph osd crush move DC2 root=default
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Move the OSD hosts according to the required CRUSH placement:
Syntax
ceph osd crush move HOST datacenter=DATACENTER
ceph osd crush move HOST datacenter=DATACENTER
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd crush move host01 datacenter=DC1
[ceph: root@host01 /]# ceph osd crush move host01 datacenter=DC1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
4.1.2. Entering the stretch mode Copy linkLink copied to clipboard!
The new stretch mode is designed to handle two sites. There is a lower risk of component availability outages with 2-site clusters.
Prerequisites
- Root-level access to the nodes.
- The crush location is set to the hosts.
Procedure
Set the location of each monitor, matching your CRUSH map:
Syntax
ceph mon set_location HOST datacenter=DATACENTER
ceph mon set_location HOST datacenter=DATACENTER
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph mon set_location host01 datacenter=DC1 [ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC1 [ceph: root@host01 /]# ceph mon set_location host04 datacenter=DC2 [ceph: root@host01 /]# ceph mon set_location host05 datacenter=DC2 [ceph: root@host01 /]# ceph mon set_location host07 datacenter=DC3
[ceph: root@host01 /]# ceph mon set_location host01 datacenter=DC1 [ceph: root@host01 /]# ceph mon set_location host02 datacenter=DC1 [ceph: root@host01 /]# ceph mon set_location host04 datacenter=DC2 [ceph: root@host01 /]# ceph mon set_location host05 datacenter=DC2 [ceph: root@host01 /]# ceph mon set_location host07 datacenter=DC3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Generate a CRUSH rule which places two copies on each data center:
Syntax
ceph osd getcrushmap > COMPILED_CRUSHMAP_FILENAME crushtool -d COMPILED_CRUSHMAP_FILENAME -o DECOMPILED_CRUSHMAP_FILENAME
ceph osd getcrushmap > COMPILED_CRUSHMAP_FILENAME crushtool -d COMPILED_CRUSHMAP_FILENAME -o DECOMPILED_CRUSHMAP_FILENAME
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd getcrushmap > crush.map.bin [ceph: root@host01 /]# crushtool -d crush.map.bin -o crush.map.txt
[ceph: root@host01 /]# ceph osd getcrushmap > crush.map.bin [ceph: root@host01 /]# crushtool -d crush.map.bin -o crush.map.txt
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the decompiled CRUSH map file to add a new rule:
Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThis rule makes the cluster have read-affinity towards data center
DC1
. Therefore, all the reads or writes happen through Ceph OSDs placed inDC1
.If this is not desirable, and reads or writes are to be distributed evenly across the zones, the crush rule is the following:
Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this rule, the data center is selected randomly and automatically.
See CRUSH rules for more information on
firstn
andindep
options.
Inject the CRUSH map to make the rule available to the cluster:
Syntax
crushtool -c DECOMPILED_CRUSHMAP_FILENAME -o COMPILED_CRUSHMAP_FILENAME ceph osd setcrushmap -i COMPILED_CRUSHMAP_FILENAME
crushtool -c DECOMPILED_CRUSHMAP_FILENAME -o COMPILED_CRUSHMAP_FILENAME ceph osd setcrushmap -i COMPILED_CRUSHMAP_FILENAME
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# crushtool -c crush.map.txt -o crush2.map.bin [ceph: root@host01 /]# ceph osd setcrushmap -i crush2.map.bin
[ceph: root@host01 /]# crushtool -c crush.map.txt -o crush2.map.bin [ceph: root@host01 /]# ceph osd setcrushmap -i crush2.map.bin
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you do not run the monitors in connectivity mode, set the election strategy to
connectivity
:Example
[ceph: root@host01 /]# ceph mon set election_strategy connectivity
[ceph: root@host01 /]# ceph mon set election_strategy connectivity
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enter stretch mode by setting the location of the tiebreaker monitor to split across the data centers:
Syntax
ceph mon set_location HOST datacenter=DATACENTER ceph mon enable_stretch_mode HOST stretch_rule datacenter
ceph mon set_location HOST datacenter=DATACENTER ceph mon enable_stretch_mode HOST stretch_rule datacenter
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph mon set_location host07 datacenter=DC3 [ceph: root@host01 /]# ceph mon enable_stretch_mode host07 stretch_rule datacenter
[ceph: root@host01 /]# ceph mon set_location host07 datacenter=DC3 [ceph: root@host01 /]# ceph mon enable_stretch_mode host07 stretch_rule datacenter
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example the monitor
mon.host07
is the tiebreaker.ImportantThe location of the tiebreaker monitor should differ from the data centers to which you previously set the non-tiebreaker monitors. In the example above, it is data center
DC3
.ImportantDo not add this data center to the CRUSH map as it results in the following error when you try to enter stretch mode:
Error EINVAL: there are 3 datacenters in the cluster but stretch mode currently only works with 2!
Error EINVAL: there are 3 datacenters in the cluster but stretch mode currently only works with 2!
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteIf you are writing your own tooling for deploying Ceph, you can use a new
--set-crush-location
option when booting monitors, instead of running theceph mon set_location
command. This option accepts only a singlebucket=location
pair, for exampleceph-mon --set-crush-location 'datacenter=DC1'
, which must match the bucket type you specified when running theenable_stretch_mode
command.Verify that the stretch mode is enabled successfully:
Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
stretch_mode_enabled
should be set totrue
. You can also see the number of stretch buckets, stretch mode buckets, and if the stretch mode is degraded or recovering.Verify that the monitors are in an appropriate locations:
Example
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can also see which monitor is the tiebreaker, and the monitor election strategy.
4.1.3. Adding OSD hosts in stretch mode Copy linkLink copied to clipboard!
You can add Ceph OSDs in the stretch mode. The procedure is similar to the addition of the OSD hosts on a cluster where stretch mode is not enabled.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Stretch mode in enabled on a cluster.
- Root-level access to the nodes.
Procedure
List the available devices to deploy OSDs:
Syntax
ceph orch device ls [--hostname=HOST_1 HOST_2] [--wide] [--refresh]
ceph orch device ls [--hostname=HOST_1 HOST_2] [--wide] [--refresh]
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph orch device ls
[ceph: root@host01 /]# ceph orch device ls
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Deploy the OSDs on specific hosts or on all the available devices:
Create an OSD from a specific device on a specific host:
Syntax
ceph orch daemon add osd HOST:DEVICE_PATH
ceph orch daemon add osd HOST:DEVICE_PATH
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph orch daemon add osd host03:/dev/sdb
[ceph: root@host01 /]# ceph orch daemon add osd host03:/dev/sdb
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Deploy OSDs on any available and unused devices:
ImportantThis command creates collocated WAL and DB devices. If you want to create non-collocated devices, do not use this command.
Example
[ceph: root@host01 /]# ceph orch apply osd --all-available-devices
[ceph: root@host01 /]# ceph orch apply osd --all-available-devices
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Move the OSD hosts under the CRUSH bucket:
Syntax
ceph osd crush move HOST datacenter=DATACENTER
ceph osd crush move HOST datacenter=DATACENTER
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
[ceph: root@host01 /]# ceph osd crush move host03 datacenter=DC1 [ceph: root@host01 /]# ceph osd crush move host06 datacenter=DC2
[ceph: root@host01 /]# ceph osd crush move host03 datacenter=DC1 [ceph: root@host01 /]# ceph osd crush move host06 datacenter=DC2
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteEnsure you add the same topology nodes on both sites. Issues might arise if hosts are added only on one site.