Chapter 4. Handling a data center failure
Red Hat Ceph Storage can withstand catastrophic failures to the infrastructure, such as losing one of three data centers in a stretch cluster. For the standard object store use case, configuring all three data centers can be done independently with replication set up between them. In this scenario, the cluster configuration in each of the data centers might be different, reflecting the local capabilities and dependencies.
A logical structure of the placement hierarchy should be considered. A proper CRUSH map can be used, reflecting the hierarchical structure of the failure domains within the infrastructure. Using logical hierarchical definitions improves the reliability of the storage cluster, versus using the standard hierarchical definitions. Failure domains are defined in the CRUSH map. The default CRUSH map contains all nodes in a flat hierarchy.
In three data center environment example, with a stretch cluster, the placement of nodes should be managed in a way that one data center can go down, but the storage cluster stays up and running. Consider which failure domain a node resides in when using 3-way replication for the data, in the case of an outage of one data center, it is possible that some data can be left with one copy. When this scenario happens, there are two options:
- Leave the data in read-only status with the standard settings.
- Live with only one copy for the duration of the outage.
With the standard settings, and because of the randomness of data placement across the nodes, not all the data will be affected, but some data can have only one copy and the storage cluster would revert to read-only mode.
In the example below the resulting map is derived from the initial setup of the cluster with 6 OSD nodes. In this example all nodes have only one disk and hence one OSD. All of the nodes are arranged under the default root, that is the standard root of the hierarchy tree. Because there is a weight assigned to two of the OSDs, these OSDs receive fewer chunks of data than the other OSDs. These nodes were introduced later with bigger disks than the initial OSD disks. This does not affect the data placement to withstand a failure of a group of nodes.
Standard CRUSH map
$ sudo ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.33554 root default -2 0.04779 host ceph-node3 0 0.04779 osd.0 up 1.00000 1.00000 -3 0.04779 host ceph-node2 1 0.04779 osd.1 up 1.00000 1.00000 -4 0.04779 host ceph-node1 2 0.04779 osd.2 up 1.00000 1.00000 -5 0.04779 host ceph-node4 3 0.04779 osd.3 up 1.00000 1.00000 -6 0.07219 host ceph-node6 4 0.07219 osd.4 up 0.79999 1.00000 -7 0.07219 host ceph-node5 5 0.07219 osd.5 up 0.79999 1.00000
Using logical hierarchical definitions to group the nodes into same data center, can achieve data placement maturity. Possible definition types of root, datacenter, rack, row and host allow the reflection of the failure domains for the three data center stretch cluster:
- Nodes ceph-node1 and ceph-node2 reside in data center 1 (DC1)
- Nodes ceph-node3 and ceph-node5 reside in data center 2 (DC2)
- Nodes ceph-node4 and ceph-node6 reside in data center 3 (DC3)
- All data centers belong to the same structure (allDC)
Since all OSDs in a host belong to the host definition there is no change needed. All the other assignments can be adjusted during runtime of the storage cluster by:
Defining the bucket structure with the following commands:
ceph osd crush add-bucket allDC root ceph osd crush add-bucket DC1 datacenter ceph osd crush add-bucket DC2 datacenter ceph osd crush add-bucket DC3 datacenter
Moving the nodes into the appropriate place within this structure by modifying the CRUSH map:
ceph osd crush move DC1 root=allDC ceph osd crush move DC2 root=allDC ceph osd crush move DC3 root=allDC ceph osd crush move ceph-node1 datacenter=DC1 ceph osd crush move ceph-node2 datacenter=DC1 ceph osd crush move ceph-node3 datacenter=DC2 ceph osd crush move ceph-node5 datacenter=DC2 ceph osd crush move ceph-node4 datacenter=DC3 ceph osd crush move ceph-node6 datacenter=DC3
Within this structure any new hosts can be added too, as well as new disks. By placing the OSDs at the right place in the hierarchy the CRUSH algorithm is changed to place redundant pieces into different failure domains within the structure.
The above example results in the following:
$ sudo ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -8 6.00000 root allDC -9 2.00000 datacenter DC1 -4 1.00000 host ceph-node1 2 1.00000 osd.2 up 1.00000 1.00000 -3 1.00000 host ceph-node2 1 1.00000 osd.1 up 1.00000 1.00000 -10 2.00000 datacenter DC2 -2 1.00000 host ceph-node3 0 1.00000 osd.0 up 1.00000 1.00000 -7 1.00000 host ceph-node5 5 1.00000 osd.5 up 0.79999 1.00000 -11 2.00000 datacenter DC3 -6 1.00000 host ceph-node6 4 1.00000 osd.4 up 0.79999 1.00000 -5 1.00000 host ceph-node4 3 1.00000 osd.3 up 1.00000 1.00000 -1 0 root default
The listing from above shows the resulting CRUSH map by displaying the osd tree. Easy to see is now how the hosts belong to a data center and all data centers belong to the same top level structure but clearly distinguishing between locations.
Placing the data in the proper locations according to the map works only properly within the healthy cluster. Misplacement might happen under circumstances, when some OSDs not available. Those misplacements will be corrected automatically once it’s possible to do so.
Additional Resources
- See the CRUSH administration chapter in the Red Hat Ceph Storage Storage Strategies Guide for more information.