Chapter 15. Handling a data center failure
As a storage administrator, you can take preventive measures to avoid a data center failure. These preventive measures include:
- Configuring the data center infrastructure.
- Setting up failure domains within the CRUSH map hierarchy.
- Designating failure nodes within the domains.
15.1. Prerequisites
- A healthy running Red Hat Ceph Storage cluster.
- Root-level access to all nodes in the storage cluster.
15.2. Avoiding a data center failure
Configuring the data center infrastructure
Each data center within a stretch cluster can have a different storage cluster configuration to reflect local capabilities and dependencies. Set up replication between the data centers to help preserve the data. If one data center fails, the other data centers in the storage cluster contain copies of the data.
Setting up failure domains within the CRUSH map hierarchy
Failure, or failover, domains are redundant copies of domains within the storage cluster. If an active domain fails, the failure domain becomes the active domain.
By default, the CRUSH map lists all nodes in a storage cluster within a flat hierarchy. However, for best results, create a logical hierarchical structure within the CRUSH map. The hierarchy designates the domains to which each node belongs and the relationships among those domains within the storage cluster, including the failure domains. Defining the failure domains for each domain within the hierarchy improves the reliability of the storage cluster.
When planning a storage cluster that contains multiple data centers, place the nodes within the CRUSH map hierarchy so that if one data center goes down, the rest of the storage cluster stays up and running.
Designating failure nodes within the domains
If you plan to use three-way replication for data within the storage cluster, consider the location of the nodes within the failure domain. If an outage occurs within a data center, it is possible that some data might reside in only one copy. When this scenario happens, there are two options:
- Leave the data in read-only status with the standard settings.
- Live with only one copy for the duration of the outage.
With the standard settings, and because of the randomness of data placement across the nodes, not all the data will be affected, but some data can have only one copy and the storage cluster would revert to read-only mode. However, if some data exist in only one copy, the storage cluster reverts to read-only mode.
15.3. Handling a data center failure
Red Hat Ceph Storage can withstand catastrophic failures to the infrastructure, such as losing one of the data centers in a stretch cluster. For the standard object store use case, configuring all three data centers can be done independently with replication set up between them. In this scenario, the storage cluster configuration in each of the data centers might be different, reflecting the local capabilities and dependencies.
A logical structure of the placement hierarchy should be considered. A proper CRUSH map can be used, reflecting the hierarchical structure of the failure domains within the infrastructure. Using logical hierarchical definitions improves the reliability of the storage cluster, versus using the standard hierarchical definitions. Failure domains are defined in the CRUSH map. The default CRUSH map contains all nodes in a flat hierarchy. In a three data center environment, such as a stretch cluster, the placement of nodes should be managed in a way that one data center can go down, but the storage cluster stays up and running. Consider which failure domain a node resides in when using 3-way replication for the data.
In the example below, the resulting map is derived from the initial setup of the storage cluster with 6 OSD nodes. In this example, all nodes have only one disk and hence one OSD. All of the nodes are arranged under the default root, that is the standard root of the hierarchy tree. Because there is a weight assigned to two of the OSDs, these OSDs receive fewer chunks of data than the other OSDs. These nodes were introduced later with bigger disks than the initial OSD disks. This does not affect the data placement to withstand a failure of a group of nodes.
Example
[ceph: root@host01 /]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.33554 root default -2 0.04779 host host03 0 0.04779 osd.0 up 1.00000 1.00000 -3 0.04779 host host02 1 0.04779 osd.1 up 1.00000 1.00000 -4 0.04779 host host01 2 0.04779 osd.2 up 1.00000 1.00000 -5 0.04779 host host04 3 0.04779 osd.3 up 1.00000 1.00000 -6 0.07219 host host06 4 0.07219 osd.4 up 0.79999 1.00000 -7 0.07219 host host05 5 0.07219 osd.5 up 0.79999 1.00000
Using logical hierarchical definitions to group the nodes into same data center can achieve data placement maturity. Possible definition types of root, datacenter, rack, row and host allow the reflection of the failure domains for the three data center stretch cluster:
- Nodes host01 and host02 reside in data center 1 (DC1)
- Nodes host03 and host05 reside in data center 2 (DC2)
- Nodes host04 and host06 reside in data center 3 (DC3)
- All data centers belong to the same structure (allDC)
Since all OSDs in a host belong to the host definition there is no change needed. All the other assignments can be adjusted during runtime of the storage cluster by:
Defining the bucket structure with the following commands:
ceph osd crush add-bucket allDC root ceph osd crush add-bucket DC1 datacenter ceph osd crush add-bucket DC2 datacenter ceph osd crush add-bucket DC3 datacenter
Moving the nodes into the appropriate place within this structure by modifying the CRUSH map:
ceph osd crush move DC1 root=allDC ceph osd crush move DC2 root=allDC ceph osd crush move DC3 root=allDC ceph osd crush move host01 datacenter=DC1 ceph osd crush move host02 datacenter=DC1 ceph osd crush move host03 datacenter=DC2 ceph osd crush move host05 datacenter=DC2 ceph osd crush move host04 datacenter=DC3 ceph osd crush move host06 datacenter=DC3
Within this structure any new hosts can be added too, as well as new disks. By placing the OSDs at the right place in the hierarchy the CRUSH algorithm is changed to place redundant pieces into different failure domains within the structure.
The above example results in the following:
Example
[ceph: root@host01 /]# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -8 6.00000 root allDC -9 2.00000 datacenter DC1 -4 1.00000 host host01 2 1.00000 osd.2 up 1.00000 1.00000 -3 1.00000 host host02 1 1.00000 osd.1 up 1.00000 1.00000 -10 2.00000 datacenter DC2 -2 1.00000 host host03 0 1.00000 osd.0 up 1.00000 1.00000 -7 1.00000 host host05 5 1.00000 osd.5 up 0.79999 1.00000 -11 2.00000 datacenter DC3 -6 1.00000 host host06 4 1.00000 osd.4 up 0.79999 1.00000 -5 1.00000 host host04 3 1.00000 osd.3 up 1.00000 1.00000 -1 0 root default
The listing from above shows the resulting CRUSH map by displaying the osd tree. Easy to see is now how the hosts belong to a data center and all data centers belong to the same top level structure but clearly distinguishing between locations.
Placing the data in the proper locations according to the map works only properly within the healthy cluster. Misplacement might happen under circumstances, when some OSDs are not available. Those misplacements will be corrected automatically once it is possible to do so.
Additional Resources
- See the CRUSH administration chapter in the Red Hat Ceph Storage Storage Strategies Guide for more information.