Chapter 16. Configuring a multisite, fault-tolerant messaging system by using Ceph


Large-scale enterprise messaging systems usually locate broker clusters in geographically distributed data centers. You can use specific broker topologies and Red Hat Ceph Storage, a software-defined storage platform, to ensure continuity of your messaging services during a data center outage.

This type of solution is called a multisite, fault-tolerant architecture.

Note

The following sections explain how to protect your messaging system from data center outages by using Red Hat Ceph Storage:

Note

To use Red Hat Ceph Storage to ensure continuity of your messaging system, you must configure your brokers to use the shared store high-availability (HA) policy. You cannot configure your brokers to use the replication HA policy. For more information about these policies, see Implementing High Availability.

16.1. How Red Hat Ceph Storage clusters work

Red Hat Ceph Storage is a clustered object storage system that you can use to provide highly available and fault-tolerant shared storage for your AMQ Broker journals.

Red Hat Ceph Storage uses an algorithm called CRUSH (Controlled Replication Under Scalable Hashing) to determine how to store and retrieve data by automatically computing data storage locations. You configure Ceph items called CRUSH maps, which detail cluster topography and specify how data is replicated across storage clusters.

CRUSH maps contain lists of Object Storage Devices (OSDs), a list of 'buckets' for aggregating the devices into a failure domain hierarchy, and rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools.

By reflecting the underlying physical organization of the installation, CRUSH maps can model,and thereby address, potential sources of correlated device failures, such as physical proximity, shared power sources, and shared networks. By encoding this information into the cluster map, CRUSH can separate object replicas across different failure domains (for example, data centers) while still maintaining a pseudorandom distribution of data across the storage cluster. This helps to prevent data loss and enables the cluster to operate in a degraded state.

Red Hat Ceph Storage clusters require serveral nodes (physical or virtual) to operate. Clusters must include the following types of nodes:

Monitor nodes
Each Monitor (MON) node runs the monitor daemon (ceph-mon), which maintains the main copy of the cluster map. The cluster map includes the cluster topology. A client connecting to the Ceph cluster retrieves the current copy of the cluster map from the Monitor, which enables the client to read from and write data to the cluster.
Important

A Red Hat Ceph Storage cluster can run with one Monitor node; however, to ensure high availability in a production cluster, Red Hat supports only deployments with at least three Monitor nodes. A minimum of three Monitor nodes means that if one monitor fails or is unavailable, a quorum exists for the remaining Monitor nodes in the cluster to elect a new leader.

Manager nodes
Each Manager (MGR) node runs the Ceph Manager daemon (ceph-mgr), which is responsible for keeping track of runtime metrics and the current state of the Ceph cluster, including storage utilization, current performance metrics, and system load. Usually, Manager nodes are colocated (that is, on the same host machine) with Monitor nodes.
Object Storage Device nodes
Each Object Storage Device (OSD) node runs the Ceph OSD daemon (ceph-osd), which interacts with logical disks attached to the node. Ceph stores data on OSD nodes. Ceph can run with very few OSD nodes (the default is three), but production clusters realize better performance at modest scales, for example, with 50 OSDs in a storage cluster. Having multiple OSDs in a storage cluster enables system administrators to define isolated failure domains within a CRUSH map.
Metadata Server nodes
Each Metadata Server (MDS) node runs the MDS daemon (ceph-mds), which manages metadata related to files stored on the Ceph File System (CephFS). The MDS daemon also coordinates access to the shared cluster.

Additional resources

16.2. Installing Red Hat Ceph Storage

You must install a Red Hat Ceph 3 storage cluster to replicate messaging data across data centers so it is available to brokers in separate data centers.

Red Hat Ceph Storage clusters intended for production use should have a minimum of:

  • Three Monitor (MON) nodes
  • Three Manager (MGR) nodes
  • Three Object Storage Device (OSD) nodes containing multiple OSD daemons
  • Three Metadata Server (MDS) nodes
Important

You can run the OSD, MON, MGR, and MDS nodes on either the same or separate physical or virtual machines. However, to ensure fault tolerance within your Red Hat Ceph Storage cluster, it is good practice to distribute each of these types of nodes across distinct data centers. In particular, you must ensure that in the event of a single data center outage, your storage cluster still has a minimum of two available MON nodes. Therefore, if you have three MON nodes in you cluster, each of these nodes must run on separate host machines in separate data centers. Do not run two MON nodes in a single data center, because failure of this data center will leave your storage cluster with only one remaining MON node. In this situation, the storage cluster can no longer operate.

The procedures linked-to from this section show you how to install a Red Hat Ceph Storage 3 cluster that includes MON, MGR, OSD, and MDS nodes.

Prerequisites

Procedure

16.3. Configuring a Red Hat Ceph Storage cluster

To configure a Red Hat Ceph Storage cluster, you must create CRUSH buckets to aggregate your Object Storage Device (OSD) nodes into data centers that reflect the physical installation. In addition, you must create a rule to replicate data across data centers for use by multiple brokers.

Prerequisites

  • You have already installed a Red Hat Ceph Storage cluster. For more information, see Installing Red Hat Ceph Storage.
  • You should understand how Red Hat Ceph Storage uses Placement Groups (PGs) to organize large numbers of data objects in a pool, and how to calculate the number of PGs to use in your pool. For more information, see Placement Groups (PGs).
  • You should understand how to set the number of object replicas in a pool. For more information, Set the Number of Object Replicas.

Procedure

  1. Create CRUSH buckets to organize your OSD nodes. Buckets are lists of OSDs, based on physical locations such as data centers. In Ceph, these physical locations are known as failure domains.

    ceph osd crush add-bucket dc1 datacenter
    ceph osd crush add-bucket dc2 datacenter
    Copy to Clipboard Toggle word wrap
  2. Move the host machines for your OSD nodes to the data center CRUSH buckets that you created. Replace host names host1-host4 with the names of your host machines.

    ceph osd crush move host1 datacenter=dc1
    ceph osd crush move host2 datacenter=dc1
    ceph osd crush move host3 datacenter=dc2
    ceph osd crush move host4 datacenter=dc2
    Copy to Clipboard Toggle word wrap
  3. Ensure that the CRUSH buckets you created are part of the default CRUSH tree.

    ceph osd crush move dc1 root=default
    ceph osd crush move dc2 root=default
    Copy to Clipboard Toggle word wrap
  4. Create a rule to map storage object replicas across your data centers. This helps to prevent data loss and enables your cluster to stay running in the event of a single data center outage.

    The command to create a rule uses the following syntax: ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>. An example is shown below.

    ceph osd crush rule create-replicated multi-dc default datacenter hdd
    Copy to Clipboard Toggle word wrap
    Note

    In the preceding command, if your storage cluster uses solid-state drives (SSD), specify ssd instead of hdd (hard disk drives).

  5. Configure your Ceph data and metadata pools to use the rule that you created. Initially, this might cause data to be backfilled to the storage destinations determined by the CRUSH algorithm.

    ceph osd pool set cephfs_data crush_rule multi-dc
    ceph osd pool set cephfs_metadata crush_rule multi-dc
    Copy to Clipboard Toggle word wrap
  6. Specify the numbers of Placement Groups (PGs) and Placement Groups for Placement (PGPs) for your metadata and data pools. The PGP value should be equal to the PG value.

    ceph osd pool set cephfs_metadata pg_num 128
    ceph osd pool set cephfs_metadata pgp_num 128
    
    ceph osd pool set cephfs_data pg_num 128
    ceph osd pool set cephfs_data pgp_num 128
    Copy to Clipboard Toggle word wrap
  7. Specify the numbers of replicas to be used by your data and metadata pools.

    ceph osd pool set cephfs_data min_size 1
    ceph osd pool set cephfs_metadata min_size 1
    
    ceph osd pool set cephfs_data size 2
    ceph osd pool set cephfs_metadata size 2
    Copy to Clipboard Toggle word wrap

    The following figure shows the Red Hat Ceph Storage cluster created by the preceding example procedure. The storage cluster has OSDs organized into CRUSH buckets corresponding to data centers.

    Figure 16.1. Red Hat Ceph Storage Custer example

    The following figure shows a possible layout of the first data center, including your broker servers. Specifically, the data center hosts:

    • The servers for two primary-backup broker pairs
    • The OSD nodes that you assigned to the first data center in the preceding procedure
    • Single Metadata Server, Monitor and Manager nodes. The Monitor and Manager nodes are usually co-located on the same machine.

    Figure 16.2. Single data center with Ceph Storage Cluster and two primary-backup broker pairs

    Important

    You can run the OSD, MON, MGR, and MDS nodes on either the same or separate physical or virtual machines. However, to ensure fault tolerance within your Red Hat Ceph Storage cluster, it is good practice to distribute each of these types of nodes across distinct data centers. In particular, you must ensure that in the event of a single data center outage, you storage cluster still has a minimum of two available MON nodes. Therefore, if you have three MON nodes in you cluster, each of these nodes must run on separate host machines in separate data centers.

    The following figure shows a complete example topology. To ensure fault tolerance in your storage cluster, the MON, MGR, and MDS nodes are distributed across three separate data centers.

    Figure 16.3. Multiple data centers with Ceph storage clusters and two primary-backup broker pairs

    Note

    Locating the host machines for certain OSD nodes in the same data center as your broker servers does not mean that you store messaging data on those specific OSD nodes. You configure the brokers to store messaging data in a specified directory in the Ceph File System. The Metadata Server nodes in your cluster then determine how to distribute the stored data across all available OSDs in your data centers and handle replication of this data across data centers. the sections that follow show how to configure brokers to store messaging data on the Ceph File System.

    The figure below illustrates replication of data between the two data centers that have broker servers.

    Figure 16.4. Ceph repliaction between two data centers that have broker servers

    Additional resources

You must mount the Ceph File System (CephFS) on your broker servers so you can configure your brokers to store messaging data in a Red Hat Ceph Storage cluster.

The procedure linked-to from this section illustrates how to mount the CephFS on your broker servers.

Prerequisites

Procedure

  1. For instructions on mounting the Ceph File System on your broker servers, see Mounting the Ceph File System as a kernel client.

To successfully deploy a multisite, fault-tolerant messaging system, you must add idle backup brokers that can take over from brokers in a primary-backup group if a data center outage occurs. For every broker, you must also assign the Ceph client role and configure a shared store HA policy.

16.5.1. Adding backup brokers

Within each of your data centers, you must add idle backup brokers that can take over from brokers in a primary-backup group if a data center outage occurs.

You should replicate the configuration of active primary brokers in your idle backup brokers. You must also configure your backup brokers to accept client connections in the same way as your existing brokers.

In a later procedure, you see how to configure an idle backup broker to join an existing primary-backup broker group. You must locate the idle backup broker in a separate data center to that of the active primary-backup broker group. It is also recommended that you manually start the idle backup broker only if a data center failure occurs.

The following figure shows an example topology.

Figure 16.5. Idle backup broker in a multisite, fault-tolerant messaging system

16.5.2. Configuring brokers as Ceph clients

When you add the backup brokers required for a fault-tolerant system, you must configure all of the broker to have the Ceph client role. This role gives brokers the capability to store data within your Red Hat Ceph Storage cluster.

To learn how to configure Ceph clients, see Installing the Ceph Client Role.

16.5.3. Configuring shared store high availability

You must configure each broker in your primary-backup group to use a shared store high availability (HA) policy and to use the same journal, paging, and large message directories on the Ceph File System. This configuration creates a shared store for brokers in different data centers.

The following procedure shows how to configure the shared store HA policy on the primary, backup, and idle backup brokers of your primary-backup group.

Procedure

  1. Edit the broker.xml configuration file of each broker in the primary-backup group. Configure each broker to use the same paging, bindings, journal, and large message directories in the Ceph File System.

    # Primary Broker - DC1
    <paging-directory>mnt/cephfs/broker1/paging</paging-directory>
    <bindings-directory>/mnt/cephfs/data/broker1/bindings</bindings-directory>
    <journal-directory>/mnt/cephfs/data/broker1/journal</journal-directory>
    <large-messages-directory>mnt/cephfs/data/broker1/large-messages</large-messages-directory>
    
    # Backup Broker - DC1
    <paging-directory>mnt/cephfs/broker1/paging</paging-directory>
    <bindings-directory>/mnt/cephfs/data/broker1/bindings</bindings-directory>
    <journal-directory>/mnt/cephfs/data/broker1/journal</journal-directory>
    <large-messages-directory>mnt/cephfs/data/broker1/large-messages</large-messages-directory>
    
    # Backup Broker (Idle) - DC2
    <paging-directory>mnt/cephfs/broker1/paging</paging-directory>
    <bindings-directory>/mnt/cephfs/data/broker1/bindings</bindings-directory>
    <journal-directory>/mnt/cephfs/data/broker1/journal</journal-directory>
    <large-messages-directory>mnt/cephfs/data/broker1/large-messages</large-messages-directory>
    Copy to Clipboard Toggle word wrap
  2. Configure the backup broker as a primary within it’s HA policy, as shown below. This configuration setting ensures that the backup broker immediately becomes the primary when you manually start it. Because the broker is an idle backup, the failover-on-shutdown parameter that you can specify for an active primary broker does not apply in this case.

    <configuration>
        <core>
            ...
            <ha-policy>
                <shared-store>
                    <primary>
                    </primary>
                </shared-store>
            </ha-policy>
            ...
        </core>
    </configuration>
    Copy to Clipboard Toggle word wrap

An internal client application runs on a machine located in the same data center as the broker while an external client run on a machine located outside the broker data center. The way you configure clients to mitigate data center outages depends on whether clients are internal or external.

The following figure shows a topology for internal clients.

Figure 16.6. Internal clients in a multisite, fault-tolerant messaging system

The following figure shows a topology for external clients.

Figure 16.7. External clients in a multisite, fault-tolerant messaging system

The following sub-sections show examples of configuring your internal and external client applications to connect to a backup broker in another data center if a data center outage occurs.

16.6.1. Configuring internal clients

If a data center outage occurs, internal clients shut down along with the brokers. You can mitigate data center outages for internal clients by preparing a backup client instance in a separate location. If an outage occurs, you can manually start the backup client to connect to the backup broker.

To enable the backup client to connect to a backup broker, you must configure the client connection similarly to that of the client in your primary data center.

For example, a basic connection configuration for an AMQ Core Protocol JMS client to a primary-backup broker group is shown below. In this example, host1 and host2 are the host servers for the primary and backup brokers.

<ConnectionFactory connectionFactory = new ActiveMQConnectionFactory(“(tcp://host1:port,tcp://host2:port)?ha=true&retryInterval=100&retryIntervalMultiplier=1.0&reconnectAttempts=-1”);
Copy to Clipboard Toggle word wrap

To configure a backup client to connect to a backup broker if a data center outage occurs, use a similar connection configuration, but specify only the host name of your backup broker server. In this example, the backup broker server is host3.

<ConnectionFactory connectionFactory = new ActiveMQConnectionFactory(“(tcp://host3:port)?ha=true&retryInterval=100&retryIntervalMultiplier=1.0&reconnectAttempts=-1”);
Copy to Clipboard Toggle word wrap

16.6.2. Configuring external clients

To mitigate a data center outage, you can configure external clients to automatically fail over to a backup broker that is located in a separate data center.

Shown below are examples of configuring the AMQ Core Protocol JMS and Red Hat build of Apache Qpid JMS clients to fail over to a backup broker if the primary-backup group is unavailable. In these examples, host1 and host2 are the host servers for the primary and backup brokers, while host3 is the host server for the backup broker that you manually start if a data center outage occurs.

  • To configure an AMQ Core Protocol JMS client, include the backup broker on the ordered list of brokers that the client attempts to connect to.

    <ConnectionFactory connectionFactory = new ActiveMQConnectionFactory(“(tcp://host1:port,tcp://host2:port,tcp://host3:port)?ha=true&retryInterval=100&retryIntervalMultiplier=1.0&reconnectAttempts=-1”);
    Copy to Clipboard Toggle word wrap
  • To configure an Red Hat build of Apache Qpid JMS client, include the backup broker in the failover URI that you configure on the client.

    failover:(amqp://host1:port,amqp://host2:port,amqp://host3:port)?jms.clientID=myclient&failover.maxReconnectAttempts=20
    Copy to Clipboard Toggle word wrap

You can verify the status of your Red Hat Ceph Storage cluster while it runs in a degraded state. The cluster runs in a degraded state, without losing any data, if one of your data centers fails.

Procedure

  1. To verify the status of your Ceph storage cluster, use the health or status commands:

    # ceph health
    # ceph status
    Copy to Clipboard Toggle word wrap
  2. To watch the ongoing events of the cluster on the command line, open a new terminal. Then, enter:

    # ceph -w
    Copy to Clipboard Toggle word wrap

    When you run any of the preceding commands, you see output indicating that the storage cluster is still running, but in a degraded state. Specifically, you should see a warning that resembles the following:

    health: HEALTH_WARN
            2 osds down
            Degraded data redundancy: 42/84 objects degraded (50.0%), 16 pgs unclean, 16 pgs degraded
    Copy to Clipboard Toggle word wrap

Additional resources

To maintain messaging services during a data center outage, you must manually start any idle backup brokers that you created to take over from brokers in your failed data center. You must also connect internal or external clients to the new active brokers.

Prerequisites

Procedure

  1. For each primary-backup broker pair in the failed data center, manually start the idle backup broker that you added.

    Figure 16.8. Backup broker started after data center outage

  2. Reestablish client connections.

    1. If you were using an internal client in the failed data center, manually start the backup client that you created. As described in Configuring clients in a multi-site, fault-tolerant messaging system, you must configure the client to connect to the backup broker that you manually started.

      The following figure shows the new topology.

      Figure 16.9. Internal client connected to backup broker after data center outage

    2. If you have an external client, manually connect the external client to the new active broker or observe that the clients automatically fails over to the new active broker, based on its configuration. For more information, see Configuring external clients.

      The following figure shows the new topology.

      Figure 16.10. External client connected to backup broker after data center outage

16.9. Restarting a previously failed data center

When a failed data center returns to operation, you can restore the original state of your messaging system. To complete the restoration, you must restart the servers that host the nodes of your Red Hat Ceph Storage cluster, restart the brokers and re-establish connections from client applications.

16.9.1. Restarting storage cluster servers

When you restart Monitor, Metadata Server, Manager, and Object Storage Device (OSD) nodes in a previously failed data center, your Red Hat Ceph Storage cluster self-heals to restore full data redundancy. You can verify the status of this restoration.

To verify that your storage cluster is automatically self-healing and restoring full data redundancy, use the commands previously shown in Verifying storage cluster health during a data center outage. When you re-execute these commands, you see that the percentage degradation indicated by the previous HEALTH_WARN message starts to improve until it returns to 100%.

16.9.2. Restarting broker servers

When your storage cluster is no longer operating in a degraded state, you can restart the brokers in the data center that failed previously.

Procedure

  1. Stop any client applications connected to backup brokers that you manually started when the data center outage occurred.
  2. Stop the backup brokers that you manually started.

    1. On Linux:

      <broker_instance_dir>/bin/artemis stop
      Copy to Clipboard Toggle word wrap
    2. On Windows:

      <broker_instance_dir>\bin\artemis-service.exe stop
      Copy to Clipboard Toggle word wrap
  3. In your previously failed data center, restart the original primary and backup brokers.

    1. On Linux:

      <broker_instance_dir>/bin/artemis run
      Copy to Clipboard Toggle word wrap
    2. On Windows:

      <broker_instance_dir>\bin\artemis-service.exe start
      Copy to Clipboard Toggle word wrap

      The original primary broker automatically resumes its role as primary when you restart it.

16.9.3. Reestablishing client connections

After your restart your brokers, you must reconnect your client applications to those brokers, thereby restoring your messaging system to its state prior to the data center failure.

16.9.3.1. Reconnecting internal clients

Internal clients are those clients that run in the same data center, which prevously failed, as the restored brokers. To reconnect the clients to the broker, restart them.

Each client application reconnects to the restored primary broker that is specified in its connection configuration.

For more information about configuring broker network connections, see Chapter 2, Configuring acceptors and connectors in network connections.

16.9.3.2. Reconnecting external clients

External clients are clients running outside the data center that previously failed. Depending on the client type and configuration, clients either failover and reconnect automatically to the original broker, or you must manually reconnect the clients.

  • If you configured your external client to automatically fail over to a backup broker, the client automatically fails back to the original primary broker when you shut down the backup broker and restart the original primary broker.
  • If you manually connected the external client to a backup broker when a data center outage occurred, you must manually reconnect the client to the original primary broker that you restart.
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat