搜索

此内容没有您所选择的语言版本。

Chapter 3. Master/Slave

download PDF

Abstract

Persistent messages require an additional layer of fault tolerance. In case of a broker failure, persistent messages require that the replacement broker has a copy of all the undelivered messages. Master/slave groups address this requirement by having a standby broker that shares the active broker's data store.
A master/slave group consists of two or more brokers where one master broker is active and one or more slave brokers are on hot standby, ready to take over whenever the master fails or shuts down. All of the brokers store the message and event data processed by the master broker. So, when one of the slaves takes over as the new master broker the integrity of the messaging system is guaranteed.
Red Hat JBoss A-MQ supports two master/slave broker configurations:
  • Shared file system—the master and the slaves use a common persistence store that is located on a shared file system
  • Shared JDBC database—the masters and the slaves use a common JDBC persistence store

3.1. Shared File System Master/Slave

Overview

A shared file system master/slave group works by sharing a common data store that is located on a shared file system. Brokers automatically configure themselves to operate in master mode or slave mode based on their ability to grab an exclusive lock on the underlying data store.
The disadvantage of this configuration is that the shared file system is a single point of failure. This disadvantage can be mitigated by using a storage area network (SAN) with built in high availability (HA) functionality. The SAN will handle replication and fail over of the data store.

Supported network file systems

The following network file systems (and only these file systems) are supported by AMQ:
  • NFSv4
  • GFS2
  • CIFS/SMB (Windows only)

Recommended NFSv4 client mount options

The goal is to set mount options that provide optimal support for both broker failover
 and data persistence. For broker failover, you want errors to propagate up the broker
 quickly. For data persistence, you want to resend failed requests many times. The trick
 is to find settings that optimally balance both fault tolerant messaging
 features.
The following mount options were used in all NFS locking mechanism tests. The tests
 were run on Red Hat Enterprise Linux 7.x machines in Red Hat OpenStack Platform. The
 broker was configured to use the KahaDB store with
 lockKeepAlivePeriod=2000 (for details, see the section called “File locking requirements”). In these tests, the broker detected lost access to the data store and initiated
 shutdown within 12 seconds. You may need to adjust these settings depending on your
 particular setup.
  • soft—Disables continuous retransmission attempts
 by the client when the NFS server does not respond to a request. Instead, an
 NFS request fails after retrans transmissions have been
 sent, causing the NFS client to return an error to the calling client, and
 thus the broker. This option is key for enabling the
 timeo and retrans options.
  • timeo=20—The time, in deciseconds, the NFS client
 waits for a response from the NFS server, before it sends another request.
 The default is 600 (60 seconds).
  • retrans=2—Specifies the number of times the NFS
 client attempts to retransmit a failed request to the NFS server. The
 default is 3. The client waits a timeo
 timeout period between each retrans attempt.
    Note
    After each retransmission, the timeout period is incremented by
 timeo, up to the maximum allowed
 (600).
  • lookupcache=none—Specifies how the kernel manages
 its cache of directory entries for the mount point. none
 forces the client to revalidate all cache entries before they are used. This
 enables the Master broker to immediately detect any change made to the lock
 file, and it prevents the lock checking mechanism from returning incorrect
 validity information.
    The default is all, which means the client assumes that all
 cache directory entries are valid until their parent directory's cached
 attributes expire.
  • sync—Any system call that writes data to files on
 the mount point causes the data to be flushed to the NFS server before the
 system call returns control to user space. This option provides greater data
 cache coherence.
  • intr—Allows signals to interrupt file operations
 on the mount point. System calls return EINTR when an
 in-progress NFS operation is interrupted by a signal.
  • proto=tcp—Specifies the protocol the NFS client
 uses to transmit requests to the NFS server.
For more information on NFS mount point options, see http://linux.die.net/man/5/nfs.

File locking requirements

The shared file system requires an efficient and reliable file locking mechanism to function correctly. Not all SAN file systems are compatible with the shared file system configuration's needs.
Warning
OCFS2 is incompatible with this master/slave configuration, because mutex file locking from Java is not supported.
Warning
NFSv3 is incompatible with this master/slave configuration. In the event of an abnormal termination of a master broker, which is an NFSv3 client, the NFSv3 server does not time out the lock held by the client. This renders the Red Hat JBoss A-MQ data directory inaccessible. Because of this, the slave broker cannot acquire the lock and therefore cannot start up. In this case, the only way to unblock the master/slave in NFSv3 is to reboot all broker instances.
NFSv4, on the other hand, is compatible with this master/slave configuration, as its design includes timeouts for locks. When an NFSv4 client holding a lock terminates abnormally, NFSv4 automatically releases the lock after the specified timeout (see http://tools.ietf.org/html/rfc5661 for details), allowing another NFSv4 client to grab it.
It is possible for a slave to grab the lock from the master without the master's knowledge when NFSv4 crashes. This is so because the master broker does not automatically check whether it still has the lock, giving a slave the chance to grab it when the NFSv4 specified timeout elapses.
You can avoid this scenario by using the persistence adapter's lockKeepAlivePeriod attribute. Setting the lockKeepAlivePeriod attribute instructs the master to check, at intervals of the specified milliseconds, whether it still holds the lock (lock is valid) and that the lock file still exists. If the master discovers that the lock is invalid, it tries to regain the lock. If it fails or the lock file no longer exists, the master enters Slave mode, allowing another slave to try to get the lock and become master.
In attempting to get the lock, the slave checks every lockAcquireSleepInterval (milliseconds) whether another broker holds the lock. If not, the slave locks the file and waits one lockKeepAlivePeriod before entering Master mode. If the lock file does not exist, the slave recreates it and then tries to lock it, following the same procedure it would if the lock file existed.
To enable this lock checking mechanism, add the lockKeepAlivePeriod attribute to the persistence Adaptor element in the broker configuration. For example, like this:
<kahaDB directory="/sharedFileSystem/sharedBrokerData" lockKeepAlivePeriod="2000">
    <locker>
        <shared-file-locker lockAcquireSleepInterval="10000" />
    </locker>
</kahaDB>
which instructs the master broker to check at five second intervals whether the lock is still valid and that the lock file exists. Example 3.1, “Shared File System Broker Configuration” shows how to set the lockAcquireSleepInterval attribute.

Initial state

Figure 3.1, “Shared File System Initial State” shows the initial state of a shared file system master/slave group. When all of the brokers are started, one of them grabs the exclusive lock on the broker data store and becomes the master. All of the other brokers remain slaves and pause while waiting for the exclusive lock to be freed up. Only the master starts its transport connectors, so all of the clients connect to it.

Figure 3.1. Shared File System Initial State

a master and two slaves using a shared file system

State after failure of the master

Figure 3.2, “Shared File System after Master Failure” shows the state of the master/slave group after the original master has shut down or failed. As soon as the master gives up the lock (or after a suitable timeout, if the master crashes), the lock on the data store frees up and another broker grabs the lock and gets promoted to master.

Figure 3.2. Shared File System after Master Failure

master with a single slave
After the clients lose their connection to the original master, they automatically try all of the other brokers listed in the failover URL. This enables them to find and connect to the new master.

Configuring the brokers

In the shared file system master/slave configuration, there is nothing special to distinguish a master broker from the slave brokers. The membership of a particular master/slave group is defined by the fact that all of the brokers in the group use the same persistence layer and store their data in the same shared directory.
Example 3.1, “Shared File System Broker Configuration” shows the broker configuration for a shared file system master/slave group that shares a data store located at /sharedFileSystem/sharedBrokerData and uses the KahaDB persistence store.

Example 3.1. Shared File System Broker Configuration

<broker ... >
  ...
  <persistenceAdapter>
    <kahaDB directory="/sharedFileSystem/sharedBrokerData" lockKeepAlivePeriod="2000">
        <locker>
            <shared-file-locker lockAcquireSleepInterval="10000" />
        </locker>
     </kahaDB>
  </persistenceAdapter>
  ...
</broker>
All of the brokers in the group must share the same persistenceAdapter element.

Configuring the clients

Clients of shared file system master/slave group must be configured with a failover URL that lists the URLs for all of the brokers in the group. Example 3.2, “Client URL for a Shared File System Master/Slave Group” shows the client failover URL for a group that consists of three brokers: broker1, broker2, and broker3.

Example 3.2. Client URL for a Shared File System Master/Slave Group

failover:(tcp://broker1:61616,tcp://broker2:61616,tcp://broker3:61616)
For more information about using the failover protocol see Section 2.1.1, “Static Failover”.

Reintroducing a failed node

You can restart the failed master at any time and it will rejoin the cluster. It will rejoin as a slave broker because one of the other brokers already owns the exclusive lock on the data store, as shown in Figure 3.3, “Shared File System after Master Restart”.

Figure 3.3. Shared File System after Master Restart

a master with two slaves broker1 is now a slave
Red Hat logoGithubRedditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

© 2024 Red Hat, Inc.