Chapter 3. RGManager
RGManager manages and provides failover capabilities for collections of cluster resources called services, resource groups, or resource trees. These resource groups are tree-structured, and have parent-child dependency and inheritance relationships within each subtree.
How RGManager works is that it allows administrators to define, configure, and monitor cluster services. In the event of a node failure, RGManager will relocate the clustered service to another node with minimal service disruption. You can also restrict services to certain nodes, such as restricting
httpd
to one group of nodes while mysql
can be restricted to a separate set of nodes.
There are various processes and agents that combine to make RGManager work. The following list summarizes those areas.
- Failover Domains - How the RGManager failover domain system works
- Service Policies - RGManager's service startup and recovery policies
- Resource Trees - How RGManager's resource trees work, including start/stop orders and inheritance
- Service Operational Behaviors - How RGManager's operations work and what states mean
- Virtual Machine Behaviors - Special things to remember when running VMs in an RGManager cluster
- Resource Actions - The agent actions RGManager uses and how to customize their behavior from the
cluster.conf
file. - Event Scripting - If RGManager's failover and recovery policies do not fit in your environment, you can customize your own using this scripting subsystem.
3.1. Failover Domains
A failover domain is an ordered subset of members to which a service may be bound. Failover domains, while useful for cluster customization, are not required for operation.
The following is a list of semantics governing the configuration options that affect the behavior of a failover domain.
- preferred node or preferred member: The preferred node is the member designated to run a given service if the member is online. We can emulate this behavior by specifying an unordered, unrestricted failover domain of exactly one member.
- restricted domain: Services bound to the domain may only run on cluster members which are also members of the failover domain. If no members of the failover domain are available, the service is placed in the stopped state. In a cluster with several members, using a restricted failover domain can ease configuration of a cluster service (such as httpd), which requires identical configuration on all members that run the service. Instead of setting up the entire cluster to run the cluster service, you must set up only the members in the restricted failover domain that you associate with the cluster service.
- unrestricted domain: The default behavior, services bound to this domain may run on all cluster members, but will run on a member of the domain whenever one is available. This means that if a service is running outside of the domain and a member of the domain comes online, the service will migrate to that member, unless nofailback is set.
- ordered domain: The order specified in the configuration dictates the order of preference of members within the domain. The highest-ranking member of the domain will run the service whenever it is online. This means that if member A has a higher-rank than member B, the service will migrate to A if it was running on B if A transitions from offline to online.
- unordered domain: The default behavior, members of the domain have no order of preference; any member may run the service. Services will always migrate to members of their failover domain whenever possible, however, in an unordered domain.
- failback: Services on members of an ordered failover domain should fail back to the node that it was originally running on before the node failed, which is useful for frequently failing nodes to prevent frequent service shifts between the failing node and the failover node.
Ordering, restriction, and nofailback are flags and may be combined in almost any way (that is, ordered+restricted, unordered+unrestricted, and so on). These combinations affect both where services start after initial quorum formation and which cluster members will take over services in the event that the service has failed.
3.1.1. Behavior Examples
Given a cluster comprised of this set of members: {A, B, C, D, E, F, G}.
- Ordered, restricted failover domain {A, B, C}
- With nofailback unset: A service 'S' will always run on member 'A' whenever member 'A' is online and there is a quorum. If all members of {A, B, C} are offline, the service will not run. If the service is running on 'C' and 'A' transitions online, the service will migrate to 'A'.With nofailback set: A service 'S' will run on the highest priority cluster member when a quorum is formed. If all members of {A, B, C} are offline, the service will not run. If the service is running on 'C' and 'A' transitions online, the service will remain on 'C' unless 'C' fails, at which point it will fail over to 'A'.
- Unordered, restricted failover domain {A, B, C}
- A service 'S' will only run if there is a quorum and at least one member of {A, B, C} is online. If another member of the domain transitions online, the service does not relocate.
- Ordered, unrestricted failover domain {A, B, C}
- With nofailback unset: A service 'S' will run whenever there is a quorum. If a member of the failover domain is online, the service will run on the highest-priority member, otherwise a member of the cluster will be chosen at random to run the service. That is, the service will run on 'A' whenever 'A' is online, followed by 'B'.With nofailback set: A service 'S' will run whenever there is a quorum. If a member of the failover domain is online at quorum formation, the service will run on the highest-priority member of the failover domain. That is, if 'B' is online (but 'A' is not), the service will run on 'B'. If, at some later point, 'A' joins the cluster, the service will not relocate to 'A'.
- Unordered, unrestricted failover domain {A, B, C}
- This is also called a "Set of Preferred Members". When one or more members of the failover domain are online, the service will run on a nonspecific online member of the failover domain. If another member of the failover domain transitions online, the service does not relocate.