Este conteúdo não está disponível no idioma selecionado.
Chapter 31. Performing cluster maintenance
In order to perform maintenance on the nodes of your cluster, you may need to stop or move the resources and services running on that cluster. Or you may need to stop the cluster software while leaving the services untouched. Pacemaker provides a variety of methods for performing system maintenance.
- If you need to stop a node in a cluster while continuing to provide the services running on that cluster on another node, you can put the cluster node in standby mode. A node that is in standby mode is no longer able to host resources. Any resource currently active on the node will be moved to another node, or stopped if no other node is eligible to run the resource. For information about standby mode, see Putting a node into standby mode.
If you need to move an individual resource off the node on which it is currently running without stopping that resource, you can use the
pcs resource move
command to move the resource to a different node.When you execute the
pcs resource move
command, this adds a constraint to the resource to prevent it from running on the node on which it is currently running. When you are ready to move the resource back, you can execute thepcs resource clear
or thepcs constraint delete
command to remove the constraint. This does not necessarily move the resources back to the original node, however, since where the resources can run at that point depends on how you have configured your resources initially. You can relocate a resource to its preferred node with thepcs resource relocate run
command.-
If you need to stop a running resource entirely and prevent the cluster from starting it again, you can use the
pcs resource disable
command. For information on thepcs resource disable
command, see Disabling, enabling, and banning cluster resources. -
If you want to prevent Pacemaker from taking any action for a resource (for example, if you want to disable recovery actions while performing maintenance on the resource, or if you need to reload the
/etc/sysconfig/pacemaker
settings), use thepcs resource unmanage
command, as described in Setting a resource to unmanaged mode. Pacemaker Remote connection resources should never be unmanaged. -
If you need to put the cluster in a state where no services will be started or stopped, you can set the
maintenance-mode
cluster property. Putting the cluster into maintenance mode automatically unmanages all resources. For information about putting the cluster in maintenance mode, see Putting a cluster in maintenance mode. - If you need to update the packages that make up the RHEL High Availability and Resilient Storage Add-Ons, you can update the packages on one node at a time or on the entire cluster as a whole, as summarized in Updating a RHEL high availability cluster.
- If you need to perform maintenance on a Pacemaker remote node, you can remove that node from the cluster by disabling the remote node resource, as described in Upgrading remote nodes and guest nodes.
- If you need to migrate a VM in a RHEL cluster, you will first need to stop the cluster services on the VM to remove the node from the cluster and then start the cluster back up after performing the migration. as described in Migrating VMs in a RHEL cluster.
31.1. Putting a node into standby mode
When a cluster node is in standby mode, the node is no longer able to host resources. Any resources currently active on the node will be moved to another node.
The following command puts the specified node into standby mode. If you specify the --all
, this command puts all nodes into standby mode.
You can use this command when updating a resource’s packages. You can also use this command when testing a configuration, to simulate recovery without actually shutting down a node.
pcs node standby node | --all
The following command removes the specified node from standby mode. After running this command, the specified node is then able to host resources. If you specify the --all
, this command removes all nodes from standby mode.
pcs node unstandby node | --all
Note that when you execute the pcs node standby
command, this prevents resources from running on the indicated node. When you execute the pcs node unstandby
command, this allows resources to run on the indicated node. This does not necessarily move the resources back to the indicated node; where the resources can run at that point depends on how you have configured your resources initially.
31.2. Manually moving cluster resources
You can override the cluster and force resources to move from their current location. There are two occasions when you would want to do this:
- When a node is under maintenance, and you need to move all resources running on that node to a different node
- When individually specified resources needs to be moved
To move all resources running on a node to a different node, you put the node in standby mode.
You can move individually specified resources in either of the following ways.
-
You can use the
pcs resource move
command to move a resource off a node on which it is currently running. -
You can use the
pcs resource relocate run
command to move a resource to its preferred node, as determined by current cluster status, constraints, location of resources and other settings.
31.2.1. Moving a resource from its current node
To move a resource off the node on which it is currently running, use the following command, specifying the resource_id of the resource as defined. Specify the destination_node
if you want to indicate on which node to run the resource that you are moving.
pcs resource move resource_id [destination_node] [--promoted] [--strict] [--wait[=n]]
When you execute the pcs resource move
command, this adds a constraint to the resource to prevent it from running on the node on which it is currently running. By default, the location constraint that the command creates is automatically removed once the resource has been moved. If removing the constraint would cause the resource to move back to the original node, as might happen if the resource-stickiness
value for the resource is 0, the pcs resource move
command fails. If you would like to move a resource and leave the resulting constraint in place, use the pcs resource move-with-constraint
command.
If you specify the --promoted
parameter of the pcs resource move
command, the constraint applies only to promoted instances of the resource.
If you specify the --strict
parameter of the pcs resource move
command, the command will fail if other resources than the one specified in the command would be affected.
You can optionally configure a --wait[=n]
parameter for the pcs resource move
command to indicate the number of seconds to wait for the resource to start on the destination node before returning 0 if the resource is started or 1 if the resource has not yet started. If you do not specify n, it defaults to a value of 60 minutes.
31.2.2. Moving a resource to its preferred node
After a resource has moved, either due to a failover or to an administrator manually moving the node, it will not necessarily move back to its original node even after the circumstances that caused the failover have been corrected. To relocate resources to their preferred node, use the following command. A preferred node is determined by the current cluster status, constraints, resource location, and other settings and may change over time.
pcs resource relocate run [resource1] [resource2] ...
If you do not specify any resources, all resource are relocated to their preferred nodes.
This command calculates the preferred node for each resource while ignoring resource stickiness. After calculating the preferred node, it creates location constraints which will cause the resources to move to their preferred nodes. Once the resources have been moved, the constraints are deleted automatically. To remove all constraints created by the pcs resource relocate run
command, you can enter the pcs resource relocate clear
command. To display the current status of resources and their optimal node ignoring resource stickiness, enter the pcs resource relocate show
command.
31.3. Disabling, enabling, and banning cluster resources
In addition to the pcs resource move
and pcs resource relocate
commands, there are a variety of other commands you can use to control the behavior of cluster resources.
Disabling a cluster resource
You can manually stop a running resource and prevent the cluster from starting it again with the following command. Depending on the rest of the configuration (constraints, options, failures, and so on), the resource may remain started. If you specify the --wait
option, pcs will wait up to 'n' seconds for the resource to stop and then return 0 if the resource is stopped or 1 if the resource has not stopped. If 'n' is not specified it defaults to 60 minutes.
pcs resource disable resource_id [--wait[=n]]
You can specify that a resource be disabled only if disabling the resource would not have an effect on other resources. Ensuring that this would be the case can be impossible to do by hand when complex resource relations are set up.
-
The
pcs resource disable --simulate
command shows the effects of disabling a resource while not changing the cluster configuration. -
The
pcs resource disable --safe
command disables a resource only if no other resources would be affected in any way, such as being migrated from one node to another. Thepcs resource safe-disable
command is an alias for thepcs resource disable --safe
command. -
The
pcs resource disable --safe --no-strict
command disables a resource only if no other resources would be stopped or demoted
You can specify the --brief
option for the pcs resource disable --safe
command to print errors only. The error report that the pcs resource disable --safe
command generates if the safe disable operation fails contains the affected resource IDs. If you need to know only the resource IDs of resources that would be affected by disabling a resource, use the --brief
option, which does not provide the full simulation result.
Enabling a cluster resource
Use the following command to allow the cluster to start a resource. Depending on the rest of the configuration, the resource may remain stopped. If you specify the --wait
option, pcs will wait up to 'n' seconds for the resource to start and then return 0 if the resource is started or 1 if the resource has not started. If 'n' is not specified it defaults to 60 minutes.
pcs resource enable resource_id [--wait[=n]]
Preventing a resource from running on a particular node
Use the following command to prevent a resource from running on a specified node, or on the current node if no node is specified.
pcs resource ban resource_id [node] [--promoted] [lifetime=lifetime] [--wait[=n]]
Note that when you execute the pcs resource ban
command, this adds a -INFINITY location constraint to the resource to prevent it from running on the indicated node. You can execute the pcs resource clear
or the pcs constraint delete
command to remove the constraint. This does not necessarily move the resources back to the indicated node; where the resources can run at that point depends on how you have configured your resources initially.
If you specify the --promoted
parameter of the pcs resource ban
command, the scope of the constraint is limited to the promoted role and you must specify promotable_id rather than resource_id.
You can optionally configure a lifetime
parameter for the pcs resource ban
command to indicate a period of time the constraint should remain.
You can optionally configure a --wait[=n]
parameter for the pcs resource ban
command to indicate the number of seconds to wait for the resource to start on the destination node before returning 0 if the resource is started or 1 if the resource has not yet started. If you do not specify n, the default resource timeout will be used.
Forcing a resource to start on the current node
Use the debug-start
parameter of the pcs resource
command to force a specified resource to start on the current node, ignoring the cluster recommendations and printing the output from starting the resource. This is mainly used for debugging resources; starting resources on a cluster is (almost) always done by Pacemaker and not directly with a pcs
command. If your resource is not starting, it is usually due to either a misconfiguration of the resource (which you debug in the system log), constraints that prevent the resource from starting, or the resource being disabled. You can use this command to test resource configuration, but it should not normally be used to start resources in a cluster.
The format of the debug-start
command is as follows.
pcs resource debug-start resource_id
31.4. Setting a resource to unmanaged mode
When a resource is in unmanaged
mode, the resource is still in the configuration but Pacemaker does not manage the resource.
The following command sets the indicated resources to unmanaged
mode.
pcs resource unmanage resource1 [resource2] ...
The following command sets resources to managed
mode, which is the default state.
pcs resource manage resource1 [resource2] ...
You can specify the name of a resource group with the pcs resource manage
or pcs resource unmanage
command. The command will act on all of the resources in the group, so that you can set all of the resources in a group to managed
or unmanaged
mode with a single command and then manage the contained resources individually.
31.5. Putting a cluster in maintenance mode
When a cluster is in maintenance mode, the cluster does not start or stop any services until told otherwise. When maintenance mode is completed, the cluster does a sanity check of the current state of any services, and then stops or starts any that need it.
To put a cluster in maintenance mode, use the following command to set the maintenance-mode
cluster property to true
.
# pcs property set maintenance-mode=true
To remove a cluster from maintenance mode, use the following command to set the maintenance-mode
cluster property to false
.
# pcs property set maintenance-mode=false
You can remove a cluster property from the configuration with the following command.
pcs property unset property
Alternately, you can remove a cluster property from a configuration by leaving the value field of the pcs property set
command blank. This restores that property to its default value. For example, if you have previously set the symmetric-cluster
property to false
, the following command removes the value you have set from the configuration and restores the value of symmetric-cluster
to true
, which is its default value.
# pcs property set symmetric-cluster=
31.6. Updating a RHEL high availability cluster
Updating packages that make up the RHEL High Availability and Resilient Storage Add-Ons, either individually or as a whole, can be done in one of two general ways:
- Rolling Updates: Remove one node at a time from service, update its software, then integrate it back into the cluster. This allows the cluster to continue providing service and managing resources while each node is updated.
- Entire Cluster Update: Stop the entire cluster, apply updates to all nodes, then start the cluster back up.
It is critical that when performing software update procedures for Red Hat Enterprise Linux High Availability and Resilient Storage clusters, you ensure that any node that will undergo updates is not an active member of the cluster before those updates are initiated.
For a full description of each of these methods and the procedures to follow for the updates, see Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster.
31.7. Upgrading remote nodes and guest nodes
If the pacemaker_remote
service is stopped on an active remote node or guest node, the cluster will gracefully migrate resources off the node before stopping the node. This allows you to perform software upgrades and other routine maintenance procedures without removing the node from the cluster. Once pacemaker_remote
is shut down, however, the cluster will immediately try to reconnect. If pacemaker_remote
is not restarted within the resource’s monitor timeout, the cluster will consider the monitor operation as failed.
If you wish to avoid monitor failures when the pacemaker_remote
service is stopped on an active Pacemaker Remote node, you can use the following procedure to take the node out of the cluster before performing any system administration that might stop pacemaker_remote
.
Procedure
Stop the node’s connection resource with the
pcs resource disable resourcename
command, which will move all services off the node. The connection resource would be theocf:pacemaker:remote
resource for a remote node or, commonly, theocf:heartbeat:VirtualDomain
resource for a guest node. For guest nodes, this command will also stop the VM, so the VM must be started outside the cluster (for example, usingvirsh
) to perform any maintenance.pcs resource disable resourcename
- Perform the required maintenance.
When ready to return the node to the cluster, re-enable the resource with the
pcs resource enable
command.pcs resource enable resourcename
31.8. Migrating VMs in a RHEL cluster
Red Hat does not support live migration of active cluster nodes across hypervisors or hosts, as noted in Support Policies for RHEL High Availability Clusters - General Conditions with Virtualized Cluster Members. If you need to perform a live migration, you will first need to stop the cluster services on the VM to remove the node from the cluster, and then start the cluster back up after performing the migration. The following steps outline the procedure for removing a VM from a cluster, migrating the VM, and restoring the VM to the cluster.
The following steps outline the procedure for removing a VM from a cluster, migrating the VM, and restoring the VM to the cluster.
This procedure applies to VMs that are used as full cluster nodes, not to VMs managed as cluster resources (including VMs used as guest nodes) which can be live-migrated without special precautions. For general information about the fuller procedure required for updating packages that make up the RHEL High Availability and Resilient Storage Add-Ons, either individually or as a whole, see Recommended Practices for Applying Software Updates to a RHEL High Availability or Resilient Storage Cluster.
Before performing this procedure, consider the effect on cluster quorum of removing a cluster node. For example, if you have a three-node cluster and you remove one node, your cluster can not withstand any node failure. This is because if one node of a three-node cluster is already down, removing a second node will lose quorum.
Procedure
- If any preparations need to be made before stopping or moving the resources or software running on the VM to migrate, perform those steps.
Run the following command on the VM to stop the cluster software on the VM.
# pcs cluster stop
- Perform the live migration of the VM.
Start cluster services on the VM.
# pcs cluster start
31.9. Identifying clusters by UUID
As of Red Hat Enterprise Linux 9.1, when you create a cluster it has an associated UUID. Since a cluster name is not a unique cluster identifier, a third-party tool such as a configuration management database that manages multiple clusters with the same name can uniquely identify a cluster by means of its UUID. You can display the current cluster UUID with the pcs cluster config [show]
command, which includes the cluster UUID in its output.
To add a UUID to an existing cluster, run the following command.
# pcs cluster config uuid generate
To regenerate a UUID for a cluster with an existing UUID, run the following command.
# pcs cluster config uuid generate --force