8.2. Moving Resources Due to Failure
When you create a resource, you can configure the resource so that it will move to a new node after a defined number of failures by setting the
migration-threshold
option for that resource. Once the threshold has been reached, this node will no longer be allowed to run the failed resource until:
- The administrator manually resets the resource's
failcount
using thepcs resource failcount
command. - The resource's
failure-timeout
value is reached.
The value of
migration-threshold
is set to INFINITY
by default. INFINITY
is defined internally as a very large but finite number. A value of 0 disables the migration-threshold
feature.
Note
Setting a
migration-threshold
for a resource is not the same as configuring a resource for migration, in which the resource moves to another location without loss of state.
The following example adds a migration threshold of 10 to the resource named
dummy_resource
, which indicates that the resource will move to a new node after 10 failures.
# pcs resource meta dummy_resource migration-threshold=10
You can add a migration threshold to the defaults for the whole cluster with the following command.
# pcs resource defaults migration-threshold=10
To determine the resource's current failure status and limits, use the
pcs resource failcount
command.
There are two exceptions to the migration threshold concept; they occur when a resource either fails to start or fails to stop. If the cluster property
start-failure-is-fatal
is set to true
(which is the default), start failures cause the failcount
to be set to INFINITY
and thus always cause the resource to move immediately. For information on the start-failure-is-fatal
option, see Table 12.1, “Cluster Properties”.
Stop failures are slightly different and crucial. If a resource fails to stop and STONITH is enabled, then the cluster will fence the node in order to be able to start the resource elsewhere. If STONITH is not enabled, then the cluster has no way to continue and will not try to start the resource elsewhere, but will try to stop it again after the failure timeout.