5.12. Testing a Fence Device
Fencing is a fundamental part of the Red Hat Cluster infrastructure and it is therefore important to validate or test that fencing is working properly.
Use the following procedure to test a fence device.
- Use ssh, telnet, HTTP, or whatever remote protocol is used to connect to the device to manually log in and test the fence device or see what output is given. For example, if you will be configuring fencing for an IPMI-enabled device, then try to log in remotely with
ipmitool
. Take note of the options used when logging in manually because those options might be needed when using the fencing agent.If you are unable to log in to the fence device, verify that the device is pingable, there is nothing such as a firewall configuration that is preventing access to the fence device, remote access is enabled on the fencing agent, and the credentials are correct. - Run the fence agent manually, using the fence agent script. This does not require that the cluster services are running, so you can perform this step before the device is configured in the cluster. This can ensure that the fence device is responding properly before proceeding.
Note
The examples in this section use thefence_ilo
fence agent script for an iLO device. The actual fence agent you will use and the command that calls that agent will depend on your server hardware. You should consult the man page for the fence agent you are using to determine which options to specify. You will usually need to know the login and password for the fence device and other information related to the fence device.The following example shows the format you would use to run thefence_ilo
fence agent script with-o status
parameter to check the status of the fence device interface on another node without actually fencing it. This allows you to test the device and get it working before attempting to reboot the node. When running this command, you specify the name and password of an iLO user that has power on and off permissions for the iLO device.#
fence_ilo -a ipaddress -l username -p password -o status
The following example shows the format you would use to run thefence_ilo
fence agent script with the-o reboot
parameter. Running this command on one node reboots another node on which you have configured the fence agent.#
fence_ilo -a ipaddress -l username -p password -o reboot
If the fence agent failed to properly do a status, off, on, or reboot action, you should check the hardware, the configuration of the fence device, and the syntax of your commands. In addition, you can run the fence agent script with the debug output enabled. The debug output is useful for some fencing agents to see where in the sequence of events the fencing agent script is failing when logging into the fence device.#
fence_ilo -a ipaddress -l username -p password -o status -D /tmp/$(hostname)-fence_agent.debug
When diagnosing a failure that has occurred, you should ensure that the options you specified when manually logging in to the fence device are identical to what you passed on to the fence agent with the fence agent script.For fence agents that support an encrypted connection, you may see an error due to certificate validation failing, requiring that you trust the host or that you use the fence agent'sssl-insecure
parameter. Similarly, if SSL/TLS is disabled on the target device, you may need to account for this when setting the SSL parameters for the fence agent.Note
If the fence agent that is being tested is afence_drac
,fence_ilo
, or some other fencing agent for a systems management device that continues to fail, then fall back to tryingfence_ipmilan
. Most systems management cards support IPMI remote login and the only supported fencing agent isfence_ipmilan
. - Once the fence device has been configured in the cluster with the same options that worked manually and the cluster has been started, test fencing with the
pcs stonith fence
command from any node (or even multiple times from different nodes), as in the following example. Thepcs stonith fence
command reads the cluster configuration from the CIB and calls the fence agent as configured to execute the fence action. This verifies that the cluster configuration is correct.#
pcs stonith fence node_name
If thepcs stonith fence
command works properly, that means the fencing configuration for the cluster should work when a fence event occurs. If the command fails, it means that cluster management cannot invoke the fence device through the configuration it has retrieved. Check for the following issues and update your cluster configuration as needed.- Check your fence configuration. For example, if you have used a host map you should ensure that the system can find the node using the host name you have provided.
- Check whether the password and user name for the device include any special characters that could be misinterpreted by the bash shell. Making sure that you enter passwords and user names surrounded by quotation marks could address this issue.
- Check whether you can connect to the device using the exact IP address or host name you specified in the
pcs stonith
command. For example, if you give the host name in the stonith command but test by using the IP address, that is not a valid test. - If the protocol that your your fence device uses is accessible to you, use that protocol to try to connect to the device. For example many agents use ssh or telnet. You should try to connect to the device with the credentials you provided when configuring the device, to see if you get a valid prompt and can log in to the device.
If you determine that all your parameters are appropriate but you still have trouble connecting to your fence device, you can check the logging on the fence device itself, if the device provides that, which will show if the user has connected and what command the user issued. You can also search through the/var/log/messages
file for instances of stonith and error, which could give some idea of what is transpiring, but some agents can provide additional information. - Once the fence device tests are working and the cluster is up and running, test an actual failure. To do this, take an action in the cluster that should initiate a token loss.
- Take down a network. How you take a network depends on your specific configuration. In many cases, you can physically pull the network or power cables out of the host.
Note
Disabling the network interface on the local host rather than physically disconnecting the network or power cables is not recommended as a test of fencing because it does not accurately simulate a typical real-world failure. - Block corosync traffic both inbound and outbound using the local firewall.The following example blocks corosync, assuming the default corosync port is used,
firewalld
is used as the local firewall, and the network interface used by corosync is in the default firewall zone:#
firewall-cmd --direct --add-rule ipv4 filter OUTPUT 2 -p udp --dport=5405 -j DROP
#firewall-cmd --add-rich-rule='rule family="ipv4" port port="5405" protocol="udp" drop'
- Simulate a crash and panic your machine with
sysrq-trigger
. Note, however, that triggering a kernel panic can cause data loss; it is recommended that you disable your cluster resources first.#
echo c > /proc/sysrq-trigger