Chapter 9. Troubleshooting
The following is a list of some problems you may see regarding the configuration of fence devices as well as some suggestions for how to address these problems.
- If your system does not fence a node automatically, you can try to fence the node from the command line using the
fence_node
command, as described at the end of each of the fencing configuration procedures. Thefence_node
performs I/O fencing on a single node by reading the fencing settings from thecluster.conf
file for the given node and then running the configured fencing agent against the node. For example, the following command fences nodeclusternode1.example.com
:#
/sbin/fence_node clusternode1.example.com
If thefence_node
command is unsuccessful, you may have made an error in defining the fence device configuration. To determine whether the fencing agent itself is able to talk to the fencing device, you can execute the I/O fencing command for your fence device directly from the command line. As a first step, you can execute the with the-o status
option specified. For example, if you are using an APC switch as a fencing agent, you can execute a command such as the following:#
/sbin/fence_apc -a (ipaddress) -l (login) ... -o status -v
You can also use the I/O fencing command for your device to fence the node. For example, for an HP ILO device, you can issue the following command:#
/sbin/fence_ilo -a myilo -l login -p passwd -o off -v
- Check the version of firmware you are using in your fence device. You may want to consider upgrading your firmware. You may also want to scan bugzilla to see if there are any issues regarding your level of firmware.
- If a node in your cluster is repeatedly getting fenced, it means that one of the nodes in your cluster is not seeing enough "heartbeat" network messages from the node that is getting fenced. Most of the time, this is a result of flaky or faulty hardware, such as bad cables or bad ports on the network hub or switch. Test your communications paths thoroughly without the cluster software running to make sure your hardware is working correctly.
- If a node in your cluster is repeatedly getting fenced right at startup, if may be due to system activities that occur when a node joins a cluster. If your network is busy, your cluster may decide it is not getting enough heartbeat packets. To address this, you may have to increase the
post_join_delay
setting in yourcluster.conf
file. This delay is basically a grace period to give the node more time to join the cluster.In the following example, thefence_daemon
entry in the cluster configuration file shows apost_join_delay
setting that has been increased to 600.<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="600">
- If a node fails while the
fenced
daemon is not running, it will not be fenced. It will cause problems if thefenced
daemon is killed or exits while the node is using GFS. If thefenced
daemon exits, it should be restarted.
If you find that you are seeing error messages when you try to configure your system, or if after configuration your system does not behave as expected, you can perform the following checks and examine the following areas.
- Connect to one of the nodes in the cluster and execute the
clustat
(8) command. This command runs a utility that displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services.The following example shows the output of theclustat
(8) command.[root@clusternode4 ~]#
clustat
Cluster Status for nfsclust @ Wed Dec 3 12:37:22 2008 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ clusternode5.example.com 1 Online, rgmanager clusternode4.example.com 2 Online, Local, rgmanager clusternode3.example.com 3 Online, rgmanager clusternode2.example.com 4 Online, rgmanager clusternode1.example.com 5 Online, rgmanager Service Name Owner (Last) State ------- --- ----- ------ ----- service:nfssvc clusternode2.example.com startingIn this example,clusternode4
is the local node since it is the host from which the command was run. Ifrgmanager
did not appear in theStatus
category, it could indicate that cluster services are not running on the node. - Connect to one of the nodes in the cluster and execute the
group_tool
(8) command. This command provides information that you may find helpful in debugging your system. The following example shows the output of thegroup_tool
(8) command.[root@clusternode1 ~]#
group_tool
type level name id state fence 0 default 00010005 none [1 2 3 4 5] dlm 1 clvmd 00020005 none [1 2 3 4 5] dlm 1 rgmanager 00030005 none [3 4 5] dlm 1 mygfs 007f0005 none [5] gfs 2 mygfs 007e0005 none [5]The state of the group should benone
. The numbers in the brackets are the node ID numbers of the cluster nodes in the group. Theclustat
shows which node IDs are associated with which nodes. If you do not see a node number in the group, it is not a member of that group. For example, if a node ID is not in dlm/rgmanager group, it is not using the rgmanager dlm lock space (and probably is not running rgmanager).The level of a group indicates the recovery ordering. 0 is recovered first, 1 is recovered second, and so forth. - Connect to one of the nodes in the cluster and execute the
cman_tool nodes -f
command This command provides information about the cluster nodes that you may want to look at. The following example shows the output of thecman_tool nodes -f
command.[root@clusternode1 ~]#
cman_tool nodes -f
Node Sts Inc Joined Name 1 M 752 2008-10-27 11:17:15 clusternode5.example.com 2 M 752 2008-10-27 11:17:15 clusternode4.example.com 3 M 760 2008-12-03 11:28:44 clusternode3.example.com 4 M 756 2008-12-03 11:28:26 clusternode2.example.com 5 M 744 2008-10-27 11:17:15 clusternode1.example.comTheSts
heading indicates the status of a node. A status of M indicates the node is a member of the cluster. A status of X indicates that the node is dead. TheInc
heading indicating the incarnation number of a node, which is for debugging purposes only. - Check whether the
cluster.conf
is identical in each node of the cluster. If you configure your system with Conga, as in the example provided in this document, these files should be identical, but one of the files may have accidentally been deleted or altered.