Este conteúdo não está disponível no idioma selecionado.
Chapter 9. Troubleshooting
The following is a list of some problems you may see regarding the configuration of fence devices as well as some suggestions for how to address these problems.
- If your system does not fence a node automatically, you can try to fence the node from the command line using the
fence_nodecommand, as described at the end of each of the fencing configuration procedures. Thefence_nodeperforms I/O fencing on a single node by reading the fencing settings from thecluster.conffile for the given node and then running the configured fencing agent against the node. For example, the following command fences nodeclusternode1.example.com:/sbin/fence_node clusternode1.example.com
# /sbin/fence_node clusternode1.example.comCopy to Clipboard Copied! Toggle word wrap Toggle overflow If thefence_nodecommand is unsuccessful, you may have made an error in defining the fence device configuration. To determine whether the fencing agent itself is able to talk to the fencing device, you can execute the I/O fencing command for your fence device directly from the command line. As a first step, you can execute the with the-o statusoption specified. For example, if you are using an APC switch as a fencing agent, you can execute a command such as the following:/sbin/fence_apc -a (ipaddress) -l (login) ... -o status -v
# /sbin/fence_apc -a (ipaddress) -l (login) ... -o status -vCopy to Clipboard Copied! Toggle word wrap Toggle overflow You can also use the I/O fencing command for your device to fence the node. For example, for an HP ILO device, you can issue the following command:/sbin/fence_ilo -a myilo -l login -p passwd -o off -v
# /sbin/fence_ilo -a myilo -l login -p passwd -o off -vCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Check the version of firmware you are using in your fence device. You may want to consider upgrading your firmware. You may also want to scan bugzilla to see if there are any issues regarding your level of firmware.
- If a node in your cluster is repeatedly getting fenced, it means that one of the nodes in your cluster is not seeing enough "heartbeat" network messages from the node that is getting fenced. Most of the time, this is a result of flaky or faulty hardware, such as bad cables or bad ports on the network hub or switch. Test your communications paths thoroughly without the cluster software running to make sure your hardware is working correctly.
- If a node in your cluster is repeatedly getting fenced right at startup, if may be due to system activities that occur when a node joins a cluster. If your network is busy, your cluster may decide it is not getting enough heartbeat packets. To address this, you may have to increase the
post_join_delaysetting in yourcluster.conffile. This delay is basically a grace period to give the node more time to join the cluster.In the following example, thefence_daemonentry in the cluster configuration file shows apost_join_delaysetting that has been increased to 600.<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="600">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="600">Copy to Clipboard Copied! Toggle word wrap Toggle overflow - If a node fails while the
fenceddaemon is not running, it will not be fenced. It will cause problems if thefenceddaemon is killed or exits while the node is using GFS. If thefenceddaemon exits, it should be restarted.
If you find that you are seeing error messages when you try to configure your system, or if after configuration your system does not behave as expected, you can perform the following checks and examine the following areas.
- Connect to one of the nodes in the cluster and execute the
clustat(8) command. This command runs a utility that displays the status of the cluster. It shows membership information, quorum view, and the state of all configured user services.The following example shows the output of theclustat(8) command.Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example,clusternode4is the local node since it is the host from which the command was run. Ifrgmanagerdid not appear in theStatuscategory, it could indicate that cluster services are not running on the node. - Connect to one of the nodes in the cluster and execute the
group_tool(8) command. This command provides information that you may find helpful in debugging your system. The following example shows the output of thegroup_tool(8) command.Copy to Clipboard Copied! Toggle word wrap Toggle overflow The state of the group should benone. The numbers in the brackets are the node ID numbers of the cluster nodes in the group. Theclustatshows which node IDs are associated with which nodes. If you do not see a node number in the group, it is not a member of that group. For example, if a node ID is not in dlm/rgmanager group, it is not using the rgmanager dlm lock space (and probably is not running rgmanager).The level of a group indicates the recovery ordering. 0 is recovered first, 1 is recovered second, and so forth. - Connect to one of the nodes in the cluster and execute the
cman_tool nodes -fcommand This command provides information about the cluster nodes that you may want to look at. The following example shows the output of thecman_tool nodes -fcommand.Copy to Clipboard Copied! Toggle word wrap Toggle overflow TheStsheading indicates the status of a node. A status of M indicates the node is a member of the cluster. A status of X indicates that the node is dead. TheIncheading indicating the incarnation number of a node, which is for debugging purposes only. - Check whether the
cluster.confis identical in each node of the cluster. If you configure your system with Conga, as in the example provided in this document, these files should be identical, but one of the files may have accidentally been deleted or altered.