Product SiteDocumentation Site

Chapter 9. Diagnosing and Correcting Problems in a Cluster

9.1. Cluster Does Not Form
9.2. Nodes Unable to Rejoin Cluster after Fence or Reboot
9.3. Cluster Services Hang
9.4. Cluster Service Will Not Start
9.5. Cluster-Controlled Services Fails to Migrate
9.6. Each Node in a Two-Node Cluster Reports Second Node Down
9.7. Nodes are Fenced on LUN Path Failure
9.8. Quorum Disk Does Not Appear as Cluster Member
9.9. Unusual Failover Behavior
9.10. Fencing Occurs at Random
Clusters problems, by nature, can be difficult to troubleshoot. This is due to the increased complexity that a cluster of systems introduces as opposed to diagnosing issues on a single system. However, there are common issues that system administrators are more likely to encounter when deploying or administering a cluster. Understanding how to tackle those common issues can help make deploying and administering a cluster much easier.
This chapter provides information about some common cluster issues and how to troubleshoot them. Additional help can be found in our knowledge base and by contacting an authorized Red Hat support representative. If your issue is related to the GFS2 file system specifically, you can find information about troubleshooting common GFS2 issues in the Global File System 2 document.

9.1. Cluster Does Not Form

If you find you are having trouble getting a new cluster to form, check for the following things:
  • Make sure you have name resolution set up correctly. The cluster node name in the cluster.conf file should correspond to the name used to resolve that cluster's address over the network that cluster will be using to communicate. For example, if your cluster's node names are nodea and nodeb make sure both nodes have entries in the /etc/cluster/cluster.conf file and /etc/hosts file that match those names.
  • If the cluster uses multicast for communication between nodes, make sure that multicast traffic is not being blocked, delayed, or otherwise interfered with on the network that the cluster is using to communicate. Note that some Cisco switches have features that may cause delays in multicast traffic.
  • Use telnet or SSH to verify whether you can reach remote nodes.
  • Execute the ethtool eth1 | grep link command to check whether the ethernet link is up.
  • Use the tcpdump command at each node to check the network traffic.
  • Ensure that you do not have firewall rules blocking communication between your nodes.
  • Ensure that the interfaces you are passing cluster traffic over are not using any bonding mode other than 1 and are not using VLAN tagging.