17.7. Troubleshooting Nagios
17.7.1. Troubleshooting NSCA and NRPE Configuration Issues
- Check Firewall and Port Settings on Nagios ServerIf port 5667 is not opened on the server host's firewall, a timeout error is displayed. Ensure that port 5667 is opened.
On Red Hat Gluster Storage based on Red Hat Enterprise Linux 6
- Log in as root and run the following command on the Red Hat Gluster Storage node to get the list of current iptables rules:
# iptables -L
- The output is displayed as shown below:
ACCEPT tcp -- anywhere anywhere tcp dpt:5667
On Red Hat Gluster Storage based on Red Hat Enterprise Linux 7:
- Run the following command on the Red Hat Gluster Storage node as root to get a listing of the current firewall rules:
# firewall-cmd --list-all-zones
- If the port is open,
5667/tcp
is listed besideports:
under one or more zones in your output.
- If the port is not open, add a firewall rule for the port:
On Red Hat Gluster Storage based on Red Hat Enterprise Linux 6
- If the port is not open, add an iptables rule by adding the following line in
/etc/sysconfig/iptables
file:-A INPUT -m state --state NEW -m tcp -p tcp --dport 5667 -j ACCEPT
- Restart the iptables service using the following command:
# service iptables restart
- Restart the NSCA service using the following command:
# service nsca restart
On Red Hat Gluster Storage based on Red Hat Enterprise Linux 7:
- Run the following commands to open the port:
# firewall-cmd --zone=public --add-port=5667/tcp # firewall-cmd --zone=public --add-port=5667/tcp --permanent
- Check the Configuration File on Red Hat Gluster Storage NodeMessages cannot be sent to the NSCA server, if Nagios server IP or FQDN, cluster name and hostname (as configured in Nagios server) are not configured correctly.Open the Nagios server configuration file /etc/nagios/nagios_server.conf and verify if the correct configurations are set as shown below:
# NAGIOS SERVER # The nagios server IP address or FQDN to which the NSCA command # needs to be sent [NAGIOS-SERVER] nagios_server=NagiosServerIPAddress # CLUSTER NAME # The host name of the logical cluster configured in Nagios under which # the gluster volume services reside [NAGIOS-DEFINTIONS] cluster_name=cluster_auto # LOCAL HOST NAME # Host name given in the nagios server [HOST-NAME] hostname_in_nagios=NagiosServerHostName
If Host name is updated, restart the NSCA service using the following command:# service nsca restart
- CHECK_NRPE: Error - Could Not Complete SSL HandshakeThis error occurs if the IP address of the Nagios server is not defined in the
nrpe.cfg
file of the Red Hat Gluster Storage node. To fix this issue, follow the steps given below:- Add the Nagios server IP address in
/etc/nagios/nrpe.cfg
file in theallowed_hosts
line as shown below:allowed_hosts=127.0.0.1, NagiosServerIP
Theallowed_hosts
is the list of IP addresses which can execute NRPE commands. - Save the
nrpe.cfg
file and restart NRPE service using the following command:# service nrpe restart
- CHECK_NRPE: Socket Timeout After n SecondsTo resolve this issue perform the steps given below:On Nagios Server:The default timeout value for the NRPE calls is 10 seconds and if the server does not respond within 10 seconds, Nagios Server GUI displays an error that the NRPE call has timed out in 10 seconds. To fix this issue, change the timeout value for NRPE calls by modifying the command definition configuration files.
- Changing the NRPE timeout for services which directly invoke check_nrpe.For the services which directly invoke check_nrpe (check_disk_and_inode, check_cpu_multicore, and check_memory), modify the command definition configuration file
/etc/nagios/gluster/gluster-commands.cfg
by adding -t Time in Seconds as shown below:define command { command_name check_disk_and_inode command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c check_disk_and_inode -t TimeInSeconds }
- Changing the NRPE timeout for the services in
nagios-server-addons
package which invoke NRPE call through code.The services which invoke/usr/lib64/nagios/plugins/gluster/check_vol_server.py
(check_vol_utilization, check_vol_status, check_vol_quota_status, check_vol_heal_status, and check_vol_georep_status) make NRPE call to the Red Hat Gluster Storage nodes for the details through code. To change the timeout for the NRPE calls, modify the command definition configuration file/etc/nagios/gluster/gluster-commands.cfg
by adding -t No of seconds as shown below:define command { command_name check_vol_utilization command_line $USER1$/gluster/check_vol_server.py $ARG1$ $ARG2$ -w $ARG3$ -c $ARG4$ -o utilization -t TimeInSeconds }
The auto configuration servicegluster_auto_discovery
makes NRPE calls for the configuration details from the Red Hat Gluster Storage nodes. To change the NRPE timeout value for the auto configuration service, modify the command definition configuration file/etc/nagios/gluster/gluster-commands.cfg
by adding -t TimeInSeconds as shown below:define command{ command_name gluster_auto_discovery command_line sudo $USER1$/gluster/configure-gluster-nagios.py -H $ARG1$ -c $HOSTNAME$ -m auto -n $ARG2$ -t TimeInSeconds }
- Restart Nagios service using the following command:
#
service nagios restart
On Red Hat Gluster Storage node:- Add the Nagios server IP address as described in CHECK_NRPE: Error - Could Not Complete SSL Handshake section in Troubleshooting NRPE Configuration Issues section.
- Edit the
nrpe.cfg
file using the following command:# vi /etc/nagios/nrpe.cfg
- Search for the
command_timeout
andconnection_timeout
settings and change the value. Thecommand_timeout
value must be greater than or equal to the timeout value set in Nagios server.The timeout on checks can be set as connection_timeout=300 and the command_timeout=60 seconds. - Restart the NRPE service using the following command:
#
service nrpe restart
- Check the NRPE Service StatusThis error occurs if the NRPE service is not running. To resolve this issue perform the steps given below:
- Log in as root to the Red Hat Gluster Storage node and run the following command to verify the status of NRPE service:
# service nrpe status
- If NRPE is not running, start the service using the following command:
# service nrpe start
- Check Firewall and Port SettingsThis error is associated with firewalls and ports. The timeout error is displayed if the NRPE traffic is not traversing a firewall, or if port 5666 is not open on the Red Hat Gluster Storage node.Ensure that port 5666 is open on the Red Hat Gluster Storage node.
- Run
check_nrpe
command from the Nagios server to verify if the port is open and if NRPE is running on the Red Hat Gluster Storage Node . - Log into the Nagios server as root and run the following command:
# /usr/lib64/nagios/plugins/check_nrpe -H RedHatStorageNodeIP
- The output is displayed as given below:
NRPE v2.14
If not, ensure the that port 5666 is opened on the Red Hat Gluster Storage node.On Red Hat Gluster Storage based on Red Hat Enterprise Linux 6:
- Run the following command on the Red Hat Gluster Storage node as root to get a listing of the current iptables rules:
# iptables -L
- If the port is open, the following appears in your output.
ACCEPT tcp -- anywhere anywhere tcp dpt:5666
On Red Hat Gluster Storage based on Red Hat Enterprise Linux 7:
- Run the following command on the Red Hat Gluster Storage node as root to get a listing of the current firewall rules:
# firewall-cmd --list-all-zones
- If the port is open,
5666/tcp
is listed besideports:
under one or more zones in your output.
- If the port is not open, add an iptables rule for the port.
On Red Hat Gluster Storage based on Red Hat Enterprise Linux 6:
- To add iptables rule, edit the
iptables
file as shown below:# vi /etc/sysconfig/iptables
- Add the following line in the file:
-A INPUT -m state --state NEW -m tcp -p tcp --dport 5666 -j ACCEPT
- Restart the iptables service using the following command:
# service iptables restart
- Save the file and restart the NRPE service:
# service nrpe restart
On Red Hat Gluster Storage based on Red Hat Enterprise Linux 7:
- Run the following commands to open the port:
# firewall-cmd --zone=public --add-port=5666/tcp # firewall-cmd --zone=public --add-port=5666/tcp --permanent
- Checking Port 5666 From the Nagios Server with TelnetUse telnet to verify the Red Hat Gluster Storage node's ports. To verify the ports of the Red Hat Gluster Storage node, perform the steps given below:
- Log in as root on Nagios server.
- Test the connection on port 5666 from the Nagios server to the Red Hat Gluster Storage node using the following command:
# telnet RedHatStorageNodeIP 5666
- The output displayed is similar to:
telnet 10.70.36.49 5666 Trying 10.70.36.49... Connected to 10.70.36.49. Escape character is '^]'.
- Connection Refused By HostThis error is due to port/firewall issues or incorrectly configured allowed_hosts directives. See the sections CHECK_NRPE: Error - Could Not Complete SSL Handshake and CHECK_NRPE: Socket Timeout After n Seconds for troubleshooting steps.