Chapter 3. Troubleshooting networking issues
This chapter lists basic troubleshooting procedures connected with networking and chrony for Network Time Protocol (NTP).
Prerequisites
- A running Red Hat Ceph Storage cluster.
3.1. Basic networking troubleshooting
Red Hat Ceph Storage depends heavily on a reliable network connection. Red Hat Ceph Storage nodes use the network for communicating with each other. Networking issues can cause many problems with Ceph OSDs, such as them flapping, or being incorrectly reported as down
. Networking issues can also cause the Ceph Monitor’s clock skew errors. In addition, packet loss, high latency, or limited bandwidth can impact the cluster performance and stability.
Prerequisites
- Root-level access to the node.
Procedure
Installing the
net-tools
andtelnet
packages can help when troubleshooting network issues that can occur in a Ceph storage cluster:Example
[root@host01 ~]# dnf install net-tools [root@host01 ~]# dnf install telnet
Log into the
cephadm
shell and verify that thepublic_network
parameters in the Ceph configuration file include the correct values:Example
[ceph: root@host01 /]# cat /etc/ceph/ceph.conf # minimal ceph.conf for 57bddb48-ee04-11eb-9962-001a4a000672 [global] fsid = 57bddb48-ee04-11eb-9962-001a4a000672 mon_host = [v2:10.74.249.26:3300/0,v1:10.74.249.26:6789/0] [v2:10.74.249.163:3300/0,v1:10.74.249.163:6789/0] [v2:10.74.254.129:3300/0,v1:10.74.254.129:6789/0] [mon.host01] public network = 10.74.248.0/21
Exit the shell and verify that the network interfaces are up:
Example
[root@host01 ~]# ip link list 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 00:1a:4a:00:06:72 brd ff:ff:ff:ff:ff:ff
Verify that the Ceph nodes are able to reach each other using their short host names. Verify this on each node in the storage cluster:
Syntax
ping SHORT_HOST_NAME
Example
[root@host01 ~]# ping host02
If you use a firewall, ensure that Ceph nodes are able to reach each other on their appropriate ports. The
firewall-cmd
andtelnet
tools can validate the port status, and if the port is open respectively:Syntax
firewall-cmd --info-zone=ZONE telnet IP_ADDRESS PORT
Example
[root@host01 ~]# firewall-cmd --info-zone=public public (active) target: default icmp-block-inversion: no interfaces: ens3 sources: services: ceph ceph-mon cockpit dhcpv6-client ssh ports: 9283/tcp 8443/tcp 9093/tcp 9094/tcp 3000/tcp 9100/tcp 9095/tcp protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: [root@host01 ~]# telnet 192.168.0.22 9100
Verify that there are no errors on the interface counters. Verify that the network connectivity between nodes has expected latency, and that there is no packet loss.
Using the
ethtool
command:Syntax
ethtool -S INTERFACE
Example
[root@host01 ~]# ethtool -S ens3 | grep errors NIC statistics: rx_fcs_errors: 0 rx_align_errors: 0 rx_frame_too_long_errors: 0 rx_in_length_errors: 0 rx_out_length_errors: 0 tx_mac_errors: 0 tx_carrier_sense_errors: 0 tx_errors: 0 rx_errors: 0
Using the
ifconfig
command:Example
[root@host01 ~]# ifconfig ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.74.249.26 netmask 255.255.248.0 broadcast 10.74.255.255 inet6 fe80::21a:4aff:fe00:672 prefixlen 64 scopeid 0x20<link> inet6 2620:52:0:4af8:21a:4aff:fe00:672 prefixlen 64 scopeid 0x0<global> ether 00:1a:4a:00:06:72 txqueuelen 1000 (Ethernet) RX packets 150549316 bytes 56759897541 (52.8 GiB) RX errors 0 dropped 176924 overruns 0 frame 0 TX packets 55584046 bytes 62111365424 (57.8 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 9373290 bytes 16044697815 (14.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 9373290 bytes 16044697815 (14.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Using the
netstat
command:Example
[root@host01 ~]# netstat -ai Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg ens3 1500 311847720 0 364903 0 114341918 0 0 0 BMRU lo 65536 19577001 0 0 0 19577001 0 0 0 LRU
For performance issues, in addition to the latency checks and to verify the network bandwidth between all nodes of the storage cluster, use the
iperf3
tool. Theiperf3
tool does a simple point-to-point network bandwidth test between a server and a client.Install the
iperf3
package on the Red Hat Ceph Storage nodes you want to check the bandwidth:Example
[root@host01 ~]# dnf install iperf3
On a Red Hat Ceph Storage node, start the
iperf3
server:Example
[root@host01 ~]# iperf3 -s ----------------------------------------------------------- Server listening on 5201 -----------------------------------------------------------
NoteThe default port is 5201, but can be set using the
-P
command argument.On a different Red Hat Ceph Storage node, start the
iperf3
client:Example
[root@host02 ~]# iperf3 -c mon Connecting to host mon, port 5201 [ 4] local xx.x.xxx.xx port 52270 connected to xx.x.xxx.xx port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 114 MBytes 954 Mbits/sec 0 409 KBytes [ 4] 1.00-2.00 sec 113 MBytes 945 Mbits/sec 0 409 KBytes [ 4] 2.00-3.00 sec 112 MBytes 943 Mbits/sec 0 454 KBytes [ 4] 3.00-4.00 sec 112 MBytes 941 Mbits/sec 0 471 KBytes [ 4] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 471 KBytes [ 4] 5.00-6.00 sec 113 MBytes 945 Mbits/sec 0 471 KBytes [ 4] 6.00-7.00 sec 112 MBytes 937 Mbits/sec 0 488 KBytes [ 4] 7.00-8.00 sec 113 MBytes 947 Mbits/sec 0 520 KBytes [ 4] 8.00-9.00 sec 112 MBytes 939 Mbits/sec 0 520 KBytes [ 4] 9.00-10.00 sec 112 MBytes 939 Mbits/sec 0 520 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender [ 4] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver iperf Done.
This output shows a network bandwidth of 1.1 Gbits/second between the Red Hat Ceph Storage nodes, along with no retransmissions (
Retr
) during the test.Red Hat recommends you validate the network bandwidth between all the nodes in the storage cluster.
Ensure that all nodes have the same network interconnect speed. Slower attached nodes might slow down the faster connected ones. Also, ensure that the inter switch links can handle the aggregated bandwidth of the attached nodes:
Syntax
ethtool INTERFACE
Example
[root@host01 ~]# ethtool ens3 Settings for ens3: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Link partner advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Link partner advertised pause frame use: Symmetric Link partner advertised auto-negotiation: Yes Link partner advertised FEC modes: Not reported Speed: 1000Mb/s 1 Duplex: Full 2 Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: off Supports Wake-on: g Wake-on: d Current message level: 0x000000ff (255) drv probe link timer ifdown ifup rx_err tx_err Link detected: yes 3
Additional Resources
- See the Basic Network troubleshooting solution on the Customer Portal for details.
- See the What is the "ethtool" command and how can I use it to obtain information about my network devices and interfaces for details.
- See the RHEL network interface dropping packets solutions on the Customer Portal for details.
- For details, see the What are the performance benchmarking tools available for Red Hat Ceph Storage? solution on the Customer Portal.
- For more information, see Knowledgebase articles and solutions related to troubleshooting networking issues on the Customer Portal.
3.2. Basic chrony NTP troubleshooting
This section includes basic chrony NTP troubleshooting steps.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor node.
Procedure
Verify that the
chronyd
daemon is running on the Ceph Monitor hosts:Example
[root@mon ~]# systemctl status chronyd
If
chronyd
is not running, enable and start it:Example
[root@mon ~]# systemctl enable chronyd [root@mon ~]# systemctl start chronyd
Ensure that
chronyd
is synchronizing the clocks correctly:Example
[root@mon ~]# chronyc sources [root@mon ~]# chronyc sourcestats [root@mon ~]# chronyc tracking
Additional Resources
- See the How to troubleshoot chrony issues solution on the Red Hat Customer Portal for advanced chrony NTP troubleshooting steps.
- See the Clock skew section in the Red Hat Ceph Storage Troubleshooting Guide for further details.
- See the Checking if chrony is synchronized section for further details.