Chapter 3. Troubleshooting networking issues
This chapter lists basic troubleshooting procedures connected with networking and Network Time Protocol (NTP).
3.1. Prerequisites
- A running Red Hat Ceph Storage cluster.
3.2. Basic networking troubleshooting
Red Hat Ceph Storage depends heavily on a reliable network connection. Red Hat Ceph Storage nodes use the network for communicating with each other. Networking issues can cause many problems with Ceph OSDs, such as them flapping, or being incorrectly reported as down
. Networking issues can also cause the Ceph Monitor’s clock skew errors. In addition, packet loss, high latency, or limited bandwidth can impact the cluster performance and stability.
Prerequisites
- Root-level access to the node.
Procedure
Installing the
net-tools
andtelnet
packages can help when troubleshooting network issues that can occur in a Ceph storage cluster:Red Hat Enterprise Linux 7
[root@mon ~]# yum install net-tools [root@mon ~]# yum install telnet
Red Hat Enterprise Linux 8
[root@mon ~]# dnf install net-tools [root@mon ~]# dnf install telnet
Verify that the
cluster_network
andpublic_network
parameters in the Ceph configuration file include the correct values:Example
[root@mon ~]# cat /etc/ceph/ceph.conf | grep net cluster_network = 192.168.1.0/24 public_network = 192.168.0.0/24
Verify that the network interfaces are up:
Example
[root@mon ~]# ip link list 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp22s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 40:f2:e9:b8:a0:48 brd ff:ff:ff:ff:ff:ff
Verify that the Ceph nodes are able to reach each other using their short host names. Verify this on each node in the storage cluster:
Syntax
ping SHORT_HOST_NAME
Example
[root@mon ~]# ping osd01
If you use a firewall, ensure that Ceph nodes are able to reach other on their appropriate ports. The
firewall-cmd
andtelnet
tools can validate the port status, and if the port is open respectively:Syntax
firewall-cmd --info-zone=ZONE telnet IP_ADDRESS PORT
Example
[root@mon ~]# firewall-cmd --info-zone=public public (active) target: default icmp-block-inversion: no interfaces: enp1s0 sources: 192.168.0.0/24 services: ceph ceph-mon cockpit dhcpv6-client ssh ports: 9100/tcp 8443/tcp 9283/tcp 3000/tcp 9092/tcp 9093/tcp 9094/tcp 9094/udp protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: [root@mon ~]# telnet 192.168.0.22 9100
Verify that there are no errors on the interface counters. Verify that the network connectivity between nodes has expected latency, and that there is no packet loss.
Using the
ethtool
command:Syntax
ethtool -S INTERFACE
Example
[root@mon ~]# ethtool -S enp22s0f0 | grep errors NIC statistics: rx_fcs_errors: 0 rx_align_errors: 0 rx_frame_too_long_errors: 0 rx_in_length_errors: 0 rx_out_length_errors: 0 tx_mac_errors: 0 tx_carrier_sense_errors: 0 tx_errors: 0 rx_errors: 0
Using the
ifconfig
command:Example
[root@mon ~]# ifconfig enp22s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.8.222.13 netmask 255.255.254.0 broadcast 10.8.223.255 inet6 2620:52:0:8de:42f2:e9ff:feb8:a048 prefixlen 64 scopeid 0x0<global> inet6 fe80::42f2:e9ff:feb8:a048 prefixlen 64 scopeid 0x20<link> ether 40:f2:e9:b8:a0:48 txqueuelen 1000 (Ethernet) RX packets 4219130 bytes 2704255777 (2.5 GiB) RX errors 0 dropped 0 overruns 0 frame 0 1 TX packets 1418329 bytes 738664259 (704.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 2 device interrupt 16
Using the
netstat
command:Example
[root@mon ~]# netstat -ai Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg docker0 1500 0 0 0 0 0 0 0 0 BMU eno2 1500 0 0 0 0 0 0 0 0 BMU eno3 1500 0 0 0 0 0 0 0 0 BMU eno4 1500 0 0 0 0 0 0 0 0 BMU enp0s20u13u5 1500 253277 0 0 0 0 0 0 0 BMRU enp22s0f0 9000 234160 0 0 0 432326 0 0 0 BMRU 1 lo 65536 10366 0 0 0 10366 0 0 0 LRU
For performance issues, in addition to the latency checks and to verify the network bandwidth between all nodes of the storage cluster, use the
iperf3
tool. Theiperf3
tool does a simple point-to-point network bandwidth test between a server and a client.Install the
iperf3
package on the Red Hat Ceph Storage nodes you want to check the bandwidth:Red Hat Enterprise Linux 7
[root@mon ~]# yum install iperf3
Red Hat Enterprise Linux 8
[root@mon ~]# dnf install iperf3
On a Red Hat Ceph Storage node, start the
iperf3
server:Example
[root@mon ~]# iperf3 -s ----------------------------------------------------------- Server listening on 5201 -----------------------------------------------------------
NoteThe default port is 5201, but can be set using the
-P
command argument.On a different Red Hat Ceph Storage node, start the
iperf3
client:Example
[root@osd ~]# iperf3 -c mon Connecting to host mon, port 5201 [ 4] local xx.x.xxx.xx port 52270 connected to xx.x.xxx.xx port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 114 MBytes 954 Mbits/sec 0 409 KBytes [ 4] 1.00-2.00 sec 113 MBytes 945 Mbits/sec 0 409 KBytes [ 4] 2.00-3.00 sec 112 MBytes 943 Mbits/sec 0 454 KBytes [ 4] 3.00-4.00 sec 112 MBytes 941 Mbits/sec 0 471 KBytes [ 4] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 471 KBytes [ 4] 5.00-6.00 sec 113 MBytes 945 Mbits/sec 0 471 KBytes [ 4] 6.00-7.00 sec 112 MBytes 937 Mbits/sec 0 488 KBytes [ 4] 7.00-8.00 sec 113 MBytes 947 Mbits/sec 0 520 KBytes [ 4] 8.00-9.00 sec 112 MBytes 939 Mbits/sec 0 520 KBytes [ 4] 9.00-10.00 sec 112 MBytes 939 Mbits/sec 0 520 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender [ 4] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver iperf Done.
This output shows a network bandwidth of 1.1 Gbits/second between the Red Hat Ceph Storage nodes, along with no retransmissions (
Retr
) during the test.Red Hat recommends you validate the network bandwidth between all the nodes in the storage cluster.
Ensure that all nodes have the same network interconnect speed. Slower attached nodes might slow down the faster connected ones. Also, ensure that the inter switch links can handle the aggregated bandwidth of the attached nodes:
Syntax
ethtool INTERFACE
Example
[root@mon ~]# ethtool enp22s0f0 Settings for enp22s0f0: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Link partner advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Link partner advertised pause frame use: Symmetric Link partner advertised auto-negotiation: Yes Link partner advertised FEC modes: Not reported Speed: 1000Mb/s 1 Duplex: Full 2 Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: off Supports Wake-on: g Wake-on: d Current message level: 0x000000ff (255) drv probe link timer ifdown ifup rx_err tx_err Link detected: yes 3
Additional Resources
- See the Basic Network troubleshooting solution on the Customer Portal for details.
- See the Verifying and configuring the MTU value section in the Red Hat Ceph Storage Configuration Guide.
- See the Configuring Firewall section in the Red Hat Ceph Storage Installation Guide.
- See the What is the "ethtool" command and how can I use it to obtain information about my network devices and interfaces for details.
- See the RHEL network interface dropping packets solutions on the Customer Portal for details.
- For details, see the What are the performance benchmarking tools available for Red Hat Ceph Storage? solution on the Customer Portal.
- The Networking Guide for Red Hat Enterprise Linux 7.
- For more information, see Knowledgebase articles and solutions related to troubleshooting networking issues on the Customer Portal.
3.3. Basic chrony NTP troubleshooting
This section includes basic chrony troubleshooting steps.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor node.
Procedure
Verify that the
chronyd
daemon is running on the Ceph Monitor hosts:Example
[root@mon ~]# systemctl status chronyd
If
chronyd
is not running, enable and start it:Example
[root@mon ~]# systemctl enable chronyd [root@mon ~]# systemctl start chronyd
Ensure that
chronyd
is synchronizing the clocks correctly:Example
[root@mon ~]# chronyc sources [root@mon ~]# chronyc sourcestats [root@mon ~]# chronyc tracking
Additional Resources
- See the How to troubleshoot chrony issues solution on the Red Hat Customer Portal for advanced chrony NTP troubleshooting steps.
- See the Clock skew section in the Red Hat Ceph Storage Troubleshooting Guide for further details.
- See the Checking if chrony is synchronized section for further details.
3.4. Basic NTP troubleshooting
This section includes basic NTP troubleshooting steps.
Prerequisites
- A running Red Hat Ceph Storage cluster.
- Root-level access to the Ceph Monitor node.
Procedure
Verify that the
ntpd
daemon is running on the Ceph Monitor hosts:Example
[root@mon ~]# systemctl status ntpd
If
ntpd
is not running, enable and start it:Example
[root@mon ~]# systemctl enable ntpd [root@mon ~]# systemctl start ntpd
Ensure that
ntpd
is synchronizing the clocks correctly:Example
[root@mon ~]# ntpq -p
Additional Resources
- See the How to troubleshoot NTP issues solution on the Red Hat Customer Portal for advanced NTP troubleshooting steps.
- See the Clock skew section in the Red Hat Ceph Storage Troubleshooting Guide for further details.