Chapter 5. Tuning TCP connections for high throughput
Tune TCP-related settings on Red Hat Enterprise Linux to increase the throughput, reduce the latency, or prevent problems, such as packet loss.
5.1. Testing the TCP throughput by using iperf3
The iperf3
utility provides a server and client mode to perform network throughput tests between two hosts.
The throughput of applications depends on many factors, such as the buffer sizes that the application uses. Therefore, the results measured with testing utilities, such as iperf3
, can be significantly different from those of applications on a server under production workload.
Prerequisites
-
The
iperf3
package is installed on both the client and server. - No other services on either host cause network traffic that substantially affects the test result.
- For 40 Gbps and faster connections, the network card supports Accelerated Receive Flow Steering (ARFS) and the feature is enabled on the interface.
Procedure
Optional: Display the maximum network speed of the network interface controller (NIC) on both the server and client:
ethtool enp1s0 | grep "Speed"
# ethtool enp1s0 | grep "Speed" Speed: 100000Mb/s
Copy to Clipboard Copied! On the server:
Temporarily open the default
iperf3
TCP port 5201 in thefirewalld
service:firewall-cmd --add-port=5201/tcp
# firewall-cmd --add-port=5201/tcp
Copy to Clipboard Copied! Start
iperf3
in server mode:iperf3 --server
# iperf3 --server
Copy to Clipboard Copied! The service now is waiting for incoming client connections.
On the client:
Start measuring the throughput:
iperf3 --time 60 --zerocopy --client 192.0.2.1
# iperf3 --time 60 --zerocopy --client 192.0.2.1
Copy to Clipboard Copied! --time <seconds>
: Defines the time in seconds when the client stops the transmission.Set this parameter to a value that you expect to work and increase it in later measurements. If the client ends packets at a faster rate than the devices on the transmit path or the server can process, packets can be dropped.
-
--zerocopy
: Enables a zero copy method instead of using thewrite()
system call. You require this option only if you want to simulate a zero-copy-capable application or to reach 40 Gbps and more on a single stream. -
--client <server>
: Enables the client mode and sets the IP address or name of the server that runs theiperf3
server.
Wait until
iperf3
completes the test. Both the server and the client display statistics every second and a summary at the end. For example, the following is a summary displayed on a client:[ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 101 GBytes 14.4 Gbits/sec 0 sender [ 5] 0.00-60.04 sec 101 GBytes 14.4 Gbits/sec receiver
[ ID] Interval Transfer Bitrate Retr [ 5] 0.00-60.00 sec 101 GBytes 14.4 Gbits/sec 0 sender [ 5] 0.00-60.04 sec 101 GBytes 14.4 Gbits/sec receiver
Copy to Clipboard Copied! In this example, the average bitrate was 14.4 Gbps.
On the server:
-
Press Ctrl+C to stop the
iperf3
server. Close the TCP port 5201 in
firewalld
:firewall-cmd --remove-port=5201/tcp
# firewall-cmd --remove-port=5201/tcp
Copy to Clipboard Copied!
-
Press Ctrl+C to stop the
5.2. The system-wide TCP socket buffer settings
Socket buffers temporarily store data that the kernel has received or should send:
- The read socket buffer holds packets that the kernel has received but which the application has not read yet.
- The write socket buffer holds packets that an application has written to the buffer but which the kernel has not passed to the IP stack and network driver yet.
If a TCP packet is too large and exceeds the buffer size or packets are sent or received at a too fast rate, the kernel drops any new incoming TCP packet until the data is removed from the buffer. In this case, increasing the socket buffers can prevent packet loss.
Both the net.ipv4.tcp_rmem
(read) and net.ipv4.tcp_wmem
(write) socket buffer kernel settings contain three values:
net.ipv4.tcp_rmem = 4096 131072 6291456 net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_rmem = 4096 131072 6291456
net.ipv4.tcp_wmem = 4096 16384 4194304
The displayed values are in bytes and Red Hat Enterprise Linux uses them in the following way:
- The first value is the minimum buffer size. New sockets cannot have a smaller size.
- The second value is the default buffer size. If an application sets no buffer size, this is the default value.
-
The third value is the maximum size of automatically tuned buffers. Using the
setsockopt()
function with theSO_SNDBUF
socket option in an application disables this maximum buffer size.
Note that the net.ipv4.tcp_rmem
and net.ipv4.tcp_wmem
parameters set the socket sizes for both the IPv4 and IPv6 protocols.
5.3. Increasing the system-wide TCP socket buffers
The system-wide TCP socket buffers temporarily store data that the kernel has received or should send. Both net.ipv4.tcp_rmem
(read) and net.ipv4.tcp_wmem
(write) socket buffer kernel settings each contain three settings: A minimum, default, and maximum value.
Setting too large buffer sizes wastes memory. Each socket can be set to the size that the application requests, and the kernel doubles this value. For example, if an application requests a 256 KiB socket buffer size and opens 1 million sockets, the system can use up to 512 GB RAM (512 KiB x 1 million) only for the potential socket buffer space.
Additionally, a too large value for the maximum buffer size can increase the latency.
Prerequisites
- You encountered a significant rate of dropped TCP packets.
Procedure
Determine the latency of the connection. For example, ping from the client to server to measure the average Round Trip Time (RTT):
ping -c 10 server.example.com
# ping -c 10 server.example.com ... --- server.example.com ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9014ms rtt min/avg/max/mdev = 17.208/17.056/19.333/0.616 ms
Copy to Clipboard Copied! In this example, the latency is 17 ms.
Use the following formula to calculate the Bandwidth Delay Product (BDP) for the traffic you want to tune:
connection speed in bytes * latency in seconds = BDP in bytes
connection speed in bytes * latency in seconds = BDP in bytes
Copy to Clipboard Copied! For example, to calculate the BDP for a 10 Gbps connection that has a 17 ms latency:
(10 * 1000 * 1000 * 1000 / 8) * 0.017 = 21,250,000 bytes
(10 * 1000 * 1000 * 1000 / 8) * 0.017 = 21,250,000 bytes
Copy to Clipboard Copied! Create the
/etc/sysctl.d/10-tcp-socket-buffers.conf
file and either set the maximum read or write buffer size, or both, based on your requirements:net.ipv4.tcp_rmem = 4096 262144 42500000 net.ipv4.tcp_wmem = 4096 262144 42500000
net.ipv4.tcp_rmem = 4096 262144 42500000 net.ipv4.tcp_wmem = 4096 262144 42500000
Copy to Clipboard Copied! Specify the values in bytes. Use the following rule of thumb when you try to identify optimized values for your environment:
-
Default buffer size (second value): Increase this value only slightly or set it to
524288
(512 KiB) at most. A too high default buffer size can cause buffer collapsing and, consequently, latency spikes. - Maximum buffer size (third value): A value double to triple of the BDP is often sufficient.
-
Default buffer size (second value): Increase this value only slightly or set it to
Load the settings from the
/etc/sysctl.d/10-tcp-socket-buffers.conf
file:sysctl -p /etc/sysctl.d/10-tcp-socket-buffers.conf
# sysctl -p /etc/sysctl.d/10-tcp-socket-buffers.conf
Copy to Clipboard Copied! If you have changed the second value in the
net.ipv4.tcp_rmem
ornet.ipv4.tcp_wmem
parameter, restart the applications to use the new TCP buffer sizes.If you have changed only the third value, you do not need to restart the application because auto-tuning applies these settings dynamically.
Verification
- Optional: Test the TCP throughput by using iperf3.
Monitor the packet drop statistics by using the same method that you used when you encountered the packet drops.
If packet drops still occur but at a lower rate, increase the buffer sizes further.
5.4. TCP Window Scaling
The TCP Window Scaling feature, which is enabled by default in Red Hat Enterprise Linux, is an extension of the TCP protocol that significantly improves the throughput.
For example, on a 1 Gbps connection with 1.5 ms Round Trip Time (RTT):
- With TCP Window Scaling enabled, approximately 630 Mbps are realistic.
- With TCP Window Scaling disabled, the throughput goes down to 380 Mbps.
One of the features TCP provides is flow control. With flow control, a sender can send as much data as the receiver can receive, but no more. To achieve this, the receiver advertises a window
value, which is the amount of data a sender can send.
TCP originally supported window sizes up to 64 KiB, but at high Bandwidth Delay Products (BDP), this value becomes a restriction because the sender cannot send more than 64 KiB at a time. High-speed connections can transfer much more than 64 KiB of data at a given time. For example, a 10 Gbps link with 1 ms of latency between systems can have more than 1 MiB of data in transit at a given time. It would be inefficient if a host sends only 64 KiB, then pauses until the other host receives that 64 KiB.
To remove this bottleneck, the TCP Window Scaling extension allows the TCP window value to be arithmetically shifted left to increase the window size beyond 64 KiB. For example, the largest window value of 65535
shifted 7 places to the left, resulting in a window size of almost 8 MiB. This enables transferring much more data at a given time.
TCP Window Scaling is negotiated during the three-way TCP handshake that opens every TCP connection. Both sender and receiver must support TCP Window Scaling for the feature to work. If either or both participants do not advertise window scaling ability in their handshake, the connection reverts to using the original 16-bit TCP window size.
By default, TCP Window Scaling is enabled in Red Hat Enterprise Linux:
sysctl net.ipv4.tcp_window_scaling
# sysctl net.ipv4.tcp_window_scaling
net.ipv4.tcp_window_scaling = 1
If TCP Window Scaling is disabled (0
) on your server, revert the setting in the same way as you set it.
5.5. How TCP SACK reduces the packet drop rate
The TCP Selective Acknowledgment (TCP SACK) feature, which is enabled by default in Red Hat Enterprise Linux (RHEL), is an enhancement of the TCP protocol and increases the efficiency of TCP connections.
In TCP transmissions, the receiver sends an ACK packet to the sender for every packet it receives. For example, a client sends the TCP packets 1-10 to the server but the packets number 5 and 6 get lost. Without TCP SACK, the server drops packets 7-10, and the client must retransmit all packets from the point of loss, which is inefficient. With TCP SACK enabled on both hosts, the client must re-transmit only the lost packets 5 and 6.
Disabling TCP SACK decreases the performance and causes a higher packet drop rate on the receiver side in a TCP connection.
By default, TCP SACK is enabled in RHEL. To verify:
sysctl net.ipv4.tcp_sack
# sysctl net.ipv4.tcp_sack
1
If TCP SACK is disabled (0
) on your server, revert the setting in the same way as you set it.