Chapter 5. Tuning TCP connections for high throughput

Tune TCP-related settings on Red Hat Enterprise Linux to increase the throughput, reduce the latency, or prevent problems, such as packet loss.

5.1. Testing the TCP throughput by using iperf3
Copy link

The iperf3 utility provides a server and client mode to perform network throughput tests between two hosts.

Note

The throughput of applications depends on many factors, such as the buffer sizes that the application uses. Therefore, the results measured with testing utilities, such as iperf3, can be significantly different from those of applications on a server under production workload.

Prerequisites

The iperf3 package is installed on both the client and server.
No other services on either host cause network traffic that substantially affects the test result.
For 40 Gbps and faster connections, the network card supports Accelerated Receive Flow Steering (ARFS) and the feature is enabled on the interface.

Procedure

Optional: Display the maximum network speed of the network interface controller (NIC) on both the server and client:
```
# ethtool enp1s0 | grep "Speed"
   Speed: 100000Mb/s
```
On the server:
1. Temporarily open the default iperf3 TCP port 5201 in the firewalld service:
  # firewall-cmd --add-port=5201/tcp
2. Start iperf3 in server mode:
  # iperf3 --server
  The service now is waiting for incoming client connections.
On the client:
1. Start measuring the throughput:
  # iperf3 --time 60 --zerocopy --client 192.0.2.1
  - --time <seconds>: Defines the time in seconds when the client stops the transmission.
    Set this parameter to a value that you expect to work and increase it in later measurements. If the client sends packets at a faster rate than the devices on the transmit path or the server can process, packets can be dropped.
  - --zerocopy: Enables a zero copy method instead of using the write() system call. You require this option only if you want to simulate a zero-copy-capable application or to reach 40 Gbps and more on a single stream.
  - --client <server>: Enables the client mode and sets the IP address or name of the server that runs the iperf3 server.
Wait until iperf3 completes the test. Both the server and the client display statistics every second and a summary at the end. For example, the following is a summary displayed on a client:
```
[ ID] Interval         Transfer    Bitrate         Retr
[  5] 0.00-60.00  sec  101 GBytes   14.4 Gbits/sec   0   sender
[  5] 0.00-60.04  sec  101 GBytes   14.4 Gbits/sec       receiver
```
In this example, the average bitrate was 14.4 Gbps.
On the server:
1. Press Ctrl+C to stop the iperf3 server.
2. Close the TCP port 5201 in firewalld:
  # firewall-cmd --remove-port=5201/tcp

5.2. The system-wide TCP socket buffer settings
Copy link

Socket buffers temporarily store data that the kernel has received or queued to send.

The following buffers exist:

The read socket buffer holds packets that the kernel has received but which the application has not read yet.
The write socket buffer holds packets that an application has written to the buffer but which the kernel has not passed to the IP stack and network driver yet.

If a TCP packet is too large and exceeds the buffer size or packets are sent or received at a too fast rate, the kernel drops any new incoming TCP packet until the data is removed from the buffer. In this case, increasing the socket buffers can prevent packet loss.

Both the net.ipv4.tcp_rmem (read) and net.ipv4.tcp_wmem (write) socket buffer kernel settings contain three values:

net.ipv4.tcp_rmem = 4096  131072  6291456
net.ipv4.tcp_wmem = 4096  16384   4194304

The displayed values are in bytes and Red Hat Enterprise Linux uses them in the following way:

The first value is the minimum buffer size. New sockets cannot have a smaller size.
The second value is the default buffer size. If an application sets no buffer size, this is the default value.
The third value is the maximum size of automatically tuned buffers. Using the setsockopt() function with the SO_SNDBUF socket option in an application disables this maximum buffer size.

Note that the net.ipv4.tcp_rmem and net.ipv4.tcp_wmem parameters set the socket sizes for both the IPv4 and IPv6 protocols.

5.3. Increasing the system-wide TCP socket buffers
Copy link

The system-wide TCP socket buffers temporarily store data that the kernel has received or is queued to send. Both net.ipv4.tcp_rmem (read) and net.ipv4.tcp_wmem (write) socket buffer kernel settings contain three settings: A minimum, default, and maximum value.

Important

Setting too large buffer sizes wastes memory. Each socket can be set to the size that the application requests, and the kernel doubles this value. For example, if an application requests a 256 KiB socket buffer size and opens 1 million sockets, the system can use up to 512 GB RAM (512 KiB x 1 million) only for the potential socket buffer space.

Additionally, a too large value for the maximum buffer size can increase the latency.

Prerequisites

You encountered a significant rate of dropped TCP packets.

Procedure

Determine the latency of the connection. For example, ping from the client to server to measure the average Round Trip Time (RTT):

# ping -c 10 server.example.com
...
--- server.example.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9014ms
rtt min/avg/max/mdev = 17.208/17.056/19.333/0.616 ms

In this example, the latency is 17 ms.

Use the following formula to calculate the Bandwidth Delay Product (BDP) for the traffic you want to tune:
```
connection speed in bytes * latency in seconds = BDP in bytes
```
For example, to calculate the BDP for a 10 Gbps connection that has a 17 ms latency:
```
(10 * 1000 * 1000 * 1000 / 8) * 0.017 = 21,250,000 bytes
```
Create the /etc/sysctl.d/10-tcp-socket-buffers.conf file and either set the maximum read or write buffer size, or both, based on your requirements:
```
net.ipv4.tcp_rmem = 4096 262144 42500000
net.ipv4.tcp_wmem = 4096 262144 42500000
```
Specify the values in bytes. Use the following rule of thumb when you try to identify optimized values for your environment:
- Default buffer size (second value): Increase this value only slightly or set it to 524288 (512 KiB) at most. A too high default buffer size can cause buffer collapsing and, consequently, latency spikes.
- Maximum buffer size (third value): A value double to triple of the BDP is often sufficient.
Load the settings from the /etc/sysctl.d/10-tcp-socket-buffers.conf file:
```
# sysctl -p /etc/sysctl.d/10-tcp-socket-buffers.conf
```
If you have changed the second value in the net.ipv4.tcp_rmem or net.ipv4.tcp_wmem parameter, restart the applications to use the new TCP buffer sizes.
If you have changed only the third value, you do not need to restart the application because auto-tuning applies these settings dynamically.

Verification

Optional: Test the TCP throughput by using iperf3.
Monitor the packet drop statistics by using the same method that you used when you encountered the packet drops.
If packet drops still occur but at a lower rate, increase the buffer sizes further.

5.4. TCP Window Scaling
Copy link

The TCP Window Scaling feature, which is enabled by default in Red Hat Enterprise Linux, is an extension of the TCP protocol that significantly improves the throughput.

For example, on a 1 Gbps connection with 1.5 ms Round Trip Time (RTT):

With TCP Window Scaling enabled, approximately 630 Mbps are realistic.
With TCP Window Scaling disabled, the throughput goes down to 380 Mbps.

One of the features TCP provides is flow control. With flow control, a sender can send as much data as the receiver can receive, but no more. To achieve this, the receiver advertises a window value, which is the amount of data a sender can send.

TCP originally supported window sizes up to 64 KiB, but at high Bandwidth Delay Products (BDP), this value becomes a restriction because the sender cannot send more than 64 KiB at a time. High-speed connections can transfer much more than 64 KiB of data at a given time. For example, a 10 Gbps link with 1 ms of latency between systems can have more than 1 MiB of data in transit at a given time. It is inefficient if a host sends only 64 KiB, then pauses until the other host receives that 64 KiB.

To remove this bottleneck, the TCP Window Scaling extension allows the TCP window value to be arithmetically shifted left to increase the window size beyond 64 KiB. For example, the largest window value of 65535 shifted 7 places to the left, resulting in a window size of almost 8 MiB. This enables transferring much more data at a given time.

TCP Window Scaling is negotiated during the three-way TCP handshake that opens every TCP connection. Both sender and receiver must support TCP Window Scaling for the feature to work. If either or both participants do not advertise window scaling ability in their handshake, the connection reverts to using the original 16-bit TCP window size.

By default, TCP Window Scaling is enabled in Red Hat Enterprise Linux:

# sysctl net.ipv4.tcp_window_scaling
net.ipv4.tcp_window_scaling = 1

If TCP Window Scaling is disabled (0) on your server, revert the setting in the same way as you set it.

5.5. How TCP SACK reduces the packet drop rate
Copy link

The TCP Selective Acknowledgment (TCP SACK) feature, which is enabled by default in RHEL, is an enhancement of the TCP protocol and increases the efficiency of TCP connections.

In TCP transmissions, the receiver sends an ACK packet to the sender for every packet it receives. For example, a client sends the TCP packets 1-10 to the server but the packets number 5 and 6 get lost. Without TCP SACK, the server drops packets 7-10, and the client must retransmit all packets from the point of loss, which is inefficient. With TCP SACK enabled on both hosts, the client must re-transmit only the lost packets 5 and 6.

Important

Disabling TCP SACK decreases the performance and causes a higher packet drop rate on the receiver side in a TCP connection.

By default, TCP SACK is enabled in RHEL. To verify:

# sysctl net.ipv4.tcp_sack
1

If TCP SACK is disabled (0) on your server, revert the setting in the same way as you set it.

Chapter 5. Tuning TCP connections for high throughput

5.1. Testing the TCP throughput by using iperf3
Copy link

5.2. The system-wide TCP socket buffer settings
Copy link

5.3. Increasing the system-wide TCP socket buffers
Copy link

5.4. TCP Window Scaling
Copy link

5.5. How TCP SACK reduces the packet drop rate
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 5. Tuning TCP connections for high throughput

5.1. Testing the TCP throughput by using iperf3Copy linkLink copied to clipboard!

5.2. The system-wide TCP socket buffer settingsCopy linkLink copied to clipboard!

5.3. Increasing the system-wide TCP socket buffersCopy linkLink copied to clipboard!

5.4. TCP Window ScalingCopy linkLink copied to clipboard!

5.5. How TCP SACK reduces the packet drop rateCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

5.1. Testing the TCP throughput by using iperf3
Copy link

5.2. The system-wide TCP socket buffer settings
Copy link

5.3. Increasing the system-wide TCP socket buffers
Copy link

5.4. TCP Window Scaling
Copy link

5.5. How TCP SACK reduces the packet drop rate
Copy link