Chapter 4. High Packet Loss in the TX Queue of the Instance’s Tap Interface
Use this section to troubleshoot packet loss in the TX queue for kernel networking, not OVS-DPDK.
4.1. Symptom
During a test of a virtual network function (VNF) using host-only networking, high packet loss can be observed in the TX queue of the instance’s tap interface. The test setup sends packets from one VM on a node to another VM on the same node. The packet loss appears in bursts.
The following example shows a high number of dropped packets in the tap’s TX queue.
ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500034259301 132047795 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5481296464 81741449 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0
4.2. Diagnosis
This section examines packet drop on tap (kernel path) interfaces. For packet drops on vhost user interfaces in the user datapath, see https://access.redhat.com/solutions/3381011
TX drops occur because of interference between the instance’s vCPU and other processes on the hypervisor. The TX queue of the tap interface is a buffer that can store packets for a short while in case that the instance cannot pick up the packets. This happens if the instance’s CPU is prevented from running (or freezes) for a long enough time.
A TUN/TAP device is a virtual device where one end is a kernel network interface, and the other end is a user space file descriptor.
A TUN/TAP interface can run in one of two modes:
- Tap mode feeds L2 ethernet frames with L2 header into the device, and expects to receive the same out from user space. This mode is used for VMs.
- Tun mode feeds L3 IP packets with L3 header into the device, and expects to receive the same out from user space. This mode is mostly used for VPN clients.
In KVM networking, the user space file descriptor is owned by the qemu-kvm
process. Frames that are sent into the tap (TX from the hypervisor’s perspective) end up as L2 frames inside qemu-kvm
, which can then feed those frames to the virtual network device in the VM as network packets received into the virtual network interface (RX from the VM’s perspective).
A key concept with TUN/TAP is that the transmit direction from the hypervisor is the receive direction for the virtual machine. This same is true of the opposite direction; receive for the hypervisor is equal to transmit from the virtual machine.
There is no "ring buffer" of packets on a virtio-net device. This means that if the TUN/TAP device’s TX queue fills up because the VM is not receiving (either fast enough or at all) then there is nowhere for new packets to go, and the hypervisor sees TX loss on the tap.
If you notice TX loss on a TUN/TAP, increase the tap txqueuelen
to avoid that, similar to increasing the RX ring buffer to stop receive loss on a physical NIC.
However, this assumes the VM is just "slow" and "bursty" at receive. If the VM is not executing fast enough all the time, or otherwise not receiving at all, tuning the TX queue length won’t help. You must find out why the VM is not running or receiving.
4.2.1. Workaround
To alleviate small freezes at the cost of higher latency and other disadvantages, increase the TX queue.
To temporarily increase txqueuelen
, use the following command:
/sbin/ip link set tap<uuid> txqueuelen <new queue length>
4.2.2. Diagnostic Steps
Use the following script to view the effects of CPU time being stolen from the hypervisor.
[root@ibm-x3550m4-9 ~]# cat generate-tx-drops.sh #!/bin/bash trap 'cleanup' INT cleanup() { echo "Cleanup ..." if [ "x$HPING_PID" != "x" ]; then echo "Killing hping3 with PID $HPING_PID" kill $HPING_PID fi if [ "x$DD_PID" != "x" ]; then echo "Killing dd with PID $DD_PID" kill $DD_PID fi exit 0 } VM_IP=10.0.0.20 VM_TAP=tapc18eb09e-01 VM_INSTANCE_ID=instance-00000012 LAST_CPU=$( lscpu | awk '/^CPU\(s\):/ { print $NF - 1 }' ) # this is a 12 core system, we are sending everything to CPU 11, # so the taskset mask is 800 so set dd affinity only for last CPU TASKSET_MASK=800 # pinning vCPU to last pCPU echo "virsh vcpupin $VM_INSTANCE_ID 0 $LAST_CPU" virsh vcpupin $VM_INSTANCE_ID 0 $LAST_CPU # make sure that: nova secgroup-add-rule default udp 1 65535 0.0.0.0/0 # make sure that: nova secgroup-add-rule default tcp 1 65535 0.0.0.0/0 # make sure that: nova secgroup-add-rule default icmp -1 -1 0.0.0.0/0 # --fast, --faster or --flood can also be used echo "hping3 -u -p 5000 $VM_IP --faster > /dev/null " hping3 -u -p 5000 $VM_IP --faster > /dev/null & HPING_PID=$! echo "hping is running, but dd not yet:" for i in { 1 .. 3 }; do date echo "ip -s -s link ls dev $VM_TAP" ip -s -s link ls dev $VM_TAP sleep 5 done echo "Starting dd and pinning it to the same pCPU as the instance" echo "dd if=/dev/zero of=/dev/null" dd if=/dev/zero of=/dev/null & DD_PID=$! echo "taskset -p $TASKSET_MASK $DD_PID" taskset -p $TASKSET_MASK $DD_PID for i in { 1 .. 3 }; do date echo "ip -s -s link ls dev $VM_TAP" ip -s -s link ls dev $VM_TAP sleep 5 done cleanup
Log in to the instance and start dd if=/dev/zero of=/dev/null
to generate additional load on its only vCPU. Note that this is for demonstration purposes. You can repeat the same test with and without load from within the VM. TX drop only occurs when another process on the hypervisor is stealing time from the instance’s vCPU.
The following example shows an instance before the test:
%Cpu(s): 22.3 us, 77.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 1884108 total, 1445636 free, 90536 used, 347936 buff/cache KiB Swap: 0 total, 0 free, 0 used. 1618720 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30172 root 20 0 107936 620 528 R 99.9 0.0 0:05.89 dd
Run the following script and observe the dropped packages in the TX queue. These only occur when the dd process consumes a significant amount of processing time from the instance’s CPU.
[root@ibm-x3550m4-9 ~]# ./generate-tx-drops.sh virsh vcpupin instance-00000012 0 11 hping3 -u -p 5000 10.0.0.20 --faster > /dev/null hping is running, but dd not yet: Tue Nov 29 12:28:22 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500034259301 132047795 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5481296464 81741449 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:27 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500055729011 132445382 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5502766282 82139038 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:32 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500077122125 132841551 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5524159396 82535207 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:37 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500098181033 133231531 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5545218358 82925188 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:42 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500119152685 133619793 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5566184804 83313451 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Starting dd and pinning it to the same pCPU as the instance dd if=/dev/zero of=/dev/null taskset -p 800 8763 pid 8763's current affinity mask: fff pid 8763's new affinity mask: 800 Tue Nov 29 12:28:47 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500140267091 134010698 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5587300452 83704477 0 11155280 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:52 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500159822749 134372711 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5606853168 84066563 0 11188074 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:28:57 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500179161241 134730729 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5626179144 84424451 0 11223096 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:29:02 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500198344463 135085948 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5645365410 84779752 0 11260740 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Tue Nov 29 12:29:07 EST 2016 ip -s -s link ls dev tapc18eb09e-01 69: tapc18eb09e-01: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbrc18eb09e-01 state UNKNOWN mode DEFAULT qlen 1000 link/ether fe:16:3e:a5:17:c0 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 5500217014275 135431570 0 0 0 0 RX errors: length crc frame fifo missed 0 0 0 0 0 TX: bytes packets errors dropped carrier collsns 5664031398 85125418 0 11302179 0 0 TX errors: aborted fifo window heartbeat transns 0 0 0 0 0 Cleanup ... Killing hping3 with PID 8722 Killing dd with PID 8763 [root@ibm-x3550m4-9 ~]# --- 10.0.0.20 hping statistic --- 3919615 packets transmitted, 0 packets received, 100% packet loss round-trip min/avg/max = 0.0/0.0/0.0 ms
The following example shows the effects of dd
on the hypervisor during the test. The st
label identifies the percentage of time stolen from the hypervisor.
%Cpu(s): 7.0 us, 27.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 20.2 si, 45.4 st KiB Mem : 1884108 total, 1445484 free, 90676 used, 347948 buff/cache KiB Swap: 0 total, 0 free, 0 used. 1618568 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30172 root 20 0 107936 620 528 R 54.3 0.0 1:00.50 dd
Note that ssh
can become sluggish during the second half of the test on the instance, including the possibility of timing out if the test runs too long.
4.3. Solution
While increasing the TX queue helps to mitigate these small freezes, complete isolation with CPU pinning and isolcpus in the kernel parameters is the best solution. Form more information, see Configure CPU pinning with NUMA in OpenStack for further details.