Chapter 41. Understanding the eBPF networking features in RHEL 9

PDF

The extended Berkeley Packet Filter (eBPF) is an in-kernel virtual machine that allows code execution in the kernel space. This code runs in a restricted sandbox environment with access only to a limited set of functions.

In networking, you can use eBPF to complement or replace kernel packet processing. Depending on the hook you use, eBPF programs have, for example:

Read and write access to packet data and metadata
Can look up sockets and routes
Can set socket options
Can redirect packets

41.1. Overview of networking eBPF features in RHEL 9

You can attach extended Berkeley Packet Filter (eBPF) networking programs to the following hooks in RHEL:

eXpress Data Path (XDP): Provides early access to received packets before the kernel networking stack processes them.
tc eBPF classifier with direct-action flag: Provides powerful packet processing on ingress and egress.
Control Groups version 2 (cgroup v2): Enables filtering and overriding socket-based operations performed by programs in a control group.
Socket filtering: Enables filtering of packets received from sockets. This feature was also available in the classic Berkeley Packet Filter (cBPF), but has been extended to support eBPF programs.
Stream parser: Enables splitting up streams to individual messages, filtering, and redirecting them to sockets.
SO_REUSEPORT socket selection: Provides a programmable selection of a receiving socket from a reuseport socket group.
Flow dissector: Enables overriding the way the kernel parses packet headers in certain situations.
TCP congestion control callbacks: Enables implementing a custom TCP congestion control algorithm.
Routes with encapsulation: Enables creating custom tunnel encapsulation.

XDP

You can attach programs of the BPF_PROG_TYPE_XDP type to a network interface. The kernel then executes the program on received packets before the kernel network stack starts processing them. This allows fast packet forwarding in certain situations, such as fast packet dropping to prevent distributed denial of service (DDoS) attacks and fast packet redirects for load balancing scenarios.

You can also use XDP for different forms of packet monitoring and sampling. The kernel allows XDP programs to modify packets and to pass them for further processing to the kernel network stack.

The following XDP modes are available:

Native (driver) XDP: The kernel executes the program from the earliest possible point during packet reception. At this moment, the kernel did not parse the packet and, therefore, no metadata provided by the kernel is available. This mode requires that the network interface driver supports XDP but not all drivers support this native mode.
Generic XDP: The kernel network stack executes the XDP program early in the processing. At that time, kernel data structures have been allocated, and the packet has been pre-processed. If a packet should be dropped or redirected, it requires a significant overhead compared to the native mode. However, the generic mode does not require network interface driver support and works with all network interfaces.
Offloaded XDP: The kernel executes the XDP program on the network interface instead of on the host CPU. Note that this requires specific hardware, and only certain eBPF features are available in this mode.

On RHEL, load all XDP programs using the libxdp library. This library enables system-controlled usage of XDP.

Note

Currently, there are some system configuration limitations for XDP programs. For example, you must disable certain hardware offload features on the receiving interface. Additionally, not all features are available with all drivers that support the native mode.

In RHEL 9, Red Hat supports the XDP features only if you use the libxdp library to load the program into the kernel.

AF_XDP

Using an XDP program that filters and redirects packets to a given AF_XDP socket, you can use one or more sockets from the AF_XDP protocol family to quickly copy packets from the kernel to the user space.

Traffic Control

The Traffic Control (tc) subsystem offers the following types of eBPF programs:

BPF_PROG_TYPE_SCHED_CLS
BPF_PROG_TYPE_SCHED_ACT

These types enable you to write custom tc classifiers and tc actions in eBPF. Together with the parts of the tc ecosystem, this provides the ability for powerful packet processing and is the core part of several container networking orchestration solutions.

In most cases, only the classifier is used, as with the direct-action flag, the eBPF classifier can execute actions directly from the same eBPF program. The clsact Queueing Discipline (qdisc) has been designed to enable this on the ingress side.

Note that using a flow dissector eBPF program can influence operation of some other qdiscs and tc classifiers, such as flower.

Socket filter

Several utilities use or have used the classic Berkeley Packet Filter (cBPF) for filtering packets received on a socket. For example, the tcpdump utility enables the user to specify expressions, which tcpdump then translates into cBPF code.

As an alternative to cBPF, the kernel allows eBPF programs of the BPF_PROG_TYPE_SOCKET_FILTER type for the same purpose.

Control Groups

In RHEL, you can use multiple types of eBPF programs that you can attach to a cgroup. The kernel executes these programs when a program in the given cgroup performs an operation. Note that you can use only cgroups version 2.

The following networking-related cgroup eBPF programs are available in RHEL:

BPF_PROG_TYPE_SOCK_OPS: The kernel calls this program on various TCP events. The program can adjust the behavior of the kernel TCP stack, including custom TCP header options, and so on.
BPF_PROG_TYPE_CGROUP_SOCK_ADDR: The kernel calls this program during connect, bind, sendto, recvmsg, getpeername, and getsockname operations. This program allows changing IP addresses and ports. This is useful when you implement socket-based network address translation (NAT) in eBPF.
BPF_PROG_TYPE_CGROUP_SOCKOPT: The kernel calls this program during setsockopt and getsockopt operations and allows changing the options.
BPF_PROG_TYPE_CGROUP_SOCK: The kernel calls this program during socket creation, socket releasing, and binding to addresses. You can use these programs to allow or deny the operation, or only to inspect socket creation for statistics.
BPF_PROG_TYPE_CGROUP_SKB: This program filters individual packets on ingress and egress, and can accept or reject packets.
BPF_PROG_TYPE_CGROUP_SYSCTL: This program allows filtering of access to system controls (sysctl).

Stream Parser

A stream parser operates on a group of sockets that are added to a special eBPF map. The eBPF program then processes packets that the kernel receives or sends on those sockets.

The following stream parser eBPF programs are available in RHEL:

BPF_PROG_TYPE_SK_SKB: An eBPF program parses packets received from the socket into individual messages, and instructs the kernel to drop those messages or send them to another socket in the group.
BPF_PROG_TYPE_SK_MSG: This program filters egress messages. An eBPF program parses the packets into individual messages and either approves or rejects them.

SO_REUSEPORT socket selection

Using this socket option, you can bind multiple sockets to the same IP address and port. Without eBPF, the kernel selects the receiving socket based on a connection hash. With the BPF_PROG_TYPE_SK_REUSEPORT program, the selection of the receiving socket is fully programmable.

Flow dissector

When the kernel needs to process packet headers without going through the full protocol decode, they are dissected. For example, this happens in the tc subsystem, in multipath routing, in bonding, or when calculating a packet hash. In this situation the kernel parses the packet headers and fills internal structures with the information from the packet headers. You can replace this internal parsing using the BPF_PROG_TYPE_FLOW_DISSECTOR program. Note that you can only dissect TCP and UDP over IPv4 and IPv6 in eBPF in RHEL.

TCP Congestion Control

You can write a custom TCP congestion control algorithm using a group of BPF_PROG_TYPE_STRUCT_OPS programs that implement struct tcp_congestion_oops callbacks. An algorithm that is implemented this way is available to the system alongside the built-in kernel algorithms.

Routes with encapsulation

You can attach one of the following eBPF program types to routes in the routing table as a tunnel encapsulation attribute:

BPF_PROG_TYPE_LWT_IN
BPF_PROG_TYPE_LWT_OUT
BPF_PROG_TYPE_LWT_XMIT

The functionality of such an eBPF program is limited to specific tunnel configurations and does not allow creating a generic encapsulation or decapsulation solution.

Socket lookup

To bypass limitations of the bind system call, use an eBPF program of the BPF_PROG_TYPE_SK_LOOKUP type. Such programs can select a listening socket for new incoming TCP connections or an unconnected socket for UDP packets.

41.2. Overview of XDP features in RHEL 9 by network cards

The following is an overview of XDP-enabled network cards and the XDP features you can use with them:

Network card	Driver	Basic	Redirect	Target	HW offload	Zero-copy	Large MTU
Amazon Elastic Network Adapter	`ena`	yes	yes	yes ^[a]	no	no	no
aQuantia AQtion Ethernet card	`atlantic`	yes	yes	no	no	no	no
Broadcom NetXtreme-C/E 10/25/40/50 gigabit Ethernet	`bnxt_en`	yes	yes	yes ^[a]	no	no	yes
Cavium Thunder Virtual function	`nicvf`	yes	no	no	no	no	no
Google Virtual NIC (gVNIC) support	`gve`	yes	yes	yes	no	yes	no
Intel® 10GbE PCI Express Virtual Function Ethernet	`ixgbevf`	yes	no	no	no	no	no
Intel® 10GbE PCI Express adapters	`ixgbe`	yes	yes	yes ^[a]	no	yes	yes ^[b]
Intel® Ethernet Connection E800 Series	`ice`	yes	yes	yes ^[a]	no	yes	yes
Intel® Ethernet Controller I225-LM/I225-V family	`igc`	yes	yes	yes	no	yes	yes ^[b]
Intel® PCI Express Gigabit adapters	`igb`	yes	yes	yes ^[a]	no	no	yes ^[b]
Intel® Ethernet Controller XL710 Family	`i40e`	yes	yes	yes ^[a] ^[c]	no	yes	no
Marvell OcteonTX2	`rvu_nicpf`	yes	yes	yes ^[a] ^[c]	no	no	no
Mellanox 5th generation network adapters (ConnectX series)	`mlx5_core`	yes	yes	yes ^[c]	no	yes	yes
Mellanox Technologies 1/10/40Gbit Ethernet	`mlx4_en`	yes	yes	no	no	no	no
Microsoft Azure Network Adapter	`mana`	yes	yes	yes	no	no	no
Microsoft Hyper-V virtual network	`hv_netvsc`	yes	yes	yes	no	no	no
Netronome® NFP4000/NFP6000 NIC ^[d]	`nfp`	yes	no	no	yes	yes	no
QEMU Virtio network	`virtio_net`	yes	yes	yes ^[a]	no	no	yes
QLogic QED 25/40/100Gb Ethernet NIC	`qede`	yes	yes	yes	no	no	no
STMicroelectronics Multi-Gigabit Ethernet	`stmmac`	yes	yes	yes	no	yes	no
Solarflare SFC9000/SFC9100/EF100-family	`sfc`	yes	yes	yes ^[c]	no	no	no
Universal TUN/TAP device	`tun`	yes	yes	yes	no	no	no
Virtual Ethernet pair device	`veth`	yes	yes	yes	no	no	yes
VMware VMXNET3 ethernet driver	`vmxnet3`	yes	yes	yes ^[a] ^[c]	no	no	no
Xen paravirtual network device	`xen-netfront`	yes	yes	yes	no	no	no
^[a] Only if an XDP program is loaded on the interface. ^[b] Transmitting side only. Cannot receive large packets through XDP. ^[c] Requires several XDP TX queues allocated that is larger or equal to the largest CPU index. ^[d] Some of the listed features are not available for the Netronome® NFP3800 NIC.

Legend:

Basic: Supports basic return codes: DROP, PASS, ABORTED, and TX.
Redirect: Supports the XDP_REDIRECT return code.
Target: Can be a target of a XDP_REDIRECT return code.
HW offload: Supports XDP hardware offload.
Zero-copy: Supports the zero-copy mode for the AF_XDP protocol family.
Large MTU: Supports packets larger than page size.

Chapter 41. Understanding the eBPF networking features in RHEL 9

41.1. Overview of networking eBPF features in RHEL 9

XDP

AF_XDP

Traffic Control

Socket filter

Control Groups

Stream Parser

SO_REUSEPORT socket selection

Flow dissector

TCP Congestion Control

Routes with encapsulation

Socket lookup

41.2. Overview of XDP features in RHEL 9 by network cards

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links