Chapter 18. Hardware networks

18.1. About Single Root I/O Virtualization (SR-IOV) hardware networks

The Single Root I/O Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that can share a single device with multiple pods.

You can configure a Single Root I/O Virtualization (SR-IOV) device in your cluster by using the SR-IOV Operator.

SR-IOV can segment a compliant network device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device. The SR-IOV network device driver for the device determines how the VF is exposed in the container:

netdevice driver: A regular kernel network device in the netns of the container
vfio-pci driver: A character device mounted in the container

You can use SR-IOV network devices with additional networks on your OpenShift Container Platform cluster installed on bare metal or Red Hat OpenStack Platform (RHOSP) infrastructure for applications that require high bandwidth or low latency.

You can configure multi-network policies for SR-IOV networks. The support for this is technology preview and SR-IOV additional networks are only supported with kernel NICs. They are not supported for Data Plane Development Kit (DPDK) applications.

Note

Creating multi-network policies on SR-IOV networks might not deliver the same performance to applications compared to SR-IOV networks without a multi-network policy configured.

Important

Multi-network policies for SR-IOV network is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

You can enable SR-IOV on a node by using the following command:

$ oc label node <node_name> feature.node.kubernetes.io/network-sriov.capable="true"

Additional resources

Installing the SR-IOV Network Operator

18.1.1. Components that manage SR-IOV network devices

The SR-IOV Network Operator creates and manages the components of the SR-IOV stack. The Operator performs the following functions:

Orchestrates discovery and management of SR-IOV network devices
Generates NetworkAttachmentDefinition custom resources for the SR-IOV Container Network Interface (CNI)
Creates and updates the configuration of the SR-IOV network device plugin
Creates node specific SriovNetworkNodeState custom resources
Updates the spec.interfaces field in each SriovNetworkNodeState custom resource

The Operator provisions the following components:

SR-IOV network configuration daemon: A daemon set that is deployed on worker nodes when the SR-IOV Network Operator starts. The daemon is responsible for discovering and initializing SR-IOV network devices in the cluster.
SR-IOV Network Operator webhook: A dynamic admission controller webhook that validates the Operator custom resource and sets appropriate default values for unset fields.
SR-IOV Network resources injector: A dynamic admission controller webhook that provides functionality for patching Kubernetes pod specifications with requests and limits for custom network resources such as SR-IOV VFs. The SR-IOV network resources injector adds the resource field to only the first container in a pod automatically.
SR-IOV network device plugin: A device plugin that discovers, advertises, and allocates SR-IOV network virtual function (VF) resources. Device plugins are used in Kubernetes to enable the use of limited resources, typically in physical devices. Device plugins give the Kubernetes scheduler awareness of resource availability, so that the scheduler can schedule pods on nodes with sufficient resources.
SR-IOV CNI plugin: A CNI plugin that attaches VF interfaces allocated from the SR-IOV network device plugin directly into a pod.
SR-IOV InfiniBand CNI plugin: A CNI plugin that attaches InfiniBand (IB) VF interfaces allocated from the SR-IOV network device plugin directly into a pod.

Note

The SR-IOV Network resources injector and SR-IOV Network Operator webhook are enabled by default and can be disabled by editing the default SriovOperatorConfig CR. Use caution when disabling the SR-IOV Network Operator Admission Controller webhook. You can disable the webhook under specific circumstances, such as troubleshooting, or if you want to use unsupported devices.

18.1.1.1. Supported platforms

The SR-IOV Network Operator is supported on the following platforms:

Bare metal
Red Hat OpenStack Platform (RHOSP)

18.1.1.2. Supported devices

OpenShift Container Platform supports the following network interface controllers:

Table 18.1. Supported network interface controllers
Manufacturer	Model	Vendor ID	Device ID
Broadcom	BCM57414	14e4	16d7
Broadcom	BCM57508	14e4	1750
Broadcom	BCM57504	14e4	1751
Intel	X710	8086	1572
Intel	X710 Backplane	8086	1581
Intel	X710 Base T	8086	15ff
Intel	XL710	8086	1583
Intel	XXV710	8086	158b
Intel	E810-CQDA2	8086	1592
Intel	E810-2CQDA2	8086	1592
Intel	E810-XXVDA2	8086	159b
Intel	E810-XXVDA4	8086	1593
Intel	E810-XXVDA4T	8086	1593
Intel	Ice E810-XXV Backplane	8086	1599
Intel	Ice E823L Backplane	8086	124c
Intel	Ice E823L SFP	8086	124d
Marvell	OCTEON Fusion CNF105XX	177d	ba00
Marvell	OCTEON10 CN10XXX	1177d	b900
Mellanox	MT27700 Family [ConnectX‑4]	15b3	1013
Mellanox	MT27710 Family [ConnectX‑4 Lx]	15b3	1015
Mellanox	MT27800 Family [ConnectX‑5]	15b3	1017
Mellanox	MT28880 Family [ConnectX‑5 Ex]	15b3	1019
Mellanox	MT28908 Family [ConnectX‑6]	15b3	101b
Mellanox	MT2892 Family [ConnectX‑6 Dx]	15b3	101d
Mellanox	MT2894 Family [ConnectX‑6 Lx]	15b3	101f
Mellanox	Mellanox MT2910 Family [ConnectX‑7]	15b3	1021
Mellanox	MT42822 BlueField‑2 in ConnectX‑6 NIC mode	15b3	a2d6
Pensando ^[1]	DSC-25 dual-port 25G distributed services card for ionic driver	0x1dd8	0x1002
Pensando ^[1]	DSC-100 dual-port 100G distributed services card for ionic driver	0x1dd8	0x1003
Silicom	STS Family	8086	1591

OpenShift SR-IOV is supported, but you must set a static, Virtual Function (VF) media access control (MAC) address using the SR-IOV CNI config file when using SR-IOV.

Note

For the most up-to-date list of supported cards and compatible OpenShift Container Platform versions available, see Openshift Single Root I/O Virtualization (SR-IOV) and PTP hardware networks Support Matrix.

18.1.2. Additional resources

Configuring multi-network policy

18.1.3. Next steps

18.2. Configuring an SR-IOV network device

You can configure a Single Root I/O Virtualization (SR-IOV) device in your cluster.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.2.1. SR-IOV network node configuration object

You specify the SR-IOV network device configuration for a node by creating an SR-IOV network node policy. The API object for the policy is part of the sriovnetwork.openshift.io API group.

The following YAML describes an SR-IOV network node policy:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: <name> 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: <sriov_resource_name> 3
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true" 4
  priority: <priority> 5
  mtu: <mtu> 6
  needVhostNet: false 7
  numVfs: <num> 8
  externallyManaged: false 9
  nicSelector: 10
    vendor: "<vendor_code>" 11
    deviceID: "<device_id>" 12
    pfNames: ["<pf_name>", ...] 13
    rootDevices: ["<pci_bus_id>", ...] 14
    netFilter: "<filter_string>" 15
  deviceType: <device_type> 16
  isRdma: false 17
  linkType: <link_type> 18
  eSwitchMode: "switchdev" 19
  excludeTopology: false 20

1

The name for the custom resource object.

2

The namespace where the SR-IOV Network Operator is installed.

3

The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name.

When specifying a name, be sure to use the accepted syntax expression ^[a-zA-Z0-9_]+$ in the resourceName.

4

The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only.

Important

The SR-IOV Network Operator applies node network configuration policies to nodes in sequence. Before applying node network configuration policies, the SR-IOV Network Operator checks if the machine config pool (MCP) for a node is in an unhealthy state such as Degraded or Updating. If a node is in an unhealthy MCP, the process of applying node network configuration policies to all targeted nodes in the cluster pauses until the MCP returns to a healthy state.

To avoid a node in an unhealthy MCP from blocking the application of node network configuration policies to other nodes, including nodes in other MCPs, you must create a separate node network configuration policy for each MCP.

5

Optional: The priority is an integer value between 0 and 99. A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99. The default value is 99.

6

Optional: The maximum transmission unit (MTU) of the physical function and all its virtual functions. The maximum MTU value can vary for different network interface controller (NIC) models.

Important

If you want to create virtual function on the default network interface, ensure that the MTU is set to a value that matches the cluster MTU.

If you want to modify the MTU of a single virtual function while the function is assigned to a pod, leave the MTU value blank in the SR-IOV network node policy. Otherwise, the SR-IOV Network Operator reverts the MTU of the virtual function to the MTU value defined in the SR-IOV network node policy, which might trigger a node drain.

7

Optional: Set needVhostNet to true to mount the /dev/vhost-net device in the pod. Use the mounted /dev/vhost-net device with Data Plane Development Kit (DPDK) to forward traffic to the kernel network stack.

8

The number of the virtual functions (VF) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 127.

9

The externallyManaged field indicates whether the SR-IOV Network Operator manages all, or only a subset of virtual functions (VFs). With the value set to false the SR-IOV Network Operator manages and configures all VFs on the PF.

Note

When externallyManaged is set to true, you must manually create the Virtual Functions (VFs) on the physical function (PF) before applying the SriovNetworkNodePolicy resource. If the VFs are not pre-created, the SR-IOV Network Operator’s webhook will block the policy request.

When externallyManaged is set to false, the SR-IOV Network Operator automatically creates and manages the VFs, including resetting them if necessary.

To use VFs on the host system, you must create them through NMState, and set externallyManaged to true. In this mode, the SR-IOV Network Operator does not modify the PF or the manually managed VFs, except for those explicitly defined in the nicSelector field of your policy. However, the SR-IOV Network Operator continues to manage VFs that are used as pod secondary interfaces.

10

The NIC selector identifies the device to which this resource applies. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally.

If you specify rootDevices, you must also specify a value for vendor, deviceID, or pfNames. If you specify both pfNames and rootDevices at the same time, ensure that they refer to the same device. If you specify a value for netFilter, then you do not need to specify any other parameter because a network ID is unique.

11

Optional: The vendor hexadecimal vendor identifier of the SR-IOV network device. The only allowed values are 8086 (Intel) and 15b3 (Mellanox).

12

Optional: The device hexadecimal device identifier of the SR-IOV network device. For example, 101b is the device ID for a Mellanox ConnectX-6 device.

13

Optional: An array of one or more physical function (PF) names the resource must apply to.

14

Optional: An array of one or more PCI bus addresses the resource must apply to. For example 0000:02:00.1.

15

Optional: The platform-specific network filter. The only supported platform is Red Hat OpenStack Platform (RHOSP). Acceptable values use the following format: openstack/NetworkID:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx. Replace xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx with the value from the /var/config/openstack/latest/network_data.json metadata file. This filter ensures that VFs are associated with a specific OpenStack network. The operator uses this filter to map the VFs to the appropriate network based on metadata provided by the OpenStack platform.

16

Optional: The driver to configure for the VFs created from this resource. The only allowed values are netdevice and vfio-pci. The default value is netdevice.

For a Mellanox NIC to work in DPDK mode on bare metal nodes, use the netdevice driver type and set isRdma to true.

17

Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false.

If the isRdma parameter is set to true, you can continue to use the RDMA-enabled VF as a normal network device. A device can be used in either mode.

Set isRdma to true and additionally set needVhostNet to true to configure a Mellanox NIC for use with Fast Datapath DPDK applications.

Note

You cannot set the isRdma parameter to true for intel NICs.

18

Optional: The link type for the VFs. The default value is eth for Ethernet. Change this value to 'ib' for InfiniBand.

When linkType is set to ib, isRdma is automatically set to true by the SR-IOV Network Operator webhook. When linkType is set to ib, deviceType should not be set to vfio-pci.

Do not set linkType to eth for SriovNetworkNodePolicy, because this can lead to an incorrect number of available devices reported by the device plugin.

19

Optional: To enable hardware offloading, you must set the eSwitchMode field to "switchdev". For more information about hardware offloading, see "Configuring hardware offloading".

20

Optional: To exclude advertising an SR-IOV network resource’s NUMA node to the Topology Manager, set the value to true. The default value is false.

18.2.1.1. SR-IOV network node configuration examples

The following example describes the configuration for an InfiniBand device:

Example configuration for an InfiniBand device

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: <name>
  namespace: openshift-sriov-network-operator
spec:
  resourceName: <sriov_resource_name>
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: <num>
  nicSelector:
    vendor: "<vendor_code>"
    deviceID: "<device_id>"
    rootDevices:
      - "<pci_bus_id>"
  linkType: <link_type>
  isRdma: true
# ...

The following example describes the configuration for an SR-IOV network device in a RHOSP virtual machine:

Example configuration for an SR-IOV device in a virtual machine

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: <name>
  namespace: openshift-sriov-network-operator
spec:
  resourceName: <sriov_resource_name>
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 1 1
  nicSelector:
    vendor: "<vendor_code>"
    deviceID: "<device_id>"
    netFilter: "openstack/NetworkID:ea24bd04-8674-4f69-b0ee-fa0b3bd20509" 2
# ...

1: When configuring the node network policy for a virtual machine, the numVfs parameter is always set to 1.
2: When the virtual machine is deployed on RHOSP, the netFilter parameter must refer to a network ID. Valid values for netFilter are available from an SriovNetworkNodeState object.

18.2.1.2. Automated discovery of SR-IOV network devices

The SR-IOV Network Operator searches your cluster for SR-IOV capable network devices on worker nodes. The Operator creates and updates a SriovNetworkNodeState custom resource (CR) for each worker node that provides a compatible SR-IOV network device.

The CR is assigned the same name as the worker node. The status.interfaces list provides information about the network devices on a node.

Important

Do not modify a SriovNetworkNodeState object. The Operator creates and manages these resources automatically.

18.2.1.2.1. Example SriovNetworkNodeState object

The following YAML is an example of a SriovNetworkNodeState object created by the SR-IOV Network Operator:

An SriovNetworkNodeState object

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodeState
metadata:
  name: node-25 1
  namespace: openshift-sriov-network-operator
  ownerReferences:
  - apiVersion: sriovnetwork.openshift.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: SriovNetworkNodePolicy
    name: default
spec:
  dpConfigVersion: "39824"
status:
  interfaces: 2
  - deviceID: "1017"
    driver: mlx5_core
    mtu: 1500
    name: ens785f0
    pciAddress: "0000:18:00.0"
    totalvfs: 8
    vendor: 15b3
  - deviceID: "1017"
    driver: mlx5_core
    mtu: 1500
    name: ens785f1
    pciAddress: "0000:18:00.1"
    totalvfs: 8
    vendor: 15b3
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens817f0
    pciAddress: 0000:81:00.0
    totalvfs: 64
    vendor: "8086"
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens817f1
    pciAddress: 0000:81:00.1
    totalvfs: 64
    vendor: "8086"
  - deviceID: 158b
    driver: i40e
    mtu: 1500
    name: ens803f0
    pciAddress: 0000:86:00.0
    totalvfs: 64
    vendor: "8086"
  syncStatus: Succeeded

1: The value of the name field is the same as the name of the worker node.
2: The interfaces stanza includes a list of all of the SR-IOV devices discovered by the Operator on the worker node.

18.2.1.3. Virtual function (VF) partitioning for SR-IOV devices

In some cases, you might want to split virtual functions (VFs) from the same physical function (PF) into multiple resource pools. For example, you might want some of the VFs to load with the default driver and the remaining VFs load with the vfio-pci driver. In such a deployment, the pfNames selector in your SriovNetworkNodePolicy custom resource (CR) can be used to specify a range of VFs for a pool using the following format: <pfname>#<first_vf>-<last_vf>.

For example, the following YAML shows the selector for an interface named netpf0 with VF 2 through 7:

pfNames: ["netpf0#2-7"]

netpf0 is the PF interface name.
2 is the first VF index (0-based) that is included in the range.
7 is the last VF index (0-based) that is included in the range.

You can select VFs from the same PF by using different policy CRs if the following requirements are met:

The numVfs value must be identical for policies that select the same PF.
The VF index must be in the range of 0 to <numVfs>-1. For example, if you have a policy with numVfs set to 8, then the <first_vf> value must not be smaller than 0, and the <last_vf> must not be larger than 7.
The VFs ranges in different policies must not overlap.
The <first_vf> must not be larger than the <last_vf>.

The following example illustrates NIC partitioning for an SR-IOV device.

The policy policy-net-1 defines a resource pool net-1 that contains the VF 0 of PF netpf0 with the default VF driver. The policy policy-net-1-dpdk defines a resource pool net-1-dpdk that contains the VF 8 to 15 of PF netpf0 with the vfio VF driver.

Policy policy-net-1:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-net-1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: net1
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 16
  nicSelector:
    pfNames: ["netpf0#0-0"]
  deviceType: netdevice

Policy policy-net-1-dpdk:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-net-1-dpdk
  namespace: openshift-sriov-network-operator
spec:
  resourceName: net1dpdk
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 16
  nicSelector:
    pfNames: ["netpf0#8-15"]
  deviceType: vfio-pci

Verifying that the interface is successfully partitioned

Confirm that the interface partitioned to virtual functions (VFs) for the SR-IOV device by running the following command.

$ ip link show <interface> 1

1: Replace <interface> with the interface that you specified when partitioning to VFs for the SR-IOV device, for example, ens3f1.

Example output

5: ens3f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 3c:fd:fe:d1:bc:01 brd ff:ff:ff:ff:ff:ff

vf 0     link/ether 5a:e7:88:25:ea:a0 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 1     link/ether 3e:1d:36:d7:3d:49 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 2     link/ether ce:09:56:97:df:f9 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3     link/ether 5e:91:cf:88:d1:38 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 4     link/ether e6:06:a1:96:2f:de brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off

18.2.1.4. A test pod template for clusters that use SR-IOV on OpenStack

The following testpmd pod demonstrates container creation with huge pages, reserved CPUs, and the SR-IOV port.

An example testpmd pod

apiVersion: v1
kind: Pod
metadata:
  name: testpmd-sriov
  namespace: mynamespace
  annotations:
    cpu-load-balancing.crio.io: "disable"
    cpu-quota.crio.io: "disable"
# ...
spec:
  containers:
  - name: testpmd
    command: ["sleep", "99999"]
    image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
    securityContext:
      capabilities:
        add: ["IPC_LOCK","SYS_ADMIN"]
      privileged: true
      runAsUser: 0
    resources:
      requests:
        memory: 1000Mi
        hugepages-1Gi: 1Gi
        cpu: '2'
        openshift.io/sriov1: 1
      limits:
        hugepages-1Gi: 1Gi
        cpu: '2'
        memory: 1000Mi
        openshift.io/sriov1: 1
    volumeMounts:
      - mountPath: /dev/hugepages
        name: hugepage
        readOnly: False
  runtimeClassName: performance-cnf-performanceprofile 1
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

1: This example assumes that the name of the performance profile is cnf-performance profile.

18.2.1.5. A test pod template for clusters that use OVS hardware offloading on OpenStack

The following testpmd pod demonstrates Open vSwitch (OVS) hardware offloading on Red Hat OpenStack Platform (RHOSP).

An example testpmd pod

apiVersion: v1
kind: Pod
metadata:
  name: testpmd-sriov
  namespace: mynamespace
  annotations:
    k8s.v1.cni.cncf.io/networks: hwoffload1
spec:
  runtimeClassName: performance-cnf-performanceprofile 1
  containers:
  - name: testpmd
    command: ["sleep", "99999"]
    image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
    securityContext:
      capabilities:
        add: ["IPC_LOCK","SYS_ADMIN"]
      privileged: true
      runAsUser: 0
    resources:
      requests:
        memory: 1000Mi
        hugepages-1Gi: 1Gi
        cpu: '2'
      limits:
        hugepages-1Gi: 1Gi
        cpu: '2'
        memory: 1000Mi
    volumeMounts:
      - mountPath: /mnt/huge
        name: hugepage
        readOnly: False
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

1: If your performance profile is not named cnf-performance profile, replace that string with the correct performance profile name.

18.2.1.6. Huge pages resource injection for Downward API

When a pod specification includes a resource request or limit for huge pages, the Network Resources Injector automatically adds Downward API fields to the pod specification to provide the huge pages information to the container.

The Network Resources Injector adds a volume that is named podnetinfo and is mounted at /etc/podnetinfo for each container in the pod. The volume uses the Downward API and includes a file for huge pages requests and limits. The file naming convention is as follows:

/etc/podnetinfo/hugepages_1G_request_<container-name>
/etc/podnetinfo/hugepages_1G_limit_<container-name>
/etc/podnetinfo/hugepages_2M_request_<container-name>
/etc/podnetinfo/hugepages_2M_limit_<container-name>

The paths specified in the previous list are compatible with the app-netutil library. By default, the library is configured to search for resource information in the /etc/podnetinfo directory. If you choose to specify the Downward API path items yourself manually, the app-netutil library searches for the following paths in addition to the paths in the previous list.

/etc/podnetinfo/hugepages_request
/etc/podnetinfo/hugepages_limit
/etc/podnetinfo/hugepages_1G_request
/etc/podnetinfo/hugepages_1G_limit
/etc/podnetinfo/hugepages_2M_request
/etc/podnetinfo/hugepages_2M_limit

As with the paths that the Network Resources Injector can create, the paths in the preceding list can optionally end with a _<container-name> suffix.

18.2.2. Configuring SR-IOV network devices

The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io CustomResourceDefinition to OpenShift Container Platform. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).

Note

When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes. Reboot only happens in the following cases:

With Mellanox NICs (mlx5 driver) a node reboot happens every time the number of virtual functions (VFs) increase on a physical function (PF).
With Intel NICs, a reboot only happens if the kernel parameters do not include intel_iommu=on and iommu=pt.

It might take several minutes for a configuration change to apply.

Prerequisites

You installed the OpenShift CLI (oc).
You have access to the cluster as a user with the cluster-admin role.
You have installed the SR-IOV Network Operator.
You have enough available nodes in your cluster to handle the evicted workload from drained nodes.
You have not selected any control plane nodes for SR-IOV network device configuration.

Procedure

Create an SriovNetworkNodePolicy object, and then save the YAML in the <name>-sriov-node-network.yaml file. Replace <name> with the name for this configuration.
Optional: Label the SR-IOV capable cluster nodes with SriovNetworkNodePolicy.Spec.NodeSelector if they are not already labeled. For more information about labeling nodes, see "Understanding how to update labels on nodes".
Create the SriovNetworkNodePolicy object:
```
$ oc create -f <name>-sriov-node-network.yaml
```
where <name> specifies the name for this configuration.
After applying the configuration update, all the pods in sriov-network-operator namespace transition to the Running status.
To verify that the SR-IOV network device is configured, enter the following command. Replace <node_name> with the name of a node with the SR-IOV network device that you just configured.
```
$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'
```

Additional resources

Understanding how to update labels on nodes.

18.2.3. Creating a non-uniform memory access (NUMA) aligned SR-IOV pod

You can create a NUMA aligned SR-IOV pod by restricting SR-IOV and the CPU resources allocated from the same NUMA node with restricted or single-numa-node Topology Manager policies.

Prerequisites

You have installed the OpenShift CLI (oc).
You have configured the CPU Manager policy to static. For more information on CPU Manager, see the "Additional resources" section.
You have configured the Topology Manager policy to single-numa-node.
Note
When single-numa-node is unable to satisfy the request, you can configure the Topology Manager policy to restricted. For more flexible SR-IOV network resource scheduling, see Excluding SR-IOV network topology during NUMA-aware scheduling in the Additional resources section.

Procedure

Create the following SR-IOV pod spec, and then save the YAML in the <name>-sriov-pod.yaml file. Replace <name> with a name for this pod.
The following example shows an SR-IOV pod spec:
```
apiVersion: v1
kind: Pod
metadata:
  name: sample-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: <name> 1
spec:
  containers:
  - name: sample-container
    image: <image> 2
    command: ["sleep", "infinity"]
    resources:
      limits:
        memory: "1Gi" 3
        cpu: "2" 4
      requests:
        memory: "1Gi"
        cpu: "2"
```
1
Replace <name> with the name of the SR-IOV network attachment definition CR.
2
Replace <image> with the name of the sample-pod image.
3
To create the SR-IOV pod with guaranteed QoS, set memory limits equal to memory requests.
4
To create the SR-IOV pod with guaranteed QoS, set cpu limits equals to cpu requests.
Create the sample SR-IOV pod by running the following command:
```
$ oc create -f <filename> 1
```
1
Replace <filename> with the name of the file you created in the previous step.
Confirm that the sample-pod is configured with guaranteed QoS.
```
$ oc describe pod sample-pod
```

Confirm that the sample-pod is allocated with exclusive CPUs.

$ oc exec sample-pod -- cat /sys/fs/cgroup/cpuset/cpuset.cpus

Confirm that the SR-IOV device and CPUs that are allocated for the sample-pod are on the same NUMA node.
```
$ oc exec sample-pod -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
```

18.2.4. Exclude the SR-IOV network topology for NUMA-aware scheduling

You can exclude advertising the Non-Uniform Memory Access (NUMA) node for the SR-IOV network to the Topology Manager for more flexible SR-IOV network deployments during NUMA-aware pod scheduling.

In some scenarios, it is a priority to maximize CPU and memory resources for a pod on a single NUMA node. By not providing a hint to the Topology Manager about the NUMA node for the pod’s SR-IOV network resource, the Topology Manager can deploy the SR-IOV network resource and the pod CPU and memory resources to different NUMA nodes. This can add to network latency because of the data transfer between NUMA nodes. However, it is acceptable in scenarios when workloads require optimal CPU and memory performance.

For example, consider a compute node, compute-1, that features two NUMA nodes: numa0 and numa1. The SR-IOV-enabled NIC is present on numa0. The CPUs available for pod scheduling are present on numa1 only. By setting the excludeTopology specification to true, the Topology Manager can assign CPU and memory resources for the pod to numa1 and can assign the SR-IOV network resource for the same pod to numa0. This is only possible when you set the excludeTopology specification to true. Otherwise, the Topology Manager attempts to place all resources on the same NUMA node.

18.2.5. Troubleshooting SR-IOV configuration

After following the procedure to configure an SR-IOV network device, the following sections address some error conditions.

To display the state of nodes, run the following command:

$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name>

where: <node_name> specifies the name of a node with an SR-IOV network device.

Error output: Cannot allocate memory

"lastSyncError": "write /sys/bus/pci/devices/0000:3b:00.1/sriov_numvfs: cannot allocate memory"

When a node indicates that it cannot allocate memory, check the following items:

Confirm that global SR-IOV settings are enabled in the BIOS for the node.
Confirm that VT-d is enabled in the BIOS for the node.

Additional resources

Using CPU Manager

18.2.6. Next steps

Configuring an SR-IOV network attachment

18.3. Configuring an SR-IOV Ethernet network attachment

You can configure an Ethernet network attachment for an Single Root I/O Virtualization (SR-IOV) device in the cluster.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.3.1. Ethernet device configuration object

You can configure an Ethernet network device by defining an SriovNetwork object.

The following YAML describes an SriovNetwork object:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: <name> 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: <sriov_resource_name> 3
  networkNamespace: <target_namespace> 4
  vlan: <vlan> 5
  spoofChk: "<spoof_check>" 6
  ipam: |- 7
    {}
  linkState: <link_state> 8
  maxTxRate: <max_tx_rate> 9
  minTxRate: <min_tx_rate> 10
  vlanQoS: <vlan_qos> 11
  trust: "<trust_vf>" 12
  capabilities: <capabilities> 13

1: A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
2: The namespace where the SR-IOV Network Operator is installed.
3: The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4: The target namespace for the SriovNetwork object. Only pods in the target namespace can attach to the additional network.
5: Optional: A Virtual LAN (VLAN) ID for the additional network. The integer value must be from 0 to 4095. The default value is 0.
6: Optional: The spoof check mode of the VF. The allowed values are the strings "on" and "off".
Important
You must enclose the value you specify in quotes or the object is rejected by the SR-IOV Network Operator.
7: A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
8: Optional: The link state of virtual function (VF). Allowed value are enable, disable and auto.
9: Optional: A maximum transmission rate, in Mbps, for the VF.
10: Optional: A minimum transmission rate, in Mbps, for the VF. This value must be less than or equal to the maximum transmission rate.
Note
Intel NICs do not support the minTxRate parameter. For more information, see BZ#1772847.
11: Optional: An IEEE 802.1p priority level for the VF. The default value is 0.
12: Optional: The trust mode of the VF. The allowed values are the strings "on" and "off".
Important
You must enclose the value that you specify in quotes, or the SR-IOV Network Operator rejects the object.
13: Optional: The capabilities to configure for this additional network. You can specify '{ "ips": true }' to enable IP address support or '{ "mac": true }' to enable MAC address support.

18.3.1.1. Creating a configuration for assignment of dual-stack IP addresses dynamically

Dual-stack IP address assignment can be configured with the ipRanges parameter for:

IPv4 addresses
IPv6 addresses
multiple IP address assignment

Procedure

Set type to whereabouts.

Use ipRanges to allocate IP addresses as shown in the following example:

cniVersion: operator.openshift.io/v1
kind: Network
=metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: whereabouts-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
       "name": "whereabouts-dual-stack",
       "cniVersion": "0.3.1,
       "type": "bridge",
       "ipam": {
         "type": "whereabouts",
         "ipRanges": [
                  {"range": "192.168.10.0/24"},
                  {"range": "2001:db8::/64"}
              ]
       }
      }

Attach network to a pod. For more information, see "Adding a pod to an additional network".
Verify that all IP addresses are assigned.
Run the following command to ensure the IP addresses are assigned as metadata.
```
$ oc exec -it mypod -- ip a
```

18.3.1.2. Configuration of IP address assignment for a network attachment

For additional networks, IP addresses can be assigned using an IP Address Management (IPAM) CNI plugin, which supports various assignment methods, including Dynamic Host Configuration Protocol (DHCP) and static assignment.

The DHCP IPAM CNI plugin responsible for dynamic assignment of IP addresses operates with two distinct components:

CNI Plugin: Responsible for integrating with the Kubernetes networking stack to request and release IP addresses.
DHCP IPAM CNI Daemon: A listener for DHCP events that coordinates with existing DHCP servers in the environment to handle IP address assignment requests. This daemon is not a DHCP server itself.

For networks requiring type: dhcp in their IPAM configuration, ensure the following:

A DHCP server is available and running in the environment. The DHCP server is external to the cluster and is expected to be part of the customer’s existing network infrastructure.
The DHCP server is appropriately configured to serve IP addresses to the nodes.

In cases where a DHCP server is unavailable in the environment, it is recommended to use the Whereabouts IPAM CNI plugin instead. The Whereabouts CNI provides similar IP address management capabilities without the need for an external DHCP server.

Note

Use the Whereabouts CNI plugin when there is no external DHCP server or where static IP address management is preferred. The Whereabouts plugin includes a reconciler daemon to manage stale IP address allocations.

A DHCP lease must be periodically renewed throughout the container’s lifetime, so a separate daemon, the DHCP IPAM CNI Daemon, is required. To deploy the DHCP IPAM CNI daemon, modify the Cluster Network Operator (CNO) configuration to trigger the deployment of this daemon as part of the additional network setup.

18.3.1.2.1. Static IP address assignment configuration

The following table describes the configuration for static IP address assignment:

Table 18.2. ipam static configuration object
Field	Type	Description
`type`	`string`	The IPAM address type. The value `static` is required.
`addresses`	`array`	An array of objects specifying IP addresses to assign to the virtual interface. Both IPv4 and IPv6 IP addresses are supported.
`routes`	`array`	An array of objects specifying routes to configure inside the pod.
`dns`	`array`	Optional: An array of objects specifying the DNS configuration.

The addresses array requires objects with the following fields:

Table 18.3. ipam.addresses[] array
Field	Type	Description
`address`	`string`	An IP address and network prefix that you specify. For example, if you specify `10.10.21.10/24`, then the additional network is assigned an IP address of `10.10.21.10` and the netmask is `255.255.255.0`.
`gateway`	`string`	The default gateway to route egress network traffic to.

Table 18.4. ipam.routes[] array
Field	Type	Description
`dst`	`string`	The IP address range in CIDR format, such as `192.168.17.0/24` or `0.0.0.0/0` for the default route.
`gw`	`string`	The gateway where network traffic is routed.

Table 18.5. ipam.dns object
Field	Type	Description
`nameservers`	`array`	An array of one or more IP addresses for to send DNS queries to.
`domain`	`array`	The default domain to append to a hostname. For example, if the domain is set to `example.com`, a DNS lookup query for `example-host` is rewritten as `example-host.example.com`.
`search`	`array`	An array of domain names to append to an unqualified hostname, such as `example-host`, during a DNS lookup query.

Static IP address assignment configuration example

{
  "ipam": {
    "type": "static",
      "addresses": [
        {
          "address": "191.168.1.7/24"
        }
      ]
  }
}

18.3.1.2.2. Dynamic IP address (DHCP) assignment configuration

A pod obtains its original DHCP lease when it is created. The lease must be periodically renewed by a minimal DHCP server deployment running on the cluster.

Important

For an Ethernet network attachment, the SR-IOV Network Operator does not create a DHCP server deployment; the Cluster Network Operator is responsible for creating the minimal DHCP server deployment.

To trigger the deployment of the DHCP server, you must create a shim network attachment by editing the Cluster Network Operator configuration, as in the following example:

Example shim network attachment definition

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: dhcp-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
        "name": "dhcp-shim",
        "cniVersion": "0.3.1",
        "type": "bridge",
        "ipam": {
          "type": "dhcp"
        }
      }
  # ...

The following table describes the configuration parameters for dynamic IP address address assignment with DHCP.

Table 18.6. ipam DHCP configuration object
Field	Type	Description
`type`	`string`	The IPAM address type. The value `dhcp` is required.

The following JSON example describes the configuration p for dynamic IP address address assignment with DHCP.

Dynamic IP address (DHCP) assignment configuration example

{
  "ipam": {
    "type": "dhcp"
  }
}

18.3.1.2.3. Dynamic IP address assignment configuration with Whereabouts

The Whereabouts CNI plugin allows the dynamic assignment of an IP address to an additional network without the use of a DHCP server.

The Whereabouts CNI plugin also supports overlapping IP address ranges and configuration of the same CIDR range multiple times within separate NetworkAttachmentDefinition CRDs. This provides greater flexibility and management capabilities in multi-tenant environments.

18.3.1.2.3.1. Dynamic IP address configuration objects

The following table describes the configuration objects for dynamic IP address assignment with Whereabouts:

Table 18.7. ipam whereabouts configuration object
Field	Type	Description
`type`	`string`	The IPAM address type. The value `whereabouts` is required.
`range`	`string`	An IP address and range in CIDR notation. IP addresses are assigned from within this range of addresses.
`exclude`	`array`	Optional: A list of zero or more IP addresses and ranges in CIDR notation. IP addresses within an excluded address range are not assigned.
`network_name`	`string`	Optional: Helps ensure that each group or domain of pods gets its own set of IP addresses, even if they share the same range of IP addresses. Setting this field is important for keeping networks separate and organized, notably in multi-tenant environments.

18.3.1.2.3.2. Dynamic IP address assignment configuration that uses Whereabouts

The following example shows a dynamic address assignment configuration that uses Whereabouts:

Whereabouts dynamic IP address assignment

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/27",
    "exclude": [
       "192.0.2.192/30",
       "192.0.2.196/32"
    ]
  }
}

18.3.1.2.3.3. Dynamic IP address assignment that uses Whereabouts with overlapping IP address ranges

The following example shows a dynamic IP address assignment that uses overlapping IP address ranges for multi-tenant networks.

NetworkAttachmentDefinition 1

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/29",
    "network_name": "example_net_common", 1
  }
}

1: Optional. If set, must match the network_name of NetworkAttachmentDefinition 2.

NetworkAttachmentDefinition 2

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/24",
    "network_name": "example_net_common", 1
  }
}

1: Optional. If set, must match the network_name of NetworkAttachmentDefinition 1.

18.3.2. Configuring SR-IOV additional network

You can configure an additional network that uses SR-IOV hardware by creating an SriovNetwork object. When you create an SriovNetwork object, the SR-IOV Network Operator automatically creates a NetworkAttachmentDefinition object.

Note

Do not modify or delete an SriovNetwork object if it is attached to any pods in a running state.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.

Procedure

Create a SriovNetwork object, and then save the YAML in the <name>.yaml file, where <name> is a name for this additional network. The object specification might resemble the following example:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: attach1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: net1
  networkNamespace: project2
  ipam: |-
    {
      "type": "host-local",
      "subnet": "10.56.217.0/24",
      "rangeStart": "10.56.217.171",
      "rangeEnd": "10.56.217.181",
      "gateway": "10.56.217.1"
    }

To create the object, enter the following command:
```
$ oc create -f <name>.yaml
```
where <name> specifies the name of the additional network.
Optional: To confirm that the NetworkAttachmentDefinition object that is associated with the SriovNetwork object that you created in the previous step exists, enter the following command. Replace <namespace> with the networkNamespace you specified in the SriovNetwork object.
```
$ oc get net-attach-def -n <namespace>
```

18.3.3. Assigning an SR-IOV network to a VRF

As a cluster administrator, you can assign an SR-IOV network interface to your VRF domain by using the CNI VRF plugin.

To do this, add the VRF configuration to the optional metaPlugins parameter of the SriovNetwork resource.

Note

Applications that use VRFs need to bind to a specific device. The common usage is to use the SO_BINDTODEVICE option for a socket. SO_BINDTODEVICE binds the socket to a device that is specified in the passed interface name, for example, eth1. To use SO_BINDTODEVICE, the application must have CAP_NET_RAW capabilities.

Using a VRF through the ip vrf exec command is not supported in OpenShift Container Platform pods. To use VRF, bind applications directly to the VRF interface.

18.3.3.1. Creating an additional SR-IOV network attachment with the CNI VRF plugin

The SR-IOV Network Operator manages additional network definitions. When you specify an additional SR-IOV network to create, the SR-IOV Network Operator creates the NetworkAttachmentDefinition custom resource (CR) automatically.

Note

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

To create an additional SR-IOV network attachment with the CNI VRF plugin, perform the following procedure.

Prerequisites

Install the OpenShift Container Platform CLI (oc).
Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.

Procedure

Create the SriovNetwork custom resource (CR) for the additional SR-IOV network attachment and insert the metaPlugins configuration, as in the following example CR. Save the YAML as the file sriov-network-attachment.yaml.

Example SriovNetwork custom resource (CR) example

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: example-network
  namespace: additional-sriov-network-1
spec:
  ipam: |
    {
      "type": "host-local",
      "subnet": "10.56.217.0/24",
      "rangeStart": "10.56.217.171",
      "rangeEnd": "10.56.217.181",
      "routes": [{
        "dst": "0.0.0.0/0"
      }],
      "gateway": "10.56.217.1"
    }
  vlan: 0
  resourceName: intelnics
  metaPlugins : |
    {
      "type": "vrf", 1
      "vrfname": "example-vrf-name" 2
    }

1: type must be set to vrf.
2: vrfname is the name of the VRF that the interface is assigned to. If it does not exist in the pod, it is created.

Create the SriovNetwork resource:

$ oc create -f sriov-network-attachment.yaml

Verifying that the NetworkAttachmentDefinition CR is successfully created

Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command:
```
$ oc get network-attachment-definitions -n <namespace> 1
```
1
Replace <namespace> with the namespace that you specified when configuring the network attachment, for example, additional-sriov-network-1.
Example output
```
NAME                            AGE
additional-sriov-network-1      14m
```
Note
There might be a delay before the SR-IOV Network Operator creates the CR.

Verifying that the additional SR-IOV network attachment is successful

To verify that the VRF CNI is correctly configured and that the additional SR-IOV network attachment is attached, do the following:

Create an SR-IOV network that uses the VRF CNI.
Assign the network to a pod.
Verify that the pod network attachment is connected to the SR-IOV additional network. Remote shell into the pod and run the following command:
```
$ ip vrf show
```
Example output
```
Name              Table
-----------------------
red                 10
```
Confirm that the VRF interface is master of the secondary interface by running the following command:
```
$ ip link
```
Example output
```
...
5: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master red state UP mode
...
```

18.3.4. Runtime configuration for an Ethernet-based SR-IOV attachment

When attaching a pod to an additional network, you can specify a runtime configuration to make specific customizations for the pod. For example, you can request a specific MAC hardware address.

You specify the runtime configuration by setting an annotation in the pod specification. The annotation key is k8s.v1.cni.cncf.io/networks, and it accepts a JSON object that describes the runtime configuration.

The following JSON describes the runtime configuration options for an Ethernet-based SR-IOV network attachment.

[
  {
    "name": "<name>", 1
    "mac": "<mac_address>", 2
    "ips": ["<cidr_range>"] 3
  }
]

1: The name of the SR-IOV network attachment definition CR.
2: Optional: The MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify { "mac": true } in the SriovNetwork object.
3: Optional: IP addresses for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovNetwork object.

Example runtime configuration

apiVersion: v1
kind: Pod
metadata:
  name: sample-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "net1",
          "mac": "20:04:0f:f1:88:01",
          "ips": ["192.168.10.1/24", "2001::1/64"]
        }
      ]
spec:
  containers:
  - name: sample-container
    image: <image>
    imagePullPolicy: IfNotPresent
    command: ["sleep", "infinity"]

18.3.5. Adding a pod to an additional network

You can add a pod to an additional network. The pod continues to send normal cluster-related network traffic over the default network.

When a pod is created additional networks are attached to it. However, if a pod already exists, you cannot attach additional networks to it.

The pod must be in the same namespace as the additional network.

Prerequisites

Install the OpenShift CLI (oc).
Log in to the cluster.

Procedure

Add an annotation to the Pod object. Only one of the following annotation formats can be used:
1. To attach an additional network without any customization, add an annotation with the following format. Replace <network> with the name of the additional network to associate with the pod:
```
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: <network>[,<network>,...] 1
```
  1
  To specify more than one additional network, separate each network with a comma. Do not include whitespace between the comma. If you specify the same additional network multiple times, that pod will have multiple network interfaces attached to that network.
2. To attach an additional network with customizations, add an annotation with the following format:
```
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "<network>", 1
          "namespace": "<namespace>", 2
          "default-route": ["<default-route>"] 3
        }
      ]
```
  1
  Specify the name of the additional network defined by a NetworkAttachmentDefinition object.
  2
  Specify the namespace where the NetworkAttachmentDefinition object is defined.
  3
  Optional: Specify an override for the default route, such as 192.168.17.1.
To create the pod, enter the following command. Replace <name> with the name of the pod.
```
$ oc create -f <name>.yaml
```

Optional: To Confirm that the annotation exists in the Pod CR, enter the following command, replacing <name> with the name of the pod.

$ oc get pod <name> -o yaml

In the following example, the example-pod pod is attached to the net1 additional network:

$ oc get pod example-pod -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: macvlan-bridge
    k8s.v1.cni.cncf.io/network-status: |- 1
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "10.128.2.14"
          ],
          "default": true,
          "dns": {}
      },{
          "name": "macvlan-bridge",
          "interface": "net1",
          "ips": [
              "20.2.2.100"
          ],
          "mac": "22:2f:60:a5:f8:00",
          "dns": {}
      }]
  name: example-pod
  namespace: default
spec:
  ...
status:
  ...

1: The k8s.v1.cni.cncf.io/network-status parameter is a JSON array of objects. Each object describes the status of an additional network attached to the pod. The annotation value is stored as a plain text value.

18.3.5.1. Exposing MTU for vfio-pci SR-IOV devices to pod

After adding a pod to an additional network, you can check that the MTU is available for the SR-IOV network.

Procedure

Check that the pod annotation includes MTU by running the following command:

$ oc describe pod example-pod

The following example shows the sample output:

"mac": "20:04:0f:f1:88:01",
       "mtu": 1500,
       "dns": {},
       "device-info": {
         "type": "pci",
         "version": "1.1.0",
         "pci": {
           "pci-address": "0000:86:01.3"
    }
  }

Verify that the MTU is available in /etc/podnetinfo/ inside the pod by running the following command:

$ oc exec example-pod -n sriov-tests -- cat /etc/podnetinfo/annotations | grep mtu

The following example shows the sample output:

k8s.v1.cni.cncf.io/network-status="[{
    \"name\": \"ovn-kubernetes\",
    \"interface\": \"eth0\",
    \"ips\": [
        \"10.131.0.67\"
    ],
    \"mac\": \"0a:58:0a:83:00:43\",
    \"default\": true,
    \"dns\": {}
    },{
    \"name\": \"sriov-tests/sriov-nic-1\",
    \"interface\": \"net1\",
    \"ips\": [
        \"192.168.10.1\"
    ],
    \"mac\": \"20:04:0f:f1:88:01\",
    \"mtu\": 1500,
    \"dns\": {},
    \"device-info\": {
        \"type\": \"pci\",
        \"version\": \"1.1.0\",
        \"pci\": {
            \"pci-address\": \"0000:86:01.3\"
        }
    }
    }]"

18.3.6. Configuring parallel node draining during SR-IOV network policy updates

By default, the SR-IOV Network Operator drains workloads from a node before every policy change. The Operator performs this action, one node at a time, to ensure that no workloads are affected by the reconfiguration.

In large clusters, draining nodes sequentially can be time-consuming, taking hours or even days. In time-sensitive environments, you can enable parallel node draining in an SriovNetworkPoolConfig custom resource (CR) for faster rollouts of SR-IOV network configurations.

To configure parallel draining, use the SriovNetworkPoolConfig CR to create a node pool. You can then add nodes to the pool and define the maximum number of nodes in the pool that the Operator can drain in parallel. With this approach, you can enable parallel draining for faster reconfiguration while ensuring you still have enough nodes remaining in the pool to handle any running workloads.

Note

A node can only belong to one SR-IOV network pool configuration. If a node is not part of a pool, it is added to a virtual, default, pool that is configured to drain one node at a time only.

The node might restart during the draining process.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.
Install the SR-IOV Network Operator.
Nodes have hardware that support SR-IOV.

Procedure

Create a SriovNetworkPoolConfig resource:
1. Create a YAML file that defines the SriovNetworkPoolConfig resource:
  Example sriov-nw-pool.yaml file
```
apiVersion: v1
kind: SriovNetworkPoolConfig
metadata:
  name: pool-1 1
  namespace: openshift-sriov-network-operator 2
spec:
  maxUnavailable: 2 3
  nodeSelector: 4
    matchLabels:
      node-role.kubernetes.io/worker: ""
```
  1
  Specify the name of the SriovNetworkPoolConfig object.
  2
  Specify namespace where the SR-IOV Network Operator is installed.
  3
  Specify an integer number, or percentage value, for nodes that can be unavailable in the pool during an update. For example, if you have 10 nodes and you set the maximum unavailable to 2, then only 2 nodes can be drained in parallel at any time, leaving 8 nodes for handling workloads.
  4
  Specify the nodes to add the pool by using the node selector. This example adds all nodes with the worker role to the pool.
2. Create the SriovNetworkPoolConfig resource by running the following command:
```
$ oc create -f sriov-nw-pool.yaml
```
Create the sriov-test namespace by running the following comand:
```
$ oc create namespace sriov-test
```

Create a SriovNetworkNodePolicy resource:

Create a YAML file that defines the SriovNetworkNodePolicy resource:

Example sriov-node-policy.yaml file

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-nic-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  nicSelector:
    pfNames: ["ens1"]
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  numVfs: 5
  priority: 99
  resourceName: sriov_nic_1

Create the SriovNetworkNodePolicy resource by running the following command:
```
$ oc create -f sriov-node-policy.yaml
```

Create a SriovNetwork resource:

Create a YAML file that defines the SriovNetwork resource:

Example sriov-network.yaml file

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-nic-1
  namespace: openshift-sriov-network-operator
spec:
  linkState: auto
  networkNamespace: sriov-test
  resourceName: sriov_nic_1
  capabilities: '{ "mac": true, "ips": true }'
  ipam: '{ "type": "static" }'

Create the SriovNetwork resource by running the following command:
```
$ oc create -f sriov-network.yaml
```

Verification

View the node pool you created by running the following command:
```
$ oc get sriovNetworkpoolConfig -n openshift-sriov-network-operator
```
Example output
```
NAME     AGE
pool-1   67s 1
```
1
In this example, pool-1 contains all the nodes with the worker role.

To demonstrate the node draining process using the example scenario from the above procedure, complete the following steps:

Update the number of virtual functions in the SriovNetworkNodePolicy resource to trigger workload draining in the cluster:

$ oc patch SriovNetworkNodePolicy sriov-nic-1 -n openshift-sriov-network-operator --type merge -p '{"spec": {"numVfs": 4}}'

Monitor the draining status on the target cluster by running the following command:

$ oc get sriovNetworkNodeState -n openshift-sriov-network-operator

Example output

NAMESPACE                          NAME       SYNC STATUS   DESIRED SYNC STATE   CURRENT SYNC STATE   AGE
openshift-sriov-network-operator   worker-0   InProgress    Drain_Required       DrainComplete        3d10h
openshift-sriov-network-operator   worker-1   InProgress    Drain_Required       DrainComplete        3d10h

When the draining process is complete, the SYNC STATUS changes to Succeeded, and the DESIRED SYNC STATE and CURRENT SYNC STATE values return to IDLE.

Example output

NAMESPACE                          NAME       SYNC STATUS   DESIRED SYNC STATE   CURRENT SYNC STATE   AGE
openshift-sriov-network-operator   worker-0   Succeeded     Idle                 Idle                 3d10h
openshift-sriov-network-operator   worker-1   Succeeded     Idle                 Idle                 3d10h

18.3.7. Excluding the SR-IOV network topology for NUMA-aware scheduling

To exclude advertising the SR-IOV network resource’s Non-Uniform Memory Access (NUMA) node to the Topology Manager, you can configure the excludeTopology specification in the SriovNetworkNodePolicy custom resource. Use this configuration for more flexible SR-IOV network deployments during NUMA-aware pod scheduling.

Prerequisites

You have installed the OpenShift CLI (oc).
You have configured the CPU Manager policy to static. For more information about CPU Manager, see the Additional resources section.
You have configured the Topology Manager policy to single-numa-node.
You have installed the SR-IOV Network Operator.

Procedure

Create the SriovNetworkNodePolicy CR:
1. Save the following YAML in the sriov-network-node-policy.yaml file, replacing values in the YAML to match your environment:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: <policy_name>
  namespace: openshift-sriov-network-operator
spec:
  resourceName: sriovnuma0 1
  nodeSelector:
    kubernetes.io/hostname: <node_name>
  numVfs: <number_of_Vfs>
  nicSelector: 2
    vendor: "<vendor_ID>"
    deviceID: "<device_ID>"
  deviceType: netdevice
  excludeTopology: true 3
```
  1
  The resource name of the SR-IOV network device plugin. This YAML uses a sample resourceName value.
  2
  Identify the device for the Operator to configure by using the NIC selector.
  3
  To exclude advertising the NUMA node for the SR-IOV network resource to the Topology Manager, set the value to true. The default value is false.
  Note
  If multiple SriovNetworkNodePolicy resources target the same SR-IOV network resource, the SriovNetworkNodePolicy resources must have the same value as the excludeTopology specification. Otherwise, the conflicting policy is rejected.
2. Create the SriovNetworkNodePolicy resource by running the following command:
```
$ oc create -f sriov-network-node-policy.yaml
```
  Example output
```
sriovnetworknodepolicy.sriovnetwork.openshift.io/policy-for-numa-0 created
```
Create the SriovNetwork CR:
1. Save the following YAML in the sriov-network.yaml file, replacing values in the YAML to match your environment:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-numa-0-network 1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: sriovnuma0 2
  networkNamespace: <namespace> 3
  ipam: |- 4
    {
      "type": "<ipam_type>",
    }
```
  1
  Replace sriov-numa-0-network with the name for the SR-IOV network resource.
  2
  Specify the resource name for the SriovNetworkNodePolicy CR from the previous step. This YAML uses a sample resourceName value.
  3
  Enter the namespace for your SR-IOV network resource.
  4
  Enter the IP address management configuration for the SR-IOV network.
2. Create the SriovNetwork resource by running the following command:
```
$ oc create -f sriov-network.yaml
```
  Example output
```
sriovnetwork.sriovnetwork.openshift.io/sriov-numa-0-network created
```

Create a pod and assign the SR-IOV network resource from the previous step:

Save the following YAML in the sriov-network-pod.yaml file, replacing values in the YAML to match your environment:

apiVersion: v1
kind: Pod
metadata:
  name: <pod_name>
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "sriov-numa-0-network", 1
        }
      ]
spec:
  containers:
  - name: <container_name>
    image: <image>
    imagePullPolicy: IfNotPresent
    command: ["sleep", "infinity"]

1: This is the name of the SriovNetwork resource that uses the SriovNetworkNodePolicy resource.

Create the Pod resource by running the following command:
```
$ oc create -f sriov-network-pod.yaml
```
Example output
```
pod/example-pod created
```

Verification

Verify the status of the pod by running the following command, replacing <pod_name> with the name of the pod:

$ oc get pod <pod_name>

Example output

NAME                                     READY   STATUS    RESTARTS   AGE
test-deployment-sriov-76cbbf4756-k9v72   1/1     Running   0          45h

Open a debug session with the target pod to verify that the SR-IOV network resources are deployed to a different node than the memory and CPU resources.
1. Open a debug session with the pod by running the following command, replacing <pod_name> with the target pod name.
```
$ oc debug pod/<pod_name>
```
2. Set /host as the root directory within the debug shell. The debug pod mounts the root file system from the host in /host within the pod. By changing the root directory to /host, you can run binaries from the host file system:
```
$ chroot /host
```
3. View information about the CPU allocation by running the following commands:
```
$ lscpu | grep NUMA
```
  Example output
```
NUMA node(s):                    2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,...
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,...
```
```
$ cat /proc/self/status | grep Cpus
```
  Example output
```
Cpus_allowed:	aa
Cpus_allowed_list:	1,3,5,7
```
```
$ cat  /sys/class/net/net1/device/numa_node
```
  Example output
```
0
```
  In this example, CPUs 1,3,5, and 7 are allocated to NUMA node1 but the SR-IOV network resource can use the NIC in NUMA node0.

Note

If the excludeTopology specification is set to True, it is possible that the required resources exist in the same NUMA node.

18.3.8. Additional resources

18.4. Configuring an SR-IOV InfiniBand network attachment

You can configure an InfiniBand (IB) network attachment for an Single Root I/O Virtualization (SR-IOV) device in the cluster.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.4.1. InfiniBand device configuration object

You can configure an InfiniBand (IB) network device by defining an SriovIBNetwork object.

The following YAML describes an SriovIBNetwork object:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: <name> 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: <sriov_resource_name> 3
  networkNamespace: <target_namespace> 4
  ipam: |- 5
    {}
  linkState: <link_state> 6
  capabilities: <capabilities> 7

1: A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
2: The namespace where the SR-IOV Operator is installed.
3: The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4: The target namespace for the SriovIBNetwork object. Only pods in the target namespace can attach to the network device.
5: Optional: A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
6: Optional: The link state of virtual function (VF). Allowed values are enable, disable and auto.
7: Optional: The capabilities to configure for this network. You can specify '{ "ips": true }' to enable IP address support or '{ "infinibandGUID": true }' to enable IB Global Unique Identifier (GUID) support.

18.4.1.1. Creating a configuration for assignment of dual-stack IP addresses dynamically

Dual-stack IP address assignment can be configured with the ipRanges parameter for:

IPv4 addresses
IPv6 addresses
multiple IP address assignment

Procedure

Set type to whereabouts.

Use ipRanges to allocate IP addresses as shown in the following example:

cniVersion: operator.openshift.io/v1
kind: Network
=metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: whereabouts-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
       "name": "whereabouts-dual-stack",
       "cniVersion": "0.3.1,
       "type": "bridge",
       "ipam": {
         "type": "whereabouts",
         "ipRanges": [
                  {"range": "192.168.10.0/24"},
                  {"range": "2001:db8::/64"}
              ]
       }
      }

Attach network to a pod. For more information, see "Adding a pod to an additional network".
Verify that all IP addresses are assigned.
Run the following command to ensure the IP addresses are assigned as metadata.
```
$ oc exec -it mypod -- ip a
```

18.4.1.2. Configuration of IP address assignment for a network attachment

For additional networks, IP addresses can be assigned using an IP Address Management (IPAM) CNI plugin, which supports various assignment methods, including Dynamic Host Configuration Protocol (DHCP) and static assignment.

The DHCP IPAM CNI plugin responsible for dynamic assignment of IP addresses operates with two distinct components:

CNI Plugin: Responsible for integrating with the Kubernetes networking stack to request and release IP addresses.
DHCP IPAM CNI Daemon: A listener for DHCP events that coordinates with existing DHCP servers in the environment to handle IP address assignment requests. This daemon is not a DHCP server itself.

For networks requiring type: dhcp in their IPAM configuration, ensure the following:

A DHCP server is available and running in the environment. The DHCP server is external to the cluster and is expected to be part of the customer’s existing network infrastructure.
The DHCP server is appropriately configured to serve IP addresses to the nodes.

In cases where a DHCP server is unavailable in the environment, it is recommended to use the Whereabouts IPAM CNI plugin instead. The Whereabouts CNI provides similar IP address management capabilities without the need for an external DHCP server.

Note

Use the Whereabouts CNI plugin when there is no external DHCP server or where static IP address management is preferred. The Whereabouts plugin includes a reconciler daemon to manage stale IP address allocations.

A DHCP lease must be periodically renewed throughout the container’s lifetime, so a separate daemon, the DHCP IPAM CNI Daemon, is required. To deploy the DHCP IPAM CNI daemon, modify the Cluster Network Operator (CNO) configuration to trigger the deployment of this daemon as part of the additional network setup.

18.4.1.2.1. Static IP address assignment configuration

The following table describes the configuration for static IP address assignment:

Table 18.8. ipam static configuration object
Field	Type	Description
`type`	`string`	The IPAM address type. The value `static` is required.
`addresses`	`array`	An array of objects specifying IP addresses to assign to the virtual interface. Both IPv4 and IPv6 IP addresses are supported.
`routes`	`array`	An array of objects specifying routes to configure inside the pod.
`dns`	`array`	Optional: An array of objects specifying the DNS configuration.

The addresses array requires objects with the following fields:

Table 18.9. ipam.addresses[] array
Field	Type	Description
`address`	`string`	An IP address and network prefix that you specify. For example, if you specify `10.10.21.10/24`, then the additional network is assigned an IP address of `10.10.21.10` and the netmask is `255.255.255.0`.
`gateway`	`string`	The default gateway to route egress network traffic to.

Table 18.10. ipam.routes[] array
Field	Type	Description
`dst`	`string`	The IP address range in CIDR format, such as `192.168.17.0/24` or `0.0.0.0/0` for the default route.
`gw`	`string`	The gateway where network traffic is routed.

Table 18.11. ipam.dns object
Field	Type	Description
`nameservers`	`array`	An array of one or more IP addresses for to send DNS queries to.
`domain`	`array`	The default domain to append to a hostname. For example, if the domain is set to `example.com`, a DNS lookup query for `example-host` is rewritten as `example-host.example.com`.
`search`	`array`	An array of domain names to append to an unqualified hostname, such as `example-host`, during a DNS lookup query.

Static IP address assignment configuration example

{
  "ipam": {
    "type": "static",
      "addresses": [
        {
          "address": "191.168.1.7/24"
        }
      ]
  }
}

18.4.1.2.2. Dynamic IP address (DHCP) assignment configuration

A pod obtains its original DHCP lease when it is created. The lease must be periodically renewed by a minimal DHCP server deployment running on the cluster.

Important

For an Ethernet network attachment, the SR-IOV Network Operator does not create a DHCP server deployment; the Cluster Network Operator is responsible for creating the minimal DHCP server deployment.

To trigger the deployment of the DHCP server, you must create a shim network attachment by editing the Cluster Network Operator configuration, as in the following example:

Example shim network attachment definition

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  additionalNetworks:
  - name: dhcp-shim
    namespace: default
    type: Raw
    rawCNIConfig: |-
      {
        "name": "dhcp-shim",
        "cniVersion": "0.3.1",
        "type": "bridge",
        "ipam": {
          "type": "dhcp"
        }
      }
  # ...

The following table describes the configuration parameters for dynamic IP address address assignment with DHCP.

Table 18.12. ipam DHCP configuration object
Field	Type	Description
`type`	`string`	The IPAM address type. The value `dhcp` is required.

The following JSON example describes the configuration p for dynamic IP address address assignment with DHCP.

Dynamic IP address (DHCP) assignment configuration example

{
  "ipam": {
    "type": "dhcp"
  }
}

18.4.1.2.3. Dynamic IP address assignment configuration with Whereabouts

The Whereabouts CNI plugin allows the dynamic assignment of an IP address to an additional network without the use of a DHCP server.

The Whereabouts CNI plugin also supports overlapping IP address ranges and configuration of the same CIDR range multiple times within separate NetworkAttachmentDefinition CRDs. This provides greater flexibility and management capabilities in multi-tenant environments.

18.4.1.2.3.1. Dynamic IP address configuration objects

The following table describes the configuration objects for dynamic IP address assignment with Whereabouts:

Table 18.13. ipam whereabouts configuration object
Field	Type	Description
`type`	`string`	The IPAM address type. The value `whereabouts` is required.
`range`	`string`	An IP address and range in CIDR notation. IP addresses are assigned from within this range of addresses.
`exclude`	`array`	Optional: A list of zero or more IP addresses and ranges in CIDR notation. IP addresses within an excluded address range are not assigned.
`network_name`	`string`	Optional: Helps ensure that each group or domain of pods gets its own set of IP addresses, even if they share the same range of IP addresses. Setting this field is important for keeping networks separate and organized, notably in multi-tenant environments.

18.4.1.2.3.2. Dynamic IP address assignment configuration that uses Whereabouts

The following example shows a dynamic address assignment configuration that uses Whereabouts:

Whereabouts dynamic IP address assignment

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/27",
    "exclude": [
       "192.0.2.192/30",
       "192.0.2.196/32"
    ]
  }
}

18.4.1.2.3.3. Dynamic IP address assignment that uses Whereabouts with overlapping IP address ranges

The following example shows a dynamic IP address assignment that uses overlapping IP address ranges for multi-tenant networks.

NetworkAttachmentDefinition 1

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/29",
    "network_name": "example_net_common", 1
  }
}

1: Optional. If set, must match the network_name of NetworkAttachmentDefinition 2.

NetworkAttachmentDefinition 2

{
  "ipam": {
    "type": "whereabouts",
    "range": "192.0.2.192/24",
    "network_name": "example_net_common", 1
  }
}

1: Optional. If set, must match the network_name of NetworkAttachmentDefinition 1.

18.4.2. Configuring SR-IOV additional network

You can configure an additional network that uses SR-IOV hardware by creating an SriovIBNetwork object. When you create an SriovIBNetwork object, the SR-IOV Network Operator automatically creates a NetworkAttachmentDefinition object.

Note

Do not modify or delete an SriovIBNetwork object if it is attached to any pods in a running state.

Prerequisites

Install the OpenShift CLI (oc).
Log in as a user with cluster-admin privileges.

Procedure

Create a SriovIBNetwork object, and then save the YAML in the <name>.yaml file, where <name> is a name for this additional network. The object specification might resemble the following example:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovIBNetwork
metadata:
  name: attach1
  namespace: openshift-sriov-network-operator
spec:
  resourceName: net1
  networkNamespace: project2
  ipam: |-
    {
      "type": "host-local",
      "subnet": "10.56.217.0/24",
      "rangeStart": "10.56.217.171",
      "rangeEnd": "10.56.217.181",
      "gateway": "10.56.217.1"
    }

To create the object, enter the following command:
```
$ oc create -f <name>.yaml
```
where <name> specifies the name of the additional network.
Optional: To confirm that the NetworkAttachmentDefinition object that is associated with the SriovIBNetwork object that you created in the previous step exists, enter the following command. Replace <namespace> with the networkNamespace you specified in the SriovIBNetwork object.
```
$ oc get net-attach-def -n <namespace>
```

18.4.3. Runtime configuration for an InfiniBand-based SR-IOV attachment

When attaching a pod to an additional network, you can specify a runtime configuration to make specific customizations for the pod. For example, you can request a specific MAC hardware address.

You specify the runtime configuration by setting an annotation in the pod specification. The annotation key is k8s.v1.cni.cncf.io/networks, and it accepts a JSON object that describes the runtime configuration.

The following JSON describes the runtime configuration options for an InfiniBand-based SR-IOV network attachment.

[
  {
    "name": "<network_attachment>", 1
    "infiniband-guid": "<guid>", 2
    "ips": ["<cidr_range>"] 3
  }
]

1: The name of the SR-IOV network attachment definition CR.
2: The InfiniBand GUID for the SR-IOV device. To use this feature, you also must specify { "infinibandGUID": true } in the SriovIBNetwork object.
3: The IP addresses for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovIBNetwork object.

Example runtime configuration

apiVersion: v1
kind: Pod
metadata:
  name: sample-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "ib1",
          "infiniband-guid": "c2:11:22:33:44:55:66:77",
          "ips": ["192.168.10.1/24", "2001::1/64"]
        }
      ]
spec:
  containers:
  - name: sample-container
    image: <image>
    imagePullPolicy: IfNotPresent
    command: ["sleep", "infinity"]

18.4.4. Adding a pod to an additional network

You can add a pod to an additional network. The pod continues to send normal cluster-related network traffic over the default network.

When a pod is created additional networks are attached to it. However, if a pod already exists, you cannot attach additional networks to it.

The pod must be in the same namespace as the additional network.

Prerequisites

Install the OpenShift CLI (oc).
Log in to the cluster.

Procedure

Add an annotation to the Pod object. Only one of the following annotation formats can be used:
1. To attach an additional network without any customization, add an annotation with the following format. Replace <network> with the name of the additional network to associate with the pod:
```
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: <network>[,<network>,...] 1
```
  1
  To specify more than one additional network, separate each network with a comma. Do not include whitespace between the comma. If you specify the same additional network multiple times, that pod will have multiple network interfaces attached to that network.
2. To attach an additional network with customizations, add an annotation with the following format:
```
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "<network>", 1
          "namespace": "<namespace>", 2
          "default-route": ["<default-route>"] 3
        }
      ]
```
  1
  Specify the name of the additional network defined by a NetworkAttachmentDefinition object.
  2
  Specify the namespace where the NetworkAttachmentDefinition object is defined.
  3
  Optional: Specify an override for the default route, such as 192.168.17.1.
To create the pod, enter the following command. Replace <name> with the name of the pod.
```
$ oc create -f <name>.yaml
```

Optional: To Confirm that the annotation exists in the Pod CR, enter the following command, replacing <name> with the name of the pod.

$ oc get pod <name> -o yaml

In the following example, the example-pod pod is attached to the net1 additional network:

$ oc get pod example-pod -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: macvlan-bridge
    k8s.v1.cni.cncf.io/network-status: |- 1
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "10.128.2.14"
          ],
          "default": true,
          "dns": {}
      },{
          "name": "macvlan-bridge",
          "interface": "net1",
          "ips": [
              "20.2.2.100"
          ],
          "mac": "22:2f:60:a5:f8:00",
          "dns": {}
      }]
  name: example-pod
  namespace: default
spec:
  ...
status:
  ...

1: The k8s.v1.cni.cncf.io/network-status parameter is a JSON array of objects. Each object describes the status of an additional network attached to the pod. The annotation value is stored as a plain text value.

18.4.4.1. Exposing MTU for vfio-pci SR-IOV devices to pod

After adding a pod to an additional network, you can check that the MTU is available for the SR-IOV network.

Procedure

Check that the pod annotation includes MTU by running the following command:

$ oc describe pod example-pod

The following example shows the sample output:

"mac": "20:04:0f:f1:88:01",
       "mtu": 1500,
       "dns": {},
       "device-info": {
         "type": "pci",
         "version": "1.1.0",
         "pci": {
           "pci-address": "0000:86:01.3"
    }
  }

Verify that the MTU is available in /etc/podnetinfo/ inside the pod by running the following command:

$ oc exec example-pod -n sriov-tests -- cat /etc/podnetinfo/annotations | grep mtu

The following example shows the sample output:

k8s.v1.cni.cncf.io/network-status="[{
    \"name\": \"ovn-kubernetes\",
    \"interface\": \"eth0\",
    \"ips\": [
        \"10.131.0.67\"
    ],
    \"mac\": \"0a:58:0a:83:00:43\",
    \"default\": true,
    \"dns\": {}
    },{
    \"name\": \"sriov-tests/sriov-nic-1\",
    \"interface\": \"net1\",
    \"ips\": [
        \"192.168.10.1\"
    ],
    \"mac\": \"20:04:0f:f1:88:01\",
    \"mtu\": 1500,
    \"dns\": {},
    \"device-info\": {
        \"type\": \"pci\",
        \"version\": \"1.1.0\",
        \"pci\": {
            \"pci-address\": \"0000:86:01.3\"
        }
    }
    }]"

18.4.5. Additional resources

18.5. Configuring interface-level network sysctl settings and all-multicast mode for SR-IOV networks

As a cluster administrator, you can change interface-level network sysctls and several interface attributes such as promiscuous mode, all-multicast mode, MTU, and MAC address by using the tuning Container Network Interface (CNI) meta plugin for a pod connected to a SR-IOV network device.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.5.1. Labeling nodes with an SR-IOV enabled NIC

If you want to enable SR-IOV on only SR-IOV capable nodes there are a couple of ways to do this:

Install the Node Feature Discovery (NFD) Operator. NFD detects the presence of SR-IOV enabled NICs and labels the nodes with node.alpha.kubernetes-incubator.io/nfd-network-sriov.capable = true.
Examine the SriovNetworkNodeState CR for each node. The interfaces stanza includes a list of all of the SR-IOV devices discovered by the SR-IOV Network Operator on the worker node. Label each node with feature.node.kubernetes.io/network-sriov.capable: "true" by using the following command:
```
$ oc label node <node_name> feature.node.kubernetes.io/network-sriov.capable="true"
```
Note
You can label the nodes with whatever name you want.

18.5.2. Setting one sysctl flag

You can set interface-level network sysctl settings for a pod connected to a SR-IOV network device.

In this example, net.ipv4.conf.IFNAME.accept_redirects is set to 1 on the created virtual interfaces.

The sysctl-tuning-test is a namespace used in this example.

Use the following command to create the sysctl-tuning-test namespace:
```
$ oc create namespace sysctl-tuning-test
```

18.5.2.1. Setting one sysctl flag on nodes with SR-IOV network devices

The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io custom resource definition (CRD) to OpenShift Container Platform. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).

Note

When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator might drain and reboot the nodes.

It can take several minutes for a configuration change to apply.

Follow this procedure to create a SriovNetworkNodePolicy custom resource (CR).

Procedure

Create an SriovNetworkNodePolicy custom resource (CR). For example, save the following YAML as the file policyoneflag-sriov-node-network.yaml:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policyoneflag 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: policyoneflag 3
  nodeSelector: 4
    feature.node.kubernetes.io/network-sriov.capable="true"
  priority: 10 5
  numVfs: 5 6
  nicSelector: 7
    pfNames: ["ens5"] 8
  deviceType: "netdevice" 9
  isRdma: false 10
```
1
The name for the custom resource object.
2
The namespace where the SR-IOV Network Operator is installed.
3
The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name.
4
The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only.
5
Optional: The priority is an integer value between 0 and 99. A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99. The default value is 99.
6
The number of the virtual functions (VFs) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 127.
7
The NIC selector identifies the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally. If you specify rootDevices, you must also specify a value for vendor, deviceID, or pfNames. If you specify both pfNames and rootDevices at the same time, ensure that they refer to the same device. If you specify a value for netFilter, then you do not need to specify any other parameter because a network ID is unique.
8
Optional: An array of one or more physical function (PF) names for the device.
9
Optional: The driver type for the virtual functions. The only allowed value is netdevice. For a Mellanox NIC to work in DPDK mode on bare metal nodes, set isRdma to true.
10
Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false. If the isRdma parameter is set to true, you can continue to use the RDMA-enabled VF as a normal network device. A device can be used in either mode. Set isRdma to true and additionally set needVhostNet to true to configure a Mellanox NIC for use with Fast Datapath DPDK applications.
Note
The vfio-pci driver type is not supported.
Create the SriovNetworkNodePolicy object:
```
$ oc create -f policyoneflag-sriov-node-network.yaml
```
After applying the configuration update, all the pods in sriov-network-operator namespace change to the Running status.
To verify that the SR-IOV network device is configured, enter the following command. Replace <node_name> with the name of a node with the SR-IOV network device that you just configured.
```
$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'
```
Example output
```
Succeeded
```

18.5.2.2. Configuring sysctl on a SR-IOV network

You can set interface specific sysctl settings on virtual interfaces created by SR-IOV by adding the tuning configuration to the optional metaPlugins parameter of the SriovNetwork resource.

The SR-IOV Network Operator manages additional network definitions. When you specify an additional SR-IOV network to create, the SR-IOV Network Operator creates the NetworkAttachmentDefinition custom resource (CR) automatically.

Note

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

To change the interface-level network net.ipv4.conf.IFNAME.accept_redirects sysctl settings, create an additional SR-IOV network with the Container Network Interface (CNI) tuning plugin.

Prerequisites

Install the OpenShift Container Platform CLI (oc).
Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.

Procedure

Create the SriovNetwork custom resource (CR) for the additional SR-IOV network attachment and insert the metaPlugins configuration, as in the following example CR. Save the YAML as the file sriov-network-interface-sysctl.yaml.
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: onevalidflag 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: policyoneflag 3
  networkNamespace: sysctl-tuning-test 4
  ipam: '{ "type": "static" }' 5
  capabilities: '{ "mac": true, "ips": true }' 6
  metaPlugins : | 7
    {
      "type": "tuning",
      "capabilities":{
        "mac":true
      },
      "sysctl":{
         "net.ipv4.conf.IFNAME.accept_redirects": "1"
      }
    }
```
1
A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
2
The namespace where the SR-IOV Network Operator is installed.
3
The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4
The target namespace for the SriovNetwork object. Only pods in the target namespace can attach to the additional network.
5
A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
6
Optional: Set capabilities for the additional network. You can specify "{ "ips": true }" to enable IP address support or "{ "mac": true }" to enable MAC address support.
7
Optional: The metaPlugins parameter is used to add additional capabilities to the device. In this use case set the type field to tuning. Specify the interface-level network sysctl you want to set in the sysctl field.

Create the SriovNetwork resource:

$ oc create -f sriov-network-interface-sysctl.yaml

Verifying that the NetworkAttachmentDefinition CR is successfully created

Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command:
```
$ oc get network-attachment-definitions -n <namespace> 1
```
1
Replace <namespace> with the value for networkNamespace that you specified in the SriovNetwork object. For example, sysctl-tuning-test.
Example output
```
NAME                                  AGE
onevalidflag                          14m
```
Note
There might be a delay before the SR-IOV Network Operator creates the CR.

Verifying that the additional SR-IOV network attachment is successful

To verify that the tuning CNI is correctly configured and the additional SR-IOV network attachment is attached, do the following:

Create a Pod CR. Save the following YAML as the file examplepod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: tunepod
  namespace: sysctl-tuning-test
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "onevalidflag",  1
          "mac": "0a:56:0a:83:04:0c", 2
          "ips": ["10.100.100.200/24"] 3
       }
      ]
spec:
  containers:
  - name: podexample
    image: centos
    command: ["/bin/bash", "-c", "sleep INF"]
    securityContext:
      runAsUser: 2000
      runAsGroup: 3000
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

1: The name of the SR-IOV network attachment definition CR.
2: Optional: The MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify { "mac": true } in the SriovNetwork object.
3: Optional: IP addresses for the SR-IOV device that are allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovNetwork object.

Create the Pod CR:
```
$ oc apply -f examplepod.yaml
```

Verify that the pod is created by running the following command:

$ oc get pod -n sysctl-tuning-test

Example output

NAME      READY   STATUS    RESTARTS   AGE
tunepod   1/1     Running   0          47s

Log in to the pod by running the following command:
```
$ oc rsh -n sysctl-tuning-test tunepod
```
Verify the values of the configured sysctl flag. Find the value net.ipv4.conf.IFNAME.accept_redirects by running the following command::
```
$ sysctl net.ipv4.conf.net1.accept_redirects
```
Example output
```
net.ipv4.conf.net1.accept_redirects = 1
```

18.5.3. Configuring sysctl settings for pods associated with bonded SR-IOV interface flag

You can set interface-level network sysctl settings for a pod connected to a bonded SR-IOV network device.

In this example, the specific network interface-level sysctl settings that can be configured are set on the bonded interface.

The sysctl-tuning-test is a namespace used in this example.

Use the following command to create the sysctl-tuning-test namespace:
```
$ oc create namespace sysctl-tuning-test
```

18.5.3.1. Setting all sysctl flag on nodes with bonded SR-IOV network devices

The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io custom resource definition (CRD) to OpenShift Container Platform. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).

Note

When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes.

It might take several minutes for a configuration change to apply.

Follow this procedure to create a SriovNetworkNodePolicy custom resource (CR).

Procedure

Create an SriovNetworkNodePolicy custom resource (CR). Save the following YAML as the file policyallflags-sriov-node-network.yaml. Replace policyallflags with the name for the configuration.
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policyallflags 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: policyallflags 3
  nodeSelector: 4
    node.alpha.kubernetes-incubator.io/nfd-network-sriov.capable = `true`
  priority: 10 5
  numVfs: 5 6
  nicSelector: 7
    pfNames: ["ens1f0"]  8
  deviceType: "netdevice" 9
  isRdma: false 10
```
1
The name for the custom resource object.
2
The namespace where the SR-IOV Network Operator is installed.
3
The resource name of the SR-IOV network device plugin. You can create multiple SR-IOV network node policies for a resource name.
4
The node selector specifies the nodes to configure. Only SR-IOV network devices on the selected nodes are configured. The SR-IOV Container Network Interface (CNI) plugin and device plugin are deployed on selected nodes only.
5
Optional: The priority is an integer value between 0 and 99. A smaller value receives higher priority. For example, a priority of 10 is a higher priority than 99. The default value is 99.
6
The number of virtual functions (VFs) to create for the SR-IOV physical network device. For an Intel network interface controller (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than 127.
7
The NIC selector identifies the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally. If you specify rootDevices, you must also specify a value for vendor, deviceID, or pfNames. If you specify both pfNames and rootDevices at the same time, ensure that they refer to the same device. If you specify a value for netFilter, then you do not need to specify any other parameter because a network ID is unique.
8
Optional: An array of one or more physical function (PF) names for the device.
9
Optional: The driver type for the virtual functions. The only allowed value is netdevice. For a Mellanox NIC to work in DPDK mode on bare metal nodes, set isRdma to true.
10
Optional: Configures whether to enable remote direct memory access (RDMA) mode. The default value is false. If the isRdma parameter is set to true, you can continue to use the RDMA-enabled VF as a normal network device. A device can be used in either mode. Set isRdma to true and additionally set needVhostNet to true to configure a Mellanox NIC for use with Fast Datapath DPDK applications.
Note
The vfio-pci driver type is not supported.
Create the SriovNetworkNodePolicy object:
```
$ oc create -f policyallflags-sriov-node-network.yaml
```
After applying the configuration update, all the pods in sriov-network-operator namespace change to the Running status.
To verify that the SR-IOV network device is configured, enter the following command. Replace <node_name> with the name of a node with the SR-IOV network device that you just configured.
```
$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'
```
Example output
```
Succeeded
```

18.5.3.2. Configuring sysctl on a bonded SR-IOV network

You can set interface specific sysctl settings on a bonded interface created from two SR-IOV interfaces. Do this by adding the tuning configuration to the optional Plugins parameter of the bond network attachment definition.

Note

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

To change specific interface-level network sysctl settings create the SriovNetwork custom resource (CR) with the Container Network Interface (CNI) tuning plugin by using the following procedure.

Prerequisites

Install the OpenShift Container Platform CLI (oc).
Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.

Procedure

Create the SriovNetwork custom resource (CR) for the bonded interface as in the following example CR. Save the YAML as the file sriov-network-attachment.yaml.
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: allvalidflags 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: policyallflags 3
  networkNamespace: sysctl-tuning-test 4
  capabilities: '{ "mac": true, "ips": true }' 5
```
1
A name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with same name.
2
The namespace where the SR-IOV Network Operator is installed.
3
The value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4
The target namespace for the SriovNetwork object. Only pods in the target namespace can attach to the additional network.
5
Optional: The capabilities to configure for this additional network. You can specify "{ "ips": true }" to enable IP address support or "{ "mac": true }" to enable MAC address support.

Create the SriovNetwork resource:

$ oc create -f sriov-network-attachment.yaml

Create a bond network attachment definition as in the following example CR. Save the YAML as the file sriov-bond-network-interface.yaml.
```
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: bond-sysctl-network
  namespace: sysctl-tuning-test
spec:
  config: '{
  "cniVersion":"0.4.0",
  "name":"bound-net",
  "plugins":[
    {
      "type":"bond", 1
      "mode": "active-backup", 2
      "failOverMac": 1, 3
      "linksInContainer": true, 4
      "miimon": "100",
      "links": [ 5
        {"name": "net1"},
        {"name": "net2"}
      ],
      "ipam":{ 6
        "type":"static"
      }
    },
    {
      "type":"tuning", 7
      "capabilities":{
        "mac":true
      },
      "sysctl":{
        "net.ipv4.conf.IFNAME.accept_redirects": "0",
        "net.ipv4.conf.IFNAME.accept_source_route": "0",
        "net.ipv4.conf.IFNAME.disable_policy": "1",
        "net.ipv4.conf.IFNAME.secure_redirects": "0",
        "net.ipv4.conf.IFNAME.send_redirects": "0",
        "net.ipv6.conf.IFNAME.accept_redirects": "0",
        "net.ipv6.conf.IFNAME.accept_source_route": "1",
        "net.ipv6.neigh.IFNAME.base_reachable_time_ms": "20000",
        "net.ipv6.neigh.IFNAME.retrans_time_ms": "2000"
      }
    }
  ]
}'
```
1
The type is bond.
2
The mode attribute specifies the bonding mode. The bonding modes supported are:
balance-rr - 0
active-backup - 1
balance-xor - 2
For balance-rr or balance-xor modes, you must set the trust mode to on for the SR-IOV virtual function.
3
The failover attribute is mandatory for active-backup mode.
4
The linksInContainer=true flag informs the Bond CNI that the required interfaces are to be found inside the container. By default, Bond CNI looks for these interfaces on the host which does not work for integration with SRIOV and Multus.
5
The links section defines which interfaces will be used to create the bond. By default, Multus names the attached interfaces as: "net", plus a consecutive number, starting with one.
6
A configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition. In this pod example IP addresses are configured manually, so in this case,ipam is set to static.
7
Add additional capabilities to the device. For example, set the type field to tuning. Specify the interface-level network sysctl you want to set in the sysctl field. This example sets all interface-level network sysctl settings that can be set.

Create the bond network attachment resource:

$ oc create -f sriov-bond-network-interface.yaml

Verifying that the NetworkAttachmentDefinition CR is successfully created

Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command:
```
$ oc get network-attachment-definitions -n <namespace> 1
```
1
Replace <namespace> with the networkNamespace that you specified when configuring the network attachment, for example, sysctl-tuning-test.
Example output
```
NAME                          AGE
bond-sysctl-network           22m
allvalidflags                 47m
```
Note
There might be a delay before the SR-IOV Network Operator creates the CR.

Verifying that the additional SR-IOV network resource is successful

To verify that the tuning CNI is correctly configured and the additional SR-IOV network attachment is attached, do the following:

Create a Pod CR. For example, save the following YAML as the file examplepod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: tunepod
  namespace: sysctl-tuning-test
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {"name": "allvalidflags"}, 1
        {"name": "allvalidflags"},
        {
          "name": "bond-sysctl-network",
          "interface": "bond0",
          "mac": "0a:56:0a:83:04:0c", 2
          "ips": ["10.100.100.200/24"] 3
       }
      ]
spec:
  containers:
  - name: podexample
    image: centos
    command: ["/bin/bash", "-c", "sleep INF"]
    securityContext:
      runAsUser: 2000
      runAsGroup: 3000
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

1: The name of the SR-IOV network attachment definition CR.
2: Optional: The MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify { "mac": true } in the SriovNetwork object.
3: Optional: IP addresses for the SR-IOV device that are allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovNetwork object.

Apply the YAML:
```
$ oc apply -f examplepod.yaml
```

Verify that the pod is created by running the following command:

$ oc get pod -n sysctl-tuning-test

Example output

NAME      READY   STATUS    RESTARTS   AGE
tunepod   1/1     Running   0          47s

Log in to the pod by running the following command:
```
$ oc rsh -n sysctl-tuning-test tunepod
```
Verify the values of the configured sysctl flag. Find the value net.ipv6.neigh.IFNAME.base_reachable_time_ms by running the following command::
```
$ sysctl net.ipv6.neigh.bond0.base_reachable_time_ms
```
Example output
```
net.ipv6.neigh.bond0.base_reachable_time_ms = 20000
```

18.5.4. About all-multicast mode

Enabling all-multicast mode, particularly in the context of rootless applications, is critical. If you do not enable this mode, you would be required to grant the NET_ADMIN capability to the pod’s Security Context Constraints (SCC). If you were to allow the NET_ADMIN capability to grant the pod privileges to make changes that extend beyond its specific requirements, you could potentially expose security vulnerabilities.

The tuning CNI plugin supports changing several interface attributes, including all-multicast mode. By enabling this mode, you can allow applications running on Virtual Functions (VFs) that are configured on a SR-IOV network device to receive multicast traffic from applications on other VFs, whether attached to the same or different physical functions.

18.5.4.1. Enabling the all-multicast mode on an SR-IOV network

You can enable the all-multicast mode on an SR-IOV interface by:

Adding the tuning configuration to the metaPlugins parameter of the SriovNetwork resource
Setting the allmulti field to true in the tuning configuration
Note
Ensure that you create the virtual function (VF) with trust enabled.

The SR-IOV Network Operator manages additional network definitions. When you specify an additional SR-IOV network to create, the SR-IOV Network Operator creates the NetworkAttachmentDefinition custom resource (CR) automatically.

Note

Do not edit NetworkAttachmentDefinition custom resources that the SR-IOV Network Operator manages. Doing so might disrupt network traffic on your additional network.

Enable the all-multicast mode on a SR-IOV network by following this guidance.

Prerequisites

You have installed the OpenShift Container Platform CLI (oc).
You are logged in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.
You have installed the SR-IOV Network Operator.
You have configured an appropriate SriovNetworkNodePolicy object.

Procedure

Create a YAML file with the following settings that defines a SriovNetworkNodePolicy object for a Mellanox ConnectX-5 device. Save the YAML file as sriovnetpolicy-mlx.yaml.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriovnetpolicy-mlx
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  nicSelector:
    deviceID: "1017"
    pfNames:
      - ens8f0np0#0-9
    rootDevices:
      - 0000:d8:00.0
    vendor: "15b3"
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 10
  priority: 99
  resourceName: resourcemlx

Optional: If the SR-IOV capable cluster nodes are not already labeled, add the SriovNetworkNodePolicy.Spec.NodeSelector label. For more information about labeling nodes, see "Understanding how to update labels on nodes".
Create the SriovNetworkNodePolicy object by running the following command:
```
$ oc create -f sriovnetpolicy-mlx.yaml
```
After applying the configuration update, all the pods in the sriov-network-operator namespace automatically move to a Running status.
Create the enable-allmulti-test namespace by running the following command:
```
$ oc create namespace enable-allmulti-test
```
Create the SriovNetwork custom resource (CR) for the additional SR-IOV network attachment and insert the metaPlugins configuration, as in the following example CR YAML, and save the file as sriov-enable-all-multicast.yaml.
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: enableallmulti 1
  namespace: openshift-sriov-network-operator 2
spec:
  resourceName: enableallmulti 3
  networkNamespace: enable-allmulti-test 4
  ipam: '{ "type": "static" }' 5
  capabilities: '{ "mac": true, "ips": true }' 6
  trust: "on" 7
  metaPlugins : | 8
    {
      "type": "tuning",
      "capabilities":{
        "mac":true
      },
      "allmulti": true
      }
    }
```
1
Specify a name for the object. The SR-IOV Network Operator creates a NetworkAttachmentDefinition object with the same name.
2
Specify the namespace where the SR-IOV Network Operator is installed.
3
Specify a value for the spec.resourceName parameter from the SriovNetworkNodePolicy object that defines the SR-IOV hardware for this additional network.
4
Specify the target namespace for the SriovNetwork object. Only pods in the target namespace can attach to the additional network.
5
Specify a configuration object for the IPAM CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
6
Optional: Set capabilities for the additional network. You can specify "{ "ips": true }" to enable IP address support or "{ "mac": true }" to enable MAC address support.
7
Specify the trust mode of the virtual function. This must be set to "on".
8
Add more capabilities to the device by using the metaPlugins parameter. In this use case, set the type field to tuning, and add the allmulti field and set it to true.
Create the SriovNetwork resource by running the following command:
```
$ oc create -f sriov-enable-all-multicast.yaml
```

Verification of the NetworkAttachmentDefinition CR

Confirm that the SR-IOV Network Operator created the NetworkAttachmentDefinition CR by running the following command:
```
$ oc get network-attachment-definitions -n <namespace> 1
```
1
Replace <namespace> with the value for networkNamespace that you specified in the SriovNetwork object. For this example, that is enable-allmulti-test.
Example output
```
NAME                                  AGE
enableallmulti                        14m
```
Note
There might be a delay before the SR-IOV Network Operator creates the CR.
1. Display information about the SR-IOV network resources by running the following command:
```
$ oc get sriovnetwork -n openshift-sriov-network-operator
```

Verification of the additional SR-IOV network attachment

To verify that the tuning CNI is correctly configured and that the additional SR-IOV network attachment is attached, follow these steps:

Create a Pod CR. Save the following sample YAML in a file named examplepod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: samplepod
  namespace: enable-allmulti-test
  annotations:
    k8s.v1.cni.cncf.io/networks: |-
      [
        {
          "name": "enableallmulti",  1
          "mac": "0a:56:0a:83:04:0c", 2
          "ips": ["10.100.100.200/24"] 3
       }
      ]
spec:
  containers:
  - name: podexample
    image: centos
    command: ["/bin/bash", "-c", "sleep INF"]
    securityContext:
      runAsUser: 2000
      runAsGroup: 3000
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault

1: Specify the name of the SR-IOV network attachment definition CR.
2: Optional: Specify the MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify {"mac": true} in the SriovNetwork object.
3: Optional: Specify the IP addresses for the SR-IOV device that are allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify { "ips": true } in the SriovNetwork object.

Create the Pod CR by running the following command:
```
$ oc apply -f examplepod.yaml
```

Verify that the pod is created by running the following command:

$ oc get pod -n enable-allmulti-test

Example output

NAME       READY   STATUS    RESTARTS   AGE
samplepod  1/1     Running   0          47s

Log in to the pod by running the following command:
```
$ oc rsh -n enable-allmulti-test samplepod
```

List all the interfaces associated with the pod by running the following command:

sh-4.4# ip link

Example output

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0@if22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8901 qdisc noqueue state UP mode DEFAULT group default
    link/ether 0a:58:0a:83:00:10 brd ff:ff:ff:ff:ff:ff link-netnsid 0 1
3: net1@if24: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether ee:9b:66:a4:ec:1d brd ff:ff:ff:ff:ff:ff link-netnsid 0 2

1: eth0@if22 is the primary interface
2: net1@if24 is the secondary interface configured with the network-attachment-definition that supports the all-multicast mode (ALLMULTI flag)

18.6. Configuring QinQ support for SR-IOV enabled workloads

QinQ, formally known as 802.1Q-in-802.1Q, is a networking technique defined by IEEE 802.1ad. IEEE 802.1ad extends the IEEE 802.1Q-1998 standard and enriches VLAN capabilities by introducing an additional 802.1Q tag to packets already tagged with 802.1Q. This method is also referred to as VLAN stacking or double VLAN.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.6.1. About 802.1Q-in-802.1Q support

In traditional VLAN setups, frames typically contain a single VLAN tag, such as VLAN-100, as well as other metadata such as Quality of Service (QoS) bits and protocol information. QinQ introduces a second VLAN tag, where the service provider designates the outer tag for their use, offering them flexibility, while the inner tag remains dedicated to the customer’s VLAN.

QinQ facilitates the creation of nested VLANs by using double VLAN tagging, enabling finer segmentation and isolation of traffic within a network environment. This approach is particularly valuable in service provider networks where you need to deliver VLAN-based services to multiple customers over a common infrastructure, while ensuring separation and isolation of traffic.

The following diagram illustrates how OpenShift Container Platform can use SR-IOV and QinQ to achieve advanced network segmentation and isolation for containerized workloads.

The diagram shows how double VLAN tagging (QinQ) works in a worker node with SR-IOV support. The SR-IOV virtual function (VF) located in the pod namespace, ext0 is configured by the SR-IOV Container Network Interface (CNI) with a VLAN ID and VLAN protocol. This corresponds to the S-tag. Inside the pod, the VLAN CNI creates a subinterface using the primary interface ext0. This subinterface adds an internal VLAN ID using the 802.1Q protocol, which corresponds to the C-tag.

This demonstrates how QinQ enables finer traffic segmentation and isolation within the network. The Ethernet frame structure is detailed on the right, highlighting the inclusion of both VLAN tags, EtherType, IP, TCP, and Payload sections. QinQ facilitates the delivery of VLAN-based services to multiple customers over a shared infrastructure while ensuring traffic separation and isolation.

Diagram showing QinQ (double VLAN tagging)

The OpenShift Container Platform SR-IOV solution already supports setting the VLAN protocol on the SriovNetwork custom resource (CR). The virtual function (VF) can use this protocol to set the VLAN tag, also known as the outer tag. Pods can then use the VLAN CNI plugin to configure the inner tag.

Table 18.14. Supported network interface cards
NIC	802.1ad/802.1Q	802.1Q/802.1Q
Intel X710	No	Supported
Intel E810	Supported	Supported
Mellanox	No	Supported

Additional resources

Configuration for an VLAN additional network

18.6.2. Configuring QinQ support for SR-IOV enabled workloads

Prerequisites

You have installed the OpenShift CLI (oc).
You have access to the cluster as a user with the cluster-admin role.
You have installed the SR-IOV Network Operator.

Procedure

Create a file named sriovnetpolicy-810-sriov-node-network.yaml by using the following content:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriovnetpolicy-810
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  nicSelector:
    pfNames:
      - ens5f0#0-9
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numVfs: 10
  priority: 99
  resourceName: resource810

Create the SriovNetworkNodePolicy object by running the following command:
```
$ oc create -f sriovnetpolicy-810-sriov-node-network.yaml
```
Open a separate terminal window and monitor the synchronization status of the SR-IOV network node state for the node specified in the openshift-sriov-network-operator namespace by running the following command:
```
$ watch -n 1 'oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath="{.status.syncStatus}"'
```
The synchronization status indicates a change from InProgress to Succeeded.
Create a SriovNetwork object, and set the outer VLAN called the S-tag, or Service Tag, as it belongs to the infrastructure.
Important
You must configure the VLAN on the trunk interface of the switch. In addition, you might need to further configure some switches to support QinQ tagging.
1. Create a file named nad-sriovnetwork-1ad-810.yaml by using the following content:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriovnetwork-1ad-810
  namespace: openshift-sriov-network-operator
spec:
  ipam: '{}'
  vlan: 171 1
  vlanProto: "802.1ad" 2
  networkNamespace: default
  resourceName: resource810
```
  1
  Sets the S-tag VLAN tag to 171.
  2
  Specifies the VLAN protocol to assign to the virtual function (VF). Supported values are 802.1ad and 802.1q. The default value is 802.1q.
2. Create the object by running the following command:
```
$ oc create -f nad-sriovnetwork-1ad-810.yaml
```
Create a NetworkAttachmentDefinition object with an inner VLAN. The inner VLAN is often referred to as the C-tag, or Customer Tag, as it belongs to the Network Function:
1. Create a file named nad-cvlan100.yaml by using the following content:
```
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: nad-cvlan100
  namespace: default
spec:
  config: '{
    "name": "vlan-100",
    "cniVersion": "0.3.1",
    "type": "vlan",
    "linkInContainer": true,
    "master": "net1", 1
    "vlanId": 100,
    "ipam": {"type": "static"}
  }'
```
  1
  Specifies the VF interface inside the pod. The default name is net1 as the name is not set in the pod annotation.
2. Apply the YAML file by running the following command:
```
$ oc apply -f nad-cvlan100.yaml
```

Verification

Verify QinQ is active on the node by following this procedure:

Create a file named test-qinq-pod.yaml by using the following content:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: sriovnetwork-1ad-810, nad-cvlan100
spec:
  containers:
    - name: test-container
      image: quay.io/ocp-edge-qe/cnf-gotests-client:v4.10
      imagePullPolicy: Always
      securityContext:
        privileged: true

Create the test pod by running the following command:
```
$ oc create -f test-qinq-pod.yaml
```

Enter into a debug session on the target node where the pod is present and display information about the network interface ens5f0 by running the following command:

$ oc debug node/my-cluster-node -- bash -c "ip link show ens5f0"

Example output

6: ens5f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether b4:96:91:a5:22:10 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether a2:81:ba:d0:6f:f3 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 1 link/ether 8a:bb:0a:36:f2:ed brd ff:ff:ff:ff:ff:ff, vlan 171, vlan protocol 802.1ad, spoof checking on, link-state auto, trust off
vf 2 link/ether ca:0e:e1:5b:0c:d2 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3 link/ether ee:6c:e2:f5:2c:70 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 4 link/ether 0a:d6:b7:66:5e:e8 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 5 link/ether da:d5:e7:14:4f:aa brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 6 link/ether d6:8e:85:75:12:5c brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 7 link/ether d6:eb:ce:9c:ea:78 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 8 link/ether 5e:c5:cc:05:93:3c brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust on
vf 9 link/ether a6:5a:7c:1c:2a:16 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off

The vlan protocol 802.1ad ID in the output indicates that the interface supports VLAN tagging with protocol 802.1ad (QinQ). The VLAN ID is 171.

18.7. Using high performance multicast

You can use multicast on your Single Root I/O Virtualization (SR-IOV) hardware network.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.7.1. High performance multicast

The OVN-Kubernetes network plugin supports multicast between pods on the default network. This is best used for low-bandwidth coordination or service discovery, and not high-bandwidth applications. For applications such as streaming media, like Internet Protocol television (IPTV) and multipoint videoconferencing, you can utilize Single Root I/O Virtualization (SR-IOV) hardware to provide near-native performance.

When using additional SR-IOV interfaces for multicast:

Multicast packages must be sent or received by a pod through the additional SR-IOV interface.
The physical network which connects the SR-IOV interfaces decides the multicast routing and topology, which is not controlled by OpenShift Container Platform.

18.7.2. Configuring an SR-IOV interface for multicast

The follow procedure creates an example SR-IOV interface for multicast.

Prerequisites

Install the OpenShift CLI (oc).
You must log in to the cluster with a user that has the cluster-admin role.

Procedure

Create a SriovNetworkNodePolicy object:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-example
  namespace: openshift-sriov-network-operator
spec:
  resourceName: example
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 4
  nicSelector:
    vendor: "8086"
    pfNames: ['ens803f0']
    rootDevices: ['0000:86:00.0']

Create a SriovNetwork object:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: net-example
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: default
  ipam: | 1
    {
      "type": "host-local", 2
      "subnet": "10.56.217.0/24",
      "rangeStart": "10.56.217.171",
      "rangeEnd": "10.56.217.181",
      "routes": [
        {"dst": "224.0.0.0/5"},
        {"dst": "232.0.0.0/5"}
      ],
      "gateway": "10.56.217.1"
    }
  resourceName: example

1 2: If you choose to configure DHCP as IPAM, ensure that you provision the following default routes through your DHCP server: 224.0.0.0/5 and 232.0.0.0/5. This is to override the static multicast route set by the default network provider.

Create a pod with multicast application:

apiVersion: v1
kind: Pod
metadata:
  name: testpmd
  namespace: default
  annotations:
    k8s.v1.cni.cncf.io/networks: nic1
spec:
  containers:
  - name: example
    image: rhel7:latest
    securityContext:
      capabilities:
        add: ["NET_ADMIN"] 1
    command: [ "sleep", "infinity"]

1: The NET_ADMIN capability is required only if your application needs to assign the multicast IP address to the SR-IOV interface. Otherwise, it can be omitted.

18.8. Using DPDK and RDMA

The containerized Data Plane Development Kit (DPDK) application is supported on OpenShift Container Platform. You can use Single Root I/O Virtualization (SR-IOV) network hardware with the Data Plane Development Kit (DPDK) and with remote direct memory access (RDMA).

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.8.1. Example use of a virtual function in a pod

You can run a remote direct memory access (RDMA) or a Data Plane Development Kit (DPDK) application in a pod with SR-IOV VF attached.

This example shows a pod using a virtual function (VF) in RDMA mode:

Pod spec that uses RDMA mode

apiVersion: v1
kind: Pod
metadata:
  name: rdma-app
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-rdma-mlnx
spec:
  containers:
  - name: testpmd
    image: <RDMA_image>
    imagePullPolicy: IfNotPresent
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
    command: ["sleep", "infinity"]

The following example shows a pod with a VF in DPDK mode:

Pod spec that uses DPDK mode

apiVersion: v1
kind: Pod
metadata:
  name: dpdk-app
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-dpdk-net
spec:
  containers:
  - name: testpmd
    image: <DPDK_image>
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"]
    volumeMounts:
    - mountPath: /dev/hugepages
      name: hugepage
    resources:
      limits:
        memory: "1Gi"
        cpu: "2"
        hugepages-1Gi: "4Gi"
      requests:
        memory: "1Gi"
        cpu: "2"
        hugepages-1Gi: "4Gi"
    command: ["sleep", "infinity"]
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

18.8.2. Using a virtual function in DPDK mode with an Intel NIC

Prerequisites

Install the OpenShift CLI (oc).
Install the SR-IOV Network Operator.
Log in as a user with cluster-admin privileges.

Procedure

Create the following SriovNetworkNodePolicy object, and then save the YAML in the intel-dpdk-node-policy.yaml file.
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: intel-dpdk-node-policy
  namespace: openshift-sriov-network-operator
spec:
  resourceName: intelnics
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  priority: <priority>
  numVfs: <num>
  nicSelector:
    vendor: "8086"
    deviceID: "158b"
    pfNames: ["<pf_name>", ...]
    rootDevices: ["<pci_bus_id>", "..."]
  deviceType: vfio-pci 1
```
1
Specify the driver type for the virtual functions to vfio-pci.
Note
See the Configuring SR-IOV network devices section for a detailed explanation on each option in SriovNetworkNodePolicy.
When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.
After the configuration update is applied, all the pods in openshift-sriov-network-operator namespace will change to a Running status.
Create the SriovNetworkNodePolicy object by running the following command:
```
$ oc create -f intel-dpdk-node-policy.yaml
```
Create the following SriovNetwork object, and then save the YAML in the intel-dpdk-network.yaml file.
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: intel-dpdk-network
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: <target_namespace>
  ipam: |-
# ... 1
  vlan: <vlan>
  resourceName: intelnics
```
1
Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
Note
See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in SriovNetwork.
An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.
Create the SriovNetwork object by running the following command:
```
$ oc create -f intel-dpdk-network.yaml
```
Create the following Pod spec, and then save the YAML in the intel-dpdk-pod.yaml file.
```
apiVersion: v1
kind: Pod
metadata:
  name: dpdk-app
  namespace: <target_namespace> 1
  annotations:
    k8s.v1.cni.cncf.io/networks: intel-dpdk-network
spec:
  containers:
  - name: testpmd
    image: <DPDK_image> 2
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 3
    volumeMounts:
    - mountPath: /mnt/huge 4
      name: hugepage
    resources:
      limits:
        openshift.io/intelnics: "1" 5
        memory: "1Gi"
        cpu: "4" 6
        hugepages-1Gi: "4Gi" 7
      requests:
        openshift.io/intelnics: "1"
        memory: "1Gi"
        cpu: "4"
        hugepages-1Gi: "4Gi"
    command: ["sleep", "infinity"]
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
```
1
Specify the same target_namespace where the SriovNetwork object intel-dpdk-network is created. If you would like to create the pod in a different namespace, change target_namespace in both the Pod spec and the SriovNetwork object.
2
Specify the DPDK image which includes your application and the DPDK library used by application.
3
Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
4
Mount a hugepage volume to the DPDK pod under /mnt/huge. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
5
Optional: Specify the number of DPDK devices allocated to DPDK pod. This resource request and limit, if not explicitly specified, will be automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by the SR-IOV Operator. It is enabled by default and can be disabled by setting enableInjector option to false in the default SriovOperatorConfig CR.
6
Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and creating a pod with Guaranteed QoS.
7
Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes. For example, adding kernel arguments default_hugepagesz=1GB, hugepagesz=1G and hugepages=16 will result in 16*1Gi hugepages be allocated during system boot.
Create the DPDK pod by running the following command:
```
$ oc create -f intel-dpdk-pod.yaml
```

18.8.3. Using a virtual function in DPDK mode with a Mellanox NIC

You can create a network node policy and create a Data Plane Development Kit (DPDK) pod using a virtual function in DPDK mode with a Mellanox NIC.

Prerequisites

You have installed the OpenShift CLI (oc).
You have installed the Single Root I/O Virtualization (SR-IOV) Network Operator.
You have logged in as a user with cluster-admin privileges.

Procedure

Save the following SriovNetworkNodePolicy YAML configuration to an mlx-dpdk-node-policy.yaml file:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx-dpdk-node-policy
  namespace: openshift-sriov-network-operator
spec:
  resourceName: mlxnics
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  priority: <priority>
  numVfs: <num>
  nicSelector:
    vendor: "15b3"
    deviceID: "1015" 1
    pfNames: ["<pf_name>", ...]
    rootDevices: ["<pci_bus_id>", "..."]
  deviceType: netdevice 2
  isRdma: true 3
```
1
Specify the device hex code of the SR-IOV network device.
2
Specify the driver type for the virtual functions to netdevice. A Mellanox SR-IOV Virtual Function (VF) can work in DPDK mode without using the vfio-pci device type. The VF device appears as a kernel network interface inside a container.
3
Enable Remote Direct Memory Access (RDMA) mode. This is required for Mellanox cards to work in DPDK mode.
Note
See Configuring an SR-IOV network device for a detailed explanation of each option in the SriovNetworkNodePolicy object.
When applying the configuration specified in an SriovNetworkNodePolicy object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes. It might take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.
After the configuration update is applied, all the pods in the openshift-sriov-network-operator namespace will change to a Running status.
Create the SriovNetworkNodePolicy object by running the following command:
```
$ oc create -f mlx-dpdk-node-policy.yaml
```
Save the following SriovNetwork YAML configuration to an mlx-dpdk-network.yaml file:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: mlx-dpdk-network
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: <target_namespace>
  ipam: |- 1
...
  vlan: <vlan>
  resourceName: mlxnics
```
1
Specify a configuration object for the IP Address Management (IPAM) Container Network Interface (CNI) plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
Note
See Configuring an SR-IOV network device for a detailed explanation on each option in the SriovNetwork object.
The app-netutil option library provides several API methods for gathering network information about the parent pod of a container.
Create the SriovNetwork object by running the following command:
```
$ oc create -f mlx-dpdk-network.yaml
```
Save the following Pod YAML configuration to an mlx-dpdk-pod.yaml file:
```
apiVersion: v1
kind: Pod
metadata:
  name: dpdk-app
  namespace: <target_namespace> 1
  annotations:
    k8s.v1.cni.cncf.io/networks: mlx-dpdk-network
spec:
  containers:
  - name: testpmd
    image: <DPDK_image> 2
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 3
    volumeMounts:
    - mountPath: /mnt/huge 4
      name: hugepage
    resources:
      limits:
        openshift.io/mlxnics: "1" 5
        memory: "1Gi"
        cpu: "4" 6
        hugepages-1Gi: "4Gi" 7
      requests:
        openshift.io/mlxnics: "1"
        memory: "1Gi"
        cpu: "4"
        hugepages-1Gi: "4Gi"
    command: ["sleep", "infinity"]
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
```
1
Specify the same target_namespace where SriovNetwork object mlx-dpdk-network is created. To create the pod in a different namespace, change target_namespace in both the Pod spec and SriovNetwork object.
2
Specify the DPDK image which includes your application and the DPDK library used by the application.
3
Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
4
Mount the hugepage volume to the DPDK pod under /mnt/huge. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
5
Optional: Specify the number of DPDK devices allocated for the DPDK pod. If not explicitly specified, this resource request and limit is automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the enableInjector option to false in the default SriovOperatorConfig CR.
6
Specify the number of CPUs. The DPDK pod usually requires that exclusive CPUs be allocated from the kubelet. To do this, set the CPU Manager policy to static and create a pod with Guaranteed Quality of Service (QoS).
7
Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepages requires adding kernel arguments to Nodes.
Create the DPDK pod by running the following command:
```
$ oc create -f mlx-dpdk-pod.yaml
```

18.8.4. Using the TAP CNI to run a rootless DPDK workload with kernel access

DPDK applications can use virtio-user as an exception path to inject certain types of packets, such as log messages, into the kernel for processing. For more information about this feature, see Virtio_user as Exception Path.

In OpenShift Container Platform version 4.14 and later, you can use non-privileged pods to run DPDK applications alongside the tap CNI plugin. To enable this functionality, you need to mount the vhost-net device by setting the needVhostNet parameter to true within the SriovNetworkNodePolicy object.

Figure 18.1. DPDK and TAP example configuration

Prerequisites

You have installed the OpenShift CLI (oc).
You have installed the SR-IOV Network Operator.
You are logged in as a user with cluster-admin privileges.
Ensure that setsebools container_use_devices=on is set as root on all nodes.
Note
Use the Machine Config Operator to set this SELinux boolean.

Procedure

Create a file, such as test-namespace.yaml, with content like the following example:

apiVersion: v1
kind: Namespace
metadata:
  name: test-namespace
  labels:
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/warn: privileged
    security.openshift.io/scc.podSecurityLabelSync: "false"

Create the new Namespace object by running the following command:
```
$ oc apply -f test-namespace.yaml
```
Create a file, such as sriov-node-network-policy.yaml, with content like the following example::
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
 name: sriovnic
 namespace: openshift-sriov-network-operator
spec:
 deviceType: netdevice 1
 isRdma: true 2
 needVhostNet: true 3
 nicSelector:
   vendor: "15b3" 4
   deviceID: "101b" 5
   rootDevices: ["00:05.0"]
 numVfs: 10
 priority: 99
 resourceName: sriovnic
 nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
```
1
This indicates that the profile is tailored specifically for Mellanox Network Interface Controllers (NICs).
2
Setting isRdma to true is only required for a Mellanox NIC.
3
This mounts the /dev/net/tun and /dev/vhost-net devices into the container so the application can create a tap device and connect the tap device to the DPDK workload.
4
The vendor hexadecimal code of the SR-IOV network device. The value 15b3 is associated with a Mellanox NIC.
5
The device hexadecimal code of the SR-IOV network device.
Create the SriovNetworkNodePolicy object by running the following command:
```
$ oc create -f sriov-node-network-policy.yaml
```
Create the following SriovNetwork object, and then save the YAML in the sriov-network-attachment.yaml file:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
 name: sriov-network
 namespace: openshift-sriov-network-operator
spec:
 networkNamespace: test-namespace
 resourceName: sriovnic
 spoofChk: "off"
 trust: "on"
```
Note
See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in SriovNetwork.
An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.
Create the SriovNetwork object by running the following command:
```
$ oc create -f sriov-network-attachment.yaml
```

Create a file, such as tap-example.yaml, that defines a network attachment definition, with content like the following example:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
 name: tap-one
 namespace: test-namespace 1
spec:
 config: '{
   "cniVersion": "0.4.0",
   "name": "tap",
   "plugins": [
     {
        "type": "tap",
        "multiQueue": true,
        "selinuxcontext": "system_u:system_r:container_t:s0"
     },
     {
       "type":"tuning",
       "capabilities":{
         "mac":true
       }
     }
   ]
 }'

1: Specify the same target_namespace where the SriovNetwork object is created.

Create the NetworkAttachmentDefinition object by running the following command:
```
$ oc apply -f tap-example.yaml
```
Create a file, such as dpdk-pod-rootless.yaml, with content like the following example:
```
apiVersion: v1
kind: Pod
metadata:
  name: dpdk-app
  namespace: test-namespace 1
  annotations:
    k8s.v1.cni.cncf.io/networks: '[
      {"name": "sriov-network", "namespace": "test-namespace"},
      {"name": "tap-one", "interface": "ext0", "namespace": "test-namespace"}]'
spec:
  nodeSelector:
    kubernetes.io/hostname: "worker-0"
  securityContext:
      fsGroup: 1001 2
      runAsGroup: 1001 3
      seccompProfile:
        type: RuntimeDefault
  containers:
  - name: testpmd
    image: <DPDK_image> 4
    securityContext:
      capabilities:
        drop: ["ALL"] 5
        add: 6
          - IPC_LOCK
          - NET_RAW #for mlx only 7
      runAsUser: 1001 8
      privileged: false 9
      allowPrivilegeEscalation: true 10
      runAsNonRoot: true 11
    volumeMounts:
    - mountPath: /mnt/huge 12
      name: hugepages
    resources:
      limits:
        openshift.io/sriovnic: "1" 13
        memory: "1Gi"
        cpu: "4" 14
        hugepages-1Gi: "4Gi" 15
      requests:
        openshift.io/sriovnic: "1"
        memory: "1Gi"
        cpu: "4"
        hugepages-1Gi: "4Gi"
    command: ["sleep", "infinity"]
  runtimeClassName: performance-cnf-performanceprofile 16
  volumes:
  - name: hugepages
    emptyDir:
      medium: HugePages
```
1
Specify the same target_namespace in which the SriovNetwork object is created. If you want to create the pod in a different namespace, change target_namespace in both the Pod spec and the SriovNetwork object.
2
Sets the group ownership of volume-mounted directories and files created in those volumes.
3
Specify the primary group ID used for running the container.
4
Specify the DPDK image that contains your application and the DPDK library used by application.
5
Removing all capabilities (ALL) from the container’s securityContext means that the container has no special privileges beyond what is necessary for normal operation.
6
Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access. These capabilities must also be set in the binary file by using the setcap command.
7
Mellanox network interface controller (NIC) requires the NET_RAW capability.
8
Specify the user ID used for running the container.
9
This setting indicates that the container or containers within the pod should not be granted privileged access to the host system.
10
This setting allows a container to escalate its privileges beyond the initial non-root privileges it might have been assigned.
11
This setting ensures that the container runs with a non-root user. This helps enforce the principle of least privilege, limiting the potential impact of compromising the container and reducing the attack surface.
12
Mount a hugepage volume to the DPDK pod under /mnt/huge. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
13
Optional: Specify the number of DPDK devices allocated for the DPDK pod. If not explicitly specified, this resource request and limit is automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the enableInjector option to false in the default SriovOperatorConfig CR.
14
Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and creating a pod with Guaranteed QoS.
15
Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the DPDK pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes. For example, adding kernel arguments default_hugepagesz=1GB, hugepagesz=1G and hugepages=16 will result in 16*1Gi hugepages be allocated during system boot.
16
If your performance profile is not named cnf-performance profile, replace that string with the correct performance profile name.
Create the DPDK pod by running the following command:
```
$ oc create -f dpdk-pod-rootless.yaml
```

Additional resources

18.8.5. Overview of achieving a specific DPDK line rate

To achieve a specific Data Plane Development Kit (DPDK) line rate, deploy a Node Tuning Operator and configure Single Root I/O Virtualization (SR-IOV). You must also tune the DPDK settings for the following resources:

Isolated CPUs
Hugepages
The topology scheduler

Note

In previous versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift Container Platform applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.

DPDK test environment

The following diagram shows the components of a traffic-testing environment:

Traffic generator: An application that can generate high-volume packet traffic.
SR-IOV-supporting NIC: A network interface card compatible with SR-IOV. The card runs a number of virtual functions on a physical interface.
Physical Function (PF): A PCI Express (PCIe) function of a network adapter that supports the SR-IOV interface.
Virtual Function (VF): A lightweight PCIe function on a network adapter that supports SR-IOV. The VF is associated with the PCIe PF on the network adapter. The VF represents a virtualized instance of the network adapter.
Switch: A network switch. Nodes can also be connected back-to-back.
testpmd: An example application included with DPDK. The testpmd application can be used to test the DPDK in a packet-forwarding mode. The testpmd application is also an example of how to build a fully-fledged application using the DPDK Software Development Kit (SDK).
worker 0 and worker 1: OpenShift Container Platform nodes.

18.8.6. Using SR-IOV and the Node Tuning Operator to achieve a DPDK line rate

You can use the Node Tuning Operator to configure isolated CPUs, hugepages, and a topology scheduler. You can then use the Node Tuning Operator with Single Root I/O Virtualization (SR-IOV) to achieve a specific Data Plane Development Kit (DPDK) line rate.

Prerequisites

You have installed the OpenShift CLI (oc).
You have installed the SR-IOV Network Operator.
You have logged in as a user with cluster-admin privileges.
You have deployed a standalone Node Tuning Operator.
Note
In previous versions of OpenShift Container Platform, the Performance Addon Operator was used to implement automatic tuning to achieve low latency performance for OpenShift applications. In OpenShift Container Platform 4.11 and later, this functionality is part of the Node Tuning Operator.

Procedure

Create a PerformanceProfile object based on the following example:
```
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: performance
spec:
  globallyDisableIrqLoadBalancing: true
  cpu:
    isolated: 21-51,73-103 1
    reserved: 0-20,52-72 2
  hugepages:
    defaultHugepagesSize: 1G 3
    pages:
      - count: 32
        size: 1G
  net:
    userLevelNetworking: true
  numa:
    topologyPolicy: "single-numa-node"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
```
1
If hyperthreading is enabled on the system, allocate the relevant symbolic links to the isolated and reserved CPU groups. If the system contains multiple non-uniform memory access nodes (NUMAs), allocate CPUs from both NUMAs to both groups. You can also use the Performance Profile Creator for this task. For more information, see Creating a performance profile.
2
You can also specify a list of devices that will have their queues set to the reserved CPU count. For more information, see Reducing NIC queues using the Node Tuning Operator.
3
Allocate the number and size of hugepages needed. You can specify the NUMA configuration for the hugepages. By default, the system allocates an even number to every NUMA node on the system. If needed, you can request the use of a realtime kernel for the nodes. See Provisioning a worker with real-time capabilities for more information.
Save the yaml file as mlx-dpdk-perfprofile-policy.yaml.
Apply the performance profile using the following command:
```
$ oc create -f mlx-dpdk-perfprofile-policy.yaml
```

18.8.6.1. DPDK library for use with container applications

An optional library, app-netutil, provides several API methods for gathering network information about a pod from within a container running within that pod.

This library can assist with integrating SR-IOV virtual functions (VFs) in Data Plane Development Kit (DPDK) mode into the container. The library provides both a Golang API and a C API.

Currently there are three API methods implemented:

GetCPUInfo(): This function determines which CPUs are available to the container and returns the list.
GetHugepages(): This function determines the amount of huge page memory requested in the Pod spec for each container and returns the values.
GetInterfaces(): This function determines the set of interfaces in the container and returns the list. The return value includes the interface type and type-specific data for each interface.

The repository for the library includes a sample Dockerfile to build a container image, dpdk-app-centos. The container image can run one of the following DPDK sample applications, depending on an environment variable in the pod specification: l2fwd, l3wd or testpmd. The container image provides an example of integrating the app-netutil library into the container image itself. The library can also integrate into an init container. The init container can collect the required data and pass the data to an existing DPDK workload.

18.8.6.2. Example SR-IOV Network Operator for virtual functions

You can use the Single Root I/O Virtualization (SR-IOV) Network Operator to allocate and configure Virtual Functions (VFs) from SR-IOV-supporting Physical Function NICs on the nodes.

For more information on deploying the Operator, see Installing the SR-IOV Network Operator. For more information on configuring an SR-IOV network device, see Configuring an SR-IOV network device.

There are some differences between running Data Plane Development Kit (DPDK) workloads on Intel VFs and Mellanox VFs. This section provides object configuration examples for both VF types. The following is an example of an sriovNetworkNodePolicy object used to run DPDK applications on Intel NICs:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: dpdk-nic-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci 1
  needVhostNet: true 2
  nicSelector:
    pfNames: ["ens3f0"]
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numVfs: 10
  priority: 99
  resourceName: dpdk_nic_1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: dpdk-nic-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: vfio-pci
  needVhostNet: true
  nicSelector:
    pfNames: ["ens3f1"]
  nodeSelector:
  node-role.kubernetes.io/worker-cnf: ""
  numVfs: 10
  priority: 99
  resourceName: dpdk_nic_2

1: For Intel NICs, deviceType must be vfio-pci.
2: If kernel communication with DPDK workloads is required, add needVhostNet: true. This mounts the /dev/net/tun and /dev/vhost-net devices into the container so the application can create a tap device and connect the tap device to the DPDK workload.

The following is an example of an sriovNetworkNodePolicy object for Mellanox NICs:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: dpdk-nic-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice 1
  isRdma: true 2
  nicSelector:
    rootDevices:
      - "0000:5e:00.1"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numVfs: 5
  priority: 99
  resourceName: dpdk_nic_1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: dpdk-nic-2
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  isRdma: true
  nicSelector:
    rootDevices:
      - "0000:5e:00.0"
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numVfs: 5
  priority: 99
  resourceName: dpdk_nic_2

1: For Mellanox devices the deviceType must be netdevice.
2: For Mellanox devices isRdma must be true. Mellanox cards are connected to DPDK applications using Flow Bifurcation. This mechanism splits traffic between Linux user space and kernel space, and can enhance line rate processing capability.

18.8.6.3. Example SR-IOV network operator

The following is an example definition of an sriovNetwork object. In this case, Intel and Mellanox configurations are identical:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: dpdk-network-1
  namespace: openshift-sriov-network-operator
spec:
  ipam: '{"type": "host-local","ranges": [[{"subnet": "10.0.1.0/24"}]],"dataDir":
   "/run/my-orchestrator/container-ipam-state-1"}' 1
  networkNamespace: dpdk-test 2
  spoofChk: "off"
  trust: "on"
  resourceName: dpdk_nic_1 3
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: dpdk-network-2
  namespace: openshift-sriov-network-operator
spec:
  ipam: '{"type": "host-local","ranges": [[{"subnet": "10.0.2.0/24"}]],"dataDir":
   "/run/my-orchestrator/container-ipam-state-1"}'
  networkNamespace: dpdk-test
  spoofChk: "off"
  trust: "on"
  resourceName: dpdk_nic_2

1: You can use a different IP Address Management (IPAM) implementation, such as Whereabouts. For more information, see Dynamic IP address assignment configuration with Whereabouts.
2: You must request the networkNamespace where the network attachment definition will be created. You must create the sriovNetwork CR under the openshift-sriov-network-operator namespace.
3: The resourceName value must match that of the resourceName created under the sriovNetworkNodePolicy.

18.8.6.4. Example DPDK base workload

The following is an example of a Data Plane Development Kit (DPDK) container:

apiVersion: v1
kind: Namespace
metadata:
  name: dpdk-test
---
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: '[ 1
     {
      "name": "dpdk-network-1",
      "namespace": "dpdk-test"
     },
     {
      "name": "dpdk-network-2",
      "namespace": "dpdk-test"
     }
   ]'
    irq-load-balancing.crio.io: "disable" 2
    cpu-load-balancing.crio.io: "disable"
    cpu-quota.crio.io: "disable"
  labels:
    app: dpdk
  name: testpmd
  namespace: dpdk-test
spec:
  runtimeClassName: performance-performance 3
  containers:
    - command:
        - /bin/bash
        - -c
        - sleep INF
      image: registry.redhat.io/openshift4/dpdk-base-rhel8
      imagePullPolicy: Always
      name: dpdk
      resources: 4
        limits:
          cpu: "16"
          hugepages-1Gi: 8Gi
          memory: 2Gi
        requests:
          cpu: "16"
          hugepages-1Gi: 8Gi
          memory: 2Gi
      securityContext:
        capabilities:
          add:
            - IPC_LOCK
            - SYS_RESOURCE
            - NET_RAW
            - NET_ADMIN
        runAsUser: 0
      volumeMounts:
        - mountPath: /mnt/huge
          name: hugepages
  terminationGracePeriodSeconds: 5
  volumes:
    - emptyDir:
        medium: HugePages
      name: hugepages

1: Request the SR-IOV networks you need. Resources for the devices will be injected automatically.
2: Disable the CPU and IRQ load balancing base. See Disabling interrupt processing for individual pods for more information.
3: Set the runtimeClass to performance-performance. Do not set the runtimeClass to HostNetwork or privileged.
4: Request an equal number of resources for requests and limits to start the pod with Guaranteed Quality of Service (QoS).

Note

Do not start the pod with SLEEP and then exec into the pod to start the testpmd or the DPDK workload. This can add additional interrupts as the exec process is not pinned to any CPU.

18.8.6.5. Example testpmd script

The following is an example script for running testpmd:

#!/bin/bash
set -ex
export CPU=$(cat /sys/fs/cgroup/cpuset/cpuset.cpus)
echo ${CPU}

dpdk-testpmd -l ${CPU} -a ${PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_1} -a ${PCIDEVICE_OPENSHIFT_IO_DPDK_NIC_2} -n 4 -- -i --nb-cores=15 --rxd=4096 --txd=4096 --rxq=7 --txq=7 --forward-mode=mac --eth-peer=0,50:00:00:00:00:01 --eth-peer=1,50:00:00:00:00:02

This example uses two different sriovNetwork CRs. The environment variable contains the Virtual Function (VF) PCI address that was allocated for the pod. If you use the same network in the pod definition, you must split the pciAddress. It is important to configure the correct MAC addresses of the traffic generator. This example uses custom MAC addresses.

18.8.7. Using a virtual function in RDMA mode with a Mellanox NIC

Important

RDMA over Converged Ethernet (RoCE) is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

RDMA over Converged Ethernet (RoCE) is the only supported mode when using RDMA on OpenShift Container Platform.

Prerequisites

Install the OpenShift CLI (oc).
Install the SR-IOV Network Operator.
Log in as a user with cluster-admin privileges.

Procedure

Create the following SriovNetworkNodePolicy object, and then save the YAML in the mlx-rdma-node-policy.yaml file.
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx-rdma-node-policy
  namespace: openshift-sriov-network-operator
spec:
  resourceName: mlxnics
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  priority: <priority>
  numVfs: <num>
  nicSelector:
    vendor: "15b3"
    deviceID: "1015" 1
    pfNames: ["<pf_name>", ...]
    rootDevices: ["<pci_bus_id>", "..."]
  deviceType: netdevice 2
  isRdma: true 3
```
1
Specify the device hex code of the SR-IOV network device.
2
Specify the driver type for the virtual functions to netdevice.
3
Enable RDMA mode.
Note
See the Configuring SR-IOV network devices section for a detailed explanation on each option in SriovNetworkNodePolicy.
When applying the configuration specified in a SriovNetworkNodePolicy object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.
After the configuration update is applied, all the pods in the openshift-sriov-network-operator namespace will change to a Running status.
Create the SriovNetworkNodePolicy object by running the following command:
```
$ oc create -f mlx-rdma-node-policy.yaml
```
Create the following SriovNetwork object, and then save the YAML in the mlx-rdma-network.yaml file.
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: mlx-rdma-network
  namespace: openshift-sriov-network-operator
spec:
  networkNamespace: <target_namespace>
  ipam: |- 1
# ...
  vlan: <vlan>
  resourceName: mlxnics
```
1
Specify a configuration object for the ipam CNI plugin as a YAML block scalar. The plugin manages IP address assignment for the attachment definition.
Note
See the "Configuring SR-IOV additional network" section for a detailed explanation on each option in SriovNetwork.
An optional library, app-netutil, provides several API methods for gathering network information about a container’s parent pod.
Create the SriovNetworkNodePolicy object by running the following command:
```
$ oc create -f mlx-rdma-network.yaml
```
Create the following Pod spec, and then save the YAML in the mlx-rdma-pod.yaml file.
```
apiVersion: v1
kind: Pod
metadata:
  name: rdma-app
  namespace: <target_namespace> 1
  annotations:
    k8s.v1.cni.cncf.io/networks: mlx-rdma-network
spec:
  containers:
  - name: testpmd
    image: <RDMA_image> 2
    securityContext:
      runAsUser: 0
      capabilities:
        add: ["IPC_LOCK","SYS_RESOURCE","NET_RAW"] 3
    volumeMounts:
    - mountPath: /mnt/huge 4
      name: hugepage
    resources:
      limits:
        memory: "1Gi"
        cpu: "4" 5
        hugepages-1Gi: "4Gi" 6
      requests:
        memory: "1Gi"
        cpu: "4"
        hugepages-1Gi: "4Gi"
    command: ["sleep", "infinity"]
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages
```
1
Specify the same target_namespace where SriovNetwork object mlx-rdma-network is created. If you would like to create the pod in a different namespace, change target_namespace in both Pod spec and SriovNetwork object.
2
Specify the RDMA image which includes your application and RDMA library used by application.
3
Specify additional capabilities required by the application inside the container for hugepage allocation, system resource allocation, and network interface access.
4
Mount the hugepage volume to RDMA pod under /mnt/huge. The hugepage volume is backed by the emptyDir volume type with the medium being Hugepages.
5
Specify number of CPUs. The RDMA pod usually requires exclusive CPUs be allocated from the kubelet. This is achieved by setting CPU Manager policy to static and create pod with Guaranteed QoS.
6
Specify hugepage size hugepages-1Gi or hugepages-2Mi and the quantity of hugepages that will be allocated to the RDMA pod. Configure 2Mi and 1Gi hugepages separately. Configuring 1Gi hugepage requires adding kernel arguments to Nodes.
Create the RDMA pod by running the following command:
```
$ oc create -f mlx-rdma-pod.yaml
```

18.8.8. A test pod template for clusters that use OVS-DPDK on OpenStack

The following testpmd pod demonstrates container creation with huge pages, reserved CPUs, and the SR-IOV port.

An example testpmd pod

apiVersion: v1
kind: Pod
metadata:
  name: testpmd-dpdk
  namespace: mynamespace
  annotations:
    cpu-load-balancing.crio.io: "disable"
    cpu-quota.crio.io: "disable"
# ...
spec:
  containers:
  - name: testpmd
    command: ["sleep", "99999"]
    image: registry.redhat.io/openshift4/dpdk-base-rhel8:v4.9
    securityContext:
      capabilities:
        add: ["IPC_LOCK","SYS_ADMIN"]
      privileged: true
      runAsUser: 0
    resources:
      requests:
        memory: 1000Mi
        hugepages-1Gi: 1Gi
        cpu: '2'
        openshift.io/dpdk1: 1 1
      limits:
        hugepages-1Gi: 1Gi
        cpu: '2'
        memory: 1000Mi
        openshift.io/dpdk1: 1
    volumeMounts:
      - mountPath: /mnt/huge
        name: hugepage
        readOnly: False
  runtimeClassName: performance-cnf-performanceprofile 2
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

1: The name dpdk1 in this example is a user-created SriovNetworkNodePolicy resource. You can substitute this name for that of a resource that you create.
2: If your performance profile is not named cnf-performance profile, replace that string with the correct performance profile name.

18.8.9. Additional resources

18.9. Using pod-level bonding

Bonding at the pod level is vital to enable workloads inside pods that require high availability and more throughput. With pod-level bonding, you can create a bond interface from multiple single root I/O virtualization (SR-IOV) virtual function interfaces in a kernel mode interface. The SR-IOV virtual functions are passed into the pod and attached to a kernel driver.

One scenario where pod level bonding is required is creating a bond interface from multiple SR-IOV virtual functions on different physical functions. Creating a bond interface from two different physical functions on the host can be used to achieve high availability and throughput at pod level.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

For guidance on tasks such as creating a SR-IOV network, network policies, network attachment definitions and pods, see Configuring an SR-IOV network device.

18.9.1. Configuring a bond interface from two SR-IOV interfaces

Bonding enables multiple network interfaces to be aggregated into a single logical "bonded" interface. Bond Container Network Interface (Bond-CNI) brings bond capability into containers.

Bond-CNI can be created using Single Root I/O Virtualization (SR-IOV) virtual functions and placing them in the container network namespace.

OpenShift Container Platform only supports Bond-CNI using SR-IOV virtual functions. The SR-IOV Network Operator provides the SR-IOV CNI plugin needed to manage the virtual functions. Other CNIs or types of interfaces are not supported.

Prerequisites

The SR-IOV Network Operator must be installed and configured to obtain virtual functions in a container.
To configure SR-IOV interfaces, an SR-IOV network and policy must be created for each interface.
The SR-IOV Network Operator creates a network attachment definition for each SR-IOV interface, based on the SR-IOV network and policy defined.
The linkState is set to the default value auto for the SR-IOV virtual function.

18.9.1.1. Creating a bond network attachment definition

Now that the SR-IOV virtual functions are available, you can create a bond network attachment definition.

apiVersion: "k8s.cni.cncf.io/v1"
    kind: NetworkAttachmentDefinition
    metadata:
      name: bond-net1
      namespace: demo
    spec:
      config: '{
      "type": "bond", 1
      "cniVersion": "0.3.1",
      "name": "bond-net1",
      "mode": "active-backup", 2
      "failOverMac": 1, 3
      "linksInContainer": true, 4
      "miimon": "100",
      "mtu": 1500,
      "links": [ 5
            {"name": "net1"},
            {"name": "net2"}
        ],
      "ipam": {
            "type": "host-local",
            "subnet": "10.56.217.0/24",
            "routes": [{
            "dst": "0.0.0.0/0"
            }],
            "gateway": "10.56.217.1"
        }
      }'

1

The cni-type is always set to bond.

2

The mode attribute specifies the bonding mode.

Note

The bonding modes supported are:

balance-rr - 0
active-backup - 1
balance-xor - 2

For balance-rr or balance-xor modes, you must set the trust mode to on for the SR-IOV virtual function.

3

The failover attribute is mandatory for active-backup mode and must be set to 1.

4

The linksInContainer=true flag informs the Bond CNI that the required interfaces are to be found inside the container. By default, Bond CNI looks for these interfaces on the host which does not work for integration with SRIOV and Multus.

5

The links section defines which interfaces will be used to create the bond. By default, Multus names the attached interfaces as: "net", plus a consecutive number, starting with one.

18.9.1.2. Creating a pod using a bond interface

Test the setup by creating a pod with a YAML file named for example podbonding.yaml with content similar to the following:

apiVersion: v1
    kind: Pod
    metadata:
      name: bondpod1
      namespace: demo
      annotations:
        k8s.v1.cni.cncf.io/networks: demo/sriovnet1, demo/sriovnet2, demo/bond-net1 1
    spec:
      containers:
      - name: podexample
        image: quay.io/openshift/origin-network-interface-bond-cni:4.11.0
        command: ["/bin/bash", "-c", "sleep INF"]

1: Note the network annotation: it contains two SR-IOV network attachments, and one bond network attachment. The bond attachment uses the two SR-IOV interfaces as bonded port interfaces.

Apply the yaml by running the following command:
```
$ oc apply -f podbonding.yaml
```

Inspect the pod interfaces with the following command:

$ oc rsh -n demo bondpod1
sh-4.4#
sh-4.4# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
3: eth0@if150: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
link/ether 62:b1:b5:c8:fb:7a brd ff:ff:ff:ff:ff:ff
inet 10.244.1.122/24 brd 10.244.1.255 scope global eth0
valid_lft forever preferred_lft forever
4: net3: <BROADCAST,MULTICAST,UP,LOWER_UP400> mtu 1500 qdisc noqueue state UP qlen 1000
link/ether 9e:23:69:42:fb:8a brd ff:ff:ff:ff:ff:ff 1
inet 10.56.217.66/24 scope global bond0
valid_lft forever preferred_lft forever
43: net1: <BROADCAST,MULTICAST,UP,LOWER_UP800> mtu 1500 qdisc mq master bond0 state UP qlen 1000
link/ether 9e:23:69:42:fb:8a brd ff:ff:ff:ff:ff:ff 2
44: net2: <BROADCAST,MULTICAST,UP,LOWER_UP800> mtu 1500 qdisc mq master bond0 state UP qlen 1000
link/ether 9e:23:69:42:fb:8a brd ff:ff:ff:ff:ff:ff 3

1: The bond interface is automatically named net3. To set a specific interface name add @name suffix to the pod’s k8s.v1.cni.cncf.io/networks annotation.
2: The net1 interface is based on an SR-IOV virtual function.
3: The net2 interface is based on an SR-IOV virtual function.

Note

If no interface names are configured in the pod annotation, interface names are assigned automatically as net<n>, with <n> starting at 1.

Optional: If you want to set a specific interface name for example bond0, edit the k8s.v1.cni.cncf.io/networks annotation and set bond0 as the interface name as follows:
```
annotations:
        k8s.v1.cni.cncf.io/networks: demo/sriovnet1, demo/sriovnet2, demo/bond-net1@bond0
```

18.10. Configuring hardware offloading

As a cluster administrator, you can configure hardware offloading on compatible nodes to increase data processing performance and reduce load on host CPUs.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.10.1. About hardware offloading

Open vSwitch hardware offloading is a method of processing network tasks by diverting them away from the CPU and offloading them to a dedicated processor on a network interface controller. As a result, clusters can benefit from faster data transfer speeds, reduced CPU workloads, and lower computing costs.

The key element for this feature is a modern class of network interface controllers known as SmartNICs. A SmartNIC is a network interface controller that is able to handle computationally-heavy network processing tasks. In the same way that a dedicated graphics card can improve graphics performance, a SmartNIC can improve network performance. In each case, a dedicated processor improves performance for a specific type of processing task.

In OpenShift Container Platform, you can configure hardware offloading for bare metal nodes that have a compatible SmartNIC. Hardware offloading is configured and enabled by the SR-IOV Network Operator.

Hardware offloading is not compatible with all workloads or application types. Only the following two communication types are supported:

pod-to-pod
pod-to-service, where the service is a ClusterIP service backed by a regular pod

In all cases, hardware offloading takes place only when those pods and services are assigned to nodes that have a compatible SmartNIC. Suppose, for example, that a pod on a node with hardware offloading tries to communicate with a service on a regular node. On the regular node, all the processing takes place in the kernel, so the overall performance of the pod-to-service communication is limited to the maximum performance of that regular node. Hardware offloading is not compatible with DPDK applications.

Enabling hardware offloading on a node, but not configuring pods to use, it can result in decreased throughput performance for pod traffic. You cannot configure hardware offloading for pods that are managed by OpenShift Container Platform.

18.10.2. Supported devices

Hardware offloading is supported on the following network interface controllers:

Table 18.15. Supported network interface controllers
Manufacturer	Model	Vendor ID	Device ID
Mellanox	MT27800 Family [ConnectX‑5]	15b3	1017
Mellanox	MT28880 Family [ConnectX‑5 Ex]	15b3	1019
Mellanox	MT2892 Family [ConnectX‑6 Dx]	15b3	101d
Mellanox	MT2894 Family [ConnectX-6 Lx]	15b3	101f
Mellanox	MT42822 BlueField-2 in ConnectX-6 NIC mode	15b3	a2d6

18.10.3. Prerequisites

Your cluster has at least one bare metal machine with a network interface controller that is supported for hardware offloading.
You installed the SR-IOV Network Operator.
Your cluster uses the OVN-Kubernetes network plugin.
In your OVN-Kubernetes network plugin configuration, the gatewayConfig.routingViaHost field is set to false.

18.10.4. Setting the SR-IOV Network Operator into systemd mode

To support hardware offloading, you must first set the SR-IOV Network Operator into systemd mode.

Prerequisites

You installed the OpenShift CLI (oc).
You have access to the cluster as a user that has the cluster-admin role.

Procedure

Create a SriovOperatorConfig custom resource (CR) to deploy all the SR-IOV Operator components:
1. Create a file named sriovOperatorConfig.yaml that contains the following YAML:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
  name: default 1
  namespace: openshift-sriov-network-operator
spec:
  enableInjector: true
  enableOperatorWebhook: true
  configurationMode: "systemd" 2
  logLevel: 2
```
  1
  The only valid name for the SriovOperatorConfig resource is default and it must be in the namespace where the Operator is deployed.
  2
  Setting the SR-IOV Network Operator into systemd mode is only relevant for Open vSwitch hardware offloading.
2. Create the resource by running the following command:
```
$ oc apply -f sriovOperatorConfig.yaml
```

18.10.5. Configuring a machine config pool for hardware offloading

To enable hardware offloading, you now create a dedicated machine config pool and configure it to work with the SR-IOV Network Operator.

Prerequisites

SR-IOV Network Operator installed and set into systemd mode.

Procedure

Create a machine config pool for machines you want to use hardware offloading on.

Create a file, such as mcp-offloading.yaml, with content like the following example:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: mcp-offloading 1
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,mcp-offloading]} 2
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/mcp-offloading: "" 3

1 2: The name of your machine config pool for hardware offloading.
3: This node role label is used to add nodes to the machine config pool.

Apply the configuration for the machine config pool:
```
$ oc create -f mcp-offloading.yaml
```

Add nodes to the machine config pool. Label each node with the node role label of your pool:
```
$ oc label node worker-2 node-role.kubernetes.io/mcp-offloading=""
```

Optional: To verify that the new pool is created, run the following command:

$ oc get nodes

Example output

NAME       STATUS   ROLES                   AGE   VERSION
master-0   Ready    master                  2d    v1.30.3
master-1   Ready    master                  2d    v1.30.3
master-2   Ready    master                  2d    v1.30.3
worker-0   Ready    worker                  2d    v1.30.3
worker-1   Ready    worker                  2d    v1.30.3
worker-2   Ready    mcp-offloading,worker   47h   v1.30.3
worker-3   Ready    mcp-offloading,worker   47h   v1.30.3

Add this machine config pool to the SriovNetworkPoolConfig custom resource:
1. Create a file, such as sriov-pool-config.yaml, with content like the following example:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkPoolConfig
metadata:
  name: sriovnetworkpoolconfig-offload
  namespace: openshift-sriov-network-operator
spec:
  ovsHardwareOffloadConfig:
    name: mcp-offloading 1
```
  1
  The name of your machine config pool for hardware offloading.
2. Apply the configuration:
```
$ oc create -f <SriovNetworkPoolConfig_name>.yaml
```
  Note
  When you apply the configuration specified in a SriovNetworkPoolConfig object, the SR-IOV Operator drains and restarts the nodes in the machine config pool.
  It might take several minutes for a configuration changes to apply.

18.10.6. Configuring the SR-IOV network node policy

You can create an SR-IOV network device configuration for a node by creating an SR-IOV network node policy. To enable hardware offloading, you must define the .spec.eSwitchMode field with the value "switchdev".

The following procedure creates an SR-IOV interface for a network interface controller with hardware offloading.

Prerequisites

You installed the OpenShift CLI (oc).
You have access to the cluster as a user with the cluster-admin role.

Procedure

Create a file, such as sriov-node-policy.yaml, with content like the following example:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-node-policy 1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice 2
  eSwitchMode: "switchdev" 3
  nicSelector:
    deviceID: "1019"
    rootDevices:
    - 0000:d8:00.0
    vendor: "15b3"
    pfNames:
    - ens8f0
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 6
  priority: 5
  resourceName: mlxnics

1: The name for the custom resource object.
2: Required. Hardware offloading is not supported with vfio-pci.
3: Required.

Apply the configuration for the policy:
```
$ oc create -f sriov-node-policy.yaml
```
Note
When you apply the configuration specified in a SriovNetworkPoolConfig object, the SR-IOV Operator drains and restarts the nodes in the machine config pool.
It might take several minutes for a configuration change to apply.

18.10.6.1. An example SR-IOV network node policy for OpenStack

The following example describes an SR-IOV interface for a network interface controller (NIC) with hardware offloading on Red Hat OpenStack Platform (RHOSP).

An SR-IOV interface for a NIC with hardware offloading on RHOSP

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ${name}
  namespace: openshift-sriov-network-operator
spec:
  deviceType: switchdev
  isRdma: true
  nicSelector:
    netFilter: openstack/NetworkID:${net_id}
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: 'true'
  numVfs: 1
  priority: 99
  resourceName: ${name}

18.10.7. Improving network traffic performance using a virtual function

Follow this procedure to assign a virtual function to the OVN-Kubernetes management port and increase its network traffic performance.

This procedure results in the creation of two pools: the first has a virtual function used by OVN-Kubernetes, and the second comprises the remaining virtual functions.

Prerequisites

You installed the OpenShift CLI (oc).
You have access to the cluster as a user with the cluster-admin role.

Procedure

Add the network.operator.openshift.io/smart-nic label to each worker node with a SmartNIC present by running the following command:
```
$ oc label node <node-name> network.operator.openshift.io/smart-nic=
```
Use the oc get nodes command to get a list of the available nodes.
Create a policy named sriov-node-mgmt-vf-policy.yaml for the management port with content such as the following example:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-node-mgmt-vf-policy
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  eSwitchMode: "switchdev"
  nicSelector:
    deviceID: "1019"
    rootDevices:
    - 0000:d8:00.0
    vendor: "15b3"
    pfNames:
    - ens8f0#0-0 1
  nodeSelector:
    network.operator.openshift.io/smart-nic: ""
  numVfs: 6 2
  priority: 5
  resourceName: mgmtvf
```
1
Replace this device with the appropriate network device for your use case. The #0-0 part of the pfNames value reserves a single virtual function used by OVN-Kubernetes.
2
The value provided here is an example. Replace this value with one that meets your requirements. For more information, see SR-IOV network node configuration object in the Additional resources section.
Create a policy named sriov-node-policy.yaml with content such as the following example:
```
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-node-policy
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  eSwitchMode: "switchdev"
  nicSelector:
    deviceID: "1019"
    rootDevices:
    - 0000:d8:00.0
    vendor: "15b3"
    pfNames:
    - ens8f0#1-5 1
  nodeSelector:
    network.operator.openshift.io/smart-nic: ""
  numVfs: 6 2
  priority: 5
  resourceName: mlxnics
```
1
Replace this device with the appropriate network device for your use case.
2
The value provided here is an example. Replace this value with the value specified in the sriov-node-mgmt-vf-policy.yaml file. For more information, see SR-IOV network node configuration object in the Additional resources section.
Note
The sriov-node-mgmt-vf-policy.yaml file has different values for the pfNames and resourceName keys than the sriov-node-policy.yaml file.

Apply the configuration for both policies:

$ oc create -f sriov-node-policy.yaml

$ oc create -f sriov-node-mgmt-vf-policy.yaml

Create a Cluster Network Operator (CNO) ConfigMap in the cluster for the management configuration:
1. Create a ConfigMap named hardware-offload-config.yaml with the following contents:
```
apiVersion: v1
kind: ConfigMap
metadata:
    name: hardware-offload-config
    namespace: openshift-network-operator
data:
    mgmt-port-resource-name: openshift.io/mgmtvf
```
2. Apply the configuration for the ConfigMap:
```
$ oc create -f hardware-offload-config.yaml
```

Additional resources

SR-IOV network node configuration object

18.10.8. Creating a network attachment definition

After you define the machine config pool and the SR-IOV network node policy, you can create a network attachment definition for the network interface card you specified.

Prerequisites

You installed the OpenShift CLI (oc).
You have access to the cluster as a user with the cluster-admin role.

Procedure

Create a file, such as net-attach-def.yaml, with content like the following example:

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: net-attach-def 1
  namespace: net-attach-def 2
  annotations:
    k8s.v1.cni.cncf.io/resourceName: openshift.io/mlxnics 3
spec:
  config: '{"cniVersion":"0.3.1","name":"ovn-kubernetes","type":"ovn-k8s-cni-overlay","ipam":{},"dns":{}}'

1: The name for your network attachment definition.
2: The namespace for your network attachment definition.
3: This is the value of the spec.resourceName field you specified in the SriovNetworkNodePolicy object.

Apply the configuration for the network attachment definition:
```
$ oc create -f net-attach-def.yaml
```

Verification

Run the following command to see whether the new definition is present:

$ oc get net-attach-def -A

Example output

NAMESPACE         NAME             AGE
net-attach-def    net-attach-def   43h

18.10.9. Adding the network attachment definition to your pods

After you create the machine config pool, the SriovNetworkPoolConfig and SriovNetworkNodePolicy custom resources, and the network attachment definition, you can apply these configurations to your pods by adding the network attachment definition to your pod specifications.

Procedure

In the pod specification, add the .metadata.annotations.k8s.v1.cni.cncf.io/networks field and specify the network attachment definition you created for hardware offloading:
```
....
metadata:
  annotations:
    v1.multus-cni.io/default-network: net-attach-def/net-attach-def 1
```
1
The value must be the name and namespace of the network attachment definition you created for hardware offloading.

18.11. Switching Bluefield-2 from DPU to NIC

You can switch the Bluefield-2 network device from data processing unit (DPU) mode to network interface controller (NIC) mode.

Before you perform any tasks in the following documentation, ensure that you installed the SR-IOV Network Operator.

18.11.1. Switching Bluefield-2 from DPU mode to NIC mode

Use the following procedure to switch Bluefield-2 from data processing units (DPU) mode to network interface controller (NIC) mode.

Important

Currently, only switching Bluefield-2 from DPU to NIC mode is supported. Switching from NIC mode to DPU mode is unsupported.

Prerequisites

You have installed the SR-IOV Network Operator. For more information, see "Installing SR-IOV Network Operator".
You have updated Bluefield-2 to the latest firmware. For more information, see Firmware for NVIDIA BlueField-2.

Procedure

Add the following labels to each of your worker nodes by entering the following commands:

$ oc label node <example_node_name_one> node-role.kubernetes.io/sriov=

$ oc label node <example_node_name_two> node-role.kubernetes.io/sriov=

Create a machine config pool for the SR-IOV Network Operator, for example:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: sriov
spec:
  machineConfigSelector:
    matchExpressions:
    - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,sriov]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/sriov: ""

Apply the following machineconfig.yaml file to the worker nodes:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: sriov
  name: 99-bf2-dpu
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,ZmluZF9jb250YWluZXIoKSB7CiAgY3JpY3RsIHBzIC1vIGpzb24gfCBqcSAtciAnLmNvbnRhaW5lcnNbXSB8IHNlbGVjdCgubWV0YWRhdGEubmFtZT09InNyaW92LW5ldHdvcmstY29uZmlnLWRhZW1vbiIpIHwgLmlkJwp9CnVudGlsIG91dHB1dD0kKGZpbmRfY29udGFpbmVyKTsgW1sgLW4gIiRvdXRwdXQiIF1dOyBkbwogIGVjaG8gIndhaXRpbmcgZm9yIGNvbnRhaW5lciB0byBjb21lIHVwIgogIHNsZWVwIDE7CmRvbmUKISBzdWRvIGNyaWN0bCBleGVjICRvdXRwdXQgL2JpbmRhdGEvc2NyaXB0cy9iZjItc3dpdGNoLW1vZGUuc2ggIiRAIgo=
        mode: 0755
        overwrite: true
        path: /etc/default/switch_in_sriov_config_daemon.sh
    systemd:
      units:
      - name: dpu-switch.service
        enabled: true
        contents: |
          [Unit]
          Description=Switch BlueField2 card to NIC/DPU mode
          RequiresMountsFor=%t/containers
          Wants=network.target
          After=network-online.target kubelet.service
          [Service]
          SuccessExitStatus=0 120
          RemainAfterExit=True
          ExecStart=/bin/bash -c '/etc/default/switch_in_sriov_config_daemon.sh nic || shutdown -r now' 1
          Type=oneshot
          [Install]
          WantedBy=multi-user.target

1: Optional: The PCI address of a specific card can optionally be specified, for example ExecStart=/bin/bash -c '/etc/default/switch_in_sriov_config_daemon.sh nic 0000:5e:00.0 || echo done'. By default, the first device is selected. If there is more than one device, you must specify which PCI address to be used. The PCI address must be the same on all nodes that are switching Bluefield-2 from DPU mode to NIC mode.

Wait for the worker nodes to restart. After restarting, the Bluefield-2 network device on the worker nodes is switched into NIC mode.
Optional: You might need to restart the host hardware because most recent Bluefield-2 firmware releases require a hardware restart to switch into NIC mode.

Additional resources

Installing SR-IOV Network Operator

Ce contenu n'est pas disponible dans la langue sélectionnée.

18.1. About Single Root I/O Virtualization (SR-IOV) hardware networks

Additional resources

18.1.1. Components that manage SR-IOV network devices

18.1.1.1. Supported platforms

18.1.1.2. Supported devices

18.1.2. Additional resources

18.1.3. Next steps

18.2. Configuring an SR-IOV network device

18.2.1. SR-IOV network node configuration object

18.2.1.1. SR-IOV network node configuration examples

18.2.1.2. Automated discovery of SR-IOV network devices

18.2.1.2.1. Example SriovNetworkNodeState object

18.2.1.3. Virtual function (VF) partitioning for SR-IOV devices

18.2.1.4. A test pod template for clusters that use SR-IOV on OpenStack

18.2.1.5. A test pod template for clusters that use OVS hardware offloading on OpenStack

18.2.1.6. Huge pages resource injection for Downward API

18.2.2. Configuring SR-IOV network devices

18.2.3. Creating a non-uniform memory access (NUMA) aligned SR-IOV pod

18.2.4. Exclude the SR-IOV network topology for NUMA-aware scheduling

18.2.5. Troubleshooting SR-IOV configuration

18.2.6. Next steps

18.3. Configuring an SR-IOV Ethernet network attachment

18.3.1. Ethernet device configuration object

18.3.1.1. Creating a configuration for assignment of dual-stack IP addresses dynamically

18.3.1.2. Configuration of IP address assignment for a network attachment

18.3.1.2.1. Static IP address assignment configuration

18.3.1.2.2. Dynamic IP address (DHCP) assignment configuration

18.3.1.2.3. Dynamic IP address assignment configuration with Whereabouts

18.3.1.2.3.1. Dynamic IP address configuration objects

18.3.1.2.3.2. Dynamic IP address assignment configuration that uses Whereabouts

18.3.1.2.3.3. Dynamic IP address assignment that uses Whereabouts with overlapping IP address ranges

18.3.2. Configuring SR-IOV additional network

18.3.3. Assigning an SR-IOV network to a VRF

18.3.3.1. Creating an additional SR-IOV network attachment with the CNI VRF plugin

18.3.4. Runtime configuration for an Ethernet-based SR-IOV attachment

18.3.5. Adding a pod to an additional network

18.3.5.1. Exposing MTU for vfio-pci SR-IOV devices to pod

18.3.6. Configuring parallel node draining during SR-IOV network policy updates

18.3.7. Excluding the SR-IOV network topology for NUMA-aware scheduling

18.3.8. Additional resources

18.4. Configuring an SR-IOV InfiniBand network attachment

18.4.1. InfiniBand device configuration object

18.4.1.1. Creating a configuration for assignment of dual-stack IP addresses dynamically

18.4.1.2. Configuration of IP address assignment for a network attachment

18.4.1.2.1. Static IP address assignment configuration

18.4.1.2.2. Dynamic IP address (DHCP) assignment configuration

18.4.1.2.3. Dynamic IP address assignment configuration with Whereabouts

18.4.1.2.3.1. Dynamic IP address configuration objects

18.4.1.2.3.2. Dynamic IP address assignment configuration that uses Whereabouts

18.4.1.2.3.3. Dynamic IP address assignment that uses Whereabouts with overlapping IP address ranges

18.4.2. Configuring SR-IOV additional network

18.4.3. Runtime configuration for an InfiniBand-based SR-IOV attachment

18.4.4. Adding a pod to an additional network

18.4.4.1. Exposing MTU for vfio-pci SR-IOV devices to pod

18.4.5. Additional resources

18.5. Configuring interface-level network sysctl settings and all-multicast mode for SR-IOV networks

18.5.1. Labeling nodes with an SR-IOV enabled NIC

18.5.2. Setting one sysctl flag

18.5.2.1. Setting one sysctl flag on nodes with SR-IOV network devices

18.5.2.2. Configuring sysctl on a SR-IOV network

18.5.3. Configuring sysctl settings for pods associated with bonded SR-IOV interface flag

18.5.3.1. Setting all sysctl flag on nodes with bonded SR-IOV network devices

18.5.3.2. Configuring sysctl on a bonded SR-IOV network

18.5.4. About all-multicast mode

18.5.4.1. Enabling the all-multicast mode on an SR-IOV network

18.6. Configuring QinQ support for SR-IOV enabled workloads

18.6.1. About 802.1Q-in-802.1Q support

18.6.2. Configuring QinQ support for SR-IOV enabled workloads

18.7. Using high performance multicast

18.7.1. High performance multicast

18.7.2. Configuring an SR-IOV interface for multicast

18.8. Using DPDK and RDMA

18.8.1. Example use of a virtual function in a pod

18.8.2. Using a virtual function in DPDK mode with an Intel NIC

18.8.3. Using a virtual function in DPDK mode with a Mellanox NIC

18.8.4. Using the TAP CNI to run a rootless DPDK workload with kernel access

18.8.5. Overview of achieving a specific DPDK line rate

18.8.6. Using SR-IOV and the Node Tuning Operator to achieve a DPDK line rate