Chapter 8. Hardware networks
8.1. About Single Root I/O Virtualization (SR-IOV) hardware networks
The Single Root I/O Virtualization (SR-IOV) specification is a standard for a type of PCI device assignment that can share a single device with multiple pods.
SR-IOV enables you to segment a compliant network device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device. The SR-IOV device driver for the device determines how the VF is exposed in the container:
-
netdevice
driver: A regular kernel network device in thenetns
of the container -
vfio-pci
driver: A character device mounted in the container
You can use SR-IOV network devices with additional networks on your OpenShift Container Platform cluster for application that require high bandwidth or low latency.
8.1.1. Components that manage SR-IOV network devices
The SR-IOV Network Operator creates and manages the components of the SR-IOV stack. It performs the following functions:
- Orchestrates discovery and management of SR-IOV network devices
-
Generates
NetworkAttachmentDefinition
custom resources for the SR-IOV Container Network Interface (CNI) - Creates and updates the configuration of the SR-IOV network device plug-in
-
Creates node specific
SriovNetworkNodeState
custom resources -
Updates the
spec.interfaces
field in eachSriovNetworkNodeState
custom resource
The Operator provisions the following components:
- SR-IOV network configuration daemon
- A DaemonSet that is deployed on worker nodes when the SR-IOV Operator starts. The daemon is responsible for discovering and initializing SR-IOV network devices in the cluster.
- SR-IOV Operator webhook
- A dynamic admission controller webhook that validates the Operator custom resource and sets appropriate default values for unset fields.
- SR-IOV Network resources injector
- A dynamic admission controller webhook that provides functionality for patching Kubernetes pod specifications with requests and limits for custom network resources such as SR-IOV VFs.
- SR-IOV network device plug-in
- A device plug-in that discovers, advertises, and allocates SR-IOV network virtual function (VF) resources. Device plug-ins are used in Kubernetes to enable the use of limited resources, typically in physical devices. Device plug-ins give the Kubernetes scheduler awareness of resource availability, so that the scheduler can schedule pods on nodes with sufficient resources.
- SR-IOV CNI plug-in
- A CNI plug-in that attaches VF interfaces allocated from the SR-IOV device plug-in directly into a pod.
The SR-IOV Network resources injector and SR-IOV Network Operator webhook are enabled by default and can be disabled by editing the default
SriovOperatorConfig
CR.
8.1.1.1. Supported devices
OpenShift Container Platform supports the following Network Interface Card (NIC) models:
-
Intel XXV710 25GbE SFP28 with vendor ID
0x8086
and device ID0x158b
-
Mellanox MT27710 Family [ConnectX-4 Lx] 25GbE dual-port SFP28 with vendor ID
0x15b3
and device ID0x1015
-
Mellanox MT27800 Family [ConnectX-5] 25GbE dual-port SFP28 with vendor ID
0x15b3
and device ID0x1017
-
Mellanox MT27800 Family [ConnectX-5] 100GbE with vendor ID
0x15b3
and device ID0x1017
8.1.1.2. Automated discovery of SR-IOV network devices
The SR-IOV Network Operator searches your cluster for SR-IOV capable network devices on worker nodes. The Operator creates and updates a SriovNetworkNodeState custom resource (CR) for each worker node that provides a compatible SR-IOV network device.
The CR is assigned the same name as the worker node. The status.interfaces
list provides information about the network devices on a node.
Do not modify a SriovNetworkNodeState
object. The Operator creates and manages these resources automatically.
8.1.1.2.1. Example SriovNetworkNodeState
object
The following YAML is an example of a SriovNetworkNodeState
object created by the SR-IOV Network Operator:
An SriovNetworkNodeState object
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodeState metadata: name: node-25 1 namespace: openshift-sriov-network-operator ownerReferences: - apiVersion: sriovnetwork.openshift.io/v1 blockOwnerDeletion: true controller: true kind: SriovNetworkNodePolicy name: default spec: dpConfigVersion: "39824" status: interfaces: 2 - deviceID: "1017" driver: mlx5_core mtu: 1500 name: ens785f0 pciAddress: "0000:18:00.0" totalvfs: 8 vendor: 15b3 - deviceID: "1017" driver: mlx5_core mtu: 1500 name: ens785f1 pciAddress: "0000:18:00.1" totalvfs: 8 vendor: 15b3 - deviceID: 158b driver: i40e mtu: 1500 name: ens817f0 pciAddress: 0000:81:00.0 totalvfs: 64 vendor: "8086" - deviceID: 158b driver: i40e mtu: 1500 name: ens817f1 pciAddress: 0000:81:00.1 totalvfs: 64 vendor: "8086" - deviceID: 158b driver: i40e mtu: 1500 name: ens803f0 pciAddress: 0000:86:00.0 totalvfs: 64 vendor: "8086" syncStatus: Succeeded
8.1.1.3. Example use of a virtual function in a pod
You can run a remote direct memory access (RDMA) or a Data Plane Development Kit (DPDK) application in a pod with SR-IOV VF attached.
This example shows a pod using a virtual function (VF) in RDMA mode:
Pod
spec that uses RDMA mode
apiVersion: v1 kind: Pod metadata: name: rdma-app annotations: k8s.v1.cni.cncf.io/networks: sriov-rdma-mlnx spec: containers: - name: testpmd image: <RDMA_image> imagePullPolicy: IfNotPresent securityContext: capabilities: add: ["IPC_LOCK"] command: ["sleep", "infinity"]
The following example shows a pod with a VF in DPDK mode:
Pod
spec that uses DPDK mode
apiVersion: v1 kind: Pod metadata: name: dpdk-app annotations: k8s.v1.cni.cncf.io/networks: sriov-dpdk-net spec: containers: - name: testpmd image: <DPDK_image> securityContext: capabilities: add: ["IPC_LOCK"] volumeMounts: - mountPath: /dev/hugepages name: hugepage resources: limits: memory: "1Gi" cpu: "2" hugepages-1Gi: "4Gi" requests: memory: "1Gi" cpu: "2" hugepages-1Gi: "4Gi" command: ["sleep", "infinity"] volumes: - name: hugepage emptyDir: medium: HugePages
An optional library is available to aid the application running in a container in gathering network information associated with a pod. This library is called 'app-netutil'. See the library’s source code in the app-netutil
GitHub repo.
This library is intended to ease the integration of the SR-IOV VFs in DPDK mode into the container. The library provides both a GO API and a C API, as well as examples of using both languages.
There is also a sample Docker image, 'dpdk-app-centos', which can run one of the following DPDK sample applications based on an environmental variable in the pod-spec: l2fwd, l3wd or testpmd. This Docker image provides an example of integrating the 'app-netutil' into the container image itself. The library can also integrate into an init-container which collects the desired data and passes the data to an existing DPDK workload.
8.1.2. Next steps
8.2. Installing the SR-IOV Network Operator
You can install the Single Root I/O Virtualization (SR-IOV) Network Operator on your cluster to manage SR-IOV network devices and network attachments.
8.2.1. Installing SR-IOV Network Operator
As a cluster administrator, you can install the SR-IOV Network Operator by using the OpenShift Container Platform CLI or the web console.
8.2.1.1. CLI: Installing the SR-IOV Network Operator
As a cluster administrator, you can install the Operator using the CLI.
Prerequisites
- A cluster installed on bare-metal hardware with nodes that have hardware that supports SR-IOV.
-
Install the OpenShift CLI (
oc
). -
An account with
cluster-admin
privileges.
Procedure
To create the
openshift-sriov-network-operator
namespace, enter the following command:$ cat << EOF| oc create -f - apiVersion: v1 kind: Namespace metadata: name: openshift-sriov-network-operator labels: openshift.io/run-level: "1" EOF
To create an OperatorGroup CR, enter the following command:
$ cat << EOF| oc create -f - apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: sriov-network-operators namespace: openshift-sriov-network-operator spec: targetNamespaces: - openshift-sriov-network-operator EOF
Subscribe to the SR-IOV Network Operator.
Run the following command to get the OpenShift Container Platform major and minor version. It is required for the
channel
value in the next step.$ OC_VERSION=$(oc version -o yaml | grep openshiftVersion | \ grep -o '[0-9]*[.][0-9]*' | head -1)
To create a Subscription CR for the SR-IOV Network Operator, enter the following command:
$ cat << EOF| oc create -f - apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: sriov-network-operator-subsription namespace: openshift-sriov-network-operator spec: channel: "${OC_VERSION}" name: sriov-network-operator source: redhat-operators sourceNamespace: openshift-marketplace EOF
To verify that the Operator is installed, enter the following command:
$ oc get csv -n openshift-sriov-network-operator \ -o custom-columns=Name:.metadata.name,Phase:.status.phase
Example output
Name Phase sriov-network-operator.4.4.0-202006160135 Succeeded
8.2.1.2. Web console: Installing the SR-IOV Network Operator
As a cluster administrator, you can install the Operator using the web console.
You must create the operator group by using the CLI.
Prerequisites
- A cluster installed on bare-metal hardware with nodes that have hardware that supports SR-IOV.
-
Install the OpenShift CLI (
oc
). -
An account with
cluster-admin
privileges.
Procedure
Create a namespace for the SR-IOV Network Operator:
-
In the OpenShift Container Platform web console, click Administration
Namespaces. - Click Create Namespace.
-
In the Name field, enter
openshift-sriov-network-operator
, and then click Create. -
In the Filter by name field, enter
openshift-sriov-network-operator
. -
From the list of results, click
openshift-sriov-network-operator
, and then click YAML. Update the namespace by adding the following stanza to the namespace definition:
labels: openshift.io/run-level: "1"
- Click Save.
-
In the OpenShift Container Platform web console, click Administration
Install the SR-IOV Network Operator:
-
In the OpenShift Container Platform web console, click Operators
OperatorHub. - Select SR-IOV Network Operator from the list of available Operators, and then click Install.
- On the Create Operator Subscription page, under A specific namespace on the cluster, select openshift-sriov-network-operator.
- Click Subscribe.
-
In the OpenShift Container Platform web console, click Operators
Verify that the SR-IOV Network Operator is installed successfully:
-
Navigate to the Operators
Installed Operators page. Ensure that SR-IOV Network Operator is listed in the openshift-sriov-network-operator project with a Status of InstallSucceeded.
NoteDuring installation an Operator might display a Failed status. If the installation later succeeds with an InstallSucceeded message, you can ignore the Failed message.
If the operator does not appear as installed, to troubleshoot further:
- Inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
-
Navigate to the Workloads
Pods page and check the logs for pods in the openshift-sriov-network-operator
project.
-
Navigate to the Operators
8.2.2. Next steps
- Optional: Configuring the SR-IOV Network Operator
8.3. Configuring the SR-IOV Network Operator
The Single Root I/O Virtualization (SR-IOV) Network Operator manages the SR-IOV network devices and network attachments in your cluster.
8.3.1. Configuring the SR-IOV Network Operator
Modifying the SR-IOV Network Operator configuration is not normally necessary. The default configuration is recommended for most use cases. Complete the steps to modify the relevant configuration only if the default behavior of the Operator is not compatible with your use case.
The SR-IOV Network Operator adds the SriovOperatorConfig.sriovnetwork.openshift.io
CustomResourceDefinition resource. The operator automatically creates a SriovOperatorConfig custom resource (CR) named default
in the openshift-sriov-network-operator
namespace.
The default
CR contains the SR-IOV Network Operator configuration for your cluster. To change the operator configuration, you must modify this CR.
The SriovOperatorConfig
object provides several fields for configuring the operator:
-
enableInjector
allows project administrators to enable or disable the Network Resources Injector daemon set. -
enableOperatorWebhook
allows project administrators to enable or disable the Operator Admission Controller webhook daemon set. -
configDaemonNodeSelector
allows project administrators to schedule the SR-IOV Network Config Daemon on selected nodes.
8.3.1.1. About the Network Resources Injector
The Network Resources Injector is a Kubernetes Dynamic Admission Controller application. It provides the following capabilities:
-
Mutation of resource requests and limits in
Pod
specification to add an SR-IOV resource name according to an SR-IOV network attachment definition annotation. -
Mutation of
Pod
specifications with downward API volume to expose pod annotations and labels to the running container as files under the/etc/podnetinfo
path.
By default the Network Resources Injector is enabled by the SR-IOV operator and runs as a daemon set on all master nodes. The following is an example of Network Resources Injector pods running in a cluster with three master nodes:
$ oc get pods -n openshift-sriov-network-operator
Example output
NAME READY STATUS RESTARTS AGE network-resources-injector-5cz5p 1/1 Running 0 10m network-resources-injector-dwqpx 1/1 Running 0 10m network-resources-injector-lktz5 1/1 Running 0 10m
8.3.1.2. About the SR-IOV Operator admission controller webhook
The SR-IOV Operator Admission Controller webhook is a Kubernetes Dynamic Admission Controller application. It provides the following capabilities:
-
Validation of the
SriovNetworkNodePolicy
CR when it is created or updated. -
Mutation of the
SriovNetworkNodePolicy
CR by setting the default value for thepriority
anddeviceType
fields when the CR is created or updated.
By default the SR-IOV Operator Admission Controller webhook is enabled by the operator and runs as a daemon set on all master nodes. The following is an example of the Operator Admission Controller webhook pods running in a cluster with three master nodes:
$ oc get pods -n openshift-sriov-network-operator
Example output
NAME READY STATUS RESTARTS AGE operator-webhook-9jkw6 1/1 Running 0 16m operator-webhook-kbr5p 1/1 Running 0 16m operator-webhook-rpfrl 1/1 Running 0 16m
8.3.1.3. About custom node selectors
The SR-IOV Network Config daemon discovers and configures the SR-IOV network devices on cluster nodes. By default, it is deployed to all the worker
nodes in the cluster. You can use node labels to specify on which nodes the SR-IOV Network Config daemon runs.
8.3.1.4. Disabling or enabling the Network Resources Injector
To disable or enable the Network Resources Injector, which is enabled by default, complete the following procedure.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - You must have installed the SR-IOV Operator.
Procedure
Set the
enableInjector
field. Replace<value>
withfalse
to disable the feature ortrue
to enable the feature.$ oc patch sriovoperatorconfig default \ --type=merge -n openshift-sriov-network-operator \ --patch '{ "spec": { "enableInjector": <value> } }'
8.3.1.5. Disabling or enabling the SR-IOV Operator admission controller webhook
To disable or enable the admission controller webhook, which is enabled by default, complete the following procedure.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - You must have installed the SR-IOV Operator.
Procedure
Set the
enableOperatorWebhook
field. Replace<value>
withfalse
to disable the feature ortrue
to enable it:$ oc patch sriovoperatorconfig default --type=merge \ -n openshift-sriov-network-operator \ --patch '{ "spec": { "enableOperatorWebhook": <value> } }'
8.3.1.6. Configuring a custom NodeSelector for the SR-IOV Network Config daemon
The SR-IOV Network Config daemon discovers and configures the SR-IOV network devices on cluster nodes. By default, it is deployed to all the worker
nodes in the cluster. You can use node labels to specify on which nodes the SR-IOV Network Config daemon runs.
To specify the nodes where the SR-IOV Network Config daemon is deployed, complete the following procedure.
When you update the configDaemonNodeSelector
field, the SR-IOV Network Config daemon is recreated on each selected node. While the daemon is recreated, cluster users are unable to apply any new SR-IOV Network node policy or create new SR-IOV pods.
Procedure
To update the node selector for the operator, enter the following command:
$ oc patch sriovoperatorconfig default --type=json \ -n openshift-sriov-network-operator \ --patch '[{ "op": "replace", "path": "/spec/configDaemonNodeSelector", "value": {<node-label>} }]'
Replace
<node-label>
with a label to apply as in the following example:"node-role.kubernetes.io/worker": ""
.
8.3.2. Next steps
8.4. Configuring an SR-IOV network device
You can configure a Single Root I/O Virtualization (SR-IOV) device in your cluster.
8.4.1. SR-IOV network node configuration object
You specify the SR-IOV network device configuration for a node by defining an SriovNetworkNodePolicy
object. The object is part of the sriovnetwork.openshift.io
API group.
The following YAML describes an SriovNetworkNodePolicy
object:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: <name> 1 namespace: openshift-sriov-network-operator 2 spec: resourceName: <sriov_resource_name> 3 nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" 4 priority: <priority> 5 mtu: <mtu> 6 numVfs: <num> 7 nicSelector: 8 vendor: "<vendor_code>" 9 deviceID: "<device_id>" 10 pfNames: ["<pf_name>", ...] 11 rootDevices: ["<pci_bus_id>", "..."] 12 deviceType: <device_type> 13 isRdma: false 14
- 1
- The name for the CR object.
- 2
- The namespace where the SR-IOV Operator is installed.
- 3
- The resource name of the SR-IOV device plug-in. You can create multiple
SriovNetworkNodePolicy
objects for a resource name. - 4
- The node selector to select which nodes are configured. Only SR-IOV network devices on selected nodes are configured. The SR-IOV Container Network Interface (CNI) plug-in and device plug-in are deployed on only selected nodes.
- 5
- Optional: An integer value between
0
and99
. A smaller number gets higher priority, so a priority of10
is higher than a priority of99
. The default value is99
. - 6
- Optional: The maximum transmission unit (MTU) of the virtual function. The maximum MTU value can vary for different NIC models.
- 7
- The number of the virtual functions (VF) to create for the SR-IOV physical network device. For an Intel Network Interface Card (NIC), the number of VFs cannot be larger than the total VFs supported by the device. For a Mellanox NIC, the number of VFs cannot be larger than
128
. - 8
- The
nicSelector
mapping selects the device for the Operator to configure. You do not have to specify values for all the parameters. It is recommended to identify the network device with enough precision to avoid selecting a device unintentionally. If you specifyrootDevices
, you must also specify a value forvendor
,deviceID
, orpfNames
. If you specify bothpfNames
androotDevices
at the same time, ensure that they point to the same device. - 9
- Optional: The vendor hex code of the SR-IOV network device. The only allowed values are
8086
and15b3
. - 10
- Optional: The device hex code of SR-IOV network device. The only allowed values are
158b
,1015
, and1017
. - 11
- Optional: An array of one or more physical function (PF) names for the device.
- 12
- An array of one or more PCI bus addresses for the PF of the device. Provide the address in the following format:
0000:02:00.1
. - 13
- Optional: The driver type for the virtual functions. The only allowed values are
netdevice
andvfio-pci
. The default value isnetdevice
.NoteFor a Mellanox card to work in Data Plane Development Kit (DPDK) mode on bare metal nodes, use the
netdevice
driver type and setisRdma
totrue
. - 14
- Optional: Whether to enable remote direct memory access (RDMA) mode. The default value is
false
.NoteIf the
isRDMA
parameter is set totrue
, you can continue to use the RDMA enabled VF as a normal network device. A device can be used in either mode.
8.4.1.1. Virtual function (VF) partitioning for SR-IOV devices
In some cases, you might want to split virtual functions (VFs) from the same physical function (PF) into multiple resource pools. For example, you might want some of the VFs to load with the default driver and the remaining VFs load with the vfio-pci
driver. In such a deployment, the pfNames
selector in your SriovNetworkNodePolicy custom resource (CR) can be used to specify a range of VFs for a pool using the following format: <pfname>#<first_vf>-<last_vf>
.
For example, the following YAML shows the selector for an interface named netpf0
with VF 2
through 7
:
pfNames: ["netpf0#2-7"]
-
netpf0
is the PF interface name. -
2
is the first VF index (0-based) that is included in the range. -
7
is the last VF index (0-based) that is included in the range.
You can select VFs from the same PF by using different policy CRs if the following requirements are met:
-
The
numVfs
value must be identical for policies that select the same PF. -
The VF index must be in the range of
0
to<numVfs>-1
. For example, if you have a policy withnumVfs
set to8
, then the<first_vf>
value must not be smaller than0
, and the<last_vf>
must not be larger than7
. - The VFs ranges in different policies must not overlap.
-
The
<first_vf>
must not be larger than the<last_vf>
.
The following example illustrates NIC partitioning for an SR-IOV device.
The policy policy-net-1
defines a resource pool net-1
that contains the VF 0
of PF netpf0
with the default VF driver. The policy policy-net-1-dpdk
defines a resource pool net-1-dpdk
that contains the VF 8
to 15
of PF netpf0
with the vfio
VF driver.
Policy policy-net-1
:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-net-1 namespace: openshift-sriov-network-operator spec: resourceName: net1 nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" numVfs: 16 nicSelector: pfNames: ["netpf0#0-0"] deviceType: netdevice
Policy policy-net-1-dpdk
:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-net-1-dpdk namespace: openshift-sriov-network-operator spec: resourceName: net1dpdk nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" numVfs: 16 nicSelector: pfNames: ["netpf0#8-15"] deviceType: vfio-pci
8.4.2. Configuring SR-IOV network devices
The SR-IOV Network Operator adds the SriovNetworkNodePolicy.sriovnetwork.openshift.io
CustomResourceDefinition to OpenShift Container Platform. You can configure an SR-IOV network device by creating a SriovNetworkNodePolicy custom resource (CR).
When applying the configuration specified in a SriovNetworkNodePolicy
object, the SR-IOV Operator might drain the nodes, and in some cases, reboot nodes.
It might take several minutes for a configuration change to apply.
Prerequisites
-
You installed the OpenShift CLI (
oc
). -
You have access to the cluster as a user with the
cluster-admin
role. - You have installed the SR-IOV Network Operator.
- You have enough available nodes in your cluster to handle the evicted workload from drained nodes.
- You have not selected any control plane nodes for SR-IOV network device configuration.
Procedure
-
Create an
SriovNetworkNodePolicy
object, and then save the YAML in the<name>-sriov-node-network.yaml
file. Replace<name>
with the name for this configuration. Create the SriovNetworkNodePolicy CR:
$ oc create -f <name>-sriov-node-network.yaml
where
<name>
specifies the name for this configuration.After applying the configuration update, all the pods in
sriov-network-operator
namespace transition to theRunning
status.To verify that the SR-IOV network device is configured, enter the following command. Replace
<node_name>
with the name of a node with the SR-IOV network device that you just configured.$ oc get sriovnetworknodestates -n openshift-sriov-network-operator <node_name> -o jsonpath='{.status.syncStatus}'
8.4.3. Next steps
8.5. Configuring an SR-IOV Ethernet network attachment
You can configure an Ethernet network attachment for an Single Root I/O Virtualization (SR-IOV) device in the cluster.
8.5.1. Ethernet device configuration object
You can configure an Ethernet network device by defining an SriovNetwork
object.
The following YAML describes an SriovNetwork
object:
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: <name> 1 namespace: openshift-sriov-network-operator 2 spec: resourceName: <sriov_resource_name> 3 networkNamespace: <target_namespace> 4 vlan: <vlan> 5 spoofChk: "<spoof_check>" 6 ipam: |- 7 {} linkState: <link_state> 8 maxTxRate: <max_tx_rate> 9 minTxRate: <min_tx_rate> 10 vlanQoS: <vlan_qos> 11 trust: "<trust_vf>" 12 capabilities: <capabilities> 13
- 1
- A name for the object. The SR-IOV Network Operator creates a
NetworkAttachmentDefinition
object with same name. - 2
- The namespace where the SR-IOV Network Operator is installed.
- 3
- The value for the
spec.resourceName
parameter from theSriovNetworkNodePolicy
object that defines the SR-IOV hardware for this additional network. - 4
- The target namespace for the
SriovNetwork
object. Only pods in the target namespace can attach to the additional network. - 5
- Optional: A Virtual LAN (VLAN) ID for the additional network. The integer value must be from
0
to4095
. The default value is0
. - 6
- Optional: The spoof check mode of the VF. The allowed values are the strings
"on"
and"off"
.ImportantYou must enclose the value you specify in quotes or the object is rejected by the SR-IOV Network Operator.
- 7
- A configuration object for the IPAM CNI plug-in as a YAML block scalar. The plug-in manages IP address assignment for the attachment definition.
- 8
- Optional: The link state of virtual function (VF). Allowed value are
enable
,disable
andauto
. - 9
- Optional: A maximum transmission rate, in Mbps, for the VF.
- 10
- Optional: A minimum transmission rate, in Mbps, for the VF. This value must be less than or equal to the maximum transmission rate.Note
Intel NICs do not support the
minTxRate
parameter. For more information, see BZ#1772847. - 11
- Optional: An IEEE 802.1p priority level for the VF. The default value is
0
. - 12
- Optional: The trust mode of the VF. The allowed values are the strings
"on"
and"off"
.ImportantYou must enclose the value that you specify in quotes, or the SR-IOV Network Operator rejects the object.
- 13
- Optional: The capabilities to configure for this additional network. You can specify
"{ "ips": true }"
to enable IP address support or"{ "mac": true }"
to enable MAC address support.
8.5.1.1. Configuration for ipam CNI plug-in
The ipam Container Network Interface (CNI) plug-in provides IP address management (IPAM) for other CNI plug-ins. You can configure ipam for either static IP address assignment or dynamic IP address assignment by using DHCP. The DHCP server you specify must be reachable from the additional network.
The following JSON configuration object describes the parameters that you can set.
8.5.1.1.1. Static IP address assignment configuration
The following JSON describes the configuration for static IP address assignment:
Static assignment configuration
{ "ipam": { "type": "static", "addresses": [ 1 { "address": "<address>", 2 "gateway": "<gateway>" 3 } ], "routes": [ 4 { "dst": "<dst>", 5 "gw": "<gw>" 6 } ], "dns": { 7 "nameservers": ["<nameserver>"], 8 "domain": "<domain>", 9 "search": ["<search_domain>"] 10 } } }
- 1
- An array describing IP addresses to assign to the virtual interface. Both IPv4 and IPv6 IP addresses are supported.
- 2
- An IP address and network prefix that you specify. For example, if you specify
10.10.21.10/24
, then the additional network is assigned an IP address of10.10.21.10
and the netmask is255.255.255.0
. - 3
- The default gateway to route egress network traffic to.
- 4
- An array describing routes to configure inside the pod.
- 5
- The IP address range in CIDR format, such as
192.168.17.0/24
, or0.0.0.0/0
for the default route. - 6
- The gateway where network traffic is routed.
- 7
- Optional: DNS configuration.
- 8
- An of array of one or more IP addresses for to send DNS queries to.
- 9
- The default domain to append to a host name. For example, if the domain is set to
example.com
, a DNS lookup query forexample-host
is rewritten asexample-host.example.com
. - 10
- An array of domain names to append to an unqualified host name, such as
example-host
, during a DNS lookup query.
8.5.1.1.2. Dynamic IP address assignment configuration
The following JSON describes the configuration for dynamic IP address address assignment with DHCP.
A pod obtains its original DHCP lease when it is created. The lease must be periodically renewed by a minimal DHCP server deployment running on the cluster.
The SR-IOV Network Operator does not create a DHCP server deployment; The Cluster Network Operator is responsible for creating the minimal DHCP server deployment.
To trigger the deployment of the DHCP server, you must create a shim network attachment by editing the Cluster Network Operator configuration, as in the following example:
Example shim network attachment definition
apiVersion: operator.openshift.io/v1 kind: Network metadata: name: cluster spec: ... additionalNetworks: - name: dhcp-shim namespace: default rawCNIConfig: |- { "name": "dhcp-shim", "cniVersion": "0.3.1", "type": "bridge", "master": "ens5", "ipam": { "type": "dhcp" } }
DHCP assignment configuration
{ "ipam": { "type": "dhcp" } }
8.5.1.1.3. Static IP address assignment configuration example
You can configure ipam for static IP address assignment:
{ "ipam": { "type": "static", "addresses": [ { "address": "191.168.1.7" } ] } }
8.5.1.1.4. Dynamic IP address assignment configuration example using DHCP
You can configure ipam for DHCP:
{ "ipam": { "type": "dhcp" } }
8.5.2. Configuring SR-IOV additional network
You can configure an additional network that uses SR-IOV hardware by creating a SriovNetwork
object. When you create a SriovNetwork
object, the SR-IOV Operator automatically creates a NetworkAttachmentDefinition
object.
Do not modify or delete a SriovNetwork
object if it is attached to any pods in the running
state.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges.
Procedure
Create a
SriovNetwork
object, and then save the YAML in the<name>.yaml
file, where<name>
is a name for this additional network. The object specification might resemble the following example:apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: attach1 namespace: openshift-sriov-network-operator spec: resourceName: net1 networkNamespace: project2 ipam: |- { "type": "host-local", "subnet": "10.56.217.0/24", "rangeStart": "10.56.217.171", "rangeEnd": "10.56.217.181", "gateway": "10.56.217.1" }
To create the object, enter the following command:
$ oc create -f <name>.yaml
where
<name>
specifies the name of the additional network.Optional: To confirm that the
NetworkAttachmentDefinition
object that is associated with theSriovNetwork
object that you created in the previous step exists, enter the following command. Replace<namespace>
with the networkNamespace you specified in theSriovNetwork
object.$ oc get net-attach-def -n <namespace>
8.5.3. Next steps
8.5.4. Additional resources
8.6. Adding a pod to an SR-IOV additional network
You can add a pod to an existing Single Root I/O Virtualization (SR-IOV) network.
8.6.1. Runtime configuration for a network attachment
When attaching a pod to an additional network, you can specify a runtime configuration to make specific customizations for the pod. For example, you can request a specific MAC hardware address.
You specify the runtime configuration by setting an annotation in the pod specification. The annotation key is k8s.v1.cni.cncf.io/networks
, and it accepts a JSON object that describes the runtime configuration.
8.6.1.1. Runtime configuration for an Ethernet-based SR-IOV attachment
The following JSON describes the runtime configuration options for an Ethernet-based SR-IOV network attachment.
[ { "name": "<name>", 1 "mac": "<mac_address>", 2 "ips": ["<cidr_range>"] 3 } ]
- 1
- The name of the SR-IOV network attachment definition CR.
- 2
- Optional: The MAC address for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. To use this feature, you also must specify
{ "mac": true }
in theSriovNetwork
object. - 3
- Optional: IP addresses for the SR-IOV device that is allocated from the resource type defined in the SR-IOV network attachment definition CR. Both IPv4 and IPv6 addresses are supported. To use this feature, you also must specify
{ "ips": true }
in theSriovNetwork
object.
Example runtime configuration
apiVersion: v1 kind: Pod metadata: name: sample-pod annotations: k8s.v1.cni.cncf.io/networks: |- [ { "name": "net1", "mac": "20:04:0f:f1:88:01", "ips": ["192.168.10.1/24", "2001::1/64"] } ] spec: containers: - name: sample-container image: <image> imagePullPolicy: IfNotPresent command: ["sleep", "infinity"]
8.6.2. Adding a pod to an additional network
You can add a pod to an additional network. The pod continues to send normal cluster-related network traffic over the default network.
When a pod is created additional networks are attached to it. However, if a pod already exists, you cannot attach additional networks to it.
The pod must be in the same namespace as the additional network.
If a network attachment is managed by the SR-IOV Network Operator, the SR-IOV Network Resource Injector adds the resource
field to the Pod
object automatically.
When specifying an SR-IOV hardware network for a Deployment
object or a ReplicationController
object, you must specify the namespace of the NetworkAttachmentDefinition
object. For more information, see the following bugs: BZ#1846333 and BZ#1840962.
Prerequisites
-
Install the OpenShift CLI (
oc
). - Log in to the cluster.
- Install the SR-IOV Operator.
-
Create an
SriovNetwork
object to attach the pod to.
Procedure
Add an annotation to the
Pod
object. Only one of the following annotation formats can be used:To attach an additional network without any customization, add an annotation with the following format. Replace
<network>
with the name of the additional network to associate with the pod:metadata: annotations: k8s.v1.cni.cncf.io/networks: <network>[,<network>,...] 1
- 1
- To specify more than one additional network, separate each network with a comma. Do not include whitespace between the comma. If you specify the same additional network multiple times, that pod will have multiple network interfaces attached to that network.
To attach an additional network with customizations, add an annotation with the following format:
metadata: annotations: k8s.v1.cni.cncf.io/networks: |- [ { "name": "<network>", 1 "namespace": "<namespace>", 2 "default-route": ["<default-route>"] 3 } ]
To create the pod, enter the following command. Replace
<name>
with the name of the pod.$ oc create -f <name>.yaml
Optional: To Confirm that the annotation exists in the
Pod
CR, enter the following command, replacing<name>
with the name of the pod.$ oc get pod <name> -o yaml
In the following example, the
example-pod
pod is attached to thenet1
additional network:$ oc get pod example-pod -o yaml apiVersion: v1 kind: Pod metadata: annotations: k8s.v1.cni.cncf.io/networks: macvlan-bridge k8s.v1.cni.cncf.io/networks-status: |- 1 [{ "name": "openshift-sdn", "interface": "eth0", "ips": [ "10.128.2.14" ], "default": true, "dns": {} },{ "name": "macvlan-bridge", "interface": "net1", "ips": [ "20.2.2.100" ], "mac": "22:2f:60:a5:f8:00", "dns": {} }] name: example-pod namespace: default spec: ... status: ...
- 1
- The
k8s.v1.cni.cncf.io/networks-status
parameter is a JSON array of objects. Each object describes the status of an additional network attached to the pod. The annotation value is stored as a plain text value.
8.6.3. Creating a non-uniform memory access (NUMA) aligned SR-IOV pod
You can create a NUMA aligned SR-IOV pod by restricting SR-IOV and the CPU resources allocated from the same NUMA node with restricted
or single-numa-node
Topology Manager polices.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
Enable a LatencySensitive profile and configure the CPU Manager policy to
static
.
Procedure
Create the following SR-IOV pod spec, and then save the YAML in the
<name>-sriov-pod.yaml
file. Replace<name>
with a name for this pod.The following example shows an SR-IOV pod spec:
apiVersion: v1 kind: Pod metadata: name: sample-pod annotations: k8s.v1.cni.cncf.io/networks: <name> 1 spec: containers: - name: sample-container image: <image> 2 command: ["sleep", "infinity"] resources: limits: memory: "1Gi" 3 cpu: "2" 4 requests: memory: "1Gi" cpu: "2"
- 1
- Replace
<name>
with the name of the SR-IOV network attachment definition CR. - 2
- Replace
<image>
with the name of thesample-pod
image. - 3
- To create the SR-IOV pod with guaranteed QoS, set
memory limits
equal tomemory requests
. - 4
- To create the SR-IOV pod with guaranteed QoS, set
cpu limits
equals tocpu requests
.
Create the sample SR-IOV pod by running the following command:
$ oc create -f <filename> 1
- 1
- Replace
<filename>
with the name of the file you created in the previous step.
Confirm that the
sample-pod
is configured with guaranteed QoS.$ oc describe pod sample-pod
Confirm that the
sample-pod
is allocated with exclusive CPUs.$ oc exec sample-pod -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
Confirm that the SR-IOV device and CPUs that are allocated for the
sample-pod
are on the same NUMA node.$ oc exec sample-pod -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
8.6.4. Additional resources
8.7. Using high performance multicast
You can use multicast on your Single Root I/O Virtualization (SR-IOV) hardware network.
8.7.1. Configuring high performance multicast
The OpenShift SDN default Container Network Interface (CNI) network provider supports multicast between pods on the default network. This is best used for low-bandwidth coordination or service discovery, and not high-bandwidth applications. For applications such as streaming media, like Internet Protocol television (IPTV) and multipoint videoconferencing, you can utilize Single Root I/O Virtualization (SR-IOV) hardware to provide near-native performance.
When using additional SR-IOV interfaces for multicast:
- Multicast packages must be sent or received by a pod through the additional SR-IOV interface.
- The physical network which connects the SR-IOV interfaces decides the multicast routing and topology, which is not controlled by OpenShift Container Platform.
8.7.2. Using an SR-IOV interface for multicast
The follow procedure creates an example SR-IOV interface for multicast.
Prerequisites
-
Install the OpenShift CLI (
oc
). -
You must log in to the cluster with a user that has the
cluster-admin
role.
Procedure
Create a
SriovNetworkNodePolicy
object:apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-example namespace: openshift-sriov-network-operator spec: resourceName: example nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" numVfs: 4 nicSelector: vendor: "8086" pfNames: ['ens803f0'] rootDevices: ['0000:86:00.0']
Create a
SriovNetwork
object:apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: net-example namespace: openshift-sriov-network-operator spec: networkNamespace: default ipam: | 1 { "type": "host-local", 2 "subnet": "10.56.217.0/24", "rangeStart": "10.56.217.171", "rangeEnd": "10.56.217.181", "routes": [ {"dst": "224.0.0.0/5"}, {"dst": "232.0.0.0/5"} ], "gateway": "10.56.217.1" } resourceName: example
Create a pod with multicast application:
apiVersion: v1 kind: Pod metadata: name: testpmd namespace: default annotations: k8s.v1.cni.cncf.io/networks: nic1 spec: containers: - name: example image: rhel7:latest securityContext: capabilities: add: ["NET_ADMIN"] 1 command: [ "sleep", "infinity"]
- 1
- The
NET_ADMIN
capability is required only if your application needs to assign the multicast IP address to the SR-IOV interface. Otherwise, it can be omitted.
8.8. Using virtual functions (VFs) with DPDK and RDMA modes
You can use Single Root I/O Virtualization (SR-IOV) network hardware with the Data Plane Development Kit (DPDK) and with remote direct memory access (RDMA).
8.8.1. Examples of using virtual functions in DPDK and RDMA modes
The Data Plane Development Kit (DPDK) is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.
Remote Direct Memory Access (RDMA) is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see https://access.redhat.com/support/offerings/techpreview/.
8.8.2. Prerequisites
-
Install the OpenShift CLI (
oc
). -
Log in as a user with
cluster-admin
privileges. - You must have installed the SR-IOV Network Operator.
8.8.3. Example use of virtual function (VF) in DPDK mode with Intel NICs
Procedure
Create the following
SriovNetworkNodePolicy
object, and then save the YAML in theintel-dpdk-node-policy.yaml
file.apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: intel-dpdk-node-policy namespace: openshift-sriov-network-operator spec: resourceName: intelnics nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" priority: <priority> numVfs: <num> nicSelector: vendor: "8086" deviceID: "158b" pfNames: ["<pf_name>", ...] rootDevices: ["<pci_bus_id>", "..."] deviceType: vfio-pci 1
- 1
- Specify the driver type for the virtual functions to
vfio-pci
.
NotePlease refer to the
Configuring SR-IOV network devices
section for a detailed explanation on each option inSriovNetworkNodePolicy
.When applying the configuration specified in a
SriovNetworkNodePolicy
object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.After the configuration update is applied, all the pods in
openshift-sriov-network-operator
namespace will change to aRunning
status.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f intel-dpdk-node-policy.yaml
Create the following
SriovNetwork
object, and then save the YAML in theintel-dpdk-network.yaml
file.apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: intel-dpdk-network namespace: openshift-sriov-network-operator spec: networkNamespace: <target_namespace> ipam: "{}" 1 vlan: <vlan> resourceName: intelnics
- 1
- Specify an empty object
"{}"
for the ipam CNI plug-in. DPDK works in userspace mode and does not require an IP address.
NotePlease refer to the
Configuring SR-IOV additional network
section for a detailed explanation on each option inSriovNetwork
.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f intel-dpdk-network.yaml
Create the following
Pod
spec, and then save the YAML in theintel-dpdk-pod.yaml
file.apiVersion: v1 kind: Pod metadata: name: dpdk-app namespace: <target_namespace> 1 annotations: k8s.v1.cni.cncf.io/networks: intel-dpdk-network spec: containers: - name: testpmd image: <DPDK_image> 2 securityContext: capabilities: add: ["IPC_LOCK"] 3 volumeMounts: - mountPath: /dev/hugepages 4 name: hugepage resources: limits: openshift.io/intelnics: "1" 5 memory: "1Gi" cpu: "4" 6 hugepages-1Gi: "4Gi" 7 requests: openshift.io/intelnics: "1" memory: "1Gi" cpu: "4" hugepages-1Gi: "4Gi" command: ["sleep", "infinity"] volumes: - name: hugepage emptyDir: medium: HugePages
- 1
- Specify the same
target_namespace
where theSriovNetwork
objectintel-dpdk-network
is created. If you would like to create the pod in a different namespace, changetarget_namespace
in both thePod
spec and theSriovNetowrk
object. - 2
- Specify the DPDK image which includes your application and the DPDK library used by application.
- 3
- Specify the
IPC_LOCK
capability which is required by the application to allocate hugepage memory inside container. - 4
- Mount a hugepage volume to the DPDK pod under
/dev/hugepages
. The hugepage volume is backed by the emptyDir volume type with the medium beingHugepages
. - 5
- Optional: Specify the number of DPDK devices allocated to DPDK pod. This resource request and limit, if not explicitly specified, will be automatically added by the SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by the SR-IOV Operator. It is enabled by default and can be disabled by setting
enableInjector
option tofalse
in the defaultSriovOperatorConfig
CR. - 6
- Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs to be allocated from the kubelet. This is achieved by setting CPU Manager policy to
static
and creating a pod withGuaranteed
QoS. - 7
- Specify hugepage size
hugepages-1Gi
orhugepages-2Mi
and the quantity of hugepages that will be allocated to the DPDK pod. Configure2Mi
and1Gi
hugepages separately. Configuring1Gi
hugepage requires adding kernel arguments to Nodes. For example, adding kernel argumentsdefault_hugepagesz=1GB
,hugepagesz=1G
andhugepages=16
will result in16*1Gi
hugepages be allocated during system boot.
Create the DPDK pod by running the following command:
$ oc create -f intel-dpdk-pod.yaml
8.8.4. Example use of a virtual function in DPDK mode with Mellanox NICs
Procedure
Create the following
SriovNetworkNodePolicy
object, and then save the YAML in themlx-dpdk-node-policy.yaml
file.apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlx-dpdk-node-policy namespace: openshift-sriov-network-operator spec: resourceName: mlxnics nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" priority: <priority> numVfs: <num> nicSelector: vendor: "15b3" deviceID: "1015" 1 pfNames: ["<pf_name>", ...] rootDevices: ["<pci_bus_id>", "..."] deviceType: netdevice 2 isRdma: true 3
- 1
- Specify the device hex code of the SR-IOV network device. The only allowed values for Mellanox cards are
1015
,1017
. - 2
- Specify the driver type for the virtual functions to
netdevice
. Mellanox SR-IOV VF can work in DPDK mode without using thevfio-pci
device type. VF device appears as a kernel network interface inside a container. - 3
- Enable RDMA mode. This is required by Mellanox cards to work in DPDK mode.
NotePlease refer to
Configuring SR-IOV network devices
section for detailed explanation on each option inSriovNetworkNodePolicy
.When applying the configuration specified in a
SriovNetworkNodePolicy
object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.After the configuration update is applied, all the pods in the
openshift-sriov-network-operator
namespace will change to aRunning
status.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f mlx-dpdk-node-policy.yaml
Create the following
SriovNetwork
object, and then save the YAML in themlx-dpdk-network.yaml
file.apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: mlx-dpdk-network namespace: openshift-sriov-network-operator spec: networkNamespace: <target_namespace> ipam: |- 1 ... vlan: <vlan> resourceName: mlxnics
- 1
- Specify a configuration object for the ipam CNI plug-in as a YAML block scalar. The plug-in manages IP address assignment for the attachment definition.
NotePlease refer to
Configuring SR-IOV additional network
section for detailed explanation on each option inSriovNetwork
.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f mlx-dpdk-network.yaml
Create the following
Pod
spec, and then save the YAML in themlx-dpdk-pod.yaml
file.apiVersion: v1 kind: Pod metadata: name: dpdk-app namespace: <target_namespace> 1 annotations: k8s.v1.cni.cncf.io/networks: mlx-dpdk-network spec: containers: - name: testpmd image: <DPDK_image> 2 securityContext: capabilities: add: ["IPC_LOCK","NET_RAW"] 3 volumeMounts: - mountPath: /dev/hugepages 4 name: hugepage resources: limits: openshift.io/mlxnics: "1" 5 memory: "1Gi" cpu: "4" 6 hugepages-1Gi: "4Gi" 7 requests: openshift.io/mlxnics: "1" memory: "1Gi" cpu: "4" hugepages-1Gi: "4Gi" command: ["sleep", "infinity"] volumes: - name: hugepage emptyDir: medium: HugePages
- 1
- Specify the same
target_namespace
whereSriovNetwork
objectmlx-dpdk-network
is created. If you would like to create the pod in a different namespace, changetarget_namespace
in bothPod
spec andSriovNetowrk
object. - 2
- Specify the DPDK image which includes your application and the DPDK library used by application.
- 3
- Specify the
IPC_LOCK
capability which is required by the application to allocate hugepage memory inside the container andNET_RAW
for the application to access the network interface. - 4
- Mount the hugepage volume to the DPDK pod under
/dev/hugepages
. The hugepage volume is backed by the emptyDir volume type with the medium beingHugepages
. - 5
- Optional: Specify the number of DPDK devices allocated to the DPDK pod. This resource request and limit, if not explicitly specified, will be automatically added by SR-IOV network resource injector. The SR-IOV network resource injector is an admission controller component managed by SR-IOV Operator. It is enabled by default and can be disabled by setting the
enableInjector
option tofalse
in the defaultSriovOperatorConfig
CR. - 6
- Specify the number of CPUs. The DPDK pod usually requires exclusive CPUs be allocated from kubelet. This is achieved by setting CPU Manager policy to
static
and creating a pod withGuaranteed
QoS. - 7
- Specify hugepage size
hugepages-1Gi
orhugepages-2Mi
and the quantity of hugepages that will be allocated to DPDK pod. Configure2Mi
and1Gi
hugepages separately. Configuring1Gi
hugepage requires adding kernel arguments to Nodes.
Create the DPDK pod by running the following command:
$ oc create -f mlx-dpdk-pod.yaml
8.8.5. Example of a virtual function in RDMA mode with Mellanox NICs
RDMA over Converged Ethernet (RoCE) is the only supported mode when using RDMA on OpenShift Container Platform.
Procedure
Create the following
SriovNetworkNodePolicy
object, and then save the YAML in themlx-rdma-node-policy.yaml
file.apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlx-rdma-node-policy namespace: openshift-sriov-network-operator spec: resourceName: mlxnics nodeSelector: feature.node.kubernetes.io/network-sriov.capable: "true" priority: <priority> numVfs: <num> nicSelector: vendor: "15b3" deviceID: "1015" 1 pfNames: ["<pf_name>", ...] rootDevices: ["<pci_bus_id>", "..."] deviceType: netdevice 2 isRdma: true 3
NotePlease refer to the
Configuring SR-IOV network devices
section for a detailed explanation on each option inSriovNetworkNodePolicy
.When applying the configuration specified in a
SriovNetworkNodePolicy
object, the SR-IOV Operator may drain the nodes, and in some cases, reboot nodes. It may take several minutes for a configuration change to apply. Ensure that there are enough available nodes in your cluster to handle the evicted workload beforehand.After the configuration update is applied, all the pods in the
openshift-sriov-network-operator
namespace will change to aRunning
status.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f mlx-rdma-node-policy.yaml
Create the following
SriovNetwork
object, and then save the YAML in themlx-rdma-network.yaml
file.apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: mlx-rdma-network namespace: openshift-sriov-network-operator spec: networkNamespace: <target_namespace> ipam: |- 1 ... vlan: <vlan> resourceName: mlxnics
- 1
- Specify a configuration object for the ipam CNI plug-in as a YAML block scalar. The plug-in manages IP address assignment for the attachment definition.
NotePlease refer to
Configuring SR-IOV additional network
section for detailed explanation on each option inSriovNetwork
.Create the
SriovNetworkNodePolicy
object by running the following command:$ oc create -f mlx-rdma-network.yaml
Create the following
Pod
spec, and then save the YAML in themlx-rdma-pod.yaml
file.apiVersion: v1 kind: Pod metadata: name: rdma-app namespace: <target_namespace> 1 annotations: k8s.v1.cni.cncf.io/networks: mlx-rdma-network spec: containers: - name: testpmd image: <RDMA_image> 2 securityContext: capabilities: add: ["IPC_LOCK"] 3 volumeMounts: - mountPath: /dev/hugepages 4 name: hugepage resources: limits: memory: "1Gi" cpu: "4" 5 hugepages-1Gi: "4Gi" 6 requests: memory: "1Gi" cpu: "4" hugepages-1Gi: "4Gi" command: ["sleep", "infinity"] volumes: - name: hugepage emptyDir: medium: HugePages
- 1
- Specify the same
target_namespace
whereSriovNetwork
objectmlx-rdma-network
is created. If you would like to create the pod in a different namespace, changetarget_namespace
in bothPod
spec andSriovNetowrk
object. - 2
- Specify the RDMA image which includes your application and RDMA library used by application.
- 3
- Specify the
IPC_LOCK
capability which is required by the application to allocate hugepage memory inside the container. - 4
- Mount the hugepage volume to RDMA pod under
/dev/hugepages
. The hugepage volume is backed by the emptyDir volume type with the medium beingHugepages
. - 5
- Specify number of CPUs. The RDMA pod usually requires exclusive CPUs be allocated from the kubelet. This is achieved by setting CPU Manager policy to
static
and create pod withGuaranteed
QoS. - 6
- Specify hugepage size
hugepages-1Gi
orhugepages-2Mi
and the quantity of hugepages that will be allocated to the RDMA pod. Configure2Mi
and1Gi
hugepages separately. Configuring1Gi
hugepage requires adding kernel arguments to Nodes.
Create the RDMA pod by running the following command:
$ oc create -f mlx-rdma-pod.yaml