Este contenido no está disponible en el idioma seleccionado.
Chapter 5. NVIDIA GPUDirect Remote Direct Memory Access (RDMA)
NVIDIA GPUDirect Remote Direct Memory Access (RDMA) allows for an application in one computer to directly access the memory of another computer without needing access through the operating system. This provides the ability to bypass kernel intervention in the process, freeing up resources and greatly reducing the CPU overhead normally needed to process network communications. This is useful for distributing GPU-accelerated workloads across clusters. And because RDMA is so suited toward high bandwidth and low latency applications, this makes it ideal for big data and machine learning applications.
There are currently three configuration methods for NVIDIA GPUDirect RDMA:
- Shared device
- This method allows for an NVIDIA GPUDirect RDMA device to be shared among multiple pods on the OpenShift Container Platform worker node where the device is exposed.
- Host device
- This method provides direct physical Ethernet access on the worker node by creating an additional host network on a pod. A plugin allows the network device to be moved from the host network namespace to the network namespace on the pod.
- SR-IOV legacy device
- The Single Root IO Virtualization (SR-IOV) method can share a single network device, such as an Ethernet adapter, with multiple pods. SR-IOV segments the device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device.
Each of these methods can be used across either the NVIDIA GPUDirect RDMA over Converged Ethernet (RoCE) or Infiniband infrastructures, providing an aggregate total of six methods of configuration.
5.1. NVIDIA GPUDirect RDMA prerequisites Copiar enlaceEnlace copiado en el portapapeles!
All methods of NVIDIA GPUDirect RDMA configuration require the installation of specific Operators. Use the following steps to install the Operators:
- Install the Node Feature Discovery Operator.
- Install the SR-IOV Operator.
- Install the NVIDIA Network Operator (NVIDIA documentation).
- Install the NVIDIA GPU Operator (NVIDIA documentation).
5.2. Disabling the IRDMA kernel module Copiar enlaceEnlace copiado en el portapapeles!
On some systems, including the DellR750xa, the IRDMA kernel module creates problems for the NVIDIA Network Operator when unloading and loading the DOCA drivers. Use the following procedure to disable the module.
Procedure
Generate the following machine configuration file by running the following command:
cat <<EOF > 99-machine-config-blacklist-irdma.yaml
$ cat <<EOF > 99-machine-config-blacklist-irdma.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the machine configuration on the cluster and wait for the nodes to reboot by running the following command:
oc create -f 99-machine-config-blacklist-irdma.yaml
$ oc create -f 99-machine-config-blacklist-irdma.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created
machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Validate in a debug pod on each node that the module has not loaded by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.3. Creating persistent naming rules Copiar enlaceEnlace copiado en el portapapeles!
In some cases, device names won’t persist following a reboot. For example, on R760xa systems Mellanox devices might be renamed after a reboot. You can avoid this problem by using a MachineConfig
to set persistence.
Procedure
Gather the MAC address names from the worker nodes for the node into a file and provide names for the interfaces that need to persist. This example uses the file
70-persistent-net.rules
and stashes the details in it.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Convert that file into a base64 string without line breaks and set the output to the variable
PERSIST
:PERSIST=`cat 70-persistent-net.rules| base64 -w 0` echo $PERSIST U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK
$ PERSIST=`cat 70-persistent-net.rules| base64 -w 0` $ echo $PERSIST U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a machine configuration and set the base64 encoding in the custom resource file by running the following command:
cat <<EOF > 99-machine-config-udev-network.yaml
$ cat <<EOF > 99-machine-config-udev-network.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the machine configuration on the cluster by running the following command:
oc create -f 99-machine-config-udev-network.yaml
$ oc create -f 99-machine-config-udev-network.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created
machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the
get mcp
command to view the machine configuration status:oc get mcp
$ oc get mcp
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9adfe851c2c14d9598eea5ec3df6c187 True False False 1 1 1 0 6h21m worker rendered-worker-4568f1b174066b4b1a4de794cf538fee False True False 2 0 0 0 6h21m
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9adfe851c2c14d9598eea5ec3df6c187 True False False 1 1 1 0 6h21m worker rendered-worker-4568f1b174066b4b1a4de794cf538fee False True False 2 0 0 0 6h21m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The nodes will reboot and when the updating field returns to false
, you can validate on the nodes by looking at the devices in a debug pod.
5.4. Configuring the NFD Operator Copiar enlaceEnlace copiado en el portapapeles!
The Node Feature Discovery (NFD) Operator manages the detection of hardware features and configuration in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, operating system version, and so on.
Prerequisites
- You have installed the NFD Operator.
Procedure
Validate that the Operator is installed and running by looking at the pods in the
openshift-nfd
namespace by running the following command:oc get pods -n openshift-nfd
$ oc get pods -n openshift-nfd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE nfd-controller-manager-8698c88cdd-t8gbc 2/2 Running 0 2m
NAME READY STATUS RESTARTS AGE nfd-controller-manager-8698c88cdd-t8gbc 2/2 Running 0 2m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow With the NFD controller running, generate the
NodeFeatureDiscovery
instance and add it to the cluster.The
ClusterServiceVersion
specification for NFD Operator provides default values, including the NFD operand image that is part of the Operator payload. Retrieve its value by running the following command:NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
$ NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Add entries to the default
deviceClassWhiteList
field, to support more network adapters, such as the NVIDIA BlueField DPUs.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the 'NodeFeatureDiscovery` instance by running the following command:
oc create -f nfd-instance.yaml
$ oc create -f nfd-instance.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
nodefeaturediscovery.nfd.openshift.io/nfd-instance created
nodefeaturediscovery.nfd.openshift.io/nfd-instance created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Validate that the instance is up and running by looking at the pods under the
openshift-nfd
namespace by running the following command:oc get pods -n openshift-nfd
$ oc get pods -n openshift-nfd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait a short period of time and then verify that NFD has added labels to the node. The NFD labels are prefixed with
feature.node.kubernetes.io
, so you can easily filter them.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Confirm there is a network device that is discovered:
oc describe node | grep -E 'Roles|pci' | grep pci-15b3
$ oc describe node | grep -E 'Roles|pci' | grep pci-15b3 feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-15b3.sriov.capable=true feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-15b3.sriov.capable=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.5. Configuring the SR-IOV Operator Copiar enlaceEnlace copiado en el portapapeles!
Single root I/O virtualization (SR-IOV) enhances the performance of NVIDIA GPUDirect RDMA by providing sharing across multiple pods from a single device.
Prerequisites
- You have installed the SR-IOV Operator.
Procedure
Validate that the Operator is installed and running by looking at the pods in the
openshift-sriov-network-operator
namespace by running the following command:oc get pods -n openshift-sriov-network-operator
$ oc get pods -n openshift-sriov-network-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE sriov-network-operator-7cb6c49868-89486 1/1 Running 0 22s
NAME READY STATUS RESTARTS AGE sriov-network-operator-7cb6c49868-89486 1/1 Running 0 22s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For the default
SriovOperatorConfig
CR to work with the MLNX_OFED container, run this command to update the following values:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the resource on the cluster by running the following command:
oc create -f sriov-operator-config.yaml
$ oc create -f sriov-operator-config.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
sriovoperatorconfig.sriovnetwork.openshift.io/default created
sriovoperatorconfig.sriovnetwork.openshift.io/default created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Patch the sriov-operator so the MOFED container can work with it by running the following command:
oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
$ oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
sriovoperatorconfig.sriovnetwork.openshift.io/default patched
sriovoperatorconfig.sriovnetwork.openshift.io/default patched
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.6. Configuring the NVIDIA network Operator Copiar enlaceEnlace copiado en el portapapeles!
The NVIDIA network Operator manages NVIDIA networking resources and networking related components such as drivers and device plugins to enable NVIDIA GPUDirect RDMA workloads.
Prerequisites
- You have installed the NVIDIA network Operator.
Procedure
Validate that the network Operator is installed and running by confirming the controller is running in the
nvidia-network-operator
namespace by running the following command:oc get pods -n nvidia-network-operator
$ oc get pods -n nvidia-network-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m
NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow With the Operator running, create the
NicClusterPolicy
custom resource file. The device you choose depends on your system configuration. In this example, the Infiniband interfaceibs2f0
is hard coded and is used as the shared NVIDIA GPUDirect RDMA device.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
NicClusterPolicy
custom resource on the cluster by running the following command:oc create -f network-sharedrdma-nic-cluster-policy.yaml
$ oc create -f network-sharedrdma-nic-cluster-policy.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
nicclusterpolicy.mellanox.com/nic-cluster-policy created
nicclusterpolicy.mellanox.com/nic-cluster-policy created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Validate the
NicClusterPolicy
by running the following command in the DOCA/MOFED container:oc get pods -n nvidia-network-operator
$ oc get pods -n nvidia-network-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow rsh
into themofed
container to check the status by running the following command:MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed) oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD}
$ MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed) $ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD} sh-5.1# ofed_info -s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
OFED-internal-24.07-0.6.1:
OFED-internal-24.07-0.6.1:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ibdev2netdev -v
sh-5.1# ibdev2netdev -v
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up) 0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)
0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up) 0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
IPoIBNetwork
custom resource file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
IPoIBNetwork
resource on the cluster by running the following command:oc create -f ipoib-network.yaml
$ oc create -f ipoib-network.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
ipoibnetwork.mellanox.com/example-ipoibnetwork created
ipoibnetwork.mellanox.com/example-ipoibnetwork created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
MacvlanNetwork
custom resource file for your other interface:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the resource on the cluster by running the following command:
oc create -f macvlan-network.yaml
$ oc create -f macvlan-network.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
macvlannetwork.mellanox.com/rdmashared-net created
macvlannetwork.mellanox.com/rdmashared-net created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.7. Configuring the GPU Operator Copiar enlaceEnlace copiado en el portapapeles!
The GPU Operator automates the management of the NVIDIA drivers, device plugins for GPUs, the NVIDIA Container Toolkit, and other components required for GPU provisioning.
Prerequisites
- You have installed the GPU Operator.
Procedure
Check that the Operator pod is running to look at the pods under the namespace by running the following command:
oc get pods -n nvidia-gpu-operator
$ oc get pods -n nvidia-gpu-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s
NAME READY STATUS RESTARTS AGE gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a GPU cluster policy custom resource file similar to the following example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When the GPU
ClusterPolicy
custom resource has generated, create the resource on the cluster by running the following command:oc create -f gpu-cluster-policy.yaml
$ oc create -f gpu-cluster-policy.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
clusterpolicy.nvidia.com/gpu-cluster-policy created
clusterpolicy.nvidia.com/gpu-cluster-policy created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Validate that the Operator is installed and running by running the following command:
oc get pods -n nvidia-gpu-operator
$ oc get pods -n nvidia-gpu-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: When you have verified the pods are running, remote shell into the NVIDIA driver daemonset pod and confirm that the NVIDIA modules are loaded. Specifically, ensure the
nvidia_peermem
is loaded.oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver)
$ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver) sh-4.4# lsmod|grep nvidia
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Optional: Run the
nvidia-smi
utility to show the details about the driver and the hardware:
nvidia-smi
sh-4.4# nvidia-smi
+ .Example output
While you are still in the driver pod, set the GPU clock to maximum using the
nvidia-smi
command:oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc
$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0 All done.
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0 All done.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0 All done.
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0 All done.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Validate the resource is available from a node describe perspective by running the following command:
oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.8. Creating the machine configuration Copiar enlaceEnlace copiado en el portapapeles!
Before you create the resource pods, you need to create the machineconfig.yaml
custom resource (CR) that provides access to the GPU and networking resources without the need for user privileges.
Procedure
Generate a
Machineconfig
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.9. Creating the workload pods Copiar enlaceEnlace copiado en el portapapeles!
Use the procedures in this section to create the workload pods for the shared and host devices.
5.9.2. Creating a host device RDMA on RoCE Copiar enlaceEnlace copiado en el portapapeles!
Create the workload pods for a host device Remote Direct Memory Access (RDMA) for the NVIDIA Network Operator and test the pod configuration.
Prerequisites
- Ensure that the Operator is running.
-
Delete the
NicClusterPolicy
custom resource (CR), if it exists.
Procedure
Generate a new host device
NicClusterPolicy
(CR), as shown below:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
NicClusterPolicy
CR on the cluster by using the following command:oc create -f network-hostdev-nic-cluster-policy.yaml
$ oc create -f network-hostdev-nic-cluster-policy.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
nicclusterpolicy.mellanox.com/nic-cluster-policy created
nicclusterpolicy.mellanox.com/nic-cluster-policy created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the host device
NicClusterPolicy
CR by using the following command in the DOCA/MOFED container:oc get pods -n nvidia-network-operator
$ oc get pods -n nvidia-network-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Confirm that the resources appear in the cluster
oc describe node
section by using the following command:oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A7
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A7
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
HostDeviceNetwork
CR file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
HostDeviceNetwork
resource on the cluster by using the following command:oc create -f hostdev-network.yaml
$ oc create -f hostdev-network.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
hostdevicenetwork.mellanox.com/hostdev-net created
hostdevicenetwork.mellanox.com/hostdev-net created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Confirm that the resources appear in the cluster
oc describe node
section by using the following command:oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.9.3. Creating a SR-IOV legacy mode RDMA on RoCE Copiar enlaceEnlace copiado en el portapapeles!
Configure a Single Root I/O Virtualization (SR-IOV) legacy mode host device RDMA on RoCE.
Procedure
Generate a new host device
NicClusterPolicy
custom resource (CR):Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the policy on the cluster by using the following command:
oc create -f network-sriovleg-nic-cluster-policy.yaml
$ oc create -f network-sriovleg-nic-cluster-policy.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
nicclusterpolicy.mellanox.com/nic-cluster-policy created
nicclusterpolicy.mellanox.com/nic-cluster-policy created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the pods by using the following command in the DOCA/MOFED container:
oc get pods -n nvidia-network-operator
$ oc get pods -n nvidia-network-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE mofed-rhcos4.16-696886fcb4-ds-4mb42 2/2 Running 0 40s mofed-rhcos4.16-696886fcb4-ds-8knwq 2/2 Running 0 40s nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 13 (4d ago) 4d21h
NAME READY STATUS RESTARTS AGE mofed-rhcos4.16-696886fcb4-ds-4mb42 2/2 Running 0 40s mofed-rhcos4.16-696886fcb4-ds-8knwq 2/2 Running 0 40s nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 13 (4d ago) 4d21h
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create an
SriovNetworkNodePolicy
CR that generates the Virtual Functions (VFs) for the device you want to operate in SR-IOV legacy mode. See the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the CR on the cluster by using the following command:
NoteEnsure that SR-IOV Global Enable is enabled. For more information, see Unable to enable SR-IOV and receiving the message "not enough MMIO resources for SR-IOV" in Red Hat Enterprise Linux.
oc create -f sriov-network-node-policy.yaml
$ oc create -f sriov-network-node-policy.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created
sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Each node has scheduling disabled. The nodes reboot to apply the configuration. You can view the nodes by using the following command:
oc get nodes
$ oc get nodes
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS ROLES AGE VERSION edge-19.edge.lab.eng.rdu2.redhat.com Ready control-plane,master,worker 5d v1.29.8+632b078 nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Ready worker 4d22h v1.29.8+632b078 nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com NotReady,SchedulingDisabled worker 4d22h v1.29.8+632b078
NAME STATUS ROLES AGE VERSION edge-19.edge.lab.eng.rdu2.redhat.com Ready control-plane,master,worker 5d v1.29.8+632b078 nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Ready worker 4d22h v1.29.8+632b078 nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com NotReady,SchedulingDisabled worker 4d22h v1.29.8+632b078
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After the nodes have rebooted, verify that the VF interfaces exist by opening up a debug pod on each node. Run the following command:
a$ oc debug node/nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
a$ oc debug node/nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Repeat the previous steps on the second node, if necessary.
Optional: Confirm that the resources appear in the cluster
oc describe node
section by using the following command:oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After the VFs for SR-IOV legacy mode are in place, generate the
SriovNetwork
CR file. See the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the custom resource on the cluster by using the following command:
oc create -f sriov-network.yaml
$ oc create -f sriov-network.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
sriovnetwork.sriovnetwork.openshift.io/sriov-network created
sriovnetwork.sriovnetwork.openshift.io/sriov-network created
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.10. Verifying RDMA connectivity Copiar enlaceEnlace copiado en el portapapeles!
Confirm Remote Direct Memory Access (RDMA) connectivity is working between the systems, specifically for Legacy Single Root I/O Virtualization (SR-IOV) Ethernet.
Procedure
Connect to each
rdma-workload-client
pod by using the following command:oc rsh -n default rdma-sriov-32-workload
$ oc rsh -n default rdma-sriov-32-workload
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
sh-5.1#
sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the IP address assigned to the first workload pod by using the following command. In this example, the first workload pod is the RDMA test server.
ip a
sh-5.1# ip a
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The IP address of the RDMA server assigned to this pod is the
net1
interface. In this example, the IP address is192.168.4.225
.Run the
ibstatus
command to get thelink_layer
type, Ethernet or Infiniband, associated with each RDMA devicemlx5_x
. The output also shows the status of all of the RDMA devices by checking thestate
field, which shows eitherACTIVE
orDOWN
.ibstatus
sh-5.1# ibstatus
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To get the
link_layer
for each RDMAmlx5
device on your worker node, run theibstat
command:ibstat | egrep "Port|Base|Link"
sh-5.1# ibstat | egrep "Port|Base|Link"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow For RDMA Shared Device or Host Device workload pods, the RDMA device named
mlx5_x
is already known and is typicallymlx5_0
ormlx5_1
. For RDMA Legacy SR-IOV workload pods, you need to determine which RDMA device is associated with which Virtual Function (VF) subinterface. Provide this information by using the following command:rdma link show
sh-5.1# rdma link show
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In this example, the RDMA device names
mlx5_7
is associated with thenet1
interface. This output is used in the next command to perform the RDMA bandwidth test, which also verifies RDMA connectivity between worker nodes.Run the following
ib_write_bw
RDMA bandwidth test command:/root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_7 -p 10000 --source_ip 192.168.4.225 --use_cuda=0 --use_cuda_dmabuf
sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_7 -p 10000 --source_ip 192.168.4.225 --use_cuda=0 --use_cuda_dmabuf
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
-
The
mlx5_7
RDMA device is passed in the-d
switch. -
The source IP address is
192.168.4.225
to start the RDMA server. -
The
--use_cuda=0
,--use_cuda_dmabuf
switches indicate that the use of GPUDirect RDMA.
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
The
Open another terminal window and run
oc rsh
command on the second workload pod that acts as the RDMA test client pod:oc rsh -n default rdma-sriov-33-workload
$ oc rsh -n default rdma-sriov-33-workload
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
sh-5.1#
sh-5.1#
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Obtain the RDMA test client pod IP address from the
net1
interface by using the following command:ip a
sh-5.1# ip a
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Obtain the
link_layer
type associated with each RDMA devicemlx5_x
by using the following command:ibstatus
sh-5.1# ibstatus
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Obtain the firmware version of Mellanox cards by using the
ibstat
command:ibstat
sh-5.1# ibstat
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To determine which RDMA device is associated with the Virtual Function subinterface that the client workload pod uses, run the following command. In this example, the
net1
interface is using the RDMA devicemlx5_2
.rdma link show
sh-5.1# rdma link show
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the following
ib_write_bw
RDMA bandwidth test command:/root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_2 -p 10000 --source_ip 192.168.4.226 --use_cuda=0 --use_cuda_dmabuf 192.168.4.225
sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_2 -p 10000 --source_ip 192.168.4.226 --use_cuda=0 --use_cuda_dmabuf 192.168.4.225
Copy to Clipboard Copied! Toggle word wrap Toggle overflow where:
-
The
mlx5_2
RDMA device is passed in the-d
switch. -
The source IP address
192.168.4.226
and the destination IP address of the RDMA server192.168.4.225
. The
--use_cuda=0
,--use_cuda_dmabuf
switches indicate that the use of GPUDirect RDMA.Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow A positive test is seeing an expected BW average and MsgRate in Mpps.
Upon completion of the
ib_write_bw
command, the server side output also appears on the server pod. See the following example:Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
The