Chapter 5. NVIDIA GPUDirect Remote Direct Memory Access (RDMA)
NVIDIA GPUDirect Remote Direct Memory Access (RDMA) allows for the memory in one computer to directly access the memory of another computer without needing access through the operating system. This provides the ability to bypass kernel intervention in the process, freeing up resources and greatly reducing the CPU overhead normally needed to process network communications. This is useful for distributing GPU-accelerated workloads across clusters. And because RDMA is so suited toward high bandwidth and low latency applications, this makes it ideal for big data and machine learning applications.
There are currently three configuration methods for NVIDIA GPUDirect RDMA:
- Shared device
- This method allows for an NVIDIA GPUDirect RDMA device to be shared among multiple pods on the OpenShift Container Platform worker node where the device is exposed.
- Host device
- This method provides direct physical Ethernet access on the worker node by creating an additional host network on a pod. A plugin allows the network device to be moved from the host network namespace to the network namespace on the pod.
- SR-IOV legacy device
- The Single Root IO Virtualization (SR-IOV) method can share a single network device, such as an Ethernet adapter, with multiple pods. SR-IOV segments the device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device.
Each of these methods can be used across either the NVIDIA GPUDirect RDMA over Converged Ethernet (RoCE) or Infiniband infrastructures, providing an aggregate total of six methods of configuration.
5.1. NVIDIA GPUDirect RDMA prerequisites
All methods of NVIDIA GPUDirect RDMA configuration require the installation of specific Operators. Use the following steps to install the Operators:
- Install the Node Feature Discovery Operator.
- Install the SR-IOV Operator.
- Install the NVIDIA Network Operator (NVIDIA documentation).
- Install the NVIDIA GPU Operator (NVIDIA documentation).
5.2. Disabling the IRDMA kernel module
On some systems, including the DellR750xa, the IRDMA kernel module creates problems for the NVIDIA Network Operator when unloading and loading the DOCA drivers. Use the following procedure to disable the module.
Procedure
Generate the following machine configuration file by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cat <<EOF > 99-machine-config-blacklist-irdma.yaml
$ cat <<EOF > 99-machine-config-blacklist-irdma.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-blacklist-irdma spec: kernelArguments: - "module_blacklist=irdma"
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-blacklist-irdma spec: kernelArguments: - "module_blacklist=irdma"
Create the machine configuration on the cluster and wait for the nodes to reboot by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f 99-machine-config-blacklist-irdma.yaml
$ oc create -f 99-machine-config-blacklist-irdma.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created
machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created
Validate in a debug pod on each node that the module has not loaded by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-btfj2 ... To use host binaries, run `chroot /host` Pod IP: 10.6.135.11 If you don't see a command prompt, try pressing enter. sh-5.1# chroot /host sh-5.1# lsmod|grep irdma sh-5.1#
5.3. Creating persistent naming rules
In some cases, device names won’t persist following a reboot. For example, on R760xa systems Mellanox devices might be renamed after a reboot. You can avoid this problem by using a MachineConfig
to set persistence.
Procedure
Gather the MAC address names from the worker nodes for the node into a file and provide names for the interfaces that need to persist. This example uses the file
70-persistent-net.rules
and stashes the details in it.Copy to Clipboard Copied! Toggle word wrap Toggle overflow cat <<EOF > 70-persistent-net.rules SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:28",ATTR{type}=="1",NAME="ibs2f0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:29",ATTR{type}=="1",NAME="ens8f0np0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d0",ATTR{type}=="1",NAME="ibs2f0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d1",ATTR{type}=="1",NAME="ens8f0np0" EOF
$ cat <<EOF > 70-persistent-net.rules SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:28",ATTR{type}=="1",NAME="ibs2f0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:29",ATTR{type}=="1",NAME="ens8f0np0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d0",ATTR{type}=="1",NAME="ibs2f0" SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d1",ATTR{type}=="1",NAME="ens8f0np0" EOF
Convert that file into a base64 string without line breaks and set the output to the variable
PERSIST
:Copy to Clipboard Copied! Toggle word wrap Toggle overflow PERSIST=`cat 70-persistent-net.rules| base64 -w 0` echo $PERSIST U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK
$ PERSIST=`cat 70-persistent-net.rules| base64 -w 0` $ echo $PERSIST U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK
Create a machine configuration and set the base64 encoding in the custom resource file by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow cat <<EOF > 99-machine-config-udev-network.yaml
$ cat <<EOF > 99-machine-config-udev-network.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-machine-config-udev-network spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;base64,$PERSIST filesystem: root mode: 420 path: /etc/udev/rules.d/70-persistent-net.rules
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-machine-config-udev-network spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:text/plain;base64,$PERSIST filesystem: root mode: 420 path: /etc/udev/rules.d/70-persistent-net.rules
Create the machine configuration on the cluster by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f 99-machine-config-udev-network.yaml
$ oc create -f 99-machine-config-udev-network.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created
machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created
Use the
get mcp
command to view the machine configuration status:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get mcp
$ oc get mcp
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9adfe851c2c14d9598eea5ec3df6c187 True False False 1 1 1 0 6h21m worker rendered-worker-4568f1b174066b4b1a4de794cf538fee False True False 2 0 0 0 6h21m
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-9adfe851c2c14d9598eea5ec3df6c187 True False False 1 1 1 0 6h21m worker rendered-worker-4568f1b174066b4b1a4de794cf538fee False True False 2 0 0 0 6h21m
The nodes will reboot and when the updating field returns to false
, you can validate on the nodes by looking at the devices in a debug pod.
5.4. Configuring the NFD Operator
The Node Feature Discovery (NFD) Operator manages the detection of hardware features and configuration in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, operating system version, and so on.
Prerequisites
- You have installed the NFD Operator.
Procedure
Validate that the Operator is installed and running by looking at the pods in the
openshift-nfd
namespace by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-nfd
$ oc get pods -n openshift-nfd
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE nfd-controller-manager-8698c88cdd-t8gbc 2/2 Running 0 2m
NAME READY STATUS RESTARTS AGE nfd-controller-manager-8698c88cdd-t8gbc 2/2 Running 0 2m
With the NFD controller running, generate the
NodeFeatureDiscovery
instance and add it to the cluster.The
ClusterServiceVersion
specification for NFD Operator provides default values, including the NFD operand image that is part of the Operator payload. Retrieve its value by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
$ NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
Optional: Add entries to the default
deviceClasseWhiteList
field, to support more network adapters, such as the NVIDIA BlueField DPUs.Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: nfd.openshift.io/v1 kind: NodeFeatureDiscovery metadata: name: nfd-instance namespace: openshift-nfd spec: instance: '' operand: image: '${NFD_OPERAND_IMAGE}' servicePort: 12000 prunerOnDelete: false topologyUpdater: false workerConfig: configData: | core: sleepInterval: 60s sources: pci: deviceClassWhitelist: - "02" - "03" - "0200" - "0207" - "12" deviceLabelFields: - "vendor"
apiVersion: nfd.openshift.io/v1 kind: NodeFeatureDiscovery metadata: name: nfd-instance namespace: openshift-nfd spec: instance: '' operand: image: '${NFD_OPERAND_IMAGE}' servicePort: 12000 prunerOnDelete: false topologyUpdater: false workerConfig: configData: | core: sleepInterval: 60s sources: pci: deviceClassWhitelist: - "02" - "03" - "0200" - "0207" - "12" deviceLabelFields: - "vendor"
Create the 'NodeFeatureDiscovery` instance by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f nfd-instance.yaml
$ oc create -f nfd-instance.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow nodefeaturediscovery.nfd.openshift.io/nfd-instance created
nodefeaturediscovery.nfd.openshift.io/nfd-instance created
Validate that the instance is up and running by looking at the pods under the
openshift-nfd
namespace by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-nfd
$ oc get pods -n openshift-nfd
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE nfd-controller-manager-7cb6d656-jcnqb 2/2 Running 0 4m nfd-gc-7576d64889-s28k9 1/1 Running 0 21s nfd-master-b7bcf5cfd-qnrmz 1/1 Running 0 21s nfd-worker-96pfh 1/1 Running 0 21s nfd-worker-b2gkg 1/1 Running 0 21s nfd-worker-bd9bk 1/1 Running 0 21s nfd-worker-cswf4 1/1 Running 0 21s nfd-worker-kp6gg 1/1 Running 0 21s
NAME READY STATUS RESTARTS AGE nfd-controller-manager-7cb6d656-jcnqb 2/2 Running 0 4m nfd-gc-7576d64889-s28k9 1/1 Running 0 21s nfd-master-b7bcf5cfd-qnrmz 1/1 Running 0 21s nfd-worker-96pfh 1/1 Running 0 21s nfd-worker-b2gkg 1/1 Running 0 21s nfd-worker-bd9bk 1/1 Running 0 21s nfd-worker-cswf4 1/1 Running 0 21s nfd-worker-kp6gg 1/1 Running 0 21s
Wait a short period of time and then verify that NFD has added labels to the node. The NFD labels are prefixed with
feature.node.kubernetes.io
, so you can easily filter them.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))'
$ oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))' { "feature.node.kubernetes.io/cpu-cpuid.ADX": "true", "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true", "feature.node.kubernetes.io/cpu-cpuid.AVX": "true", "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true", "feature.node.kubernetes.io/cpu-cpuid.CETSS": "true", "feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true", "feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true", "feature.node.kubernetes.io/cpu-cpuid.CPBOOST": "true", "feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS": "true", "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true", "feature.node.kubernetes.io/cpu-cpuid.FP256": "true", "feature.node.kubernetes.io/cpu-cpuid.FSRM": "true", "feature.node.kubernetes.io/cpu-cpuid.FXSR": "true", "feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED": "true", "feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSFFV": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT": "true", "feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE": "true", "feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST": "true", "feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD": "true", "feature.node.kubernetes.io/cpu-cpuid.INVLPGB": "true", "feature.node.kubernetes.io/cpu-cpuid.LAHF": "true", "feature.node.kubernetes.io/cpu-cpuid.LBRVIRT": "true", "feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true", "feature.node.kubernetes.io/cpu-cpuid.MCOMMIT": "true", "feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true", "feature.node.kubernetes.io/cpu-cpuid.MOVU": "true", "feature.node.kubernetes.io/cpu-cpuid.MSRIRC": "true", "feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH": "true", "feature.node.kubernetes.io/cpu-cpuid.NRIPS": "true", "feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true", "feature.node.kubernetes.io/cpu-cpuid.PPIN": "true", "feature.node.kubernetes.io/cpu-cpuid.PSFD": "true", "feature.node.kubernetes.io/cpu-cpuid.RDPRU": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_ES": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED": "true", "feature.node.kubernetes.io/cpu-cpuid.SEV_SNP": "true", "feature.node.kubernetes.io/cpu-cpuid.SHA": "true", "feature.node.kubernetes.io/cpu-cpuid.SME": "true", "feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT": "true", "feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true", "feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true", "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true", "feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON": "true", "feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true", "feature.node.kubernetes.io/cpu-cpuid.SVM": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMDA": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMFBASID": "true", "feature.node.kubernetes.io/cpu-cpuid.SVML": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMNP": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMPF": "true", "feature.node.kubernetes.io/cpu-cpuid.SVMPFT": "true", "feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true", "feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true", "feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED": "true", "feature.node.kubernetes.io/cpu-cpuid.TOPEXT": "true", "feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR": "true", "feature.node.kubernetes.io/cpu-cpuid.VAES": "true", "feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN": "true", "feature.node.kubernetes.io/cpu-cpuid.VMPL": "true", "feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT": "true", "feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true", "feature.node.kubernetes.io/cpu-cpuid.VTE": "true", "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true", "feature.node.kubernetes.io/cpu-cpuid.X87": "true", "feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true", "feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true", "feature.node.kubernetes.io/cpu-hardware_multithreading": "false", "feature.node.kubernetes.io/cpu-model.family": "25", "feature.node.kubernetes.io/cpu-model.id": "1", "feature.node.kubernetes.io/cpu-model.vendor_id": "AMD", "feature.node.kubernetes.io/kernel-config.NO_HZ": "true", "feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true", "feature.node.kubernetes.io/kernel-selinux.enabled": "true", "feature.node.kubernetes.io/kernel-version.full": "5.14.0-427.35.1.el9_4.x86_64", "feature.node.kubernetes.io/kernel-version.major": "5", "feature.node.kubernetes.io/kernel-version.minor": "14", "feature.node.kubernetes.io/kernel-version.revision": "0", "feature.node.kubernetes.io/memory-numa": "true", "feature.node.kubernetes.io/network-sriov.capable": "true", "feature.node.kubernetes.io/pci-102b.present": "true", "feature.node.kubernetes.io/pci-10de.present": "true", "feature.node.kubernetes.io/pci-10de.sriov.capable": "true", "feature.node.kubernetes.io/pci-15b3.present": "true", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true", "feature.node.kubernetes.io/rdma.available": "true", "feature.node.kubernetes.io/rdma.capable": "true", "feature.node.kubernetes.io/storage-nonrotationaldisk": "true", "feature.node.kubernetes.io/system-os_release.ID": "rhcos", "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.17", "feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "417.94.202409121747-0", "feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.4", "feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.17", "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4", "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "17" }
Confirm there is a network device that is discovered:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc describe node | grep -E 'Roles|pci' | grep pci-15b3
$ oc describe node | grep -E 'Roles|pci' | grep pci-15b3 feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-15b3.sriov.capable=true feature.node.kubernetes.io/pci-15b3.present=true feature.node.kubernetes.io/pci-15b3.sriov.capable=true
5.5. Configuring the SR-IOV Operator
Single root I/O virtualization (SR-IOV) enhances the performance of NVIDIA GPUDirect RDMA by providing sharing across multiple pods from a single device.
Prerequisites
- You have installed the SR-IOV Operator.
Procedure
Validate that the Operator is installed and running by looking at the pods in the
openshift-sriov-network-operator
namespace by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n openshift-sriov-network-operator
$ oc get pods -n openshift-sriov-network-operator
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE sriov-network-operator-7cb6c49868-89486 1/1 Running 0 22s
NAME READY STATUS RESTARTS AGE sriov-network-operator-7cb6c49868-89486 1/1 Running 0 22s
For the default
SriovOperatorConfig
CR to work with the MLNX_OFED container, run this command to update the following values:Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: sriovnetwork.openshift.io/v1 kind: SriovOperatorConfig metadata: name: default namespace: openshift-sriov-network-operator spec: enableInjector: true enableOperatorWebhook: true logLevel: 2
apiVersion: sriovnetwork.openshift.io/v1 kind: SriovOperatorConfig metadata: name: default namespace: openshift-sriov-network-operator spec: enableInjector: true enableOperatorWebhook: true logLevel: 2
Create the resource on the cluster by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f sriov-operator-config.yaml
$ oc create -f sriov-operator-config.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sriovoperatorconfig.sriovnetwork.openshift.io/default created
sriovoperatorconfig.sriovnetwork.openshift.io/default created
Patch the sriov-operator so the MOFED container can work with it by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
$ oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow sriovoperatorconfig.sriovnetwork.openshift.io/default patched
sriovoperatorconfig.sriovnetwork.openshift.io/default patched
5.6. Configuring the NVIDIA network Operator
The NVIDIA network Operator manages NVIDIA networking resources and networking related components such as drivers and device plugins to enable NVIDIA GPUDirect RDMA workloads.
Prerequisites
- You have installed the NVIDIA network Operator.
Procedure
Validate that the network Operator is installed and running by confirming the controller is running in the
nvidia-network-operator
namespace by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n nvidia-network-operator
$ oc get pods -n nvidia-network-operator
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m
NAME READY STATUS RESTARTS AGE nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m
With the Operator running, create the
NicClusterPolicy
custom resource file. The device you choose depends on your system configuration. In this example, the Infiniband interfaceibs2f0
is hard coded and is used as the shared NVIDIA GPUDirect RDMA device.Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: nicFeatureDiscovery: image: nic-feature-discovery repository: ghcr.io/mellanox version: v0.0.1 docaTelemetryService: image: doca_telemetry repository: nvcr.io/nvidia/doca version: 1.16.5-doca2.6.0-host rdmaSharedDevicePlugin: config: | { "configList": [ { "resourceName": "rdma_shared_device_ib", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ibs2f0"] } }, { "resourceName": "rdma_shared_device_eth", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ens8f0np0"] } } ] } image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: v1.5.1 secondaryNetwork: ipoib: image: ipoib-cni repository: ghcr.io/mellanox version: v1.2.0 nvIpam: enableWebhook: false image: nvidia-k8s-ipam repository: ghcr.io/mellanox version: v0.2.0 ofedDriver: readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 forcePrecompiled: false terminationGracePeriodSeconds: 300 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: true enable: true force: true timeoutSeconds: 300 podSelector: '' maxParallelUpgrades: 1 safeLoad: false waitForCompletion: timeoutSeconds: 0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 image: doca-driver repository: nvcr.io/nvidia/mellanox version: 24.10-0.7.0.0-0 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true"
apiVersion: mellanox.com/v1alpha1 kind: NicClusterPolicy metadata: name: nic-cluster-policy spec: nicFeatureDiscovery: image: nic-feature-discovery repository: ghcr.io/mellanox version: v0.0.1 docaTelemetryService: image: doca_telemetry repository: nvcr.io/nvidia/doca version: 1.16.5-doca2.6.0-host rdmaSharedDevicePlugin: config: | { "configList": [ { "resourceName": "rdma_shared_device_ib", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ibs2f0"] } }, { "resourceName": "rdma_shared_device_eth", "rdmaHcaMax": 63, "selectors": { "ifNames": ["ens8f0np0"] } } ] } image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: v1.5.1 secondaryNetwork: ipoib: image: ipoib-cni repository: ghcr.io/mellanox version: v1.2.0 nvIpam: enableWebhook: false image: nvidia-k8s-ipam repository: ghcr.io/mellanox version: v0.2.0 ofedDriver: readinessProbe: initialDelaySeconds: 10 periodSeconds: 30 forcePrecompiled: false terminationGracePeriodSeconds: 300 livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: true enable: true force: true timeoutSeconds: 300 podSelector: '' maxParallelUpgrades: 1 safeLoad: false waitForCompletion: timeoutSeconds: 0 startupProbe: initialDelaySeconds: 10 periodSeconds: 20 image: doca-driver repository: nvcr.io/nvidia/mellanox version: 24.10-0.7.0.0-0 env: - name: UNLOAD_STORAGE_MODULES value: "true" - name: RESTORE_DRIVER_ON_POD_TERMINATION value: "true" - name: CREATE_IFNAMES_UDEV value: "true"
Create the
NicClusterPolicy
custom resource on the cluster by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f network-sharedrdma-nic-cluster-policy.yaml
$ oc create -f network-sharedrdma-nic-cluster-policy.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow nicclusterpolicy.mellanox.com/nic-cluster-policy created
nicclusterpolicy.mellanox.com/nic-cluster-policy created
Validate the
NicClusterPolicy
by running the following command in the DOCA/MOFED container:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n nvidia-network-operator
$ oc get pods -n nvidia-network-operator
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE doca-telemetry-service-hwj65 1/1 Running 2 160m kube-ipoib-cni-ds-fsn8g 1/1 Running 2 160m mofed-rhcos4.16-9b5ddf4c6-ds-ct2h5 2/2 Running 4 160m nic-feature-discovery-ds-dtksz 1/1 Running 2 160m nv-ipam-controller-854585f594-c5jpp 1/1 Running 2 160m nv-ipam-controller-854585f594-xrnp5 1/1 Running 2 160m nv-ipam-node-xqttl 1/1 Running 2 160m nvidia-network-operator-controller-manager-5798b564cd-5cq99 1/1 Running 2 5d23h rdma-shared-dp-ds-p9vvg 1/1 Running 0 85m
NAME READY STATUS RESTARTS AGE doca-telemetry-service-hwj65 1/1 Running 2 160m kube-ipoib-cni-ds-fsn8g 1/1 Running 2 160m mofed-rhcos4.16-9b5ddf4c6-ds-ct2h5 2/2 Running 4 160m nic-feature-discovery-ds-dtksz 1/1 Running 2 160m nv-ipam-controller-854585f594-c5jpp 1/1 Running 2 160m nv-ipam-controller-854585f594-xrnp5 1/1 Running 2 160m nv-ipam-node-xqttl 1/1 Running 2 160m nvidia-network-operator-controller-manager-5798b564cd-5cq99 1/1 Running 2 5d23h rdma-shared-dp-ds-p9vvg 1/1 Running 0 85m
rsh
into themofed
container to check the status by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed) oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD}
$ MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed) $ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD} sh-5.1# ofed_info -s
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow OFED-internal-24.07-0.6.1:
OFED-internal-24.07-0.6.1:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ibdev2netdev -v
sh-5.1# ibdev2netdev -v
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow 0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up) 0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)
0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up) 0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)
Create a
IPoIBNetwork
custom resource file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: mellanox.com/v1alpha1 kind: IPoIBNetwork metadata: name: example-ipoibnetwork spec: ipam: | { "type": "whereabouts", "range": "192.168.6.225/28", "exclude": [ "192.168.6.229/30", "192.168.6.236/32" ] } master: ibs2f0 networkNamespace: default
apiVersion: mellanox.com/v1alpha1 kind: IPoIBNetwork metadata: name: example-ipoibnetwork spec: ipam: | { "type": "whereabouts", "range": "192.168.6.225/28", "exclude": [ "192.168.6.229/30", "192.168.6.236/32" ] } master: ibs2f0 networkNamespace: default
Create the
IPoIBNetwork
resource on the cluster by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f ipoib-network.yaml
$ oc create -f ipoib-network.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ipoibnetwork.mellanox.com/example-ipoibnetwork created
ipoibnetwork.mellanox.com/example-ipoibnetwork created
Create a
MacvlanNetwork
custom resource file for your other interface:Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdmashared-net spec: networkNamespace: default master: ens8f0np0 mode: bridge mtu: 1500 ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}'
apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdmashared-net spec: networkNamespace: default master: ens8f0np0 mode: bridge mtu: 1500 ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}'
Create the resource on the cluster by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f macvlan-network.yaml
$ oc create -f macvlan-network.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow macvlannetwork.mellanox.com/rdmashared-net created
macvlannetwork.mellanox.com/rdmashared-net created
5.7. Configuring the GPU Operator
The GPU Operator automates the management of the NVIDIA drivers, device plugins for GPUs, the NVIDIA Container Toolkit, and other components required for GPU provisioning.
Prerequisites
- You have installed the GPU Operator.
Procedure
Check that the Operator pod is running to look at the pods under the namespace by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n nvidia-gpu-operator
$ oc get pods -n nvidia-gpu-operator
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s
NAME READY STATUS RESTARTS AGE gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s
Create a GPU cluster policy custom resource file similar to the following example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' serviceMonitor: enabled: true enabled: true cdi: default: false enabled: false driver: licensingConfig: nlsEnabled: true configMapName: '' certConfig: name: '' rdma: enabled: false kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' enabled: true useNvidiaDriverCRD: false useOpenKernelModules: true devicePlugin: config: name: '' default: '' mps: root: /run/nvidia/mps enabled: true gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: nvidia-fs version: 2.20.5 repository: nvcr.io/nvidia/cloud-native vgpuManager: enabled: false vfioManager: enabled: true toolkit: installDir: /usr/local/nvidia enabled: true
apiVersion: nvidia.com/v1 kind: ClusterPolicy metadata: name: gpu-cluster-policy spec: vgpuDeviceManager: config: default: default enabled: true migManager: config: default: all-disabled name: default-mig-parted-config enabled: true operator: defaultRuntime: crio initContainer: {} runtimeClass: nvidia use_ocp_driver_toolkit: true dcgm: enabled: true gfd: enabled: true dcgmExporter: config: name: '' serviceMonitor: enabled: true enabled: true cdi: default: false enabled: false driver: licensingConfig: nlsEnabled: true configMapName: '' certConfig: name: '' rdma: enabled: false kernelModuleConfig: name: '' upgradePolicy: autoUpgrade: true drain: deleteEmptyDir: false enable: false force: false timeoutSeconds: 300 maxParallelUpgrades: 1 maxUnavailable: 25% podDeletion: deleteEmptyDir: false force: false timeoutSeconds: 300 waitForCompletion: timeoutSeconds: 0 repoConfig: configMapName: '' virtualTopology: config: '' enabled: true useNvidiaDriverCRD: false useOpenKernelModules: true devicePlugin: config: name: '' default: '' mps: root: /run/nvidia/mps enabled: true gdrcopy: enabled: true kataManager: config: artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses mig: strategy: single sandboxDevicePlugin: enabled: true validator: plugin: env: - name: WITH_WORKLOAD value: 'false' nodeStatusExporter: enabled: true daemonsets: rollingUpdate: maxUnavailable: '1' updateStrategy: RollingUpdate sandboxWorkloads: defaultWorkload: container enabled: false gds: enabled: true image: nvidia-fs version: 2.20.5 repository: nvcr.io/nvidia/cloud-native vgpuManager: enabled: false vfioManager: enabled: true toolkit: installDir: /usr/local/nvidia enabled: true
When the GPU
ClusterPolicy
custom resource has generated, create the resource on the cluster by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc create -f gpu-cluster-policy.yaml
$ oc create -f gpu-cluster-policy.yaml
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow clusterpolicy.nvidia.com/gpu-cluster-policy created
clusterpolicy.nvidia.com/gpu-cluster-policy created
Validate that the Operator is installed and running by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc get pods -n nvidia-gpu-operator
$ oc get pods -n nvidia-gpu-operator
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NAME READY STATUS RESTARTS AGE gpu-feature-discovery-d5ngn 1/1 Running 0 3m20s gpu-feature-discovery-z42rx 1/1 Running 0 3m23s gpu-operator-6bb4d4b4c5-njh78 1/1 Running 0 4m35s nvidia-container-toolkit-daemonset-bkh8l 1/1 Running 0 3m20s nvidia-container-toolkit-daemonset-c4hzm 1/1 Running 0 3m23s nvidia-cuda-validator-4blvg 0/1 Completed 0 106s nvidia-cuda-validator-tw8sl 0/1 Completed 0 112s nvidia-dcgm-exporter-rrw4g 1/1 Running 0 3m20s nvidia-dcgm-exporter-xc78t 1/1 Running 0 3m23s nvidia-dcgm-nvxpf 1/1 Running 0 3m20s nvidia-dcgm-snj4j 1/1 Running 0 3m23s nvidia-device-plugin-daemonset-fk2xz 1/1 Running 0 3m23s nvidia-device-plugin-daemonset-wq87j 1/1 Running 0 3m20s nvidia-driver-daemonset-416.94.202410211619-0-ngrjg 4/4 Running 0 3m58s nvidia-driver-daemonset-416.94.202410211619-0-tm4x6 4/4 Running 0 3m58s nvidia-node-status-exporter-jlzxh 1/1 Running 0 3m57s nvidia-node-status-exporter-zjffs 1/1 Running 0 3m57s nvidia-operator-validator-l49hx 1/1 Running 0 3m20s nvidia-operator-validator-n44nn 1/1 Running 0 3m23s
NAME READY STATUS RESTARTS AGE gpu-feature-discovery-d5ngn 1/1 Running 0 3m20s gpu-feature-discovery-z42rx 1/1 Running 0 3m23s gpu-operator-6bb4d4b4c5-njh78 1/1 Running 0 4m35s nvidia-container-toolkit-daemonset-bkh8l 1/1 Running 0 3m20s nvidia-container-toolkit-daemonset-c4hzm 1/1 Running 0 3m23s nvidia-cuda-validator-4blvg 0/1 Completed 0 106s nvidia-cuda-validator-tw8sl 0/1 Completed 0 112s nvidia-dcgm-exporter-rrw4g 1/1 Running 0 3m20s nvidia-dcgm-exporter-xc78t 1/1 Running 0 3m23s nvidia-dcgm-nvxpf 1/1 Running 0 3m20s nvidia-dcgm-snj4j 1/1 Running 0 3m23s nvidia-device-plugin-daemonset-fk2xz 1/1 Running 0 3m23s nvidia-device-plugin-daemonset-wq87j 1/1 Running 0 3m20s nvidia-driver-daemonset-416.94.202410211619-0-ngrjg 4/4 Running 0 3m58s nvidia-driver-daemonset-416.94.202410211619-0-tm4x6 4/4 Running 0 3m58s nvidia-node-status-exporter-jlzxh 1/1 Running 0 3m57s nvidia-node-status-exporter-zjffs 1/1 Running 0 3m57s nvidia-operator-validator-l49hx 1/1 Running 0 3m20s nvidia-operator-validator-n44nn 1/1 Running 0 3m23s
Optional: When you have verified the pods are running, remote shell into the NVIDIA driver daemonset pod and confirm that the NVIDIA modules are loaded. Specifically, ensure the
nvidia_peermem
is loaded.Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver)
$ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver) sh-4.4# lsmod|grep nvidia
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow nvidia_fs 327680 0 nvidia_peermem 24576 0 nvidia_modeset 1507328 0 video 73728 1 nvidia_modeset nvidia_uvm 6889472 8 nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200
nvidia_fs 327680 0 nvidia_peermem 24576 0 nvidia_modeset 1507328 0 video 73728 1 nvidia_modeset nvidia_uvm 6889472 8 nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200
-
Optional: Run the
nvidia-smi
utility to show the details about the driver and the hardware:
nvidia-smi
sh-4.4# nvidia-smi
+ .Example output
Wed Nov 6 22:03:53 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A40 On | 00000000:61:00.0 Off | 0 | | 0% 37C P0 88W / 300W | 1MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 | | 0% 28C P8 29W / 300W | 1MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
Wed Nov 6 22:03:53 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 On | 00000000:61:00.0 Off | 0 |
| 0% 37C P0 88W / 300W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 |
| 0% 28C P8 29W / 300W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
While you are still in the driver pod, set the GPU clock to maximum using the
nvidia-smi
command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc
$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0 All done.
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0 All done.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0 All done.
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0 All done.
Validate the resource is available from a node describe perspective by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9
Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596712Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445736Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596672Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445696Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63
Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596712Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445736Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 -- Capacity: cpu: 128 ephemeral-storage: 1561525616Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 263596672Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63 Allocatable: cpu: 127500m ephemeral-storage: 1438028263499 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 262445696Ki nvidia.com/gpu: 2 pods: 250 rdma/rdma_shared_device_eth: 63 rdma/rdma_shared_device_ib: 63