Chapter 5. NVIDIA GPUDirect Remote Direct Memory Access (RDMA)


NVIDIA GPUDirect Remote Direct Memory Access (RDMA) allows for the memory in one computer to directly access the memory of another computer without needing access through the operating system. This provides the ability to bypass kernel intervention in the process, freeing up resources and greatly reducing the CPU overhead normally needed to process network communications. This is useful for distributing GPU-accelerated workloads across clusters. And because RDMA is so suited toward high bandwidth and low latency applications, this makes it ideal for big data and machine learning applications.

There are currently three configuration methods for NVIDIA GPUDirect RDMA:

Shared device
This method allows for an NVIDIA GPUDirect RDMA device to be shared among multiple pods on the OpenShift Container Platform worker node where the device is exposed.
Host device
This method provides direct physical Ethernet access on the worker node by creating an additional host network on a pod. A plugin allows the network device to be moved from the host network namespace to the network namespace on the pod.
SR-IOV legacy device
The Single Root IO Virtualization (SR-IOV) method can share a single network device, such as an Ethernet adapter, with multiple pods. SR-IOV segments the device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device.

Each of these methods can be used across either the NVIDIA GPUDirect RDMA over Converged Ethernet (RoCE) or Infiniband infrastructures, providing an aggregate total of six methods of configuration.

5.1. NVIDIA GPUDirect RDMA prerequisites

All methods of NVIDIA GPUDirect RDMA configuration require the installation of specific Operators. Use the following steps to install the Operators:

5.2. Disabling the IRDMA kernel module

On some systems, including the DellR750xa, the IRDMA kernel module creates problems for the NVIDIA Network Operator when unloading and loading the DOCA drivers. Use the following procedure to disable the module.

Procedure

  1. Generate the following machine configuration file by running the following command:

    Copy to Clipboard Toggle word wrap
    $ cat <<EOF > 99-machine-config-blacklist-irdma.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
      labels:
        machineconfiguration.openshift.io/role: worker
      name: 99-worker-blacklist-irdma
    spec:
      kernelArguments:
        - "module_blacklist=irdma"

  2. Create the machine configuration on the cluster and wait for the nodes to reboot by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc create -f 99-machine-config-blacklist-irdma.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created

  3. Validate in a debug pod on each node that the module has not loaded by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
    Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-btfj2 ...
    To use host binaries, run `chroot /host`
    Pod IP: 10.6.135.11
    If you don't see a command prompt, try pressing enter.
    sh-5.1# chroot /host
    sh-5.1# lsmod|grep irdma
    sh-5.1#

5.3. Creating persistent naming rules

In some cases, device names won’t persist following a reboot. For example, on R760xa systems Mellanox devices might be renamed after a reboot. You can avoid this problem by using a MachineConfig to set persistence.

Procedure

  1. Gather the MAC address names from the worker nodes for the node into a file and provide names for the interfaces that need to persist. This example uses the file 70-persistent-net.rules and stashes the details in it.

    Copy to Clipboard Toggle word wrap
    $ cat <<EOF > 70-persistent-net.rules
    SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:28",ATTR{type}=="1",NAME="ibs2f0"
    SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:29",ATTR{type}=="1",NAME="ens8f0np0"
    SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d0",ATTR{type}=="1",NAME="ibs2f0"
    SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d1",ATTR{type}=="1",NAME="ens8f0np0"
    EOF
  2. Convert that file into a base64 string without line breaks and set the output to the variable PERSIST:

    Copy to Clipboard Toggle word wrap
    $ PERSIST=`cat 70-persistent-net.rules| base64 -w 0`
    
    $ echo $PERSIST
    U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK
  3. Create a machine configuration and set the base64 encoding in the custom resource file by running the following command:

    Copy to Clipboard Toggle word wrap
    $ cat <<EOF > 99-machine-config-udev-network.yaml
    Copy to Clipboard Toggle word wrap
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
       labels:
         machineconfiguration.openshift.io/role: worker
       name: 99-machine-config-udev-network
    spec:
       config:
         ignition:
           version: 3.2.0
         storage:
           files:
           - contents:
               source: data:text/plain;base64,$PERSIST
             filesystem: root
             mode: 420
             path: /etc/udev/rules.d/70-persistent-net.rules
  4. Create the machine configuration on the cluster by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc create -f 99-machine-config-udev-network.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created

  5. Use the get mcp command to view the machine configuration status:

    Copy to Clipboard Toggle word wrap
    $ oc get mcp

    Example output

    Copy to Clipboard Toggle word wrap
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
    master   rendered-master-9adfe851c2c14d9598eea5ec3df6c187   True      False      False      1              1                   1                     0                      6h21m
    worker   rendered-worker-4568f1b174066b4b1a4de794cf538fee   False     True       False      2              0                   0                     0                      6h21m

The nodes will reboot and when the updating field returns to false, you can validate on the nodes by looking at the devices in a debug pod.

5.4. Configuring the NFD Operator

The Node Feature Discovery (NFD) Operator manages the detection of hardware features and configuration in an OpenShift Container Platform cluster by labeling the nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, operating system version, and so on.

Prerequisites

  • You have installed the NFD Operator.

Procedure

  1. Validate that the Operator is installed and running by looking at the pods in the openshift-nfd namespace by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-nfd

    Example output

    Copy to Clipboard Toggle word wrap
    NAME                                      READY   STATUS    RESTARTS   AGE
    nfd-controller-manager-8698c88cdd-t8gbc   2/2     Running   0          2m

  2. With the NFD controller running, generate the NodeFeatureDiscovery instance and add it to the cluster.

    The ClusterServiceVersion specification for NFD Operator provides default values, including the NFD operand image that is part of the Operator payload. Retrieve its value by running the following command:

    Copy to Clipboard Toggle word wrap
    $ NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
  3. Optional: Add entries to the default deviceClasseWhiteList field, to support more network adapters, such as the NVIDIA BlueField DPUs.

    Copy to Clipboard Toggle word wrap
    apiVersion: nfd.openshift.io/v1
    kind: NodeFeatureDiscovery
    metadata:
      name: nfd-instance
      namespace: openshift-nfd
    spec:
      instance: ''
      operand:
        image: '${NFD_OPERAND_IMAGE}'
        servicePort: 12000
      prunerOnDelete: false
      topologyUpdater: false
      workerConfig:
        configData: |
          core:
            sleepInterval: 60s
          sources:
            pci:
              deviceClassWhitelist:
                - "02"
                - "03"
                - "0200"
                - "0207"
                - "12"
              deviceLabelFields:
                - "vendor"
  4. Create the 'NodeFeatureDiscovery` instance by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc create -f nfd-instance.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    nodefeaturediscovery.nfd.openshift.io/nfd-instance created

  5. Validate that the instance is up and running by looking at the pods under the openshift-nfd namespace by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-nfd

    Example output

    Copy to Clipboard Toggle word wrap
    NAME                                    READY   STATUS    RESTARTS   AGE
    nfd-controller-manager-7cb6d656-jcnqb   2/2     Running   0          4m
    nfd-gc-7576d64889-s28k9                 1/1     Running   0          21s
    nfd-master-b7bcf5cfd-qnrmz              1/1     Running   0          21s
    nfd-worker-96pfh                        1/1     Running   0          21s
    nfd-worker-b2gkg                        1/1     Running   0          21s
    nfd-worker-bd9bk                        1/1     Running   0          21s
    nfd-worker-cswf4                        1/1     Running   0          21s
    nfd-worker-kp6gg                        1/1     Running   0          21s

  6. Wait a short period of time and then verify that NFD has added labels to the node. The NFD labels are prefixed with feature.node.kubernetes.io, so you can easily filter them.

    Copy to Clipboard Toggle word wrap
    $ oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))'
    {
      "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
      "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
      "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
      "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
      "feature.node.kubernetes.io/cpu-cpuid.CETSS": "true",
      "feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true",
      "feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true",
      "feature.node.kubernetes.io/cpu-cpuid.CPBOOST": "true",
      "feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS": "true",
      "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
      "feature.node.kubernetes.io/cpu-cpuid.FP256": "true",
      "feature.node.kubernetes.io/cpu-cpuid.FSRM": "true",
      "feature.node.kubernetes.io/cpu-cpuid.FXSR": "true",
      "feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBRS": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBS": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBSFFV": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE": "true",
      "feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST": "true",
      "feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD": "true",
      "feature.node.kubernetes.io/cpu-cpuid.INVLPGB": "true",
      "feature.node.kubernetes.io/cpu-cpuid.LAHF": "true",
      "feature.node.kubernetes.io/cpu-cpuid.LBRVIRT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true",
      "feature.node.kubernetes.io/cpu-cpuid.MCOMMIT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true",
      "feature.node.kubernetes.io/cpu-cpuid.MOVU": "true",
      "feature.node.kubernetes.io/cpu-cpuid.MSRIRC": "true",
      "feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH": "true",
      "feature.node.kubernetes.io/cpu-cpuid.NRIPS": "true",
      "feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true",
      "feature.node.kubernetes.io/cpu-cpuid.PPIN": "true",
      "feature.node.kubernetes.io/cpu-cpuid.PSFD": "true",
      "feature.node.kubernetes.io/cpu-cpuid.RDPRU": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SEV": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SEV_ES": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SEV_SNP": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SME": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true",
      "feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
      "feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SVM": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SVMDA": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SVMFBASID": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SVML": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SVMNP": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SVMPF": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SVMPFT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true",
      "feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true",
      "feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED": "true",
      "feature.node.kubernetes.io/cpu-cpuid.TOPEXT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR": "true",
      "feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
      "feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN": "true",
      "feature.node.kubernetes.io/cpu-cpuid.VMPL": "true",
      "feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
      "feature.node.kubernetes.io/cpu-cpuid.VTE": "true",
      "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
      "feature.node.kubernetes.io/cpu-cpuid.X87": "true",
      "feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true",
      "feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true",
      "feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true",
      "feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true",
      "feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true",
      "feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
      "feature.node.kubernetes.io/cpu-model.family": "25",
      "feature.node.kubernetes.io/cpu-model.id": "1",
      "feature.node.kubernetes.io/cpu-model.vendor_id": "AMD",
      "feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
      "feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true",
      "feature.node.kubernetes.io/kernel-selinux.enabled": "true",
      "feature.node.kubernetes.io/kernel-version.full": "5.14.0-427.35.1.el9_4.x86_64",
      "feature.node.kubernetes.io/kernel-version.major": "5",
      "feature.node.kubernetes.io/kernel-version.minor": "14",
      "feature.node.kubernetes.io/kernel-version.revision": "0",
      "feature.node.kubernetes.io/memory-numa": "true",
      "feature.node.kubernetes.io/network-sriov.capable": "true",
      "feature.node.kubernetes.io/pci-102b.present": "true",
      "feature.node.kubernetes.io/pci-10de.present": "true",
      "feature.node.kubernetes.io/pci-10de.sriov.capable": "true",
      "feature.node.kubernetes.io/pci-15b3.present": "true",
      "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true",
      "feature.node.kubernetes.io/rdma.available": "true",
      "feature.node.kubernetes.io/rdma.capable": "true",
      "feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
      "feature.node.kubernetes.io/system-os_release.ID": "rhcos",
      "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.17",
      "feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "417.94.202409121747-0",
      "feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.4",
      "feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.17",
      "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
      "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "17"
    }
  7. Confirm there is a network device that is discovered:

    Copy to Clipboard Toggle word wrap
    $ oc describe node | grep -E 'Roles|pci' | grep pci-15b3
                        feature.node.kubernetes.io/pci-15b3.present=true
                        feature.node.kubernetes.io/pci-15b3.sriov.capable=true
                        feature.node.kubernetes.io/pci-15b3.present=true
                        feature.node.kubernetes.io/pci-15b3.sriov.capable=true

5.5. Configuring the SR-IOV Operator

Single root I/O virtualization (SR-IOV) enhances the performance of NVIDIA GPUDirect RDMA by providing sharing across multiple pods from a single device.

Prerequisites

  • You have installed the SR-IOV Operator.

Procedure

  1. Validate that the Operator is installed and running by looking at the pods in the openshift-sriov-network-operator namespace by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n openshift-sriov-network-operator

    Example output

    Copy to Clipboard Toggle word wrap
    NAME                                      READY   STATUS    RESTARTS   AGE
    sriov-network-operator-7cb6c49868-89486   1/1     Running   0          22s

  2. For the default SriovOperatorConfig CR to work with the MLNX_OFED container, run this command to update the following values:

    Copy to Clipboard Toggle word wrap
    apiVersion: sriovnetwork.openshift.io/v1
    kind: SriovOperatorConfig
    metadata:
      name: default
      namespace: openshift-sriov-network-operator
    spec:
      enableInjector: true
      enableOperatorWebhook: true
      logLevel: 2
  3. Create the resource on the cluster by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc create -f sriov-operator-config.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    sriovoperatorconfig.sriovnetwork.openshift.io/default created

  4. Patch the sriov-operator so the MOFED container can work with it by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc patch sriovoperatorconfig default   --type=merge -n openshift-sriov-network-operator   --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'

    Example output

    Copy to Clipboard Toggle word wrap
    sriovoperatorconfig.sriovnetwork.openshift.io/default patched

5.6. Configuring the NVIDIA network Operator

The NVIDIA network Operator manages NVIDIA networking resources and networking related components such as drivers and device plugins to enable NVIDIA GPUDirect RDMA workloads.

Prerequisites

  • You have installed the NVIDIA network Operator.

Procedure

  1. Validate that the network Operator is installed and running by confirming the controller is running in the nvidia-network-operator namespace by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n nvidia-network-operator

    Example output

    Copy to Clipboard Toggle word wrap
    NAME                                                          READY   STATUS             RESTARTS        AGE
    nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg   1/1     Running            0                5m

  2. With the Operator running, create the NicClusterPolicy custom resource file. The device you choose depends on your system configuration. In this example, the Infiniband interface ibs2f0 is hard coded and is used as the shared NVIDIA GPUDirect RDMA device.

    Copy to Clipboard Toggle word wrap
    apiVersion: mellanox.com/v1alpha1
    kind: NicClusterPolicy
    metadata:
      name: nic-cluster-policy
    spec:
      nicFeatureDiscovery:
        image: nic-feature-discovery
        repository: ghcr.io/mellanox
        version: v0.0.1
      docaTelemetryService:
        image: doca_telemetry
        repository: nvcr.io/nvidia/doca
        version: 1.16.5-doca2.6.0-host
      rdmaSharedDevicePlugin:
        config: |
          {
            "configList": [
              {
                "resourceName": "rdma_shared_device_ib",
                "rdmaHcaMax": 63,
                "selectors": {
                  "ifNames": ["ibs2f0"]
                }
              },
              {
                "resourceName": "rdma_shared_device_eth",
                "rdmaHcaMax": 63,
                "selectors": {
                  "ifNames": ["ens8f0np0"]
                }
              }
            ]
          }
        image: k8s-rdma-shared-dev-plugin
        repository: ghcr.io/mellanox
        version: v1.5.1
      secondaryNetwork:
        ipoib:
          image: ipoib-cni
          repository: ghcr.io/mellanox
          version: v1.2.0
      nvIpam:
        enableWebhook: false
        image: nvidia-k8s-ipam
        repository: ghcr.io/mellanox
        version: v0.2.0
      ofedDriver:
        readinessProbe:
          initialDelaySeconds: 10
          periodSeconds: 30
        forcePrecompiled: false
        terminationGracePeriodSeconds: 300
        livenessProbe:
          initialDelaySeconds: 30
          periodSeconds: 30
        upgradePolicy:
          autoUpgrade: true
          drain:
            deleteEmptyDir: true
            enable: true
            force: true
            timeoutSeconds: 300
            podSelector: ''
          maxParallelUpgrades: 1
          safeLoad: false
          waitForCompletion:
            timeoutSeconds: 0
        startupProbe:
          initialDelaySeconds: 10
          periodSeconds: 20
        image: doca-driver
        repository: nvcr.io/nvidia/mellanox
        version: 24.10-0.7.0.0-0
        env:
        - name: UNLOAD_STORAGE_MODULES
          value: "true"
        - name: RESTORE_DRIVER_ON_POD_TERMINATION
          value: "true"
        - name: CREATE_IFNAMES_UDEV
          value: "true"
  3. Create the NicClusterPolicy custom resource on the cluster by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc create -f network-sharedrdma-nic-cluster-policy.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    nicclusterpolicy.mellanox.com/nic-cluster-policy created

  4. Validate the NicClusterPolicy by running the following command in the DOCA/MOFED container:

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n nvidia-network-operator

    Example output

    Copy to Clipboard Toggle word wrap
    NAME                                                          READY   STATUS    RESTARTS   AGE
    doca-telemetry-service-hwj65                                  1/1     Running   2          160m
    kube-ipoib-cni-ds-fsn8g                                       1/1     Running   2          160m
    mofed-rhcos4.16-9b5ddf4c6-ds-ct2h5                            2/2     Running   4          160m
    nic-feature-discovery-ds-dtksz                                1/1     Running   2          160m
    nv-ipam-controller-854585f594-c5jpp                           1/1     Running   2          160m
    nv-ipam-controller-854585f594-xrnp5                           1/1     Running   2          160m
    nv-ipam-node-xqttl                                            1/1     Running   2          160m
    nvidia-network-operator-controller-manager-5798b564cd-5cq99   1/1     Running   2         5d23h
    rdma-shared-dp-ds-p9vvg                                       1/1     Running   0          85m

  5. rsh into the mofed container to check the status by running the following command:

    Copy to Clipboard Toggle word wrap
    $ MOFED_POD=$(oc get pods -n nvidia-network-operator -o name | grep mofed)
    $ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_POD}
    sh-5.1# ofed_info -s

    Example output

    Copy to Clipboard Toggle word wrap
    OFED-internal-24.07-0.6.1:

    Copy to Clipboard Toggle word wrap
    sh-5.1# ibdev2netdev -v

    Example output

    Copy to Clipboard Toggle word wrap
    0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket                                                       fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up)
    0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket                                                       fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)

  6. Create a IPoIBNetwork custom resource file:

    Copy to Clipboard Toggle word wrap
    apiVersion: mellanox.com/v1alpha1
    kind: IPoIBNetwork
    metadata:
      name: example-ipoibnetwork
    spec:
      ipam: |
        {
          "type": "whereabouts",
          "range": "192.168.6.225/28",
          "exclude": [
           "192.168.6.229/30",
           "192.168.6.236/32"
          ]
        }
      master: ibs2f0
      networkNamespace: default
  7. Create the IPoIBNetwork resource on the cluster by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc create -f ipoib-network.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    ipoibnetwork.mellanox.com/example-ipoibnetwork created

  8. Create a MacvlanNetwork custom resource file for your other interface:

    Copy to Clipboard Toggle word wrap
    apiVersion: mellanox.com/v1alpha1
    kind: MacvlanNetwork
    metadata:
      name: rdmashared-net
    spec:
      networkNamespace: default
      master: ens8f0np0
      mode: bridge
      mtu: 1500
      ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}'
  9. Create the resource on the cluster by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc create -f macvlan-network.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    macvlannetwork.mellanox.com/rdmashared-net created

5.7. Configuring the GPU Operator

The GPU Operator automates the management of the NVIDIA drivers, device plugins for GPUs, the NVIDIA Container Toolkit, and other components required for GPU provisioning.

Prerequisites

  • You have installed the GPU Operator.

Procedure

  1. Check that the Operator pod is running to look at the pods under the namespace by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n nvidia-gpu-operator

    Example output

    Copy to Clipboard Toggle word wrap
    NAME                          READY   STATUS    RESTARTS   AGE
    gpu-operator-b4cb7d74-zxpwq   1/1     Running   0          32s

  2. Create a GPU cluster policy custom resource file similar to the following example:

    Copy to Clipboard Toggle word wrap
    apiVersion: nvidia.com/v1
    kind: ClusterPolicy
    metadata:
      name: gpu-cluster-policy
    spec:
      vgpuDeviceManager:
        config:
          default: default
        enabled: true
      migManager:
        config:
          default: all-disabled
          name: default-mig-parted-config
        enabled: true
      operator:
        defaultRuntime: crio
        initContainer: {}
        runtimeClass: nvidia
        use_ocp_driver_toolkit: true
      dcgm:
        enabled: true
      gfd:
        enabled: true
      dcgmExporter:
        config:
          name: ''
        serviceMonitor:
          enabled: true
        enabled: true
      cdi:
        default: false
        enabled: false
      driver:
        licensingConfig:
          nlsEnabled: true
          configMapName: ''
        certConfig:
          name: ''
        rdma:
          enabled: false
        kernelModuleConfig:
          name: ''
        upgradePolicy:
          autoUpgrade: true
          drain:
            deleteEmptyDir: false
            enable: false
            force: false
            timeoutSeconds: 300
          maxParallelUpgrades: 1
          maxUnavailable: 25%
          podDeletion:
            deleteEmptyDir: false
            force: false
            timeoutSeconds: 300
          waitForCompletion:
            timeoutSeconds: 0
        repoConfig:
          configMapName: ''
        virtualTopology:
          config: ''
        enabled: true
        useNvidiaDriverCRD: false
        useOpenKernelModules: true
      devicePlugin:
        config:
          name: ''
          default: ''
        mps:
          root: /run/nvidia/mps
        enabled: true
      gdrcopy:
        enabled: true
      kataManager:
        config:
          artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
      mig:
        strategy: single
      sandboxDevicePlugin:
        enabled: true
      validator:
        plugin:
          env:
            - name: WITH_WORKLOAD
              value: 'false'
      nodeStatusExporter:
        enabled: true
      daemonsets:
        rollingUpdate:
          maxUnavailable: '1'
        updateStrategy: RollingUpdate
      sandboxWorkloads:
        defaultWorkload: container
        enabled: false
      gds:
        enabled: true
        image: nvidia-fs
        version: 2.20.5
        repository: nvcr.io/nvidia/cloud-native
      vgpuManager:
        enabled: false
      vfioManager:
        enabled: true
      toolkit:
        installDir: /usr/local/nvidia
        enabled: true
  3. When the GPU ClusterPolicy custom resource has generated, create the resource on the cluster by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc create -f gpu-cluster-policy.yaml

    Example output

    Copy to Clipboard Toggle word wrap
    clusterpolicy.nvidia.com/gpu-cluster-policy created

  4. Validate that the Operator is installed and running by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc get pods -n nvidia-gpu-operator

    Example output

    Copy to Clipboard Toggle word wrap
    NAME                                                  READY   STATUS      RESTARTS   AGE
    gpu-feature-discovery-d5ngn                           1/1     Running     0          3m20s
    gpu-feature-discovery-z42rx                           1/1     Running     0          3m23s
    gpu-operator-6bb4d4b4c5-njh78                         1/1     Running     0          4m35s
    nvidia-container-toolkit-daemonset-bkh8l              1/1     Running     0          3m20s
    nvidia-container-toolkit-daemonset-c4hzm              1/1     Running     0          3m23s
    nvidia-cuda-validator-4blvg                           0/1     Completed   0          106s
    nvidia-cuda-validator-tw8sl                           0/1     Completed   0          112s
    nvidia-dcgm-exporter-rrw4g                            1/1     Running     0          3m20s
    nvidia-dcgm-exporter-xc78t                            1/1     Running     0          3m23s
    nvidia-dcgm-nvxpf                                     1/1     Running     0          3m20s
    nvidia-dcgm-snj4j                                     1/1     Running     0          3m23s
    nvidia-device-plugin-daemonset-fk2xz                  1/1     Running     0          3m23s
    nvidia-device-plugin-daemonset-wq87j                  1/1     Running     0          3m20s
    nvidia-driver-daemonset-416.94.202410211619-0-ngrjg   4/4     Running     0          3m58s
    nvidia-driver-daemonset-416.94.202410211619-0-tm4x6   4/4     Running     0          3m58s
    nvidia-node-status-exporter-jlzxh                     1/1     Running     0          3m57s
    nvidia-node-status-exporter-zjffs                     1/1     Running     0          3m57s
    nvidia-operator-validator-l49hx                       1/1     Running     0          3m20s
    nvidia-operator-validator-n44nn                       1/1     Running     0          3m23s

  5. Optional: When you have verified the pods are running, remote shell into the NVIDIA driver daemonset pod and confirm that the NVIDIA modules are loaded. Specifically, ensure the nvidia_peermem is loaded.

    Copy to Clipboard Toggle word wrap
    $ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver)
    sh-4.4# lsmod|grep nvidia

    Example output

    Copy to Clipboard Toggle word wrap
    nvidia_fs             327680  0
    nvidia_peermem         24576  0
    nvidia_modeset       1507328  0
    video                  73728  1 nvidia_modeset
    nvidia_uvm           6889472  8
    nvidia               8810496  43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
    ib_uverbs             217088  3 nvidia_peermem,rdma_ucm,mlx5_ib
    drm                   741376  5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200

  6. Optional: Run the nvidia-smi utility to show the details about the driver and the hardware:
Copy to Clipboard Toggle word wrap
sh-4.4# nvidia-smi

+ .Example output

Copy to Clipboard Toggle word wrap
Wed Nov  6 22:03:53 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     On  |   00000000:61:00.0 Off |                    0 |
|  0%   37C    P0             88W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A40                     On  |   00000000:E1:00.0 Off |                    0 |
|  0%   28C    P8             29W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  1. While you are still in the driver pod, set the GPU clock to maximum using the nvidia-smi command:

    Copy to Clipboard Toggle word wrap
    $ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc
    sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)

    Example output

    Copy to Clipboard Toggle word wrap
    GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0
    All done.

    Copy to Clipboard Toggle word wrap
    sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)

    Example output

    Copy to Clipboard Toggle word wrap
    GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0
    All done.

  2. Validate the resource is available from a node describe perspective by running the following command:

    Copy to Clipboard Toggle word wrap
    $ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9

    Example output

    Copy to Clipboard Toggle word wrap
    Capacity:
      cpu:                          128
      ephemeral-storage:            1561525616Ki
      hugepages-1Gi:                0
      hugepages-2Mi:                0
      memory:                       263596712Ki
      nvidia.com/gpu:               2
      pods:                         250
      rdma/rdma_shared_device_eth:  63
      rdma/rdma_shared_device_ib:   63
    Allocatable:
      cpu:                          127500m
      ephemeral-storage:            1438028263499
      hugepages-1Gi:                0
      hugepages-2Mi:                0
      memory:                       262445736Ki
      nvidia.com/gpu:               2
      pods:                         250
      rdma/rdma_shared_device_eth:  63
      rdma/rdma_shared_device_ib:   63
    --
    Capacity:
      cpu:                          128
      ephemeral-storage:            1561525616Ki
      hugepages-1Gi:                0
      hugepages-2Mi:                0
      memory:                       263596672Ki
      nvidia.com/gpu:               2
      pods:                         250
      rdma/rdma_shared_device_eth:  63
      rdma/rdma_shared_device_ib:   63
    Allocatable:
      cpu:                          127500m
      ephemeral-storage:            1438028263499
      hugepages-1Gi:                0
      hugepages-2Mi:                0
      memory:                       262445696Ki
      nvidia.com/gpu:               2
      pods:                         250
      rdma/rdma_shared_device_eth:  63
      rdma/rdma_shared_device_ib:   63

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat, Inc.