Home
Products
Red Hat OpenShift AI Self-Managed
3.3
Accelerate data processing and training with distributed workloads
Chapter 3. Configuring RoCE networking for distributed LLM deployments

Chapter 3. Configuring RoCE networking for distributed LLM deployments

High-performance distributed large language model (LLM) deployments require low-latency, high-bandwidth GPU-to-GPU communication across pods. RoCE with GPUDirect RDMA (GDR) enables direct memory access between GPUs without CPU involvement.

RoCE (RDMA over Converged Ethernet) is a network protocol that enables RDMA communication over Ethernet networks. When combined with NVIDIA GPUDirect RDMA, it provides:

High bandwidth: Provides up to 400 Gbps on modern network infrastructure.
Low latency: Delivers sub-microsecond latency for GPU-to-GPU transfers.
CPU offload: Transfers data directly between GPUs without CPU involvement.
Scalability: Supports efficient multi-node distributed inference and training.

RoCE networking is useful for:

Disaggregated prefill and decode serving: Separates initial token generation, or prefill, from subsequent generation, or decode, across different GPU pools. RoCE enables high-speed KV cache transfers between stages, improving throughput and resource use.
Wide Expert Parallel (WideEP): Distributes expert layers of Mixture-of-Experts (MoE) models such as Mixtral across multiple GPUs and nodes. RoCE provides the low-latency communication required for expert routing and token processing.
Multi-node tensor parallelism: Splits large models across multiple nodes with tensor parallelism. RoCE reduces communication strain for all-reduce operations during inference.
Distributed training: Enables efficient gradient synchronization across nodes for large-scale model training.

The following components can be used to configure an OpenShift Cluster for RoCE workloads:

NVIDIA GPU Operator: Manages GPU drivers, device plugins, and monitoring
Node Feature Discovery (NFD) Operator: Detects hardware capabilities on cluster nodes
SR-IOV Network Operator (bare metal) or network-attachment-definitions (IBM Cloud): Configures secondary high-speed networks

Distributed Inference with llm-d uses the following software libraries to enable high performance distributed inference:

NCCL (NVIDIA Collective Communications Library): Provides optimized multi-GPU communication primitives
NVIDIA Inference Xfer Library (NIXL): Enables KV Cache transfers across vLLM Pods using RoCE
NVSHMEM (NVIDIA OpenSHMEM Library): Used by DeepEP kernels for WideEP deployments of large MoE models

3.1. Enabling RoCE networking for distributed LLM deployments
Copy link

Configure GPUDirect RDMA (GDR) over RDMA over Converged Ethernet (RoCE) to enable high-speed, low-latency GPU-to-GPU communication across pods for distributed large language model (LLM) deployments using Distributed Inference Server with llm-d.

This procedure guides you through configuring your OpenShift cluster to support RoCE networking for distributed LLM workloads.

Important

RoCE networking for distributed LLM deployments is currently available in Red Hat OpenShift AI as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have access to an OpenShift cluster running version 4.12 or later.
Your cluster nodes have NVIDIA GPUs with GPUDirect RDMA support, Pascal architecture or later.
Your cluster has high-speed network interfaces that support RoCE: 100 Gbps or higher recommended.
You have installed the OpenShift CLI (oc).
You have installed OpenShift AI and enabled the single-model serving platform.
Your network infrastructure supports RDMA (InfiniBand or Ethernet with RoCE/iWARP).
You understand your deployment environment: IBM Cloud, bare metal, or other cloud providers.

Procedure

Install the Node Feature Discovery (NFD) Operator to detect hardware features on your cluster nodes:
1. In the OpenShift web console, navigate to Operators OperatorHub.
2. Search for Node Feature Discovery Operator.
3. Click Install and accept the default settings.
4. Wait for the operator installation to complete.

Create an NFD instance to enable feature discovery:

apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
  name: nfd-instance
  namespace: openshift-nfd
spec:
  operand:
    image: quay.io/openshift/origin-node-feature-discovery:4.12
    imagePullPolicy: Always
  workerConfig:
    configData: |
      sources:
        pci:
          deviceClassWhitelist:
            - "03"
            - "0200"
          deviceLabelFields:
            - "vendor"

Install the NVIDIA GPU Operator:
1. In the OpenShift web console, navigate to Operators OperatorHub.
2. Search for NVIDIA GPU Operator.
3. Click Install and select the appropriate update channel.
4. Choose the installation namespace, for example, nvidia-gpu-operator.
5. Click Install and wait for the installation to complete.

Create a ClusterPolicy custom resource to configure the NVIDIA GPU Operator with RDMA support:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
  namespace: nvidia-gpu-operator
spec:
  # --- General Operator Settings ---
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  # --- Driver & Licensing ---
  driver:
    enabled: true
    licensingConfig:
      configMapName: 'nvidia-licensing-config'
      nlsEnabled: true
    kernelModuleType: auto
    certConfig:
      name: ''
    rdma:
      enabled: true      # Enables nvidia-peermem
      useHostMofed: false # Requires Network Operator
    useNvidiaDriverCRD: false
    kernelModuleConfig:
      name: 'kernel-module-params'
    usePrecompiled: false
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
  # --- Monitoring ---
  dcgm:
    enabled: true
  dcgmExporter:
    enabled: true
    serviceMonitor:
      enabled: true
  config:
    name: ''
  nodeStatusExporter:
    enabled: true
  # --- Device Plugins & Topology ---
  devicePlugin:
    enabled: true
    config:
      default: ''
      name: ''
  mps:
    root: /run/nvidia/mps
  sandboxDevicePlugin:
    enabled: true
  virtualTopology:
    config: ''
  # --- Advanced Features (MIG, RDMA, GDR) ---
  mig:
    strategy: single
  migManager:
    enabled: true
  vgpuDeviceManager:
    enabled: true
  gdrcopy:
    enabled: true
  gfd:
    enabled: true
  vfioManager:
    enabled: true
  # --- Toolkit ---
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
  # --- Updates & Validation ---
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  validator:
    plugin:
      env: []
  # --- Disabled Components ---
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: false
  vgpuManager:
    enabled: false

where:

driver.rdma.useHostMofed: Set to false. The NVIDIA Network Operator manages the mofed drivers (configured in later steps).
driver.gdrcopy.enabled: Specifies whether to enable GDRCopy, which provides additional performance optimizations for GPU-to-GPU transfers and is required for low-latency memory copying. Set to true if your environment supports it.
Note
The performance impact of enabling or disabling GDRCopy depends on your specific workload and hardware configuration. Testing is recommended to determine the optimal setting for your use case.
driver.kernelModuleConfig.name: Specifies the name of the ConfigMap containing custom driver settings. The example above references kernel-module-params, which is required for DeepEP (Deep Endpoint) support.

Create a ConfigMap with custom driver settings required by DeepEP:
DeepEP requires specific NVIDIA driver settings to enable advanced peer-to-peer memory operations. For more information about customizing nvidia.conf values, see the NVIDIA GPU Operator documentation on GPUDirect RDMA configuration.
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: kernel-module-params
  namespace: nvidia-gpu-operator
data:
  nvidia.conf: |
    NVreg_EnableStreamMemOPs=1
    NVreg_RegistryDwords="PeerMappingOverride=1;"
```
where:
NVreg_EnableStreamMemOPs
Enables stream memory operations for improved GPU-to-GPU communication performance.
NVreg_RegistryDwords
Configures additional driver registry settings. PeerMappingOverride=1 enables peer mapping for GPUDirect RDMA.

Configure secondary networks for RoCE based on your deployment environment:

For IBM Cloud deployments:

IBM Cloud provides cluster network support for NVIDIA accelerated computing. Create a NetworkAttachmentDefinition for each secondary network interface using the host-device CNI plugin.

Important

Use the host-device CNI to attach the full host interface to pods. Alternative CNI plugins like ipvlan and macvlan may not work on cloud platforms without special configuration of routing rules. By default, ipvlan/macvlan traffic is likely to be blocked by cloud routing rules.

Note

Only one pod can use each interface per node, similar to GPU allocation. You need to create one NetworkAttachmentDefinition per secondary network interface.

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: "dhcp-host-device-port-1"
  namespace: <your-namespace>
spec:
  config: '{
      "cniVersion": "0.3.1",
      "name": "dhcp-host-device-port-1",
      "plugins": [
        {
          "type": "host-device",
          "device": "enp163s0",
          "isRdma": true,
          "ipam": {
            "type": "dhcp"
          }
        },
        {
          "type": "tuning",
          "name": "mytuning",
          "mtu": 9000
        }
      ]
    }'

where:

device

Specifies the host network device name. Replace enp163s0 with your actual device name. You must create a separate NetworkAttachmentDefinition for each secondary network interface with the appropriate device name.

isRdma

Set to true to enable RDMA support for RoCE.

mtu

Specifies the maximum transmission unit size. 9000 is recommended for high-performance workloads.

To attach the network interfaces to your pods, you must use pod annotations that reference the NetworkAttachmentDefinitions. The following example shows annotations for all 8 high-speed secondary network interfaces:

metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks: |
      [
        {"name":"dhcp-host-device-port-1", "namespace": "<your-namespace>"},
        {"name":"dhcp-host-device-port-2", "namespace": "<your-namespace>"},
        {"name":"dhcp-host-device-port-3", "namespace": "<your-namespace>"},
        {"name":"dhcp-host-device-port-4", "namespace": "<your-namespace>"},
        {"name":"dhcp-host-device-port-5", "namespace": "<your-namespace>"},
        {"name":"dhcp-host-device-port-6", "namespace": "<your-namespace>"},
        {"name":"dhcp-host-device-port-7", "namespace": "<your-namespace>"},
        {"name":"dhcp-host-device-port-8", "namespace": "<your-namespace>"}
      ]

Replace <your-namespace> with the namespace where you created the NetworkAttachmentDefinitions.

For more information, see IBM Cloud cluster network documentation.

For bare metal deployments:

Configure SR-IOV (Single Root I/O Virtualization) for high-performance network interfaces. Install the SR-IOV Network Operator from OperatorHub.

Create an SriovNetworkNodePolicy to configure the network interfaces:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: roce-policy
  namespace: openshift-sriov-network-operator
spec:
  resourceName: rocenicresource
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  priority: 10
  numVfs: 8
  nicSelector:
    vendor: "15b3"
    deviceID: "1017"
  deviceType: netdevice
  isRdma: true

where:

nicSelector.vendor

Specifies the Mellanox/NVIDIA vendor ID. Adjust for your network card vendor.

nicSelector.deviceID

Specifies the device ID for your specific network card model.

isRdma

Set to true to enable RDMA support for RoCE.

Create an SriovNetwork to attach the RDMA-enabled network to pods:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: roce-network
  namespace: openshift-sriov-network-operator
spec:
  resourceName: rocenicresource
  networkNamespace: <your-namespace>
  ipam: |
    {
      "type": "host-local",
      "subnet": "192.168.100.0/24",
      "rangeStart": "192.168.100.10",
      "rangeEnd": "192.168.100.100",
      "gateway": "192.168.100.1"
    }

Verify that the RDMA devices are available on your nodes:
```
$ oc debug node/<node-name>
sh-4.4# chroot /host
sh-4.4# ls -l /dev/infiniband/
```
The output lists available RDMA devices, for example uverbs0 or uverbs1. If no devices are listed, verify that the GPU Operator is running and that your nodes have RDMA-capable hardware.

Label the nodes that have RDMA capabilities:

$ oc label node <node-name> network.nvidia.com/roce=true

Configure your pod to use the RoCE network by adding network annotations to your InferenceService or deployment:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: llm-with-roce
  annotations:
    k8s.v1.cni.cncf.io/networks: roce-network
spec:
  replicas: 2
  model:
    uri: hf://meta-llama/Meta-Llama-3-70B
    name: llama-3-70b
  router:
    template:
      spec:
        containers:
        - name: main
          resources:
            limits:
              cpu: '8'
              memory: 64Gi
              nvidia.com/gpu: "2"
              rdma/roce: "1"
          env:
          - name: NCCL_IB_DISABLE
            value: "0"
          - name: NCCL_NET_GDR_LEVEL
            value: "5"
          - name: NCCL_DEBUG
            value: "INFO"

where:

k8s.v1.cni.cncf.io/networks: Specifies the RoCE secondary network to attach to the pod.
rdma/roce: Specifies the RDMA resources to request. The resource name depends on your SR-IOV or network configuration.
NCCL_IB_DISABLE: Set to "0" to enable InfiniBand/RoCE for NCCL (NVIDIA Collective Communications Library).
NCCL_NET_GDR_LEVEL: Specifies the GPUDirect RDMA level: 0-5, where 5 is maximum optimization.
NCCL_DEBUG: Sets the NCCL log verbosity level. Set to INFO for troubleshooting; use WARN or remove in production.

Verification

To verify that RoCE networking is properly configured and functioning:

Check that the GPU Operator pods are running:
```
$ oc get pods -n nvidia-gpu-operator
```
All pods are in the Running state when the GPU Operator is properly configured.
Verify that RDMA devices are detected:
```
$ oc get nodes -l network.nvidia.com/roce=true
```
The output lists all RDMA-capable nodes.

Test RDMA connectivity between pods using ib_write_bw or rping:

# On the first pod (server)
$ oc exec -it <pod-1> -- ib_write_bw -d <rdma-device>

# On the second pod (client)
$ oc exec -it <pod-2> -- ib_write_bw -d <rdma-device> <server-ip>

The output displays bandwidth measurements indicating successful RDMA communication.

Check NCCL communication in your LLM deployment logs:
```
$ oc logs <llm-pod-name> | grep NCCL
```
Look for messages indicating successful NCCL initialization with RDMA transport:
```
NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_1:1/RoCE
NCCL INFO Using network RoCE
```

Run a distributed inference request to verify end-to-end functionality:

$ curl -X POST http://<inference-endpoint>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b",
    "messages": [{"role": "user", "content": "Explain RoCE networking"}],
    "max_tokens": 100
  }'

Monitor the response time and check logs for RDMA activity.

3.2. Optimizing RoCE performance for LLM deployments
Copy link

Optimize your RoCE deployment for maximum performance with network tuning and model serving best practices.

3.2.1. Network tuning
Copy link

For optimal RoCE performance:

Enable Priority Flow Control (PFC) on network switches to ensure lossless Ethernet traffic.
Configure ECN (Explicit Congestion Notification) for RoCE v2.
Use dedicated VLANs for RDMA traffic to isolate from other workloads.
Set the MTU size to 9000 bytes to enable jumbo frames.

3.2.2. Model serving optimization
Copy link

To optimize model serving:

Use quantization. FP8 or INT8 quantization reduces memory usage and bandwidth requirements.
Tune batch sizes. Larger batch sizes improve GPU utilization but increase latency.

3.3. Next steps
Copy link

Experiment with different parallelization strategies for your specific models
Monitor performance metrics to optimize configuration
Scale your deployment based on workload requirements

Chapter 3. Configuring RoCE networking for distributed LLM deployments

3.1. Enabling RoCE networking for distributed LLM deployments
Copy link

3.2. Optimizing RoCE performance for LLM deployments
Copy link

3.2.1. Network tuning
Copy link

3.2.2. Model serving optimization
Copy link

3.3. Next steps
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 3. Configuring RoCE networking for distributed LLM deployments

3.1. Enabling RoCE networking for distributed LLM deploymentsCopy linkLink copied to clipboard!

3.2. Optimizing RoCE performance for LLM deploymentsCopy linkLink copied to clipboard!

3.2.1. Network tuningCopy linkLink copied to clipboard!

3.2.2. Model serving optimizationCopy linkLink copied to clipboard!

3.3. Next stepsCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. Enabling RoCE networking for distributed LLM deployments
Copy link

3.2. Optimizing RoCE performance for LLM deployments
Copy link

3.2.1. Network tuning
Copy link

3.2.2. Model serving optimization
Copy link

3.3. Next steps
Copy link