Chapter 7. Enabling accelerators


7.1. Enabling NVIDIA GPUs

Before you can use NVIDIA GPUs in OpenShift AI, you must install the NVIDIA GPU Operator.

Important

If you are using OpenShift AI in a disconnected self-managed environment, see Enabling accelerators instead.

Prerequisites

  • You have logged in to your OpenShift cluster.
  • You have the cluster-admin role in your OpenShift cluster.
  • You have installed an NVIDIA GPU and confirmed that it is detected in your environment.

Procedure

  1. To enable GPU support on an OpenShift cluster, follow the instructions here: NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.

    Important

    After you install the Node Feature Discovery (NFD) Operator, you must create an instance of NodeFeatureDiscovery. In addition, after you install the NVIDIA GPU Operator, you must create a ClusterPolicy and populate it with default values.

  2. Delete the migration-gpu-status ConfigMap.

    1. In the OpenShift web console, switch to the Administrator perspective.
    2. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate ConfigMap.
    3. Search for the migration-gpu-status ConfigMap.
    4. Click the action menu (⋮) and select Delete ConfigMap from the list.

      The Delete ConfigMap dialog appears.

    5. Inspect the dialog and confirm that you are deleting the correct ConfigMap.
    6. Click Delete.
  3. Restart the dashboard replicaset.

    1. In the OpenShift web console, switch to the Administrator perspective.
    2. Click Workloads Deployments.
    3. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate deployment.
    4. Search for the rhods-dashboard deployment.
    5. Click the action menu (⋮) and select Restart Rollout from the list.
    6. Wait until the Status column indicates that all pods in the rollout have fully restarted.

Verification

  • The reset migration-gpu-status instance is present on the Instances tab on the AcceleratorProfile custom resource definition (CRD) details page.
  • From the Administrator perspective, go to the Operators Installed Operators page. Confirm that the following Operators appear:

    • NVIDIA GPU
    • Node Feature Discovery (NFD)
    • Kernel Module Management (KMM)
  • The GPU is correctly detected a few minutes after full installation of the Node Feature Discovery (NFD) and NVIDIA GPU Operators. The OpenShift command line interface (CLI) displays the appropriate output for the GPU worker node. For example:

    # Expected output when the GPU is detected properly
    oc describe node <node name>
    ...
    Capacity:
      cpu:                4
      ephemeral-storage:  313981932Ki
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             16076568Ki
      nvidia.com/gpu:     1
      pods:               250
    Allocatable:
      cpu:                3920m
      ephemeral-storage:  288292006229
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             12828440Ki
      nvidia.com/gpu:     1
      pods:               250
Note

In OpenShift AI, Red Hat supports the use of accelerators within the same cluster only.

Starting from Red Hat OpenShift AI 2.19, NVIDIA GPUs can communicate directly with each other by using NVIDIA GPUDirect RDMA across either Ethernet or InfiniBand networks. In this OpenShift AI release, Red Hat does not support remote direct memory access (RDMA) or use across a network for other accelerator types.

After installing the NVIDIA GPU Operator, create a hardware profile as described in Working with hardware profiles.

7.2. Intel Gaudi AI Accelerator integration

To accelerate your high-performance deep learning models, you can integrate Intel Gaudi AI accelerators into OpenShift AI. This integration enables your data scientists to use Gaudi libraries and software associated with Intel Gaudi AI accelerators through custom-configured workbench instances.

Intel Gaudi AI accelerators offer optimized performance for deep learning workloads, with the latest Gaudi 3 devices providing significant improvements in training speed and energy efficiency. These accelerators are suitable for enterprises running machine learning and AI applications on OpenShift AI.

Before you can enable Intel Gaudi AI accelerators in OpenShift AI, you must complete the following steps:

  1. Install the latest version of the Intel Gaudi AI Accelerator Operator from OperatorHub.
  2. Create and configure a custom workbench image for Intel Gaudi AI accelerators. A prebuilt workbench image for Gaudi accelerators is not included in OpenShift AI.
  3. Manually define and configure a hardware profile for each Intel Gaudi AI device in your environment.

Red Hat supports Intel Gaudi devices up to Intel Gaudi 3. The Intel Gaudi 3 accelerators, in particular, offer the following benefits:

  • Improved training throughput: Reduce the time required to train large models by using advanced tensor processing cores and increased memory bandwidth.
  • Energy efficiency: Lower power consumption while maintaining high performance, reducing operational costs for large-scale deployments.
  • Scalable architecture: Scale across multiple nodes for distributed training configurations.

Your OpenShift platform must support EC2 DL1 instances to use Intel Gaudi AI accelerators in an Amazon EC2 DL1 instance. You can use Intel Gaudi AI accelerators in workbench instances or model serving after you enable the accelerators, create a custom workbench image, and configure the hardware profile.

To identify the Intel Gaudi AI accelerators present in your deployment, use the lspci utility. For more information, see lspci(8) - Linux man page.

Important

The presence of Intel Gaudi AI accelerators in your deployment, as indicated by the lspci utility, does not guarantee that the devices are ready to use. You must ensure that all installation and configuration steps are completed successfully.

7.2.1. Enabling Intel Gaudi AI accelerators

Before you can use Intel Gaudi AI accelerators in OpenShift AI, you must install the required dependencies, deploy the Intel Gaudi AI Accelerator Operator, and configure the environment.

Prerequisites

  • You have logged in to OpenShift.
  • You have the cluster-admin role in OpenShift.
  • You have installed your Intel Gaudi accelerator and confirmed that it is detected in your environment.
  • Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
  • You have installed the OpenShift command-line interface (CLI).

Procedure

  1. Install the latest version of the Intel Gaudi AI Accelerator Operator, as described in Intel Gaudi AI Operator OpenShift installation.
  2. By default, OpenShift sets a per-pod PID limit of 4096. If your workload requires more processing power, such as when you use multiple Gaudi accelerators or when using vLLM with Ray, you must manually increase the per-pod PID limit to avoid Resource temporarily unavailable errors. These errors occur due to PID exhaustion. Red Hat recommends setting this limit to 32768, although values over 20000 are sufficient.

    1. Run the following command to label the node:

      oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubelet
    2. Optional: To prevent workload distribution on the affected node, you can mark the node as unschedulable and then drain it in preparation for maintenance. For more information, see Understanding how to evacuate pods on nodes.
    3. Create a custom-kubelet-pidslimit.yaml KubeletConfig resource file:

      oc create -f custom-kubelet-pidslimit.yaml
    4. Populate the file with the following YAML code. Set the PodPidsLimit value to 32768:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
        name: custom-kubelet-pidslimit
      spec:
        kubeletConfig:
          PodPidsLimit: 32768
        machineConfigPoolSelector:
          matchLabels:
            custom-kubelet: set-pod-pid-limit-kubelet
    5. Apply the configuration:

      oc apply -f custom-kubelet-pidslimit.yaml

      This operation causes the node to reboot. For more information, see Understanding node rebooting.

    6. Optional: If you previously marked the node as unschedulable, you can allow scheduling again after the node reboots.
  3. Create a custom workbench image for Intel Gaudi AI accelerators, as described in Creating custom workbench images.
  4. After installing the Intel Gaudi AI Accelerator Operator, create a hardware profile, as described in Working with hardware profiles.

Verification

From the Administrator perspective, go to the Operators Installed Operators page. Confirm that the following Operators appear:

  • Intel Gaudi AI Accelerator
  • Node Feature Discovery (NFD)
  • Kernel Module Management (KMM)

7.3. AMD GPU Integration

You can use AMD GPUs with OpenShift AI to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently.

Integrating AMD GPUs with OpenShift AI involves the following components:

  • ROCm workbench images: Use the ROCm workbench images to streamline AI/ML workflows on AMD GPUs. These images include libraries and frameworks optimized with the AMD ROCm platform, enabling high-performance workloads for PyTorch and TensorFlow. The pre-configured images reduce setup time and provide an optimized environment for GPU-accelerated development and experimentation.
  • AMD GPU Operator: The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads.

7.3.1. Verifying AMD GPU availability on your cluster

Before you proceed with the AMD GPU Operator installation process, you can verify the presence of an AMD GPU device on a node within your OpenShift cluster. You can use commands such as lspci or oc to confirm hardware and resource availability.

Prerequisites

  • You have administrative access to the OpenShift cluster.
  • You have a running OpenShift cluster with a node equipped with an AMD GPU.
  • You have access to the OpenShift CLI (oc) and terminal access to the node.

Procedure

  1. Use the OpenShift CLI to verify if GPU resources are allocatable:

    1. List all nodes in the cluster to identify the node with an AMD GPU:

      oc get nodes
    2. Note the name of the node where you expect the AMD GPU to be present.
    3. Describe the node to check its resource allocation:

      oc describe node <node_name>
    4. In the output, locate the Capacity and Allocatable sections and confirm that amd.com/gpu is listed. For example:

      Capacity:
        amd.com/gpu:  1
      Allocatable:
        amd.com/gpu:  1
  2. Check for the AMD GPU device using the lspci command:

    1. Log in to the node:

      oc debug node/<node_name>
      chroot /host
    2. Run the lspci command and search for the supported AMD device in your deployment. For example:

      lspci | grep -E "MI210|MI250|MI300"
    3. Verify that the output includes one of the AMD GPU models. For example:

      03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD] Instinct MI210
  3. Optional: Use the rocminfo command if the ROCm stack is installed on the node:

    rocminfo
    1. Confirm that the ROCm tool outputs details about the AMD GPU, such as compute units, memory, and driver status.

Verification

  • The oc describe node <node_name> command lists amd.com/gpu under Capacity and Allocatable.
  • The lspci command output identifies an AMD GPU as a PCI device matching one of the specified models (for example, MI210, MI250, MI300).
  • Optional: The rocminfo tool provides detailed GPU information, confirming driver and hardware configuration.

7.3.2. Enabling AMD GPUs

Before you can use AMD GPUs in OpenShift AI, you must install the required dependencies, deploy the AMD GPU Operator, and configure the environment.

Prerequisites

  • You have logged in to OpenShift.
  • You have the cluster-admin role in OpenShift.
  • You have installed your AMD GPU and confirmed that it is detected in your environment.
  • Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).

Procedure

  1. Install the latest version of the AMD GPU Operator, as described in Install AMD GPU Operator on OpenShift.
  2. After installing the AMD GPU Operator, configure the AMD drivers required by the Operator as described in the documentation: Configure AMD drivers for the GPU Operator.
Note

Alternatively, you can install the AMD GPU Operator from the Red Hat Catalog. For more information, see Install AMD GPU Operator from Red Hat Catalog.

  1. After installing the AMD GPU Operator, create a hardware profile, as described in Working with hardware profiles.

Verification

From the Administrator perspective, go to the Operators Installed Operators page. Confirm that the following Operators appear:

  • AMD GPU Operator
  • Node Feature Discovery (NFD)
  • Kernel Module Management (KMM)
Note

Ensure that you follow all the steps for proper driver installation and configuration. Incorrect installation or configuration may prevent the AMD GPUs from being recognized or functioning properly.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.