Este contenido no está disponible en el idioma seleccionado.

Chapter 5. AMD GPU Integration

You can use AMD GPUs with OpenShift AI to accelerate AI and machine learning (ML) workloads. AMD GPUs provide high-performance compute capabilities, allowing users to process large data sets, train deep neural networks, and perform complex inference tasks more efficiently.

Integrating AMD GPUs with OpenShift AI involves the following components:

ROCm workbench images: Use the ROCm workbench images to streamline AI/ML workflows on AMD GPUs. These images include libraries and frameworks optimized with the AMD ROCm platform, enabling high-performance workloads for PyTorch and TensorFlow. The pre-configured images reduce setup time and provide an optimized environment for GPU-accelerated development and experimentation.
AMD GPU Operator: The AMD GPU Operator simplifies GPU integration by automating driver installation, device plugin setup, and node labeling for GPU resource management. It ensures compatibility between OpenShift and AMD hardware while enabling scaling of GPU-enabled workloads.

5.1. Verifying AMD GPU availability on your cluster
Copiar enlace

Before you proceed with the AMD GPU Operator installation process, you can verify the presence of an AMD GPU device on a node within your OpenShift cluster. You can use commands such as lspci or oc to confirm hardware and resource availability.

Prerequisites

You have administrative access to the OpenShift cluster.
You have a running OpenShift cluster with a node equipped with an AMD GPU.
You have access to the OpenShift CLI (oc) and terminal access to the node.

Procedure

Use the OpenShift CLI to verify if GPU resources are allocatable:
1. List all nodes in the cluster to identify the node with an AMD GPU:
  oc get nodes
  Copy to Clipboard Toggle word wrap
2. Note the name of the node where you expect the AMD GPU to be present.
3. Describe the node to check its resource allocation:
  oc describe node <node_name>
  Copy to Clipboard Toggle word wrap
4. In the output, locate the Capacity and Allocatable sections and confirm that amd.com/gpu is listed. For example:
  Capacity: amd.com/gpu: 1 Allocatable: amd.com/gpu: 1
  Copy to Clipboard Toggle word wrap
Check for the AMD GPU device using the lspci command:
1. Log in to the node:
  oc debug node/<node_name> chroot /host
  Copy to Clipboard Toggle word wrap
2. Run the lspci command and search for the supported AMD device in your deployment. For example:
  lspci | grep -E "MI210|MI250|MI300"
  Copy to Clipboard Toggle word wrap
3. Verify that the output includes one of the AMD GPU models. For example:
  03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD] Instinct MI210
  Copy to Clipboard Toggle word wrap
Optional: Use the rocminfo command if the ROCm stack is installed on the node:
```
rocminfo
```
```
rocminfo
```
Copy to Clipboard Toggle word wrap
1. Confirm that the ROCm tool outputs details about the AMD GPU, such as compute units, memory, and driver status.

Verification

The oc describe node <node_name> command lists amd.com/gpu under Capacity and Allocatable.
The lspci command output identifies an AMD GPU as a PCI device matching one of the specified models (for example, MI210, MI250, MI300).
Optional: The rocminfo tool provides detailed GPU information, confirming driver and hardware configuration.

5.2. Enabling AMD GPUs
Copiar enlace

Before you can use AMD GPUs in OpenShift AI, you must install the required dependencies, deploy the AMD GPU Operator, and configure the environment.

Prerequisites

You have logged in to OpenShift.
You have the cluster-admin role in OpenShift.
You have installed your AMD GPU and confirmed that it is detected in your environment.
Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).

Procedure

Install the latest version of the AMD GPU Operator, as described in Install AMD GPU Operator on OpenShift.
After installing the AMD GPU Operator, configure the AMD drivers required by the Operator as described in the documentation: Configure AMD drivers for the GPU Operator.

Note

Alternatively, you can install the AMD GPU Operator from the Red Hat Catalog. For more information, see Install AMD GPU Operator from Red Hat Catalog.

After installing the AMD GPU Operator, create a hardware profile, as described in Working with hardware profiles.

Verification

From the Administrator perspective, go to the Operators Installed Operators page. Confirm that the following Operators appear:

AMD GPU Operator
Node Feature Discovery (NFD)
Kernel Module Management (KMM)

Note

Ensure that you follow all the steps for proper driver installation and configuration. Incorrect installation or configuration may prevent the AMD GPUs from being recognized or functioning properly.

Este contenido no está disponible en el idioma seleccionado.

Chapter 5. AMD GPU Integration

5.1. Verifying AMD GPU availability on your cluster
Copiar enlace

5.2. Enabling AMD GPUs
Copiar enlace

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este contenido no está disponible en el idioma seleccionado.

Chapter 5. AMD GPU Integration

5.1. Verifying AMD GPU availability on your clusterCopiar enlaceEnlace copiado en el portapapeles!

5.2. Enabling AMD GPUsCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

5.1. Verifying AMD GPU availability on your cluster
Copiar enlace

5.2. Enabling AMD GPUs
Copiar enlace