Chapter 3. Enabling NVIDIA GPUs
Before you can use NVIDIA GPUs in OpenShift AI, you must install the NVIDIA GPU Operator.
Prerequisites
- You have logged in to your OpenShift cluster.
-
You have the
cluster-admin
role in your OpenShift cluster. - You have installed an NVIDIA GPU and confirmed that it is detected in your environment.
Procedure
To enable GPU support on an OpenShift cluster in a disconnected or airgapped environment, follow the instructions here: Deploy GPU Operators in a disconnected or airgapped environment in the NVIDIA documentation.
ImportantAfter you install the Node Feature Discovery (NFD) Operator, you must create an instance of NodeFeatureDiscovery. In addition, after you install the NVIDIA GPU Operator, you must create a ClusterPolicy and populate it with default values.
Delete the migration-gpu-status ConfigMap.
- In the OpenShift web console, switch to the Administrator perspective.
- Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate ConfigMap.
- Search for the migration-gpu-status ConfigMap.
Click the action menu (⋮) and select Delete ConfigMap from the list.
The Delete ConfigMap dialog appears.
- Inspect the dialog and confirm that you are deleting the correct ConfigMap.
- Click Delete.
Restart the dashboard replicaset.
- In the OpenShift web console, switch to the Administrator perspective.
-
Click Workloads
Deployments. - Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate deployment.
- Search for the rhods-dashboard deployment.
- Click the action menu (⋮) and select Restart Rollout from the list.
- Wait until the Status column indicates that all pods in the rollout have fully restarted.
Verification
-
The reset migration-gpu-status instance is present on the Instances tab on the
AcceleratorProfile
custom resource definition (CRD) details page. From the Administrator perspective, go to the Operators
Installed Operators page. Confirm that the following Operators appear: - NVIDIA GPU
- Node Feature Discovery (NFD)
- Kernel Module Management (KMM)
The GPU is correctly detected a few minutes after full installation of the Node Feature Discovery (NFD) and NVIDIA GPU Operators. The OpenShift command line interface (CLI) displays the appropriate output for the GPU worker node. For example:
Expected output when the GPU is detected properly
# Expected output when the GPU is detected properly oc describe node <node name> ... Capacity: cpu: 4 ephemeral-storage: 313981932Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16076568Ki nvidia.com/gpu: 1 pods: 250 Allocatable: cpu: 3920m ephemeral-storage: 288292006229 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 12828440Ki nvidia.com/gpu: 1 pods: 250
Copy to Clipboard Copied!
In OpenShift AI, Red Hat supports the use of accelerators within the same cluster only.
Starting from Red Hat OpenShift AI 2.19, Red Hat supports remote direct memory access (RDMA) for NVIDIA GPUs only, enabling them to communicate directly with each other by using NVIDIA GPUDirect RDMA across either Ethernet or InfiniBand networks.
After installing the NVIDIA GPU Operator, create a hardware profile as described in Working with hardware profiles.