Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. NVIDIA GPU architecture

NVIDIA supports the use of graphics processing unit (GPU) resources on OpenShift Container Platform. OpenShift Container Platform is a security-focused and hardened Kubernetes platform developed and supported by Red Hat for deploying and managing Kubernetes clusters at scale. OpenShift Container Platform includes enhancements to Kubernetes so that users can easily configure and use NVIDIA GPU resources to accelerate workloads.

The NVIDIA GPU Operator uses the Operator framework within OpenShift Container Platform to manage the full lifecycle of NVIDIA software components required to run GPU-accelerated workloads.

These components include the NVIDIA drivers (to enable CUDA), the Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, automatic node tagging using GPU feature discovery (GFD), DCGM-based monitoring, and others.

Note

The NVIDIA GPU Operator is only supported by NVIDIA. For more information about obtaining support from NVIDIA, see Obtaining Support from NVIDIA.

2.1. NVIDIA GPU prerequisites
Copier lien

A working OpenShift cluster with at least one GPU worker node.
Access to the OpenShift cluster as a cluster-admin to perform the required steps.
OpenShift CLI (oc) is installed.
The node feature discovery (NFD) Operator is installed and a nodefeaturediscovery instance is created.

2.2. NVIDIA GPU enablement
Copier lien

The following diagram shows how the GPU architecture is enabled for OpenShift:

Figure 2.1. NVIDIA GPU enablement

Note

MIG is supported on GPUs starting with the NVIDIA Ampere generation. For a list of GPUs that support MIG, see the NVIDIA MIG User Guide.

2.2.1. GPUs and bare metal
Copier lien

You can deploy OpenShift Container Platform on an NVIDIA-certified bare metal server but with some limitations:

Control plane nodes can be CPU nodes.
Worker nodes must be GPU nodes, provided that AI/ML workloads are executed on these worker nodes.
In addition, the worker nodes can host one or more GPUs, but they must be of the same type. For example, a node can have two NVIDIA A100 GPUs, but a node with one A100 GPU and one T4 GPU is not supported. The NVIDIA Device Plugin for Kubernetes does not support mixing different GPU models on the same node.
When using OpenShift, note that one or three or more servers are required. Clusters with two servers are not supported. The single server deployment is called single node openShift (SNO) and using this configuration results in a non-high availability OpenShift environment.

You can choose one of the following methods to access the containerized GPUs:

GPU passthrough
Multi-Instance GPU (MIG)

2.2.2. GPUs and virtualization
Copier lien

Many developers and enterprises are moving to containerized applications and serverless infrastructures, but there is still a lot of interest in developing and maintaining applications that run on virtual machines (VMs). Red Hat OpenShift Virtualization provides this capability, enabling enterprises to incorporate VMs into containerized workflows within clusters.

You can choose one of the following methods to connect the worker nodes to the GPUs:

GPU passthrough to access and use GPU hardware within a virtual machine (VM).
GPU (vGPU) time-slicing, when GPU compute capacity is not saturated by workloads.

2.2.3. GPUs and vSphere
Copier lien

You can deploy OpenShift Container Platform on an NVIDIA-certified VMware vSphere server that can host different GPU types.

An NVIDIA GPU driver must be installed in the hypervisor in case vGPU instances are used by the VMs. For VMware vSphere, this host driver is provided in the form of a VIB file.

The maximum number of vGPUS that can be allocated to worker node VMs depends on the version of vSphere:

vSphere 7.0: maximum 4 vGPU per VM
vSphere 8.0: maximum 8 vGPU per VM
Note
vSphere 8.0 introduced support for multiple full or fractional heterogenous profiles associated with a VM.

You can choose one of the following methods to attach the worker nodes to the GPUs:

GPU passthrough for accessing and using GPU hardware within a virtual machine (VM)
GPU (vGPU) time-slicing, when not all of the GPU is needed

Similar to bare metal deployments, one or three or more servers are required. Clusters with two servers are not supported.

2.2.4. GPUs and Red Hat KVM
Copier lien

You can use OpenShift Container Platform on an NVIDIA-certified kernel-based virtual machine (KVM) server.

Similar to bare-metal deployments, one or three or more servers are required. Clusters with two servers are not supported.

However, unlike bare-metal deployments, you can use different types of GPUs in the server. This is because you can assign these GPUs to different VMs that act as Kubernetes nodes. The only limitation is that a Kubernetes node must have the same set of GPU types at its own level.

You can choose one of the following methods to access the containerized GPUs:

GPU passthrough for accessing and using GPU hardware within a virtual machine (VM)
GPU (vGPU) time-slicing when not all of the GPU is needed

To enable the vGPU capability, a special driver must be installed at the host level. This driver is delivered as a RPM package. This host driver is not required at all for GPU passthrough allocation.

2.2.5. GPUs and CSPs
Copier lien

You can deploy OpenShift Container Platform to one of the major cloud service providers (CSPs): Amazon Web Services (AWS), Google Cloud, or Microsoft Azure.

Two modes of operation are available: a fully managed deployment and a self-managed deployment.

In a fully managed deployment, everything is automated by Red Hat in collaboration with CSP. You can request an OpenShift instance through the CSP web console, and the cluster is automatically created and fully managed by Red Hat. You do not have to worry about node failures or errors in the environment. Red Hat is fully responsible for maintaining the uptime of the cluster. The fully managed services are available on AWS, Azure, and Google Cloud. For AWS, the OpenShift service is called ROSA (Red Hat OpenShift Service on AWS). For Azure, the service is called Azure Red Hat OpenShift. For Google Cloud, the service is called OpenShift Dedicated on Google Cloud.
In a self-managed deployment, you are responsible for instantiating and maintaining the OpenShift cluster. Red Hat provides the OpenShift-install utility to support the deployment of the OpenShift cluster in this case. The self-managed services are available globally to all CSPs.

It is important that this compute instance is a GPU-accelerated compute instance and that the GPU type matches the list of supported GPUs from NVIDIA AI Enterprise. For example, T4, V100, and A100 are part of this list.

You can choose one of the following methods to access the containerized GPUs:

GPU passthrough to access and use GPU hardware within a virtual machine (VM).
GPU (vGPU) time slicing when the entire GPU is not required.

2.2.6. GPUs and Red Hat Device Edge
Copier lien

Red Hat Device Edge provides access to MicroShift. MicroShift provides the simplicity of a single-node deployment with the functionality and services you need for resource-constrained (edge) computing. Red Hat Device Edge meets the needs of bare-metal, virtual, containerized, or Kubernetes workloads deployed in resource-constrained environments.

You can enable NVIDIA GPUs on containers in a Red Hat Device Edge environment.

You use GPU passthrough to access the containerized GPUs.

2.4. NVIDIA GPU features for OpenShift Container Platform
Copier lien

NVIDIA Container Toolkit

NVIDIA Container Toolkit enables you to create and run GPU-accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to use NVIDIA GPUs.

NVIDIA AI Enterprise

NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software optimized, certified, and supported with NVIDIA-Certified systems.

NVIDIA AI Enterprise includes support for Red Hat OpenShift Container Platform. The following installation methods are supported:

OpenShift Container Platform on bare metal or VMware vSphere with GPU Passthrough.
OpenShift Container Platform on VMware vSphere with NVIDIA vGPU.

GPU Feature Discovery

NVIDIA GPU Feature Discovery for Kubernetes is a software component that enables you to automatically generate labels for the GPUs available on a node. GPU Feature Discovery uses node feature discovery (NFD) to perform this labeling.

The Node Feature Discovery Operator (NFD) manages the discovery of hardware features and configurations in an OpenShift Container Platform cluster by labeling nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, OS version, and so on.

You can find the NFD Operator in the Operator Hub by searching for “Node Feature Discovery”.

NVIDIA GPU Operator with OpenShift Virtualization

Up until this point, the GPU Operator only provisioned worker nodes to run GPU-accelerated containers. Now, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines (VMs).

You can configure the GPU Operator to deploy different software components to worker nodes depending on which GPU workload is configured to run on those nodes.

GPU Monitoring dashboard

You can install a monitoring dashboard to display GPU usage information on the cluster Observe page in the OpenShift Container Platform web console. GPU utilization information includes the number of available GPUs, power consumption (in watts), temperature (in degrees Celsius), utilization (in percent), and other metrics for each GPU.

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. NVIDIA GPU architecture

2.1. NVIDIA GPU prerequisites
Copier lien

2.2. NVIDIA GPU enablement
Copier lien

2.2.1. GPUs and bare metal
Copier lien

2.2.2. GPUs and virtualization
Copier lien

2.2.3. GPUs and vSphere
Copier lien

2.2.4. GPUs and Red Hat KVM
Copier lien

2.2.5. GPUs and CSPs
Copier lien

2.2.6. GPUs and Red Hat Device Edge
Copier lien

2.4. NVIDIA GPU features for OpenShift Container Platform
Copier lien

Apprendre

Essayez, achetez et vendez

Communautés

À propos de Red Hat

Rendre l’open source plus inclusif

À propos de la documentation Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. NVIDIA GPU architecture

2.1. NVIDIA GPU prerequisitesCopier lienLien copié sur presse-papiers!

2.2. NVIDIA GPU enablementCopier lienLien copié sur presse-papiers!

2.2.1. GPUs and bare metalCopier lienLien copié sur presse-papiers!

2.2.2. GPUs and virtualizationCopier lienLien copié sur presse-papiers!

2.2.3. GPUs and vSphereCopier lienLien copié sur presse-papiers!

2.2.4. GPUs and Red Hat KVMCopier lienLien copié sur presse-papiers!

2.2.5. GPUs and CSPsCopier lienLien copié sur presse-papiers!

2.2.6. GPUs and Red Hat Device EdgeCopier lienLien copié sur presse-papiers!

2.3. GPU sharing methodsCopier lienLien copié sur presse-papiers!

2.3.1. CUDA streamsCopier lienLien copié sur presse-papiers!

2.3.2. Time-slicingCopier lienLien copié sur presse-papiers!

2.3.3. CUDA Multi-Process ServiceCopier lienLien copié sur presse-papiers!

2.3.4. Multi-instance GPUCopier lienLien copié sur presse-papiers!

2.3.5. Virtualization with vGPUCopier lienLien copié sur presse-papiers!

2.4. NVIDIA GPU features for OpenShift Container PlatformCopier lienLien copié sur presse-papiers!

Apprendre

Essayez, achetez et vendez

Communautés

À propos de Red Hat

Rendre l’open source plus inclusif

À propos de la documentation Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. NVIDIA GPU prerequisites
Copier lien

2.2. NVIDIA GPU enablement
Copier lien

2.2.1. GPUs and bare metal
Copier lien

2.2.2. GPUs and virtualization
Copier lien

2.2.3. GPUs and vSphere
Copier lien

2.2.4. GPUs and Red Hat KVM
Copier lien

2.2.5. GPUs and CSPs
Copier lien

2.2.6. GPUs and Red Hat Device Edge
Copier lien

2.3. GPU sharing methods
Copier lien

2.3.1. CUDA streams
Copier lien

2.3.2. Time-slicing
Copier lien

2.3.3. CUDA Multi-Process Service
Copier lien

2.3.4. Multi-instance GPU
Copier lien

2.3.5. Virtualization with vGPU
Copier lien

2.4. NVIDIA GPU features for OpenShift Container Platform
Copier lien