Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 4. Intel Gaudi AI Accelerator integration

To accelerate your high-performance deep learning models, you can integrate Intel Gaudi AI accelerators into OpenShift AI. This integration enables your data scientists to use Gaudi libraries and software associated with Intel Gaudi AI accelerators through custom-configured workbench instances.

Intel Gaudi AI accelerators offer optimized performance for deep learning workloads, with the latest Gaudi 3 devices providing significant improvements in training speed and energy efficiency. These accelerators are suitable for enterprises running machine learning and AI applications on OpenShift AI.

Before you can enable Intel Gaudi AI accelerators in OpenShift AI, you must complete the following steps:

Install the latest version of the Intel Gaudi Base Operator from OperatorHub.
Create and configure a custom workbench image for Intel Gaudi AI accelerators. A prebuilt workbench image for Gaudi accelerators is not included in OpenShift AI.
Manually define and configure an accelerator profile or a hardware profile for each Intel Gaudi AI device in your environment.
Important
By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. To show the Settings Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfiles value to false in the OdhDashboardConfig custom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.

Red Hat supports Intel Gaudi devices up to Intel Gaudi 3. The Intel Gaudi 3 accelerators, in particular, offer the following benefits:

Improved training throughput: Reduce the time required to train large models by using advanced tensor processing cores and increased memory bandwidth.
Energy efficiency: Lower power consumption while maintaining high performance, reducing operational costs for large-scale deployments.
Scalable architecture: Scale across multiple nodes for distributed training configurations.

Your OpenShift platform must support EC2 DL1 instances to use Intel Gaudi AI accelerators in an Amazon EC2 DL1 instance. You can use Intel Gaudi AI accelerators in workbench instances or model serving after you enable the accelerators, create a custom workbench image, and configure the accelerator profile or the hardware profile.

To identify the Intel Gaudi AI accelerators present in your deployment, use the lspci utility. For more information, see lspci(8) - Linux man page.

Important

The presence of Intel Gaudi AI accelerators in your deployment, as indicated by the lspci utility, does not guarantee that the devices are ready to use. You must ensure that all installation and configuration steps are completed successfully.

4.1. Enabling Intel Gaudi AI accelerators
Copier lien

Before you can use Intel Gaudi AI accelerators in OpenShift AI, you must install the required dependencies, deploy the Intel Gaudi Base Operator, and configure the environment.

Prerequisites

You have logged in to OpenShift.
You have the cluster-admin role in OpenShift.
You have installed your Intel Gaudi accelerator and confirmed that it is detected in your environment.
Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:
- Installing the OpenShift CLI for OpenShift Dedicated
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS (classic architecture)

Procedure

Install the latest version of the Intel Gaudi Base Operator, as described in Intel Gaudi Base Operator OpenShift installation.
By default, OpenShift sets a per-pod PID limit of 4096. If your workload requires more processing power, such as when you use multiple Gaudi accelerators or when using vLLM with Ray, you must manually increase the per-pod PID limit to avoid Resource temporarily unavailable errors. These errors occur due to PID exhaustion. Red Hat recommends setting this limit to 32768, although values over 20000 are sufficient.
1. Run the following command to label the node:
  oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubelet
  Copy to Clipboard Toggle word wrap
2. Optional: To prevent workload distribution on the affected node, you can mark the node as unschedulable and then drain it in preparation for maintenance. For more information, see Understanding how to evacuate pods on nodes.
3. Create a custom-kubelet-pidslimit.yaml KubeletConfig resource file:
  oc create -f custom-kubelet-pidslimit.yaml
  Copy to Clipboard Toggle word wrap
4. Populate the file with the following YAML code. Set the PodPidsLimit value to 32768:
  apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: custom-kubelet-pidslimit spec: kubeletConfig: PodPidsLimit: 32768 machineConfigPoolSelector: matchLabels: custom-kubelet: set-pod-pid-limit-kubelet
  Copy to Clipboard Toggle word wrap
5. Apply the configuration:
  oc apply -f custom-kubelet-pidslimit.yaml
  Copy to Clipboard Toggle word wrap
  This operation causes the node to reboot. For more information, see Understanding node rebooting.
6. Optional: If you previously marked the node as unschedulable, you can allow scheduling again after the node reboots.
Create a custom workbench image for Intel Gaudi AI accelerators, as described in Creating custom workbench images.
After installing the Intel Gaudi Base Operator, create an accelerator profile, as described in Working with accelerator profiles.
Important
By default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. To show the Settings Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfiles value to false in the OdhDashboardConfig custom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.

Verification

From the Administrator perspective, go to the Operators Installed Operators page. Confirm that the following Operators appear:

Intel Gaudi Base Operator
Node Feature Discovery (NFD)
Kernel Module Management (KMM)

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 4. Intel Gaudi AI Accelerator integration

4.1. Enabling Intel Gaudi AI accelerators
Copier lien

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 4. Intel Gaudi AI Accelerator integration

4.1. Enabling Intel Gaudi AI acceleratorsCopier lienLien copié sur presse-papiers!

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

4.1. Enabling Intel Gaudi AI accelerators
Copier lien