Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 4. Intel Gaudi AI Accelerator integration
To accelerate your high-performance deep learning models, you can integrate Intel Gaudi AI accelerators into OpenShift AI. This integration enables your data scientists to use Gaudi libraries and software associated with Intel Gaudi AI accelerators through custom-configured workbench instances.
Intel Gaudi AI accelerators offer optimized performance for deep learning workloads, with the latest Gaudi 3 devices providing significant improvements in training speed and energy efficiency. These accelerators are suitable for enterprises running machine learning and AI applications on OpenShift AI.
Before you can enable Intel Gaudi AI accelerators in OpenShift AI, you must complete the following steps:
- Install the latest version of the Intel Gaudi Base Operator from OperatorHub.
- Create and configure a custom workbench image for Intel Gaudi AI accelerators. A prebuilt workbench image for Gaudi accelerators is not included in OpenShift AI.
Manually define and configure an accelerator profile or a hardware profile for each Intel Gaudi AI device in your environment.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. To show the Settings
Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfilesvalue tofalsein theOdhDashboardConfigcustom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.
Red Hat supports Intel Gaudi devices up to Intel Gaudi 3. The Intel Gaudi 3 accelerators, in particular, offer the following benefits:
- Improved training throughput: Reduce the time required to train large models by using advanced tensor processing cores and increased memory bandwidth.
- Energy efficiency: Lower power consumption while maintaining high performance, reducing operational costs for large-scale deployments.
- Scalable architecture: Scale across multiple nodes for distributed training configurations.
Your OpenShift platform must support EC2 DL1 instances to use Intel Gaudi AI accelerators in an Amazon EC2 DL1 instance. You can use Intel Gaudi AI accelerators in workbench instances or model serving after you enable the accelerators, create a custom workbench image, and configure the accelerator profile or the hardware profile.
To identify the Intel Gaudi AI accelerators present in your deployment, use the lspci utility. For more information, see lspci(8) - Linux man page.
The presence of Intel Gaudi AI accelerators in your deployment, as indicated by the lspci utility, does not guarantee that the devices are ready to use. You must ensure that all installation and configuration steps are completed successfully.
4.1. Enabling Intel Gaudi AI accelerators Copier lienLien copié sur presse-papiers!
Before you can use Intel Gaudi AI accelerators in OpenShift AI, you must install the required dependencies, deploy the Intel Gaudi Base Operator, and configure the environment.
Prerequisites
- You have logged in to OpenShift.
-
You have the
cluster-adminrole in OpenShift. - You have installed your Intel Gaudi accelerator and confirmed that it is detected in your environment.
- Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Dedicated
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS (classic architecture)
Procedure
- Install the latest version of the Intel Gaudi Base Operator, as described in Intel Gaudi Base Operator OpenShift installation.
By default, OpenShift sets a per-pod PID limit of 4096. If your workload requires more processing power, such as when you use multiple Gaudi accelerators or when using vLLM with Ray, you must manually increase the per-pod PID limit to avoid
Resource temporarily unavailableerrors. These errors occur due to PID exhaustion. Red Hat recommends setting this limit to 32768, although values over 20000 are sufficient.Run the following command to label the node:
oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubelet
oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubeletCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Optional: To prevent workload distribution on the affected node, you can mark the node as unschedulable and then drain it in preparation for maintenance. For more information, see Understanding how to evacuate pods on nodes.
Create a
custom-kubelet-pidslimit.yamlKubeletConfig resource file:oc create -f custom-kubelet-pidslimit.yaml
oc create -f custom-kubelet-pidslimit.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow Populate the file with the following YAML code. Set the
PodPidsLimitvalue to 32768:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the configuration:
oc apply -f custom-kubelet-pidslimit.yaml
oc apply -f custom-kubelet-pidslimit.yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow This operation causes the node to reboot. For more information, see Understanding node rebooting.
- Optional: If you previously marked the node as unschedulable, you can allow scheduling again after the node reboots.
- Create a custom workbench image for Intel Gaudi AI accelerators, as described in Creating custom workbench images.
After installing the Intel Gaudi Base Operator, create an accelerator profile, as described in Working with accelerator profiles.
ImportantBy default, hardware profiles are hidden in the dashboard navigation menu and user interface, while accelerator profiles remain visible. In addition, user interface components associated with the deprecated accelerator profiles functionality are still displayed. To show the Settings
Hardware profiles option in the dashboard navigation menu, and the user interface components associated with hardware profiles, set the disableHardwareProfilesvalue tofalsein theOdhDashboardConfigcustom resource (CR) in OpenShift. For more information about setting dashboard configuration options, see Customizing the dashboard.
Verification
From the Administrator perspective, go to the Operators
- Intel Gaudi Base Operator
- Node Feature Discovery (NFD)
- Kernel Module Management (KMM)