Chapter 4. Intel Gaudi AI Accelerator integration
To accelerate your high-performance deep learning models, you can integrate Intel Gaudi AI accelerators into OpenShift AI. This integration enables your data scientists to use Gaudi libraries and software associated with Intel Gaudi AI accelerators through custom-configured workbench instances.
Intel Gaudi AI accelerators offer optimized performance for deep learning workloads, with the latest Gaudi 3 devices providing significant improvements in training speed and energy efficiency. These accelerators are suitable for enterprises running machine learning and AI applications on OpenShift AI.
Before you can enable Intel Gaudi AI accelerators in OpenShift AI, you must complete the following steps:
- Install the latest version of the Intel Gaudi AI Accelerator Operator from OperatorHub.
- Create and configure a custom workbench image for Intel Gaudi AI accelerators. A prebuilt workbench image for Gaudi accelerators is not included in OpenShift AI.
- Manually define and configure a hardware profile for each Intel Gaudi AI device in your environment.
Red Hat supports Intel Gaudi devices up to Intel Gaudi 3. The Intel Gaudi 3 accelerators, in particular, offer the following benefits:
- Improved training throughput: Reduce the time required to train large models by using advanced tensor processing cores and increased memory bandwidth.
- Energy efficiency: Lower power consumption while maintaining high performance, reducing operational costs for large-scale deployments.
- Scalable architecture: Scale across multiple nodes for distributed training configurations.
Your OpenShift platform must support EC2 DL1 instances to use Intel Gaudi AI accelerators in an Amazon EC2 DL1 instance. You can use Intel Gaudi AI accelerators in workbench instances or model serving after you enable the accelerators, create a custom workbench image, and configure the hardware profile.
To identify the Intel Gaudi AI accelerators present in your deployment, use the lspci
utility. For more information, see lspci(8) - Linux man page.
The presence of Intel Gaudi AI accelerators in your deployment, as indicated by the lspci
utility, does not guarantee that the devices are ready to use. You must ensure that all installation and configuration steps are completed successfully.
4.1. Enabling Intel Gaudi AI accelerators
Before you can use Intel Gaudi AI accelerators in OpenShift AI, you must install the required dependencies, deploy the Intel Gaudi AI Accelerator Operator, and configure the environment.
Prerequisites
- You have logged in to OpenShift.
-
You have the
cluster-admin
role in OpenShift. - You have installed your Intel Gaudi accelerator and confirmed that it is detected in your environment.
- Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
- You have installed the OpenShift command-line interface (CLI).
Procedure
- Install the latest version of the Intel Gaudi AI Accelerator Operator, as described in Intel Gaudi AI Operator OpenShift installation.
By default, OpenShift sets a per-pod PID limit of 4096. If your workload requires more processing power, such as when you use multiple Gaudi accelerators or when using vLLM with Ray, you must manually increase the per-pod PID limit to avoid
Resource temporarily unavailable
errors. These errors occur due to PID exhaustion. Red Hat recommends setting this limit to 32768, although values over 20000 are sufficient.Run the following command to label the node:
oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubelet
oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubelet
Copy to Clipboard Copied! - Optional: To prevent workload distribution on the affected node, you can mark the node as unschedulable and then drain it in preparation for maintenance. For more information, see Understanding how to evacuate pods on nodes.
Create a
custom-kubelet-pidslimit.yaml
KubeletConfig resource file:oc create -f custom-kubelet-pidslimit.yaml
oc create -f custom-kubelet-pidslimit.yaml
Copy to Clipboard Copied! Populate the file with the following YAML code. Set the
PodPidsLimit
value to 32768:apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: custom-kubelet-pidslimit spec: kubeletConfig: PodPidsLimit: 32768 machineConfigPoolSelector: matchLabels: custom-kubelet: set-pod-pid-limit-kubelet
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: custom-kubelet-pidslimit spec: kubeletConfig: PodPidsLimit: 32768 machineConfigPoolSelector: matchLabels: custom-kubelet: set-pod-pid-limit-kubelet
Copy to Clipboard Copied! Apply the configuration:
oc apply -f custom-kubelet-pidslimit.yaml
oc apply -f custom-kubelet-pidslimit.yaml
Copy to Clipboard Copied! This operation causes the node to reboot. For more information, see Understanding node rebooting.
- Optional: If you previously marked the node as unschedulable, you can allow scheduling again after the node reboots.
- Create a custom workbench image for Intel Gaudi AI accelerators, as described in Creating custom workbench images.
- After installing the Intel Gaudi AI Accelerator Operator, create a hardware profile, as described in Working with hardware profiles.
Verification
From the Administrator perspective, go to the Operators
- Intel Gaudi AI Accelerator
- Node Feature Discovery (NFD)
- Kernel Module Management (KMM)