Chapter 4. Intel Gaudi AI Accelerator integration


To accelerate your high-performance deep learning models, you can integrate Intel Gaudi AI accelerators into OpenShift AI. This integration enables your data scientists to use Gaudi libraries and software associated with Intel Gaudi AI accelerators through custom-configured workbench instances.

Intel Gaudi AI accelerators offer optimized performance for deep learning workloads, with the latest Gaudi 3 devices providing significant improvements in training speed and energy efficiency. These accelerators are suitable for enterprises running machine learning and AI applications on OpenShift AI.

Before you can enable Intel Gaudi AI accelerators in OpenShift AI, you must complete the following steps:

  1. Install the latest version of the Intel Gaudi AI Accelerator Operator from OperatorHub.
  2. Create and configure a custom workbench image for Intel Gaudi AI accelerators. A prebuilt workbench image for Gaudi accelerators is not included in OpenShift AI.
  3. Manually define and configure a hardware profile for each Intel Gaudi AI device in your environment.

Red Hat supports Intel Gaudi devices up to Intel Gaudi 3. The Intel Gaudi 3 accelerators, in particular, offer the following benefits:

  • Improved training throughput: Reduce the time required to train large models by using advanced tensor processing cores and increased memory bandwidth.
  • Energy efficiency: Lower power consumption while maintaining high performance, reducing operational costs for large-scale deployments.
  • Scalable architecture: Scale across multiple nodes for distributed training configurations.

Your OpenShift platform must support EC2 DL1 instances to use Intel Gaudi AI accelerators in an Amazon EC2 DL1 instance. You can use Intel Gaudi AI accelerators in workbench instances or model serving after you enable the accelerators, create a custom workbench image, and configure the hardware profile.

To identify the Intel Gaudi AI accelerators present in your deployment, use the lspci utility. For more information, see lspci(8) - Linux man page.

Important

The presence of Intel Gaudi AI accelerators in your deployment, as indicated by the lspci utility, does not guarantee that the devices are ready to use. You must ensure that all installation and configuration steps are completed successfully.

4.1. Enabling Intel Gaudi AI accelerators

Before you can use Intel Gaudi AI accelerators in OpenShift AI, you must install the required dependencies, deploy the Intel Gaudi AI Accelerator Operator, and configure the environment.

Prerequisites

  • You have logged in to OpenShift.
  • You have the cluster-admin role in OpenShift.
  • You have installed your Intel Gaudi accelerator and confirmed that it is detected in your environment.
  • Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
  • You have installed the OpenShift command-line interface (CLI).

Procedure

  1. Install the latest version of the Intel Gaudi AI Accelerator Operator, as described in Intel Gaudi AI Operator OpenShift installation.
  2. By default, OpenShift sets a per-pod PID limit of 4096. If your workload requires more processing power, such as when you use multiple Gaudi accelerators or when using vLLM with Ray, you must manually increase the per-pod PID limit to avoid Resource temporarily unavailable errors. These errors occur due to PID exhaustion. Red Hat recommends setting this limit to 32768, although values over 20000 are sufficient.

    1. Run the following command to label the node:

      oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubelet
      Copy to Clipboard
    2. Optional: To prevent workload distribution on the affected node, you can mark the node as unschedulable and then drain it in preparation for maintenance. For more information, see Understanding how to evacuate pods on nodes.
    3. Create a custom-kubelet-pidslimit.yaml KubeletConfig resource file:

      oc create -f custom-kubelet-pidslimit.yaml
      Copy to Clipboard
    4. Populate the file with the following YAML code. Set the PodPidsLimit value to 32768:

      apiVersion: machineconfiguration.openshift.io/v1
      kind: KubeletConfig
      metadata:
        name: custom-kubelet-pidslimit
      spec:
        kubeletConfig:
          PodPidsLimit: 32768
        machineConfigPoolSelector:
          matchLabels:
            custom-kubelet: set-pod-pid-limit-kubelet
      Copy to Clipboard
    5. Apply the configuration:

      oc apply -f custom-kubelet-pidslimit.yaml
      Copy to Clipboard

      This operation causes the node to reboot. For more information, see Understanding node rebooting.

    6. Optional: If you previously marked the node as unschedulable, you can allow scheduling again after the node reboots.
  3. Create a custom workbench image for Intel Gaudi AI accelerators, as described in Creating custom workbench images.
  4. After installing the Intel Gaudi AI Accelerator Operator, create a hardware profile, as described in Working with hardware profiles.

Verification

From the Administrator perspective, go to the Operators Installed Operators page. Confirm that the following Operators appear:

  • Intel Gaudi AI Accelerator
  • Node Feature Discovery (NFD)
  • Kernel Module Management (KMM)
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat