Search

Chapter 3. Habana Gaudi integration

download PDF

To accelerate your high-performance deep learning (DL) models, you can integrate Habana Gaudi devices in OpenShift AI. OpenShift AI also includes the HabanaAI workbench image, which is pre-built and ready for your data scientists to use after you install or upgrade OpenShift AI.

Before you can enable Habana Gaudi devices in OpenShift AI, you must install the necessary dependencies and the version of the HabanaAI Operator that matches the Habana version of the HabanaAI workbench image in your deployment. This allows your data scientists to use Habana libraries and software associated with Habana Gaudi devices from their workbench.

For more information about how to enable your OpenShift environment for Habana Gaudi devices, see HabanaAI Operator v1.10 for OpenShift and HabanaAI Operator v1.13 for OpenShift.

Important

Currently, Habana Gaudi integration is only supported in OpenShift 4.12.

You can use Habana Gaudi accelerators on OpenShift AI with versions 1.10.0 and 1.13.0 of the Habana Gaudi Operator. The version of the HabanaAI Operator that you install must match the Habana version of the HabanaAI workbench image in your deployment. This means that only one version of HabanaAI workbench image will work for you at a time.

For information about the supported configurations for versions 1.10 and 1.13 of the Habana Gaudi Operator, see Support Matrix v1.10.0 and Support Matrix v1.13.0.

You can use Habana Gaudi devices in an Amazon EC2 DL1 instance on OpenShift. Therefore, your OpenShift platform must support EC2 DL1 instances. Habana Gaudi accelerators are available to your data scientists when they create a workbench instance or serve a model.

To identify the Habana Gaudi devices present in your deployment, use the lspci utility. For more information, see lspci(8) - Linux man page.

Important

If the lspci utility indicates that Habana Gaudi devices are present in your deployment, it does not necessarily mean that the devices are ready to use.

Before you can use your Habana Gaudi devices, you must enable them in your OpenShift environment and configure an accelerator profile for each device. For more information about how to enable your OpenShift environment for Habana Gaudi devices, see HabanaAI Operator for OpenShift.

3.1. Enabling Habana Gaudi devices

Before you can use Habana Gaudi devices in OpenShift AI, you must install the necessary dependencies and deploy the HabanaAI Operator.

Prerequisites

  • You have logged in to OpenShift Container Platform.
  • You have the cluster-admin role in OpenShift Container Platform.

Procedure

  1. To enable Habana Gaudi devices in OpenShift AI, follow the instructions at HabanaAI Operator for OpenShift.
  2. From the OpenShift AI dashboard, click Settings Accelerator profiles.

    The Accelerator profiles page appears, displaying existing accelerator profiles. To enable or disable an existing accelerator profile, on the row containing the relevant accelerator profile, click the toggle in the Enable column.

  3. Click Create accelerator profile.

    The Create accelerator profile dialog opens.

  4. In the Name field, enter a name for the Habana Gaudi device.
  5. In the Identifier field, enter a unique string that identifies the Habana Gaudi device, for example, habana.ai/gaudi.
  6. Optional: In the Description field, enter a description for the Habana Gaudi device.
  7. To enable or disable the accelerator profile for the Habana Gaudi device immediately after creation, click the toggle in the Enable column.
  8. Optional: Add a toleration to schedule pods with matching taints.

    1. Click Add toleration.

      The Add toleration dialog opens.

    2. From the Operator list, select one of the following options:

      • Equal - The key/value/effect parameters must match. This is the default.
      • Exists - The key/effect parameters must match. You must leave a blank value parameter, which matches any.
    3. From the Effect list, select one of the following options:

      • None
      • NoSchedule - New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
      • PreferNoSchedule - New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain.
      • NoExecute - New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed.
    4. In the Key field, enter the toleration key habana.ai/gaudi. The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    5. In the Value field, enter a toleration value. The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
    6. In the Toleration Seconds section, select one of the following options to specify how long a pod stays bound to a node that has a node condition.

      • Forever - Pods stays permanently bound to a node.
      • Custom value - Enter a value, in seconds, to define how long pods stay bound to a node that has a node condition.
    7. Click Add.
  9. Click Create accelerator profile.

Verification

  • From the Administrator perspective, the following Operators appear on the Operators Installed Operators page.

    • HabanaAI
    • Node Feature Discovery (NFD)
    • Kernel Module Management (KMM)
  • The Accelerator list displays the Habana Gaudi accelerator on the Start a notebook server page. After you select an accelerator, the Number of accelerators field appears, which you can use to choose the number of accelerators for your notebook server.
  • The accelerator profile appears on the Accelerator profiles page
  • The accelerator profile appears on the Instances tab on the details page for the AcceleratorProfile custom resource definition (CRD).
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.