Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 10. Serve and inference language models with Podman using Intel Gaudi 3 AI accelerators


Serve a large language model and run inference by using Podman or Docker and Red Hat AI Inference Server on a server that has Intel Gaudi 3 AI accelerators available. Red Hat AI Inference Server integrates with Intel Gaudi 3 accelerators through the vllm-gaudi hardware plugin to deploy vLLM-based inference workloads on Intel Gaudi 3 hardware.

Note

Intel Gaudi 3 uses Habana Processing Unit (HPU) Graph compilation to optimize model execution. During the first inference request, the server compiles execution graphs for different input shapes, which requires additional time and memory compared to later requests. You can control this behavior by using Gaudi-specific environment variables described in the configuration reference.

Important

Intel Gaudi 3 accelerator support is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

  • You have access to a server with Intel Gaudi 3 accelerators and Gaudi Software Suite 1.23.0 or later installed.
  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your Gaudi 3 server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
  2. Pull the Red Hat AI Inference Server image by running the following command:

    $ podman pull registry.redhat.io/rhaii-early-access/vllm-gaudi-rhel9:3.4.0-ea.2
  3. If your system has SELinux enabled, configure SELinux to allow device access:

    $ sudo setsebool -P container_use_devices 1
  4. Optional: Verify that the Habana Processing Unit (HPU) devices are available in the host.

    1. Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

      $ podman run -it --net=host --privileged --rm --entrypoint /bin/bash registry.redhat.io/rhaii-early-access/vllm-gaudi-rhel9:3.4.0-ea.2
    2. Verify HPU device access and basic operations by running the following Python code in the container shell prompt:

      $ python3 -c "
      import habana_frameworks.torch as ht
      import torch
      if torch.hpu.is_available():
          device_count = torch.hpu.device_count()
          print(f'HPU devices available: {device_count}')
          for i in range(device_count):
              print(f'  Device {i}: {torch.hpu.get_device_name(i)}')
          print('HPU is operational.')
      else:
          print('No HPU devices detected.')
          print('Check Gaudi SW installation and device paths.')
      "

      Example output:

      HPU devices available: 8
        Device 0: GAUDI3
        Device 1: GAUDI3
        Device 2: GAUDI3
        Device 3: GAUDI3
        Device 4: GAUDI3
        Device 5: GAUDI3
        Device 6: GAUDI3
        Device 7: GAUDI3
      HPU is operational.
    3. Optional: Verify the vllm-gaudi plugin version by running the following command in the container shell prompt:

      $ pip show vllm-gaudi | grep Version

      Example output:

      Version: 0.16.0
    4. Exit the shell prompt.

      $ exit
  5. Create a local cache directory for model storage and set the permissions:

    $ mkdir -p ./.cache/rhaii && chmod g+rwX ./.cache/rhaii
  6. Add the HF_TOKEN Hugging Face token to the private.env file and source it.

    $ echo "export HF_TOKEN=<huggingface_token>" > private.env
    $ source private.env
  7. Start the AI Inference Server container image:

    Important

    The --privileged flag grants the container full access to host devices and capabilities. This flag is required because fine-grained device passthrough for Intel Gaudi 3 is not yet validated. For production deployments, evaluate the security implications of running privileged containers in your environment.

    $ podman run --rm -it \
      --name vllm-gaudi \
      --network=host \
      --privileged \
      --userns=keep-id:uid=1001 \
      -v /dev/shm:/dev/shm \
      -e HF_TOKEN=$HF_TOKEN \
      -e HF_HOME=/opt/app-root/src/.cache \
      -e HF_HUB_OFFLINE=0 \
      -v ./.cache/rhaii:/opt/app-root/src/.cache \
      registry.redhat.io/rhaii-early-access/vllm-gaudi-rhel9:3.4.0-ea.2 \
      --model RedHatAI/granite-4.0-h-small \
      --tensor-parallel-size 1 \
      --host 0.0.0.0 \
      --port 8000
    • --privileged: Specifies that the container has access to all HPU devices on the host. Fine-grained device passthrough for Intel Gaudi 3 is not yet validated.
    • --userns=keep-id:uid=1001: Specifies the host UID mapping to the effective UID of the vLLM process in the container. You can pass --user=0 instead, but this is less secure because it runs vLLM as root inside the container.
    • -e HF_HOME=/opt/app-root/src/.cache: Specifies the Hugging Face cache directory inside the container to match the volume mount target, ensuring that downloaded models persist across container restarts.
    • --tensor-parallel-size 1: Specifies the number of HPU devices to use for tensor parallelism. Set this value to match the number of available HPU devices for your workload.
    • --max-model-len: Optional. Specifies the maximum model context length. Defaults to the maximum context length defined by the model. If the model context length is too large for the available memory, you can reduce this value or lower the GPU memory utilization from the default of 0.9 by using the --gpu-memory-utilization argument, for example --gpu-memory-utilization=0.8.

Verification

  • Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

    $ curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "RedHatAI/granite-4.0-h-small",
        "messages": [
          {"role": "user", "content": "Briefly, what color is the wind?"}
        ],
        "max_tokens": 50
      }' | jq

    The model returns a valid JSON response answering your question.

10.1. Intel Gaudi 3 configuration and environment variables

You can configure Red Hat AI Inference Server on Intel Gaudi 3 accelerators by setting environment variables that control Habana Processing Unit (HPU) Graph compilation, memory management, and inference optimization. These variables are specific to the Gaudi software stack and supplement the general AI Inference Server environment variables.

Pass these variables to the container by using the -e flag when starting the AI Inference Server container with Podman or Docker.

Note

Default values listed in this reference are based on the vllm-gaudi plugin version shipped with Red Hat AI Inference Server 3.4. For the latest upstream defaults, see vLLM Gaudi environment variables.

Expand
Table 10.1. HPU execution environment variables
VariableDefaultDescription

VLLM_SKIP_WARMUP

false

Skips the warm-up phase during server startup. The warm-up phase pre-compiles HPU graphs for common input shapes, which improves latency for the first requests but increases startup time. Set to true to reduce startup time during development and testing. For production deployments, leave this set to false to benefit from pre-compiled graphs.

Note

Lazy mode (PT_HPU_LAZY_MODE) is not supported by the PyTorch version shipped with Red Hat AI Inference Server. Red Hat AI Inference Server disables lazy mode by default. Red Hat AI Inference Server will fail to start if you set the PT_HPU_LAZY_MODE=1 or PT_HPU_ENABLE_LAZY_COLLECTIVES=true environment variables.

Expand
Table 10.2. Memory management environment variables
VariableDefaultDescription

VLLM_GRAPH_RESERVED_MEM

0.1

Specifies the fraction of HPU memory reserved for HPU Graph compilation. The remaining memory is available for model weights and KV cache. Increase this value if graph compilation fails due to insufficient memory. Decrease this value if the model requires more memory for weights or KV cache.

VLLM_CONTIGUOUS_PA

true

Enables contiguous cache fetching to avoid costly gather operations on Gaudi 3. When enabled, the KV cache is fetched in contiguous memory blocks, improving memory access performance. Recommended for all Gaudi 3 deployments.

Expand
Table 10.3. Bucketing environment variables
VariableDefaultDescription

VLLM_PROMPT_SEQ_BUCKET_MAX

Dynamically set to max_num_batched_tokens

Specifies the maximum prompt sequence bucket size. The Gaudi software stack groups input sequences into buckets of similar lengths to optimize graph compilation. Sequences longer than this value are processed in multiple chunks. Set this value to the maximum input token size that you expect to handle to reduce startup time and warm-up. Higher values require more memory for graph compilation.

VLLM_EXPONENTIAL_BUCKETING

true

Enables exponential bucketing for prompt processing. Exponential bucketing uses geometrically increasing bucket sizes instead of linear increments, reducing the number of compiled graphs needed for workloads with varied input lengths. Recommended for production workloads with varied input lengths.

Expand
Table 10.4. Runtime optimization environment variables
VariableDefaultDescription

RUNTIME_SCALE_PATCHING

false

Enables runtime scale patching for FP8 quantized models. This optimization patches scaling factors at runtime to improve performance for FP8 inference. Set to true when serving FP8 quantized models on Gaudi 3.

Red Hat logoGithubredditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez le Blog Red Hat.

À propos de la documentation Red Hat

Legal Notice

Theme

© 2026 Red Hat
Retour au début