Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators


Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.

Prerequisites

  • You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your TPU server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the Red Hat AI Inference Server image by running the following command:

    $ podman pull registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1
    Copy to Clipboard Toggle word wrap
  3. Optional: Verify that the TPUs are available in the host.

    1. Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

      $ podman run -it --net=host --privileged --rm --entrypoint /bin/bash registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1
      Copy to Clipboard Toggle word wrap
    2. Verify system TPU access and basic operations by running the following Python code in the container shell prompt:

      $ python3 -c "
      import jax
      import importlib.metadata
      try:
          devices = jax.devices()
          print(f'JAX devices available: {devices}')
          print(f'Number of TPU devices: {len(devices)}')
          tpu_version = importlib.metadata.version('tpu_inference')
          print(f'tpu-inference version: {tpu_version}')
          print('TPU is operational.')
      except Exception as e:
          print(f'TPU test failed: {e}')
          print('Check container image version and TPU device availability.')
      "
      Copy to Clipboard Toggle word wrap

      Example output:

      JAX devices available: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
      Number of TPU devices: 4
      tpu-inference version: 0.13.2
      TPU is operational.
      Copy to Clipboard Toggle word wrap
    3. Exit the shell prompt.

      $ exit
      Copy to Clipboard Toggle word wrap
  4. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir ./.cache/rhaii
    Copy to Clipboard Toggle word wrap
    $ chmod g+rwX ./.cache/rhaii
    Copy to Clipboard Toggle word wrap
  5. Add the HF_TOKEN Hugging Face token to the private.env file.

    $ echo "export HF_TOKEN=<huggingface_token>" > private.env
    Copy to Clipboard Toggle word wrap
  6. Append the HF_HOME variable to the private.env file.

    $ echo "export HF_HOME=./.cache/rhaii" >> private.env
    Copy to Clipboard Toggle word wrap

    Source the private.env file.

    $ source private.env
    Copy to Clipboard Toggle word wrap
  7. Start the AI Inference Server container image:

    $ podman run --rm -it \
      --name vllm-tpu \
      --network=host \
      --privileged \
      -v /dev/shm:/dev/shm \
      -e HF_TOKEN=$HF_TOKEN \
      -e HF_HUB_OFFLINE=0 \
      -v ./.cache/rhaii:/opt/app-root/src/.cache \
      registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1 \
      --model Qwen/Qwen2.5-1.5B-Instruct \
      --tensor-parallel-size 1 \
      --max-model-len=256 \
      --host=0.0.0.0 \
      --port=8000
    Copy to Clipboard Toggle word wrap

    Where:

    --tensor-parallel-size 1
    Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
    --max-model-len=256
    Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq
Copy to Clipboard Toggle word wrap

+ The model returns a valid JSON response answering your question.

Red Hat logoGithubredditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance. Découvrez nos récentes mises à jour.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez le Blog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

Theme

© 2026 Red Hat
Retour au début