Chapter 5. Serving and inferencing language models with Podman using Google TPU AI accelerators


Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.

Prerequisites

  • You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your TPU server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the Red Hat AI Inference Server image by running the following command:

    $ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
    Copy to Clipboard Toggle word wrap
  3. Optional: Verify that the TPUs are available in the host.

    1. Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

      $ podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
      Copy to Clipboard Toggle word wrap
    2. Verify system TPU access and basic operations by running the following Python code in the container shell prompt:

      $ python3 -c "
      import torch
      import torch_xla
      try:
          device = torch_xla.device()
          print(f'')
          print(f'XLA device available: {device}')
          x = torch.randn(3, 3).to(device)
          y = torch.randn(3, 3).to(device)
          z = torch.matmul(x, y)
          import torch_xla.core.xla_model as xm
          torch_xla.sync()
          print(f'Matrix multiplication successful')
          print(f'Result tensor shape: {z.shape}')
          print(f'Result tensor device: {z.device}')
          print(f'Result tensor: {z.data}')
          print('TPU is operational.')
      except Exception as e:
          print(f'TPU test failed: {e}')
          print('Try restarting the container to clear TPU locks')
      "
      Copy to Clipboard Toggle word wrap

      Example output

      XLA device available: xla:0
      Matrix multiplication successful
      Result tensor shape: torch.Size([3, 3])
      Result tensor device: xla:0
      Result tensor: tensor([[-1.8161,  1.6359, -3.1301],
              [-1.2205,  0.8985, -1.4422],
              [ 0.0588,  0.7693, -1.5683]], device='xla:0')
      TPU is operational.
      Copy to Clipboard Toggle word wrap

    3. Exit the shell prompt.

      $ exit
      Copy to Clipboard Toggle word wrap
  4. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir ./.cache/rhaiis
    Copy to Clipboard Toggle word wrap
    $ chmod g+rwX ./.cache/rhaiis
    Copy to Clipboard Toggle word wrap
  5. Add the HF_TOKEN Hugging Face token to the private.env file.

    $ echo "export HF_TOKEN=<huggingface_token>" > private.env
    Copy to Clipboard Toggle word wrap
  6. Append the HF_HOME variable to the private.env file.

    $ echo "export HF_HOME=./.cache/rhaiis" >> private.env
    Copy to Clipboard Toggle word wrap

    Source the private.env file.

    $ source private.env
    Copy to Clipboard Toggle word wrap
  7. Start the AI Inference Server container image:

    podman run --rm -it \
      --name vllm-tpu \
      --network=host \
      --privileged \
      --shm-size=4g \
      --device=/dev/vfio/vfio \
      --device=/dev/vfio/0 \
      -e PJRT_DEVICE=TPU \
      -e HF_HUB_OFFLINE=0 \
      -v ./.cache/rhaiis:/opt/app-root/src/.cache \
      registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2 \
      --model Qwen/Qwen2.5-1.5B-Instruct \
      --tensor-parallel-size 1 \ 
    1
    
      --max-model-len=256 \ 
    2
    
      --host=0.0.0.0 \
      --port=8000
    Copy to Clipboard Toggle word wrap
    1
    Set --tensor-parallel-size to match the number of TPUs.
    2
    For the best performance, set the max-model-len parameter to as low as your workload allows.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq
Copy to Clipboard Toggle word wrap

Example output

{
  "id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}
Copy to Clipboard Toggle word wrap

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat