Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators

Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.

Prerequisites

You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:
- Set up the Cloud TPU environment
- vLLM inference on v6e TPUs
You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to the registry.redhat.io image registry and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your TPU server host, and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```
Pull the Red Hat AI Inference Server image by running the following command:
```
$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0
```

Optional: Verify that the TPUs are available in the host.

Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

$ podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0

Verify system TPU access and basic operations by running the following Python code in the container shell prompt:

$ python3 -c "
import torch
import torch_xla
try:
    device = torch_xla.device()
    print(f'')
    print(f'XLA device available: {device}')
    x = torch.randn(3, 3).to(device)
    y = torch.randn(3, 3).to(device)
    z = torch.matmul(x, y)
    import torch_xla.core.xla_model as xm
    torch_xla.sync()
    print(f'Matrix multiplication successful')
    print(f'Result tensor shape: {z.shape}')
    print(f'Result tensor device: {z.device}')
    print(f'Result tensor: {z.data}')
    print('TPU is operational.')
except Exception as e:
    print(f'TPU test failed: {e}')
    print('Try restarting the container to clear TPU locks')
"

Example output

XLA device available: xla:0
Matrix multiplication successful
Result tensor shape: torch.Size([3, 3])
Result tensor device: xla:0
Result tensor: tensor([[-1.8161,  1.6359, -3.1301],
        [-1.2205,  0.8985, -1.4422],
        [ 0.0588,  0.7693, -1.5683]], device='xla:0')
TPU is operational.

Exit the shell prompt.
```
$ exit
```

Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
$ mkdir ./.cache/rhaiis
```
```
$ chmod g+rwX ./.cache/rhaiis
```

Add the HF_TOKEN Hugging Face token to the private.env file.

$ echo "export HF_TOKEN=<huggingface_token>" > private.env

Append the HF_HOME variable to the private.env file.

$ echo "export HF_HOME=./.cache/rhaiis" >> private.env

Source the private.env file.

$ source private.env

Start the AI Inference Server container image:

podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --shm-size=4g \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len=256 \
  --host=0.0.0.0 \
  --port=8000

Where:

--tensor-parallel-size 1: Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
--max-model-len=256: Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

Example output

{
  "id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links