Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators
Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.
Prerequisites
You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.ioimage registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your TPU server host, and log in to
registry.redhat.io:$ podman login registry.redhat.ioPull the Red Hat AI Inference Server image by running the following command:
$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0Optional: Verify that the TPUs are available in the host.
Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:
$ podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0Verify system TPU access and basic operations by running the following Python code in the container shell prompt:
$ python3 -c " import torch import torch_xla try: device = torch_xla.device() print(f'') print(f'XLA device available: {device}') x = torch.randn(3, 3).to(device) y = torch.randn(3, 3).to(device) z = torch.matmul(x, y) import torch_xla.core.xla_model as xm torch_xla.sync() print(f'Matrix multiplication successful') print(f'Result tensor shape: {z.shape}') print(f'Result tensor device: {z.device}') print(f'Result tensor: {z.data}') print('TPU is operational.') except Exception as e: print(f'TPU test failed: {e}') print('Try restarting the container to clear TPU locks') "Example output
XLA device available: xla:0 Matrix multiplication successful Result tensor shape: torch.Size([3, 3]) Result tensor device: xla:0 Result tensor: tensor([[-1.8161, 1.6359, -3.1301], [-1.2205, 0.8985, -1.4422], [ 0.0588, 0.7693, -1.5683]], device='xla:0') TPU is operational.Exit the shell prompt.
$ exit
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
$ mkdir ./.cache/rhaiis$ chmod g+rwX ./.cache/rhaiisAdd the
HF_TOKENHugging Face token to theprivate.envfile.$ echo "export HF_TOKEN=<huggingface_token>" > private.envAppend the
HF_HOMEvariable to theprivate.envfile.$ echo "export HF_HOME=./.cache/rhaiis" >> private.envSource the
private.envfile.$ source private.envStart the AI Inference Server container image:
podman run --rm -it \ --name vllm-tpu \ --network=host \ --privileged \ --shm-size=4g \ --device=/dev/vfio/vfio \ --device=/dev/vfio/0 \ -e PJRT_DEVICE=TPU \ -e HF_HUB_OFFLINE=0 \ -v ./.cache/rhaiis:/opt/app-root/src/.cache \ registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0 \ --model Qwen/Qwen2.5-1.5B-Instruct \ --tensor-parallel-size 1 \ --max-model-len=256 \ --host=0.0.0.0 \ --port=8000Where:
--tensor-parallel-size 1- Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
--max-model-len=256- Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.
Verification
Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [
{"role": "user", "content": "Briefly, what colour is the wind?"}
],
"max_tokens": 50
}' | jq
Example output
{
"id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
"object": "chat.completion",
"created": 1755268559,
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning_content": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 38,
"total_tokens": 75,
"completion_tokens": 37,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"kv_transfer_params": null
}