Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators

Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.

Prerequisites

You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:
- Set up the Cloud TPU environment
- vLLM inference on v6e TPUs
You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to the registry.redhat.io image registry and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your TPU server host, and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```

Pull the Red Hat AI Inference Server image by running the following command:

$ podman pull registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2

Optional: Verify that the TPUs are available in the host.

Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

$ podman run -it --net=host --privileged --rm --entrypoint /bin/bash registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2

Verify system TPU access and basic operations by running the following Python code in the container shell prompt:

$ python3 -c "
import jax
import importlib.metadata
try:
    devices = jax.devices()
    print(f'JAX devices available: {devices}')
    print(f'Number of TPU devices: {len(devices)}')
    tpu_version = importlib.metadata.version('tpu_inference')
    print(f'tpu-inference version: {tpu_version}')
    print('TPU is operational.')
except Exception as e:
    print(f'TPU test failed: {e}')
    print('Check container image version and TPU device availability.')
"

Example output:

JAX devices available: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
Number of TPU devices: 4
tpu-inference version: 0.13.2
TPU is operational.

Exit the shell prompt.
```
$ exit
```

Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
$ mkdir ./.cache/rhaii
```
```
$ chmod g+rwX ./.cache/rhaii
```

Add the HF_TOKEN Hugging Face token to the private.env file.

$ echo "export HF_TOKEN=<huggingface_token>" > private.env

Append the HF_HOME variable to the private.env file.

$ echo "export HF_HOME=./.cache/rhaii" >> private.env

Source the private.env file.

$ source private.env

Start the AI Inference Server container image:

$ podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  -v /dev/shm:/dev/shm \
  -e HF_TOKEN=$HF_TOKEN \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaii:/opt/app-root/src/.cache \
  registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len=256 \
  --host=0.0.0.0 \
  --port=8000

Where:

--tensor-parallel-size 1: Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
--max-model-len=256: Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

+ The model returns a valid JSON response answering your question.

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators

Apprendre

Essayez, achetez et vendez

Communautés

À propos de Red Hat

Rendre l’open source plus inclusif

À propos de la documentation Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links