Questo contenuto non è disponibile nella lingua selezionata.
Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators
Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.
Prerequisites
You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.ioimage registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your TPU server host, and log in to
registry.redhat.io:$ podman login registry.redhat.ioPull the Red Hat AI Inference Server image by running the following command:
$ podman pull registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1Optional: Verify that the TPUs are available in the host.
Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:
$ podman run -it --net=host --privileged --rm --entrypoint /bin/bash registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1Verify system TPU access and basic operations by running the following Python code in the container shell prompt:
$ python3 -c " import jax import importlib.metadata try: devices = jax.devices() print(f'JAX devices available: {devices}') print(f'Number of TPU devices: {len(devices)}') tpu_version = importlib.metadata.version('tpu_inference') print(f'tpu-inference version: {tpu_version}') print('TPU is operational.') except Exception as e: print(f'TPU test failed: {e}') print('Check container image version and TPU device availability.') "Example output:
JAX devices available: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)] Number of TPU devices: 4 tpu-inference version: 0.13.2 TPU is operational.Exit the shell prompt.
$ exit
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
$ mkdir ./.cache/rhaii$ chmod g+rwX ./.cache/rhaiiAdd the
HF_TOKENHugging Face token to theprivate.envfile.$ echo "export HF_TOKEN=<huggingface_token>" > private.envAppend the
HF_HOMEvariable to theprivate.envfile.$ echo "export HF_HOME=./.cache/rhaii" >> private.envSource the
private.envfile.$ source private.envStart the AI Inference Server container image:
$ podman run --rm -it \ --name vllm-tpu \ --network=host \ --privileged \ -v /dev/shm:/dev/shm \ -e HF_TOKEN=$HF_TOKEN \ -e HF_HUB_OFFLINE=0 \ -v ./.cache/rhaii:/opt/app-root/src/.cache \ registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1 \ --model Qwen/Qwen2.5-1.5B-Instruct \ --tensor-parallel-size 1 \ --max-model-len=256 \ --host=0.0.0.0 \ --port=8000Where:
--tensor-parallel-size 1- Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
--max-model-len=256- Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.
Verification
Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [
{"role": "user", "content": "Briefly, what colour is the wind?"}
],
"max_tokens": 50
}' | jq
+ The model returns a valid JSON response answering your question.