이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators

Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.

Prerequisites

You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:
- Set up the Cloud TPU environment
- vLLM inference on v6e TPUs
You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to the registry.redhat.io image registry and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your TPU server host, and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```

Pull the Red Hat AI Inference Server image by running the following command:

$ podman pull registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2

Optional: Verify that the TPUs are available in the host.

Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

$ podman run -it --net=host --privileged --rm --entrypoint /bin/bash registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2

Verify system TPU access and basic operations by running the following Python code in the container shell prompt:

$ python3 -c "
import jax
import importlib.metadata
try:
    devices = jax.devices()
    print(f'JAX devices available: {devices}')
    print(f'Number of TPU devices: {len(devices)}')
    tpu_version = importlib.metadata.version('tpu_inference')
    print(f'tpu-inference version: {tpu_version}')
    print('TPU is operational.')
except Exception as e:
    print(f'TPU test failed: {e}')
    print('Check container image version and TPU device availability.')
"

Example output:

JAX devices available: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
Number of TPU devices: 4
tpu-inference version: 0.13.2
TPU is operational.

Exit the shell prompt.
```
$ exit
```

Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
$ mkdir ./.cache/rhaii
```
```
$ chmod g+rwX ./.cache/rhaii
```

Add the HF_TOKEN Hugging Face token to the private.env file.

$ echo "export HF_TOKEN=<huggingface_token>" > private.env

Append the HF_HOME variable to the private.env file.

$ echo "export HF_HOME=./.cache/rhaii" >> private.env

Source the private.env file.

$ source private.env

Start the AI Inference Server container image:

$ podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  -v /dev/shm:/dev/shm \
  -e HF_TOKEN=$HF_TOKEN \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaii:/opt/app-root/src/.cache \
  registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len=256 \
  --host=0.0.0.0 \
  --port=8000

Where:

--tensor-parallel-size 1: Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
--max-model-len=256: Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

+ The model returns a valid JSON response answering your question.

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links