Este conteúdo não está disponível no idioma selecionado.

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators


Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.

Prerequisites

  • You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your TPU server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the Red Hat AI Inference Server image by running the following command:

    $ podman pull registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1
    Copy to Clipboard Toggle word wrap
  3. Optional: Verify that the TPUs are available in the host.

    1. Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

      $ podman run -it --net=host --privileged --rm --entrypoint /bin/bash registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1
      Copy to Clipboard Toggle word wrap
    2. Verify system TPU access and basic operations by running the following Python code in the container shell prompt:

      $ python3 -c "
      import jax
      import importlib.metadata
      try:
          devices = jax.devices()
          print(f'JAX devices available: {devices}')
          print(f'Number of TPU devices: {len(devices)}')
          tpu_version = importlib.metadata.version('tpu_inference')
          print(f'tpu-inference version: {tpu_version}')
          print('TPU is operational.')
      except Exception as e:
          print(f'TPU test failed: {e}')
          print('Check container image version and TPU device availability.')
      "
      Copy to Clipboard Toggle word wrap

      Example output:

      JAX devices available: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
      Number of TPU devices: 4
      tpu-inference version: 0.13.2
      TPU is operational.
      Copy to Clipboard Toggle word wrap
    3. Exit the shell prompt.

      $ exit
      Copy to Clipboard Toggle word wrap
  4. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir ./.cache/rhaii
    Copy to Clipboard Toggle word wrap
    $ chmod g+rwX ./.cache/rhaii
    Copy to Clipboard Toggle word wrap
  5. Add the HF_TOKEN Hugging Face token to the private.env file.

    $ echo "export HF_TOKEN=<huggingface_token>" > private.env
    Copy to Clipboard Toggle word wrap
  6. Append the HF_HOME variable to the private.env file.

    $ echo "export HF_HOME=./.cache/rhaii" >> private.env
    Copy to Clipboard Toggle word wrap

    Source the private.env file.

    $ source private.env
    Copy to Clipboard Toggle word wrap
  7. Start the AI Inference Server container image:

    $ podman run --rm -it \
      --name vllm-tpu \
      --network=host \
      --privileged \
      -v /dev/shm:/dev/shm \
      -e HF_TOKEN=$HF_TOKEN \
      -e HF_HUB_OFFLINE=0 \
      -v ./.cache/rhaii:/opt/app-root/src/.cache \
      registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1 \
      --model Qwen/Qwen2.5-1.5B-Instruct \
      --tensor-parallel-size 1 \
      --max-model-len=256 \
      --host=0.0.0.0 \
      --port=8000
    Copy to Clipboard Toggle word wrap

    Where:

    --tensor-parallel-size 1
    Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
    --max-model-len=256
    Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq
Copy to Clipboard Toggle word wrap

+ The model returns a valid JSON response answering your question.

Red Hat logoGithubredditYoutubeTwitter

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Ajudamos os usuários da Red Hat a inovar e atingir seus objetivos com nossos produtos e serviços com conteúdo em que podem confiar. Explore nossas atualizações recentes.

Tornando o open source mais inclusivo

A Red Hat está comprometida em substituir a linguagem problemática em nosso código, documentação e propriedades da web. Para mais detalhes veja o Blog da Red Hat.

Sobre a Red Hat

Fornecemos soluções robustas que facilitam o trabalho das empresas em plataformas e ambientes, desde o data center principal até a borda da rede.

Theme

© 2026 Red Hat
Voltar ao topo