이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators


Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.

Prerequisites

  • You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your TPU server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
  2. Pull the Red Hat AI Inference Server image by running the following command:

    $ podman pull registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2
  3. Optional: Verify that the TPUs are available in the host.

    1. Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

      $ podman run -it --net=host --privileged --rm --entrypoint /bin/bash registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2
    2. Verify system TPU access and basic operations by running the following Python code in the container shell prompt:

      $ python3 -c "
      import jax
      import importlib.metadata
      try:
          devices = jax.devices()
          print(f'JAX devices available: {devices}')
          print(f'Number of TPU devices: {len(devices)}')
          tpu_version = importlib.metadata.version('tpu_inference')
          print(f'tpu-inference version: {tpu_version}')
          print('TPU is operational.')
      except Exception as e:
          print(f'TPU test failed: {e}')
          print('Check container image version and TPU device availability.')
      "

      Example output:

      JAX devices available: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
      Number of TPU devices: 4
      tpu-inference version: 0.13.2
      TPU is operational.
    3. Exit the shell prompt.

      $ exit
  4. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir ./.cache/rhaii
    $ chmod g+rwX ./.cache/rhaii
  5. Add the HF_TOKEN Hugging Face token to the private.env file.

    $ echo "export HF_TOKEN=<huggingface_token>" > private.env
  6. Append the HF_HOME variable to the private.env file.

    $ echo "export HF_HOME=./.cache/rhaii" >> private.env

    Source the private.env file.

    $ source private.env
  7. Start the AI Inference Server container image:

    $ podman run --rm -it \
      --name vllm-tpu \
      --network=host \
      --privileged \
      -v /dev/shm:/dev/shm \
      -e HF_TOKEN=$HF_TOKEN \
      -e HF_HUB_OFFLINE=0 \
      -v ./.cache/rhaii:/opt/app-root/src/.cache \
      registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2 \
      --model Qwen/Qwen2.5-1.5B-Instruct \
      --tensor-parallel-size 1 \
      --max-model-len=256 \
      --host=0.0.0.0 \
      --port=8000

    Where:

    --tensor-parallel-size 1
    Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
    --max-model-len=256
    Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

+ The model returns a valid JSON response answering your question.

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2026 Red Hat
맨 위로 이동