第5章 Google TPU AI アクセラレーターを使用した Podman による言語モデルのサービングと推論

Google TPU AI アクセラレーターが利用可能な Google クラウド仮想マシンで、Podman または Docker と Red Hat AI Inference Server を使用して大規模言語モデルのサービングおよび推論を行います。

前提条件

Google TPU AI アクセラレーターが設定された Google Cloud TPU 仮想マシンにアクセスできる。詳細は以下を参照してください。
- Set up the Cloud TPU environment
- vLLM inference on v6e TPUs
Podman または Docker がインストールされている。
sudo アクセス権を持つユーザーとしてログインしている。
registry.redhat.io イメージレジストリーにアクセスでき、ログインしている。
Hugging Face アカウントがあり、Hugging Face アクセストークンが生成されている。

注記

アクセラレーターでサポートされている vLLM 量子化スキームの詳細は、Supported hardware を参照してください。

手順

TPU サーバーホストでターミナルを開き、registry.redhat.io にログインします。
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
次のコマンドを実行して、Red Hat AI Inference Server イメージをプルします。
```
podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
```
Copy to Clipboard Toggle word wrap

オプション: ホストで TPU が使用可能であることを確認します。

Red Hat AI Inference Server コンテナーでシェルプロンプトを開きます。以下のコマンドを実行します。

podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2

$ podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2

Copy to Clipboard

Toggle word wrap

コンテナーシェルプロンプトで次の Python コードを実行して、システム TPU アクセスと基本操作を確認します。

python3 -c "

$ python3 -c "
import torch
import torch_xla
try:
    device = torch_xla.device()
    print(f'')
    print(f'XLA device available: {device}')
    x = torch.randn(3, 3).to(device)
    y = torch.randn(3, 3).to(device)
    z = torch.matmul(x, y)
    import torch_xla.core.xla_model as xm
    torch_xla.sync()
    print(f'Matrix multiplication successful')
    print(f'Result tensor shape: {z.shape}')
    print(f'Result tensor device: {z.device}')
    print(f'Result tensor: {z.data}')
    print('TPU is operational.')
except Exception as e:
    print(f'TPU test failed: {e}')
    print('Try restarting the container to clear TPU locks')
"

Copy to Clipboard

Toggle word wrap

出力例

XLA device available: xla:0
Matrix multiplication successful
Result tensor shape: torch.Size([3, 3])
Result tensor device: xla:0
Result tensor: tensor([[-1.8161,  1.6359, -3.1301],
        [-1.2205,  0.8985, -1.4422],
        [ 0.0588,  0.7693, -1.5683]], device='xla:0')
TPU is operational.

XLA device available: xla:0
Matrix multiplication successful
Result tensor shape: torch.Size([3, 3])
Result tensor device: xla:0
Result tensor: tensor([[-1.8161,  1.6359, -3.1301],
        [-1.2205,  0.8985, -1.4422],
        [ 0.0588,  0.7693, -1.5683]], device='xla:0')
TPU is operational.

Copy to Clipboard

Toggle word wrap

シェルプロンプトを終了します。
```
exit
```
```
$ exit
```
Copy to Clipboard Toggle word wrap

ボリュームを作成してコンテナーにマウントします。コンテナーが使用できるようにコンテナーの権限を調整します。
```
mkdir ./.cache/rhaiis
```
```
$ mkdir ./.cache/rhaiis
```
Copy to Clipboard Toggle word wrap
```
chmod g+rwX ./.cache/rhaiis
```
```
$ chmod g+rwX ./.cache/rhaiis
```
Copy to Clipboard Toggle word wrap
HF_TOKEN Hugging Face トークンを private.env ファイルに追加します。
```
echo "export HF_TOKEN=<huggingface_token>" > private.env
```
```
$ echo "export HF_TOKEN=<huggingface_token>" > private.env
```
Copy to Clipboard Toggle word wrap
HF_HOME 変数を private.env ファイルに追加します。
```
echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
```
$ echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
Copy to Clipboard Toggle word wrap
source コマンドで private.env ファイルを読み込みます。
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

AI Inference Server コンテナーイメージを起動します。

podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --shm-size=4g \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \ 
  --max-model-len=256 \ 
  --host=0.0.0.0 \
  --port=8000

podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --shm-size=4g \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \


  --max-model-len=256 \


  --host=0.0.0.0 \
  --port=8000

Copy to Clipboard

Toggle word wrap

1: TPU の数に合わせて --tensor-parallel-size を設定します。
2: 最高のパフォーマンスを得るには、ワークロードで許容される限り、max-model-len パラメーターを低く設定してください。

検証

AI Inference Server サーバーが起動していることを確認します。ターミナルで別のタブを開き、API を使用してモデルリクエストを作成します。

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

Copy to Clipboard

Toggle word wrap

出力例

{
  "id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

{
  "id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

第5章 Google TPU AI アクセラレーターを使用した Podman による言語モデルのサービングと推論

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links