第 6 章使用 Google TPU AI Accelerators 使用 Podman 提供和推断语言模型

在具有 Google TPU AI 加速器的 Google 云虚拟机中使用 Podman 或 Docker 和 Red Hat AI Inference Server 提供大型语言模型。

先决条件

您可以访问配置了 Google TPU AI 加速器的 Google Cloud TPU 虚拟机。如需更多信息，请参阅：
- 设置 Cloud TPU 环境
- v6e TPU 的 vLLM inferences
已安装 Podman 或 Docker。
您以具有 sudo 访问权限的用户身份登录。
您可以访问 registry.redhat.io 镜像 registry 并已登录。
您有一个 Hugging Face 帐户，并生成了一个 Hugging Face 访问令牌。

注意

有关加速器支持的 vLLM 量化方案的更多信息，请参阅支持的硬件。

流程

在 TPU 服务器主机上打开一个终端，并登录到 registry.redhat.io ：
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

运行以下命令拉取 Red Hat AI Inference Server 镜像：

podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5

$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5

Copy to Clipboard

Toggle word wrap

可选：验证主机中是否有 TPU。

在 Red Hat AI Inference Server 容器中打开 shell 提示符。运行以下命令:

podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5

$ podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5

Copy to Clipboard

Toggle word wrap

通过在容器 shell 提示符中运行以下 Python 代码来验证系统 TPU 访问和基本操作：

python3 -c "
import torch
import torch_xla
try:
    device = torch_xla.device()
    print(f'')
    print(f'XLA device available: {device}')
    x = torch.randn(3, 3).to(device)
    y = torch.randn(3, 3).to(device)
    z = torch.matmul(x, y)
    import torch_xla.core.xla_model as xm
    torch_xla.sync()
    print(f'Matrix multiplication successful')
    print(f'Result tensor shape: {z.shape}')
    print(f'Result tensor device: {z.device}')
    print(f'Result tensor: {z.data}')
    print('TPU is operational.')
except Exception as e:
    print(f'TPU test failed: {e}')
    print('Try restarting the container to clear TPU locks')
"

$ python3 -c "
import torch
import torch_xla
try:
    device = torch_xla.device()
    print(f'')
    print(f'XLA device available: {device}')
    x = torch.randn(3, 3).to(device)
    y = torch.randn(3, 3).to(device)
    z = torch.matmul(x, y)
    import torch_xla.core.xla_model as xm
    torch_xla.sync()
    print(f'Matrix multiplication successful')
    print(f'Result tensor shape: {z.shape}')
    print(f'Result tensor device: {z.device}')
    print(f'Result tensor: {z.data}')
    print('TPU is operational.')
except Exception as e:
    print(f'TPU test failed: {e}')
    print('Try restarting the container to clear TPU locks')
"

Copy to Clipboard

Toggle word wrap

输出示例

XLA device available: xla:0
Matrix multiplication successful
Result tensor shape: torch.Size([3, 3])
Result tensor device: xla:0
Result tensor: tensor([[-1.8161,  1.6359, -3.1301],
        [-1.2205,  0.8985, -1.4422],
        [ 0.0588,  0.7693, -1.5683]], device='xla:0')
TPU is operational.

XLA device available: xla:0
Matrix multiplication successful
Result tensor shape: torch.Size([3, 3])
Result tensor device: xla:0
Result tensor: tensor([[-1.8161,  1.6359, -3.1301],
        [-1.2205,  0.8985, -1.4422],
        [ 0.0588,  0.7693, -1.5683]], device='xla:0')
TPU is operational.

Copy to Clipboard

Toggle word wrap

退出 shell 提示符。
```
exit
```
```
$ exit
```
Copy to Clipboard Toggle word wrap

创建卷并将其挂载到容器中。调整容器权限，以便容器可以使用它。
```
mkdir ./.cache/rhaiis
```
```
$ mkdir ./.cache/rhaiis
```
Copy to Clipboard Toggle word wrap
```
chmod g+rwX ./.cache/rhaiis
```
```
$ chmod g+rwX ./.cache/rhaiis
```
Copy to Clipboard Toggle word wrap

将 HF_TOKEN Hugging Face 令牌添加到 private.env 文件中。

echo "export HF_TOKEN=<huggingface_token>" > private.env

$ echo "export HF_TOKEN=<huggingface_token>" > private.env

Copy to Clipboard

Toggle word wrap

将 HF_HOME 变量附加到 private.env 文件。
```
echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
```
$ echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
Copy to Clipboard Toggle word wrap
提供 private.env 文件。
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

启动 AI Inference Server 容器镜像：

podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --shm-size=4g \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \ 
  --max-model-len=256 \ 
  --host=0.0.0.0 \
  --port=8000

podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --shm-size=4g \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \


  --max-model-len=256 \


  --host=0.0.0.0 \
  --port=8000

Copy to Clipboard

Toggle word wrap

1: set- tensor-parallel-size，以匹配 TPU 的数量。
2: 为了获得最佳性能，请将 max-model-len 参数设置为 low，因为您的工作负载允许。

验证

检查 AI Inference Server 服务器是否已启动。在终端中打开一个单独的标签页，并使用 API 发出模型请求：

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

Copy to Clipboard

Toggle word wrap

输出示例

{
  "id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

{
  "id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

第 6 章使用 Google TPU AI Accelerators 使用 Podman 提供和推断语言模型

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第 6 章 使用 Google TPU AI Accelerators 使用 Podman 提供和推断语言模型

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第 6 章使用 Google TPU AI Accelerators 使用 Podman 提供和推断语言模型