第 7 章在带有 IBM Spyre AI Accelerator 的 IBM Power 上使用 Podman 的推测

使用 Podman 和 Red Hat AI Inference Server 在带有 IBM Spyre AI 加速器的 IBM Power 上运行的大型语言模型。

先决条件

对于安装的 Power AI Accelerators，您可以访问运行带有 IBM Spyre 的 RHEL 9.6 的 IBM Power 11 服务器。
您以具有 sudo 访问权限的用户身份登录。
已安装 Podman。
您可以访问 registry.redhat.io 并已登录。
已安装 Service Report 工具。请参阅 IBM Power 系统服务及生产力工具。
您已创建了一个 发送的安全组，并将 Spyre 用户添加到组中。

流程

在服务器主机上打开一个终端，并登录到 registry.redhat.io ：
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

运行 servicereport 命令来验证 IBM Spyre 硬件：

servicereport -r -p spyre

$ servicereport -r -p spyre

Copy to Clipboard

Toggle word wrap

输出示例

servicereport 2.2.5

Spyre configuration checks                          PASS

  VFIO Driver configuration                         PASS
  User memlock configuration                        PASS
  sos config                                        PASS
  sos package                                       PASS
  VFIO udev rules configuration                     PASS
  User group configuration                          PASS
  VFIO device permission                            PASS
  VFIO kernel module loaded                         PASS
  VFIO module dep configuration                     PASS

Memlock limit is set for the sentient group.
Spyre user must be in the sentient group.
To add run below command:
        sudo usermod -aG sentient <user>
        Example:
        sudo usermod -aG sentient abc
        Re-login as <user>.

servicereport 2.2.5

Spyre configuration checks                          PASS

  VFIO Driver configuration                         PASS
  User memlock configuration                        PASS
  sos config                                        PASS
  sos package                                       PASS
  VFIO udev rules configuration                     PASS
  User group configuration                          PASS
  VFIO device permission                            PASS
  VFIO kernel module loaded                         PASS
  VFIO module dep configuration                     PASS

Memlock limit is set for the sentient group.
Spyre user must be in the sentient group.
To add run below command:
        sudo usermod -aG sentient <user>
        Example:
        sudo usermod -aG sentient abc
        Re-login as <user>.

Copy to Clipboard

Toggle word wrap

运行以下命令拉取 Red Hat AI Inference Server 镜像：
```
podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
Copy to Clipboard Toggle word wrap
如果您的系统启用了 SELinux，请将 SELinux 配置为允许设备访问：
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap

使用 lspci -v 验证容器是否可以访问主机系统 IBM Spyre AI Accelerators：

podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

$ podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

Copy to Clipboard

Toggle word wrap

输出示例

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

创建卷以挂载到容器并调整容器权限，以便容器可以使用它。
```
mkdir -p ~/models && chmod g+rwX ~/models
```
```
$ mkdir -p ~/models && chmod g+rwX ~/models
```
Copy to Clipboard Toggle word wrap
将 granite-3.3-8b-instruct 模型下载到 models/ 文件夹中。如需更多信息，请参阅下载模型。

为 VLLM_AIU_PCIE_IDS 变量收集 Spyre ID：

lspci

$ lspci

Copy to Clipboard

Toggle word wrap

输出示例

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

设置 SPYRE_IDS 变量：

SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

Copy to Clipboard

Toggle word wrap

启动 AI Inference Server 容器。例如，部署为实体提取服务配置的 granite-3.3-8b-instruct 模型：

podman run \
    --device=/dev/vfio \
    -v $HOME/models:/models \
    -e AIU_PCIE_IDS="${SPYRE_IDS}" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200G \
    --shm-size 64G \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

podman run \
    --device=/dev/vfio \
    -v $HOME/models:/models \
    -e AIU_PCIE_IDS="${SPYRE_IDS}" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200G \
    --shm-size 64G \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

验证

在终端中的单独标签页中，使用 API 向模型发出请求。

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct"
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct"
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

输出示例

{
    "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

{
    "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Copy to Clipboard

Toggle word wrap

7.1. 推荐的带有 IBM Spyre AI 加速器的 IBM Power 的模型 inference 设置
复制链接

以下是使用 IBM Spyre AI Accelerator 的 IBM Power 系统的推荐模型和 AI Inference 服务器 inference 服务设置。

Expand

表 7.1. 实体提取的推荐模型和 inference 设置
model	批处理大小	最大输入上下文大小	最大输出上下文大小	每个容器的卡数
granite3.3-8b-instruct	16	3K	3K	1

Expand

表 7.2. 嵌入了 RAG (Retrieval-Augmented Generation)的推荐模型和 inference 设置
model	批处理大小	最大输入上下文大小	最大输出上下文大小	每个容器的卡数
granite-embedding-125m-english granite-embedding-278m-multilingual	最多 256	512	大小向量 768	1
granite-embedding-30m-english granite-embedding-107m-multilingual	最多 256	512	384 大小向量	1

Expand

表 7.3. 推荐的 RAG inference 服务设置
model	批处理大小	最大输入上下文大小	最大输出上下文大小	每个容器的卡数
granite3.3-8b-instruct	32	4K	4K	4
	16	8K	8K	4
	8	16K	16K	4
	4	32K	32K	4

第 7 章在带有 IBM Spyre AI Accelerator 的 IBM Power 上使用 Podman 的推测

7.1. 推荐的带有 IBM Spyre AI 加速器的 IBM Power 的模型 inference 设置
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第 7 章 在带有 IBM Spyre AI Accelerator 的 IBM Power 上使用 Podman 的推测

7.1. 推荐的带有 IBM Spyre AI 加速器的 IBM Power 的模型 inference 设置复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第 7 章在带有 IBM Spyre AI Accelerator 的 IBM Power 上使用 Podman 的推测

7.1. 推荐的带有 IBM Spyre AI 加速器的 IBM Power 的模型 inference 设置
复制链接