第7章 IBM Spyre AI アクセラレーターを使用した IBM Power での Podman による推論

IBM Spyre AI アクセラレーターを備えた IBM Power で実行されている Podman および Red Hat AI Inference Server を使用して大規模な言語モデルを提供し、推論します。

前提条件

RHEL 9.6 を実行している IBM Spyre for Power AI アクセラレーターがインストールされている IBM Power 11 サーバーにアクセスできる。
sudo アクセス権を持つユーザーとしてログインしている。
Podman をインストールしている。
registry.redhat.io にアクセスでき、ログインしている。
Service Report ツールがインストールされている。IBM Power Systems サービスおよび生産性ツールを参照してください。
送信済みセキュリティーグループを 作成 し、Spyre ユーザーをグループに追加しました。

手順

サーバーホストでターミナルを開き、registry.redhat.io にログインします。
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

servicereport コマンドを実行して、IBM Spyre ハードウェアを確認します。

servicereport -r -p spyre

$ servicereport -r -p spyre

Copy to Clipboard

Toggle word wrap

出力例

servicereport 2.2.5

Spyre configuration checks                          PASS

  VFIO Driver configuration                         PASS
  User memlock configuration                        PASS
  sos config                                        PASS
  sos package                                       PASS
  VFIO udev rules configuration                     PASS
  User group configuration                          PASS
  VFIO device permission                            PASS
  VFIO kernel module loaded                         PASS
  VFIO module dep configuration                     PASS

Memlock limit is set for the sentient group.
Spyre user must be in the sentient group.
To add run below command:
        sudo usermod -aG sentient <user>
        Example:
        sudo usermod -aG sentient abc
        Re-login as <user>.

servicereport 2.2.5

Spyre configuration checks                          PASS

  VFIO Driver configuration                         PASS
  User memlock configuration                        PASS
  sos config                                        PASS
  sos package                                       PASS
  VFIO udev rules configuration                     PASS
  User group configuration                          PASS
  VFIO device permission                            PASS
  VFIO kernel module loaded                         PASS
  VFIO module dep configuration                     PASS

Memlock limit is set for the sentient group.
Spyre user must be in the sentient group.
To add run below command:
        sudo usermod -aG sentient <user>
        Example:
        sudo usermod -aG sentient abc
        Re-login as <user>.

Copy to Clipboard

Toggle word wrap

次のコマンドを実行して、Red Hat AI Inference Server イメージをプルします。
```
podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
Copy to Clipboard Toggle word wrap
システムで SELinux が有効になっている場合は、デバイスアクセスを許可するように SELinux を設定します。
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap

lspci -v を使用して、コンテナーがホストシステムの IBM Spyre AI アクセラレーターにアクセスできることを確認します。

podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

$ podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

Copy to Clipboard

Toggle word wrap

出力例

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

コンテナーにマウントするボリュームを作成し、コンテナーが使用できるようにコンテナーのパーミッションを調整します。
```
mkdir -p ~/models && chmod g+rwX ~/models
```
```
$ mkdir -p ~/models && chmod g+rwX ~/models
```
Copy to Clipboard Toggle word wrap
granite- 3.3-8b-instruct モデルは models/ フォルダーにダウンロードします。詳細は、Downloading models を参照してください。

VLLM_AIU_PCIE_IDS 変数の Spyre ID を収集します。

lspci

$ lspci

Copy to Clipboard

Toggle word wrap

出力例

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

SPYRE_IDS 変数を設定します。

SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

Copy to Clipboard

Toggle word wrap

AI 推論サーバーコンテナーを起動します。たとえば、エンティティー抽出の推論サービス用に設定された granite-3.3-8b-instruct モデルをデプロイします。

podman run \
    --device=/dev/vfio \
    -v $HOME/models:/models \
    -e AIU_PCIE_IDS="${SPYRE_IDS}" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200G \
    --shm-size 64G \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

podman run \
    --device=/dev/vfio \
    -v $HOME/models:/models \
    -e AIU_PCIE_IDS="${SPYRE_IDS}" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200G \
    --shm-size 64G \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

検証

ターミナルの別のタブで、API を使用してモデルにリクエストを送信します。

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct"
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct"
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

出力例

{
    "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

{
    "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Copy to Clipboard

Toggle word wrap

7.1. IBM Power with IBM Spyre AI アクセラレーターの推奨モデルの推論設定
リンクのコピー

以下は、IBM Spyre AI アクセラレーターを備えた IBM Power システムに推奨されるモデルと AI 推論サーバー会議です。

Expand

表7.1 エンティティー抽出の推奨モデルと推論設定
Model	バッチサイズ	最大入力コンテキストサイズ	最大出力コンテキストサイズ	コンテナーあたりのカード数
granite3.3-8b-instruct	16	3K	3K	1

Expand

表7.2 RAG の推奨モデルと推論設定(Retrieval-Augmented Generation)埋め込み
Model	バッチサイズ	最大入力コンテキストサイズ	最大出力コンテキストサイズ	コンテナーあたりのカード数
granite-embedding-125m-english granite-embedding-278m-multilingual	最大 256	512	サイズ 768 のベクトル	1
granite-embedding-30m-english granite-embedding-107m-multilingual	最大 256	512	サイズ 384 のベクトル	1

Expand

表7.3 RAG 推論サービスの推奨設定
Model	バッチサイズ	最大入力コンテキストサイズ	最大出力コンテキストサイズ	コンテナーあたりのカード数
granite3.3-8b-instruct	32	4K	4K	4
	16	8K	8K	4
	8	16K	16K	4
	4	32K	32K	4

第7章 IBM Spyre AI アクセラレーターを使用した IBM Power での Podman による推論

7.1. IBM Power with IBM Spyre AI アクセラレーターの推奨モデルの推論設定
リンクのコピー

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第7章 IBM Spyre AI アクセラレーターを使用した IBM Power での Podman による推論

7.1. IBM Power with IBM Spyre AI アクセラレーターの推奨モデルの推論設定リンクのコピーリンクがクリップボードにコピーされました!

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

7.1. IBM Power with IBM Spyre AI アクセラレーターの推奨モデルの推論設定
リンクのコピー