第8章 IBM Spyre AI アクセラレーターを備えた IBM Z での Podman による推論

IBM Spyre AI アクセラレーターを備えた IBM Z で実行されている Podman および Red Hat AI Inference Server を使用して大規模な言語モデルを提供し、推論します。

前提条件

RHEL 9.6 を実行している IBM Spyre for Z AI アクセラレーターがインストールされている IBM Z (s390x)サーバーにアクセスできる。
sudo アクセス権を持つユーザーとしてログインしている。
Podman をインストールしている。
registry.redhat.io にアクセスでき、ログインしている。
Hugging Face アカウントがあり、Hugging Face アクセストークンが生成されている。

注記

IBM Spyre AI アクセラレーターカードは、FP16 形式のモデルの重みのみをサポートします。互換性のあるモデルの場合、Red Hat AI Inference Server 推論エンジンは起動時に自動的に重みを FP16 に変換します。追加の設定は必要ありません。

手順

サーバーホストでターミナルを開き、registry.redhat.io にログインします。
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
次のコマンドを実行して、Red Hat AI Inference Server イメージをプルします。
```
podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
Copy to Clipboard Toggle word wrap
システムで SELinux が有効になっている場合は、デバイスアクセスを許可するように SELinux を設定します。
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap

lspci -v を使用して、コンテナーがホストシステムの IBM Spyre AI アクセラレーターにアクセスできることを確認します。

podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

$ podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

Copy to Clipboard

Toggle word wrap

出力例

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

コンテナーにマウントするボリュームを作成し、コンテナーが使用できるようにコンテナーのパーミッションを調整します。
```
mkdir -p ~/models && chmod g+rwX ~/models
```
```
$ mkdir -p ~/models && chmod g+rwX ~/models
```
Copy to Clipboard Toggle word wrap
granite- 3.3-8b-instruct モデルは models/ フォルダーにダウンロードします。詳細は、Downloading models を参照してください。

利用可能な Spyre デバイスの IOMMU グループ ID を収集します。

lspci

$ lspci

Copy to Clipboard

Toggle word wrap

出力例

0000:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0001:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0002:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev ff)
0003:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)

0000:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0001:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0002:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev ff)
0003:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)

Copy to Clipboard

Toggle word wrap

各行は PCI デバイスアドレスで始まります（例 : 0000:00:00.0 ）。

PCI アドレスを使用して、必要な Spyre カードの IOMMU グループ ID を確認します。以下に例を示します。
```
readlink /sys/bus/pci/devices/<PCI_ADDRESS>/iommu_group
```
```
$ readlink /sys/bus/pci/devices/<PCI_ADDRESS>/iommu_group
```
Copy to Clipboard Toggle word wrap
出力例
```
../../../kernel/iommu_groups/0
```
```
../../../kernel/iommu_groups/0
```
Copy to Clipboard Toggle word wrap
IOMMU グループ ID (0)は、readlink 出力の末尾番号です。
必要な Spyre カードごとに繰り返します。
readlink 出力を使用して、必要な Spyre カードの IOMMU_GROUP_ID 変数を設定します。以下に例を示します。
```
IOMMU_GROUP_ID0=0
IOMMU_GROUP_ID1=1
IOMMU_GROUP_ID2=2
IOMMU_GROUP_ID3=3
```
```
IOMMU_GROUP_ID0=0
IOMMU_GROUP_ID1=1
IOMMU_GROUP_ID2=2
IOMMU_GROUP_ID3=3
```
Copy to Clipboard Toggle word wrap

AI 推論サーバーコンテナーを起動し、必要な Spyre デバイスの IOMMU グループ ID 変数を渡します。たとえば、4 つの Spyre デバイス全体でエンティティー抽出用に設定された granite-3.3-8b-instruct モデルをデプロイします。

podman run \
  --device /dev/vfio/vfio \
  --device /dev/vfio/${IOMMU_GROUP_ID0}:/dev/vfio/${IOMMU_GROUP_ID0}  \
  --device /dev/vfio/${IOMMU_GROUP_ID1}:/dev/vfio/${IOMMU_GROUP_ID1}  \
  --device /dev/vfio/${IOMMU_GROUP_ID2}:/dev/vfio/${IOMMU_GROUP_ID2}  \
  --device /dev/vfio/${IOMMU_GROUP_ID3}:/dev/vfio/${IOMMU_GROUP_ID3}  \
  -v $HOME/models:/models:Z \
  --pids-limit 0 \
  --userns=keep-id \
  --group-add=keep-groups \
  --memory 200G \
  --shm-size 64G \
  -p 8000:8000 \
  registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
    --model /models/granite-3.3-8b-instruct \
    -tp 4 \
    --max-model-len 32768 \
    --max-num-seqs 32

podman run \
  --device /dev/vfio/vfio \
  --device /dev/vfio/${IOMMU_GROUP_ID0}:/dev/vfio/${IOMMU_GROUP_ID0}  \
  --device /dev/vfio/${IOMMU_GROUP_ID1}:/dev/vfio/${IOMMU_GROUP_ID1}  \
  --device /dev/vfio/${IOMMU_GROUP_ID2}:/dev/vfio/${IOMMU_GROUP_ID2}  \
  --device /dev/vfio/${IOMMU_GROUP_ID3}:/dev/vfio/${IOMMU_GROUP_ID3}  \
  -v $HOME/models:/models:Z \
  --pids-limit 0 \
  --userns=keep-id \
  --group-add=keep-groups \
  --memory 200G \
  --shm-size 64G \
  -p 8000:8000 \
  registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
    --model /models/granite-3.3-8b-instruct \
    -tp 4 \
    --max-model-len 32768 \
    --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

検証

ターミナルの別のタブで、API を使用してモデルにリクエストを送信します。

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

出力例

{
  "id": "cmpl-7c81cd00ccd04237ac8b5119e86b32a5",
  "object": "text_completion",
  "created": 1764665204,
  "model": "/models/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\nThe answer is Paris. Paris is the capital and most populous city of France, located in the northern part of the country. It is renowned for its history, culture, fashion, and art, attracting",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 57,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

{
  "id": "cmpl-7c81cd00ccd04237ac8b5119e86b32a5",
  "object": "text_completion",
  "created": 1764665204,
  "model": "/models/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\nThe answer is Paris. Paris is the capital and most populous city of France, located in the northern part of the country. It is renowned for its history, culture, fashion, and art, attracting",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 57,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

第8章 IBM Spyre AI アクセラレーターを備えた IBM Z での Podman による推論

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links