Chapter 8. Inference serving with Podman on IBM Z with IBM Spyre AI accelerators

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on IBM Z with IBM Spyre AI accelerators.

Prerequisites

You have access to an IBM Z (s390x) server running RHEL 9.6 with IBM Spyre for Z AI accelerators installed.
You are logged in as a user with sudo access.
You have installed Podman.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

IBM Spyre AI accelerator cards support FP16 format model weights only. For compatible models, the Red Hat AI Inference Server inference engine automatically converts weights to FP16 at startup. No additional configuration is needed.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
Pull the Red Hat AI Inference Server image by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
Copy to Clipboard Toggle word wrap
If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap

Use lspci -v to verify that the container can access the host system IBM Spyre AI accelerators:

podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

$ podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

Copy to Clipboard

Toggle word wrap

Example output

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

Create a volume to mount into the container and adjust the container permissions so that the container can use it.
```
mkdir -p ~/models && chmod g+rwX ~/models
```
```
$ mkdir -p ~/models && chmod g+rwX ~/models
```
Copy to Clipboard Toggle word wrap
Download the granite-3.3-8b-instruct model into the models/ folder. See Downloading models for more information.

Gather the IOMMU group IDs for the available Spyre devices:

lspci

$ lspci

Copy to Clipboard

Toggle word wrap

Example output

0000:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0001:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0002:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev ff)
0003:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)

0000:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0001:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0002:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev ff)
0003:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)

Copy to Clipboard

Toggle word wrap

Each line begins with the PCI device address, for example, 0000:00:00.0.

Use the PCI address to determine the IOMMU group ID for the required Spyre card, for example:
```
readlink /sys/bus/pci/devices/<PCI_ADDRESS>/iommu_group
```
```
$ readlink /sys/bus/pci/devices/<PCI_ADDRESS>/iommu_group
```
Copy to Clipboard Toggle word wrap
Example output
```
../../../kernel/iommu_groups/0
```
```
../../../kernel/iommu_groups/0
```
Copy to Clipboard Toggle word wrap
The IOMMU group ID (0) is the trailing number in the readlink output.
Repeat for each required Spyre card.
Set IOMMU_GROUP_ID variables for the required Spyre cards using the readlink output. For example:
```
IOMMU_GROUP_ID0=0
IOMMU_GROUP_ID1=1
IOMMU_GROUP_ID2=2
IOMMU_GROUP_ID3=3
```
```
IOMMU_GROUP_ID0=0
IOMMU_GROUP_ID1=1
IOMMU_GROUP_ID2=2
IOMMU_GROUP_ID3=3
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container, passing in the IOMMU group ID variables for the required Spyre devices. For example, deploy the granite-3.3-8b-instruct model configured for entity extraction across 4 Spyre devices:

podman run \
  --device /dev/vfio/vfio \
  --device /dev/vfio/${IOMMU_GROUP_ID0}:/dev/vfio/${IOMMU_GROUP_ID0}  \
  --device /dev/vfio/${IOMMU_GROUP_ID1}:/dev/vfio/${IOMMU_GROUP_ID1}  \
  --device /dev/vfio/${IOMMU_GROUP_ID2}:/dev/vfio/${IOMMU_GROUP_ID2}  \
  --device /dev/vfio/${IOMMU_GROUP_ID3}:/dev/vfio/${IOMMU_GROUP_ID3}  \
  -v $HOME/models:/models:Z \
  --pids-limit 0 \
  --userns=keep-id \
  --group-add=keep-groups \
  --memory 200G \
  --shm-size 64G \
  -p 8000:8000 \
  registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
    --model /models/granite-3.3-8b-instruct \
    -tp 4 \
    --max-model-len 32768 \
    --max-num-seqs 32

podman run \
  --device /dev/vfio/vfio \
  --device /dev/vfio/${IOMMU_GROUP_ID0}:/dev/vfio/${IOMMU_GROUP_ID0}  \
  --device /dev/vfio/${IOMMU_GROUP_ID1}:/dev/vfio/${IOMMU_GROUP_ID1}  \
  --device /dev/vfio/${IOMMU_GROUP_ID2}:/dev/vfio/${IOMMU_GROUP_ID2}  \
  --device /dev/vfio/${IOMMU_GROUP_ID3}:/dev/vfio/${IOMMU_GROUP_ID3}  \
  -v $HOME/models:/models:Z \
  --pids-limit 0 \
  --userns=keep-id \
  --group-add=keep-groups \
  --memory 200G \
  --shm-size 64G \
  -p 8000:8000 \
  registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
    --model /models/granite-3.3-8b-instruct \
    -tp 4 \
    --max-model-len 32768 \
    --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
  "id": "cmpl-7c81cd00ccd04237ac8b5119e86b32a5",
  "object": "text_completion",
  "created": 1764665204,
  "model": "/models/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\nThe answer is Paris. Paris is the capital and most populous city of France, located in the northern part of the country. It is renowned for its history, culture, fashion, and art, attracting",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 57,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

{
  "id": "cmpl-7c81cd00ccd04237ac8b5119e86b32a5",
  "object": "text_completion",
  "created": 1764665204,
  "model": "/models/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\nThe answer is Paris. Paris is the capital and most populous city of France, located in the northern part of the country. It is renowned for its history, culture, fashion, and art, attracting",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 57,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

Chapter 8. Inference serving with Podman on IBM Z with IBM Spyre AI accelerators

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links