Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI accelerators

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on IBM Power with IBM Spyre AI accelerators.

Prerequisites

You have access to an IBM Power 11 server running RHEL 9.6 with IBM Spyre for Power AI accelerators installed.
You are logged in as a user with sudo access.
You have installed Podman.
You have access to registry.redhat.io and have logged in.
You have installed the Service Report tool. See IBM Power Systems service and productivity tools.
You have created a sentient security group and added your Spyre user to the group.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

Run the servicereport command to verify your IBM Spyre hardware:

servicereport -r -p spyre

$ servicereport -r -p spyre

Copy to Clipboard

Toggle word wrap

Example output

servicereport 2.2.5

Spyre configuration checks                          PASS

  VFIO Driver configuration                         PASS
  User memlock configuration                        PASS
  sos config                                        PASS
  sos package                                       PASS
  VFIO udev rules configuration                     PASS
  User group configuration                          PASS
  VFIO device permission                            PASS
  VFIO kernel module loaded                         PASS
  VFIO module dep configuration                     PASS

Memlock limit is set for the sentient group.
Spyre user must be in the sentient group.
To add run below command:
        sudo usermod -aG sentient <user>
        Example:
        sudo usermod -aG sentient abc
        Re-login as <user>.

servicereport 2.2.5

Spyre configuration checks                          PASS

  VFIO Driver configuration                         PASS
  User memlock configuration                        PASS
  sos config                                        PASS
  sos package                                       PASS
  VFIO udev rules configuration                     PASS
  User group configuration                          PASS
  VFIO device permission                            PASS
  VFIO kernel module loaded                         PASS
  VFIO module dep configuration                     PASS

Memlock limit is set for the sentient group.
Spyre user must be in the sentient group.
To add run below command:
        sudo usermod -aG sentient <user>
        Example:
        sudo usermod -aG sentient abc
        Re-login as <user>.

Copy to Clipboard

Toggle word wrap

Pull the Red Hat AI Inference Server image by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
```
Copy to Clipboard Toggle word wrap
If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap

Use lspci -v to verify that the container can access the host system IBM Spyre AI accelerators:

podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

$ podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5

Copy to Clipboard

Toggle word wrap

Example output

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

Create a volume to mount into the container and adjust the container permissions so that the container can use it.
```
mkdir -p ~/models && chmod g+rwX ~/models
```
```
$ mkdir -p ~/models && chmod g+rwX ~/models
```
Copy to Clipboard Toggle word wrap
Download the granite-3.3-8b-instruct model into the models/ folder. See Downloading models for more information.

Gather the Spyre IDs for the VLLM_AIU_PCIE_IDS variable:

lspci

$ lspci

Copy to Clipboard

Toggle word wrap

Example output

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

Set the SPYRE_IDS variable:

SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

Copy to Clipboard

Toggle word wrap

Start the AI Inference Server container. For example, deploy the granite-3.3-8b-instruct model configured for entity extraction inference serving:

podman run \
    --device=/dev/vfio \
    -v $HOME/models:/models \
    -e AIU_PCIE_IDS="${SPYRE_IDS}" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200G \
    --shm-size 64G \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

podman run \
    --device=/dev/vfio \
    -v $HOME/models:/models \
    -e AIU_PCIE_IDS="${SPYRE_IDS}" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200G \
    --shm-size 64G \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct"
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct"
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
    "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

{
    "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Copy to Clipboard

Toggle word wrap

7.1. Recommended model inference settings for IBM Power with IBM Spyre AI accelerators
Copy link

The following are the recommended model and AI Inference Server inference serving settings for IBM Power systems with IBM Spyre AI accelerators.

Expand

Table 7.1. Recommended model and inference settings for entity extraction
Model	Batch size	Max input context size	Max output context size	Number of cards per container
granite3.3-8b-instruct	16	3K	3K	1

Expand

Table 7.2. Recommended model and inference settings for RAG (Retrieval-Augmented Generation) embedding
Model	Batch size	Max input context size	Max output context size	Number of cards per container
granite-embedding-125m-english granite-embedding-278m-multilingual	Up to 256	512	Vector of size 768	1
granite-embedding-30m-english granite-embedding-107m-multilingual	Up to 256	512	Vector of size 384	1

Expand

Table 7.3. Recommended settings for RAG inference serving
Model	Batch size	Max input context size	Max output context size	Number of cards per container
granite3.3-8b-instruct	32	4K	4K	4
	16	8K	8K	4
	8	16K	16K	4
	4	32K	32K	4

7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM Power
Copy link

The following examples describe common Red Hat AI Inference Server workloads on IBM Spyre AI accelerators and IBM Power.

Entity extraction

Select a single Spyre card ID with the output from the lspci command, for example:

SPYRE_IDS="0381:50:00.0"

$ SPYRE_IDS="0381:50:00.0"

Copy to Clipboard

Toggle word wrap

Podman entity extraction example

podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 100GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 1 \
        --max-model-len 3072 \
        --max-num-seqs 16

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 100GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 1 \
        --max-model-len 3072 \
        --max-num-seqs 16

Copy to Clipboard

Toggle word wrap

RAG inference serving

Select 4 Spyre card IDs with the output from the lspci command, for example:

SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

Copy to Clipboard

Toggle word wrap

Podman RAG inference serving example

podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-3.3-8b-instruct \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-3.3-8b-instruct \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

RAG embedding

Select a single Spyre card ID with the output from the lspci command, for example:

SPYRE_IDS="0384:80:00.0"

$ SPYRE_IDS="0384:80:00.0"

Copy to Clipboard

Toggle word wrap

Podman RAG embedding inference serving example

podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-embedding-125m-english \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=64 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=64 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
    --model /models/granite-embedding-125m-english \
    -tp 1

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-embedding-125m-english \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=64 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=64 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
    --model /models/granite-embedding-125m-english \
    -tp 1

Copy to Clipboard

Toggle word wrap

Re-ranker inference serving

Select a single Spyre AI accelerator card ID with the output from the lspci command, for example:

SPYRE_IDS="0384:80:00.0"

$ SPYRE_IDS="0384:80:00.0"

Copy to Clipboard

Toggle word wrap

Podman re-ranker inference serving example

podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/bge-reranker-v2-m3 \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=1024 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=4 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/bge-reranker-v2-m3 \
        -tp 1

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/bge-reranker-v2-m3 \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=1024 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=4 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/bge-reranker-v2-m3 \
        -tp 1

Copy to Clipboard

Toggle word wrap

Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI accelerators

7.1. Recommended model inference settings for IBM Power with IBM Spyre AI accelerators
Copy link

7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM Power
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI accelerators

7.1. Recommended model inference settings for IBM Power with IBM Spyre AI acceleratorsCopy linkLink copied to clipboard!

7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM PowerCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

7.1. Recommended model inference settings for IBM Power with IBM Spyre AI accelerators
Copy link

7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM Power
Copy link