Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI accelerators


Serve and inference a large language model with Podman and Red Hat AI Inference Server running on IBM Power with IBM Spyre AI accelerators.

Prerequisites

  • You have access to an IBM Power 11 server running RHEL 9.6 with IBM Spyre for Power AI accelerators installed.
  • You are logged in as a user with sudo access.
  • You have installed Podman.
  • You have access to registry.redhat.io and have logged in.
  • You have installed the Service Report tool. See IBM Power Systems service and productivity tools.
  • You have created a sentient security group and added your Spyre user to the group.

Procedure

  1. Open a terminal on your server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Run the servicereport command to verify your IBM Spyre hardware:

    $ servicereport -r -p spyre
    Copy to Clipboard Toggle word wrap

    Example output

    servicereport 2.2.5
    
    Spyre configuration checks                          PASS
    
      VFIO Driver configuration                         PASS
      User memlock configuration                        PASS
      sos config                                        PASS
      sos package                                       PASS
      VFIO udev rules configuration                     PASS
      User group configuration                          PASS
      VFIO device permission                            PASS
      VFIO kernel module loaded                         PASS
      VFIO module dep configuration                     PASS
    
    Memlock limit is set for the sentient group.
    Spyre user must be in the sentient group.
    To add run below command:
            sudo usermod -aG sentient <user>
            Example:
            sudo usermod -aG sentient abc
            Re-login as <user>.
    Copy to Clipboard Toggle word wrap

  3. Pull the Red Hat AI Inference Server image by running the following command:

    $ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
    Copy to Clipboard Toggle word wrap
  4. If your system has SELinux enabled, configure SELinux to allow device access:

    $ sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  5. Use lspci -v to verify that the container can access the host system IBM Spyre AI accelerators:

    $ podman run -it --rm --pull=newer \
        --security-opt=label=disable \
        --device=/dev/vfio \
        --group-add keep-groups \
        --entrypoint="lspci" \
        registry.redhat.io/rhaiis/vllm-spyre:3.2.5
    Copy to Clipboard Toggle word wrap

    Example output

    0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
    0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
    0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
    0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
    Copy to Clipboard Toggle word wrap

  6. Create a volume to mount into the container and adjust the container permissions so that the container can use it.

    $ mkdir -p ~/models && chmod g+rwX ~/models
    Copy to Clipboard Toggle word wrap
  7. Download the granite-3.3-8b-instruct model into the models/ folder. See Downloading models for more information.
  8. Gather the Spyre IDs for the VLLM_AIU_PCIE_IDS variable:

    $ lspci
    Copy to Clipboard Toggle word wrap

    Example output

    0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
    0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
    0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
    0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
    Copy to Clipboard Toggle word wrap

  9. Set the SPYRE_IDS variable:

    $ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"
    Copy to Clipboard Toggle word wrap
  10. Start the AI Inference Server container. For example, deploy the granite-3.3-8b-instruct model configured for entity extraction inference serving:

    podman run \
        --device=/dev/vfio \
        -v $HOME/models:/models \
        -e AIU_PCIE_IDS="${SPYRE_IDS}" \
        -e VLLM_SPYRE_USE_CB=1 \
        --pids-limit 0 \
        --userns=keep-id \
        --group-add=keep-groups \
        --memory 200G \
        --shm-size 64G \
        -p 8000:8000 \
        registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
            --model /models/granite-3.3-8b-instruct \
            -tp 4 \
            --max-model-len 32768 \
            --max-num-seqs 32
    Copy to Clipboard Toggle word wrap

Verification

  • In a separate tab in your terminal, make a request to the model with the API.

    curl -X POST -H "Content-Type: application/json" -d '{
        "model": "/models/granite-3.3-8b-instruct"
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://<your_server_ip>:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    Example output

    {
        "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
        "object": "text_completion",
        "created": 1746555421,
        "choices": [
            {
                "index": 0,
                "text": " Paris.\nThe capital of France is Paris.",
                "logprobs": null,
                "finish_reason": "stop",
                "stop_reason": null,
                "prompt_logprobs": null
            }
        ],
        "usage": {
            "prompt_tokens": 8,
            "total_tokens": 18,
            "completion_tokens": 10,
            "prompt_tokens_details": null
        }
    }
    Copy to Clipboard Toggle word wrap

The following examples describe common Red Hat AI Inference Server workloads on IBM Spyre AI accelerators and IBM Power.

Entity extraction

Select a single Spyre card ID with the output from the lspci command, for example:

$ SPYRE_IDS="0381:50:00.0"
Copy to Clipboard Toggle word wrap

Podman entity extraction example

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 100GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 1 \
        --max-model-len 3072 \
        --max-num-seqs 16
Copy to Clipboard Toggle word wrap

RAG inference serving

Select 4 Spyre card IDs with the output from the lspci command, for example:

$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"
Copy to Clipboard Toggle word wrap

Podman RAG inference serving example

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-3.3-8b-instruct \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32
Copy to Clipboard Toggle word wrap

RAG embedding

Select a single Spyre card ID with the output from the lspci command, for example:

$ SPYRE_IDS="0384:80:00.0"
Copy to Clipboard Toggle word wrap

Podman RAG embedding inference serving example

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-embedding-125m-english \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=64 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=64 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
    --model /models/granite-embedding-125m-english \
    -tp 1
Copy to Clipboard Toggle word wrap

Re-ranker inference serving

Select a single Spyre AI accelerator card ID with the output from the lspci command, for example:

$ SPYRE_IDS="0384:80:00.0"
Copy to Clipboard Toggle word wrap

Podman re-ranker inference serving example

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/bge-reranker-v2-m3 \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=1024 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=4 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.2.5 \
        --model /models/bge-reranker-v2-m3 \
        -tp 1
Copy to Clipboard Toggle word wrap

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat