Chapter 8. Inference serving with Podman on IBM Z with IBM Spyre AI accelerators
Serve and inference a large language model with Podman and Red Hat AI Inference Server running on IBM Z with IBM Spyre AI accelerators.
Prerequisites
- You have access to an IBM Z (s390x) server running RHEL 9.6 with IBM Spyre for Z AI accelerators installed.
- You are logged in as a user with sudo access.
- You have installed Podman.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
IBM Spyre AI accelerator cards support FP16 format model weights only. For compatible models, the Red Hat AI Inference Server inference engine automatically converts weights to FP16 at startup. No additional configuration is needed.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io:podman login registry.redhat.io
$ podman login registry.redhat.ioCopy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the Red Hat AI Inference Server image by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5
$ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.2.5Copy to Clipboard Copied! Toggle word wrap Toggle overflow If your system has SELinux enabled, configure SELinux to allow device access:
sudo setsebool -P container_use_devices 1
$ sudo setsebool -P container_use_devices 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use
lspci -vto verify that the container can access the host system IBM Spyre AI accelerators:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a volume to mount into the container and adjust the container permissions so that the container can use it.
mkdir -p ~/models && chmod g+rwX ~/models
$ mkdir -p ~/models && chmod g+rwX ~/modelsCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Download the
granite-3.3-8b-instructmodel into themodels/folder. See Downloading models for more information. Gather the IOMMU group IDs for the available Spyre devices:
lspci
$ lspciCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0000:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02) 0001:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02) 0002:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev ff) 0003:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0000:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02) 0001:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02) 0002:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev ff) 0003:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Each line begins with the PCI device address, for example,
0000:00:00.0.Use the PCI address to determine the IOMMU group ID for the required Spyre card, for example:
readlink /sys/bus/pci/devices/<PCI_ADDRESS>/iommu_group
$ readlink /sys/bus/pci/devices/<PCI_ADDRESS>/iommu_groupCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
../../../kernel/iommu_groups/0
../../../kernel/iommu_groups/0Copy to Clipboard Copied! Toggle word wrap Toggle overflow The IOMMU group ID (0) is the trailing number in the
readlinkoutput.Repeat for each required Spyre card.
Set
IOMMU_GROUP_IDvariables for the required Spyre cards using thereadlinkoutput. For example:IOMMU_GROUP_ID0=0 IOMMU_GROUP_ID1=1 IOMMU_GROUP_ID2=2 IOMMU_GROUP_ID3=3
IOMMU_GROUP_ID0=0 IOMMU_GROUP_ID1=1 IOMMU_GROUP_ID2=2 IOMMU_GROUP_ID3=3Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container, passing in the IOMMU group ID variables for the required Spyre devices. For example, deploy the granite-3.3-8b-instruct model configured for entity extraction across 4 Spyre devices:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
In a separate tab in your terminal, make a request to the model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "model": "/models/granite-3.3-8b-instruct", "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jqcurl -X POST -H "Content-Type: application/json" -d '{ "model": "/models/granite-3.3-8b-instruct", "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow