Questo contenuto non è disponibile nella lingua selezionata.
Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI accelerators
Serve and inference a large language model with Podman and Red Hat AI Inference Server running on IBM Power with IBM Spyre AI accelerators.
Prerequisites
- You have access to an IBM Power 11 server running RHEL 9.6 with IBM Spyre for Power AI accelerators installed.
- You are logged in as a user with sudo access.
- You have installed Podman.
-
You have access to
registry.redhat.ioand have logged in. - You have installed the Service Report tool. See IBM Power Systems service and productivity tools.
-
You have created a
sentientsecurity group and added your Spyre user to the group.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io:$ podman login registry.redhat.ioRun the
servicereportcommand to verify your IBM Spyre hardware:$ servicereport -r -p spyreExample output
servicereport 2.2.5 Spyre configuration checks PASS VFIO Driver configuration PASS User memlock configuration PASS sos config PASS sos package PASS VFIO udev rules configuration PASS User group configuration PASS VFIO device permission PASS VFIO kernel module loaded PASS VFIO module dep configuration PASS Memlock limit is set for the sentient group. Spyre user must be in the sentient group. To add run below command: sudo usermod -aG sentient <user> Example: sudo usermod -aG sentient abc Re-login as <user>.Pull the Red Hat AI Inference Server image by running the following command:
$ podman pull registry.redhat.io/rhaii-early-access/vllm-spyre:3.4.0-ea.2If your system has SELinux enabled, configure SELinux to allow device access:
$ sudo setsebool -P container_use_devices 1Use
lspci -vto verify that the container can access the host system IBM Spyre AI accelerators:$ podman run -it --rm --pull=newer \ --security-opt=label=disable \ --device=/dev/vfio \ --group-add keep-groups \ --entrypoint="lspci" \ registry.redhat.io/rhaii-early-access/vllm-spyre:3.4.0-ea.2Example output
0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)Create a volume to mount into the container and adjust the container permissions so that the container can use it.
$ mkdir -p ~/models && chmod g+rwX ~/modelsDownload the
granite-3.3-8b-instructmodel into themodels/folder. See Downloading models for more information.NoteAs an alternative to downloading models from Hugging Face, you can use validated Red Hat AI modelcar container images with a
3.0or later tag. For more information about using modelcar images, see Inference serving language models in OCI-compliant model containers.Gather the Spyre IDs for the
VLLM_AIU_PCIE_IDSvariable:$ lspciExample output
0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02) 0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)Set the
SPYRE_IDSvariable:$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"Start the AI Inference Server container. For example, deploy the granite-3.3-8b-instruct model configured for entity extraction inference serving:
podman run \ --device=/dev/vfio \ -v $HOME/models:/models \ -e AIU_PCIE_IDS="${SPYRE_IDS}" \ -e VLLM_SPYRE_USE_CB=1 \ --pids-limit 0 \ --userns=keep-id \ --group-add=keep-groups \ --memory 200G \ --shm-size 64G \ -p 8000:8000 \ registry.redhat.io/rhaii-early-access/vllm-spyre:3.4.0-ea.2 \ --model /models/granite-3.3-8b-instruct \ -tp 4 \ --max-model-len 32768 \ --max-num-seqs 32
Verification
In a separate tab in your terminal, make a request to the model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "model": "/models/granite-3.3-8b-instruct" "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jqExample output
{ "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca", "object": "text_completion", "created": 1746555421, "choices": [ { "index": 0, "text": " Paris.\nThe capital of France is Paris.", "logprobs": null, "finish_reason": "stop", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 8, "total_tokens": 18, "completion_tokens": 10, "prompt_tokens_details": null } }
7.1. Recommended model inference settings for IBM Power with IBM Spyre AI accelerators Copia collegamentoCollegamento copiato negli appunti!
The following are the recommended model and AI Inference Server inference serving settings for IBM Power systems with IBM Spyre AI accelerators.
| Model | Batch size | Max input context size | Max output context size | Number of cards per container |
|---|---|---|---|---|
| granite3.3-8b-instruct | 16 | 3K | 3K | 1 |
| Model | Batch size | Max input context size | Max output context size | Number of cards per container |
|---|---|---|---|---|
| Up to 256 | 512 | Vector of size 768 | 1 |
| Up to 256 | 512 | Vector of size 384 | 1 |
| Model | Batch size | Max input context size | Max output context size | Number of cards per container |
|---|---|---|---|---|
| granite3.3-8b-instruct | 32 | 4K | 4K | 4 |
| 16 | 8K | 8K | 4 | |
| 8 | 16K | 16K | 4 | |
| 4 | 32K | 32K | 4 |
7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM Power Copia collegamentoCollegamento copiato negli appunti!
The following examples describe common Red Hat AI Inference Server workloads on IBM Spyre AI accelerators and IBM Power.
- Entity extraction
Select a single Spyre card ID with the output from the
lspcicommand, for example:$ SPYRE_IDS="0381:50:00.0"Podman entity extraction example
$ podman run -d \ --device=/dev/vfio \ --name vllm-api \ -v $HOME/models:/models:Z \ -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \ -e VLLM_SPYRE_USE_CB=1 \ --pids-limit 0 \ --userns=keep-id \ --group-add=keep-groups \ --memory 100GB \ --shm-size 64GB \ -p 8000:8000 \ registry.redhat.io/rhaii-early-access/vllm-spyre:3.4.0-ea.2 \ --enable-prefix-caching \ --model /models/granite-3.3-8b-instruct \ -tp 1 \ --max-model-len 3072 \ --max-num-seqs 16- RAG inference serving
Select 4 Spyre card IDs with the output from the
lspcicommand, for example:$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"Podman RAG inference serving example
$ podman run -d \ --device=/dev/vfio \ --name vllm-api \ -v $HOME/models:/models:Z \ -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \ -e VLLM_MODEL_PATH=/models/granite-3.3-8b-instruct \ -e VLLM_SPYRE_USE_CB=1 \ --pids-limit 0 \ --userns=keep-id \ --group-add=keep-groups \ --memory 200GB \ --shm-size 64GB \ -p 8000:8000 \ registry.redhat.io/rhaii-early-access/vllm-spyre:3.4.0-ea.2 \ --enable-prefix-caching \ --model /models/granite-3.3-8b-instruct \ -tp 4 \ --max-model-len 32768 \ --max-num-seqs 32- RAG embedding
Select a single Spyre card ID with the output from the
lspcicommand, for example:$ SPYRE_IDS="0384:80:00.0"Podman RAG embedding inference serving example
$ podman run -d \ --device=/dev/vfio \ --name vllm-api \ -v $HOME/models:/models:Z \ -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \ -e VLLM_MODEL_PATH=/models/granite-embedding-125m-english \ -e VLLM_SPYRE_USE_CHUNKED_PREFILL=0 \ -e VLLM_SPYRE_WARMUP_PROMPT_LENS=64 \ -e VLLM_SPYRE_WARMUP_BATCH_SIZES=64 \ --pids-limit 0 \ --userns=keep-id \ --group-add=keep-groups \ --memory 200GB \ --shm-size 64GB \ -p 8000:8000 \ registry.redhat.io/rhaii-early-access/vllm-spyre:3.4.0-ea.2 \ --model /models/granite-embedding-125m-english \ -tp 1- Re-ranker inference serving
Select a single Spyre AI accelerator card ID with the output from the
lspcicommand, for example:$ SPYRE_IDS="0384:80:00.0"Podman re-ranker inference serving example
$ podman run -d \ --device=/dev/vfio \ --name vllm-api \ -v $HOME/models:/models:Z \ -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \ -e VLLM_MODEL_PATH=/models/bge-reranker-v2-m3 \ -e VLLM_SPYRE_USE_CHUNKED_PREFILL=0 \ -e VLLM_SPYRE_WARMUP_PROMPT_LENS=1024 \ -e VLLM_SPYRE_WARMUP_BATCH_SIZES=4 \ --pids-limit 0 \ --userns=keep-id \ --group-add=keep-groups \ --memory 200GB \ --shm-size 64GB \ -p 8000:8000 \ registry.redhat.io/rhaii-early-access/vllm-spyre:3.4.0-ea.2 \ --model /models/bge-reranker-v2-m3 \ -tp 1