Getting started

Chapter 1. About AI Inference Server
Copy link

AI Inference Server provides enterprise-grade stability and security, building on the open source vLLM project, which provides state-of-the-art inferencing features.

AI Inference Server uses continuous batching to process requests as they arrive instead of waiting for a full batch to be accumulated. It also uses tensor parallelism to distribute LLM workloads across multiple GPUs. These features provide reduced latency and higher throughput.

To reduce the cost of inferencing models, AI Inference Server uses paged attention. LLMs use a mechanism called attention to understand conversations with users. Normally, attention uses a significant amount of memory, much of which is wasted. Paged attention addresses this memory waste by provisioning memory for LLMs similar to the way that virtual memory works for operating systems. This approach consumes less memory and lowers costs.

To verify cost savings and performance gains with AI Inference Server, complete the following procedures:

Serving and inferencing with AI Inference Server
Validating Red Hat AI Inference Server benefits using key metrics

Chapter 2. Product and version compatibility
Copy link

The following table lists the supported product versions for Red Hat AI Inference Server, Red Hat Enterprise Linux AI, and Red Hat OpenShift AI.

Expand

Table 2.1. AI Inference Server product and version compatibility
Product version	vLLM core version	LLM Compressor version
3.3.0	v0.13.0	v0.9.0.1
3.2.5	v0.11.2	v0.8.1
3.2.4	v0.11.0	v0.8.1
3.2.3	v0.11.0	v0.8.1
3.2.2	v0.10.1.1	v0.7.1
3.2.1	v0.10.0	Not included in this release
3.2.0	v0.9.2	Not included in this release

Expand

Table 2.2. Red Hat OpenShift AI product and version compatibility
Product version	vLLM core version	LLM Compressor version
3.2	v0.11.2	v0.8.1
3.0	v0.11.0	v0.8.1

Expand

Table 2.3. Red Hat Enterprise Linux AI product and version compatibility
Product version	vLLM core version	LLM Compressor version
3.2	v0.11.2	v0.8.1
3.0	v0.11.0	v0.8.1

Chapter 3. Reviewing AI Inference Server Python packages
Copy link

You can review the Python packages installed in the Red Hat AI Inference Server container image by running the container with Podman and reviewing the pip list package output.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.

Procedure

Run the Red Hat AI Inference Server container image with the pip list package command to view all installed Python packages. For example:

podman run --rm --entrypoint=/bin/bash \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
  -c "pip list"

$ podman run --rm --entrypoint=/bin/bash \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
  -c "pip list"

Copy to Clipboard

Toggle word wrap

To view detailed information about a specific package, run the Podman command with pip show <package_name>. For example:

podman run --rm --entrypoint=/bin/bash \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
  -c "pip show vllm"

$ podman run --rm --entrypoint=/bin/bash \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
  -c "pip show vllm"

Copy to Clipboard

Toggle word wrap

Example output

Name: vllm
Version: v0.13.0

Name: vllm
Version: v0.13.0

Copy to Clipboard

Toggle word wrap

Chapter 4. Serving and inferencing with Podman using NVIDIA CUDA AI accelerators
Copy link

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade NVIDIA AI accelerators installed.
- For NVIDIA GPUs:
  - Install NVIDIA drivers
  - Install the NVIDIA Container Toolkit
  - If your system has multiple NVIDIA GPUs that use NVSwitch, you must have root access to start Fabric Manager

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
Pull the relevant the NVIDIA CUDA image by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
```
Copy to Clipboard Toggle word wrap
If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
mkdir -p rhaiis-cache
```
```
$ mkdir -p rhaiis-cache
```
Copy to Clipboard Toggle word wrap
```
chmod g+rwX rhaiis-cache
```
```
$ chmod g+rwX rhaiis-cache
```
Copy to Clipboard Toggle word wrap
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
Copy to Clipboard Toggle word wrap
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container image.

For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in /proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.

ls /proc/driver/nvidia-nvswitch/devices/

$ ls /proc/driver/nvidia-nvswitch/devices/

Copy to Clipboard

Toggle word wrap

Example output

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

Copy to Clipboard

Toggle word wrap

systemctl start nvidia-fabricmanager

$ systemctl start nvidia-fabricmanager

Copy to Clipboard

Toggle word wrap

Important

NVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVSwitch. For more information, see NVIDIA Server Architectures.

Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:

podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

$ podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

Copy to Clipboard

Toggle word wrap

Example output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Copy to Clipboard

Toggle word wrap

Start the container.
```
podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2
```
```
$ podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2
```
Copy to Clipboard Toggle word wrap
Where:
--security-opt=label=disable
Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
--shm-size=4g -p 8000:8000
Specifies the shared memory size and port mapping. Increase --shm-size to 8GB if you experience shared memory issues.
--userns=keep-id:uid=1001
Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass --user=0, but this is less secure because it runs vLLM as root inside the container.
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"
Specifies the Hugging Face API access token. Set and export HF_TOKEN with your Hugging Face token.
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z
Mounts the cache directory with SELinux context. The :Z suffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
--tensor-parallel-size 2
Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.

In a separate tab in your terminal, make a request to your model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Copy to Clipboard

Toggle word wrap

Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators
Copy link

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on AMD ROCm AI accelerators.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade AMD ROCm AI accelerators installed.
- For AMD GPUs:
  - Install ROCm software
  - Verify that you can run ROCm containers

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

Pull the AMD ROCm image by running the following command:

podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0

$ podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0

Copy to Clipboard

Toggle word wrap

If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
mkdir -p rhaiis-cache
```
```
$ mkdir -p rhaiis-cache
```
Copy to Clipboard Toggle word wrap
```
chmod g+rwX rhaiis-cache
```
```
$ chmod g+rwX rhaiis-cache
```
Copy to Clipboard Toggle word wrap
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
Copy to Clipboard Toggle word wrap
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container image.

For AMD ROCm accelerators:

Use amd-smi static -a to verify that the container can access the host system GPUs:

podman run -ti --rm --pull=newer \
--security-opt=label=disable \
--device=/dev/kfd --device=/dev/dri \
--group-add keep-groups \
--entrypoint="" \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 \
amd-smi static -a

$ podman run -ti --rm --pull=newer \
--security-opt=label=disable \
--device=/dev/kfd --device=/dev/dri \
--group-add keep-groups \
--entrypoint="" \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 \
amd-smi static -a

Copy to Clipboard

Toggle word wrap

Where:

--group-add keep-groups: Preserves the supplementary groups from the host user. On AMD systems, you must belong to both the video and render groups to access GPUs.

Start the container:

podman run --rm -it \
--device /dev/kfd --device /dev/dri \
--security-opt=label=disable \
--group-add keep-groups \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2

podman run --rm -it \
--device /dev/kfd --device /dev/dri \
--security-opt=label=disable \
--group-add keep-groups \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2

Copy to Clipboard

Toggle word wrap

Where:

--security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Without this option, the container might fail to start.
--shm-size=4GB -p 8000:8000: Specifies the shared memory size and port mapping. Increase --shm-size to 8GB if you experience shared memory issues.
--tensor-parallel-size 2: Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Copy to Clipboard

Toggle word wrap

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators
Copy link

Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.

Prerequisites

You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:
- Set up the Cloud TPU environment
- vLLM inference on v6e TPUs
You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to the registry.redhat.io image registry and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your TPU server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
Pull the Red Hat AI Inference Server image by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0
```
Copy to Clipboard Toggle word wrap

Optional: Verify that the TPUs are available in the host.

Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:

podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0

$ podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0

Copy to Clipboard

Toggle word wrap

Verify system TPU access and basic operations by running the following Python code in the container shell prompt:

python3 -c "
import torch
import torch_xla
try:
    device = torch_xla.device()
    print(f'')
    print(f'XLA device available: {device}')
    x = torch.randn(3, 3).to(device)
    y = torch.randn(3, 3).to(device)
    z = torch.matmul(x, y)
    import torch_xla.core.xla_model as xm
    torch_xla.sync()
    print(f'Matrix multiplication successful')
    print(f'Result tensor shape: {z.shape}')
    print(f'Result tensor device: {z.device}')
    print(f'Result tensor: {z.data}')
    print('TPU is operational.')
except Exception as e:
    print(f'TPU test failed: {e}')
    print('Try restarting the container to clear TPU locks')
"

$ python3 -c "
import torch
import torch_xla
try:
    device = torch_xla.device()
    print(f'')
    print(f'XLA device available: {device}')
    x = torch.randn(3, 3).to(device)
    y = torch.randn(3, 3).to(device)
    z = torch.matmul(x, y)
    import torch_xla.core.xla_model as xm
    torch_xla.sync()
    print(f'Matrix multiplication successful')
    print(f'Result tensor shape: {z.shape}')
    print(f'Result tensor device: {z.device}')
    print(f'Result tensor: {z.data}')
    print('TPU is operational.')
except Exception as e:
    print(f'TPU test failed: {e}')
    print('Try restarting the container to clear TPU locks')
"

Copy to Clipboard

Toggle word wrap

Example output

XLA device available: xla:0
Matrix multiplication successful
Result tensor shape: torch.Size([3, 3])
Result tensor device: xla:0
Result tensor: tensor([[-1.8161,  1.6359, -3.1301],
        [-1.2205,  0.8985, -1.4422],
        [ 0.0588,  0.7693, -1.5683]], device='xla:0')
TPU is operational.

XLA device available: xla:0
Matrix multiplication successful
Result tensor shape: torch.Size([3, 3])
Result tensor device: xla:0
Result tensor: tensor([[-1.8161,  1.6359, -3.1301],
        [-1.2205,  0.8985, -1.4422],
        [ 0.0588,  0.7693, -1.5683]], device='xla:0')
TPU is operational.

Copy to Clipboard

Toggle word wrap

Exit the shell prompt.
```
exit
```
```
$ exit
```
Copy to Clipboard Toggle word wrap

Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
mkdir ./.cache/rhaiis
```
```
$ mkdir ./.cache/rhaiis
```
Copy to Clipboard Toggle word wrap
```
chmod g+rwX ./.cache/rhaiis
```
```
$ chmod g+rwX ./.cache/rhaiis
```
Copy to Clipboard Toggle word wrap

Add the HF_TOKEN Hugging Face token to the private.env file.

echo "export HF_TOKEN=<huggingface_token>" > private.env

$ echo "export HF_TOKEN=<huggingface_token>" > private.env

Copy to Clipboard

Toggle word wrap

Append the HF_HOME variable to the private.env file.
```
echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
```
$ echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
Copy to Clipboard Toggle word wrap
Source the private.env file.
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container image:

podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --shm-size=4g \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len=256 \
  --host=0.0.0.0 \
  --port=8000

podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --shm-size=4g \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len=256 \
  --host=0.0.0.0 \
  --port=8000

Copy to Clipboard

Toggle word wrap

Where:

--tensor-parallel-size 1: Specifies the number of TPUs to use for tensor parallelism. Set this value to match the number of available TPUs.
--max-model-len=256: Specifies the maximum model context length. For optimal performance, set this value as low as your workload allows.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what colour is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

Copy to Clipboard

Toggle word wrap

Example output

{
  "id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

{
  "id": "chatcmpl-13a9d6a04fd245409eb601688d6144c1",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI accelerators
Copy link

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on IBM Power with IBM Spyre AI accelerators.

Prerequisites

You have access to an IBM Power 11 server running RHEL 9.6 with IBM Spyre for Power AI accelerators installed.
You are logged in as a user with sudo access.
You have installed Podman.
You have access to registry.redhat.io and have logged in.
You have installed the Service Report tool. See IBM Power Systems service and productivity tools.
You have created a sentient security group and added your Spyre user to the group.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

Run the servicereport command to verify your IBM Spyre hardware:

servicereport -r -p spyre

$ servicereport -r -p spyre

Copy to Clipboard

Toggle word wrap

Example output

servicereport 2.2.5

Spyre configuration checks                          PASS

  VFIO Driver configuration                         PASS
  User memlock configuration                        PASS
  sos config                                        PASS
  sos package                                       PASS
  VFIO udev rules configuration                     PASS
  User group configuration                          PASS
  VFIO device permission                            PASS
  VFIO kernel module loaded                         PASS
  VFIO module dep configuration                     PASS

Memlock limit is set for the sentient group.
Spyre user must be in the sentient group.
To add run below command:
        sudo usermod -aG sentient <user>
        Example:
        sudo usermod -aG sentient abc
        Re-login as <user>.

servicereport 2.2.5

Spyre configuration checks                          PASS

  VFIO Driver configuration                         PASS
  User memlock configuration                        PASS
  sos config                                        PASS
  sos package                                       PASS
  VFIO udev rules configuration                     PASS
  User group configuration                          PASS
  VFIO device permission                            PASS
  VFIO kernel module loaded                         PASS
  VFIO module dep configuration                     PASS

Memlock limit is set for the sentient group.
Spyre user must be in the sentient group.
To add run below command:
        sudo usermod -aG sentient <user>
        Example:
        sudo usermod -aG sentient abc
        Re-login as <user>.

Copy to Clipboard

Toggle word wrap

Pull the Red Hat AI Inference Server image by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-spyre:3.3.0
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.3.0
```
Copy to Clipboard Toggle word wrap
If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap

Use lspci -v to verify that the container can access the host system IBM Spyre AI accelerators:

podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0

$ podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0

Copy to Clipboard

Toggle word wrap

Example output

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

Create a volume to mount into the container and adjust the container permissions so that the container can use it.
```
mkdir -p ~/models && chmod g+rwX ~/models
```
```
$ mkdir -p ~/models && chmod g+rwX ~/models
```
Copy to Clipboard Toggle word wrap
Download the granite-3.3-8b-instruct model into the models/ folder. See Downloading models for more information.

Gather the Spyre IDs for the VLLM_AIU_PCIE_IDS variable:

lspci

$ lspci

Copy to Clipboard

Toggle word wrap

Example output

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

Set the SPYRE_IDS variable:

SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

Copy to Clipboard

Toggle word wrap

Start the AI Inference Server container. For example, deploy the granite-3.3-8b-instruct model configured for entity extraction inference serving:

podman run \
    --device=/dev/vfio \
    -v $HOME/models:/models \
    -e AIU_PCIE_IDS="${SPYRE_IDS}" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200G \
    --shm-size 64G \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

podman run \
    --device=/dev/vfio \
    -v $HOME/models:/models \
    -e AIU_PCIE_IDS="${SPYRE_IDS}" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200G \
    --shm-size 64G \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct"
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct"
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
    "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

{
    "id": "cmpl-b94beda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Copy to Clipboard

Toggle word wrap

7.1. Recommended model inference settings for IBM Power with IBM Spyre AI accelerators
Copy link

The following are the recommended model and AI Inference Server inference serving settings for IBM Power systems with IBM Spyre AI accelerators.

Expand

Table 7.1. Recommended model and inference settings for entity extraction
Model	Batch size	Max input context size	Max output context size	Number of cards per container
granite3.3-8b-instruct	16	3K	3K	1

Expand

Table 7.2. Recommended model and inference settings for RAG (Retrieval-Augmented Generation) embedding
Model	Batch size	Max input context size	Max output context size	Number of cards per container
granite-embedding-125m-english granite-embedding-278m-multilingual	Up to 256	512	Vector of size 768	1
granite-embedding-30m-english granite-embedding-107m-multilingual	Up to 256	512	Vector of size 384	1

Expand

Table 7.3. Recommended settings for RAG inference serving
Model	Batch size	Max input context size	Max output context size	Number of cards per container
granite3.3-8b-instruct	32	4K	4K	4
	16	8K	8K	4
	8	16K	16K	4
	4	32K	32K	4

7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM Power
Copy link

The following examples describe common Red Hat AI Inference Server workloads on IBM Spyre AI accelerators and IBM Power.

Entity extraction

Select a single Spyre card ID with the output from the lspci command, for example:

SPYRE_IDS="0381:50:00.0"

$ SPYRE_IDS="0381:50:00.0"

Copy to Clipboard

Toggle word wrap

Podman entity extraction example

podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 100GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
        --model /models/granite-3.3-8b-instruct \
        -tp 1 \
        --max-model-len 3072 \
        --max-num-seqs 16

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 100GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
        --model /models/granite-3.3-8b-instruct \
        -tp 1 \
        --max-model-len 3072 \
        --max-num-seqs 16

Copy to Clipboard

Toggle word wrap

RAG inference serving

Select 4 Spyre card IDs with the output from the lspci command, for example:

SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

$ SPYRE_IDS="0381:50:00.0 0382:60:00.0 0383:70:00.0 0384:80:00.0"

Copy to Clipboard

Toggle word wrap

Podman RAG inference serving example

podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-3.3-8b-instruct \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-3.3-8b-instruct \
    -e VLLM_SPYRE_USE_CB=1 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
        --model /models/granite-3.3-8b-instruct \
        -tp 4 \
        --max-model-len 32768 \
        --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

RAG embedding

Select a single Spyre card ID with the output from the lspci command, for example:

SPYRE_IDS="0384:80:00.0"

$ SPYRE_IDS="0384:80:00.0"

Copy to Clipboard

Toggle word wrap

Podman RAG embedding inference serving example

podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-embedding-125m-english \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=64 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=64 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
    --model /models/granite-embedding-125m-english \
    -tp 1

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/granite-embedding-125m-english \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=64 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=64 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
    --model /models/granite-embedding-125m-english \
    -tp 1

Copy to Clipboard

Toggle word wrap

Re-ranker inference serving

Select a single Spyre AI accelerator card ID with the output from the lspci command, for example:

SPYRE_IDS="0384:80:00.0"

$ SPYRE_IDS="0384:80:00.0"

Copy to Clipboard

Toggle word wrap

Podman re-ranker inference serving example

podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/bge-reranker-v2-m3 \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=1024 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=4 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
        --model /models/bge-reranker-v2-m3 \
        -tp 1

$ podman run -d \
    --device=/dev/vfio \
    --name vllm-api \
    -v $HOME/models:/models:Z \
    -e VLLM_AIU_PCIE_IDS="$SPYRE_IDS" \
    -e VLLM_MODEL_PATH=/models/bge-reranker-v2-m3 \
    -e VLLM_SPYRE_WARMUP_PROMPT_LENS=1024 \
    -e VLLM_SPYRE_WARMUP_BATCH_SIZES=4 \
    --pids-limit 0 \
    --userns=keep-id \
    --group-add=keep-groups \
    --memory 200GB \
    --shm-size 64GB \
    -p 8000:8000 \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
        --model /models/bge-reranker-v2-m3 \
        -tp 1

Copy to Clipboard

Toggle word wrap

Chapter 8. Inference serving with Podman on IBM Z with IBM Spyre AI accelerators
Copy link

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on IBM Z with IBM Spyre AI accelerators.

Prerequisites

You have access to an IBM Z (s390x) server running RHEL 9.6 with IBM Spyre for Z AI accelerators installed.
You are logged in as a user with sudo access.
You have installed Podman.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

IBM Spyre AI accelerator cards support FP16 format model weights only. For compatible models, the Red Hat AI Inference Server inference engine automatically converts weights to FP16 at startup. No additional configuration is needed.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
Pull the Red Hat AI Inference Server image by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-spyre:3.3.0
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-spyre:3.3.0
```
Copy to Clipboard Toggle word wrap
If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap

Use lspci -v to verify that the container can access the host system IBM Spyre AI accelerators:

podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0

$ podman run -it --rm --pull=newer \
    --security-opt=label=disable \
    --device=/dev/vfio \
    --group-add keep-groups \
    --entrypoint="lspci" \
    registry.redhat.io/rhaiis/vllm-spyre:3.3.0

Copy to Clipboard

Toggle word wrap

Example output

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

0381:50:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0382:60:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0383:70:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)
0384:80:00.0 Processing accelerators: IBM Spyre Accelerator (rev 02)

Copy to Clipboard

Toggle word wrap

Create a volume to mount into the container and adjust the container permissions so that the container can use it.
```
mkdir -p ~/models && chmod g+rwX ~/models
```
```
$ mkdir -p ~/models && chmod g+rwX ~/models
```
Copy to Clipboard Toggle word wrap
Download the granite-3.3-8b-instruct model into the models/ folder. See Downloading models for more information.

Gather the IOMMU group IDs for the available Spyre devices:

lspci

$ lspci

Copy to Clipboard

Toggle word wrap

Example output

0000:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0001:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0002:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev ff)
0003:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)

0000:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0001:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)
0002:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev ff)
0003:00:00.0 Processing accelerators: IBM Spyre Accelerator Virtual Function (rev 02)

Copy to Clipboard

Toggle word wrap

Each line begins with the PCI device address, for example, 0000:00:00.0.

Use the PCI address to determine the IOMMU group ID for the required Spyre card, for example:
```
readlink /sys/bus/pci/devices/<PCI_ADDRESS>/iommu_group
```
```
$ readlink /sys/bus/pci/devices/<PCI_ADDRESS>/iommu_group
```
Copy to Clipboard Toggle word wrap
Example output
```
../../../kernel/iommu_groups/0
```
```
../../../kernel/iommu_groups/0
```
Copy to Clipboard Toggle word wrap
The IOMMU group ID (0) is the trailing number in the readlink output.
Repeat for each required Spyre card.
Set IOMMU_GROUP_ID variables for the required Spyre cards using the readlink output. For example:
```
IOMMU_GROUP_ID0=0
IOMMU_GROUP_ID1=1
IOMMU_GROUP_ID2=2
IOMMU_GROUP_ID3=3
```
```
IOMMU_GROUP_ID0=0
IOMMU_GROUP_ID1=1
IOMMU_GROUP_ID2=2
IOMMU_GROUP_ID3=3
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container, passing in the IOMMU group ID variables for the required Spyre devices. For example, deploy the granite-3.3-8b-instruct model configured for entity extraction across 4 Spyre devices:

podman run \
  --device /dev/vfio/vfio \
  --device /dev/vfio/${IOMMU_GROUP_ID0}:/dev/vfio/${IOMMU_GROUP_ID0}  \
  --device /dev/vfio/${IOMMU_GROUP_ID1}:/dev/vfio/${IOMMU_GROUP_ID1}  \
  --device /dev/vfio/${IOMMU_GROUP_ID2}:/dev/vfio/${IOMMU_GROUP_ID2}  \
  --device /dev/vfio/${IOMMU_GROUP_ID3}:/dev/vfio/${IOMMU_GROUP_ID3}  \
  -v $HOME/models:/models:Z \
  --pids-limit 0 \
  --userns=keep-id \
  --group-add=keep-groups \
  --memory 200G \
  --shm-size 64G \
  -p 8000:8000 \
  registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
    --model /models/granite-3.3-8b-instruct \
    -tp 4 \
    --max-model-len 32768 \
    --max-num-seqs 32

podman run \
  --device /dev/vfio/vfio \
  --device /dev/vfio/${IOMMU_GROUP_ID0}:/dev/vfio/${IOMMU_GROUP_ID0}  \
  --device /dev/vfio/${IOMMU_GROUP_ID1}:/dev/vfio/${IOMMU_GROUP_ID1}  \
  --device /dev/vfio/${IOMMU_GROUP_ID2}:/dev/vfio/${IOMMU_GROUP_ID2}  \
  --device /dev/vfio/${IOMMU_GROUP_ID3}:/dev/vfio/${IOMMU_GROUP_ID3}  \
  -v $HOME/models:/models:Z \
  --pids-limit 0 \
  --userns=keep-id \
  --group-add=keep-groups \
  --memory 200G \
  --shm-size 64G \
  -p 8000:8000 \
  registry.redhat.io/rhaiis/vllm-spyre:3.3.0 \
    --model /models/granite-3.3-8b-instruct \
    -tp 4 \
    --max-model-len 32768 \
    --max-num-seqs 32

Copy to Clipboard

Toggle word wrap

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "model": "/models/granite-3.3-8b-instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
  "id": "cmpl-7c81cd00ccd04237ac8b5119e86b32a5",
  "object": "text_completion",
  "created": 1764665204,
  "model": "/models/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\nThe answer is Paris. Paris is the capital and most populous city of France, located in the northern part of the country. It is renowned for its history, culture, fashion, and art, attracting",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 57,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

{
  "id": "cmpl-7c81cd00ccd04237ac8b5119e86b32a5",
  "object": "text_completion",
  "created": 1764665204,
  "model": "/models/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\nThe answer is Paris. Paris is the capital and most populous city of France, located in the northern part of the country. It is renowned for its history, culture, fashion, and art, attracting",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 57,
    "completion_tokens": 50,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

Chapter 9. Serving and inferencing language models with Podman using AWS Trainium and Inferentia AI accelerators
Copy link

Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server on an AWS cloud instance that has AWS Trainium or Inferentia AI accelerators configured.

AWS Inferentia and AWS Trainium are custom-designed machine learning chips from Amazon Web Services (AWS). Red Hat AI Inference Server integrates with these accelerators through the AWS Neuron SDK, providing a path to deploy vLLM-based inference workloads on AWS cloud infrastructure.

Important

AWS Trainium and Inferentia support is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

You have access to an AWS Inf2, Trn1, Trn1n, or Trn2 instance with AWS Neuron drivers configured. See Neuron setup guide.
You have installed Podman or Docker.
You are logged in as a user that has sudo access.
You have access to the registry.redhat.io image registry.
You have a Hugging Face account and have generated a Hugging Face access token.

Procedure

Open a terminal on your AWS host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
Pull the Red Hat AI Inference Server image for Neuron by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0
```
Copy to Clipboard Toggle word wrap

Optional: Verify that the Neuron drivers and devices are available on the host.

Run neuron-ls to verify that Neuron drivers are installed and to view detailed information about the Neuron hardware:

neuron-ls

$ neuron-ls

Copy to Clipboard

Toggle word wrap

Example output

instance-type: trn1.2xlarge
instance-id: i-0b29616c0f73dc323
+--------+--------+----------+--------+--------------+----------+------+
| NEURON | NEURON |  NEURON  | NEURON |     PCI      |   CPU    | NUMA |
| DEVICE | CORES  | CORE IDS | MEMORY |     BDF      | AFFINITY | NODE |
+--------+--------+----------+--------+--------------+----------+------+
| 0      | 2      | 0-1      | 32 GB  | 0000:00:1e.0 | 0-7      | -1   |
+--------+--------+----------+--------+--------------+----------+------+

instance-type: trn1.2xlarge
instance-id: i-0b29616c0f73dc323
+--------+--------+----------+--------+--------------+----------+------+
| NEURON | NEURON |  NEURON  | NEURON |     PCI      |   CPU    | NUMA |
| DEVICE | CORES  | CORE IDS | MEMORY |     BDF      | AFFINITY | NODE |
+--------+--------+----------+--------+--------------+----------+------+
| 0      | 2      | 0-1      | 32 GB  | 0000:00:1e.0 | 0-7      | -1   |
+--------+--------+----------+--------+--------------+----------+------+

Copy to Clipboard

Toggle word wrap

Note the number of Neuron cores available. Use this information to set --tensor-parallel-size argument when starting the container.

List the Neuron devices:
```
ls /dev/neuron*
```
```
$ ls /dev/neuron*
```
Copy to Clipboard Toggle word wrap
Example output
```
/dev/neuron0
```
```
/dev/neuron0
```
Copy to Clipboard Toggle word wrap

Create a volume for mounting into the container and adjust the permissions so that the container can use it:
```
mkdir -p ./.cache/rhaiis && chmod g+rwX ./.cache/rhaiis
```
```
$ mkdir -p ./.cache/rhaiis && chmod g+rwX ./.cache/rhaiis
```
Copy to Clipboard Toggle word wrap

Add the HF_TOKEN Hugging Face token to the private.env file.

echo "export HF_TOKEN=<huggingface_token>" > private.env

$ echo "export HF_TOKEN=<huggingface_token>" > private.env

Copy to Clipboard

Toggle word wrap

Append the HF_HOME variable to the private.env file.
```
echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
```
$ echo "export HF_HOME=./.cache/rhaiis" >> private.env
```
Copy to Clipboard Toggle word wrap
Source the private.env file.
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container image:

sudo podman run -it --rm \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -e HF_HUB_OFFLINE=0 \
  --network=host \
  --device=/dev/neuron0 \
  -p 8000:8000 \
  -v $HOME/.cache/rhaiis:/root/.cache/huggingface \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache:Z \
  registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --no-enable-prefix-caching \
  --port 8000 \
  --tensor-parallel-size 2 \
  --additional-config '{ "override_neuron_config": { "async_mode": false } }'

$ sudo podman run -it --rm \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  -e HF_HUB_OFFLINE=0 \
  --network=host \
  --device=/dev/neuron0 \
  -p 8000:8000 \
  -v $HOME/.cache/rhaiis:/root/.cache/huggingface \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache:Z \
  registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0 \
  --model Qwen/Qwen2.5-1.5B-Instruct \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --no-enable-prefix-caching \
  --port 8000 \
  --tensor-parallel-size 2 \
  --additional-config '{ "override_neuron_config": { "async_mode": false } }'

Copy to Clipboard

Toggle word wrap

--device=/dev/neuron0: Map the required Neuron device. Adjust based on your model requirements and available Neuron memory.
--no-enable-prefix-caching: Disable prefix caching for Neuron hardware.
--tensor-parallel-size 2: Set --tensor-parallel-size to match the number of neuron cores being used.
--additional-config '{ "override_neuron_config": { "async_mode": false } }': The --additional-config parameter passes Neuron-specific configuration. Setting async_mode to false is recommended for stability.

Verification

Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what color is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

$ curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {"role": "user", "content": "Briefly, what color is the wind?"}
    ],
    "max_tokens": 50
  }' | jq

Copy to Clipboard

Toggle word wrap

Example output

{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

{
  "id": "chatcmpl-abc123def456",
  "object": "chat.completion",
  "created": 1755268559,
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The wind is typically associated with the color white or grey, as it can carry dust, sand, or other particles. However, it is not a color in the traditional sense.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 38,
    "total_tokens": 75,
    "completion_tokens": 37,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)
Copy link

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on x86_64 CPUs with AVX2 instruction set support.

With CPU-only inference, you can run Red Hat AI Inference Server workloads on x86_64 CPUs without requiring GPU hardware. This feature provides a cost-effective option for development, testing, and small-scale deployments using smaller language models.

Important

{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Note

AVX512 instruction set support is planned for a future release.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with an x86_64 CPU that supports the AVX2 instruction set:
- Intel Haswell (2013) or newer processors
- AMD Excavator (2015) or newer processors
You have a minimum of 16GB system RAM. 32GB or more is recommended for larger models.

Note

CPU inference is optimized for smaller models, typically under 3 billion parameters. For larger models or production workloads requiring higher throughput, consider using GPU acceleration.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

Pull the CPU inference image by running the following command:

podman pull registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0

$ podman pull registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0

Copy to Clipboard

Toggle word wrap

Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
mkdir -p rhaiis-cache && chmod g+rwX rhaiis-cache
```
```
$ mkdir -p rhaiis-cache && chmod g+rwX rhaiis-cache
```
Copy to Clipboard Toggle word wrap
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
Copy to Clipboard Toggle word wrap
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap
Verify that your CPU supports the AVX2 instruction set:
```
grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"
```
```
$ grep -q avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 not supported"
```
Copy to Clipboard Toggle word wrap
Important
If your CPU does not support AVX2, you cannot use CPU inference with Red Hat AI Inference Server.
Start the AI Inference Server container image.
```
podman run --rm -it \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env "VLLM_CPU_KVCACHE_SPACE=4" \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
```
$ podman run --rm -it \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env "VLLM_CPU_KVCACHE_SPACE=4" \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0 \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
Copy to Clipboard Toggle word wrap
- --security-opt=label=disable: Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
- --shm-size=4g -p 8000:8000: Specifies the shared memory size and port mapping. Increase --shm-size to 8GB if you experience shared memory issues.
- --userns=keep-id:uid=1001: Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass --user=0, but this is less secure because it runs vLLM as root inside the container.
- --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN": Specifies the Hugging Face API access token. Set and export HF_TOKEN with your Hugging Face token.
- --env "VLLM_CPU_KVCACHE_SPACE=4": Allocates 4GB for the CPU key-value cache. Increase this value for larger models or longer context lengths. The default is 4GB.
- -v ./rhaiis-cache:/opt/app-root/src/.cache:Z: Mounts the cache directory with SELinux context. The :Z suffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
- --model TinyLlama/TinyLlama-1.1B-Chat-v1.0: Specifies the Hugging Face model to serve. For CPU inference, use smaller models (under 3B parameters) for optimal performance.

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

The model returns a valid JSON response answering your question.

Chapter 11. Validating Red Hat AI Inference Server benefits using key metrics
Copy link

Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:

Time to first token (TTFT): The time from when a request is sent to when the first token of the response is received.
Time per output token (TPOT): The average time it takes to generate each token after the first one.
Latency: The total time required to generate the full response.
Throughput: The total number of output tokens the model can produce at the same time across all users and requests.

Complete the procedure below to run a benchmark test that shows how AI Inference Server, and other inference servers, perform according to these metrics.

Prerequisites

AI Inference Server container image
GitHub account
Python 3.9 or higher

Procedure

On your host system, start an AI Inference Server container and serve a model.

podman run --rm -it --device nvidia.com/gpu=all \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
--security-opt=label=disable \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

$ podman run --rm -it --device nvidia.com/gpu=all \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
--security-opt=label=disable \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

Copy to Clipboard

Toggle word wrap

In a separate terminal tab, install the benchmark tool dependencies.
```
pip install vllm pandas datasets
```
```
$ pip install vllm pandas datasets
```
Copy to Clipboard Toggle word wrap

Clone the vLLM Git repository:

git clone https://github.com/vllm-project/vllm.git

$ git clone https://github.com/vllm-project/vllm.git

Copy to Clipboard

Toggle word wrap

Run the ./vllm/benchmarks/benchmark_serving.py script.

python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000

$ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000

Copy to Clipboard

Toggle word wrap

Verification

The results show how AI Inference Server performs according to key server metrics:

============ Serving Benchmark Result ============
Successful requests:                    100
Benchmark duration (s):                 4.61
Total input tokens:                     102300
Total generated tokens:                 40493
Request throughput (req/s):             21.67
Output token throughput (tok/s):        8775.85
Total Token throughput (tok/s):         30946.83
---------------Time to First Token----------------
Mean TTFT (ms):                         193.61
Median TTFT (ms):                       193.82
P99 TTFT (ms):                          303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         9.06
Median TPOT (ms):                       8.57
P99 TPOT (ms):                          13.57
---------------Inter-token Latency----------------
Mean ITL (ms):                          8.54
Median ITL (ms):                        8.49
P99 ITL (ms):                           13.14
==================================================

============ Serving Benchmark Result ============
Successful requests:                    100
Benchmark duration (s):                 4.61
Total input tokens:                     102300
Total generated tokens:                 40493
Request throughput (req/s):             21.67
Output token throughput (tok/s):        8775.85
Total Token throughput (tok/s):         30946.83
---------------Time to First Token----------------
Mean TTFT (ms):                         193.61
Median TTFT (ms):                       193.82
P99 TTFT (ms):                          303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         9.06
Median TPOT (ms):                       8.57
P99 TPOT (ms):                          13.57
---------------Inter-token Latency----------------
Mean ITL (ms):                          8.54
Median ITL (ms):                        8.49
P99 ITL (ms):                           13.14
==================================================

Copy to Clipboard

Toggle word wrap

Try changing the parameters of this benchmark and running it again. Notice how vllm as a backend compares to other options. Throughput should be consistently higher, while latency should be lower.

Other options for --backend are: tgi, lmdeploy, deepspeed-mii, openai, and openai-chat
Other options for --dataset-name are: sharegpt, burstgpt, sonnet, random, hf

Additional resources

vLLM documentation
LLM Inference Performance Engineering: Best Practices, by Mosaic AI Research, which explains metrics such as throughput and latency

Chapter 12. Troubleshooting
Copy link

The following troubleshooting information for Red Hat AI Inference Server 3.3.0 describes common problems related to model loading, memory, model response quality, networking, and GPU drivers. Where available, workarounds for common issues are described.

Most common issues in vLLM relate to installation, model loading, memory management, and GPU communication. Most problems can be resolved by using a correctly configured environment, ensuring compatible hardware and software versions, and following the recommended configuration practices.

Important

For persistent issues, export VLLM_LOGGING_LEVEL=DEBUG to enable debug logging and then check the logs.

export VLLM_LOGGING_LEVEL=DEBUG

$ export VLLM_LOGGING_LEVEL=DEBUG

Copy to Clipboard

Toggle word wrap

12.1. Model loading errors
Copy link

When you run the Red Hat AI Inference Server container image without specifying a user namespace, an unrecognized model error is returned.

podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

Copy to Clipboard

Toggle word wrap

Example output

ValueError: Unrecognized model in RedHatAI/Llama-3.2-1B-Instruct-FP8. Should have a model_type key in its config.json

ValueError: Unrecognized model in RedHatAI/Llama-3.2-1B-Instruct-FP8. Should have a model_type key in its config.json

Copy to Clipboard

Toggle word wrap

To resolve this error, pass --userns=keep-id:uid=1001 as the first Podman parameter to ensure that the container runs with the root user.

podman run --rm -it \
--userns=keep-id:uid=1001 \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version} \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

podman run --rm -it \
--userns=keep-id:uid=1001 \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:{rhaiis-version} \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

Copy to Clipboard

Toggle word wrap

Sometimes when Red Hat AI Inference Server downloads the model, the download fails or gets stuck. To prevent the model download from hanging, first download the model using the huggingface-cli. For example:
```
huggingface-cli download <MODEL_ID> --local-dir <DOWNLOAD_PATH>
```
```
$ huggingface-cli download <MODEL_ID> --local-dir <DOWNLOAD_PATH>
```
Copy to Clipboard Toggle word wrap
When serving the model, pass the local model path to vLLM to prevent the model from being downloaded again.
When Red Hat AI Inference Server loads a model from disk, the process sometimes hangs. Large models consume memory, and if memory runs low, the system slows down as it swaps data between RAM and disk. Slow network file system speeds or a lack of available memory can trigger excessive swapping. This can happen in clusters where file systems are shared between cluster nodes.
Where possible, store the model in a local disk to prevent slow down during model loading. Ensure that the system has sufficient CPU memory available.
Ensure that your system has enough CPU capacity to handle the model.

Sometimes, Red Hat AI Inference Server fails to inspect the model. Errors are reported in the log. For example:

#...
  File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported
    raise ValueError(
ValueError: Model architectures [''] failed to be inspected. Please check the logs for more details.

#...
  File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported
    raise ValueError(
ValueError: Model architectures [''] failed to be inspected. Please check the logs for more details.

Copy to Clipboard

Toggle word wrap

The error occurs when vLLM fails to import the model file, which is usually related to missing dependencies or outdated binaries in the vLLM build.

Some model architectures are not supported. Refer to the list of Red Hat AI validated models. For example, the following errors indicate that the model you are trying to use is not supported:

Traceback (most recent call last):
#...
  File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
    for arch in architectures:
TypeError: 'NoneType' object is not iterable

Traceback (most recent call last):
#...
  File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls
    for arch in architectures:
TypeError: 'NoneType' object is not iterable

Copy to Clipboard

Toggle word wrap

#...
  File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported
    raise ValueError(
ValueError: Model architectures [''] are not supported for now. Supported architectures:
#...

#...
  File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported
    raise ValueError(
ValueError: Model architectures [''] are not supported for now. Supported architectures:
#...

Copy to Clipboard

Toggle word wrap

Note

Some architectures such as DeepSeekV2VL require the architecture to be explicitly specified using the --hf_overrides flag, for example:

--hf_overrides '{\"architectures\": [\"DeepseekVLV2ForCausalLM\"]}

--hf_overrides '{\"architectures\": [\"DeepseekVLV2ForCausalLM\"]}

Copy to Clipboard

Toggle word wrap

Sometimes a runtime error occurs for certain hardware when you load 8-bit floating point (FP8) models. FP8 requires GPU hardware acceleration. Errors occur when you load FP8 models like deepseek-r1 or models tagged with the F8_E4M3 tensor type. For example:

triton.compiler.errors.CompilationError: at 1:0:
def \_per_token_group_quant_fp8(
\^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
[rank0]:[W502 11:12:56.323757996 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

triton.compiler.errors.CompilationError: at 1:0:
def \_per_token_group_quant_fp8(
\^
ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
[rank0]:[W502 11:12:56.323757996 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Copy to Clipboard

Toggle word wrap

Note

Review Getting started to ensure your specific accelerator is supported. Accelerators that are currently supported for FP8 models include:

Sometimes when serving a model a runtime error occurs that is related to the host system. For example, you might see errors in the log like this:

INFO 05-07 19:15:17 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
OMP: Error #179: Function Can't open SHM failed:
OMP: System error #0: Success
Traceback (most recent call last):
  File "/opt/app-root/bin/vllm", line 8, in <module>
    sys.exit(main())
..........................    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

INFO 05-07 19:15:17 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
OMP: Error #179: Function Can't open SHM failed:
OMP: System error #0: Success
Traceback (most recent call last):
  File "/opt/app-root/bin/vllm", line 8, in <module>
    sys.exit(main())
..........................    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

Copy to Clipboard

Toggle word wrap

You can work around this issue by passing the --shm-size=2g argument when starting vllm.

12.2. Memory optimization
Copy link

If the model is too large to run with a single GPU, you will get out-of-memory (OOM) errors. Use memory optimization options such as quantization, tensor parallelism, or reduced precision to reduce the memory consumption. For more information, see Conserving memory.

12.3. Generated model response quality
Copy link

In some scenarios, the quality of the generated model responses might deteriorate after an update.
Default sampling parameters source have been updated in newer versions. For vLLM version 0.8.4 and higher, the default sampling parameters come from the generation_config.json file that is provided by the model creator. In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.
If you experience this problem, try serving the model with the old defaults by using the --generation-config vllm server argument.
Important
If applying the --generation-config vllm server argument improves the model output, continue to use the vLLM defaults and petition the model creator on Hugging Face to update their default generation_config.json so that it produces better quality generations.

12.4. CUDA accelerator errors
Copy link

You might experience a self.graph.replay() error when running a model using CUDA accelerators.
If vLLM crashes and the error trace captures the error somewhere around the self.graph.replay() method in the vllm/worker/model_runner.py module, this is most likely a CUDA error that occurs inside the CUDAGraph class.
To identify the particular CUDA operation that causes the error, add the --enforce-eager server argument to the vllm command line to disable CUDAGraph optimization and isolate the problematic CUDA operation.
You might experience accelerator and CPU communication problems that are caused by incorrect hardware or driver settings.
NVIDIA Fabric Manager is required for multi-GPU systems for some types of NVIDIA GPUs. The nvidia-fabricmanager package and associated systemd service might not be installed or the package might not be running.
Run the diagnostic Python script to check whether the NVIDIA Collective Communications Library (NCCL) and Gloo library components are communicating correctly.
On an NVIDIA system, check the fabric manager status by running the following command:
```
systemctl status nvidia-fabricmanager
```
```
$ systemctl status nvidia-fabricmanager
```
Copy to Clipboard Toggle word wrap
On successfully configured systems, the service should be active and running with no errors.
Running vLLM with tensor parallelism enabled and setting --tensor-parallel-size to be greater than 1 on NVIDIA Multi-Instance GPU (MIG) hardware causes an AssertionError during the initial model loading or shape checking phase. This typically occurs as one of the first errors when starting vLLM.

12.5. Networking errors
Copy link

You might experience network errors with complicated network configurations.
To troubleshoot network issues, search the logs for DEBUG statements where an incorrect IP address is listed, for example:
```
DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://<incorrect_ip_address>:54641 backend=nccl
```
```
DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://<incorrect_ip_address>:54641 backend=nccl
```
Copy to Clipboard Toggle word wrap
To correct the issue, set the correct IP address with the VLLM_HOST_IP environment variable, for example:
```
export VLLM_HOST_IP=<correct_ip_address>
```
```
$ export VLLM_HOST_IP=<correct_ip_address>
```
Copy to Clipboard Toggle word wrap
Specify the network interface that is tied to the IP address for NCCL and Gloo:
```
export NCCL_SOCKET_IFNAME=<your_network_interface>
```
```
$ export NCCL_SOCKET_IFNAME=<your_network_interface>
```
Copy to Clipboard Toggle word wrap
```
export GLOO_SOCKET_IFNAME=<your_network_interface>
```
```
$ export GLOO_SOCKET_IFNAME=<your_network_interface>
```
Copy to Clipboard Toggle word wrap

12.6. Python multiprocessing errors
Copy link

You might experience Python multiprocessing warnings or runtime errors. This can be caused by code that is not properly structured for Python multiprocessing. The following is an example console warning:

WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'.

WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'.

Copy to Clipboard

Toggle word wrap

The following is an example Python runtime error:

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ = "__main__":
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ = "__main__":
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html

Copy to Clipboard

Toggle word wrap

To resolve the runtime error, update your Python code to guard the usage of vllm behind an if__name__ = "__main__": block, for example:

if __name__ = "__main__":
    import vllm

    llm = vllm.LLM(...)

if __name__ = "__main__":
    import vllm

    llm = vllm.LLM(...)

Copy to Clipboard

Toggle word wrap

12.7. GPU driver or device pass-through issues
Copy link

When you run the Red Hat AI Inference Server container image, sometimes it is unclear whether device pass-through errors are being caused by GPU drivers or tools such as the NVIDIA Container Toolkit.

Check that the NVIDIA Container toolkit that is installed on the host machine can see the host GPUs:

nvidia-ctk cdi list

$ nvidia-ctk cdi list

Copy to Clipboard

Toggle word wrap

Example output

#...
nvidia.com/gpu=GPU-0fe9bb20-207e-90bf-71a7-677e4627d9a1
nvidia.com/gpu=GPU-10eff114-f824-a804-e7b7-e07e3f8ebc26
nvidia.com/gpu=GPU-39af96b4-f115-9b6d-5be9-68af3abd0e52
nvidia.com/gpu=GPU-3a711e90-a1c5-3d32-a2cd-0abeaa3df073
nvidia.com/gpu=GPU-6f5f6d46-3fc1-8266-5baf-582a4de11937
nvidia.com/gpu=GPU-da30e69a-7ba3-dc81-8a8b-e9b3c30aa593
nvidia.com/gpu=GPU-dc3c1c36-841b-bb2e-4481-381f614e6667
nvidia.com/gpu=GPU-e85ffe36-1642-47c2-644e-76f8a0f02ba7
nvidia.com/gpu=all

#...
nvidia.com/gpu=GPU-0fe9bb20-207e-90bf-71a7-677e4627d9a1
nvidia.com/gpu=GPU-10eff114-f824-a804-e7b7-e07e3f8ebc26
nvidia.com/gpu=GPU-39af96b4-f115-9b6d-5be9-68af3abd0e52
nvidia.com/gpu=GPU-3a711e90-a1c5-3d32-a2cd-0abeaa3df073
nvidia.com/gpu=GPU-6f5f6d46-3fc1-8266-5baf-582a4de11937
nvidia.com/gpu=GPU-da30e69a-7ba3-dc81-8a8b-e9b3c30aa593
nvidia.com/gpu=GPU-dc3c1c36-841b-bb2e-4481-381f614e6667
nvidia.com/gpu=GPU-e85ffe36-1642-47c2-644e-76f8a0f02ba7
nvidia.com/gpu=all

Copy to Clipboard

Toggle word wrap

Ensure that the NVIDIA accelerator configuration has been created on the host machine:
```
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
```
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
Copy to Clipboard Toggle word wrap

Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:

podman run --rm -it --security-opt=label=disable --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smi

$ podman run --rm -it --security-opt=label=disable --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smi

Copy to Clipboard

Toggle word wrap

Example output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Copy to Clipboard

Toggle word wrap

12.8. Troubleshooting IBM Power issues
Copy link

If you are unable to access the model data from the AI Inference Server container, complete the following steps:

Verify that the /models folder mapping to the container is correct
Review the host SELinux settings
Ensure that you have applied appropriate permissions on the $HOME/models folder, for example:
```
chmod -R 755 $HOME/models
```
```
$ chmod -R 755 $HOME/models
```
Copy to Clipboard Toggle word wrap

Ensure that you are using the :Z option for the Podman volume mounts:

podman run -d --device=/dev/vfio \
     -v $HOME/models:/models:Z \
     # ...

$ podman run -d --device=/dev/vfio \
     -v $HOME/models:/models:Z \
     # ...

Copy to Clipboard

Toggle word wrap

Ensure that you set VLLM_SPYRE_USE_CB=1 for decoding models.

12.8.1. IBM Spyre for Power AI acclerator card problems
Copy link

Ensure that the IBM Spyre AI accelerator cards are visible on the host. Use lspci to verify that the cards are available.
Ensure your user is in the sentient group.
Use the Service Report tool to diagnose and correct card access issues. See IBM Power Systems service and productivity tools.

12.8.2. IBM Spyre for Power performance issues
Copy link

Ensure all Spyre cards are securely seated in the first four slots of the IBM Power server I/O drawer. The first four slots have the highest speed PCIe interfaces.
Ensure that cards assigned to an LPAR are all in the same drawer. Do not separate cards across drawers as this increases I/O latency. See IBM Power11 documentation for more information.

If you encounter errors with the IBM Spyre AI accelerator card, you can use the aiu-smi tool alongside the workload you want to profile. Perform the following steps:

Start the model.

From a second terminal, query the model. For example:

curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json" \
    -d '{ "model": "/models/granite-3.3-8b-instruct",
          "prompt": "Write me a long story about surfing dogs in Malibu.",
          "max_tokens": 8128,
          "temperature": 1,
          "n": 10
        }'

$ curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json" \
    -d '{ "model": "/models/granite-3.3-8b-instruct",
          "prompt": "Write me a long story about surfing dogs in Malibu.",
          "max_tokens": 8128,
          "temperature": 1,
          "n": 10
        }'

Copy to Clipboard

Toggle word wrap

From a third terminal, run the aiu-smi tool:
```
podman exec -it <CONTAINER_ID> -c aiu-smi
```
```
$ podman exec -it <CONTAINER_ID> -c aiu-smi
```
Copy to Clipboard Toggle word wrap

Alternatively, exec into the running container and run aiu-smi. For example:

podman exec -it <CONTAINER_ID> bash

$ podman exec -it <CONTAINER_ID> bash

Copy to Clipboard

Toggle word wrap

Run the aiu-smi tool inside the container:

aiu-smi

[senuser@689230aca2ba ~]$ aiu-smi

Copy to Clipboard

Toggle word wrap

Example aiu-smi output

#MetricFiles
0 /tmp/metrics.0181:50:00.0
1 /tmp/metrics.0182:60:00.0
2 /tmp/metrics.0183:70:00.0
3 /tmp/metrics.0184:80:00.0
#ID Date      Time      hostcpu hostmem    pwr  gtemp   busy    rdmem    wrmem    rxpci    txpci   rdrdma   wrrdma   rsvmem
#   YYYYMMDD  HH:MM:SS        %       %      W      C      %     GB/s     GB/s     GB/s     GB/s     GB/s     GB/s       MB
  0 20251103  20:18:36    951.6    11.5   33.8   34.1     96   41.221    5.480    0.967    0.964    0.000    0.000    0.000
  1 20251103  20:18:36    951.6    11.5   30.6   33.0     96   41.201    5.464    0.967    0.964    0.000    0.000    0.000
  2 20251103  20:18:36    951.6    11.5   40.5   34.7     96   41.266    5.473    0.969    0.966    0.000    0.000    0.000
  3 20251103  20:18:36    951.6    11.5   37.3   39.2     96   41.358    5.484    0.971    0.968    0.000    0.000    0.000

#MetricFiles
# 0 /tmp/metrics.0181:50:00.0
# 1 /tmp/metrics.0182:60:00.0
# 2 /tmp/metrics.0183:70:00.0
# 3 /tmp/metrics.0184:80:00.0
#ID Date      Time      hostcpu hostmem    pwr  gtemp   busy    rdmem    wrmem    rxpci    txpci   rdrdma   wrrdma   rsvmem
#   YYYYMMDD  HH:MM:SS        %       %      W      C      %     GB/s     GB/s     GB/s     GB/s     GB/s     GB/s       MB
  0 20251103  20:18:36    951.6    11.5   33.8   34.1     96   41.221    5.480    0.967    0.964    0.000    0.000    0.000
  1 20251103  20:18:36    951.6    11.5   30.6   33.0     96   41.201    5.464    0.967    0.964    0.000    0.000    0.000
  2 20251103  20:18:36    951.6    11.5   40.5   34.7     96   41.266    5.473    0.969    0.966    0.000    0.000    0.000
  3 20251103  20:18:36    951.6    11.5   37.3   39.2     96   41.358    5.484    0.971    0.968    0.000    0.000    0.000

Copy to Clipboard

Toggle word wrap

Chapter 13. Gathering system information with the vLLM collect environment script
Copy link

Use the vllm collect-env command that you run from the Red Hat AI Inference Server container to gather system information for troubleshooting AI Inference Server deployments. This script collects system details, hardware configurations, and dependency information that can help diagnose deployment problems and model inference serving issues.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to a Linux server with data center grade AI accelerators installed.
You have pulled and successfully deployed the Red Hat AI Inference Server container.

Procedure

Open a terminal and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
Pull the specific Red Hat AI Inference Server container image for the AI accelerator that is installed. For example, to pull the Red Hat AI Inference Server container for Google cloud TPUs, run the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0
```
Copy to Clipboard Toggle word wrap

Run the collect environment script in the container:

podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache:Z \
  --entrypoint vllm collect-env \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0

$ podman run --rm -it \
  --name vllm-tpu \
  --network=host \
  --privileged \
  --device=/dev/vfio/vfio \
  --device=/dev/vfio/0 \
  -e PJRT_DEVICE=TPU \
  -e HF_HUB_OFFLINE=0 \
  -v ./.cache/rhaiis:/opt/app-root/src/.cache:Z \
  --entrypoint vllm collect-env \
  registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0

Copy to Clipboard

Toggle word wrap

Verification

The vllm collect-env command output details environment information including the following:

System hardware details
Operating system details
Python environment and dependencies
GPU/TPU accelerator information

Review the output for any warnings or errors that might indicate configuration issues. Include the collect-env output for your system when reporting problems to Red Hat Support.

An example Google Cloud TPU report is provided below:

==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux 9.6 (Plow) (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 4.1.0
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0.dev20250716
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Jun 20 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] (64-bit runtime)
Python platform              : Linux-6.8.0-1015-gcp-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               44
On-line CPU(s) list:                  0-43
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 9B14
CPU family:                           25
Model:                                17
Thread(s) per core:                   2
Core(s) per socket:                   22
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             5200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            704 KiB (22 instances)
L1i cache:                            704 KiB (22 instances)
L2 cache:                             22 MiB (22 instances)
L3 cache:                             96 MiB (3 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-43
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.1
[pip3] torch==2.9.0.dev20250716
[pip3] torch-xla==2.9.0.dev20250716
[pip3] torchvision==0.24.0.dev20250716
[pip3] transformers==4.55.2
[pip3] triton==3.3.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.10.0+rhai1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
VLLM_USE_V1=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_NO_USAGE_STATS=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_default

==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux 9.6 (Plow) (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 4.1.0
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0.dev20250716
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Jun 20 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] (64-bit runtime)
Python platform              : Linux-6.8.0-1015-gcp-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               44
On-line CPU(s) list:                  0-43
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 9B14
CPU family:                           25
Model:                                17
Thread(s) per core:                   2
Core(s) per socket:                   22
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             5200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            704 KiB (22 instances)
L1i cache:                            704 KiB (22 instances)
L2 cache:                             22 MiB (22 instances)
L3 cache:                             96 MiB (3 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-43
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.1
[pip3] torch==2.9.0.dev20250716
[pip3] torch-xla==2.9.0.dev20250716
[pip3] torchvision==0.24.0.dev20250716
[pip3] transformers==4.55.2
[pip3] triton==3.3.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.10.0+rhai1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
VLLM_USE_V1=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_NO_USAGE_STATS=1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_default

Copy to Clipboard

Toggle word wrap

Getting started with Red Hat AI Inference Server

Chapter 1. About AI Inference Server
Copy link

Chapter 2. Product and version compatibility
Copy link

Chapter 3. Reviewing AI Inference Server Python packages
Copy link

Chapter 4. Serving and inferencing with Podman using NVIDIA CUDA AI accelerators
Copy link

Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators
Copy link

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators
Copy link

Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI accelerators
Copy link

7.1. Recommended model inference settings for IBM Power with IBM Spyre AI accelerators
Copy link

7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM Power
Copy link

Chapter 8. Inference serving with Podman on IBM Z with IBM Spyre AI accelerators
Copy link

Chapter 9. Serving and inferencing language models with Podman using AWS Trainium and Inferentia AI accelerators
Copy link

Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)
Copy link

Chapter 11. Validating Red Hat AI Inference Server benefits using key metrics
Copy link

Chapter 12. Troubleshooting
Copy link

12.1. Model loading errors
Copy link

12.2. Memory optimization
Copy link

12.3. Generated model response quality
Copy link

12.4. CUDA accelerator errors
Copy link

12.5. Networking errors
Copy link

12.6. Python multiprocessing errors
Copy link

12.7. GPU driver or device pass-through issues
Copy link

12.8. Troubleshooting IBM Power issues
Copy link

12.8.1. IBM Spyre for Power AI acclerator card problems
Copy link

12.8.2. IBM Spyre for Power performance issues
Copy link

Chapter 13. Gathering system information with the vLLM collect environment script
Copy link

Legal Notice
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Getting started

Getting started with Red Hat AI Inference Server

Chapter 1. About AI Inference ServerCopy linkLink copied to clipboard!

Chapter 2. Product and version compatibilityCopy linkLink copied to clipboard!

Chapter 3. Reviewing AI Inference Server Python packagesCopy linkLink copied to clipboard!

Chapter 4. Serving and inferencing with Podman using NVIDIA CUDA AI acceleratorsCopy linkLink copied to clipboard!

Chapter 5. Serving and inferencing with Podman using AMD ROCm AI acceleratorsCopy linkLink copied to clipboard!

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI acceleratorsCopy linkLink copied to clipboard!

Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI acceleratorsCopy linkLink copied to clipboard!

7.1. Recommended model inference settings for IBM Power with IBM Spyre AI acceleratorsCopy linkLink copied to clipboard!

7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM PowerCopy linkLink copied to clipboard!

Chapter 8. Inference serving with Podman on IBM Z with IBM Spyre AI acceleratorsCopy linkLink copied to clipboard!

Chapter 9. Serving and inferencing language models with Podman using AWS Trainium and Inferentia AI acceleratorsCopy linkLink copied to clipboard!

Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)Copy linkLink copied to clipboard!

Chapter 11. Validating Red Hat AI Inference Server benefits using key metricsCopy linkLink copied to clipboard!

Chapter 12. TroubleshootingCopy linkLink copied to clipboard!

12.1. Model loading errorsCopy linkLink copied to clipboard!

12.2. Memory optimizationCopy linkLink copied to clipboard!

12.3. Generated model response qualityCopy linkLink copied to clipboard!

12.4. CUDA accelerator errorsCopy linkLink copied to clipboard!

12.5. Networking errorsCopy linkLink copied to clipboard!

12.6. Python multiprocessing errorsCopy linkLink copied to clipboard!

12.7. GPU driver or device pass-through issuesCopy linkLink copied to clipboard!

12.8. Troubleshooting IBM Power issuesCopy linkLink copied to clipboard!

12.8.1. IBM Spyre for Power AI acclerator card problemsCopy linkLink copied to clipboard!

12.8.2. IBM Spyre for Power performance issuesCopy linkLink copied to clipboard!

Chapter 13. Gathering system information with the vLLM collect environment scriptCopy linkLink copied to clipboard!

Legal NoticeCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 1. About AI Inference Server
Copy link

Chapter 2. Product and version compatibility
Copy link

Chapter 3. Reviewing AI Inference Server Python packages
Copy link

Chapter 4. Serving and inferencing with Podman using NVIDIA CUDA AI accelerators
Copy link

Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators
Copy link

Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators
Copy link

Chapter 7. Inference serving with Podman on IBM Power with IBM Spyre AI accelerators
Copy link

7.1. Recommended model inference settings for IBM Power with IBM Spyre AI accelerators
Copy link

7.2. Example inference serving configurations for IBM Spyre AI accelerators on IBM Power
Copy link

Chapter 8. Inference serving with Podman on IBM Z with IBM Spyre AI accelerators
Copy link

Chapter 9. Serving and inferencing language models with Podman using AWS Trainium and Inferentia AI accelerators
Copy link

Chapter 10. Serving and inferencing with Podman using CPU (x86_64 AVX2)
Copy link

Chapter 11. Validating Red Hat AI Inference Server benefits using key metrics
Copy link

Chapter 12. Troubleshooting
Copy link

12.1. Model loading errors
Copy link

12.2. Memory optimization
Copy link

12.3. Generated model response quality
Copy link

12.4. CUDA accelerator errors
Copy link

12.5. Networking errors
Copy link

12.6. Python multiprocessing errors
Copy link

12.7. GPU driver or device pass-through issues
Copy link

12.8. Troubleshooting IBM Power issues
Copy link

12.8.1. IBM Spyre for Power AI acclerator card problems
Copy link

12.8.2. IBM Spyre for Power performance issues
Copy link

Chapter 13. Gathering system information with the vLLM collect environment script
Copy link

Legal Notice
Copy link