Chapter 2. Inference serving modelcar container images with AI Inference Server and Podman

Serve and inference a large language model stored in a modelcar container with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators. Modelcar containers provide an OCI-compliant method for packaging and distributing language models as container images.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have created a modelcar container image containing the language model you want to serve and pushed it to a container image registry that you have access to.
You have access to a Linux server with data center grade NVIDIA AI accelerators installed.
- For NVIDIA GPUs:
  - Install NVIDIA drivers
  - Install the NVIDIA Container Toolkit
  - If your system has multiple NVIDIA GPUs that use NVSwitch, you must have root access to start Fabric Manager

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap
Optional: Log in to the container registry where your modelcar container image is stored. For example:
```
podman login quay.io
```
```
$ podman login quay.io
```
Copy to Clipboard Toggle word wrap
Pull the relevant the NVIDIA CUDA image by running the following command:
```
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2
```
```
$ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2
```
Copy to Clipboard Toggle word wrap
If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap
Create a folder that you will later mount as a volume in the container. Adjust the container permissions so that the container can use it.
```
mkdir -p rhaiis-cache
```
```
$ mkdir -p rhaiis-cache
```
Copy to Clipboard Toggle word wrap
```
chmod g+rwX rhaiis-cache
```
```
$ chmod g+rwX rhaiis-cache
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container image. Run the following commands:

For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in /proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.

ls /proc/driver/nvidia-nvswitch/devices/

$ ls /proc/driver/nvidia-nvswitch/devices/

Copy to Clipboard

Toggle word wrap

Example output

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

Copy to Clipboard

Toggle word wrap

systemctl start nvidia-fabricmanager

$ systemctl start nvidia-fabricmanager

Copy to Clipboard

Toggle word wrap

Important

NVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVSwitch. For more information, see NVIDIA Server Architectures.

Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:

podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

$ podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

Copy to Clipboard

Toggle word wrap

Example output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Copy to Clipboard

Toggle word wrap

Start the AI Inference Server container with the modelcar container image mounted:
```
podman run --rm -it \
  --device nvidia.com/gpu=all \
  --security-opt=label=disable \
  --shm-size=4g \
  --userns=keep-id:uid=1001 \
  -p 8000:8000 \
  -e HF_HUB_OFFLINE=1 \
  -e TRANSFORMERS_OFFLINE=1 \
  --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model \
  -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2 \
  --model /model/models \
  --port 8000 \
  --dtype auto \
  --max-model-len 4096 \
  --tensor-parallel-size 2
```
```
$ podman run --rm -it \
  --device nvidia.com/gpu=all \
  --security-opt=label=disable \ 
```
1
```
  --shm-size=4g \ 
```
2
```
  --userns=keep-id:uid=1001 \ 
```
3
```
  -p 8000:8000 \
  -e HF_HUB_OFFLINE=1 \ 
```
4
```
  -e TRANSFORMERS_OFFLINE=1 \ 
```
5
```
  --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4-1739210683,destination=/model \ 
```
6
```
  -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \ 
```
7
```
  registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2 \
  --model /model/models \ 
```
8
```
  --port 8000 \
  --dtype auto \
  --max-model-len 4096 \
  --tensor-parallel-size 2 
```
9
Copy to Clipboard Toggle word wrap
1
Required for systems where SELinux is enabled. --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.
2
If you experience an issue with shared memory, increase --shm-size to 8GB.
3
Maps the host UID to the effective UID of the vLLM process in the container. You can also pass --user=0, but this less secure than the --userns option. Setting --user=0 runs vLLM as root inside the container.
4
Prevents Hugging Face Hub from connecting to the internet.
5
Forces the Transformers library to use only the locally mounted model.
6
Mounts the modelcar container directly inside the running rhaiis/vllm-cuda-rhel9 Red Hat AI Inference Server container.
7
Required for systems where SELinux is enabled. On Debian or Ubuntu operating systems, or when using Docker without SELinux, the :Z suffix is not available.
8
Mounts the model container /models folder inside the running AI Inference Server container.
9
Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{

$ curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of Ireland?",
    "max_tokens": 50
}' http://localhost:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
  "id": "cmpl-c08ef5087caf4f8f98b1e6d384d131b2",
  "object": "text_completion",
  "created": 1760023279,
  "model": "/model/models",
  "choices": [
    {
      "index": 0,
      "text": "\n\nCork",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 15,
    "completion_tokens": 6,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

{
  "id": "cmpl-c08ef5087caf4f8f98b1e6d384d131b2",
  "object": "text_completion",
  "created": 1760023279,
  "model": "/model/models",
  "choices": [
    {
      "index": 0,
      "text": "\n\nCork",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 15,
    "completion_tokens": 6,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

Chapter 2. Inference serving modelcar container images with AI Inference Server and Podman

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links