Chapter 4. Serving and inferencing with Podman using NVIDIA CUDA AI accelerators

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade NVIDIA AI accelerators installed.
- For NVIDIA GPUs:
  - Install NVIDIA drivers
  - Install the NVIDIA Container Toolkit
  - If your system has multiple NVIDIA GPUs that use NVSwitch, you must have root access to start Fabric Manager

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```
Pull the relevant the NVIDIA CUDA image by running the following command:
```
$ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5
```
If your system has SELinux enabled, configure SELinux to allow device access:
```
$ sudo setsebool -P container_use_devices 1
```
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
$ mkdir -p rhaiis-cache
```
```
$ chmod g+rwX rhaiis-cache
```
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ source private.env
```

Start the AI Inference Server container image.

For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in /proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.

$ ls /proc/driver/nvidia-nvswitch/devices/

Example output

0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0

$ systemctl start nvidia-fabricmanager

Important

NVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVSwitch. For more information, see NVIDIA Server Architectures.

Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:

$ podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

Example output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Start the container.
```
$ podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2
```
Where:
--security-opt=label=disable
Disables SELinux label relabeling for volume mounts. Required for systems where SELinux is enabled. Without this option, the container might fail to start.
--shm-size=4g -p 8000:8000
Specifies the shared memory size and port mapping. Increase --shm-size to 8GB if you experience shared memory issues.
--userns=keep-id:uid=1001
Maps the host UID to the effective UID of the vLLM process in the container. Alternatively, you can pass --user=0, but this is less secure because it runs vLLM as root inside the container.
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN"
Specifies the Hugging Face API access token. Set and export HF_TOKEN with your Hugging Face token.
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z
Mounts the cache directory with SELinux context. The :Z suffix is required for systems where SELinux is enabled. On Debian, Ubuntu, or Docker without SELinux, omit the :Z suffix.
--tensor-parallel-size 2
Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.

In a separate tab in your terminal, make a request to your model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Example output

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Chapter 4. Serving and inferencing with Podman using NVIDIA CUDA AI accelerators

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links