Este conteúdo não está disponível no idioma selecionado.

Chapter 3. Serving and inferencing with AI Inference Server

Serve and inference a large language model with Red Hat AI Inference Server.

Prerequisites

You have installed Podman or Docker

You have access to a Linux server with NVIDIA or AMD GPUs and are logged in as a user with root privileges

For NVIDIA GPUs:
- Install NVIDIA drivers
- Install the NVIDIA Container Toolkit
- If your system has multiple NVIDIA GPUs that use NVswitch, you must have root access to start Fabric Manager

For AMD GPUs:

Install ROCm software
Verify that you can run ROCm containers
- You have access to registry.redhat.io and have logged in
- You have a Hugging Face account and have generated a Hugging Face token

Note

AMD GPUs support FP8 (W8A8) and GGUF quantization schemes only. For more information, see Supported hardware.

Procedure

Using the table below, identify the correct image for your infrastructure.

Expand

GPU	AI Inference Server image
NVIDIA CUDA (T4, A100, L4, L40S, H100, H200)	`registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0`
AMD ROCm (MI210, MI300X)	`registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.0.0`

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

Pull the relevant image for your GPUs:

podman pull registry.redhat.io/rhaiis/vllm-<gpu_type>-rhel9:3.0.0

$ podman pull registry.redhat.io/rhaiis/vllm-<gpu_type>-rhel9:3.0.0

Copy to Clipboard

Toggle word wrap

If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
mkdir -p rhaiis-cache
```
```
$ mkdir -p rhaiis-cache
```
Copy to Clipboard Toggle word wrap
```
chmod g+rwX rhaiis-cache
```
```
$ chmod g+rwX rhaiis-cache
```
Copy to Clipboard Toggle word wrap
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
Copy to Clipboard Toggle word wrap
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container image.

For NVIDIA CUDA accelerators:

If the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in /proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.
```
ls /proc/driver/nvidia-nvswitch/devices/
```
```
$ ls /proc/driver/nvidia-nvswitch/devices/
```
Copy to Clipboard Toggle word wrap
Example output
```
0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0
```
```
0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0
```
Copy to Clipboard Toggle word wrap
```
systemctl start nvidia-fabricmanager
```
```
$ systemctl start nvidia-fabricmanager
```
Copy to Clipboard Toggle word wrap
Important
NVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVswitch. For more information, see NVIDIA Server Architectures.

Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:

podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

$ podman run --rm -it \
--security-opt=label=disable \
--device nvidia.com/gpu=all \
nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
nvidia-smi

Copy to Clipboard

Toggle word wrap

Example output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
| N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
| N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Copy to Clipboard

Toggle word wrap

Start the container.
```
podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4g -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env=VLLM_NO_USAGE_STATS=1 \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2
```
```
$ podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \ 
```
1
```
--shm-size=4g -p 8000:8000 \ 
```
2
```
--userns=keep-id:uid=1001 \ 
```
3
```
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ 
```
4
```
--env "HF_HUB_OFFLINE=0" \
--env=VLLM_NO_USAGE_STATS=1 \
-v ./rhaiis-cache:/opt/app-root/src/.cache:Z \ 
```
5
```
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2 
```
6
Copy to Clipboard Toggle word wrap
1
Required for systems where SELinux is enabled. --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.
2
If you experience an issue with shared memory, increase --shm-size to 8GB.
3
Maps the host UID to the effective UID of the vLLM process in the container. You can also pass --user=0, but this less secure than the --userns option. Setting --user=0 runs vLLM as root inside the container.
4
Set and export HF_TOKEN with your Hugging Face API access token
5
Required for systems where SELinux is enabled. On Debian or Ubuntu operating systems, or when using Docker without SELinux, the :Z suffix is not available.
6
Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.

For AMD ROCm accelerators:

Use amd-smi static -a to verify that the container can access the host system GPUs:

podman run -ti --rm --pull=newer \
--security-opt=label=disable \
--device=/dev/kfd --device=/dev/dri \
--group-add keep-groups \
--entrypoint="" \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.0.0 \
amd-smi static -a

$ podman run -ti --rm --pull=newer \
--security-opt=label=disable \
--device=/dev/kfd --device=/dev/dri \
--group-add keep-groups \


--entrypoint="" \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.0.0 \
amd-smi static -a

Copy to Clipboard

Toggle word wrap

1: You must belong to both the video and render groups on AMD systems to use the GPUs. To access GPUs, you must pass the --group-add=keep-groups supplementary groups option into the container.

Start the container:

podman run --rm -it \
--device /dev/kfd --device /dev/dri \
--security-opt=label=disable \ 
--group-add keep-groups \
--shm-size=4GB -p 8000:8000 \ 
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env=VLLM_NO_USAGE_STATS=1 \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.0.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2

podman run --rm -it \
--device /dev/kfd --device /dev/dri \
--security-opt=label=disable \


--group-add keep-groups \
--shm-size=4GB -p 8000:8000 \


--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env=VLLM_NO_USAGE_STATS=1 \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.0.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2

Copy to Clipboard

Toggle word wrap

1: --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.
2: If you experience an issue with shared memory, increase --shm-size to 8GB.
3: Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.

In a separate tab in your terminal, make a request to your model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Copy to Clipboard

Toggle word wrap

Este conteúdo não está disponível no idioma selecionado.

Chapter 3. Serving and inferencing with AI Inference Server

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Tornando o open source mais inclusivo

Sobre a Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links