Este contenido no está disponible en el idioma seleccionado.
Chapter 3. Serving and inferencing with AI Inference Server
Serve and inference a large language model with Red Hat AI Inference Server.
Prerequisites
- You have installed Podman or Docker
You have access to a Linux server with NVIDIA or AMD GPUs and are logged in as a user with root privileges
For NVIDIA GPUs:
- Install NVIDIA drivers
- Install the NVIDIA Container Toolkit
- If your system has multiple NVIDIA GPUs that use NVswitch, you must have root access to start Fabric Manager
For AMD GPUs:
- Install ROCm software
Verify that you can run ROCm containers
-
You have access to
registry.redhat.io
and have logged in - You have a Hugging Face account and have generated a Hugging Face token
-
You have access to
NoteAMD GPUs support FP8 (W8A8) and GGUF quantization schemes only. For more information, see Supported hardware.
Procedure
Using the table below, identify the correct image for your infrastructure.
Expand GPU AI Inference Server image NVIDIA CUDA (T4, A100, L4, L40S, H100, H200)
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0
AMD ROCm (MI210, MI300X)
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.0.0
Open a terminal on your server host, and log in to
registry.redhat.io
:podman login registry.redhat.io
$ podman login registry.redhat.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the relevant image for your GPUs:
podman pull registry.redhat.io/rhaiis/vllm-<gpu_type>-rhel9:3.0.0
$ podman pull registry.redhat.io/rhaiis/vllm-<gpu_type>-rhel9:3.0.0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If your system has SELinux enabled, configure SELinux to allow device access:
sudo setsebool -P container_use_devices 1
$ sudo setsebool -P container_use_devices 1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
mkdir -p rhaiis-cache
$ mkdir -p rhaiis-cache
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chmod g+rwX rhaiis-cache
$ chmod g+rwX rhaiis-cache
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create or append your
HF_TOKEN
Hugging Face token to theprivate.env
file. Source theprivate.env
file.echo "export HF_TOKEN=<your_HF_token>" > private.env
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow source private.env
$ source private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container image.
For NVIDIA CUDA accelerators:
If the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in
/proc/driver/nvidia-nvswitch/devices/
, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.ls /proc/driver/nvidia-nvswitch/devices/
$ ls /proc/driver/nvidia-nvswitch/devices/
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow systemctl start nvidia-fabricmanager
$ systemctl start nvidia-fabricmanager
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantNVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVswitch. For more information, see NVIDIA Server Architectures.
Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:
podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \ nvidia-smi
$ podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \ nvidia-smi
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the container.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Required for systems where SELinux is enabled.
--security-opt=label=disable
prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run. - 2
- If you experience an issue with shared memory, increase
--shm-size
to8GB
. - 3
- Maps the host UID to the effective UID of the vLLM process in the container. You can also pass
--user=0
, but this less secure than the--userns
option. Setting--user=0
runs vLLM as root inside the container. - 4
- Set and export
HF_TOKEN
with your Hugging Face API access token - 5
- Required for systems where SELinux is enabled. On Debian or Ubuntu operating systems, or when using Docker without SELinux, the
:Z
suffix is not available. - 6
- Set
--tensor-parallel-size
to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
For AMD ROCm accelerators:
Use
amd-smi static -a
to verify that the container can access the host system GPUs:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- You must belong to both the video and render groups on AMD systems to use the GPUs. To access GPUs, you must pass the
--group-add=keep-groups
supplementary groups option into the container.
Start the container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
--security-opt=label=disable
prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.- 2
- If you experience an issue with shared memory, increase
--shm-size
to8GB
. - 3
- Set
--tensor-parallel-size
to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
In a separate tab in your terminal, make a request to your model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jq
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jq
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow