Chapter 4. Serving and inferencing with Podman using AMD ROCm AI accelerators
Serve and inference a large language model with Podman and Red Hat AI Inference Server running on AMD ROCm AI accelerators.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.io
and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade AMD ROCm AI accelerators installed.
For AMD GPUs:
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io
:podman login registry.redhat.io
$ podman login registry.redhat.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the AMD ROCm image by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
$ podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If your system has SELinux enabled, configure SELinux to allow device access:
sudo setsebool -P container_use_devices 1
$ sudo setsebool -P container_use_devices 1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
mkdir -p rhaiis-cache
$ mkdir -p rhaiis-cache
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chmod g+rwX rhaiis-cache
$ chmod g+rwX rhaiis-cache
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create or append your
HF_TOKEN
Hugging Face token to theprivate.env
file. Source theprivate.env
file.echo "export HF_TOKEN=<your_HF_token>" > private.env
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow source private.env
$ source private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container image.
For AMD ROCm accelerators:
Use
amd-smi static -a
to verify that the container can access the host system GPUs:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- You must belong to both the video and render groups on AMD systems to use the GPUs. To access GPUs, you must pass the
--group-add=keep-groups
supplementary groups option into the container.
Start the container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
--security-opt=label=disable
prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.- 2
- If you experience an issue with shared memory, increase
--shm-size
to8GB
. - 3
- Set
--tensor-parallel-size
to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
In a separate tab in your terminal, make a request to the model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jq
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jq
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow