Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators
Serve and inference a large language model with Podman and Red Hat AI Inference Server running on AMD ROCm AI accelerators.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade AMD ROCm AI accelerators installed.
For AMD GPUs:
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io:$ podman login registry.redhat.ioPull the AMD ROCm image by running the following command:
$ podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0If your system has SELinux enabled, configure SELinux to allow device access:
$ sudo setsebool -P container_use_devices 1Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
$ mkdir -p rhaiis-cache$ chmod g+rwX rhaiis-cacheCreate or append your
HF_TOKENHugging Face token to theprivate.envfile. Source theprivate.envfile.$ echo "export HF_TOKEN=<your_HF_token>" > private.env$ source private.envStart the AI Inference Server container image.
For AMD ROCm accelerators:
Use
amd-smi static -ato verify that the container can access the host system GPUs:$ podman run -ti --rm --pull=newer \ --security-opt=label=disable \ --device=/dev/kfd --device=/dev/dri \ --group-add keep-groups \ --entrypoint="" \ registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 \ amd-smi static -aWhere:
--group-add keep-groups-
Preserves the supplementary groups from the host user. On AMD systems, you must belong to both the
videoandrendergroups to access GPUs.
Start the container:
podman run --rm -it \ --device /dev/kfd --device /dev/dri \ --security-opt=label=disable \ --group-add keep-groups \ --shm-size=4GB -p 8000:8000 \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" \ -v ./rhaiis-cache:/opt/app-root/src/.cache \ registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 \ --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \ --tensor-parallel-size 2Where:
--security-opt=label=disable- Disables SELinux label relabeling for volume mounts. Without this option, the container might fail to start.
--shm-size=4GB -p 8000:8000-
Specifies the shared memory size and port mapping. Increase
--shm-sizeto8GBif you experience shared memory issues. --tensor-parallel-size 2- Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.
Verification
In a separate tab in your terminal, make a request to the model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jqExample output
{ "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca", "object": "text_completion", "created": 1746555421, "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8", "choices": [ { "index": 0, "text": " Paris.\nThe capital of France is Paris.", "logprobs": null, "finish_reason": "stop", "stop_reason": null, "prompt_logprobs": null } ], "usage": { "prompt_tokens": 8, "total_tokens": 18, "completion_tokens": 10, "prompt_tokens_details": null } }