Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on AMD ROCm AI accelerators.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade AMD ROCm AI accelerators installed.
- For AMD GPUs:
  - Install ROCm software
  - Verify that you can run ROCm containers

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
$ podman login registry.redhat.io
```

Pull the AMD ROCm image by running the following command:

$ podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0

If your system has SELinux enabled, configure SELinux to allow device access:
```
$ sudo setsebool -P container_use_devices 1
```
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
$ mkdir -p rhaiis-cache
```
```
$ chmod g+rwX rhaiis-cache
```
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ source private.env
```
Start the AI Inference Server container image.
1. For AMD ROCm accelerators:
  1. Use amd-smi static -a to verify that the container can access the host system GPUs:
    
    $ podman run -ti --rm --pull=newer \ --security-opt=label=disable \ --device=/dev/kfd --device=/dev/dri \ --group-add keep-groups \ --entrypoint="" \ registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 \ amd-smi static -a
    
    Where:
    --group-add keep-groups
    Preserves the supplementary groups from the host user. On AMD systems, you must belong to both the video and render groups to access GPUs.
  2. Start the container:
    
    podman run --rm -it \ --device /dev/kfd --device /dev/dri \ --security-opt=label=disable \ --group-add keep-groups \ --shm-size=4GB -p 8000:8000 \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" \ -v ./rhaiis-cache:/opt/app-root/src/.cache \ registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 \ --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \ --tensor-parallel-size 2
    
    Where:
    --security-opt=label=disable
    Disables SELinux label relabeling for volume mounts. Without this option, the container might fail to start.
    --shm-size=4GB -p 8000:8000
    Specifies the shared memory size and port mapping. Increase --shm-size to 8GB if you experience shared memory issues.
    --tensor-parallel-size 2
    Specifies the number of GPUs to use for tensor parallelism. Set this value to match the number of available GPUs.

Verification

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Example output

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links