Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators

Serve and inference a large language model with Podman and Red Hat AI Inference Server running on AMD ROCm AI accelerators.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade AMD ROCm AI accelerators installed.
- For AMD GPUs:
  - Install ROCm software
  - Verify that you can run ROCm containers

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

Open a terminal on your server host, and log in to registry.redhat.io:
```
podman login registry.redhat.io
```
```
$ podman login registry.redhat.io
```
Copy to Clipboard Toggle word wrap

Pull the AMD ROCm image by running the following command:

podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4

$ podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4

Copy to Clipboard

Toggle word wrap

If your system has SELinux enabled, configure SELinux to allow device access:
```
sudo setsebool -P container_use_devices 1
```
```
$ sudo setsebool -P container_use_devices 1
```
Copy to Clipboard Toggle word wrap
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
```
mkdir -p rhaiis-cache
```
```
$ mkdir -p rhaiis-cache
```
Copy to Clipboard Toggle word wrap
```
chmod g+rwX rhaiis-cache
```
```
$ chmod g+rwX rhaiis-cache
```
Copy to Clipboard Toggle word wrap
Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.
```
echo "export HF_TOKEN=<your_HF_token>" > private.env
```
```
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
```
Copy to Clipboard Toggle word wrap
```
source private.env
```
```
$ source private.env
```
Copy to Clipboard Toggle word wrap

Start the AI Inference Server container image.

For AMD ROCm accelerators:

Use amd-smi static -a to verify that the container can access the host system GPUs:

podman run -ti --rm --pull=newer \
--security-opt=label=disable \
--device=/dev/kfd --device=/dev/dri \
--group-add keep-groups \
--entrypoint="" \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4 \
amd-smi static -a

$ podman run -ti --rm --pull=newer \
--security-opt=label=disable \
--device=/dev/kfd --device=/dev/dri \
--group-add keep-groups \


--entrypoint="" \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4 \
amd-smi static -a

Copy to Clipboard

Toggle word wrap

1: You must belong to both the video and render groups on AMD systems to use the GPUs. To access GPUs, you must pass the --group-add=keep-groups supplementary groups option into the container.

Start the container:

podman run --rm -it \
--device /dev/kfd --device /dev/dri \
--security-opt=label=disable \ 
--group-add keep-groups \
--shm-size=4GB -p 8000:8000 \ 
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2

podman run --rm -it \
--device /dev/kfd --device /dev/dri \
--security-opt=label=disable \


--group-add keep-groups \
--shm-size=4GB -p 8000:8000 \


--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2

Copy to Clipboard

Toggle word wrap

1: --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.
2: If you experience an issue with shared memory, increase --shm-size to 8GB.
3: Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.

In a separate tab in your terminal, make a request to the model with the API.

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://<your_server_ip>:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

{
    "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
    "object": "text_completion",
    "created": 1746555421,
    "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
    "choices": [
        {
            "index": 0,
            "text": " Paris.\nThe capital of France is Paris.",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null,
            "prompt_logprobs": null
        }
    ],
    "usage": {
        "prompt_tokens": 8,
        "total_tokens": 18,
        "completion_tokens": 10,
        "prompt_tokens_details": null
    }
}

Copy to Clipboard

Toggle word wrap

Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links