Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators


Serve and inference a large language model with Podman and Red Hat AI Inference Server running on AMD ROCm AI accelerators.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with data center grade AMD ROCm AI accelerators installed.

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the AMD ROCm image by running the following command:

    $ podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3
    Copy to Clipboard Toggle word wrap
  3. If your system has SELinux enabled, configure SELinux to allow device access:

    $ sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  4. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir -p rhaiis-cache
    Copy to Clipboard Toggle word wrap
    $ chmod g+rwX rhaiis-cache
    Copy to Clipboard Toggle word wrap
  5. Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.

    $ echo "export HF_TOKEN=<your_HF_token>" > private.env
    Copy to Clipboard Toggle word wrap
    $ source private.env
    Copy to Clipboard Toggle word wrap
  6. Start the AI Inference Server container image.

    1. For AMD ROCm accelerators:

      1. Use amd-smi static -a to verify that the container can access the host system GPUs:

        $ podman run -ti --rm --pull=newer \
        --security-opt=label=disable \
        --device=/dev/kfd --device=/dev/dri \
        --group-add keep-groups \ 
        1
        
        --entrypoint="" \
        registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3 \
        amd-smi static -a
        Copy to Clipboard Toggle word wrap
        1
        You must belong to both the video and render groups on AMD systems to use the GPUs. To access GPUs, you must pass the --group-add=keep-groups supplementary groups option into the container.
      2. Start the container:

        podman run --rm -it \
        --device /dev/kfd --device /dev/dri \
        --security-opt=label=disable \ 
        1
        
        --group-add keep-groups \
        --shm-size=4GB -p 8000:8000 \ 
        2
        
        --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
        --env "HF_HUB_OFFLINE=0" \
        --env=VLLM_NO_USAGE_STATS=1 \
        -v ./rhaiis-cache:/opt/app-root/src/.cache \
        registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3 \
        --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
        --tensor-parallel-size 2 
        3
        Copy to Clipboard Toggle word wrap
        1
        --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.
        2
        If you experience an issue with shared memory, increase --shm-size to 8GB.
        3
        Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
  7. In a separate tab in your terminal, make a request to the model with the API.

    curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://<your_server_ip>:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    Example output

    {
        "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
        "object": "text_completion",
        "created": 1746555421,
        "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
        "choices": [
            {
                "index": 0,
                "text": " Paris.\nThe capital of France is Paris.",
                "logprobs": null,
                "finish_reason": "stop",
                "stop_reason": null,
                "prompt_logprobs": null
            }
        ],
        "usage": {
            "prompt_tokens": 8,
            "total_tokens": 18,
            "completion_tokens": 10,
            "prompt_tokens_details": null
        }
    }
    Copy to Clipboard Toggle word wrap

Nach oben
Red Hat logoGithubredditYoutubeTwitter

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Wir helfen Red Hat Benutzern, mit unseren Produkten und Diensten innovativ zu sein und ihre Ziele zu erreichen – mit Inhalten, denen sie vertrauen können. Entdecken Sie unsere neuesten Updates.

Mehr Inklusion in Open Source

Red Hat hat sich verpflichtet, problematische Sprache in unserem Code, unserer Dokumentation und unseren Web-Eigenschaften zu ersetzen. Weitere Einzelheiten finden Sie in Red Hat Blog.

Über Red Hat

Wir liefern gehärtete Lösungen, die es Unternehmen leichter machen, plattform- und umgebungsübergreifend zu arbeiten, vom zentralen Rechenzentrum bis zum Netzwerkrand.

Theme

© 2025 Red Hat