이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 4. Serving and inferencing with Podman using AMD ROCm AI accelerators


Serve and inference a large language model with Podman and Red Hat AI Inference Server running on AMD ROCm AI accelerators.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with data center grade AMD ROCm AI accelerators installed.

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the AMD ROCm image by running the following command:

    $ podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
    Copy to Clipboard Toggle word wrap
  3. If your system has SELinux enabled, configure SELinux to allow device access:

    $ sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  4. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir -p rhaiis-cache
    Copy to Clipboard Toggle word wrap
    $ chmod g+rwX rhaiis-cache
    Copy to Clipboard Toggle word wrap
  5. Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.

    $ echo "export HF_TOKEN=<your_HF_token>" > private.env
    Copy to Clipboard Toggle word wrap
    $ source private.env
    Copy to Clipboard Toggle word wrap
  6. Start the AI Inference Server container image.

    1. For AMD ROCm accelerators:

      1. Use amd-smi static -a to verify that the container can access the host system GPUs:

        $ podman run -ti --rm --pull=newer \
        --security-opt=label=disable \
        --device=/dev/kfd --device=/dev/dri \
        --group-add keep-groups \ 
        1
        
        --entrypoint="" \
        registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1 \
        amd-smi static -a
        Copy to Clipboard Toggle word wrap
        1
        You must belong to both the video and render groups on AMD systems to use the GPUs. To access GPUs, you must pass the --group-add=keep-groups supplementary groups option into the container.
      2. Start the container:

        podman run --rm -it \
        --device /dev/kfd --device /dev/dri \
        --security-opt=label=disable \ 
        1
        
        --group-add keep-groups \
        --shm-size=4GB -p 8000:8000 \ 
        2
        
        --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
        --env "HF_HUB_OFFLINE=0" \
        --env=VLLM_NO_USAGE_STATS=1 \
        -v ./rhaiis-cache:/opt/app-root/src/.cache \
        registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1 \
        --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
        --tensor-parallel-size 2 
        3
        Copy to Clipboard Toggle word wrap
        1
        --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.
        2
        If you experience an issue with shared memory, increase --shm-size to 8GB.
        3
        Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
  7. In a separate tab in your terminal, make a request to the model with the API.

    curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://<your_server_ip>:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    Example output

    {
        "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
        "object": "text_completion",
        "created": 1746555421,
        "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
        "choices": [
            {
                "index": 0,
                "text": " Paris.\nThe capital of France is Paris.",
                "logprobs": null,
                "finish_reason": "stop",
                "stop_reason": null,
                "prompt_logprobs": null
            }
        ],
        "usage": {
            "prompt_tokens": 8,
            "total_tokens": 18,
            "completion_tokens": 10,
            "prompt_tokens_details": null
        }
    }
    Copy to Clipboard Toggle word wrap

맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat