Chapter 3. Serving and inferencing with Podman using NVIDIA CUDA AI accelerators


Serve and inference a large language model with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with data center grade NVIDIA AI accelerators installed.

Note

For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.

Procedure

  1. Open a terminal on your server host, and log in to registry.redhat.io:

    $ podman login registry.redhat.io
    Copy to Clipboard Toggle word wrap
  2. Pull the relevant the NVIDIA CUDA image by running the following command:

    $ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2
    Copy to Clipboard Toggle word wrap
  3. If your system has SELinux enabled, configure SELinux to allow device access:

    $ sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  4. Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.

    $ mkdir -p rhaiis-cache
    Copy to Clipboard Toggle word wrap
    $ chmod g+rwX rhaiis-cache
    Copy to Clipboard Toggle word wrap
  5. Create or append your HF_TOKEN Hugging Face token to the private.env file. Source the private.env file.

    $ echo "export HF_TOKEN=<your_HF_token>" > private.env
    Copy to Clipboard Toggle word wrap
    $ source private.env
    Copy to Clipboard Toggle word wrap
  6. Start the AI Inference Server container image.

    1. For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in /proc/driver/nvidia-nvswitch/devices/, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.

      $ ls /proc/driver/nvidia-nvswitch/devices/
      Copy to Clipboard Toggle word wrap

      Example output

      0000:0c:09.0  0000:0c:0a.0  0000:0c:0b.0  0000:0c:0c.0  0000:0c:0d.0  0000:0c:0e.0
      Copy to Clipboard Toggle word wrap

      $ systemctl start nvidia-fabricmanager
      Copy to Clipboard Toggle word wrap
      Important

      NVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVswitch. For more information, see NVIDIA Server Architectures.

      1. Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:

        $ podman run --rm -it \
        --security-opt=label=disable \
        --device nvidia.com/gpu=all \
        nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \
        nvidia-smi
        Copy to Clipboard Toggle word wrap

        Example output

        +-----------------------------------------------------------------------------------------+
        | NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
        |-----------------------------------------+------------------------+----------------------+
        | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
        | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
        |                                         |                        |               MIG M. |
        |=========================================+========================+======================|
        |   0  NVIDIA A100-SXM4-80GB          Off |   00000000:08:01.0 Off |                    0 |
        | N/A   32C    P0             64W /  400W |       1MiB /  81920MiB |      0%      Default |
        |                                         |                        |             Disabled |
        +-----------------------------------------+------------------------+----------------------+
        |   1  NVIDIA A100-SXM4-80GB          Off |   00000000:08:02.0 Off |                    0 |
        | N/A   29C    P0             63W /  400W |       1MiB /  81920MiB |      0%      Default |
        |                                         |                        |             Disabled |
        +-----------------------------------------+------------------------+----------------------+
        
        +-----------------------------------------------------------------------------------------+
        | Processes:                                                                              |
        |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
        |        ID   ID                                                               Usage      |
        |=========================================================================================|
        |  No running processes found                                                             |
        +-----------------------------------------------------------------------------------------+
        Copy to Clipboard Toggle word wrap

      2. Start the container.

        $ podman run --rm -it \
        --device nvidia.com/gpu=all \
        --security-opt=label=disable \ 
        1
        
        --shm-size=4g -p 8000:8000 \ 
        2
        
        --userns=keep-id:uid=1001 \ 
        3
        
        --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ 
        4
        
        --env "HF_HUB_OFFLINE=0" \
        --env=VLLM_NO_USAGE_STATS=1 \
        -v ./rhaiis-cache:/opt/app-root/src/.cache:Z \ 
        5
        
        registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2 \
        --model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
        --tensor-parallel-size 2 
        6
        Copy to Clipboard Toggle word wrap
        1
        Required for systems where SELinux is enabled. --security-opt=label=disable prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.
        2
        If you experience an issue with shared memory, increase --shm-size to 8GB.
        3
        Maps the host UID to the effective UID of the vLLM process in the container. You can also pass --user=0, but this less secure than the --userns option. Setting --user=0 runs vLLM as root inside the container.
        4
        Set and export HF_TOKEN with your Hugging Face API access token
        5
        Required for systems where SELinux is enabled. On Debian or Ubuntu operating systems, or when using Docker without SELinux, the :Z suffix is not available.
        6
        Set --tensor-parallel-size to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
  7. In a separate tab in your terminal, make a request to your model with the API.

    curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://<your_server_ip>:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    Example output

    {
        "id": "cmpl-b84aeda1d5a4485c9cb9ed4a13072fca",
        "object": "text_completion",
        "created": 1746555421,
        "model": "RedHatAI/Llama-3.2-1B-Instruct-FP8",
        "choices": [
            {
                "index": 0,
                "text": " Paris.\nThe capital of France is Paris.",
                "logprobs": null,
                "finish_reason": "stop",
                "stop_reason": null,
                "prompt_logprobs": null
            }
        ],
        "usage": {
            "prompt_tokens": 8,
            "total_tokens": 18,
            "completion_tokens": 10,
            "prompt_tokens_details": null
        }
    }
    Copy to Clipboard Toggle word wrap

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat