このコンテンツは選択した言語では利用できません。

Chapter 4. Enabling the Red Hat AI Inference Server systemd Quadlet service


You can enable the Red Hat AI Inference Server systemd Quadlet service to inference serve language models with NVIDIA CUDA AI accelerators on your RHEL AI instance. After you configure the service, the service automatically starts on system boot.

Prerequisites

  • You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
  • You are logged in as a user with sudo access.
  • You have a Hugging Face Hub token. You can obtain a token from Hugging Face settings.
Note

You do not need to create cache or model folders for Red Hat AI Inference Server or Red Hat AI Model Optimization Toolkit. On first boot, the following folders are created with the correct permissions for model serving:

/var/lib/rhaiis/cache
/var/lib/rhaiis/models
Copy to Clipboard Toggle word wrap

Procedure

  1. Open a shell prompt on the RHEL AI server.
  2. Review the images that are shipped with Red Hat Enterprise Linux AI. Run the following command:

    [cloud-user@localhost ~]$ podman images
    Copy to Clipboard Toggle word wrap

    Example output

    REPOSITORY                                      TAG               IMAGE ID      CREATED      SIZE     R/O
    registry.redhat.io/rhaiis/vllm-cuda-rhel9       3.2.3  f45efe91fbac  3 weeks ago  14.8 GB  true
    registry.redhat.io/rhaiis/model-opt-cuda-rhel9  3.2.3  61c0d36dcfa3  3 weeks ago  10.1 GB  true
    Copy to Clipboard Toggle word wrap

  3. Make a copy of the example configuration file:

    [cloud-user@localhost ~]$ sudo cp /etc/containers/systemd/rhaiis.container.d/install.conf.example /etc/containers/systemd/rhaiis.container.d/install.conf
    Copy to Clipboard Toggle word wrap
  4. Edit the configuration file and update with the required parameters:

    [cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf
    Copy to Clipboard Toggle word wrap
    [Container]
    # Set to 1 to run in offline mode and disable model downloading at runtime.
    # Default value is 0.
    Environment=HF_HUB_OFFLINE=0
    
    # Update with the required authentication token for downloading models from Hugging Face.
    Environment=HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN>
    
    # Set to 1 to disable vLLM usage statistics collection. Default value is 0.
    # Environment=VLLM_NO_USAGE_STATS=1
    
    # Configure the vLLM server arguments
    Exec=--model meta-llama/Llama-3.2-1B-Instruct \
         --tensor-parallel-size 1 \
         --max-model-len 4096
    
    PublishPort=8000:8000
    ShmSize=4G
    
    [Install]
    WantedBy=multi-user.target
    Copy to Clipboard Toggle word wrap

    Use the following table to understand the required parameters to set:

    Expand
    Table 4.1. Red Hat AI Inference Server configuration parameters
    ParameterDescription

    HF_HUB_OFFLINE

    Set to 1 to run in offline mode and disable model downloading at runtime. Default value is 0.

    HUGGING_FACE_HUB_TOKEN

    Required authentication token for downloading models from Hugging Face.

    VLLM_NO_USAGE_STATS

    Set to 1 to disable vLLM usage statistics collection. Default value is 0.

    --model

    vLLM server argument for the model identifier or local path to the model to serve, for example, meta-llama/Llama-3.2-1B-Instruct or /opt/app-root/src/models/<MODEL_NAME>.

    --tensor-parallel-size

    Number of AI accelerators to use for tensor parallelism when serving the model. Default value is 1.

    --max-model-len

    Maximum model length (context size). This depends on available AI accelerator memory. The default value is 131072, but lower values such as 4096 might be better for accelerators with less memory.

    Note

    See vLLM server arguments for the complete list of server arguments that you can configure.

  5. Reload the systemd configuration:

    [cloud-user@localhost ~]$ sudo systemctl daemon-reload
    Copy to Clipboard Toggle word wrap
  6. Enable and start the Red Hat AI Inference Server systemd service:

    [cloud-user@localhost ~]$ sudo systemctl start rhaiis
    Copy to Clipboard Toggle word wrap

Verification

  1. Check the service status:

    [cloud-user@localhost ~]$ sudo systemctl status rhaiis
    Copy to Clipboard Toggle word wrap

    Example output

    ● rhaiis.service - Red Hat AI Inference Server (vLLM)
         Loaded: loaded (/etc/containers/systemd/rhaiis.container; generated)
         Active: active (running) since Wed 2025-11-12 12:19:01 UTC; 1min 22s ago
           Docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_ai/
        Process: 2391 ExecStartPre=/usr/libexec/rhaiis/check-lib.sh (code=exited, status=0/SUCCESS)
    Copy to Clipboard Toggle word wrap

  2. Monitor the service logs to verify the model is loaded and vLLM server is running:

    [cloud-user@localhost ~]$ sudo podman logs -f rhaiis
    Copy to Clipboard Toggle word wrap

    Example output

    (APIServer pid=1) INFO:     Started server process [1]
    (APIServer pid=1) INFO:     Waiting for application startup.
    (APIServer pid=1) INFO:     Application startup complete.
    Copy to Clipboard Toggle word wrap

  3. Test the inference server API:

    [cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://localhost:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    Example output

    {
      "id": "cmpl-81f99f3c28d34f99a4c2d154d6bac822",
      "object": "text_completion",
      "created": 1762952825,
      "model": "RedHatAI/granite-3.3-8b-instruct",
      "choices": [
        {
          "index": 0,
          "text": "\n\nThe capital of France is Paris.",
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null,
          "token_ids": null,
          "prompt_logprobs": null,
          "prompt_token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 7,
        "total_tokens": 18,
        "completion_tokens": 11,
        "prompt_tokens_details": null
      },
      "kv_transfer_params": null
    }
    Copy to Clipboard Toggle word wrap

トップに戻る
Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

Red Hat をお使いのお客様が、信頼できるコンテンツが含まれている製品やサービスを活用することで、イノベーションを行い、目標を達成できるようにします。 最新の更新を見る.

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

Theme

© 2025 Red Hat