Chapter 4. Enabling the Red Hat AI Inference Server systemd Quadlet service

You can enable the Red Hat AI Inference Server systemd Quadlet service to inference serve language models with NVIDIA CUDA AI accelerators on your RHEL AI instance. After you configure the service, the service automatically starts on system boot.

Prerequisites

You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
You are logged in as a user with sudo access.
You have a Hugging Face Hub token. You can obtain a token from Hugging Face settings.

Note

You do not need to create cache or model folders for Red Hat AI Inference Server or Red Hat AI Model Optimization Toolkit. On first boot, the following folders are created with the correct permissions for model serving:

/var/lib/rhaiis/cache
/var/lib/rhaiis/models

/var/lib/rhaiis/cache
/var/lib/rhaiis/models

Copy to Clipboard

Toggle word wrap

Procedure

Open a shell prompt on the RHEL AI server.

Review the images that are shipped with Red Hat Enterprise Linux AI. Run the following command:

podman images

[cloud-user@localhost ~]$ podman images

Copy to Clipboard

Toggle word wrap

Example output

REPOSITORY                                      TAG               IMAGE ID      CREATED      SIZE     R/O
registry.redhat.io/rhaiis/vllm-cuda-rhel9       3.2.3  f45efe91fbac  3 weeks ago  14.8 GB  true
registry.redhat.io/rhaiis/model-opt-cuda-rhel9  3.2.3  61c0d36dcfa3  3 weeks ago  10.1 GB  true

REPOSITORY                                      TAG               IMAGE ID      CREATED      SIZE     R/O
registry.redhat.io/rhaiis/vllm-cuda-rhel9       3.2.3  f45efe91fbac  3 weeks ago  14.8 GB  true
registry.redhat.io/rhaiis/model-opt-cuda-rhel9  3.2.3  61c0d36dcfa3  3 weeks ago  10.1 GB  true

Copy to Clipboard

Toggle word wrap

Make a copy of the example configuration file:

sudo cp /etc/containers/systemd/rhaiis.container.d/install.conf.example /etc/containers/systemd/rhaiis.container.d/install.conf

[cloud-user@localhost ~]$ sudo cp /etc/containers/systemd/rhaiis.container.d/install.conf.example /etc/containers/systemd/rhaiis.container.d/install.conf

Copy to Clipboard

Toggle word wrap

Edit the configuration file and update with the required parameters:

sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf

[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf

Copy to Clipboard

Toggle word wrap

[Container]
# Set to 1 to run in offline mode and disable model downloading at runtime.
# Default value is 0.
Environment=HF_HUB_OFFLINE=0

# Update with the required authentication token for downloading models from Hugging Face.
Environment=HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN>

# Set to 1 to disable vLLM usage statistics collection. Default value is 0.
# Environment=VLLM_NO_USAGE_STATS=1

# Configure the vLLM server arguments
Exec=--model meta-llama/Llama-3.2-1B-Instruct \
     --tensor-parallel-size 1 \
     --max-model-len 4096

PublishPort=8000:8000
ShmSize=4G

[Install]
WantedBy=multi-user.target

[Container]
# Set to 1 to run in offline mode and disable model downloading at runtime.
# Default value is 0.
Environment=HF_HUB_OFFLINE=0

# Update with the required authentication token for downloading models from Hugging Face.
Environment=HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN>

# Set to 1 to disable vLLM usage statistics collection. Default value is 0.
# Environment=VLLM_NO_USAGE_STATS=1

# Configure the vLLM server arguments
Exec=--model meta-llama/Llama-3.2-1B-Instruct \
     --tensor-parallel-size 1 \
     --max-model-len 4096

PublishPort=8000:8000
ShmSize=4G

[Install]
WantedBy=multi-user.target

Copy to Clipboard

Toggle word wrap

Use the following table to understand the required parameters to set:

Expand

Table 4.1. Red Hat AI Inference Server configuration parameters
Parameter	Description
`HF_HUB_OFFLINE`	Set to `1` to run in offline mode and disable model downloading at runtime. Default value is `0`.
`HUGGING_FACE_HUB_TOKEN`	Required authentication token for downloading models from Hugging Face.
`VLLM_NO_USAGE_STATS`	Set to `1` to disable vLLM usage statistics collection. Default value is `0`.
`--model`	vLLM server argument for the model identifier or local path to the model to serve, for example, `meta-llama/Llama-3.2-1B-Instruct` or `/opt/app-root/src/models/<MODEL_NAME>`.
`--tensor-parallel-size`	Number of AI accelerators to use for tensor parallelism when serving the model. Default value is `1`.
`--max-model-len`	Maximum model length (context size). This depends on available AI accelerator memory. The default value is 131072, but lower values such as 4096 might be better for accelerators with less memory.

Note

See vLLM server arguments for the complete list of server arguments that you can configure.

Reload the systemd configuration:
```
sudo systemctl daemon-reload
```
```
[cloud-user@localhost ~]$ sudo systemctl daemon-reload
```
Copy to Clipboard Toggle word wrap
Enable and start the Red Hat AI Inference Server systemd service:
```
sudo systemctl start rhaiis
```
```
[cloud-user@localhost ~]$ sudo systemctl start rhaiis
```
Copy to Clipboard Toggle word wrap

Verification

Check the service status:

sudo systemctl status rhaiis

[cloud-user@localhost ~]$ sudo systemctl status rhaiis

Copy to Clipboard

Toggle word wrap

Example output

● rhaiis.service - Red Hat AI Inference Server (vLLM)
     Loaded: loaded (/etc/containers/systemd/rhaiis.container; generated)
     Active: active (running) since Wed 2025-11-12 12:19:01 UTC; 1min 22s ago
       Docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_ai/
    Process: 2391 ExecStartPre=/usr/libexec/rhaiis/check-lib.sh (code=exited, status=0/SUCCESS)

● rhaiis.service - Red Hat AI Inference Server (vLLM)
     Loaded: loaded (/etc/containers/systemd/rhaiis.container; generated)
     Active: active (running) since Wed 2025-11-12 12:19:01 UTC; 1min 22s ago
       Docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_ai/
    Process: 2391 ExecStartPre=/usr/libexec/rhaiis/check-lib.sh (code=exited, status=0/SUCCESS)

Copy to Clipboard

Toggle word wrap

Monitor the service logs to verify the model is loaded and vLLM server is running:

sudo podman logs -f rhaiis

[cloud-user@localhost ~]$ sudo podman logs -f rhaiis

Copy to Clipboard

Toggle word wrap

Example output

(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Copy to Clipboard

Toggle word wrap

Test the inference server API:

curl -X POST -H "Content-Type: application/json" -d '{

[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://localhost:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
  "id": "cmpl-81f99f3c28d34f99a4c2d154d6bac822",
  "object": "text_completion",
  "created": 1762952825,
  "model": "RedHatAI/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nThe capital of France is Paris.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 18,
    "completion_tokens": 11,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

{
  "id": "cmpl-81f99f3c28d34f99a4c2d154d6bac822",
  "object": "text_completion",
  "created": 1762952825,
  "model": "RedHatAI/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nThe capital of France is Paris.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 18,
    "completion_tokens": 11,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

Chapter 4. Enabling the Red Hat AI Inference Server systemd Quadlet service

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links