Chapter 4. Enabling the Red Hat AI Inference Server systemd Quadlet service
You can enable the Red Hat AI Inference Server systemd Quadlet service to inference serve language models with NVIDIA CUDA AI accelerators on your RHEL AI instance. After you configure the service, the service automatically starts on system boot.
Prerequisites
- You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
- You are logged in as a user with sudo access.
- You have a Hugging Face Hub token. You can obtain a token from Hugging Face settings.
You do not need to create cache or model folders for Red Hat AI Inference Server or Red Hat AI Model Optimization Toolkit. On first boot, the following folders are created with the correct permissions for model serving:
/var/lib/rhaiis/cache /var/lib/rhaiis/models
/var/lib/rhaiis/cache
/var/lib/rhaiis/models
Procedure
- Open a shell prompt on the RHEL AI server.
Review the images that are shipped with Red Hat Enterprise Linux AI. Run the following command:
podman images
[cloud-user@localhost ~]$ podman imagesCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
REPOSITORY TAG IMAGE ID CREATED SIZE R/O registry.redhat.io/rhaiis/vllm-cuda-rhel9 3.2.3 f45efe91fbac 3 weeks ago 14.8 GB true registry.redhat.io/rhaiis/model-opt-cuda-rhel9 3.2.3 61c0d36dcfa3 3 weeks ago 10.1 GB true
REPOSITORY TAG IMAGE ID CREATED SIZE R/O registry.redhat.io/rhaiis/vllm-cuda-rhel9 3.2.3 f45efe91fbac 3 weeks ago 14.8 GB true registry.redhat.io/rhaiis/model-opt-cuda-rhel9 3.2.3 61c0d36dcfa3 3 weeks ago 10.1 GB trueCopy to Clipboard Copied! Toggle word wrap Toggle overflow Make a copy of the example configuration file:
sudo cp /etc/containers/systemd/rhaiis.container.d/install.conf.example /etc/containers/systemd/rhaiis.container.d/install.conf
[cloud-user@localhost ~]$ sudo cp /etc/containers/systemd/rhaiis.container.d/install.conf.example /etc/containers/systemd/rhaiis.container.d/install.confCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the configuration file and update with the required parameters:
sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf
[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.confCopy to Clipboard Copied! Toggle word wrap Toggle overflow Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use the following table to understand the required parameters to set:
Expand Table 4.1. Red Hat AI Inference Server configuration parameters Parameter Description HF_HUB_OFFLINESet to
1to run in offline mode and disable model downloading at runtime. Default value is0.HUGGING_FACE_HUB_TOKENRequired authentication token for downloading models from Hugging Face.
VLLM_NO_USAGE_STATSSet to
1to disable vLLM usage statistics collection. Default value is0.--modelvLLM server argument for the model identifier or local path to the model to serve, for example,
meta-llama/Llama-3.2-1B-Instructor/opt/app-root/src/models/<MODEL_NAME>.--tensor-parallel-sizeNumber of AI accelerators to use for tensor parallelism when serving the model. Default value is
1.--max-model-lenMaximum model length (context size). This depends on available AI accelerator memory. The default value is 131072, but lower values such as 4096 might be better for accelerators with less memory.
NoteSee vLLM server arguments for the complete list of server arguments that you can configure.
Reload the systemd configuration:
sudo systemctl daemon-reload
[cloud-user@localhost ~]$ sudo systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Enable and start the Red Hat AI Inference Server systemd service:
sudo systemctl start rhaiis
[cloud-user@localhost ~]$ sudo systemctl start rhaiisCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check the service status:
sudo systemctl status rhaiis
[cloud-user@localhost ~]$ sudo systemctl status rhaiisCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
● rhaiis.service - Red Hat AI Inference Server (vLLM) Loaded: loaded (/etc/containers/systemd/rhaiis.container; generated) Active: active (running) since Wed 2025-11-12 12:19:01 UTC; 1min 22s ago Docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_ai/ Process: 2391 ExecStartPre=/usr/libexec/rhaiis/check-lib.sh (code=exited, status=0/SUCCESS)● rhaiis.service - Red Hat AI Inference Server (vLLM) Loaded: loaded (/etc/containers/systemd/rhaiis.container; generated) Active: active (running) since Wed 2025-11-12 12:19:01 UTC; 1min 22s ago Docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_ai/ Process: 2391 ExecStartPre=/usr/libexec/rhaiis/check-lib.sh (code=exited, status=0/SUCCESS)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Monitor the service logs to verify the model is loaded and vLLM server is running:
sudo podman logs -f rhaiis
[cloud-user@localhost ~]$ sudo podman logs -f rhaiisCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
(APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Test the inference server API:
curl -X POST -H "Content-Type: application/json" -d '{[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://localhost:8000/v1/completions | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow