Chapter 6. Downloading a model from Hugging Face before running Red Hat AI Inference Server

You can download a model from Hugging Face Hub before starting the Red Hat AI Inference Server service when running the model in offline mode. This approach is useful when you want to download models to the local file system before starting Red Hat AI Inference Server service or when running in environments with restricted internet access.

Prerequisites

You are logged in as a user with sudo access.
You have a Hugging Face Hub token. You can obtain a token from Hugging Face settings.
You have enabled the Red Hat AI Inference Server systemd Quadlet service.

Procedure

Open a shell prompt on the RHEL AI server.

Stop the Red Hat AI Inference Server service:

[cloud-user@localhost ~]$ systemctl stop rhaiis

Open a command prompt inside the Red Hat AI Inference Server container:

[cloud-user@localhost ~]$ sudo podman run -it --rm \
  -e HF_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> \
  -v /var/lib/rhaiis/cache:/opt/app-root/src/.cache:Z \
  -v /var/lib/rhaiis/models:/opt/app-root/src/models:Z \
  --entrypoint /bin/bash \
  registry.redhat.io/rhaii-early-access/model-opt-cuda-rhel9:3.4.0-ea.2

Note

You use the sudo command because the download writes to directories owned by the root group.

Inside the container, set HF_HUB_OFFLINE to 0. Run the following command:
```
(app-root) /opt/app-root$ export HF_HUB_OFFLINE=0
```
Download the model to the default directory. For example:
```
(app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \
  --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
  --token $HF_TOKEN
```
Note
The rhaiis/vllm-cuda-rhel9, rhaiis/model-opt-cuda-rhel9, and rhaiis/vllm-rocm-rhel9 containers have the same version of the Hugging Face CLI available.
Exit the container:
```
exit
```

Edit the Red Hat AI Inference Server configuration file to use the downloaded model in offline mode:

[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf

Update the configuration to enable offline mode and use the local model path:

[Container]
# Set to 1 to run in offline mode and disable model downloading at runtime
Environment=HF_HUB_OFFLINE=1

# Token is not required when running in offline mode with a local model
# Environment=HUGGING_FACE_HUB_TOKEN=

# Configure vLLM to use the locally downloaded model
Exec=--model /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
     --tensor-parallel-size 1 \
     --served-model-name RedHatAI/granite-3.3-8b-instruct \
     --max-model-len 4096

PublishPort=8000:8000
ShmSize=4G

[Install]
WantedBy=multi-user.target

Note

When you set the model location, you must set the location to the folder that is mapped inside the Red Hat AI Inference Server container, /opt/app-root/src/models/.

Reload the systemd configuration:

[cloud-user@localhost ~]$ sudo systemctl daemon-reload

Start the Red Hat AI Inference Server service:

[cloud-user@localhost ~]$ sudo systemctl start rhaiis

Verification

Monitor the service logs to verify the vLLM server is using the local model:

[cloud-user@localhost ~]$ sudo podman logs -f rhaiis

Example output

(APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}

Test the inference server API:

[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://localhost:8000/v1/completions | jq

Example output

{
  "id": "cmpl-f3e12cc62bee438c86af676332f8fe55",
  "object": "text_completion",
  "created": 1762956836,
  "model": "RedHatAI/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nThe capital of France is Paris.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 18,
    "completion_tokens": 11,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Chapter 6. Downloading a model from Hugging Face before running Red Hat AI Inference Server

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links