Chapter 5. Downloading a model from Hugging Face before running Red Hat AI Inference Server

You can download a model from Hugging Face Hub before starting the Red Hat AI Inference Server service when running the model in offline mode. This approach is useful when you want to download models to the local file system before starting Red Hat AI Inference Server service or when running in environments with restricted internet access.

Prerequisites

You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
You are logged in as a user with sudo access.
You have a Hugging Face Hub token. You can obtain a token from Hugging Face settings.
You have enabled the Red Hat AI Inference Server systemd Quadlet service.

Procedure

Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
```
systemctl stop rhaiis
```
```
[cloud-user@localhost ~]$ systemctl stop rhaiis
```
Copy to Clipboard Toggle word wrap

Open a command prompt inside the Red Hat AI Inference Server container:

sudo podman run -it --rm \
  -e HF_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> \
  -v /var/lib/rhaiis/cache:/opt/app-root/src/.cache:Z \
  -v /var/lib/rhaiis/models:/opt/app-root/src/models:Z \
  --entrypoint /bin/bash \
  registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3

[cloud-user@localhost ~]$ sudo podman run -it --rm \
  -e HF_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> \
  -v /var/lib/rhaiis/cache:/opt/app-root/src/.cache:Z \
  -v /var/lib/rhaiis/models:/opt/app-root/src/models:Z \
  --entrypoint /bin/bash \
  registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3

Copy to Clipboard

Toggle word wrap

Note

You use the sudo command because the download writes to directories owned by the root group.

Inside the container, set HF_HUB_OFFLINE to 0. Run the following command:
```
(app-root) /opt/app-root$ export HF_HUB_OFFLINE=0
```
```
(app-root) /opt/app-root$ export HF_HUB_OFFLINE=0
```
Copy to Clipboard Toggle word wrap

Download the model to the default directory. For example:

(app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \
  --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
  --token $HF_TOKEN

(app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \
  --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
  --token $HF_TOKEN

Copy to Clipboard

Toggle word wrap

Note

The rhaiis/vllm-cuda-rhel9 and rhaiis/model-opt-cuda-rhel9 containers both have the same version of the Hugging Face CLI available.

Exit the container:
```
exit
```
```
exit
```
Copy to Clipboard Toggle word wrap

Edit the Red Hat AI Inference Server configuration file to use the downloaded model in offline mode:

sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf

[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf

Copy to Clipboard

Toggle word wrap

Update the configuration to enable offline mode and use the local model path:

[Container]
# Set to 1 to run in offline mode and disable model downloading at runtime
Environment=HF_HUB_OFFLINE=1

# Token is not required when running in offline mode with a local model
# Environment=HUGGING_FACE_HUB_TOKEN=

# Configure vLLM to use the locally downloaded model
Exec=--model /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
     --tensor-parallel-size 1 \
     --served-model-name RedHatAI/granite-3.3-8b-instruct \
     --max-model-len 4096

PublishPort=8000:8000
ShmSize=4G

[Install]
WantedBy=multi-user.target

[Container]
# Set to 1 to run in offline mode and disable model downloading at runtime
Environment=HF_HUB_OFFLINE=1

# Token is not required when running in offline mode with a local model
# Environment=HUGGING_FACE_HUB_TOKEN=

# Configure vLLM to use the locally downloaded model
Exec=--model /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
     --tensor-parallel-size 1 \
     --served-model-name RedHatAI/granite-3.3-8b-instruct \
     --max-model-len 4096

PublishPort=8000:8000
ShmSize=4G

[Install]
WantedBy=multi-user.target

Copy to Clipboard

Toggle word wrap

Note

When you set the model location, you must set the location to the folder that is mapped inside the Red Hat AI Inference Server container, /opt/app-root/src/models/.

Reload the systemd configuration:
```
sudo systemctl daemon-reload
```
```
[cloud-user@localhost ~]$ sudo systemctl daemon-reload
```
Copy to Clipboard Toggle word wrap
Start the Red Hat AI Inference Server service:
```
sudo systemctl start rhaiis
```
```
[cloud-user@localhost ~]$ sudo systemctl start rhaiis
```
Copy to Clipboard Toggle word wrap

Verification

Monitor the service logs to verify the vLLM server is using the local model:

sudo podman logs -f rhaiis

[cloud-user@localhost ~]$ sudo podman logs -f rhaiis

Copy to Clipboard

Toggle word wrap

Example output

(APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}

(APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}

Copy to Clipboard

Toggle word wrap

Test the inference server API:

curl -X POST -H "Content-Type: application/json" -d '{

[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://localhost:8000/v1/completions | jq

Copy to Clipboard

Toggle word wrap

Example output

{
  "id": "cmpl-f3e12cc62bee438c86af676332f8fe55",
  "object": "text_completion",
  "created": 1762956836,
  "model": "RedHatAI/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nThe capital of France is Paris.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 18,
    "completion_tokens": 11,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

{
  "id": "cmpl-f3e12cc62bee438c86af676332f8fe55",
  "object": "text_completion",
  "created": 1762956836,
  "model": "RedHatAI/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nThe capital of France is Paris.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 18,
    "completion_tokens": 11,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Copy to Clipboard

Toggle word wrap

Chapter 5. Downloading a model from Hugging Face before running Red Hat AI Inference Server

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links