Chapter 5. Downloading a model from Hugging Face before running Red Hat AI Inference Server
You can download a model from Hugging Face Hub before starting the Red Hat AI Inference Server service when running the model in offline mode. This approach is useful when you want to download models to the local file system before starting Red Hat AI Inference Server service or when running in environments with restricted internet access.
Prerequisites
- You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
- You are logged in as a user with sudo access.
- You have a Hugging Face Hub token. You can obtain a token from Hugging Face settings.
- You have enabled the Red Hat AI Inference Server systemd Quadlet service.
Procedure
- Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
systemctl stop rhaiis
[cloud-user@localhost ~]$ systemctl stop rhaiisCopy to Clipboard Copied! Toggle word wrap Toggle overflow Open a command prompt inside the Red Hat AI Inference Server container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteYou use the
sudocommand because the download writes to directories owned by the root group.Inside the container, set
HF_HUB_OFFLINEto 0. Run the following command:(app-root) /opt/app-root$ export HF_HUB_OFFLINE=0
(app-root) /opt/app-root$ export HF_HUB_OFFLINE=0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Download the model to the default directory. For example:
(app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \ --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \ --token $HF_TOKEN
(app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \ --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \ --token $HF_TOKENCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
rhaiis/vllm-cuda-rhel9andrhaiis/model-opt-cuda-rhel9containers both have the same version of the Hugging Face CLI available.Exit the container:
exit
exitCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the Red Hat AI Inference Server configuration file to use the downloaded model in offline mode:
sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf
[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.confCopy to Clipboard Copied! Toggle word wrap Toggle overflow Update the configuration to enable offline mode and use the local model path:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteWhen you set the model location, you must set the location to the folder that is mapped inside the Red Hat AI Inference Server container,
/opt/app-root/src/models/.Reload the systemd configuration:
sudo systemctl daemon-reload
[cloud-user@localhost ~]$ sudo systemctl daemon-reloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow Start the Red Hat AI Inference Server service:
sudo systemctl start rhaiis
[cloud-user@localhost ~]$ sudo systemctl start rhaiisCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Monitor the service logs to verify the vLLM server is using the local model:
sudo podman logs -f rhaiis
[cloud-user@localhost ~]$ sudo podman logs -f rhaiisCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
(APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}(APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}Copy to Clipboard Copied! Toggle word wrap Toggle overflow Test the inference server API:
curl -X POST -H "Content-Type: application/json" -d '{[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://localhost:8000/v1/completions | jqCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow