Chapter 6. Downloading a model from Hugging Face before running Red Hat AI Inference Server
You can download a model from Hugging Face Hub before starting the Red Hat AI Inference Server service when running the model in offline mode. This approach is useful when you want to download models to the local file system before starting Red Hat AI Inference Server service or when running in environments with restricted internet access.
Prerequisites
- You are logged in as a user with sudo access.
- You have a Hugging Face Hub token. You can obtain a token from Hugging Face settings.
- You have enabled the Red Hat AI Inference Server systemd Quadlet service.
Procedure
- Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
[cloud-user@localhost ~]$ systemctl stop rhaiisOpen a command prompt inside the Red Hat AI Inference Server container:
[cloud-user@localhost ~]$ sudo podman run -it --rm \ -e HF_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> \ -v /var/lib/rhaiis/cache:/opt/app-root/src/.cache:Z \ -v /var/lib/rhaiis/models:/opt/app-root/src/models:Z \ --entrypoint /bin/bash \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.4.0-ea.1NoteYou use the
sudocommand because the download writes to directories owned by the root group.Inside the container, set
HF_HUB_OFFLINEto 0. Run the following command:(app-root) /opt/app-root$ export HF_HUB_OFFLINE=0Download the model to the default directory. For example:
(app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \ --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \ --token $HF_TOKENNoteThe
rhaiis/vllm-cuda-rhel9,rhaiis/model-opt-cuda-rhel9, andrhaiis/vllm-rocm-rhel9containers have the same version of the Hugging Face CLI available.Exit the container:
exitEdit the Red Hat AI Inference Server configuration file to use the downloaded model in offline mode:
[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.confUpdate the configuration to enable offline mode and use the local model path:
[Container] # Set to 1 to run in offline mode and disable model downloading at runtime Environment=HF_HUB_OFFLINE=1 # Token is not required when running in offline mode with a local model # Environment=HUGGING_FACE_HUB_TOKEN= # Configure vLLM to use the locally downloaded model Exec=--model /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \ --tensor-parallel-size 1 \ --served-model-name RedHatAI/granite-3.3-8b-instruct \ --max-model-len 4096 PublishPort=8000:8000 ShmSize=4G [Install] WantedBy=multi-user.targetNoteWhen you set the model location, you must set the location to the folder that is mapped inside the Red Hat AI Inference Server container,
/opt/app-root/src/models/.Reload the systemd configuration:
[cloud-user@localhost ~]$ sudo systemctl daemon-reloadStart the Red Hat AI Inference Server service:
[cloud-user@localhost ~]$ sudo systemctl start rhaiis
Verification
Monitor the service logs to verify the vLLM server is using the local model:
[cloud-user@localhost ~]$ sudo podman logs -f rhaiisExample output
(APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}Test the inference server API:
[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://localhost:8000/v1/completions | jqExample output
{ "id": "cmpl-f3e12cc62bee438c86af676332f8fe55", "object": "text_completion", "created": 1762956836, "model": "RedHatAI/granite-3.3-8b-instruct", "choices": [ { "index": 0, "text": "\n\nThe capital of France is Paris.", "logprobs": null, "finish_reason": "stop", "stop_reason": null, "token_ids": null, "prompt_logprobs": null, "prompt_token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 7, "total_tokens": 18, "completion_tokens": 11, "prompt_tokens_details": null }, "kv_transfer_params": null }