이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 5. Downloading a model from Hugging Face before running Red Hat AI Inference Server


You can download a model from Hugging Face Hub before starting the Red Hat AI Inference Server service when running the model in offline mode. This approach is useful when you want to download models to the local file system before starting Red Hat AI Inference Server service or when running in environments with restricted internet access.

Prerequisites

  • You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
  • You are logged in as a user with sudo access.
  • You have a Hugging Face Hub token. You can obtain a token from Hugging Face settings.
  • You have enabled the Red Hat AI Inference Server systemd Quadlet service.

Procedure

  1. Open a shell prompt on the RHEL AI server.
  2. Stop the Red Hat AI Inference Server service:

    [cloud-user@localhost ~]$ systemctl stop rhaiis
    Copy to Clipboard Toggle word wrap
  3. Open a command prompt inside the Red Hat AI Inference Server container:

    [cloud-user@localhost ~]$ sudo podman run -it --rm \
      -e HF_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> \
      -v /var/lib/rhaiis/cache:/opt/app-root/src/.cache:Z \
      -v /var/lib/rhaiis/models:/opt/app-root/src/models:Z \
      --entrypoint /bin/bash \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3
    Copy to Clipboard Toggle word wrap
    Note

    You use the sudo command because the download writes to directories owned by the root group.

  4. Inside the container, set HF_HUB_OFFLINE to 0. Run the following command:

    (app-root) /opt/app-root$ export HF_HUB_OFFLINE=0
    Copy to Clipboard Toggle word wrap
  5. Download the model to the default directory. For example:

    (app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \
      --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
      --token $HF_TOKEN
    Copy to Clipboard Toggle word wrap
    Note

    The rhaiis/vllm-cuda-rhel9 and rhaiis/model-opt-cuda-rhel9 containers both have the same version of the Hugging Face CLI available.

  6. Exit the container:

    exit
    Copy to Clipboard Toggle word wrap
  7. Edit the Red Hat AI Inference Server configuration file to use the downloaded model in offline mode:

    [cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf
    Copy to Clipboard Toggle word wrap

    Update the configuration to enable offline mode and use the local model path:

    [Container]
    # Set to 1 to run in offline mode and disable model downloading at runtime
    Environment=HF_HUB_OFFLINE=1
    
    # Token is not required when running in offline mode with a local model
    # Environment=HUGGING_FACE_HUB_TOKEN=
    
    # Configure vLLM to use the locally downloaded model
    Exec=--model /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
         --tensor-parallel-size 1 \
         --served-model-name RedHatAI/granite-3.3-8b-instruct \
         --max-model-len 4096
    
    PublishPort=8000:8000
    ShmSize=4G
    
    [Install]
    WantedBy=multi-user.target
    Copy to Clipboard Toggle word wrap
    Note

    When you set the model location, you must set the location to the folder that is mapped inside the Red Hat AI Inference Server container, /opt/app-root/src/models/.

  8. Reload the systemd configuration:

    [cloud-user@localhost ~]$ sudo systemctl daemon-reload
    Copy to Clipboard Toggle word wrap
  9. Start the Red Hat AI Inference Server service:

    [cloud-user@localhost ~]$ sudo systemctl start rhaiis
    Copy to Clipboard Toggle word wrap

Verification

  1. Monitor the service logs to verify the vLLM server is using the local model:

    [cloud-user@localhost ~]$ sudo podman logs -f rhaiis
    Copy to Clipboard Toggle word wrap

    Example output

    (APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}
    Copy to Clipboard Toggle word wrap

  2. Test the inference server API:

    [cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://localhost:8000/v1/completions | jq
    Copy to Clipboard Toggle word wrap

    Example output

    {
      "id": "cmpl-f3e12cc62bee438c86af676332f8fe55",
      "object": "text_completion",
      "created": 1762956836,
      "model": "RedHatAI/granite-3.3-8b-instruct",
      "choices": [
        {
          "index": 0,
          "text": "\n\nThe capital of France is Paris.",
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null,
          "token_ids": null,
          "prompt_logprobs": null,
          "prompt_token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 7,
        "total_tokens": 18,
        "completion_tokens": 11,
        "prompt_tokens_details": null
      },
      "kv_transfer_params": null
    }
    Copy to Clipboard Toggle word wrap

맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat