이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Deploying speculator models


Deploy a trained EAGLE-3 speculator model to accelerate inference using speculative decoding with Red Hat AI Inference Server.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with at least one NVIDIA AI accelerator installed.
  • You have installed the relevant NVIDIA drivers.
  • You have installed the NVIDIA Container Toolkit.

Procedure

  1. Log in to the Red Hat container registry:

    podman login registry.redhat.io
  2. Pull the AI Inference Server container image:

    podman pull registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version}
  3. Set your Hugging Face token as an environment variable:

    export HF_TOKEN=<your_huggingface_token>
Note

This example uses the RedHatAI/Qwen3-8B-speculator.eagle3 pre-trained model. For other available speculator models, see the Red Hat AI speculator models collection. If you encounter an access error, verify that you have accepted a license agreement on Hugging Face before downloading.

  1. Start the inference server with a speculator model:

    podman run --rm -it \
        --device nvidia.com/gpu=all \
        --shm-size=2g \
        -p 8000:8000 \
        --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
        --env "HF_HUB_OFFLINE=0" \
        registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version} \
          --model RedHatAI/Qwen3-8B-speculator.eagle3 \
          --host 0.0.0.0 \
          --port 8000

    vLLM reads the speculator configuration from the model and loads both the draft model and the verifier model.

Verification

  1. In a separate terminal, send a request to the model:

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
          "messages": [
            {"role": "user", "content": "Hello, how are you?"}
          ],
          "max_tokens": 50
        }'

    Example output

    {
        "id": "chatcmpl-8dc33cb67b69b432",
        "object": "chat.completion",
        "created": 1776107449,
        "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "<think>\nOkay, the user greeted me with \"Hello, how are you?\" I need to respond in a friendly and helpful manner..."
            },
            "finish_reason": "length"
          }
        ],
        "usage": {
          "prompt_tokens": 14,
          "total_tokens": 64,
          "completion_tokens": 50
        }
    }

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 문서 정보

Legal Notice

Theme

© 2026 Red Hat
맨 위로 이동