이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Deploying speculator models

Deploy a trained EAGLE-3 speculator model to accelerate inference using speculative decoding with Red Hat AI Inference Server.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with at least one NVIDIA AI accelerator installed.
You have installed the relevant NVIDIA drivers.
You have installed the NVIDIA Container Toolkit.

Procedure

Pull the AI Inference Server container image:

podman pull registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version}

Set your Hugging Face token as an environment variable:
```
export HF_TOKEN=<your_huggingface_token>
```

Note

This example uses the RedHatAI/Qwen3-8B-speculator.eagle3 pre-trained model. For other available speculator models, see the Red Hat AI speculator models collection. If you encounter an access error, verify that you have accepted a license agreement on Hugging Face before downloading.

Start the inference server with a speculator model:

podman run --rm -it \
    --device nvidia.com/gpu=all \
    --shm-size=2g \
    -p 8000:8000 \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HF_HUB_OFFLINE=0" \
    registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version} \
      --model RedHatAI/Qwen3-8B-speculator.eagle3 \
      --host 0.0.0.0 \
      --port 8000

vLLM reads the speculator configuration from the model and loads both the draft model and the verifier model.

Verification

In a separate terminal, send a request to the model:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
      "messages": [
        {"role": "user", "content": "Hello, how are you?"}
      ],
      "max_tokens": 50
    }'

Example output

{
    "id": "chatcmpl-8dc33cb67b69b432",
    "object": "chat.completion",
    "created": 1776107449,
    "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "<think>\nOkay, the user greeted me with \"Hello, how are you?\" I need to respond in a friendly and helpful manner..."
        },
        "finish_reason": "length"
      }
    ],
    "usage": {
      "prompt_tokens": 14,
      "total_tokens": 64,
      "completion_tokens": 50
    }
}

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Deploying speculator models

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 문서 정보

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links