Chapter 2. Deploying speculator models

Deploy a trained EAGLE-3 speculator model to accelerate inference using speculative decoding with Red Hat AI Inference Server.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with at least one NVIDIA AI accelerator installed.
You have installed the relevant NVIDIA drivers.
You have installed the NVIDIA Container Toolkit.

Procedure

Pull the AI Inference Server container image:

podman pull registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version}

Set your Hugging Face token as an environment variable:
```
export HF_TOKEN=<your_huggingface_token>
```

Note

This example uses the RedHatAI/Qwen3-8B-speculator.eagle3 pre-trained model. For other available speculator models, see the Red Hat AI speculator models collection. If you encounter an access error, verify that you have accepted a license agreement on Hugging Face before downloading.

Start the inference server with a speculator model:

podman run --rm -it \
    --device nvidia.com/gpu=all \
    --shm-size=2g \
    -p 8000:8000 \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HF_HUB_OFFLINE=0" \
    registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version} \
      --model RedHatAI/Qwen3-8B-speculator.eagle3 \
      --host 0.0.0.0 \
      --port 8000

vLLM reads the speculator configuration from the model and loads both the draft model and the verifier model.

Verification

In a separate terminal, send a request to the model:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
      "messages": [
        {"role": "user", "content": "Hello, how are you?"}
      ],
      "max_tokens": 50
    }'

Example output

{
    "id": "chatcmpl-8dc33cb67b69b432",
    "object": "chat.completion",
    "created": 1776107449,
    "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "<think>\nOkay, the user greeted me with \"Hello, how are you?\" I need to respond in a friendly and helpful manner..."
        },
        "finish_reason": "length"
      }
    ],
    "usage": {
      "prompt_tokens": 14,
      "total_tokens": 64,
      "completion_tokens": 50
    }
}

Chapter 2. Deploying speculator models

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links