Chapter 2. Deploying speculator models


Deploy a trained EAGLE-3 speculator model to accelerate inference using speculative decoding with Red Hat AI Inference Server.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with at least one NVIDIA AI accelerator installed.
  • You have installed the relevant NVIDIA drivers.
  • You have installed the NVIDIA Container Toolkit.

Procedure

  1. Log in to the Red Hat container registry:

    podman login registry.redhat.io
  2. Pull the AI Inference Server container image:

    podman pull registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version}
  3. Set your Hugging Face token as an environment variable:

    export HF_TOKEN=<your_huggingface_token>
Note

This example uses the RedHatAI/Qwen3-8B-speculator.eagle3 pre-trained model. For other available speculator models, see the Red Hat AI speculator models collection. If you encounter an access error, verify that you have accepted a license agreement on Hugging Face before downloading.

  1. Start the inference server with a speculator model:

    podman run --rm -it \
        --device nvidia.com/gpu=all \
        --shm-size=2g \
        -p 8000:8000 \
        --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
        --env "HF_HUB_OFFLINE=0" \
        registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version} \
          --model RedHatAI/Qwen3-8B-speculator.eagle3 \
          --host 0.0.0.0 \
          --port 8000

    vLLM reads the speculator configuration from the model and loads both the draft model and the verifier model.

Verification

  1. In a separate terminal, send a request to the model:

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
          "messages": [
            {"role": "user", "content": "Hello, how are you?"}
          ],
          "max_tokens": 50
        }'

    Example output

    {
        "id": "chatcmpl-8dc33cb67b69b432",
        "object": "chat.completion",
        "created": 1776107449,
        "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "<think>\nOkay, the user greeted me with \"Hello, how are you?\" I need to respond in a friendly and helpful manner..."
            },
            "finish_reason": "length"
          }
        ],
        "usage": {
          "prompt_tokens": 14,
          "total_tokens": 64,
          "completion_tokens": 50
        }
    }

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top