Questo contenuto non è disponibile nella lingua selezionata.

Chapter 2. Deploying speculator models


Deploy a trained EAGLE-3 speculator model to accelerate inference using speculative decoding with Red Hat AI Inference Server.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
  • You have access to a Linux server with at least one NVIDIA AI accelerator installed.
  • You have installed the relevant NVIDIA drivers.
  • You have installed the NVIDIA Container Toolkit.

Procedure

  1. Log in to the Red Hat container registry:

    podman login registry.redhat.io
  2. Pull the AI Inference Server container image:

    podman pull registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version}
  3. Set your Hugging Face token as an environment variable:

    export HF_TOKEN=<your_huggingface_token>
Note

This example uses the RedHatAI/Qwen3-8B-speculator.eagle3 pre-trained model. For other available speculator models, see the Red Hat AI speculator models collection. If you encounter an access error, verify that you have accepted a license agreement on Hugging Face before downloading.

  1. Start the inference server with a speculator model:

    podman run --rm -it \
        --device nvidia.com/gpu=all \
        --shm-size=2g \
        -p 8000:8000 \
        --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
        --env "HF_HUB_OFFLINE=0" \
        registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version} \
          --model RedHatAI/Qwen3-8B-speculator.eagle3 \
          --host 0.0.0.0 \
          --port 8000

    vLLM reads the speculator configuration from the model and loads both the draft model and the verifier model.

Verification

  1. In a separate terminal, send a request to the model:

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
          "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
          "messages": [
            {"role": "user", "content": "Hello, how are you?"}
          ],
          "max_tokens": 50
        }'

    Example output

    {
        "id": "chatcmpl-8dc33cb67b69b432",
        "object": "chat.completion",
        "created": 1776107449,
        "model": "RedHatAI/Qwen3-8B-speculator.eagle3",
        "choices": [
          {
            "index": 0,
            "message": {
              "role": "assistant",
              "content": "<think>\nOkay, the user greeted me with \"Hello, how are you?\" I need to respond in a friendly and helpful manner..."
            },
            "finish_reason": "length"
          }
        ],
        "usage": {
          "prompt_tokens": 14,
          "total_tokens": 64,
          "completion_tokens": 50
        }
    }

Red Hat logoGithubredditYoutubeTwitter

Formazione

Prova, acquista e vendi

Community

Informazioni sulla documentazione di Red Hat

Aiutiamo gli utenti Red Hat a innovarsi e raggiungere i propri obiettivi con i nostri prodotti e servizi grazie a contenuti di cui possono fidarsi. Esplora i nostri ultimi aggiornamenti.

Rendiamo l’open source più inclusivo

Red Hat si impegna a sostituire il linguaggio problematico nel codice, nella documentazione e nelle proprietà web. Per maggiori dettagli, visita il Blog di Red Hat.

Informazioni su Red Hat

Forniamo soluzioni consolidate che rendono più semplice per le aziende lavorare su piattaforme e ambienti diversi, dal datacenter centrale all'edge della rete.

Theme

© 2026 Red Hat
Torna in cima