Questo contenuto non è disponibile nella lingua selezionata.
Chapter 2. Deploying speculator models
Deploy a trained EAGLE-3 speculator model to accelerate inference using speculative decoding with Red Hat AI Inference Server.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
- You have access to a Linux server with at least one NVIDIA AI accelerator installed.
- You have installed the relevant NVIDIA drivers.
- You have installed the NVIDIA Container Toolkit.
Procedure
Log in to the Red Hat container registry:
podman login registry.redhat.ioPull the AI Inference Server container image:
podman pull registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version}Set your Hugging Face token as an environment variable:
export HF_TOKEN=<your_huggingface_token>
This example uses the RedHatAI/Qwen3-8B-speculator.eagle3 pre-trained model. For other available speculator models, see the Red Hat AI speculator models collection. If you encounter an access error, verify that you have accepted a license agreement on Hugging Face before downloading.
Start the inference server with a speculator model:
podman run --rm -it \ --device nvidia.com/gpu=all \ --shm-size=2g \ -p 8000:8000 \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" \ registry.redhat.io/{rhaii-registry-namespace}/vllm-cuda-rhel9:{rhaiis-version} \ --model RedHatAI/Qwen3-8B-speculator.eagle3 \ --host 0.0.0.0 \ --port 8000vLLM reads the speculator configuration from the model and loads both the draft model and the verifier model.
Verification
In a separate terminal, send a request to the model:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "messages": [ {"role": "user", "content": "Hello, how are you?"} ], "max_tokens": 50 }'Example output
{ "id": "chatcmpl-8dc33cb67b69b432", "object": "chat.completion", "created": 1776107449, "model": "RedHatAI/Qwen3-8B-speculator.eagle3", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "<think>\nOkay, the user greeted me with \"Hello, how are you?\" I need to respond in a friendly and helpful manner..." }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 14, "total_tokens": 64, "completion_tokens": 50 } }