Chapter 5. Viewing AI Inference Server metrics
vLLM exposes various metrics via the /metrics endpoint on the AI Inference Server OpenAI-compatible API server.
You can start the server by using Python, or using Docker.
Procedure
Launch the AI Inference Server server and load your model as shown in the following example. The command also exposes the OpenAI-compatible API.
$ vllm serve unsloth/Llama-3.2-1B-InstructQuery the
/metricsendpoint of the OpenAI-compatible API to get the latest metrics from the server:$ curl http://0.0.0.0:8000/metricsExample output
# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step. # TYPE vllm:iteration_tokens_total histogram vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0 vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0 #...