Home
Products
Red Hat AI Inference
3.4
vLLM server arguments
Chapter 5. Viewing AI Inference metrics

Chapter 5. Viewing AI Inference metrics

vLLM exposes various metrics via the /metrics endpoint on the AI Inference OpenAI-compatible API server.

You can start the server by using Python, or using Docker.

Procedure

Launch the AI Inference server and load your model as shown in the following example. The command also exposes the OpenAI-compatible API.
```
$ vllm serve unsloth/Llama-3.2-1B-Instruct
```

Query the /metrics endpoint of the OpenAI-compatible API to get the latest metrics from the server:

$ curl http://0.0.0.0:8000/metrics

Example output

# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
# TYPE vllm:iteration_tokens_total histogram
vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
vllm:iteration_tokens_total_bucket{le="1.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="8.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="16.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="32.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="64.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="128.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="256.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-Instruct"} 3.0
#...

Chapter 5. Viewing AI Inference metrics

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links