Chapter 5. AI Inference Server metrics

AI Inference Server exposes vLLM metrics that you can use to monitor the health of the system.

Expand

Table 5.1. vLLM metrics
Metric Name	Description
`vllm:num_requests_running`	Number of requests currently running on GPU.
`vllm:num_requests_waiting`	Number of requests waiting to be processed.
`vllm:lora_requests_info`	Running stats on LoRA requests.
`vllm:num_requests_swapped`	Number of requests swapped to CPU. Deprecated: KV cache offloading is not used in V1.
`vllm:gpu_cache_usage_perc`	GPU KV-cache usage. A value of 1 means 100% usage.
`vllm:cpu_cache_usage_perc`	CPU KV-cache usage. A value of 1 means 100% usage. Deprecated: KV cache offloading is not used in V1.
`vllm:cpu_prefix_cache_hit_rate`	CPU prefix cache block hit rate. Deprecated: KV cache offloading is not used in V1.
`vllm:gpu_prefix_cache_hit_rate`	GPU prefix cache block hit rate. Deprecated: Use `vllm:gpu_prefix_cache_queries` and `vllm:gpu_prefix_cache_hits` in V1.
`vllm:num_preemptions_total`	Cumulative number of preemptions from the engine.
`vllm:prompt_tokens_total`	Total number of prefill tokens processed.
`vllm:generation_tokens_total`	Total number of generation tokens processed.
`vllm:iteration_tokens_total`	Histogram of the number of tokens per engine step.
`vllm:time_to_first_token_seconds`	Histogram of time to the first token in seconds.
`vllm:time_per_output_token_seconds`	Histogram of time per output token in seconds.
`vllm:e2e_request_latency_seconds`	Histogram of end-to-end request latency in seconds.
`vllm:request_queue_time_seconds`	Histogram of time spent in the WAITING phase for a request.
`vllm:request_inference_time_seconds`	Histogram of time spent in the RUNNING phase for a request.
`vllm:request_prefill_time_seconds`	Histogram of time spent in the PREFILL phase for a request.
`vllm:request_decode_time_seconds`	Histogram of time spent in the DECODE phase for a request.
`vllm:time_in_queue_requests`	Histogram of time the request spent in the queue in seconds. Deprecated: Use `vllm:request_queue_time_seconds` instead.
`vllm:model_forward_time_milliseconds`	Histogram of time spent in the model forward pass in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.
`vllm:model_execute_time_milliseconds`	Histogram of time spent in the model execute function in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.
`vllm:request_prompt_tokens`	Histogram of the number of prefill tokens processed.
`vllm:request_generation_tokens`	Histogram of the number of generation tokens processed.
`vllm:request_max_num_generation_tokens`	Histogram of the maximum number of requested generation tokens.
`vllm:request_params_n`	Histogram of the `n` request parameter.
`vllm:request_params_max_tokens`	Histogram of the `max_tokens` request parameter.
`vllm:request_success_total`	Count of successfully processed requests.
`vllm:spec_decode_draft_acceptance_rate`	Speculative token acceptance rate.
`vllm:spec_decode_efficiency`	Speculative decoding system efficiency.
`vllm:spec_decode_num_accepted_tokens_total`	Total number of accepted tokens.
`vllm:spec_decode_num_draft_tokens_total`	Total number of draft tokens.
`vllm:spec_decode_num_emitted_tokens_total`	Total number of emitted tokens.

Torna in cima

Questo contenuto non è disponibile nella lingua selezionata.

Chapter 5. AI Inference Server metrics

Formazione

Prova, acquista e vendi

Community

Informazioni sulla documentazione di Red Hat

Rendiamo l’open source più inclusivo

Informazioni su Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links