이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. AI Inference Server metrics


AI Inference Server exposes vLLM metrics that you can use to monitor the health of the system.

Expand
Table 6.1. vLLM metrics
Metric NameDescription

vllm:num_requests_running

Number of requests currently running on GPU.

vllm:num_requests_waiting

Number of requests waiting to be processed.

vllm:lora_requests_info

Running stats on LoRA requests.

vllm:num_requests_swapped

Number of requests swapped to CPU. Deprecated: KV cache offloading is not used in V1.

vllm:gpu_cache_usage_perc

GPU KV-cache usage. A value of 1 means 100% usage.

vllm:cpu_cache_usage_perc

CPU KV-cache usage. A value of 1 means 100% usage. Deprecated: KV cache offloading is not used in V1.

vllm:cpu_prefix_cache_hit_rate

CPU prefix cache block hit rate. Deprecated: KV cache offloading is not used in V1.

vllm:gpu_prefix_cache_hit_rate

GPU prefix cache block hit rate. Deprecated: Use vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits in V1.

vllm:num_preemptions_total

Cumulative number of preemptions from the engine.

vllm:prompt_tokens_total

Total number of prefill tokens processed.

vllm:generation_tokens_total

Total number of generation tokens processed.

vllm:iteration_tokens_total

Histogram of the number of tokens per engine step.

vllm:time_to_first_token_seconds

Histogram of time to the first token in seconds.

vllm:time_per_output_token_seconds

Histogram of time per output token in seconds.

vllm:e2e_request_latency_seconds

Histogram of end-to-end request latency in seconds.

vllm:request_queue_time_seconds

Histogram of time spent in the WAITING phase for a request.

vllm:request_inference_time_seconds

Histogram of time spent in the RUNNING phase for a request.

vllm:request_prefill_time_seconds

Histogram of time spent in the PREFILL phase for a request.

vllm:request_decode_time_seconds

Histogram of time spent in the DECODE phase for a request.

vllm:time_in_queue_requests

Histogram of time the request spent in the queue in seconds. Deprecated: Use vllm:request_queue_time_seconds instead.

vllm:model_forward_time_milliseconds

Histogram of time spent in the model forward pass in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.

vllm:model_execute_time_milliseconds

Histogram of time spent in the model execute function in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.

vllm:request_prompt_tokens

Histogram of the number of prefill tokens processed.

vllm:request_generation_tokens

Histogram of the number of generation tokens processed.

vllm:request_max_num_generation_tokens

Histogram of the maximum number of requested generation tokens.

vllm:request_params_n

Histogram of the n request parameter.

vllm:request_params_max_tokens

Histogram of the max_tokens request parameter.

vllm:request_success_total

Count of successfully processed requests.

vllm:spec_decode_draft_acceptance_rate

Speculative token acceptance rate.

vllm:spec_decode_efficiency

Speculative decoding system efficiency.

vllm:spec_decode_num_accepted_tokens_total

Total number of accepted tokens.

vllm:spec_decode_num_draft_tokens_total

Total number of draft tokens.

vllm:spec_decode_num_emitted_tokens_total

Total number of emitted tokens.

맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat