이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. AI Inference Server metrics

AI Inference Server exposes vLLM metrics that you can use to monitor the health of the system.

Expand

Table 6.1. vLLM metrics
Metric Name	Description
`vllm:num_requests_running`	Number of requests currently running on GPU.
`vllm:num_requests_waiting`	Number of requests waiting to be processed.
`vllm:lora_requests_info`	Running stats on LoRA requests.
`vllm:num_requests_swapped`	Number of requests swapped to CPU. Deprecated: KV cache offloading is not used in V1.
`vllm:gpu_cache_usage_perc`	GPU KV-cache usage. A value of 1 means 100% usage.
`vllm:cpu_cache_usage_perc`	CPU KV-cache usage. A value of 1 means 100% usage. Deprecated: KV cache offloading is not used in V1.
`vllm:cpu_prefix_cache_hit_rate`	CPU prefix cache block hit rate. Deprecated: KV cache offloading is not used in V1.
`vllm:gpu_prefix_cache_hit_rate`	GPU prefix cache block hit rate. Deprecated: Use `vllm:gpu_prefix_cache_queries` and `vllm:gpu_prefix_cache_hits` in V1.
`vllm:num_preemptions_total`	Cumulative number of preemptions from the engine.
`vllm:prompt_tokens_total`	Total number of prefill tokens processed.
`vllm:generation_tokens_total`	Total number of generation tokens processed.
`vllm:iteration_tokens_total`	Histogram of the number of tokens per engine step.
`vllm:time_to_first_token_seconds`	Histogram of time to the first token in seconds.
`vllm:time_per_output_token_seconds`	Histogram of time per output token in seconds.
`vllm:e2e_request_latency_seconds`	Histogram of end-to-end request latency in seconds.
`vllm:request_queue_time_seconds`	Histogram of time spent in the WAITING phase for a request.
`vllm:request_inference_time_seconds`	Histogram of time spent in the RUNNING phase for a request.
`vllm:request_prefill_time_seconds`	Histogram of time spent in the PREFILL phase for a request.
`vllm:request_decode_time_seconds`	Histogram of time spent in the DECODE phase for a request.
`vllm:time_in_queue_requests`	Histogram of time the request spent in the queue in seconds. Deprecated: Use `vllm:request_queue_time_seconds` instead.
`vllm:model_forward_time_milliseconds`	Histogram of time spent in the model forward pass in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.
`vllm:model_execute_time_milliseconds`	Histogram of time spent in the model execute function in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.
`vllm:request_prompt_tokens`	Histogram of the number of prefill tokens processed.
`vllm:request_generation_tokens`	Histogram of the number of generation tokens processed.
`vllm:request_max_num_generation_tokens`	Histogram of the maximum number of requested generation tokens.
`vllm:request_params_n`	Histogram of the `n` request parameter.
`vllm:request_params_max_tokens`	Histogram of the `max_tokens` request parameter.
`vllm:request_success_total`	Count of successfully processed requests.
`vllm:spec_decode_draft_acceptance_rate`	Speculative token acceptance rate.
`vllm:spec_decode_efficiency`	Speculative decoding system efficiency.
`vllm:spec_decode_num_accepted_tokens_total`	Total number of accepted tokens.
`vllm:spec_decode_num_draft_tokens_total`	Total number of draft tokens.
`vllm:spec_decode_num_emitted_tokens_total`	Total number of emitted tokens.

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. AI Inference Server metrics

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links