此内容没有您所选择的语言版本。

Chapter 6. AI Inference metrics

AI Inference exposes vLLM metrics that you can use to monitor the health of the system.

Expand

Table 6.1. vLLM metrics
Metric Name	Description
`vllm:num_requests_running`	Number of requests currently running on GPU.
`vllm:num_requests_waiting`	Number of requests waiting to be processed.
`vllm:lora_requests_info`	Running stats on LoRA requests.
`vllm:num_requests_swapped`	Number of requests swapped to CPU. Deprecated: KV cache offloading is not used in V1.
`vllm:gpu_cache_usage_perc`	GPU KV-cache usage. A value of 1 means 100% usage.
`vllm:cpu_cache_usage_perc`	CPU KV-cache usage. A value of 1 means 100% usage. Deprecated: KV cache offloading is not used in V1.
`vllm:cpu_prefix_cache_hit_rate`	CPU prefix cache block hit rate. Deprecated: KV cache offloading is not used in V1.
`vllm:gpu_prefix_cache_hit_rate`	GPU prefix cache block hit rate. Deprecated: Use `vllm:gpu_prefix_cache_queries` and `vllm:gpu_prefix_cache_hits` in V1.
`vllm:num_preemptions_total`	Cumulative number of preemptions from the engine.
`vllm:prompt_tokens_total`	Total number of prefill tokens processed.
`vllm:generation_tokens_total`	Total number of generation tokens processed.
`vllm:iteration_tokens_total`	Histogram of the number of tokens per engine step.
`vllm:time_to_first_token_seconds`	Histogram of time to the first token in seconds.
`vllm:time_per_output_token_seconds`	Histogram of time per output token in seconds.
`vllm:e2e_request_latency_seconds`	Histogram of end-to-end request latency in seconds.
`vllm:request_queue_time_seconds`	Histogram of time spent in the WAITING phase for a request.
`vllm:request_inference_time_seconds`	Histogram of time spent in the RUNNING phase for a request.
`vllm:request_prefill_time_seconds`	Histogram of time spent in the PREFILL phase for a request.
`vllm:request_decode_time_seconds`	Histogram of time spent in the DECODE phase for a request.
`vllm:time_in_queue_requests`	Histogram of time the request spent in the queue in seconds. Deprecated: Use `vllm:request_queue_time_seconds` instead.
`vllm:model_forward_time_milliseconds`	Histogram of time spent in the model forward pass in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.
`vllm:model_execute_time_milliseconds`	Histogram of time spent in the model execute function in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.
`vllm:request_prompt_tokens`	Histogram of the number of prefill tokens processed.
`vllm:request_generation_tokens`	Histogram of the number of generation tokens processed.
`vllm:request_max_num_generation_tokens`	Histogram of the maximum number of requested generation tokens.
`vllm:request_params_n`	Histogram of the `n` request parameter.
`vllm:request_params_max_tokens`	Histogram of the `max_tokens` request parameter.
`vllm:request_success_total`	Count of successfully processed requests.
`vllm:spec_decode_draft_acceptance_rate`	Speculative token acceptance rate.
`vllm:spec_decode_efficiency`	Speculative decoding system efficiency.
`vllm:spec_decode_num_accepted_tokens_total`	Total number of accepted tokens.
`vllm:spec_decode_num_draft_tokens_total`	Total number of draft tokens.
`vllm:spec_decode_num_emitted_tokens_total`	Total number of emitted tokens.

此内容没有您所选择的语言版本。

Chapter 6. AI Inference metrics

学习

尝试、购买和销售

社区

關於紅帽

让开源更具包容性

关于红帽文档

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links