このコンテンツは選択した言語では利用できません。

Chapter 6. AI Inference Server metrics


AI Inference Server exposes vLLM metrics that you can use to monitor the health of the system.

Expand
Table 6.1. vLLM metrics
Metric NameDescription

vllm:num_requests_running

Number of requests currently running on GPU.

vllm:num_requests_waiting

Number of requests waiting to be processed.

vllm:lora_requests_info

Running stats on LoRA requests.

vllm:num_requests_swapped

Number of requests swapped to CPU. Deprecated: KV cache offloading is not used in V1.

vllm:gpu_cache_usage_perc

GPU KV-cache usage. A value of 1 means 100% usage.

vllm:cpu_cache_usage_perc

CPU KV-cache usage. A value of 1 means 100% usage. Deprecated: KV cache offloading is not used in V1.

vllm:cpu_prefix_cache_hit_rate

CPU prefix cache block hit rate. Deprecated: KV cache offloading is not used in V1.

vllm:gpu_prefix_cache_hit_rate

GPU prefix cache block hit rate. Deprecated: Use vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits in V1.

vllm:num_preemptions_total

Cumulative number of preemptions from the engine.

vllm:prompt_tokens_total

Total number of prefill tokens processed.

vllm:generation_tokens_total

Total number of generation tokens processed.

vllm:iteration_tokens_total

Histogram of the number of tokens per engine step.

vllm:time_to_first_token_seconds

Histogram of time to the first token in seconds.

vllm:time_per_output_token_seconds

Histogram of time per output token in seconds.

vllm:e2e_request_latency_seconds

Histogram of end-to-end request latency in seconds.

vllm:request_queue_time_seconds

Histogram of time spent in the WAITING phase for a request.

vllm:request_inference_time_seconds

Histogram of time spent in the RUNNING phase for a request.

vllm:request_prefill_time_seconds

Histogram of time spent in the PREFILL phase for a request.

vllm:request_decode_time_seconds

Histogram of time spent in the DECODE phase for a request.

vllm:time_in_queue_requests

Histogram of time the request spent in the queue in seconds. Deprecated: Use vllm:request_queue_time_seconds instead.

vllm:model_forward_time_milliseconds

Histogram of time spent in the model forward pass in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.

vllm:model_execute_time_milliseconds

Histogram of time spent in the model execute function in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead.

vllm:request_prompt_tokens

Histogram of the number of prefill tokens processed.

vllm:request_generation_tokens

Histogram of the number of generation tokens processed.

vllm:request_max_num_generation_tokens

Histogram of the maximum number of requested generation tokens.

vllm:request_params_n

Histogram of the n request parameter.

vllm:request_params_max_tokens

Histogram of the max_tokens request parameter.

vllm:request_success_total

Count of successfully processed requests.

vllm:spec_decode_draft_acceptance_rate

Speculative token acceptance rate.

vllm:spec_decode_efficiency

Speculative decoding system efficiency.

vllm:spec_decode_num_accepted_tokens_total

Total number of accepted tokens.

vllm:spec_decode_num_draft_tokens_total

Total number of draft tokens.

vllm:spec_decode_num_emitted_tokens_total

Total number of emitted tokens.

Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

Red Hat をお使いのお客様が、信頼できるコンテンツが含まれている製品やサービスを活用することで、イノベーションを行い、目標を達成できるようにします。 最新の更新を見る.

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

Theme

© 2026 Red Hat
トップに戻る