이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 6. AI Inference Server metrics
AI Inference Server exposes vLLM metrics that you can use to monitor the health of the system.
Metric Name | Description |
---|---|
| Number of requests currently running on GPU. |
| Number of requests waiting to be processed. |
| Running stats on LoRA requests. |
| Number of requests swapped to CPU. Deprecated: KV cache offloading is not used in V1. |
| GPU KV-cache usage. A value of 1 means 100% usage. |
| CPU KV-cache usage. A value of 1 means 100% usage. Deprecated: KV cache offloading is not used in V1. |
| CPU prefix cache block hit rate. Deprecated: KV cache offloading is not used in V1. |
|
GPU prefix cache block hit rate. Deprecated: Use |
| Cumulative number of preemptions from the engine. |
| Total number of prefill tokens processed. |
| Total number of generation tokens processed. |
| Histogram of the number of tokens per engine step. |
| Histogram of time to the first token in seconds. |
| Histogram of time per output token in seconds. |
| Histogram of end-to-end request latency in seconds. |
| Histogram of time spent in the WAITING phase for a request. |
| Histogram of time spent in the RUNNING phase for a request. |
| Histogram of time spent in the PREFILL phase for a request. |
| Histogram of time spent in the DECODE phase for a request. |
|
Histogram of time the request spent in the queue in seconds. Deprecated: Use |
| Histogram of time spent in the model forward pass in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead. |
| Histogram of time spent in the model execute function in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead. |
| Histogram of the number of prefill tokens processed. |
| Histogram of the number of generation tokens processed. |
| Histogram of the maximum number of requested generation tokens. |
|
Histogram of the |
|
Histogram of the |
| Count of successfully processed requests. |
| Speculative token acceptance rate. |
| Speculative decoding system efficiency. |
| Total number of accepted tokens. |
| Total number of draft tokens. |
| Total number of emitted tokens. |