Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 5. AI Inference Server metrics
AI Inference Server exposes vLLM metrics that you can use to monitor the health of the system.
Metric Name | Description |
---|---|
| Number of requests currently running on GPU. |
| Number of requests waiting to be processed. |
| Running stats on LoRA requests. |
| Number of requests swapped to CPU. Deprecated: KV cache offloading is not used in V1. |
| GPU KV-cache usage. A value of 1 means 100% usage. |
| CPU KV-cache usage. A value of 1 means 100% usage. Deprecated: KV cache offloading is not used in V1. |
| CPU prefix cache block hit rate. Deprecated: KV cache offloading is not used in V1. |
|
GPU prefix cache block hit rate. Deprecated: Use |
| Cumulative number of preemptions from the engine. |
| Total number of prefill tokens processed. |
| Total number of generation tokens processed. |
| Histogram of the number of tokens per engine step. |
| Histogram of time to the first token in seconds. |
| Histogram of time per output token in seconds. |
| Histogram of end-to-end request latency in seconds. |
| Histogram of time spent in the WAITING phase for a request. |
| Histogram of time spent in the RUNNING phase for a request. |
| Histogram of time spent in the PREFILL phase for a request. |
| Histogram of time spent in the DECODE phase for a request. |
|
Histogram of time the request spent in the queue in seconds. Deprecated: Use |
| Histogram of time spent in the model forward pass in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead. |
| Histogram of time spent in the model execute function in milliseconds. Deprecated: Use prefill/decode/inference time metrics instead. |
| Histogram of the number of prefill tokens processed. |
| Histogram of the number of generation tokens processed. |
| Histogram of the maximum number of requested generation tokens. |
|
Histogram of the |
|
Histogram of the |
| Count of successfully processed requests. |
| Speculative token acceptance rate. |
| Speculative decoding system efficiency. |
| Total number of accepted tokens. |
| Total number of draft tokens. |
| Total number of emitted tokens. |