Questo contenuto non è disponibile nella lingua selezionata.
Chapter 4. Validating Red Hat AI Inference Server benefits using key metrics
Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:
- Time to first token (TTFT): How long does it take for the model to provide the first token of its response?
- Time per output token (TPOT): How long does it take for the model to provide an output token to each user, who has sent a request?
- Latency: How long does it take for the model to generate a complete response?
- Throughput: How many output tokens can a model produce simultaneously, across all users and requests?
Complete the procedure below to run a benchmark test that shows how AI Inference Server, and other inference servers, perform according to these metrics.
Prerequisites
- AI Inference Server container image
- GitHub account
- Python 3.9 or higher
Procedure
On your host system, start an AI Inference Server container and serve a model.
$ podman run --rm -it --device nvidia.com/gpu=all \ --shm-size=4GB -p 8000:8000 \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" \ -v ./rhaiis-cache:/opt/app-root/src/.cache \ --security-opt=label=disable \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \ --model RedHatAI/Llama-3.2-1B-Instruct-FP8In a separate terminal tab, install the benchmark tool dependencies.
$ pip install vllm pandas datasetsClone the vLLM Git repository:
$ git clone https://github.com/vllm-project/vllm.gitRun the
./vllm/benchmarks/benchmark_serving.pyscript.$ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
Verification
The results show how AI Inference Server performs according to key server metrics:
============ Serving Benchmark Result ============
Successful requests: 100
Benchmark duration (s): 4.61
Total input tokens: 102300
Total generated tokens: 40493
Request throughput (req/s): 21.67
Output token throughput (tok/s): 8775.85
Total Token throughput (tok/s): 30946.83
---------------Time to First Token----------------
Mean TTFT (ms): 193.61
Median TTFT (ms): 193.82
P99 TTFT (ms): 303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 9.06
Median TPOT (ms): 8.57
P99 TPOT (ms): 13.57
---------------Inter-token Latency----------------
Mean ITL (ms): 8.54
Median ITL (ms): 8.49
P99 ITL (ms): 13.14
==================================================
Try changing the parameters of this benchmark and running it again. Notice how vllm as a backend compares to other options. Throughput should be consistently higher, while latency should be lower.
-
Other options for
--backendare:tgi,lmdeploy,deepspeed-mii,openai, andopenai-chat -
Other options for
--dataset-nameare:sharegpt,burstgpt,sonnet,random,hf
Additional resources
- vLLM documentation
- LLM Inference Performance Engineering: Best Practices, by Mosaic AI Research, which explains metrics such as throughput and latency