Chapter 6. Validating Red Hat AI Inference Server benefits using key metrics
Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:
- Time to first token (TTFT): The time from when a request is sent to when the first token of the response is received.
- Time per output token (TPOT): The average time it takes to generate each token after the first one.
- Latency: The total time required to generate the full response.
- Throughput: The total number of output tokens the model can produce at the same time across all users and requests.
Complete the procedure below to run a benchmark test that shows how AI Inference Server, and other inference servers, perform according to these metrics.
Prerequisites
- AI Inference Server container image
- GitHub account
- Python 3.9 or higher
Procedure
On your host system, start an AI Inference Server container and serve a model.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a separate terminal tab, install the benchmark tool dependencies.
pip install vllm pandas datasets
$ pip install vllm pandas datasets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Clone the vLLM Git repository:
git clone https://github.com/vllm-project/vllm.git
$ git clone https://github.com/vllm-project/vllm.git
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
./vllm/benchmarks/benchmark_serving.py
script.python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
$ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
The results show how AI Inference Server performs according to key server metrics:
Try changing the parameters of this benchmark and running it again. Notice how vllm
as a backend compares to other options. Throughput should be consistently higher, while latency should be lower.
-
Other options for
--backend
are:tgi
,lmdeploy
,deepspeed-mii
,openai
, andopenai-chat
-
Other options for
--dataset-name
are:sharegpt
,burstgpt
,sonnet
,random
,hf
Additional resources
- vLLM documentation
- LLM Inference Performance Engineering: Best Practices, by Mosaic AI Research, which explains metrics such as throughput and latency