Chapter 4. Validating Red Hat AI Inference Server benefits using key metrics
Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:
- Time to first token (TTFT): How long does it take for the model to provide the first token of its response?
- Time per output token (TPOT): How long does it take for the model to provide an output token to each user, who has sent a request?
- Latency: How long does it take for the model to generate a complete response?
- Throughput: How many output tokens can a model produce simultaneously, across all users and requests?
Complete the procedure below to run a benchmark test that shows how AI Inference Server, and other inference servers, perform according to these metrics.
Prerequisites
- AI Inference Server container image
- GitHub account
- Python 3.9 or higher
Procedure
On your host system, start an AI Inference Server container and serve a model.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a separate terminal tab, install the benchmark tool dependencies.
pip install vllm pandas datasets
$ pip install vllm pandas datasets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Clone the vLLM Git repository:
git clone https://github.com/vllm-project/vllm.git
$ git clone https://github.com/vllm-project/vllm.git
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
./vllm/benchmarks/benchmark_serving.py
script.python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
$ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
The results show how AI Inference Server performs according to key server metrics:
Try changing the parameters of this benchmark and running it again. Notice how vllm
as a backend compares to other options. Throughput should be consistently higher, while latency should be lower.
-
Other options for
--backend
are:tgi
,lmdeploy
,deepspeed-mii
,openai
, andopenai-chat
-
Other options for
--dataset-name
are:sharegpt
,burstgpt
,sonnet
,random
,hf
Additional resources
- vLLM documentation
- LLM Inference Performance Engineering: Best Practices, by Mosaic AI Research, which explains metrics such as throughput and latency