이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 6. Validating Red Hat AI Inference Server benefits using key metrics
Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:
- Time to first token (TTFT): The time from when a request is sent to when the first token of the response is received.
- Time per output token (TPOT): The average time it takes to generate each token after the first one.
- Latency: The total time required to generate the full response.
- Throughput: The total number of output tokens the model can produce at the same time across all users and requests.
Complete the procedure below to run a benchmark test that shows how AI Inference Server, and other inference servers, perform according to these metrics.
Prerequisites
- AI Inference Server container image
- GitHub account
- Python 3.9 or higher
Procedure
On your host system, start an AI Inference Server container and serve a model.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a separate terminal tab, install the benchmark tool dependencies.
pip install vllm pandas datasets
$ pip install vllm pandas datasets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Clone the vLLM Git repository:
git clone https://github.com/vllm-project/vllm.git
$ git clone https://github.com/vllm-project/vllm.git
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
./vllm/benchmarks/benchmark_serving.py
script.python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
$ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
The results show how AI Inference Server performs according to key server metrics:
Try changing the parameters of this benchmark and running it again. Notice how vllm
as a backend compares to other options. Throughput should be consistently higher, while latency should be lower.
-
Other options for
--backend
are:tgi
,lmdeploy
,deepspeed-mii
,openai
, andopenai-chat
-
Other options for
--dataset-name
are:sharegpt
,burstgpt
,sonnet
,random
,hf
Additional resources
- vLLM documentation
- LLM Inference Performance Engineering: Best Practices, by Mosaic AI Research, which explains metrics such as throughput and latency