이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Validating Red Hat AI Inference Server benefits using key metrics


Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:

  • Time to first token (TTFT): The time from when a request is sent to when the first token of the response is received.
  • Time per output token (TPOT): The average time it takes to generate each token after the first one.
  • Latency: The total time required to generate the full response.
  • Throughput: The total number of output tokens the model can produce at the same time across all users and requests.

Complete the procedure below to run a benchmark test that shows how AI Inference Server, and other inference servers, perform according to these metrics.

Prerequisites

  • AI Inference Server container image
  • GitHub account
  • Python 3.9 or higher

Procedure

  1. On your host system, start an AI Inference Server container and serve a model.

    $ podman run --rm -it --device nvidia.com/gpu=all \
    --shm-size=4GB -p 8000:8000 \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HF_HUB_OFFLINE=0" \
    -v ./rhaiis-cache:/opt/app-root/src/.cache \
    --security-opt=label=disable \
    registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1 \
    --model RedHatAI/Llama-3.2-1B-Instruct-FP8
    Copy to Clipboard Toggle word wrap
  2. In a separate terminal tab, install the benchmark tool dependencies.

    $ pip install vllm pandas datasets
    Copy to Clipboard Toggle word wrap
  3. Clone the vLLM Git repository:

    $ git clone https://github.com/vllm-project/vllm.git
    Copy to Clipboard Toggle word wrap
  4. Run the ./vllm/benchmarks/benchmark_serving.py script.

    $ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000
    Copy to Clipboard Toggle word wrap

Verification

The results show how AI Inference Server performs according to key server metrics:

============ Serving Benchmark Result ============
Successful requests:                    100
Benchmark duration (s):                 4.61
Total input tokens:                     102300
Total generated tokens:                 40493
Request throughput (req/s):             21.67
Output token throughput (tok/s):        8775.85
Total Token throughput (tok/s):         30946.83
---------------Time to First Token----------------
Mean TTFT (ms):                         193.61
Median TTFT (ms):                       193.82
P99 TTFT (ms):                          303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         9.06
Median TPOT (ms):                       8.57
P99 TPOT (ms):                          13.57
---------------Inter-token Latency----------------
Mean ITL (ms):                          8.54
Median ITL (ms):                        8.49
P99 ITL (ms):                           13.14
==================================================
Copy to Clipboard Toggle word wrap

Try changing the parameters of this benchmark and running it again. Notice how vllm as a backend compares to other options. Throughput should be consistently higher, while latency should be lower.

  • Other options for --backend are: tgi, lmdeploy, deepspeed-mii, openai, and openai-chat
  • Other options for --dataset-name are: sharegpt, burstgpt, sonnet, random, hf

Additional resources

맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat