このコンテンツは選択した言語では利用できません。

Chapter 11. Validating Red Hat AI Inference Server benefits using key metrics


Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:

  • Time to first token (TTFT): The time from when a request is sent to when the first token of the response is received.
  • Time per output token (TPOT): The average time it takes to generate each token after the first one.
  • Latency: The total time required to generate the full response.
  • Throughput: The total number of output tokens the model can produce at the same time across all users and requests.

Complete the procedure below to run a benchmark test that shows how AI Inference Server, and other inference servers, perform according to these metrics.

Prerequisites

  • AI Inference Server container image
  • GitHub account
  • Python 3.9 or higher

Procedure

  1. On your host system, start an AI Inference Server container and serve a model.

    $ podman run --rm -it --device nvidia.com/gpu=all \
    --shm-size=4GB -p 8000:8000 \
    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
    --env "HF_HUB_OFFLINE=0" \
    -v ./rhaiis-cache:/opt/app-root/src/.cache \
    --security-opt=label=disable \
    registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 \
    --model RedHatAI/Llama-3.2-1B-Instruct-FP8
    Copy to Clipboard Toggle word wrap
  2. In a separate terminal tab, install the benchmark tool dependencies.

    $ pip install vllm pandas datasets
    Copy to Clipboard Toggle word wrap
  3. Clone the vLLM Git repository:

    $ git clone https://github.com/vllm-project/vllm.git
    Copy to Clipboard Toggle word wrap
  4. Run the ./vllm/benchmarks/benchmark_serving.py script.

    $ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000
    Copy to Clipboard Toggle word wrap

Verification

The results show how AI Inference Server performs according to key server metrics:

============ Serving Benchmark Result ============
Successful requests:                    100
Benchmark duration (s):                 4.61
Total input tokens:                     102300
Total generated tokens:                 40493
Request throughput (req/s):             21.67
Output token throughput (tok/s):        8775.85
Total Token throughput (tok/s):         30946.83
---------------Time to First Token----------------
Mean TTFT (ms):                         193.61
Median TTFT (ms):                       193.82
P99 TTFT (ms):                          303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         9.06
Median TPOT (ms):                       8.57
P99 TPOT (ms):                          13.57
---------------Inter-token Latency----------------
Mean ITL (ms):                          8.54
Median ITL (ms):                        8.49
P99 ITL (ms):                           13.14
==================================================
Copy to Clipboard Toggle word wrap

Try changing the parameters of this benchmark and running it again. Notice how vllm as a backend compares to other options. Throughput should be consistently higher, while latency should be lower.

  • Other options for --backend are: tgi, lmdeploy, deepspeed-mii, openai, and openai-chat
  • Other options for --dataset-name are: sharegpt, burstgpt, sonnet, random, hf

Additional resources

Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

Red Hat をお使いのお客様が、信頼できるコンテンツが含まれている製品やサービスを活用することで、イノベーションを行い、目標を達成できるようにします。 最新の更新を見る.

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

Theme

© 2026 Red Hat
トップに戻る