第6章主要なメトリクスを使用した Red Hat AI Inference Server の利点の検証

AI Inference Server でサービングされる LLM モデルのパフォーマンスを評価するには、次のメトリクスを使用します。

Time to first token (TTFT): 要求が送信されてから応答の最初のトークンが受信されるまでの時間。
Time per output token (TPOT): 最初のトークンの後に各トークンの生成にかかる平均時間。
レイテンシー: 完全な応答の生成に必要な合計時間。
スループット: モデルがすべてのユーザーとリクエストを合わせた全体で同時に生成できる出力トークンの合計数。

AI Inference Server およびその他の推論サーバーがこれらのメトリクスに従ってどのように動作するかを示すベンチマークテストを実行するには、以下の手順を実行します。

前提条件

AI Inference Server コンテナーイメージ
GitHub アカウント
Python 3.9 以降

手順

ホストシステムで、AI Inference Server を起動し、モデルをサービングします。

podman run --rm -it --device nvidia.com/gpu=all \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
--security-opt=label=disable \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

$ podman run --rm -it --device nvidia.com/gpu=all \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
--security-opt=label=disable \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

Copy to Clipboard

Toggle word wrap

別のターミナルタブで、ベンチマークツールの依存関係をインストールします。
```
pip install vllm pandas datasets
```
```
$ pip install vllm pandas datasets
```
Copy to Clipboard Toggle word wrap
vLLM Git repository のクローンを作成します。
```
git clone https://github.com/vllm-project/vllm.git
```
```
$ git clone https://github.com/vllm-project/vllm.git
```
Copy to Clipboard Toggle word wrap

./vllm/benchmarks/benchmark_serving.py スクリプトを実行します。

python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000

$ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000

Copy to Clipboard

Toggle word wrap

検証

結果は、主要なサーバーメトリクスをもとにした AI Inference Server のパフォーマンスを示しています。

============ Serving Benchmark Result ============
Successful requests:                    100
Benchmark duration (s):                 4.61
Total input tokens:                     102300
Total generated tokens:                 40493
Request throughput (req/s):             21.67
Output token throughput (tok/s):        8775.85
Total Token throughput (tok/s):         30946.83
---------------Time to First Token----------------
Mean TTFT (ms):                         193.61
Median TTFT (ms):                       193.82
P99 TTFT (ms):                          303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         9.06
Median TPOT (ms):                       8.57
P99 TPOT (ms):                          13.57
---------------Inter-token Latency----------------
Mean ITL (ms):                          8.54
Median ITL (ms):                        8.49
P99 ITL (ms):                           13.14
==================================================

============ Serving Benchmark Result ============
Successful requests:                    100
Benchmark duration (s):                 4.61
Total input tokens:                     102300
Total generated tokens:                 40493
Request throughput (req/s):             21.67
Output token throughput (tok/s):        8775.85
Total Token throughput (tok/s):         30946.83
---------------Time to First Token----------------
Mean TTFT (ms):                         193.61
Median TTFT (ms):                       193.82
P99 TTFT (ms):                          303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         9.06
Median TPOT (ms):                       8.57
P99 TPOT (ms):                          13.57
---------------Inter-token Latency----------------
Mean ITL (ms):                          8.54
Median ITL (ms):                        8.49
P99 ITL (ms):                           13.14
==================================================

Copy to Clipboard

Toggle word wrap

このベンチマークのパラメーターを変更して、再度実行してみてください。バックエンドとしての vllm が他のオプションとどのように比較されるかに注目してください。スループットは一貫して高くなり、レイテンシーは低くなるはずです。

--backend の他のオプションは、tgi、lmdeploy、deepspeed-mii、openai、openai-chat です。
--dataset-name の他のオプションは sharegpt、burstgpt、sonnet、random、hf です。

関連情報

vLLM ドキュメント
LLM Inference Performance Engineering: Best Practices、Mosaic AI Research 著。スループットやレイテンシーなどのメトリクスを解説しています。

第6章主要なメトリクスを使用した Red Hat AI Inference Server の利点の検証

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第6章 主要なメトリクスを使用した Red Hat AI Inference Server の利点の検証

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第6章主要なメトリクスを使用した Red Hat AI Inference Server の利点の検証