第 4 章使用关键指标验证 Red Hat AI Inference 服务器的好处

使用以下指标评估 AI Inference Server 提供的 LLM 模型的性能：

第一次令牌(TTFT) ：模型提供其响应的第一个令牌所需的时间？
各个输出令牌(TPOT) 的时间：模型需要多久才能向已发送请求的每个用户提供输出令牌？
延迟：模型生成完整响应所需的时间？
吞吐量 ：在所有用户和请求中，模型可以同时生成多少个输出令牌？

完成以下步骤，运行一个基准测试，其中显示了 AI Inference Server 和其他 inference 服务器如何根据这些指标执行。

先决条件

AI Inference Server 容器镜像
GitHub 帐户
Python 3.9 或更高版本

流程

在您的主机系统上，启动一个 AI Inference Server 容器并提供模型。

podman run --rm -it --device nvidia.com/gpu=all \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
--security-opt=label=disable \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

$ podman run --rm -it --device nvidia.com/gpu=all \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
--security-opt=label=disable \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8

Copy to Clipboard

Toggle word wrap

在一个单独的终端选项卡中，安装基准工具依赖项。
```
pip install vllm pandas datasets
```
```
$ pip install vllm pandas datasets
```
Copy to Clipboard Toggle word wrap

克隆 vLLM Git 存储库：

git clone https://github.com/vllm-project/vllm.git

$ git clone https://github.com/vllm-project/vllm.git

Copy to Clipboard

Toggle word wrap

运行 ./vllm/benchmarks/benchmark_serving.py 脚本。

python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000

$ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random  --random-input 1024 --random-output 512 --port 8000

Copy to Clipboard

Toggle word wrap

验证

结果显示 AI Inference 服务器如何根据密钥服务器指标执行：

============ Serving Benchmark Result ============
Successful requests:                    100
Benchmark duration (s):                 4.61
Total input tokens:                     102300
Total generated tokens:                 40493
Request throughput (req/s):             21.67
Output token throughput (tok/s):        8775.85
Total Token throughput (tok/s):         30946.83
---------------Time to First Token----------------
Mean TTFT (ms):                         193.61
Median TTFT (ms):                       193.82
P99 TTFT (ms):                          303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         9.06
Median TPOT (ms):                       8.57
P99 TPOT (ms):                          13.57
---------------Inter-token Latency----------------
Mean ITL (ms):                          8.54
Median ITL (ms):                        8.49
P99 ITL (ms):                           13.14
==================================================

============ Serving Benchmark Result ============
Successful requests:                    100
Benchmark duration (s):                 4.61
Total input tokens:                     102300
Total generated tokens:                 40493
Request throughput (req/s):             21.67
Output token throughput (tok/s):        8775.85
Total Token throughput (tok/s):         30946.83
---------------Time to First Token----------------
Mean TTFT (ms):                         193.61
Median TTFT (ms):                       193.82
P99 TTFT (ms):                          303.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                         9.06
Median TPOT (ms):                       8.57
P99 TPOT (ms):                          13.57
---------------Inter-token Latency----------------
Mean ITL (ms):                          8.54
Median ITL (ms):                        8.49
P99 ITL (ms):                           13.14
==================================================

Copy to Clipboard

Toggle word wrap

尝试更改此基准的参数，然后再次运行它。注意如何将 vllm 作为后端与其他选项进行比较。吞吐量应该始终更高，但延迟应该较低。

其它选项是： tgi,lmdeploy,deepspeed-mii,openai, 和 openai-chat
--dataset-name 的其它选项有： sharegpt,burstgpt,sonnet,random,hf

其他资源

vLLM 文档
LLM Inference Performance Engineering: 最佳实践, Mosaic AI research，它解释了吞吐量和延迟等指标

第 4 章使用关键指标验证 Red Hat AI Inference 服务器的好处

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第 4 章 使用关键指标验证 Red Hat AI Inference 服务器的好处

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第 4 章使用关键指标验证 Red Hat AI Inference 服务器的好处