このコンテンツは選択した言語では利用できません。

Chapter 2. Key vLLM server arguments

There are 4 key arguments that you use to configure AI Inference Server to run on your hardware:

--tensor-parallel-size: distributes your model across your host GPUs.
--gpu-memory-utilization: adjusts accelerator memory utilization for model weights, activations, and KV cache. Measured as a fraction from 0.0 to 1.0 that defaults to 0.9. For example, you can set this value to 0.8 to limit GPU memory consumption by AI Inference Server to 80%. Use the largest value that is stable for your deployment to maximize throughput.
--max-model-len: limits the maximum context length of the model, measured in tokens. Set this to prevent problems with memory if the model’s default context length is too long.
--max-num-batched-tokens: limits the maximum batch size of tokens to process per step, measured in tokens. Increasing this improves throughput but can affect output token latency.

For example, to run the Red Hat AI Inference Server container and serve a model with vLLM, run the following, changing server arguments as required:

$ podman run --rm -it \
--device nvidia.com/gpu=all \
--security-opt=label=disable \
--shm-size=4GB -p 8000:8000 \
--userns=keep-id:uid=1001 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
-v ./rhaii-cache:/opt/app-root/src/.cache \
registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.8 \
--max-model-len 16384 \
--max-num-batched-tokens 2048 \

このコンテンツは選択した言語では利用できません。

Chapter 2. Key vLLM server arguments

詳細情報

試用、購入および販売

コミュニティー

会社概要

多様性を受け入れるオープンソースの強化

Red Hat ドキュメントについて

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links