Chapter 1. Key vLLM server arguments
There are 4 key arguments that you use to configure AI Inference Server to run on your hardware:
-
--tensor-parallel-size
: distributes your model across your host GPUs. -
--gpu-memory-utilization
: adjusts accelerator memory utilization for model weights, activations, and KV cache. Measured as a fraction from 0.0 to 1.0 that defaults to 0.9. For example, you can set this value to 0.8 to limit GPU memory consumption by AI Inference Server to 80%. Use the largest value that is stable for your deployment to maximize throughput. -
--max-model-len
: limits the maximum context length of the model, measured in tokens. Set this to prevent problems with memory if the model’s default context length is too long. -
--max-num-batched-tokens
: limits the maximum batch size of tokens to process per step, measured in tokens. Increasing this improves throughput but can affect output token latency.
For example, to run the Red Hat AI Inference Server container and serve a model with vLLM, run the following, changes server arguments as required:
podman run --rm -it --device nvidia.com/gpu=all \ --shm-size=4GB -p 8000:8000 \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" \ --env=VLLM_NO_USAGE_STATS=1 \ -v ./rhaiis-cache:/opt/app-root/src/.cache \ --security-opt=label=disable \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \ --model RedHatAI/Llama-3.2-1B-Instruct-FP8
$ podman run --rm -it --device nvidia.com/gpu=all \
--shm-size=4GB -p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" \
--env=VLLM_NO_USAGE_STATS=1 \
-v ./rhaiis-cache:/opt/app-root/src/.cache \
--security-opt=label=disable \
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.0.0 \
--model RedHatAI/Llama-3.2-1B-Instruct-FP8
--max-model-len 16384 \
--gpu-memory-utilization 0.8 \
--max-num-batched-tokens 2048 \
--tensor-parallel-size 2