Chapter 3. Environment variables
You can use environment variables to configure the system-level installation, build, logging behavior of AI Inference Server.
VLLM_PORT
and VLLM_HOST_IP
set the host ports and IP address for internal usage of AI Inference Server. It is not the port and IP address for the API server. Do not use --host $VLLM_HOST_IP
and --port $VLLM_PORT
to start the API server.
All environment variables used by AI Inference Server are prefixed with VLLM_
. If you are using Kubernetes, do not name the service vllm
, otherwise environment variables set by Kubernetes might come into conflict with AI Inference Server environment variables. This is because Kubernetes sets environment variables for each service with the capitalized service name as the prefix. For more information, see Kubernetes environment variables.
Environment variable | Description |
---|---|
|
Target device of vLLM, supporting |
| Maximum number of compilation jobs to run in parallel. By default, this is the number of CPUs. |
|
Number of threads to use for nvcc. By default, this is 1. If set, |
| If set, AI Inference Server uses precompiled binaries (\*.so). |
| Whether to force using nightly wheel in Python build for testing. |
| CMake build type. Available options: "Debug", "Release", "RelWithDebInfo". |
| If set, AI Inference Server prints verbose logs during installation. |
| Root directory for AI Inference Server configuration files. |
| Root directory for AI Inference Server cache files. |
| Used in a distributed environment to determine the IP address of the current node. |
| Used in a distributed environment to manually set the communication port. |
| Path used for IPC when the frontend API server is running in multi-processing mode. |
| If true, will load models from ModelScope instead of Hugging Face Hub. |
| Interval in seconds to log a warning message when the ring buffer is full. |
| Path to cudatoolkit home directory, under which should be bin, include, and lib directories. |
| Path to the NCCL library file. Needed for versions of NCCL >= 2.19 due to a bug in PyTorch. |
|
Used when |
| Flag to control if you wantAI Inference Server to use Triton Flash Attention. |
| Force AI Inference Server to use a specific flash-attention version (2 or 3), only valid with the flash-attention backend. |
| Internal flag to enable Dynamo fullgraph capture. |
| Local rank of the process in the distributed setting, used to determine the GPU device ID. |
| Used to control the visible devices in a distributed setting. |
| Timeout for each iteration in the engine. |
| API key for AI Inference Server API server. |
| S3 access key ID for tensorizer to load model from S3. |
| S3 secret access key for tensorizer to load model from S3. |
| S3 endpoint URL for tensorizer to load model from S3. |
| URL for AI Inference Server usage stats server. |
| If true, disables collection of usage stats. |
| If true, disables tracking of AI Inference Server usage stats. |
| Source for usage stats collection. |
| If set to 1, AI Inference Server configures logging using the default configuration or the specified config path. |
| Path to the logging configuration file. |
| Default logging level for vLLM. |
| If set, AI Inference Server prepends this prefix to all log messages. |
| Number of threads used for custom logits processors. |
| If set to 1, AI Inference Server traces function calls for debugging. |
| Backend for attention computation, for example , "TORCH_SDPA", "FLASH_ATTN", "XFORMERS"). |
| If set, AI Inference Server uses the FlashInfer sampler. |
| Forces FlashInfer to use tensor cores; otherwise uses heuristics. |
| Pipeline stage partition strategy. |
| CPU key-value cache space (default is 4GB). |
| CPU core IDs bound by OpenMP threads. |
| Whether to use prepack for MoE layer on unsupported CPUs. |
| OpenVINO device selection (default is CPU). |
| OpenVINO key-value cache space (default is 4GB). |
| Precision for OpenVINO KV cache. |
| Enables weights compression during model export by using HF Optimum. |
| Enables Ray SPMD worker for execution on all workers. |
| Uses the Compiled Graph API provided by Ray to optimize control plane overhead. |
| Enables NCCL communication in the Compiled Graph provided by Ray. |
| Enables GPU communication overlap in the Compiled Graph provided by Ray. |
| Specifies the method for multiprocess workers, for example, "fork"). |
| Path to the cache for storing downloaded assets. |
| Timeout for fetching images when serving multimodal models (default is 5 seconds). |
| Timeout for fetching videos when serving multimodal models (default is 30 seconds). |
| Timeout for fetching audio when serving multimodal models (default is 10 seconds). |
| Cache size in GiB for multimodal input cache (default is 8GiB). |
| Path to the XLA persistent cache directory (only for XLA devices). |
| If set, asserts on XLA recompilation after each execution step. |
| Chunk size for fused MoE layer (default is 32768). |
| If true, skips deprecation warnings. |
| If true, keeps the OpenAI API server alive even after engine errors. |
| Allows specifying a max sequence length greater than the default length of the model. |
| Forces FP8 Marlin for FP8 quantization regardless of hardware support. |
| Forces a specific load format. |
| Timeout for fetching response from backend server. |
| List of plugins to load. |
| Directory for saving Torch profiler traces. |
| If set, uses Triton implementations of AWQ. |
| If set, allows updating Lora adapters at runtime. |
| Skips peer-to-peer capability check. |
| List of quantization kernels to disable for performance comparisons. |
| If set, uses V1 code path. |
| Pads FP8 weights to 256 bytes for ROCm. |
| Divisor for dynamic query scale factor calculation for FP8 KV Cache. |
| Divisor for dynamic key scale factor calculation for FP8 KV Cache. |
| Divisor for dynamic value scale factor calculation for FP8 KV Cache. |
| If set, enables multiprocessing in LLM for the V1 code path. |
| Time interval for logging batch size. |
|
If set, AI Inference Server runs in development mode, enabling additional endpoints for debugging, for example |
VLLM_V1_OUTPUT_PROC_CHUNK_SIZE | Controls the maximum number of requests to handle in a single asyncio task for processing per-token outputs in the V1 AsyncLLM interface. It affects high-concurrency streaming requests. |
| If set, AI Inference Server disables the MLA attention optimizations. |
|
If set, AI Inference Server uses the Triton implementation of |
| Number of GPUs per worker in Ray. Can be a fraction to allow Ray to schedule multiple actors on a single GPU. |
| Specifies the indices used for the Ray bundle, for each worker. Format: comma-separated list of integers (e.g., "0,1,2,3"). |
|
Specifies the path for the |
| Enables contiguous cache fetching to avoid costly gather operations on Gaudi3. Only applicable to HPU contiguous cache. |
| Rank of the process in the data parallel setting. |
| World size of the data parallel setting. |
| IP address of the master node in the data parallel setting. |
| Port of the master node in the data parallel setting. |
| Whether to use the S3 path for model loading in CI by using RunAI Streamer. |
| Whether to use atomicAdd reduce in gptq/awq marlin kernel. |
| Whether to turn on the outlines cache for V0. This cache is unbounded and on disk, so it is unsafe for environments with malicious users. |
| If set, disables TPU-specific optimization for top-k & top-p sampling. |