Chapter 3. Environment variables


You can use environment variables to configure the system-level installation, build, logging behavior of AI Inference Server.

Important

VLLM_PORT and VLLM_HOST_IP set the host ports and IP address for internal usage of AI Inference Server. It is not the port and IP address for the API server. Do not use --host $VLLM_HOST_IP and --port $VLLM_PORT to start the API server.

Important

All environment variables used by AI Inference Server are prefixed with VLLM_. If you are using Kubernetes, do not name the service vllm, otherwise environment variables set by Kubernetes might come into conflict with AI Inference Server environment variables. This is because Kubernetes sets environment variables for each service with the capitalized service name as the prefix. For more information, see Kubernetes environment variables.

Expand
Table 3.1. AI Inference Server environment variables
Environment variableDescription

VLLM_TARGET_DEVICE

Target device of vLLM, supporting cuda (by default), rocm, neuron, cpu, openvino.

MAX_JOBS

Maximum number of compilation jobs to run in parallel. By default, this is the number of CPUs.

NVCC_THREADS

Number of threads to use for nvcc. By default, this is 1. If set, MAX_JOBS will be reduced to avoid oversubscribing the CPU.

VLLM_USE_PRECOMPILED

If set, AI Inference Server uses precompiled binaries (\*.so).

VLLM_TEST_USE_PRECOMPILED_NIGHTLY_WHEEL

Whether to force using nightly wheel in Python build for testing.

CMAKE_BUILD_TYPE

CMake build type. Available options: "Debug", "Release", "RelWithDebInfo".

VERBOSE

If set, AI Inference Server prints verbose logs during installation.

VLLM_CONFIG_ROOT

Root directory for AI Inference Server configuration files.

VLLM_CACHE_ROOT

Root directory for AI Inference Server cache files.

VLLM_HOST_IP

Used in a distributed environment to determine the IP address of the current node.

VLLM_PORT

Used in a distributed environment to manually set the communication port.

VLLM_RPC_BASE_PATH

Path used for IPC when the frontend API server is running in multi-processing mode.

VLLM_USE_MODELSCOPE

If true, will load models from ModelScope instead of Hugging Face Hub.

VLLM_RINGBUFFER_WARNING_INTERVAL

Interval in seconds to log a warning message when the ring buffer is full.

CUDA_HOME

Path to cudatoolkit home directory, under which should be bin, include, and lib directories.

VLLM_NCCL_SO_PATH

Path to the NCCL library file. Needed for versions of NCCL >= 2.19 due to a bug in PyTorch.

LD_LIBRARY_PATH

Used when VLLM_NCCL_SO_PATH is not set, AI Inference Server tries to find the NCCL library in this path.

VLLM_USE_TRITON_FLASH_ATTN

Flag to control if you wantAI Inference Server to use Triton Flash Attention.

VLLM_FLASH_ATTN_VERSION

Force AI Inference Server to use a specific flash-attention version (2 or 3), only valid with the flash-attention backend.

VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE

Internal flag to enable Dynamo fullgraph capture.

LOCAL_RANK

Local rank of the process in the distributed setting, used to determine the GPU device ID.

CUDA_VISIBLE_DEVICES

Used to control the visible devices in a distributed setting.

VLLM_ENGINE_ITERATION_TIMEOUT_S

Timeout for each iteration in the engine.

VLLM_API_KEY

API key for AI Inference Server API server.

S3_ACCESS_KEY_ID

S3 access key ID for tensorizer to load model from S3.

S3_SECRET_ACCESS_KEY

S3 secret access key for tensorizer to load model from S3.

S3_ENDPOINT_URL

S3 endpoint URL for tensorizer to load model from S3.

VLLM_USAGE_STATS_SERVER

URL for AI Inference Server usage stats server.

VLLM_NO_USAGE_STATS

If true, disables collection of usage stats.

VLLM_DO_NOT_TRACK

If true, disables tracking of AI Inference Server usage stats.

VLLM_USAGE_SOURCE

Source for usage stats collection.

VLLM_CONFIGURE_LOGGING

If set to 1, AI Inference Server configures logging using the default configuration or the specified config path.

VLLM_LOGGING_CONFIG_PATH

Path to the logging configuration file.

VLLM_LOGGING_LEVEL

Default logging level for vLLM.

VLLM_LOGGING_PREFIX

If set, AI Inference Server prepends this prefix to all log messages.

VLLM_LOGITS_PROCESSOR_THREADS

Number of threads used for custom logits processors.

VLLM_TRACE_FUNCTION

If set to 1, AI Inference Server traces function calls for debugging.

VLLM_ATTENTION_BACKEND

Backend for attention computation, for example , "TORCH_SDPA", "FLASH_ATTN", "XFORMERS").

VLLM_USE_FLASHINFER_SAMPLER

If set, AI Inference Server uses the FlashInfer sampler.

VLLM_FLASHINFER_FORCE_TENSOR_CORES

Forces FlashInfer to use tensor cores; otherwise uses heuristics.

VLLM_PP_LAYER_PARTITION

Pipeline stage partition strategy.

VLLM_CPU_KVCACHE_SPACE

CPU key-value cache space (default is 4GB).

VLLM_CPU_OMP_THREADS_BIND

CPU core IDs bound by OpenMP threads.

VLLM_CPU_MOE_PREPACK

Whether to use prepack for MoE layer on unsupported CPUs.

VLLM_OPENVINO_DEVICE

OpenVINO device selection (default is CPU).

VLLM_OPENVINO_KVCACHE_SPACE

OpenVINO key-value cache space (default is 4GB).

VLLM_OPENVINO_CPU_KV_CACHE_PRECISION

Precision for OpenVINO KV cache.

VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS

Enables weights compression during model export by using HF Optimum.

VLLM_USE_RAY_SPMD_WORKER

Enables Ray SPMD worker for execution on all workers.

VLLM_USE_RAY_COMPILED_DAG

Uses the Compiled Graph API provided by Ray to optimize control plane overhead.

VLLM_USE_RAY_COMPILED_DAG_NCCL_CHANNEL

Enables NCCL communication in the Compiled Graph provided by Ray.

VLLM_USE_RAY_COMPILED_DAG_OVERLAP_COMM

Enables GPU communication overlap in the Compiled Graph provided by Ray.

VLLM_WORKER_MULTIPROC_METHOD

Specifies the method for multiprocess workers, for example, "fork").

VLLM_ASSETS_CACHE

Path to the cache for storing downloaded assets.

VLLM_IMAGE_FETCH_TIMEOUT

Timeout for fetching images when serving multimodal models (default is 5 seconds).

VLLM_VIDEO_FETCH_TIMEOUT

Timeout for fetching videos when serving multimodal models (default is 30 seconds).

VLLM_AUDIO_FETCH_TIMEOUT

Timeout for fetching audio when serving multimodal models (default is 10 seconds).

VLLM_MM_INPUT_CACHE_GIB

Cache size in GiB for multimodal input cache (default is 8GiB).

VLLM_XLA_CACHE_PATH

Path to the XLA persistent cache directory (only for XLA devices).

VLLM_XLA_CHECK_RECOMPILATION

If set, asserts on XLA recompilation after each execution step.

VLLM_FUSED_MOE_CHUNK_SIZE

Chunk size for fused MoE layer (default is 32768).

VLLM_NO_DEPRECATION_WARNING

If true, skips deprecation warnings.

VLLM_KEEP_ALIVE_ON_ENGINE_DEATH

If true, keeps the OpenAI API server alive even after engine errors.

VLLM_ALLOW_LONG_MAX_MODEL_LEN

Allows specifying a max sequence length greater than the default length of the model.

VLLM_TEST_FORCE_FP8_MARLIN

Forces FP8 Marlin for FP8 quantization regardless of hardware support.

VLLM_TEST_FORCE_LOAD_FORMAT

Forces a specific load format.

VLLM_RPC_TIMEOUT

Timeout for fetching response from backend server.

VLLM_PLUGINS

List of plugins to load.

VLLM_TORCH_PROFILER_DIR

Directory for saving Torch profiler traces.

VLLM_USE_TRITON_AWQ

If set, uses Triton implementations of AWQ.

VLLM_ALLOW_RUNTIME_LORA_UPDATING

If set, allows updating Lora adapters at runtime.

VLLM_SKIP_P2P_CHECK

Skips peer-to-peer capability check.

VLLM_DISABLED_KERNELS

List of quantization kernels to disable for performance comparisons.

VLLM_USE_V1

If set, uses V1 code path.

VLLM_ROCM_FP8_PADDING

Pads FP8 weights to 256 bytes for ROCm.

Q_SCALE_CONSTANT

Divisor for dynamic query scale factor calculation for FP8 KV Cache.

K_SCALE_CONSTANT

Divisor for dynamic key scale factor calculation for FP8 KV Cache.

V_SCALE_CONSTANT

Divisor for dynamic value scale factor calculation for FP8 KV Cache.

VLLM_ENABLE_V1_MULTIPROCESSING

If set, enables multiprocessing in LLM for the V1 code path.

VLLM_LOG_BATCHSIZE_INTERVAL

Time interval for logging batch size.

VLLM_SERVER_DEV_MODE

If set, AI Inference Server runs in development mode, enabling additional endpoints for debugging, for example /reset_prefix_cache).

VLLM_V1_OUTPUT_PROC_CHUNK_SIZE

Controls the maximum number of requests to handle in a single asyncio task for processing per-token outputs in the V1 AsyncLLM interface. It affects high-concurrency streaming requests.

VLLM_MLA_DISABLE

If set, AI Inference Server disables the MLA attention optimizations.

VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON

If set, AI Inference Server uses the Triton implementation of moe_align_block_size, for example, moe_align_block_size_triton in fused_moe.py.

VLLM_RAY_PER_WORKER_GPUS

Number of GPUs per worker in Ray. Can be a fraction to allow Ray to schedule multiple actors on a single GPU.

VLLM_RAY_BUNDLE_INDICES

Specifies the indices used for the Ray bundle, for each worker. Format: comma-separated list of integers (e.g., "0,1,2,3").

VLLM_CUDART_SO_PATH

Specifies the path for the find_loaded_library() method when it may not work properly. Set by using the VLLM_CUDART_SO_PATH environment variable.

VLLM_USE_HPU_CONTIGUOUS_CACHE_FETCH

Enables contiguous cache fetching to avoid costly gather operations on Gaudi3. Only applicable to HPU contiguous cache.

VLLM_DP_RANK

Rank of the process in the data parallel setting.

VLLM_DP_SIZE

World size of the data parallel setting.

VLLM_DP_MASTER_IP

IP address of the master node in the data parallel setting.

VLLM_DP_MASTER_PORT

Port of the master node in the data parallel setting.

VLLM_CI_USE_S3

Whether to use the S3 path for model loading in CI by using RunAI Streamer.

VLLM_MARLIN_USE_ATOMIC_ADD

Whether to use atomicAdd reduce in gptq/awq marlin kernel.

VLLM_V0_USE_OUTLINES_CACHE

Whether to turn on the outlines cache for V0. This cache is unbounded and on disk, so it is unsafe for environments with malicious users.

VLLM_TPU_DISABLE_TOPK_TOPP_OPTIMIZATION

If set, disables TPU-specific optimization for top-k & top-p sampling.

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat