Chapter 5. Troubleshooting
The following troubleshooting information for Red Hat AI Inference Server 3.1 describes common problems related to model loading, memory, model response quality, networking, and GPU drivers. Where available, workarounds for common issues are described.
Most common issues in vLLM relate to installation, model loading, memory management, and GPU communication. Most problems can be resolved by using a correctly configured environment, ensuring compatible hardware and software versions, and following the recommended configuration practices.
For persistent issues, export VLLM_LOGGING_LEVEL=DEBUG to enable debug logging and then check the logs.
$ export VLLM_LOGGING_LEVEL=DEBUG
5.1. Model loading errors Copy linkLink copied to clipboard!
When you run the Red Hat AI Inference Server container image without specifying a user namespace, an unrecognized model error is returned.
podman run --rm -it \ --device nvidia.com/gpu=all \ --security-opt=label=disable \ --shm-size=4GB -p 8000:8000 \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" \ --env=VLLM_NO_USAGE_STATS=1 \ -v ./rhaiis-cache:/opt/app-root/src/.cache \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.1.0 \ --model RedHatAI/Llama-3.2-1B-Instruct-FP8Example output
ValueError: Unrecognized model in RedHatAI/Llama-3.2-1B-Instruct-FP8. Should have a model_type key in its config.jsonTo resolve this error, pass
--userns=keep-id:uid=1001as a Podman parameter to ensure that the container runs with the root user.Sometimes when Red Hat AI Inference Server downloads the model, the download fails or gets stuck. To prevent the model download from hanging, first download the model using the
huggingface-cli. For example:$ huggingface-cli download <MODEL_ID> --local-dir <DOWNLOAD_PATH>When serving the model, pass the local model path to vLLM to prevent the model from being downloaded again.
When Red Hat AI Inference Server loads a model from disk, the process sometimes hangs. Large models consume memory, and if memory runs low, the system slows down as it swaps data between RAM and disk. Slow network file system speeds or a lack of available memory can trigger excessive swapping. This can happen in clusters where file systems are shared between cluster nodes.
Where possible, store the model in a local disk to prevent slow down during model loading. Ensure that the system has sufficient CPU memory available.
Ensure that your system has enough CPU capacity to handle the model.
Sometimes, Red Hat AI Inference Server fails to inspect the model. Errors are reported in the log. For example:
#... File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported raise ValueError( ValueError: Model architectures [''] failed to be inspected. Please check the logs for more details.The error occurs when vLLM fails to import the model file, which is usually related to missing dependencies or outdated binaries in the vLLM build.
Some model architectures are not supported. Refer to the list of Validated models. For example, the following errors indicate that the model you are trying to use is not supported:
Traceback (most recent call last): #... File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls for arch in architectures: TypeError: 'NoneType' object is not iterable#... File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported raise ValueError( ValueError: Model architectures [''] are not supported for now. Supported architectures: #...NoteSome architectures such as
DeepSeekV2VLrequire the architecture to be explicitly specified using the--hf_overridesflag, for example:--hf_overrides '{\"architectures\": [\"DeepseekVLV2ForCausalLM\"]}Sometimes a runtime error occurs for certain hardware when you load 8-bit floating point (FP8) models. FP8 requires GPU hardware acceleration. Errors occur when you load FP8 models like
deepseek-r1or models tagged with theF8_E4M3tensor type. For example:triton.compiler.errors.CompilationError: at 1:0: def \_per_token_group_quant_fp8( \^ ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") [rank0]:[W502 11:12:56.323757996 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())NoteReview Getting started to ensure your specific accelerator is supported. Accelerators that are currently supported for FP8 models include:
Sometimes when serving a model a runtime error occurs that is related to the host system. For example, you might see errors in the log like this:
INFO 05-07 19:15:17 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048. OMP: Error #179: Function Can't open SHM failed: OMP: System error #0: Success Traceback (most recent call last): File "/opt/app-root/bin/vllm", line 8, in <module> sys.exit(main()) .......................... raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above.You can work around this issue by passing the
--shm-size=2gargument when startingvllm.
5.2. Memory optimization Copy linkLink copied to clipboard!
- If the model is too large to run with a single GPU, you will get out-of-memory (OOM) errors. Use memory optimization options such as quantization, tensor parallelism, or reduced precision to reduce the memory consumption. For more information, see Conserving memory.
5.3. Generated model response quality Copy linkLink copied to clipboard!
In some scenarios, the quality of the generated model responses might deteriorate after an update.
Default sampling parameters source have been updated in newer versions. For vLLM version 0.8.4 and higher, the default sampling parameters come from the
generation_config.jsonfile that is provided by the model creator. In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.If you experience this problem, try serving the model with the old defaults by using the
--generation-config vllmserver argument.ImportantIf applying the
--generation-config vllmserver argument improves the model output, continue to use the vLLM defaults and petition the model creator on Hugging Face to update their defaultgeneration_config.jsonso that it produces better quality generations.
5.4. CUDA accelerator errors Copy linkLink copied to clipboard!
You might experience a
self.graph.replay()error when running a model using CUDA accelerators.If vLLM crashes and the error trace captures the error somewhere around the
self.graph.replay()method in thevllm/worker/model_runner.pymodule, this is most likely a CUDA error that occurs inside theCUDAGraphclass.To identify the particular CUDA operation that causes the error, add the
--enforce-eagerserver argument to thevllmcommand line to disableCUDAGraphoptimization and isolate the problematic CUDA operation.You might experience accelerator and CPU communication problems that are caused by incorrect hardware or driver settings.
NVIDIA Fabric Manager is required for multi-GPU systems for some types of NVIDIA GPUs. The
nvidia-fabricmanagerpackage and associated systemd service might not be installed or the package might not be running.Run the diagnostic Python script to check whether the NVIDIA Collective Communications Library (NCCL) and Gloo library components are communicating correctly.
On an NVIDIA system, check the fabric manager status by running the following command:
$ systemctl status nvidia-fabricmanagerOn successfully configured systems, the service should be active and running with no errors.
-
Running vLLM with tensor parallelism enabled and setting
--tensor-parallel-sizeto be greater than 1 on NVIDIA Multi-Instance GPU (MIG) hardware causes anAssertionErrorduring the initial model loading or shape checking phase. This typically occurs as one of the first errors when starting vLLM.
5.5. Networking errors Copy linkLink copied to clipboard!
You might experience network errors with complicated network configurations.
To troubleshoot network issues, search the logs for DEBUG statements where an incorrect IP address is listed, for example:
DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://<incorrect_ip_address>:54641 backend=ncclTo correct the issue, set the correct IP address with the
VLLM_HOST_IPenvironment variable, for example:$ export VLLM_HOST_IP=<correct_ip_address>Specify the network interface that is tied to the IP address for NCCL and Gloo:
$ export NCCL_SOCKET_IFNAME=<your_network_interface>$ export GLOO_SOCKET_IFNAME=<your_network_interface>
5.6. Python multiprocessing errors Copy linkLink copied to clipboard!
You might experience Python multiprocessing warnings or runtime errors. This can be caused by code that is not properly structured for Python multiprocessing. The following is an example console warning:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information.The following is an example Python runtime error:
RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module: if __name__ = "__main__": freeze_support() ... The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable. To fix this issue, refer to the "Safe importing of main module" section in https://docs.python.org/3/library/multiprocessing.htmlTo resolve the runtime error, update your Python code to guard the usage of
vllmbehind anif__name__ = "__main__":block, for example:if __name__ = "__main__": import vllm llm = vllm.LLM(...)
5.7. GPU driver or device pass-through issues Copy linkLink copied to clipboard!
When you run the Red Hat AI Inference Server container image, sometimes it is unclear whether device pass-through errors are being caused by GPU drivers or tools such as the NVIDIA Container Toolkit.
Check that the NVIDIA Container toolkit that is installed on the host machine can see the host GPUs:
$ nvidia-ctk cdi listExample output
#... nvidia.com/gpu=GPU-0fe9bb20-207e-90bf-71a7-677e4627d9a1 nvidia.com/gpu=GPU-10eff114-f824-a804-e7b7-e07e3f8ebc26 nvidia.com/gpu=GPU-39af96b4-f115-9b6d-5be9-68af3abd0e52 nvidia.com/gpu=GPU-3a711e90-a1c5-3d32-a2cd-0abeaa3df073 nvidia.com/gpu=GPU-6f5f6d46-3fc1-8266-5baf-582a4de11937 nvidia.com/gpu=GPU-da30e69a-7ba3-dc81-8a8b-e9b3c30aa593 nvidia.com/gpu=GPU-dc3c1c36-841b-bb2e-4481-381f614e6667 nvidia.com/gpu=GPU-e85ffe36-1642-47c2-644e-76f8a0f02ba7 nvidia.com/gpu=allEnsure that the NVIDIA accelerator configuration has been created on the host machine:
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yamlCheck that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:
$ podman run --rm -it --security-opt=label=disable --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smiExample output
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB Off | 00000000:08:01.0 Off | 0 | | N/A 32C P0 64W / 400W | 1MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB Off | 00000000:08:02.0 Off | 0 | | N/A 29C P0 63W / 400W | 1MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+