Getting started
Getting started with Red Hat AI Inference Server
Abstract
Preface Copy linkLink copied to clipboard!
Red Hat AI Inference Server is a container image that optimizes serving and inferencing with LLMs. Using AI Inference Server, you can serve and inference models in a way that boosts their performance while reducing their costs.
Chapter 1. About AI Inference Server Copy linkLink copied to clipboard!
AI Inference Server provides enterprise-grade stability and security, building on upstream, open source software. AI Inference Server leverages the upstream vLLM project, which provides state-of-the-art inferencing features.
For example, AI Inference Server uses continuous batching to process requests as they arrive instead of waiting for a full batch to be accumulated. It also uses tensor parallelism to distribute LLM workloads across multiple GPUs. These features provide reduced latency and higher throughput.
To reduce the cost of inferencing models, AI Inference Server uses paged attention. LLMs use a mechanism called attention to understand conversations with users. Normally, attention uses a significant amount of memory, much of which is wasted. Paged attention addresses this memory wastage by provisioning memory for LLMs similar to the way that virtual memory works for operating systems. This approach consumes less memory, which lowers costs.
To verify cost savings and performance gains with AI Inference Server, complete the following procedures:
- Serving and inferencing with AI Inference Server
- Validating Red Hat AI Inference Server benefits using key metrics
Chapter 2. Product and version compatibility Copy linkLink copied to clipboard!
The following table lists the supported product versions for Red Hat AI Inference Server 3.2.
Red Hat AI Inference Server version | vLLM core version | LLM Compressor version |
---|---|---|
3.2.3 | v0.11.0 | v0.8.1 |
3.2.2 | v0.10.1.1 | v0.7.1 |
3.2.1 | v0.10.0 | Not included in this release |
3.2.0 | v0.9.2 | Not included in this release |
Chapter 3. Reviewing AI Inference Server Python packages Copy linkLink copied to clipboard!
You can review the Python packages installed in the Red Hat AI Inference Server container image by running the container with Podman and reviewing the pip list package
output.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.io
and have logged in.
Procedure
Run the Red Hat AI Inference Server container image with the
pip list package
command to view all installed Python packages. For example:podman run --rm --entrypoint=/bin/bash \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3 \ -c "pip list"
$ podman run --rm --entrypoint=/bin/bash \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3 \ -c "pip list"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To view detailed information about a specific package, run the Podman command with
pip show <package_name>
. For example:podman run --rm --entrypoint=/bin/bash \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3 \ -c "pip show vllm"
$ podman run --rm --entrypoint=/bin/bash \ registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3 \ -c "pip show vllm"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Name: vllm Version: v0.11.0
Name: vllm Version: v0.11.0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 4. Serving and inferencing with Podman using NVIDIA CUDA AI accelerators Copy linkLink copied to clipboard!
Serve and inference a large language model with Podman and Red Hat AI Inference Server running on NVIDIA CUDA AI accelerators.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.io
and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade NVIDIA AI accelerators installed.
For NVIDIA GPUs:
- Install NVIDIA drivers
- Install the NVIDIA Container Toolkit
- If your system has multiple NVIDIA GPUs that use NVSwitch, you must have root access to start Fabric Manager
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io
:podman login registry.redhat.io
$ podman login registry.redhat.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the relevant the NVIDIA CUDA image by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3
$ podman pull registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If your system has SELinux enabled, configure SELinux to allow device access:
sudo setsebool -P container_use_devices 1
$ sudo setsebool -P container_use_devices 1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
mkdir -p rhaiis-cache
$ mkdir -p rhaiis-cache
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chmod g+rwX rhaiis-cache
$ chmod g+rwX rhaiis-cache
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create or append your
HF_TOKEN
Hugging Face token to theprivate.env
file. Source theprivate.env
file.echo "export HF_TOKEN=<your_HF_token>" > private.env
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow source private.env
$ source private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container image.
For NVIDIA CUDA accelerators, if the host system has multiple GPUs and uses NVSwitch, then start NVIDIA Fabric Manager. To detect if your system is using NVSwitch, first check if files are present in
/proc/driver/nvidia-nvswitch/devices/
, and then start NVIDIA Fabric Manager. Starting NVIDIA Fabric Manager requires root privileges.ls /proc/driver/nvidia-nvswitch/devices/
$ ls /proc/driver/nvidia-nvswitch/devices/
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0
0000:0c:09.0 0000:0c:0a.0 0000:0c:0b.0 0000:0c:0c.0 0000:0c:0d.0 0000:0c:0e.0
Copy to Clipboard Copied! Toggle word wrap Toggle overflow systemctl start nvidia-fabricmanager
$ systemctl start nvidia-fabricmanager
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantNVIDIA Fabric Manager is only required on systems with multiple GPUs that use NVSwitch. For more information, see NVIDIA Server Architectures.
Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:
podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \ nvidia-smi
$ podman run --rm -it \ --security-opt=label=disable \ --device nvidia.com/gpu=all \ nvcr.io/nvidia/cuda:12.4.1-base-ubi9 \ nvidia-smi
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the container.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Required for systems where SELinux is enabled.
--security-opt=label=disable
prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run. - 2
- If you experience an issue with shared memory, increase
--shm-size
to8GB
. - 3
- Maps the host UID to the effective UID of the vLLM process in the container. You can also pass
--user=0
, but this less secure than the--userns
option. Setting--user=0
runs vLLM as root inside the container. - 4
- Set and export
HF_TOKEN
with your Hugging Face API access token - 5
- Required for systems where SELinux is enabled. On Debian or Ubuntu operating systems, or when using Docker without SELinux, the
:Z
suffix is not available. - 6
- Set
--tensor-parallel-size
to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
In a separate tab in your terminal, make a request to your model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jq
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jq
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 5. Serving and inferencing with Podman using AMD ROCm AI accelerators Copy linkLink copied to clipboard!
Serve and inference a large language model with Podman and Red Hat AI Inference Server running on AMD ROCm AI accelerators.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.io
and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
You have access to a Linux server with data center grade AMD ROCm AI accelerators installed.
For AMD GPUs:
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your server host, and log in to
registry.redhat.io
:podman login registry.redhat.io
$ podman login registry.redhat.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the AMD ROCm image by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3
$ podman pull registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If your system has SELinux enabled, configure SELinux to allow device access:
sudo setsebool -P container_use_devices 1
$ sudo setsebool -P container_use_devices 1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
mkdir -p rhaiis-cache
$ mkdir -p rhaiis-cache
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chmod g+rwX rhaiis-cache
$ chmod g+rwX rhaiis-cache
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create or append your
HF_TOKEN
Hugging Face token to theprivate.env
file. Source theprivate.env
file.echo "export HF_TOKEN=<your_HF_token>" > private.env
$ echo "export HF_TOKEN=<your_HF_token>" > private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow source private.env
$ source private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container image.
For AMD ROCm accelerators:
Use
amd-smi static -a
to verify that the container can access the host system GPUs:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- You must belong to both the video and render groups on AMD systems to use the GPUs. To access GPUs, you must pass the
--group-add=keep-groups
supplementary groups option into the container.
Start the container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
--security-opt=label=disable
prevents SELinux from relabeling files in the volume mount. If you choose not to use this argument, your container might not successfully run.- 2
- If you experience an issue with shared memory, increase
--shm-size
to8GB
. - 3
- Set
--tensor-parallel-size
to match the number of GPUs when running the AI Inference Server container on multiple GPUs.
In a separate tab in your terminal, make a request to the model with the API.
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jq
curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://<your_server_ip>:8000/v1/completions | jq
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 6. Serving and inferencing language models with Podman using Google TPU AI accelerators Copy linkLink copied to clipboard!
Serve and inference a large language model with Podman or Docker and Red Hat AI Inference Server in a Google cloud VM that has Google TPU AI accelerators available.
Prerequisites
You have access to a Google cloud TPU VM with Google TPU AI accelerators configured. For more information, see:
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.io
image registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
For more information about supported vLLM quantization schemes for accelerators, see Supported hardware.
Procedure
Open a terminal on your TPU server host, and log in to
registry.redhat.io
:podman login registry.redhat.io
$ podman login registry.redhat.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the Red Hat AI Inference Server image by running the following command:
podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3
$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Optional: Verify that the TPUs are available in the host.
Open a shell prompt in the Red Hat AI Inference Server container. Run the following command:
podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3
$ podman run -it --net=host --privileged -e PJRT_DEVICE=TPU --rm --entrypoint /bin/bash registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify system TPU access and basic operations by running the following Python code in the container shell prompt:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Exit the shell prompt.
exit
$ exit
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a volume and mount it into the container. Adjust the container permissions so that the container can use it.
mkdir ./.cache/rhaiis
$ mkdir ./.cache/rhaiis
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chmod g+rwX ./.cache/rhaiis
$ chmod g+rwX ./.cache/rhaiis
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Add the
HF_TOKEN
Hugging Face token to theprivate.env
file.echo "export HF_TOKEN=<huggingface_token>" > private.env
$ echo "export HF_TOKEN=<huggingface_token>" > private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Append the
HF_HOME
variable to theprivate.env
file.echo "export HF_HOME=./.cache/rhaiis" >> private.env
$ echo "export HF_HOME=./.cache/rhaiis" >> private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Source the
private.env
file.source private.env
$ source private.env
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Start the AI Inference Server container image:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that the AI Inference Server server is up. Open a separate tab in your terminal, and make a model request with the API:
Example output
Chapter 7. Validating Red Hat AI Inference Server benefits using key metrics Copy linkLink copied to clipboard!
Use the following metrics to evaluate the performance of the LLM model being served with AI Inference Server:
- Time to first token (TTFT): The time from when a request is sent to when the first token of the response is received.
- Time per output token (TPOT): The average time it takes to generate each token after the first one.
- Latency: The total time required to generate the full response.
- Throughput: The total number of output tokens the model can produce at the same time across all users and requests.
Complete the procedure below to run a benchmark test that shows how AI Inference Server, and other inference servers, perform according to these metrics.
Prerequisites
- AI Inference Server container image
- GitHub account
- Python 3.9 or higher
Procedure
On your host system, start an AI Inference Server container and serve a model.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a separate terminal tab, install the benchmark tool dependencies.
pip install vllm pandas datasets
$ pip install vllm pandas datasets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Clone the vLLM Git repository:
git clone https://github.com/vllm-project/vllm.git
$ git clone https://github.com/vllm-project/vllm.git
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
./vllm/benchmarks/benchmark_serving.py
script.python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
$ python vllm/benchmarks/benchmark_serving.py --backend vllm --model RedHatAI/Llama-3.2-1B-Instruct-FP8 --num-prompts 100 --dataset-name random --random-input 1024 --random-output 512 --port 8000
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
The results show how AI Inference Server performs according to key server metrics:
Try changing the parameters of this benchmark and running it again. Notice how vllm
as a backend compares to other options. Throughput should be consistently higher, while latency should be lower.
-
Other options for
--backend
are:tgi
,lmdeploy
,deepspeed-mii
,openai
, andopenai-chat
-
Other options for
--dataset-name
are:sharegpt
,burstgpt
,sonnet
,random
,hf
Additional resources
- vLLM documentation
- LLM Inference Performance Engineering: Best Practices, by Mosaic AI Research, which explains metrics such as throughput and latency
Chapter 8. Troubleshooting Copy linkLink copied to clipboard!
The following troubleshooting information for Red Hat AI Inference Server 3.2.3 describes common problems related to model loading, memory, model response quality, networking, and GPU drivers. Where available, workarounds for common issues are described.
Most common issues in vLLM relate to installation, model loading, memory management, and GPU communication. Most problems can be resolved by using a correctly configured environment, ensuring compatible hardware and software versions, and following the recommended configuration practices.
For persistent issues, export VLLM_LOGGING_LEVEL=DEBUG
to enable debug logging and then check the logs.
export VLLM_LOGGING_LEVEL=DEBUG
$ export VLLM_LOGGING_LEVEL=DEBUG
8.1. Model loading errors Copy linkLink copied to clipboard!
When you run the Red Hat AI Inference Server container image without specifying a user namespace, an unrecognized model error is returned.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
ValueError: Unrecognized model in RedHatAI/Llama-3.2-1B-Instruct-FP8. Should have a model_type key in its config.json
ValueError: Unrecognized model in RedHatAI/Llama-3.2-1B-Instruct-FP8. Should have a model_type key in its config.json
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To resolve this error, pass
--userns=keep-id:uid=1001
as a Podman parameter to ensure that the container runs with the root user.Sometimes when Red Hat AI Inference Server downloads the model, the download fails or gets stuck. To prevent the model download from hanging, first download the model using the
huggingface-cli
. For example:huggingface-cli download <MODEL_ID> --local-dir <DOWNLOAD_PATH>
$ huggingface-cli download <MODEL_ID> --local-dir <DOWNLOAD_PATH>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow When serving the model, pass the local model path to vLLM to prevent the model from being downloaded again.
When Red Hat AI Inference Server loads a model from disk, the process sometimes hangs. Large models consume memory, and if memory runs low, the system slows down as it swaps data between RAM and disk. Slow network file system speeds or a lack of available memory can trigger excessive swapping. This can happen in clusters where file systems are shared between cluster nodes.
Where possible, store the model in a local disk to prevent slow down during model loading. Ensure that the system has sufficient CPU memory available.
Ensure that your system has enough CPU capacity to handle the model.
Sometimes, Red Hat AI Inference Server fails to inspect the model. Errors are reported in the log. For example:
#... File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported raise ValueError( ValueError: Model architectures [''] failed to be inspected. Please check the logs for more details.
#... File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported raise ValueError( ValueError: Model architectures [''] failed to be inspected. Please check the logs for more details.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The error occurs when vLLM fails to import the model file, which is usually related to missing dependencies or outdated binaries in the vLLM build.
Some model architectures are not supported. Refer to the list of Validated models. For example, the following errors indicate that the model you are trying to use is not supported:
Traceback (most recent call last): #... File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls for arch in architectures: TypeError: 'NoneType' object is not iterable
Traceback (most recent call last): #... File "vllm/model_executor/models/registry.py", line xxx, in inspect_model_cls for arch in architectures: TypeError: 'NoneType' object is not iterable
Copy to Clipboard Copied! Toggle word wrap Toggle overflow #... File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported raise ValueError( ValueError: Model architectures [''] are not supported for now. Supported architectures: #...
#... File "vllm/model_executor/models/registry.py", line xxx, in \_raise_for_unsupported raise ValueError( ValueError: Model architectures [''] are not supported for now. Supported architectures: #...
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteSome architectures such as
DeepSeekV2VL
require the architecture to be explicitly specified using the--hf_overrides
flag, for example:--hf_overrides '{\"architectures\": [\"DeepseekVLV2ForCausalLM\"]}
--hf_overrides '{\"architectures\": [\"DeepseekVLV2ForCausalLM\"]}
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Sometimes a runtime error occurs for certain hardware when you load 8-bit floating point (FP8) models. FP8 requires GPU hardware acceleration. Errors occur when you load FP8 models like
deepseek-r1
or models tagged with theF8_E4M3
tensor type. For example:triton.compiler.errors.CompilationError: at 1:0: def \_per_token_group_quant_fp8( \^ ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") [rank0]:[W502 11:12:56.323757996 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
triton.compiler.errors.CompilationError: at 1:0: def \_per_token_group_quant_fp8( \^ ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") [rank0]:[W502 11:12:56.323757996 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteReview Getting started to ensure your specific accelerator is supported. Accelerators that are currently supported for FP8 models include:
Sometimes when serving a model a runtime error occurs that is related to the host system. For example, you might see errors in the log like this:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can work around this issue by passing the
--shm-size=2g
argument when startingvllm
.
8.2. Memory optimization Copy linkLink copied to clipboard!
- If the model is too large to run with a single GPU, you will get out-of-memory (OOM) errors. Use memory optimization options such as quantization, tensor parallelism, or reduced precision to reduce the memory consumption. For more information, see Conserving memory.
8.3. Generated model response quality Copy linkLink copied to clipboard!
In some scenarios, the quality of the generated model responses might deteriorate after an update.
Default sampling parameters source have been updated in newer versions. For vLLM version 0.8.4 and higher, the default sampling parameters come from the
generation_config.json
file that is provided by the model creator. In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.If you experience this problem, try serving the model with the old defaults by using the
--generation-config vllm
server argument.ImportantIf applying the
--generation-config vllm
server argument improves the model output, continue to use the vLLM defaults and petition the model creator on Hugging Face to update their defaultgeneration_config.json
so that it produces better quality generations.
8.4. CUDA accelerator errors Copy linkLink copied to clipboard!
You might experience a
self.graph.replay()
error when running a model using CUDA accelerators.If vLLM crashes and the error trace captures the error somewhere around the
self.graph.replay()
method in thevllm/worker/model_runner.py
module, this is most likely a CUDA error that occurs inside theCUDAGraph
class.To identify the particular CUDA operation that causes the error, add the
--enforce-eager
server argument to thevllm
command line to disableCUDAGraph
optimization and isolate the problematic CUDA operation.You might experience accelerator and CPU communication problems that are caused by incorrect hardware or driver settings.
NVIDIA Fabric Manager is required for multi-GPU systems for some types of NVIDIA GPUs. The
nvidia-fabricmanager
package and associated systemd service might not be installed or the package might not be running.Run the diagnostic Python script to check whether the NVIDIA Collective Communications Library (NCCL) and Gloo library components are communicating correctly.
On an NVIDIA system, check the fabric manager status by running the following command:
systemctl status nvidia-fabricmanager
$ systemctl status nvidia-fabricmanager
Copy to Clipboard Copied! Toggle word wrap Toggle overflow On successfully configured systems, the service should be active and running with no errors.
-
Running vLLM with tensor parallelism enabled and setting
--tensor-parallel-size
to be greater than 1 on NVIDIA Multi-Instance GPU (MIG) hardware causes anAssertionError
during the initial model loading or shape checking phase. This typically occurs as one of the first errors when starting vLLM.
8.5. Networking errors Copy linkLink copied to clipboard!
You might experience network errors with complicated network configurations.
To troubleshoot network issues, search the logs for DEBUG statements where an incorrect IP address is listed, for example:
DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://<incorrect_ip_address>:54641 backend=nccl
DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://<incorrect_ip_address>:54641 backend=nccl
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To correct the issue, set the correct IP address with the
VLLM_HOST_IP
environment variable, for example:export VLLM_HOST_IP=<correct_ip_address>
$ export VLLM_HOST_IP=<correct_ip_address>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Specify the network interface that is tied to the IP address for NCCL and Gloo:
export NCCL_SOCKET_IFNAME=<your_network_interface>
$ export NCCL_SOCKET_IFNAME=<your_network_interface>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow export GLOO_SOCKET_IFNAME=<your_network_interface>
$ export GLOO_SOCKET_IFNAME=<your_network_interface>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
8.6. Python multiprocessing errors Copy linkLink copied to clipboard!
You might experience Python multiprocessing warnings or runtime errors. This can be caused by code that is not properly structured for Python multiprocessing. The following is an example console warning:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information.
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The following is an example Python runtime error:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow To resolve the runtime error, update your Python code to guard the usage of
vllm
behind anif__name__ = "__main__":
block, for example:if __name__ = "__main__": import vllm llm = vllm.LLM(...)
if __name__ = "__main__": import vllm llm = vllm.LLM(...)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
8.7. GPU driver or device pass-through issues Copy linkLink copied to clipboard!
When you run the Red Hat AI Inference Server container image, sometimes it is unclear whether device pass-through errors are being caused by GPU drivers or tools such as the NVIDIA Container Toolkit.
Check that the NVIDIA Container toolkit that is installed on the host machine can see the host GPUs:
nvidia-ctk cdi list
$ nvidia-ctk cdi list
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Ensure that the NVIDIA accelerator configuration has been created on the host machine:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:
podman run --rm -it --security-opt=label=disable --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smi
$ podman run --rm -it --security-opt=label=disable --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smi
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Chapter 9. Gathering system information with the vLLM collect environment script Copy linkLink copied to clipboard!
Use the vllm collect-env
command that you run from the Red Hat AI Inference Server container to gather system information for troubleshooting AI Inference Server deployments. This script collects system details, hardware configurations, and dependency information that can help diagnose deployment problems and model inference serving issues.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
- You have access to a Linux server with data center grade AI accelerators installed.
- You have pulled and successfully deployed the Red Hat AI Inference Server container.
Procedure
Open a terminal and log in to
registry.redhat.io
:podman login registry.redhat.io
$ podman login registry.redhat.io
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Pull the specific Red Hat AI Inference Server container image for the AI accelerator that is installed. For example, to pull the Red Hat AI Inference Server container for Google cloud TPUs, run the following command:
podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3
$ podman pull registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the collect environment script in the container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
The vllm collect-env
command output details environment information including the following:
- System hardware details
- Operating system details
- Python environment and dependencies
- GPU/TPU accelerator information
Review the output for any warnings or errors that might indicate configuration issues. Include the collect-env
output for your system when reporting problems to Red Hat Support.
An example Google Cloud TPU report is provided below: