10.7. GPU driver or device pass-through issues
When you run the Red Hat AI Inference Server container image, sometimes it is unclear whether device pass-through errors are being caused by GPU drivers or tools such as the NVIDIA Container Toolkit.
Check that the NVIDIA Container toolkit that is installed on the host machine can see the host GPUs:
$ nvidia-ctk cdi listExample output
#... nvidia.com/gpu=GPU-0fe9bb20-207e-90bf-71a7-677e4627d9a1 nvidia.com/gpu=GPU-10eff114-f824-a804-e7b7-e07e3f8ebc26 nvidia.com/gpu=GPU-39af96b4-f115-9b6d-5be9-68af3abd0e52 nvidia.com/gpu=GPU-3a711e90-a1c5-3d32-a2cd-0abeaa3df073 nvidia.com/gpu=GPU-6f5f6d46-3fc1-8266-5baf-582a4de11937 nvidia.com/gpu=GPU-da30e69a-7ba3-dc81-8a8b-e9b3c30aa593 nvidia.com/gpu=GPU-dc3c1c36-841b-bb2e-4481-381f614e6667 nvidia.com/gpu=GPU-e85ffe36-1642-47c2-644e-76f8a0f02ba7 nvidia.com/gpu=allEnsure that the NVIDIA accelerator configuration has been created on the host machine:
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yamlCheck that the Red Hat AI Inference Server container can access NVIDIA GPUs on the host by running the following command:
$ podman run --rm -it --security-opt=label=disable --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smiExample output
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB Off | 00000000:08:01.0 Off | 0 | | N/A 32C P0 64W / 400W | 1MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB Off | 00000000:08:02.0 Off | 0 | | N/A 29C P0 63W / 400W | 1MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+