10.7. GPU 驱动程序或设备直通问题
当您运行 Red Hat AI Inference Server 容器镜像时,有时不明确设备传递错误是由 GPU 驱动程序或工具(如 NVIDIA Container Toolkit )导致的。
检查主机机器上安装的 NVIDIA Container 工具包是否可以看到主机 GPU:
$ nvidia-ctk cdi list输出示例
#... nvidia.com/gpu=GPU-0fe9bb20-207e-90bf-71a7-677e4627d9a1 nvidia.com/gpu=GPU-10eff114-f824-a804-e7b7-e07e3f8ebc26 nvidia.com/gpu=GPU-39af96b4-f115-9b6d-5be9-68af3abd0e52 nvidia.com/gpu=GPU-3a711e90-a1c5-3d32-a2cd-0abeaa3df073 nvidia.com/gpu=GPU-6f5f6d46-3fc1-8266-5baf-582a4de11937 nvidia.com/gpu=GPU-da30e69a-7ba3-dc81-8a8b-e9b3c30aa593 nvidia.com/gpu=GPU-dc3c1c36-841b-bb2e-4481-381f614e6667 nvidia.com/gpu=GPU-e85ffe36-1642-47c2-644e-76f8a0f02ba7 nvidia.com/gpu=all确保在主机上创建了 NVIDIA 加速器配置:
$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml运行以下命令,检查 Red Hat AI Inference Server 容器是否可以访问主机上的 NVIDIA GPU:
$ podman run --rm -it --security-opt=label=disable --device nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.4.1-base-ubi9 nvidia-smi输出示例
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB Off | 00000000:08:01.0 Off | 0 | | N/A 32C P0 64W / 400W | 1MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB Off | 00000000:08:02.0 Off | 0 | | N/A 29C P0 63W / 400W | 1MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+