第 2 章 安装 NVIDIA GPU Operator
安装 NVIDIA GPU Operator 以使用集群中可用的底层 NVIDIA CUDA AI 加速器。
先决条件
-
已安装 OpenShift CLI(
oc)。 -
您已以具有
cluster-admin权限的用户身份登录。 - 已安装 Node Feature Discovery Operator。
流程
为 NVIDIA GPU Operator 创建
NamespaceCR:oc apply -f - <<EOF apiVersion: v1 kind: Namespace metadata: name: nvidia-gpu-operator EOF创建
OperatorGroupCR:oc apply -f - <<EOF apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: gpu-operator-certified namespace: nvidia-gpu-operator spec: targetNamespaces: - nvidia-gpu-operator EOF创建
SubscriptionCR:oc apply -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: gpu-operator-certified namespace: nvidia-gpu-operator spec: channel: "stable" installPlanApproval: Manual name: gpu-operator-certified source: certified-operators sourceNamespace: openshift-marketplace EOF
验证
运行以下命令,验证 NVIDIA GPU Operator 部署是否成功:
$ oc get pods -n nvidia-gpu-operator
输出示例
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-c2rfm 1/1 Running 0 6m28s
gpu-operator-84b7f5bcb9-vqds7 1/1 Running 0 39m
nvidia-container-toolkit-daemonset-pgcrf 1/1 Running 0 6m28s
nvidia-cuda-validator-p8gv2 0/1 Completed 0 99s
nvidia-dcgm-exporter-kv6k8 1/1 Running 0 6m28s
nvidia-dcgm-tpsps 1/1 Running 0 6m28s
nvidia-device-plugin-daemonset-gbn55 1/1 Running 0 6m28s
nvidia-device-plugin-validator-z7ltr 0/1 Completed 0 82s
nvidia-driver-daemonset-410.84.202203290245-0-xxgdv 2/2 Running 0 6m28s
nvidia-node-status-exporter-snmsm 1/1 Running 0 6m28s
nvidia-operator-validator-6pfk6 1/1 Running 0 6m28s