12.2. 安装 NVIDIA GPU 管理仪表板
通过在 OpenShift Container Platform (OCP) 控制台上使用 Helm 安装 NVIDIA GPU 插件来添加 GPU 功能。
OpenShift Console NVIDIA GPU 插件作为 OCP 控制台的远程捆绑包运行。要运行 OpenShift Console NVIDIA GPU 插件,必须将 OCP 控制台实例正在运行。
先决条件
- Red Hat OpenShift 4.11+
- NVIDIA GPU operator
- Helm
流程
使用以下步骤安装 OpenShift Console NVIDIA GPU 插件。
添加 Helm 仓库:
$ helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu
$ helm repo update
在默认的 NVIDIA GPU Operator 命名空间中安装 Helm Chart:
$ helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu
输出示例
NAME: console-plugin-nvidia-gpu LAST DEPLOYED: Tue Aug 23 15:37:35 2022 NAMESPACE: nvidia-gpu-operator STATUS: deployed REVISION: 1 NOTES: View the Console Plugin NVIDIA GPU deployed resources by running the following command: $ oc -n {{ .Release.Namespace }} get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu Enable the plugin by running the following command: # Check if a plugins field is specified $ oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}" # if not, then run the following command to enable the plugin $ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge # if yes, then run the following command to enable the plugin $ oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json # add the required DCGM Exporter metrics ConfigMap to the existing NVIDIA operator ClusterPolicy CR: oc patch clusterpolicies.nvidia.com gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge
仪表板主要依赖于 NVIDIA DCGM Exporter 公开的 Prometheus 指标,但默认公开的指标不足以便仪表板显示所需的量表。因此,DGCM 导出器配置为公开一组自定义指标,如下所示。
apiVersion: v1 data: dcgm-metrics.csv: | DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, gpu utilization. DCGM_FI_DEV_MEM_COPY_UTIL, gauge, mem utilization. DCGM_FI_DEV_ENC_UTIL, gauge, enc utilization. DCGM_FI_DEV_DEC_UTIL, gauge, dec utilization. DCGM_FI_DEV_POWER_USAGE, gauge, power usage. DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, power mgmt limit. DCGM_FI_DEV_GPU_TEMP, gauge, gpu temp. DCGM_FI_DEV_SM_CLOCK, gauge, sm clock. DCGM_FI_DEV_MAX_SM_CLOCK, gauge, max sm clock. DCGM_FI_DEV_MEM_CLOCK, gauge, mem clock. DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, max mem clock. kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: console-plugin-nvidia-gpu meta.helm.sh/release-namespace: nvidia-gpu-operator creationTimestamp: "2022-10-26T19:46:41Z" labels: app.kubernetes.io/component: console-plugin-nvidia-gpu app.kubernetes.io/instance: console-plugin-nvidia-gpu app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: console-plugin-nvidia-gpu app.kubernetes.io/part-of: console-plugin-nvidia-gpu app.kubernetes.io/version: latest helm.sh/chart: console-plugin-nvidia-gpu-0.2.3 name: console-plugin-nvidia-gpu namespace: nvidia-gpu-operator resourceVersion: "19096623" uid: 96cdf700-dd27-437b-897d-5cbb1c255068
安装 ConfigMap 并编辑 NVIDIA Operator ClusterPolicy CR,以在 DCGM 导出器配置中添加该 ConfigMap。ConfigMap 的安装由 Console Plugin NVIDIA GPU Helm Chart 的新版本完成,但 ClusterPolicy CR 编辑由用户执行。
查看部署的资源:
$ oc -n nvidia-gpu-operator get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
输出示例
NAME READY STATUS RESTARTS AGE pod/console-plugin-nvidia-gpu-7dc9cfb5df-ztksx 1/1 Running 0 2m6s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/console-plugin-nvidia-gpu ClusterIP 172.30.240.138 <none> 9443/TCP 2m6s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/console-plugin-nvidia-gpu 1/1 1 1 2m6s NAME DESIRED CURRENT READY AGE replicaset.apps/console-plugin-nvidia-gpu-7dc9cfb5df 1 1 1 2m6s