第 3 章 安装 AMD GPU Operator
安装 AMD GPU Operator,以使用集群中可用的底层 AMD ROCm AI 加速器。
安装 AMD GPU Operator 是一个多步骤过程,需要安装 Node Feature Discovery Operator、内核模块管理 Operator (KMM),然后通过 OpenShift OperatorHub 安装 AMD GPU Operator。
AMD GPU Operator 仅在完全访问互联网的集群中支持,不支持在断开连接的环境中。这是因为 Operator 在集群中构建需要完全互联网访问的驱动程序。
先决条件
-
已安装 OpenShift CLI(
oc)。 -
您已以具有
cluster-admin权限的用户身份登录。 已在集群中安装了以下 Operator:
Expand 表 3.1. 所需的 Operator Operator 描述 Service CA Operator
为 Service 对象发出 TLS 服务证书。
kube-apiserver和 KMM Webhook 服务器之间的证书签名请求和身份验证是必需的。Operator Lifecycle Manager (OLM)
管理 Operator 安装和生命周期维护。
Machine Config Operator
管理 worker 和 control-plane 节点的操作系统配置。为 amdgpu 驱动程序配置内核黑名单需要此项。
Cluster Image Registry Operator
Cluster Image Registry Operator (CIRO)管理 OpenShift Container Platform 集群用于存储和提供容器镜像的内部容器镜像 registry。在集群中构建和存储的驱动程序镜像是必需的。
流程
为 AMD GPU Operator Operator 创建
NamespaceCR:oc apply -f - <<EOF apiVersion: v1 kind: Namespace metadata: name: openshift-amd-gpu labels: name: openshift-amd-gpu openshift.io/cluster-monitoring: "true" EOF验证 Service CA Operator 是否正常工作。运行以下命令:
$ oc get pods -A | grep service-ca输出示例
openshift-service-ca-operator service-ca-operator-7cfd997ddf-llhdg 1/1 Running 7 35d openshift-service-ca service-ca-8675b766d5-vz8gg 1/1 Running 6 35d验证 Machine Config Operator 是否正常工作:
$ oc get pods -A | grep machine-config-daemon输出示例
openshift-machine-config-operator machine-config-daemon-sdsjj 2/2 Running 10 35d openshift-machine-config-operator machine-config-daemon-xc6rm 2/2 Running 0 2d21h验证 Cluster Image Registry Operator 是否正常工作:
$ oc get pods -n openshift-image-registry输出示例
NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-58f9dc9976-czt2w 1/1 Running 5 35d image-pruner-29259360-2tdrk 0/1 Completed 0 2d8h image-pruner-29260800-v9lkc 0/1 Completed 0 32h image-pruner-29262240-swcmb 0/1 Completed 0 8h image-registry-7b67584cd-sdxpk 1/1 Running 10 35d node-ca-d2kzl 1/1 Running 0 2d21h node-ca-xxzrw 1/1 Running 5 35d可选: 如果您计划在集群中构建驱动程序镜像,则必须启用 OpenShift 内部注册表。运行以下命令:
验证当前 registry 状态:
$ oc get pods -n openshift-image-registryNAME READY STATUS RESTARTS AGE #... image-registry-7b67584cd-sdxpk 1/1 Running 10 36d配置 registry 存储。以下示例对集群中的
emptyDir临时卷进行补丁。运行以下命令:$ oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ --patch '{"spec":{"storage":{"emptyDir":{}}}}'启用 registry:
$ oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ --patch '{"spec":{"managementState":"Managed"}}'
- 安装 Node Feature Discovery (NFD) Operator。请参阅安装 Node Feature Discovery Operator。
- 安装内核模块管理(KMM) Operator。请参阅 安装内核模块管理 Operator。
为 AMD AI 加速器配置节点功能发现:
创建
NodeFeatureDiscovery(NFD)自定义资源(CR)来检测 AMD GPU 硬件。例如:apiVersion: nfd.openshift.io/v1 kind: NodeFeatureDiscovery metadata: name: amd-gpu-operator-nfd-instance namespace: openshift-nfd spec: workerConfig: configData: | core: sleepInterval: 60s sources: pci: deviceClassWhitelist: - "0200" - "03" - "12" deviceLabelFields: - "vendor" - "device" custom: - name: amd-gpu labels: feature.node.kubernetes.io/amd-gpu: "true" matchAny: - matchFeatures: - feature: pci.device matchExpressions: vendor: {op: In, value: ["1002"]} device: {op: In, value: [ "740f", # MI210 ]} - name: amd-vgpu labels: feature.node.kubernetes.io/amd-vgpu: "true" matchAny: - matchFeatures: - feature: pci.device matchExpressions: vendor: {op: In, value: ["1002"]} device: {op: In, value: [ "74b5", # MI300X VF ]}注意根据您的特定集群部署,您可能需要
NodeFeatureDiscovery或NodeFeatureRuleCR。例如,集群可能已经部署了NodeFeatureDiscovery资源,您不想更改它。如需更多信息,请参阅创建节点功能发现规则。
创建
MachineConfigCR,将树外的amdgpu内核模块添加到 modprobe 黑名单中。例如:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker1 name: amdgpu-module-blacklist spec: config: ignition: version: 3.2.0 storage: files: - path: "/etc/modprobe.d/amdgpu-blacklist.conf" mode: 420 overwrite: true contents: source: "data:text/plain;base64,YmxhY2tsaXN0IGFtZGdwdQo="- 1
- 为单节点 OpenShift 集群设置
machineconfiguration.openshift.io/role: master。
重要应用
MachineConfigCR 后,Machine Config Operator 会自动重启所选节点。创建
DeviceConfigCR 以启动 AMD AI Accelerator 驱动程序安装。例如:apiVersion: amd.com/v1alpha1 kind: DeviceConfig metadata: name: driver-cr namespace: openshift-amd-gpu spec: driver: enable: true image: image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod1 version: 6.2.2 selector: "feature.node.kubernetes.io/amd-gpu": "true"- 1
- 默认情况下,您不需要为
image字段配置值。显示默认值。
应用
DeviceConfigCR 后,AMD GPU Operator 会收集 worker 节点系统规格,构建或检索适当的驱动程序镜像,使用 KMM 部署驱动程序,最后部署 ROCM 设备插件和节点标签ler。
验证
验证 KMM worker pod 是否正在运行:
$ oc get pods -n openshift-kmm输出示例
NAME READY STATUS RESTARTS AGE kmm-operator-controller-774c7ccff6-hr76v 1/1 Running 30 (2d23h ago) 35d kmm-operator-webhook-76d7b9555-ltmps 1/1 Running 5 35d检查设备插件和 labeller 状态:
$ oc -n openshift-amd-gpu get pods输出示例
NAME READY STATUS RESTARTS AGE amd-gpu-operator-controller-manager-59dd964777-zw4bg 1/1 Running 8 (2d23h ago) 9d test-deviceconfig-device-plugin-kbrp7 1/1 Running 0 2d test-deviceconfig-metrics-exporter-k5v4x 1/1 Running 0 2d test-deviceconfig-node-labeller-fqz7x 1/1 Running 0 2d确认 GPU 资源标签已应用到节点:
$ oc get node -o json | grep amd.com输出示例
"amd.com/gpu.cu-count": "304", "amd.com/gpu.device-id": "74b5", "amd.com/gpu.driver-version": "6.12.12", "amd.com/gpu.family": "AI", "amd.com/gpu.simd-count": "1216", "amd.com/gpu.vram": "191G", "beta.amd.com/gpu.cu-count": "304", "beta.amd.com/gpu.cu-count.304": "8", "beta.amd.com/gpu.device-id": "74b5", "beta.amd.com/gpu.device-id.74b5": "8", "beta.amd.com/gpu.family": "AI", "beta.amd.com/gpu.family.AI": "8", "beta.amd.com/gpu.simd-count": "1216", "beta.amd.com/gpu.simd-count.1216": "8", "beta.amd.com/gpu.vram": "191G", "beta.amd.com/gpu.vram.191G": "8", "amd.com/gpu": "8", "amd.com/gpu": "8",