Questo contenuto non è disponibile nella lingua selezionata.
Chapter 4. Installing the AMD GPU Operator
Install the AMD GPU Operator to use the underlying AMD ROCm AI accelerators that are available in the cluster.
Installing the AMD GPU Operator is a multi-step procedure that requires installing the Node Feature Discovery Operator, the Kernel Module Management Operator (KMM), and then the AMD GPU Operator through the OpenShift OperatorHub.
The AMD GPU Operator is only supported in clusters with full access to the internet, not in disconnected environments. This is because the Operator builds the driver inside the cluster which requires full internet access.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. You have installed the following Operators in the cluster:
Expand Table 4.1. Required Operators Operator Description Service CA Operator
Issues TLS serving certificates for Service objects. Required for certificate signing and authentication between the
kube-apiserverand the KMM webhook server.Operator Lifecycle Manager (OLM)
Manages Operator installation and lifecycle maintenance.
Machine Config Operator
Manages the operating system configuration of worker and control-plane nodes. Required for configuring the kernel blacklist for the amdgpu driver.
Cluster Image Registry Operator
The Cluster Image Registry Operator (CIRO) manages the internal container image registry that OpenShift Container Platform clusters use to store and serve container images. Required for driver image building and storage in the cluster.
Procedure
Create the
NamespaceCR for the AMD GPU Operator Operator:oc apply -f - <<EOF apiVersion: v1 kind: Namespace metadata: name: openshift-amd-gpu labels: name: openshift-amd-gpu openshift.io/cluster-monitoring: "true" EOFVerify that the Service CA Operator is operational. Run the following command:
$ oc get pods -A | grep service-caExample output
openshift-service-ca-operator service-ca-operator-7cfd997ddf-llhdg 1/1 Running 7 35d openshift-service-ca service-ca-8675b766d5-vz8gg 1/1 Running 6 35dVerify that the Machine Config Operator is operational:
$ oc get pods -A | grep machine-config-daemonExample output
openshift-machine-config-operator machine-config-daemon-sdsjj 2/2 Running 10 35d openshift-machine-config-operator machine-config-daemon-xc6rm 2/2 Running 0 2d21hVerify that the Cluster Image Registry Operator is operational:
$ oc get pods -n openshift-image-registryExample output
NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-58f9dc9976-czt2w 1/1 Running 5 35d image-pruner-29259360-2tdrk 0/1 Completed 0 2d8h image-pruner-29260800-v9lkc 0/1 Completed 0 32h image-pruner-29262240-swcmb 0/1 Completed 0 8h image-registry-7b67584cd-sdxpk 1/1 Running 10 35d node-ca-d2kzl 1/1 Running 0 2d21h node-ca-xxzrw 1/1 Running 5 35dOptional: If you plan to build driver images in the cluster, you must enable the OpenShift internal registry. Run the following commands:
Verify current registry status:
$ oc get pods -n openshift-image-registryNAME READY STATUS RESTARTS AGE #... image-registry-7b67584cd-sdxpk 1/1 Running 10 36dConfigure the registry storage. The following example patches an
emptyDirephemeral volume in the cluster. Run the following command:$ oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ --patch '{"spec":{"storage":{"emptyDir":{}}}}'Enable the registry:
$ oc patch configs.imageregistry.operator.openshift.io cluster --type merge \ --patch '{"spec":{"managementState":"Managed"}}'
- Install the Node Feature Discovery (NFD) Operator. See Installing the Node Feature Discovery Operator.
- Install the Kernel Module Management (KMM) Operator. See Installing the Kernel Module Management Operator.
Configure node feature discovery for the AMD AI accelerator:
Create a
NodeFeatureDiscovery(NFD) custom resource (CR) to detect AMD GPU hardware. For example:apiVersion: nfd.openshift.io/v1 kind: NodeFeatureDiscovery metadata: name: amd-gpu-operator-nfd-instance namespace: openshift-nfd spec: workerConfig: configData: | core: sleepInterval: 60s sources: pci: deviceClassWhitelist: - "0200" - "03" - "12" deviceLabelFields: - "vendor" - "device" custom: - name: amd-gpu labels: feature.node.kubernetes.io/amd-gpu: "true" matchAny: - matchFeatures: - feature: pci.device matchExpressions: vendor: {op: In, value: ["1002"]} device: {op: In, value: [ "740f", # MI210 ]} - name: amd-vgpu labels: feature.node.kubernetes.io/amd-vgpu: "true" matchAny: - matchFeatures: - feature: pci.device matchExpressions: vendor: {op: In, value: ["1002"]} device: {op: In, value: [ "74b5", # MI300X VF ]}NoteDepending on your specific cluster deployment, you might require a
NodeFeatureDiscoveryorNodeFeatureRuleCR. For example, the cluster might already have theNodeFeatureDiscoveryresource deployed and you don’t want to change it. For more information, see Create Node Feature Discovery Rule.
Create a
MachineConfigCR to add the out-of-treeamdgpukernel module to the modprobe blacklist. For example:apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: amdgpu-module-blacklist spec: config: ignition: version: 3.2.0 storage: files: - path: "/etc/modprobe.d/amdgpu-blacklist.conf" mode: 420 overwrite: true contents: source: "data:text/plain;base64,YmxhY2tsaXN0IGFtZGdwdQo="Where:
machineconfiguration.openshift.io/role: worker-
Specifies the node role for the machine configuration. Set this value to
masterfor single-node OpenShift clusters.
ImportantThe Machine Config Operator automatically reboots selected nodes after you apply the
MachineConfigCR.Create the
DeviceConfigCR to start the AMD AI accelerator driver installation. For example:apiVersion: amd.com/v1alpha1 kind: DeviceConfig metadata: name: driver-cr namespace: openshift-amd-gpu spec: driver: enable: true image: image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod version: 6.2.2 selector: "feature.node.kubernetes.io/amd-gpu": "true"Where:
image: image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod- Specifies the driver image location. By default, you do not need to configure a value for this field because the default value is used.
After you apply the
DeviceConfigCR, the AMD GPU Operator collects the worker node system specifications, builds or retrieve the appropriate driver image, uses KMM to deploy the driver, and finally deploys the ROCM device plugin and node labeller.
Verification
Verify that the KMM worker pods are running:
$ oc get pods -n openshift-kmmExample output
NAME READY STATUS RESTARTS AGE kmm-operator-controller-774c7ccff6-hr76v 1/1 Running 30 (2d23h ago) 35d kmm-operator-webhook-76d7b9555-ltmps 1/1 Running 5 35dCheck device plugin and labeller status:
$ oc -n openshift-amd-gpu get podsExample output
NAME READY STATUS RESTARTS AGE amd-gpu-operator-controller-manager-59dd964777-zw4bg 1/1 Running 8 (2d23h ago) 9d test-deviceconfig-device-plugin-kbrp7 1/1 Running 0 2d test-deviceconfig-metrics-exporter-k5v4x 1/1 Running 0 2d test-deviceconfig-node-labeller-fqz7x 1/1 Running 0 2dConfirm that GPU resource labels are applied to the nodes:
$ oc get node -o json | grep amd.comExample output
"amd.com/gpu.cu-count": "304", "amd.com/gpu.device-id": "74b5", "amd.com/gpu.driver-version": "6.12.12", "amd.com/gpu.family": "AI", "amd.com/gpu.simd-count": "1216", "amd.com/gpu.vram": "191G", "beta.amd.com/gpu.cu-count": "304", "beta.amd.com/gpu.cu-count.304": "8", "beta.amd.com/gpu.device-id": "74b5", "beta.amd.com/gpu.device-id.74b5": "8", "beta.amd.com/gpu.family": "AI", "beta.amd.com/gpu.family.AI": "8", "beta.amd.com/gpu.simd-count": "1216", "beta.amd.com/gpu.simd-count.1216": "8", "beta.amd.com/gpu.vram": "191G", "beta.amd.com/gpu.vram.191G": "8", "amd.com/gpu": "8", "amd.com/gpu": "8",