Chapter 4. Installing the AMD GPU Operator


Install the AMD GPU Operator to use the underlying AMD ROCm AI accelerators that are available in the cluster.

Installing the AMD GPU Operator is a multi-step procedure that requires installing the Node Feature Discovery Operator, the Kernel Module Management Operator (KMM), and then the AMD GPU Operator through the OpenShift OperatorHub.

Important

The AMD GPU Operator is only supported in clusters with full access to the internet, not in disconnected environments. This is because the Operator builds the driver inside the cluster which requires full internet access.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.
  • You have installed the following Operators in the cluster:

    Expand
    Table 4.1. Required Operators
    OperatorDescription

    Service CA Operator

    Issues TLS serving certificates for Service objects. Required for certificate signing and authentication between the kube-apiserver and the KMM webhook server.

    Operator Lifecycle Manager (OLM)

    Manages Operator installation and lifecycle maintenance.

    Machine Config Operator

    Manages the operating system configuration of worker and control-plane nodes. Required for configuring the kernel blacklist for the amdgpu driver.

    Cluster Image Registry Operator

    The Cluster Image Registry Operator (CIRO) manages the internal container image registry that OpenShift Container Platform clusters use to store and serve container images. Required for driver image building and storage in the cluster.

Procedure

  1. Create the Namespace CR for the AMD GPU Operator Operator:

    oc apply -f - <<EOF
    apiVersion: v1
    kind: Namespace
    metadata:
      name: openshift-amd-gpu
      labels:
        name: openshift-amd-gpu
        openshift.io/cluster-monitoring: "true"
    EOF
    Copy to Clipboard Toggle word wrap
  2. Verify that the Service CA Operator is operational. Run the following command:

    $ oc get pods -A | grep service-ca
    Copy to Clipboard Toggle word wrap

    Example output

    openshift-service-ca-operator   service-ca-operator-7cfd997ddf-llhdg    1/1    Running    7    35d
    openshift-service-ca            service-ca-8675b766d5-vz8gg             1/1    Running    6    35d
    Copy to Clipboard Toggle word wrap

  3. Verify that the Machine Config Operator is operational:

    $ oc get pods -A | grep machine-config-daemon
    Copy to Clipboard Toggle word wrap

    Example output

    openshift-machine-config-operator   machine-config-daemon-sdsjj   2/2    Running    10   35d
    openshift-machine-config-operator   machine-config-daemon-xc6rm   2/2    Running    0    2d21h
    Copy to Clipboard Toggle word wrap

  4. Verify that the Cluster Image Registry Operator is operational:

    $ oc get pods -n openshift-image-registry
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                               READY   STATUS      RESTARTS   AGE
    cluster-image-registry-operator-58f9dc9976-czt2w   1/1     Running     5          35d
    image-pruner-29259360-2tdrk                        0/1     Completed   0          2d8h
    image-pruner-29260800-v9lkc                        0/1     Completed   0          32h
    image-pruner-29262240-swcmb                        0/1     Completed   0          8h
    image-registry-7b67584cd-sdxpk                     1/1     Running     10         35d
    node-ca-d2kzl                                      1/1     Running     0          2d21h
    node-ca-xxzrw                                      1/1     Running     5          35d
    Copy to Clipboard Toggle word wrap

  5. Optional: If you plan to build driver images in the cluster, you must enable the OpenShift internal registry. Run the following commands:

    1. Verify current registry status:

      $ oc get pods -n openshift-image-registry
      Copy to Clipboard Toggle word wrap
      NAME                                               READY   STATUS      RESTARTS   AGE
      #...
      image-registry-7b67584cd-sdxpk                     1/1     Running     10         36d
      Copy to Clipboard Toggle word wrap
    2. Configure the registry storage. The following example patches an emptyDir ephemeral volume in the cluster. Run the following command:

      $ oc patch configs.imageregistry.operator.openshift.io cluster --type merge \
        --patch '{"spec":{"storage":{"emptyDir":{}}}}'
      Copy to Clipboard Toggle word wrap
    3. Enable the registry:

      $ oc patch configs.imageregistry.operator.openshift.io cluster --type merge \
        --patch '{"spec":{"managementState":"Managed"}}'
      Copy to Clipboard Toggle word wrap
  6. Install the Node Feature Discovery (NFD) Operator. See Installing the Node Feature Discovery Operator.
  7. Install the Kernel Module Management (KMM) Operator. See Installing the Kernel Module Management Operator.
  8. Configure node feature discovery for the AMD AI accelerator:

    1. Create a NodeFeatureDiscovery (NFD) custom resource (CR) to detect AMD GPU hardware. For example:

      apiVersion: nfd.openshift.io/v1
      kind: NodeFeatureDiscovery
      metadata:
        name: amd-gpu-operator-nfd-instance
        namespace: openshift-nfd
      spec:
        workerConfig:
          configData: |
            core:
              sleepInterval: 60s
            sources:
              pci:
                deviceClassWhitelist:
                  - "0200"
                  - "03"
                  - "12"
                deviceLabelFields:
                  - "vendor"
                  - "device"
              custom:
              - name: amd-gpu
                labels:
                  feature.node.kubernetes.io/amd-gpu: "true"
                matchAny:
                  - matchFeatures:
                      - feature: pci.device
                        matchExpressions:
                          vendor: {op: In, value: ["1002"]}
                          device: {op: In, value: [
                            "740f", # MI210
                          ]}
              - name: amd-vgpu
                labels:
                  feature.node.kubernetes.io/amd-vgpu: "true"
                matchAny:
                  - matchFeatures:
                      - feature: pci.device
                        matchExpressions:
                          vendor: {op: In, value: ["1002"]}
                          device: {op: In, value: [
                            "74b5", # MI300X VF
                          ]}
      Copy to Clipboard Toggle word wrap
      Note

      Depending on your specific cluster deployment, you might require a NodeFeatureDiscovery or NodeFeatureRule CR. For example, the cluster might already have the NodeFeatureDiscovery resource deployed and you don’t want to change it. For more information, see Create Node Feature Discovery Rule.

  9. Create a MachineConfig CR to add the out-of-tree amdgpu kernel module to the modprobe blacklist. For example:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
      labels:
        machineconfiguration.openshift.io/role: worker
      name: amdgpu-module-blacklist
    spec:
      config:
        ignition:
          version: 3.2.0
        storage:
          files:
            - path: "/etc/modprobe.d/amdgpu-blacklist.conf"
              mode: 420
              overwrite: true
              contents:
                source: "data:text/plain;base64,YmxhY2tsaXN0IGFtZGdwdQo="
    Copy to Clipboard Toggle word wrap

    Where:

    machineconfiguration.openshift.io/role: worker
    Specifies the node role for the machine configuration. Set this value to master for single-node OpenShift clusters.
    Important

    The Machine Config Operator automatically reboots selected nodes after you apply the MachineConfig CR.

  10. Create the DeviceConfig CR to start the AMD AI accelerator driver installation. For example:

    apiVersion: amd.com/v1alpha1
    kind: DeviceConfig
    metadata:
      name: driver-cr
      namespace: openshift-amd-gpu
    spec:
      driver:
        enable: true
        image: image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
        version: 6.2.2
      selector:
        "feature.node.kubernetes.io/amd-gpu": "true"
    Copy to Clipboard Toggle word wrap

    Where:

    image: image-registry.openshift-image-registry.svc:5000/$MOD_NAMESPACE/amdgpu_kmod
    Specifies the driver image location. By default, you do not need to configure a value for this field because the default value is used.

    After you apply the DeviceConfig CR, the AMD GPU Operator collects the worker node system specifications, builds or retrieve the appropriate driver image, uses KMM to deploy the driver, and finally deploys the ROCM device plugin and node labeller.

Verification

  1. Verify that the KMM worker pods are running:

    $ oc get pods -n openshift-kmm
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                       READY   STATUS    RESTARTS         AGE
    kmm-operator-controller-774c7ccff6-hr76v   1/1     Running   30 (2d23h ago)   35d
    kmm-operator-webhook-76d7b9555-ltmps       1/1     Running   5                35d
    Copy to Clipboard Toggle word wrap

  2. Check device plugin and labeller status:

    $ oc -n openshift-amd-gpu get pods
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                                                   READY   STATUS    RESTARTS        AGE
    amd-gpu-operator-controller-manager-59dd964777-zw4bg   1/1     Running   8 (2d23h ago)   9d
    test-deviceconfig-device-plugin-kbrp7                  1/1     Running   0               2d
    test-deviceconfig-metrics-exporter-k5v4x               1/1     Running   0               2d
    test-deviceconfig-node-labeller-fqz7x                  1/1     Running   0               2d
    Copy to Clipboard Toggle word wrap

  3. Confirm that GPU resource labels are applied to the nodes:

    $ oc get node -o json | grep amd.com
    Copy to Clipboard Toggle word wrap

    Example output

    "amd.com/gpu.cu-count": "304",
    "amd.com/gpu.device-id": "74b5",
    "amd.com/gpu.driver-version": "6.12.12",
    "amd.com/gpu.family": "AI",
    "amd.com/gpu.simd-count": "1216",
    "amd.com/gpu.vram": "191G",
    "beta.amd.com/gpu.cu-count": "304",
    "beta.amd.com/gpu.cu-count.304": "8",
    "beta.amd.com/gpu.device-id": "74b5",
    "beta.amd.com/gpu.device-id.74b5": "8",
    "beta.amd.com/gpu.family": "AI",
    "beta.amd.com/gpu.family.AI": "8",
    "beta.amd.com/gpu.simd-count": "1216",
    "beta.amd.com/gpu.simd-count.1216": "8",
    "beta.amd.com/gpu.vram": "191G",
    "beta.amd.com/gpu.vram.191G": "8",
    "amd.com/gpu": "8",
    "amd.com/gpu": "8",
    Copy to Clipboard Toggle word wrap

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top