第 3 章 安装 Node Feature Discovery Operator 和 NVIDIA GPU Operator


安装 Node Feature Discovery Operator 和 NVIDIA GPU Operator,供您使用底层主机 AI 加速器。

先决条件

  • 已安装 OpenShift CLI(oc)。
  • 您已以具有 cluster-admin 权限的用户身份登录。
  • 您已成功在断开连接的环境中镜像所需的 Operator 镜像。

流程

  1. 禁用默认的 OperatorHub 源。运行以下命令:

    $ oc patch OperatorHub cluster --type json \
        -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]'
    Copy to Clipboard Toggle word wrap
  2. 为 Node Feature Discovery Operator 和 NVIDIA GPU Operator 应用 NamespaceOperatorGroupSubscription CR。

    1. 创建 Namespace CR:

      oc apply -f - <<EOF
      apiVersion: v1
      kind: Namespace
      metadata:
        name: nvidia-gpu-operator
      ---
      apiVersion: v1
      kind: Namespace
      metadata:
        name: openshift-nfd
        labels:
          name: openshift-nfd
          openshift.io/cluster-monitoring: "true"
      EOF
      Copy to Clipboard Toggle word wrap
    2. 创建 OperatorGroup CR:

      oc apply -f - <<EOF
      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        name: gpu-operator-certified
        namespace: nvidia-gpu-operator
      spec:
       targetNamespaces:
       - nvidia-gpu-operator
      ---
      apiVersion: operators.coreos.com/v1
      kind: OperatorGroup
      metadata:
        generateName: openshift-nfd-
        name: openshift-nfd
        namespace: openshift-nfd
      spec:
        targetNamespaces:
        - openshift-nfd
      EOF
      Copy to Clipboard Toggle word wrap
    3. 创建 Subscription CR:

      oc apply -f - <<EOF
      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: gpu-operator-certified
        namespace: nvidia-gpu-operator
      spec:
        channel: "stable"
        installPlanApproval: Manual
        name: gpu-operator-certified
        source: certified-operators
        sourceNamespace: openshift-marketplace
      ---
      apiVersion: operators.coreos.com/v1alpha1
      kind: Subscription
      metadata:
        name: nfd
        namespace: openshift-nfd
      spec:
        channel: "stable"
        installPlanApproval: Automatic
        name: nfd
        source: redhat-operators
        sourceNamespace: openshift-marketplace
      EOF
      Copy to Clipboard Toggle word wrap
  3. 为 Hugging Face 令牌创建 Secret 自定义资源(CR)。

    1. 使用您在 Hugging Face 中设置的令牌,设置 HF_TOKEN 变量。

      $ HF_TOKEN=<your_huggingface_token>
      Copy to Clipboard Toggle word wrap
    2. 将集群命名空间设置为与部署 Red Hat AI Inference Server 镜像的位置匹配,例如:

      $ NAMESPACE=rhaiis-namespace
      Copy to Clipboard Toggle word wrap
    3. 在集群中创建 Secret CR:

      $ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE
      Copy to Clipboard Toggle word wrap

验证

运行以下命令验证 Operator 部署是否成功:

$ oc get pods
Copy to Clipboard Toggle word wrap

输出示例

NAME                                                  READY   STATUS     RESTARTS   AGE
nfd-controller-manager-7f86ccfb58-vgr4x               2/2     Running    0          10m
gpu-feature-discovery-c2rfm                           1/1     Running    0          6m28s
gpu-operator-84b7f5bcb9-vqds7                         1/1     Running    0          39m
nvidia-container-toolkit-daemonset-pgcrf              1/1     Running    0          6m28s
nvidia-cuda-validator-p8gv2                           0/1     Completed  0          99s
nvidia-dcgm-exporter-kv6k8                            1/1     Running    0          6m28s
nvidia-dcgm-tpsps                                     1/1     Running    0          6m28s
nvidia-device-plugin-daemonset-gbn55                  1/1     Running    0          6m28s
nvidia-device-plugin-validator-z7ltr                  0/1     Completed  0          82s
nvidia-driver-daemonset-410.84.202203290245-0-xxgdv   2/2     Running    0          6m28s
nvidia-node-status-exporter-snmsm                     1/1     Running    0          6m28s
nvidia-operator-validator-6pfk6                       1/1     Running    0          6m28s
...
Copy to Clipboard Toggle word wrap

返回顶部
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2025 Red Hat