第 3 章 安装 Node Feature Discovery Operator 和 NVIDIA GPU Operator
安装 Node Feature Discovery Operator 和 NVIDIA GPU Operator,供您使用底层主机 AI 加速器。
先决条件
-
已安装 OpenShift CLI(
oc)。 -
您已以具有
cluster-admin权限的用户身份登录。 - 您已成功在断开连接的环境中镜像所需的 Operator 镜像。
流程
禁用默认的 OperatorHub 源。运行以下命令:
$ oc patch OperatorHub cluster --type json \ -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]'为 Node Feature Discovery Operator 和 NVIDIA GPU Operator 应用
Namespace、OperatorGroup和SubscriptionCR。创建
NamespaceCR:oc apply -f - <<EOF apiVersion: v1 kind: Namespace metadata: name: nvidia-gpu-operator --- apiVersion: v1 kind: Namespace metadata: name: openshift-nfd labels: name: openshift-nfd openshift.io/cluster-monitoring: "true" EOF创建
OperatorGroupCR:oc apply -f - <<EOF apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: gpu-operator-certified namespace: nvidia-gpu-operator spec: targetNamespaces: - nvidia-gpu-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: generateName: openshift-nfd- name: openshift-nfd namespace: openshift-nfd spec: targetNamespaces: - openshift-nfd EOF创建
SubscriptionCR:oc apply -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: gpu-operator-certified namespace: nvidia-gpu-operator spec: channel: "stable" installPlanApproval: Manual name: gpu-operator-certified source: certified-operators sourceNamespace: openshift-marketplace --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nfd namespace: openshift-nfd spec: channel: "stable" installPlanApproval: Automatic name: nfd source: redhat-operators sourceNamespace: openshift-marketplace EOF
为 Hugging Face 令牌创建
Secret自定义资源(CR)。使用您在 Hugging Face 中设置的令牌,设置
HF_TOKEN变量。$ HF_TOKEN=<your_huggingface_token>将集群命名空间设置为与部署 Red Hat AI Inference Server 镜像的位置匹配,例如:
$ NAMESPACE=rhaiis-namespace在集群中创建
SecretCR:$ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE
验证
运行以下命令验证 Operator 部署是否成功:
$ oc get pods
输出示例
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-7f86ccfb58-vgr4x 2/2 Running 0 10m
gpu-feature-discovery-c2rfm 1/1 Running 0 6m28s
gpu-operator-84b7f5bcb9-vqds7 1/1 Running 0 39m
nvidia-container-toolkit-daemonset-pgcrf 1/1 Running 0 6m28s
nvidia-cuda-validator-p8gv2 0/1 Completed 0 99s
nvidia-dcgm-exporter-kv6k8 1/1 Running 0 6m28s
nvidia-dcgm-tpsps 1/1 Running 0 6m28s
nvidia-device-plugin-daemonset-gbn55 1/1 Running 0 6m28s
nvidia-device-plugin-validator-z7ltr 0/1 Completed 0 82s
nvidia-driver-daemonset-410.84.202203290245-0-xxgdv 2/2 Running 0 6m28s
nvidia-node-status-exporter-snmsm 1/1 Running 0 6m28s
nvidia-operator-validator-6pfk6 1/1 Running 0 6m28s
...