이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 3. Installing the Node Feature Discovery Operator and NVIDIA GPU Operator
Install the Node Feature Discovery Operator and the NVIDIA GPU Operator that allow you to use the underlying host AI accelerators.
Prerequisites
-
You have installed the OpenShift CLI (
oc). -
You have logged in as a user with
cluster-adminprivileges. - You have successfully mirrored the required Operator images in the disconnected environment.
Procedure
Disable the default OperatorHub sources. Run the following command:
$ oc patch OperatorHub cluster --type json \ -p '[{"op": "add", "path": "/spec/disableAllDefaultSources", "value": true}]'Apply
Namespace,OperatorGroup, andSubscriptionCRs for the Node Feature Discovery Operator and the NVIDIA GPU Operators.Create the
NamespaceCRs:oc apply -f - <<EOF apiVersion: v1 kind: Namespace metadata: name: nvidia-gpu-operator --- apiVersion: v1 kind: Namespace metadata: name: openshift-nfd labels: name: openshift-nfd openshift.io/cluster-monitoring: "true" EOFCreate the
OperatorGroupCRs:oc apply -f - <<EOF apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: gpu-operator-certified namespace: nvidia-gpu-operator spec: targetNamespaces: - nvidia-gpu-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: generateName: openshift-nfd- name: openshift-nfd namespace: openshift-nfd spec: targetNamespaces: - openshift-nfd EOFCreate the
SubscriptionCRs:oc apply -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: gpu-operator-certified namespace: nvidia-gpu-operator spec: channel: "stable" installPlanApproval: Manual name: gpu-operator-certified source: certified-operators sourceNamespace: openshift-marketplace --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: nfd namespace: openshift-nfd spec: channel: "stable" installPlanApproval: Automatic name: nfd source: redhat-operators sourceNamespace: openshift-marketplace EOF
Create a
Secretcustom resource (CR) for the Hugging Face token.Set the
HF_TOKENvariable using the token you set in Hugging Face.$ HF_TOKEN=<your_huggingface_token>Set the cluster namespace to match where you deploy the Red Hat AI Inference Server image, for example:
$ NAMESPACE=rhaiis-namespaceCreate the
SecretCR in the cluster:$ oc create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN -n $NAMESPACE
Verification
Verify that the Operator deployments are successful by running the following command:
$ oc get pods
Example output
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-7f86ccfb58-vgr4x 2/2 Running 0 10m
gpu-feature-discovery-c2rfm 1/1 Running 0 6m28s
gpu-operator-84b7f5bcb9-vqds7 1/1 Running 0 39m
nvidia-container-toolkit-daemonset-pgcrf 1/1 Running 0 6m28s
nvidia-cuda-validator-p8gv2 0/1 Completed 0 99s
nvidia-dcgm-exporter-kv6k8 1/1 Running 0 6m28s
nvidia-dcgm-tpsps 1/1 Running 0 6m28s
nvidia-device-plugin-daemonset-gbn55 1/1 Running 0 6m28s
nvidia-device-plugin-validator-z7ltr 0/1 Completed 0 82s
nvidia-driver-daemonset-410.84.202203290245-0-xxgdv 2/2 Running 0 6m28s
nvidia-node-status-exporter-snmsm 1/1 Running 0 6m28s
nvidia-operator-validator-6pfk6 1/1 Running 0 6m28s
...