Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 6. Dynamic Accelerator Slicer (DAS) Operator
Dynamic Accelerator Slicer Operator is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
The Dynamic Accelerator Slicer (DAS) Operator allows you to dynamically slice GPU accelerators in OpenShift Container Platform, instead of relying on statically sliced GPUs defined when the node is booted. This allows you to dynamically slice GPUs based on specific workload demands, ensuring efficient resource utilization.
Dynamic slicing is useful if you do not know all the accelerator partitions needed in advance on every node on the cluster.
The DAS Operator currently includes a reference implementation for NVIDIA Multi-Instance GPU (MIG) and is designed to support additional technologies such as NVIDIA MPS or GPUs from other vendors in the future.
Limitations
The following limitations apply when using the Dynamic Accelerator Slicer Operator:
- You need to identify potential incompatibilities and ensure the system works seamlessly with various GPU drivers and operating systems.
- The Operator only works with specific MIG compatible NVIDIA GPUs and drivers, such as H100 and A100.
- The Operator cannot use only a subset of the GPUs of a node.
- The NVIDIA device plugin cannot be used together with the Dynamic Accelerator Slicer Operator to manage the GPU resources of a cluster.
The DAS Operator is designed to work with MIG-enabled GPUs. It allocates MIG slices instead of whole GPUs. Installing the DAS Operator prevents the use of the standard resource request through the NVIDIA device plugin such as nvidia.com/gpu: "1"
, for allocating the entire GPU.
6.1. Installing the Dynamic Accelerator Slicer Operator Link kopierenLink in die Zwischenablage kopiert!
As a cluster administrator, you can install the Dynamic Accelerator Slicer (DAS) Operator by using the OpenShift Container Platform web console or the OpenShift CLI.
6.1.1. Installing the Dynamic Accelerator Slicer Operator using the web console Link kopierenLink in die Zwischenablage kopiert!
As a cluster administrator, you can install the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift Container Platform web console.
Prerequisites
-
You have access to an OpenShift Container Platform cluster using an account with
cluster-admin
permissions. You have installed the required prerequisites:
- cert-manager Operator for Red Hat OpenShift
- Node Feature Discovery (NFD) Operator
- NVIDIA GPU Operator
- NodeFeatureDiscovery CR
Procedure
Configure the NVIDIA GPU Operator for MIG support:
-
In the OpenShift Container Platform web console, navigate to Operators
Installed Operators. - Select the NVIDIA GPU Operator from the list of installed operators.
- Click the ClusterPolicy tab and then click Create ClusterPolicy.
In the YAML editor, replace the default content with the following cluster policy configuration to disable the default NVIDIA device plugin and enable MIG support:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Create to apply the cluster policy.
-
Navigate to Workloads
Pods and select the nvidia-gpu-operator
namespace to monitor the cluster policy deployment. Wait for the NVIDIA GPU Operator cluster policy to reach the
Ready
state. You can monitor this by:-
Navigating to Operators
Installed Operators NVIDIA GPU Operator. -
Clicking the ClusterPolicy tab and checking that the status shows
ready
.
-
Navigating to Operators
-
Verify that all pods in the NVIDIA GPU Operator namespace are running by selecting the
nvidia-gpu-operator
namespace and navigating to WorkloadsPods. Label nodes with MIG-capable GPUs to enable MIG mode:
-
Navigate to Compute
Nodes. - Select a node that has MIG-capable GPUs.
-
Click Actions
Edit Labels. -
Add the label
nvidia.com/mig.config=all-enabled
. - Click Save.
Repeat for each node with MIG-capable GPUs.
ImportantAfter applying the MIG label, the labeled nodes will reboot to enable MIG mode. Wait for the nodes to come back online before proceeding.
-
Navigate to Compute
-
Verify that MIG mode is successfully enabled on the GPU nodes by checking that the
nvidia.com/mig.config=all-enabled
label appears in the Labels section. To locate the label, navigate to ComputeNodes, select the GPU node, and click the Details tab.
-
In the OpenShift Container Platform web console, navigate to Operators
-
In the OpenShift Container Platform web console, click Operators
OperatorHub. - Search for Dynamic Accelerator Slicer or DAS in the filter box to locate the DAS Operator.
- Select the Dynamic Accelerator Slicer and click Install.
On the Install Operator page:
- Select All namespaces on the cluster (default) for the installation mode.
-
Select Installed Namespace
Operator recommended Namespace: Project das-operator. -
If creating a new namespace, enter
das-operator
as the namespace name. - Select an update channel.
- Select Automatic or Manual for the approval strategy.
- Click Install.
-
In the OpenShift Container Platform web console, click Operators
Installed Operators. - Select DAS Operator from the list.
- In the Provided APIs table column, click DASOperator. This takes you to the DASOperator tab of the Operator details page.
- Click Create DASOperator. This takes you to the Create DASOperator YAML view.
In the YAML editor, paste the following example:
Example
DASOperator
CRCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of the
DASOperator
CR must becluster
.
- Click Create.
Verification
To verify that the DAS Operator installed successfully:
-
Navigate to the Operators
Installed Operators page. -
Ensure that Dynamic Accelerator Slicer is listed in the
das-operator
namespace with a Status of Succeeded.
To verify that the DASOperator
CR installed successfully:
-
After you create the
DASOperator
CR, the web console brings you to the DASOperator list view. The Status field of the CR changes to Available when all of the components are running. Optional. You can verify that the
DASOperator
CR installed successfully by running the following command in the OpenShift CLI:oc get dasoperator -n das-operator
$ oc get dasoperator -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME STATUS AGE cluster Available 3m
NAME STATUS AGE cluster Available 3m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
During installation an Operator might display a Failed status. If the installation later succeeds with an Succeeded message, you can ignore the Failed message.
You can also verify the installation by checking the pods:
-
Navigate to the Workloads
Pods page and select the das-operator
namespace. Verify that all DAS Operator component pods are running:
-
das-operator
pods (main operator controllers) -
das-operator-webhook
pods (webhook servers) -
das-scheduler
pods (scheduler plugins) -
das-daemonset
pods (only on nodes with MIG-compatible GPUs)
-
The das-daemonset
pods will only appear on nodes that have MIG-compatible GPU hardware. If you do not see any daemonset pods, verify that your cluster has nodes with supported GPU hardware and that the NVIDIA GPU Operator is properly configured.
Troubleshooting
Use the following procedure if the Operator does not appear to be installed:
-
Navigate to the Operators
Installed Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status. -
Navigate to the Workloads
Pods page and check the logs for pods in the das-operator
namespace.
6.1.2. Installing the Dynamic Accelerator Slicer Operator using the CLI Link kopierenLink in die Zwischenablage kopiert!
As a cluster administrator, you can install the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift CLI.
Prerequisites
-
You have access to an OpenShift Container Platform cluster using an account with
cluster-admin
permissions. -
You have installed the OpenShift CLI (
oc
). You have installed the required prerequisites:
- cert-manager Operator for Red Hat OpenShift
- Node Feature Discovery (NFD) Operator
- NVIDIA GPU Operator
- NodeFeatureDiscovery CR
Procedure
Configure the NVIDIA GPU Operator for MIG support:
Apply the following cluster policy to disable the default NVIDIA device plugin and enable MIG support. Create a file named
gpu-cluster-policy.yaml
with the following content:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the cluster policy by running the following command:
oc apply -f gpu-cluster-policy.yaml
$ oc apply -f gpu-cluster-policy.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the NVIDIA GPU Operator cluster policy reaches the
Ready
state by running the following command:oc get clusterpolicies.nvidia.com gpu-cluster-policy -w
$ oc get clusterpolicies.nvidia.com gpu-cluster-policy -w
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Wait until the
STATUS
column showsready
.Example output
NAME STATUS AGE gpu-cluster-policy ready 2025-08-14T08:56:45Z
NAME STATUS AGE gpu-cluster-policy ready 2025-08-14T08:56:45Z
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that all pods in the NVIDIA GPU Operator namespace are running by running the following command:
oc get pods -n nvidia-gpu-operator
$ oc get pods -n nvidia-gpu-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow All pods should show a
Running
orCompleted
status.Label nodes with MIG-capable GPUs to enable MIG mode by running the following command:
oc label node $NODE_NAME nvidia.com/mig.config=all-enabled --overwrite
$ oc label node $NODE_NAME nvidia.com/mig.config=all-enabled --overwrite
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Replace
$NODE_NAME
with the name of each node that has MIG-capable GPUs.ImportantAfter applying the MIG label, the labeled nodes reboot to enable MIG mode. Wait for the nodes to come back online before proceeding.
Verify that the nodes have successfully enabled MIG mode by running the following command:
oc get nodes -l nvidia.com/mig.config=all-enabled
$ oc get nodes -l nvidia.com/mig.config=all-enabled
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Create a namespace for the DAS Operator:
Create the following
Namespace
custom resource (CR) that defines thedas-operator
namespace, and save the YAML in thedas-namespace.yaml
file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the namespace by running the following command:
oc create -f das-namespace.yaml
$ oc create -f das-namespace.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Install the DAS Operator in the namespace you created in the previous step by creating the following objects:
Create the following
OperatorGroup
CR and save the YAML in thedas-operatorgroup.yaml
file:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the
OperatorGroup
CR by running the following command:oc create -f das-operatorgroup.yaml
$ oc create -f das-operatorgroup.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the following
Subscription
CR and save the YAML in thedas-sub.yaml
file:Example Subscription
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the subscription object by running the following command:
oc create -f das-sub.yaml
$ oc create -f das-sub.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Change to the
das-operator
project:oc project das-operator
$ oc project das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create the following
DASOperator
CR and save the YAML in thedas-dasoperator.yaml
file:Example
DASOperator
CRCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The name of the
DASOperator
CR must becluster
.
Create the
dasoperator
CR by running the following command:oc create -f das-dasoperator.yaml
oc create -f das-dasoperator.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the Operator deployment is successful by running the following command:
oc get pods
$ oc get pods
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow A successful deployment shows all pods with a
Running
status. The deployment includes:- das-operator
- Main Operator controller pods
- das-operator-webhook
- Webhook server pods for mutating pod requests
- das-scheduler
- Scheduler plugin pods for MIG slice allocation
- das-daemonset
Daemonset pods that run only on nodes with MIG-compatible GPUs
NoteThe
das-daemonset
pods only appear on nodes that have MIG-compatible GPU hardware. If you do not see any daemonset pods, verify that your cluster has nodes with supported GPU hardware and that the NVIDIA GPU Operator is properly configured.
6.2. Uninstalling the Dynamic Accelerator Slicer Operator Link kopierenLink in die Zwischenablage kopiert!
Use one of the following procedures to uninstall the Dynamic Accelerator Slicer (DAS) Operator, depending on how the Operator was installed.
6.2.1. Uninstalling the Dynamic Accelerator Slicer Operator using the web console Link kopierenLink in die Zwischenablage kopiert!
You can uninstall the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift Container Platform web console.
Prerequisites
-
You have access to an OpenShift Container Platform cluster using an account with
cluster-admin
permissions. - The DAS Operator is installed in your cluster.
Procedure
-
In the OpenShift Container Platform web console, navigate to Operators
Installed Operators. - Locate the Dynamic Accelerator Slicer in the list of installed Operators.
-
Click the Options menu
for the DAS Operator and select Uninstall Operator.
- In the confirmation dialog, click Uninstall to confirm the removal.
-
Navigate to Home
Projects. - Search for das-operator in the search box to locate the DAS Operator project.
-
Click the Options menu
next to the das-operator project, and select Delete Project.
-
In the confirmation dialog, type
das-operator
in the dialog box, and click Delete to confirm the deletion.
Verification
-
Navigate to the Operators
Installed Operators page. - Verify that the Dynamic Accelerator Slicer (DAS) Operator is no longer listed.
Optional. Verify that the
das-operator
namespace and its resources have been removed by running the following command:oc get namespace das-operator
$ oc get namespace das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The command should return an error indicating that the namespace is not found.
Uninstalling the DAS Operator removes all GPU slice allocations and might cause running workloads that depend on GPU slices to fail. Ensure that no critical workloads are using GPU slices before proceeding with the uninstallation.
6.2.2. Uninstalling the Dynamic Accelerator Slicer Operator using the CLI Link kopierenLink in die Zwischenablage kopiert!
You can uninstall the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift CLI.
Prerequisites
-
You have access to an OpenShift Container Platform cluster using an account with
cluster-admin
permissions. -
You have installed the OpenShift CLI (
oc
). - The DAS Operator is installed in your cluster.
Procedure
List the installed operators to find the DAS Operator subscription by running the following command:
oc get subscriptions -n das-operator
$ oc get subscriptions -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME PACKAGE SOURCE CHANNEL das-operator das-operator redhat-operators stable
NAME PACKAGE SOURCE CHANNEL das-operator das-operator redhat-operators stable
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the subscription by running the following command:
oc delete subscription das-operator -n das-operator
$ oc delete subscription das-operator -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow List and delete the cluster service version (CSV) by running the following commands:
oc get csv -n das-operator
$ oc get csv -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow oc delete csv <csv-name> -n das-operator
$ oc delete csv <csv-name> -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the operator group by running the following command:
oc delete operatorgroup das-operator -n das-operator
$ oc delete operatorgroup das-operator -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete any remaining
AllocationClaim
resources by running the following command:oc delete allocationclaims --all -n das-operator
$ oc delete allocationclaims --all -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Remove the DAS Operator namespace by running the following command:
oc delete namespace das-operator
$ oc delete namespace das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Verify that the DAS Operator resources have been removed by running the following command:
oc get namespace das-operator
$ oc get namespace das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The command should return an error indicating that the namespace is not found.
Verify that no
AllocationClaim
custom resource definitions remain by running the following command:oc get crd | grep allocationclaim
$ oc get crd | grep allocationclaim
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The command should return an error indicating that no custom resource definitions are found.
Uninstalling the DAS Operator removes all GPU slice allocations and might cause running workloads that depend on GPU slices to fail. Ensure that no critical workloads are using GPU slices before proceeding with the uninstallation.
6.3. Deploying GPU workloads with the Dynamic Accelerator Slicer Operator Link kopierenLink in die Zwischenablage kopiert!
You can deploy workloads that request GPU slices managed by the Dynamic Accelerator Slicer (DAS) Operator. The Operator dynamically partitions GPU accelerators and schedules workloads to available GPU slices.
Prerequisites
- You have MIG supported GPU hardware available in your cluster.
-
The NVIDIA GPU Operator is installed and the
ClusterPolicy
shows a Ready state. - You have installed the DAS Operator.
Procedure
Create a namespace by running the following command:
oc new-project cuda-workloads
oc new-project cuda-workloads
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a deployment that requests GPU resources using the NVIDIA MIG resource:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the deployment configuration by running the following command:
oc apply -f cuda-vectoradd-deployment.yaml
$ oc apply -f cuda-vectoradd-deployment.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the deployment is created and pods are scheduled by running the following command:
oc get deployment cuda-vectoradd
$ oc get deployment cuda-vectoradd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY UP-TO-DATE AVAILABLE AGE cuda-vectoradd 2/2 2 2 2m
NAME READY UP-TO-DATE AVAILABLE AGE cuda-vectoradd 2/2 2 2 2m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the pods by running the following command:
oc get pods -l app=cuda-vectoradd
$ oc get pods -l app=cuda-vectoradd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME READY STATUS RESTARTS AGE cuda-vectoradd-6b8c7d4f9b-abc12 1/1 Running 0 2m cuda-vectoradd-6b8c7d4f9b-def34 1/1 Running 0 2m
NAME READY STATUS RESTARTS AGE cuda-vectoradd-6b8c7d4f9b-abc12 1/1 Running 0 2m cuda-vectoradd-6b8c7d4f9b-def34 1/1 Running 0 2m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Check that
AllocationClaim
resources were created for your deployment pods by running the following command:oc get allocationclaims -n das-operator
$ oc get allocationclaims -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 2m ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 2m
NAME AGE 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 2m ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 2m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the GPU slices are properly allocated by checking one of the pod’s resource allocation by running the following command:
oc describe pod -l app=cuda-vectoradd
$ oc describe pod -l app=cuda-vectoradd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the logs to verify the CUDA sample application runs successfully by running the following command:
oc logs -l app=cuda-vectoradd
$ oc logs -l app=cuda-vectoradd
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED
[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the environment variables to verify that the GPU devices are properly exposed to the container by running the following command:
oc exec deployment/cuda-vectoradd -- env | grep -E "(NVIDIA_VISIBLE_DEVICES|CUDA_VISIBLE_DEVICES)"
$ oc exec deployment/cuda-vectoradd -- env | grep -E "(NVIDIA_VISIBLE_DEVICES|CUDA_VISIBLE_DEVICES)"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NVIDIA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652 CUDA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652
NVIDIA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652 CUDA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652
Copy to Clipboard Copied! Toggle word wrap Toggle overflow These environment variables indicate that the GPU MIG slice has been properly allocated and is visible to the CUDA runtime within the container.
6.4. Troubleshooting the Dynamic Accelerator Slicer Operator Link kopierenLink in die Zwischenablage kopiert!
If you experience issues with the Dynamic Accelerator Slicer (DAS) Operator, use the following troubleshooting steps to diagnose and resolve problems.
Prerequisites
- You have installed the DAS Operator.
- You have access to the OpenShift Container Platform cluster as a user with the cluster-admin role.
6.4.1. Debugging DAS Operator components Link kopierenLink in die Zwischenablage kopiert!
Procedure
Check the status of all DAS Operator components by running the following command:
oc get pods -n das-operator
$ oc get pods -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Inspect the logs of the DAS Operator controller by running the following command:
oc logs -n das-operator deployment/das-operator
$ oc logs -n das-operator deployment/das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the logs of the webhook server by running the following command:
oc logs -n das-operator deployment/das-operator-webhook
$ oc logs -n das-operator deployment/das-operator-webhook
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the logs of the scheduler plugin by running the following command:
oc logs -n das-operator deployment/das-scheduler
$ oc logs -n das-operator deployment/das-scheduler
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the logs of the device plugin daemonset by running the following command:
oc logs -n das-operator daemonset/das-daemonset
$ oc logs -n das-operator daemonset/das-daemonset
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.4.2. Monitoring AllocationClaims Link kopierenLink in die Zwischenablage kopiert!
Procedure
Inspect active
AllocationClaim
resources by running the following command:oc get allocationclaims -n das-operator
$ oc get allocationclaims -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 5m ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 5m
NAME AGE 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 5m ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 5m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View detailed information about a specific
AllocationClaim
by running the following command:oc get allocationclaims -n das-operator -o yaml
$ oc get allocationclaims -n das-operator -o yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output (truncated)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for claims in different states by running the following command:
oc get allocationclaims -n das-operator -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'
$ oc get allocationclaims -n das-operator -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 inUse ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 inUse
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 inUse ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 inUse
Copy to Clipboard Copied! Toggle word wrap Toggle overflow View events related to
AllocationClaim
resources by running the following command:oc get events -n das-operator --field-selector involvedObject.kind=AllocationClaim
$ oc get events -n das-operator --field-selector involvedObject.kind=AllocationClaim
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check
NodeAccelerator
resources to verify GPU hardware detection by running the following command:oc get nodeaccelerator -n das-operator
$ oc get nodeaccelerator -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
NAME AGE harpatil000034jma-qh5fm-worker-f-57md9 96m harpatil000034jma-qh5fm-worker-f-fl4wg 96m
NAME AGE harpatil000034jma-qh5fm-worker-f-57md9 96m harpatil000034jma-qh5fm-worker-f-fl4wg 96m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The
NodeAccelerator
resources represent the GPU-capable nodes detected by the DAS Operator.
Additional information
The AllocationClaim
custom resource tracks the following information:
- GPU UUID
- The unique identifier of the GPU device.
- Slice position
- The position of the MIG slice on the GPU.
- Pod reference
- The pod that requested the GPU slice.
- State
-
The current state of the claim (
staged
,created
, orreleased
).
Claims start in the staged
state and transition to created
when all requests are satisfied. When a pod is deleted, the associated claim is automatically cleaned up.
6.4.3. Verifying GPU device availability Link kopierenLink in die Zwischenablage kopiert!
Procedure
On a node with GPU hardware, verify that CDI devices were created by running the following command:
oc debug node/<node-name>
$ oc debug node/<node-name>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow chroot /host ls -l /var/run/cdi/
sh-4.4# chroot /host sh-4.4# ls -l /var/run/cdi/
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check the NVIDIA GPU Operator status by running the following command:
oc get clusterpolicies.nvidia.com -o jsonpath='{.items[0].status.state}'
$ oc get clusterpolicies.nvidia.com -o jsonpath='{.items[0].status.state}'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The output should show
ready
.
6.4.4. Increasing log verbosity Link kopierenLink in die Zwischenablage kopiert!
Procedure
To get more detailed debugging information:
Edit the
DASOperator
resource to increase log verbosity by running the following command:oc edit dasoperator -n das-operator
$ oc edit dasoperator -n das-operator
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Set the
operatorLogLevel
field toDebug
orTrace
:spec: operatorLogLevel: Debug
spec: operatorLogLevel: Debug
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Save the changes and verify that the operator pods restart with increased verbosity.
6.4.5. Common issues and solutions Link kopierenLink in die Zwischenablage kopiert!
Due to kubernetes/kubernetes#128043, pods might enter an UnexpectedAdmissionError
state if admission fails. Pods managed by higher level controllers such as Deployments are recreated automatically. Naked pods, however, must be cleaned up manually with oc delete pod
. Using controllers is recommended until the upstream issue is resolved.
Prerequisites not met
If the DAS Operator fails to start or function properly, verify that all prerequisites are installed:
- Cert-manager
- Node Feature Discovery (NFD) Operator
- NVIDIA GPU Operator