Startseite
Produkte
OpenShift Container Platform
4.19
Hardware accelerators
Chapter 6. Dynamic Accelerator Slicer (DAS) Operator

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 6. Dynamic Accelerator Slicer (DAS) Operator

Important

Dynamic Accelerator Slicer Operator is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Dynamic Accelerator Slicer (DAS) Operator allows you to dynamically slice GPU accelerators in OpenShift Container Platform, instead of relying on statically sliced GPUs defined when the node is booted. This allows you to dynamically slice GPUs based on specific workload demands, ensuring efficient resource utilization.

Dynamic slicing is useful if you do not know all the accelerator partitions needed in advance on every node on the cluster.

The DAS Operator currently includes a reference implementation for NVIDIA Multi-Instance GPU (MIG) and is designed to support additional technologies such as NVIDIA MPS or GPUs from other vendors in the future.

Limitations

The following limitations apply when using the Dynamic Accelerator Slicer Operator:

You need to identify potential incompatibilities and ensure the system works seamlessly with various GPU drivers and operating systems.
The Operator only works with specific MIG compatible NVIDIA GPUs and drivers, such as H100 and A100.
The Operator cannot use only a subset of the GPUs of a node.
The NVIDIA device plugin cannot be used together with the Dynamic Accelerator Slicer Operator to manage the GPU resources of a cluster.

Note

The DAS Operator is designed to work with MIG-enabled GPUs. It allocates MIG slices instead of whole GPUs. Installing the DAS Operator prevents the use of the standard resource request through the NVIDIA device plugin such as nvidia.com/gpu: "1", for allocating the entire GPU.

6.1. Installing the Dynamic Accelerator Slicer Operator
Link kopieren

As a cluster administrator, you can install the Dynamic Accelerator Slicer (DAS) Operator by using the OpenShift Container Platform web console or the OpenShift CLI.

6.1.1. Installing the Dynamic Accelerator Slicer Operator using the web console
Link kopieren

As a cluster administrator, you can install the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift Container Platform web console.

Prerequisites

You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
You have installed the required prerequisites:
- cert-manager Operator for Red Hat OpenShift
- Node Feature Discovery (NFD) Operator
- NVIDIA GPU Operator
- NodeFeatureDiscovery CR

Procedure

Configure the NVIDIA GPU Operator for MIG support:

In the OpenShift Container Platform web console, navigate to Operators Installed Operators.
Select the NVIDIA GPU Operator from the list of installed operators.
Click the ClusterPolicy tab and then click Create ClusterPolicy.

In the YAML editor, replace the default content with the following cluster policy configuration to disable the default NVIDIA device plugin and enable MIG support:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  daemonsets:
    rollingUpdate:
      maxUnavailable: "1"
    updateStrategy: RollingUpdate
  dcgm:
    enabled: true
  dcgmExporter:
    config:
      name: ""
    enabled: true
    serviceMonitor:
      enabled: true
  devicePlugin:
    config:
      default: ""
      name: ""
    enabled: false
    mps:
      root: /run/nvidia/mps
  driver:
    certConfig:
      name: ""
    enabled: true
    kernelModuleConfig:
      name: ""
    licensingConfig:
      configMapName: ""
      nlsEnabled: true
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    useNvidiaDriverCRD: false
    useOpenKernelModules: false
    virtualTopology:
      config: ""
  gdrcopy:
    enabled: false
  gds:
    enabled: false
  gfd:
    enabled: true
  mig:
    strategy: mixed
  migManager:
    config:
      default: ""
      name: default-mig-parted-config
    enabled: true
    env:
      - name: WITH_REBOOT
        value: 'true'
      - name: MIG_PARTED_MODE_CHANGE_ONLY
        value: 'true'
  nodeStatusExporter:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  sandboxDevicePlugin:
    enabled: true
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
  validator:
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "false"
    cuda:
      env:
      - name: WITH_WORKLOAD
        value: "false"
  vfioManager:
    enabled: true
  vgpuDeviceManager:
    enabled: true
  vgpuManager:
    enabled: false

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  daemonsets:
    rollingUpdate:
      maxUnavailable: "1"
    updateStrategy: RollingUpdate
  dcgm:
    enabled: true
  dcgmExporter:
    config:
      name: ""
    enabled: true
    serviceMonitor:
      enabled: true
  devicePlugin:
    config:
      default: ""
      name: ""
    enabled: false
    mps:
      root: /run/nvidia/mps
  driver:
    certConfig:
      name: ""
    enabled: true
    kernelModuleConfig:
      name: ""
    licensingConfig:
      configMapName: ""
      nlsEnabled: true
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    useNvidiaDriverCRD: false
    useOpenKernelModules: false
    virtualTopology:
      config: ""
  gdrcopy:
    enabled: false
  gds:
    enabled: false
  gfd:
    enabled: true
  mig:
    strategy: mixed
  migManager:
    config:
      default: ""
      name: default-mig-parted-config
    enabled: true
    env:
      - name: WITH_REBOOT
        value: 'true'
      - name: MIG_PARTED_MODE_CHANGE_ONLY
        value: 'true'
  nodeStatusExporter:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  sandboxDevicePlugin:
    enabled: true
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
  validator:
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "false"
    cuda:
      env:
      - name: WITH_WORKLOAD
        value: "false"
  vfioManager:
    enabled: true
  vgpuDeviceManager:
    enabled: true
  vgpuManager:
    enabled: false

Copy to Clipboard

Toggle word wrap

Click Create to apply the cluster policy.
Navigate to Workloads Pods and select the nvidia-gpu-operator namespace to monitor the cluster policy deployment.
Wait for the NVIDIA GPU Operator cluster policy to reach the Ready state. You can monitor this by:
1. Navigating to Operators Installed Operators NVIDIA GPU Operator.
2. Clicking the ClusterPolicy tab and checking that the status shows ready.
Verify that all pods in the NVIDIA GPU Operator namespace are running by selecting the nvidia-gpu-operator namespace and navigating to Workloads Pods.
Label nodes with MIG-capable GPUs to enable MIG mode:
1. Navigate to Compute Nodes.
2. Select a node that has MIG-capable GPUs.
3. Click Actions Edit Labels.
4. Add the label nvidia.com/mig.config=all-enabled.
5. Click Save.
6. Repeat for each node with MIG-capable GPUs.
  Important
  After applying the MIG label, the labeled nodes will reboot to enable MIG mode. Wait for the nodes to come back online before proceeding.
Verify that MIG mode is successfully enabled on the GPU nodes by checking that the nvidia.com/mig.config=all-enabled label appears in the Labels section. To locate the label, navigate to Compute Nodes, select the GPU node, and click the Details tab.

In the OpenShift Container Platform web console, click Operators OperatorHub.
Search for Dynamic Accelerator Slicer or DAS in the filter box to locate the DAS Operator.
Select the Dynamic Accelerator Slicer and click Install.
On the Install Operator page:
1. Select All namespaces on the cluster (default) for the installation mode.
2. Select Installed Namespace Operator recommended Namespace: Project das-operator.
3. If creating a new namespace, enter das-operator as the namespace name.
4. Select an update channel.
5. Select Automatic or Manual for the approval strategy.
Click Install.
In the OpenShift Container Platform web console, click Operators Installed Operators.
Select DAS Operator from the list.
In the Provided APIs table column, click DASOperator. This takes you to the DASOperator tab of the Operator details page.
Click Create DASOperator. This takes you to the Create DASOperator YAML view.

In the YAML editor, paste the following example:

Example DASOperator CR

apiVersion: inference.redhat.com/v1alpha1
kind: DASOperator
metadata:
  name: cluster 
  namespace: das-operator
spec:
  logLevel: Normal
  operatorLogLevel: Normal
  managementState: Managed

apiVersion: inference.redhat.com/v1alpha1
kind: DASOperator
metadata:
  name: cluster


  namespace: das-operator
spec:
  logLevel: Normal
  operatorLogLevel: Normal
  managementState: Managed

Copy to Clipboard

Toggle word wrap

1: The name of the DASOperator CR must be cluster.

Click Create.

Verification

To verify that the DAS Operator installed successfully:

Navigate to the Operators Installed Operators page.
Ensure that Dynamic Accelerator Slicer is listed in the das-operator namespace with a Status of Succeeded.

To verify that the DASOperator CR installed successfully:

After you create the DASOperator CR, the web console brings you to the DASOperator list view. The Status field of the CR changes to Available when all of the components are running.
Optional. You can verify that the DASOperator CR installed successfully by running the following command in the OpenShift CLI:
```
oc get dasoperator -n das-operator
```
```
$ oc get dasoperator -n das-operator
```
Copy to Clipboard Toggle word wrap
Example output
```
NAME     	STATUS    	AGE
cluster  	Available	3m
```
```
NAME     	STATUS    	AGE
cluster  	Available	3m
```
Copy to Clipboard Toggle word wrap

Note

During installation an Operator might display a Failed status. If the installation later succeeds with an Succeeded message, you can ignore the Failed message.

You can also verify the installation by checking the pods:

Navigate to the Workloads Pods page and select the das-operator namespace.
Verify that all DAS Operator component pods are running:
- das-operator pods (main operator controllers)
- das-operator-webhook pods (webhook servers)
- das-scheduler pods (scheduler plugins)
- das-daemonset pods (only on nodes with MIG-compatible GPUs)

Note

The das-daemonset pods will only appear on nodes that have MIG-compatible GPU hardware. If you do not see any daemonset pods, verify that your cluster has nodes with supported GPU hardware and that the NVIDIA GPU Operator is properly configured.

Troubleshooting

Use the following procedure if the Operator does not appear to be installed:

Navigate to the Operators Installed Operators page and inspect the Operator Subscriptions and Install Plans tabs for any failure or errors under Status.
Navigate to the Workloads Pods page and check the logs for pods in the das-operator namespace.

6.1.2. Installing the Dynamic Accelerator Slicer Operator using the CLI
Link kopieren

As a cluster administrator, you can install the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift CLI.

Prerequisites

You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
You have installed the OpenShift CLI (oc).
You have installed the required prerequisites:
- cert-manager Operator for Red Hat OpenShift
- Node Feature Discovery (NFD) Operator
- NVIDIA GPU Operator
- NodeFeatureDiscovery CR

Procedure

Configure the NVIDIA GPU Operator for MIG support:

Apply the following cluster policy to disable the default NVIDIA device plugin and enable MIG support. Create a file named gpu-cluster-policy.yaml with the following content:

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  daemonsets:
    rollingUpdate:
      maxUnavailable: "1"
    updateStrategy: RollingUpdate
  dcgm:
    enabled: true
  dcgmExporter:
    config:
      name: ""
    enabled: true
    serviceMonitor:
      enabled: true
  devicePlugin:
    config:
      default: ""
      name: ""
    enabled: false
    mps:
      root: /run/nvidia/mps
  driver:
    certConfig:
      name: ""
    enabled: true
    kernelModuleConfig:
      name: ""
    licensingConfig:
      configMapName: ""
      nlsEnabled: true
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    useNvidiaDriverCRD: false
    useOpenKernelModules: false
    virtualTopology:
      config: ""
  gdrcopy:
    enabled: false
  gds:
    enabled: false
  gfd:
    enabled: true
  mig:
    strategy: mixed
  migManager:
    config:
      default: ""
      name: default-mig-parted-config
    enabled: true
    env:
      - name: WITH_REBOOT
        value: 'true'
      - name: MIG_PARTED_MODE_CHANGE_ONLY
        value: 'true'
  nodeStatusExporter:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  sandboxDevicePlugin:
    enabled: true
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
  validator:
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "false"
    cuda:
      env:
      - name: WITH_WORKLOAD
        value: "false"
  vfioManager:
    enabled: true
  vgpuDeviceManager:
    enabled: true
  vgpuManager:
    enabled: false

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  daemonsets:
    rollingUpdate:
      maxUnavailable: "1"
    updateStrategy: RollingUpdate
  dcgm:
    enabled: true
  dcgmExporter:
    config:
      name: ""
    enabled: true
    serviceMonitor:
      enabled: true
  devicePlugin:
    config:
      default: ""
      name: ""
    enabled: false
    mps:
      root: /run/nvidia/mps
  driver:
    certConfig:
      name: ""
    enabled: true
    kernelModuleConfig:
      name: ""
    licensingConfig:
      configMapName: ""
      nlsEnabled: true
    repoConfig:
      configMapName: ""
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    useNvidiaDriverCRD: false
    useOpenKernelModules: false
    virtualTopology:
      config: ""
  gdrcopy:
    enabled: false
  gds:
    enabled: false
  gfd:
    enabled: true
  mig:
    strategy: mixed
  migManager:
    config:
      default: ""
      name: default-mig-parted-config
    enabled: true
    env:
      - name: WITH_REBOOT
        value: 'true'
      - name: MIG_PARTED_MODE_CHANGE_ONLY
        value: 'true'
  nodeStatusExporter:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  sandboxDevicePlugin:
    enabled: true
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
  validator:
    plugin:
      env:
      - name: WITH_WORKLOAD
        value: "false"
    cuda:
      env:
      - name: WITH_WORKLOAD
        value: "false"
  vfioManager:
    enabled: true
  vgpuDeviceManager:
    enabled: true
  vgpuManager:
    enabled: false

Copy to Clipboard

Toggle word wrap

Apply the cluster policy by running the following command:
```
oc apply -f gpu-cluster-policy.yaml
```
```
$ oc apply -f gpu-cluster-policy.yaml
```
Copy to Clipboard Toggle word wrap

Verify the NVIDIA GPU Operator cluster policy reaches the Ready state by running the following command:

oc get clusterpolicies.nvidia.com gpu-cluster-policy -w

$ oc get clusterpolicies.nvidia.com gpu-cluster-policy -w

Copy to Clipboard

Toggle word wrap

Wait until the STATUS column shows ready.

Example output

NAME                 STATUS   AGE
gpu-cluster-policy   ready    2025-08-14T08:56:45Z

NAME                 STATUS   AGE
gpu-cluster-policy   ready    2025-08-14T08:56:45Z

Copy to Clipboard

Toggle word wrap

Verify that all pods in the NVIDIA GPU Operator namespace are running by running the following command:
```
oc get pods -n nvidia-gpu-operator
```
```
$ oc get pods -n nvidia-gpu-operator
```
Copy to Clipboard Toggle word wrap
All pods should show a Running or Completed status.
Label nodes with MIG-capable GPUs to enable MIG mode by running the following command:
```
oc label node $NODE_NAME nvidia.com/mig.config=all-enabled --overwrite
```
```
$ oc label node $NODE_NAME nvidia.com/mig.config=all-enabled --overwrite
```
Copy to Clipboard Toggle word wrap
Replace $NODE_NAME with the name of each node that has MIG-capable GPUs.
Important
After applying the MIG label, the labeled nodes reboot to enable MIG mode. Wait for the nodes to come back online before proceeding.
Verify that the nodes have successfully enabled MIG mode by running the following command:
```
oc get nodes -l nvidia.com/mig.config=all-enabled
```
```
$ oc get nodes -l nvidia.com/mig.config=all-enabled
```
Copy to Clipboard Toggle word wrap

Create a namespace for the DAS Operator:

Create the following Namespace custom resource (CR) that defines the das-operator namespace, and save the YAML in the das-namespace.yaml file:

apiVersion: v1
kind: Namespace
metadata:
  name: das-operator
  labels:
    name: das-operator
    openshift.io/cluster-monitoring: "true"

apiVersion: v1
kind: Namespace
metadata:
  name: das-operator
  labels:
    name: das-operator
    openshift.io/cluster-monitoring: "true"

Copy to Clipboard

Toggle word wrap

Create the namespace by running the following command:
```
oc create -f das-namespace.yaml
```
```
$ oc create -f das-namespace.yaml
```
Copy to Clipboard Toggle word wrap

Install the DAS Operator in the namespace you created in the previous step by creating the following objects:

Create the following OperatorGroup CR and save the YAML in the das-operatorgroup.yaml file:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: das-operator-
  name: das-operator
  namespace: das-operator

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: das-operator-
  name: das-operator
  namespace: das-operator

Copy to Clipboard

Toggle word wrap

Create the OperatorGroup CR by running the following command:
```
oc create -f das-operatorgroup.yaml
```
```
$ oc create -f das-operatorgroup.yaml
```
Copy to Clipboard Toggle word wrap

Create the following Subscription CR and save the YAML in the das-sub.yaml file:

Example Subscription

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: das-operator
  namespace: das-operator
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: das-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: das-operator
  namespace: das-operator
spec:
  channel: "stable"
  installPlanApproval: Automatic
  name: das-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

Copy to Clipboard

Toggle word wrap

Create the subscription object by running the following command:
```
oc create -f das-sub.yaml
```
```
$ oc create -f das-sub.yaml
```
Copy to Clipboard Toggle word wrap
Change to the das-operator project:
```
oc project das-operator
```
```
$ oc project das-operator
```
Copy to Clipboard Toggle word wrap

Create the following DASOperator CR and save the YAML in the das-dasoperator.yaml file:

Example DASOperator CR

apiVersion: inference.redhat.com/v1alpha1
kind: DASOperator
metadata:
  name: cluster 
  namespace: das-operator
spec:
  managementState: Managed
  logLevel: Normal
  operatorLogLevel: Normal

apiVersion: inference.redhat.com/v1alpha1
kind: DASOperator
metadata:
  name: cluster


  namespace: das-operator
spec:
  managementState: Managed
  logLevel: Normal
  operatorLogLevel: Normal

Copy to Clipboard

Toggle word wrap

1: The name of the DASOperator CR must be cluster.

Create the dasoperator CR by running the following command:
```
oc create -f das-dasoperator.yaml
```
```
oc create -f das-dasoperator.yaml
```
Copy to Clipboard Toggle word wrap

Verification

Verify that the Operator deployment is successful by running the following command:

oc get pods

$ oc get pods

Copy to Clipboard

Toggle word wrap

Example output

NAME                                    READY   STATUS    RESTARTS   AGE
das-daemonset-6rsfd                     1/1     Running   0          5m16s
das-daemonset-8qzgf                     1/1     Running   0          5m16s
das-operator-5946478b47-cjfcp           1/1     Running   0          5m18s
das-operator-5946478b47-npwmn           1/1     Running   0          5m18s
das-operator-webhook-59949d4f85-5n9qt   1/1     Running   0          68s
das-operator-webhook-59949d4f85-nbtdl   1/1     Running   0          68s
das-scheduler-6cc59dbf96-4r85f          1/1     Running   0          68s
das-scheduler-6cc59dbf96-bf6ml          1/1     Running   0          68s

NAME                                    READY   STATUS    RESTARTS   AGE
das-daemonset-6rsfd                     1/1     Running   0          5m16s
das-daemonset-8qzgf                     1/1     Running   0          5m16s
das-operator-5946478b47-cjfcp           1/1     Running   0          5m18s
das-operator-5946478b47-npwmn           1/1     Running   0          5m18s
das-operator-webhook-59949d4f85-5n9qt   1/1     Running   0          68s
das-operator-webhook-59949d4f85-nbtdl   1/1     Running   0          68s
das-scheduler-6cc59dbf96-4r85f          1/1     Running   0          68s
das-scheduler-6cc59dbf96-bf6ml          1/1     Running   0          68s

Copy to Clipboard

Toggle word wrap

A successful deployment shows all pods with a Running status. The deployment includes:

das-operator: Main Operator controller pods
das-operator-webhook: Webhook server pods for mutating pod requests
das-scheduler: Scheduler plugin pods for MIG slice allocation
das-daemonset: Daemonset pods that run only on nodes with MIG-compatible GPUs
Note
The das-daemonset pods only appear on nodes that have MIG-compatible GPU hardware. If you do not see any daemonset pods, verify that your cluster has nodes with supported GPU hardware and that the NVIDIA GPU Operator is properly configured.

6.2. Uninstalling the Dynamic Accelerator Slicer Operator
Link kopieren

Use one of the following procedures to uninstall the Dynamic Accelerator Slicer (DAS) Operator, depending on how the Operator was installed.

6.2.1. Uninstalling the Dynamic Accelerator Slicer Operator using the web console
Link kopieren

You can uninstall the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift Container Platform web console.

Prerequisites

You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
The DAS Operator is installed in your cluster.

Procedure

In the OpenShift Container Platform web console, navigate to Operators Installed Operators.
Locate the Dynamic Accelerator Slicer in the list of installed Operators.
Click the Options menu for the DAS Operator and select Uninstall Operator.
In the confirmation dialog, click Uninstall to confirm the removal.
Navigate to Home Projects.
Search for das-operator in the search box to locate the DAS Operator project.
Click the Options menu next to the das-operator project, and select Delete Project.
In the confirmation dialog, type das-operator in the dialog box, and click Delete to confirm the deletion.

Verification

Navigate to the Operators Installed Operators page.
Verify that the Dynamic Accelerator Slicer (DAS) Operator is no longer listed.
Optional. Verify that the das-operator namespace and its resources have been removed by running the following command:
```
oc get namespace das-operator
```
```
$ oc get namespace das-operator
```
Copy to Clipboard Toggle word wrap
The command should return an error indicating that the namespace is not found.

Warning

Uninstalling the DAS Operator removes all GPU slice allocations and might cause running workloads that depend on GPU slices to fail. Ensure that no critical workloads are using GPU slices before proceeding with the uninstallation.

6.2.2. Uninstalling the Dynamic Accelerator Slicer Operator using the CLI
Link kopieren

You can uninstall the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift CLI.

Prerequisites

You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
You have installed the OpenShift CLI (oc).
The DAS Operator is installed in your cluster.

Procedure

List the installed operators to find the DAS Operator subscription by running the following command:

oc get subscriptions -n das-operator

$ oc get subscriptions -n das-operator

Copy to Clipboard

Toggle word wrap

Example output

NAME           PACKAGE        SOURCE             CHANNEL
das-operator   das-operator   redhat-operators   stable

NAME           PACKAGE        SOURCE             CHANNEL
das-operator   das-operator   redhat-operators   stable

Copy to Clipboard

Toggle word wrap

Delete the subscription by running the following command:
```
oc delete subscription das-operator -n das-operator
```
```
$ oc delete subscription das-operator -n das-operator
```
Copy to Clipboard Toggle word wrap
List and delete the cluster service version (CSV) by running the following commands:
```
oc get csv -n das-operator
```
```
$ oc get csv -n das-operator
```
Copy to Clipboard Toggle word wrap
```
oc delete csv <csv-name> -n das-operator
```
```
$ oc delete csv <csv-name> -n das-operator
```
Copy to Clipboard Toggle word wrap
Remove the operator group by running the following command:
```
oc delete operatorgroup das-operator -n das-operator
```
```
$ oc delete operatorgroup das-operator -n das-operator
```
Copy to Clipboard Toggle word wrap
Delete any remaining AllocationClaim resources by running the following command:
```
oc delete allocationclaims --all -n das-operator
```
```
$ oc delete allocationclaims --all -n das-operator
```
Copy to Clipboard Toggle word wrap
Remove the DAS Operator namespace by running the following command:
```
oc delete namespace das-operator
```
```
$ oc delete namespace das-operator
```
Copy to Clipboard Toggle word wrap

Verification

Verify that the DAS Operator resources have been removed by running the following command:
```
oc get namespace das-operator
```
```
$ oc get namespace das-operator
```
Copy to Clipboard Toggle word wrap
The command should return an error indicating that the namespace is not found.
Verify that no AllocationClaim custom resource definitions remain by running the following command:
```
oc get crd | grep allocationclaim
```
```
$ oc get crd | grep allocationclaim
```
Copy to Clipboard Toggle word wrap
The command should return an error indicating that no custom resource definitions are found.

Warning

6.3. Deploying GPU workloads with the Dynamic Accelerator Slicer Operator
Link kopieren

You can deploy workloads that request GPU slices managed by the Dynamic Accelerator Slicer (DAS) Operator. The Operator dynamically partitions GPU accelerators and schedules workloads to available GPU slices.

Prerequisites

You have MIG supported GPU hardware available in your cluster.
The NVIDIA GPU Operator is installed and the ClusterPolicy shows a Ready state.
You have installed the DAS Operator.

Procedure

Create a namespace by running the following command:
```
oc new-project cuda-workloads
```
```
oc new-project cuda-workloads
```
Copy to Clipboard Toggle word wrap

Create a deployment that requests GPU resources using the NVIDIA MIG resource:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-vectoradd
spec:
  replicas: 2
  selector:
    matchLabels:
      app: cuda-vectoradd
  template:
    metadata:
      labels:
        app: cuda-vectoradd
    spec:
      restartPolicy: Always
      containers:
      - name: cuda-vectoradd
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubi8
        resources:
          limits:
            nvidia.com/mig-1g.5gb: "1"
        command:
          - sh
          - -c
          - |
            env && /cuda-samples/vectorAdd && sleep 3600

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cuda-vectoradd
spec:
  replicas: 2
  selector:
    matchLabels:
      app: cuda-vectoradd
  template:
    metadata:
      labels:
        app: cuda-vectoradd
    spec:
      restartPolicy: Always
      containers:
      - name: cuda-vectoradd
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubi8
        resources:
          limits:
            nvidia.com/mig-1g.5gb: "1"
        command:
          - sh
          - -c
          - |
            env && /cuda-samples/vectorAdd && sleep 3600

Copy to Clipboard

Toggle word wrap

Apply the deployment configuration by running the following command:
```
oc apply -f cuda-vectoradd-deployment.yaml
```
```
$ oc apply -f cuda-vectoradd-deployment.yaml
```
Copy to Clipboard Toggle word wrap

Verify that the deployment is created and pods are scheduled by running the following command:

oc get deployment cuda-vectoradd

$ oc get deployment cuda-vectoradd

Copy to Clipboard

Toggle word wrap

Example output

NAME             READY   UP-TO-DATE   AVAILABLE   AGE
cuda-vectoradd   2/2     2            2           2m

NAME             READY   UP-TO-DATE   AVAILABLE   AGE
cuda-vectoradd   2/2     2            2           2m

Copy to Clipboard

Toggle word wrap

Check the status of the pods by running the following command:

oc get pods -l app=cuda-vectoradd

$ oc get pods -l app=cuda-vectoradd

Copy to Clipboard

Toggle word wrap

Example output

NAME                              READY   STATUS    RESTARTS   AGE
cuda-vectoradd-6b8c7d4f9b-abc12   1/1     Running   0          2m
cuda-vectoradd-6b8c7d4f9b-def34   1/1     Running   0          2m

NAME                              READY   STATUS    RESTARTS   AGE
cuda-vectoradd-6b8c7d4f9b-abc12   1/1     Running   0          2m
cuda-vectoradd-6b8c7d4f9b-def34   1/1     Running   0          2m

Copy to Clipboard

Toggle word wrap

Verification

Check that AllocationClaim resources were created for your deployment pods by running the following command:

oc get allocationclaims -n das-operator

$ oc get allocationclaims -n das-operator

Copy to Clipboard

Toggle word wrap

Example output

NAME                                                                                           AGE
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0   2m
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0   2m

NAME                                                                                           AGE
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0   2m
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0   2m

Copy to Clipboard

Toggle word wrap

Verify that the GPU slices are properly allocated by checking one of the pod’s resource allocation by running the following command:
```
oc describe pod -l app=cuda-vectoradd
```
```
$ oc describe pod -l app=cuda-vectoradd
```
Copy to Clipboard Toggle word wrap

Check the logs to verify the CUDA sample application runs successfully by running the following command:

oc logs -l app=cuda-vectoradd

$ oc logs -l app=cuda-vectoradd

Copy to Clipboard

Toggle word wrap

Example output

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED

Copy to Clipboard

Toggle word wrap

Check the environment variables to verify that the GPU devices are properly exposed to the container by running the following command:

oc exec deployment/cuda-vectoradd -- env | grep -E "(NVIDIA_VISIBLE_DEVICES|CUDA_VISIBLE_DEVICES)"

$ oc exec deployment/cuda-vectoradd -- env | grep -E "(NVIDIA_VISIBLE_DEVICES|CUDA_VISIBLE_DEVICES)"

Copy to Clipboard

Toggle word wrap

Example output

NVIDIA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652
CUDA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652

NVIDIA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652
CUDA_VISIBLE_DEVICES=MIG-d8ac9850-d92d-5474-b238-0afeabac1652

Copy to Clipboard

Toggle word wrap

These environment variables indicate that the GPU MIG slice has been properly allocated and is visible to the CUDA runtime within the container.

6.4. Troubleshooting the Dynamic Accelerator Slicer Operator
Link kopieren

If you experience issues with the Dynamic Accelerator Slicer (DAS) Operator, use the following troubleshooting steps to diagnose and resolve problems.

Prerequisites

You have installed the DAS Operator.
You have access to the OpenShift Container Platform cluster as a user with the cluster-admin role.

6.4.1. Debugging DAS Operator components
Link kopieren

Procedure

Check the status of all DAS Operator components by running the following command:

oc get pods -n das-operator

$ oc get pods -n das-operator

Copy to Clipboard

Toggle word wrap

Example output

NAME                                    READY   STATUS    RESTARTS   AGE
das-daemonset-6rsfd                     1/1     Running   0          5m16s
das-daemonset-8qzgf                     1/1     Running   0          5m16s
das-operator-5946478b47-cjfcp           1/1     Running   0          5m18s
das-operator-5946478b47-npwmn           1/1     Running   0          5m18s
das-operator-webhook-59949d4f85-5n9qt   1/1     Running   0          68s
das-operator-webhook-59949d4f85-nbtdl   1/1     Running   0          68s
das-scheduler-6cc59dbf96-4r85f          1/1     Running   0          68s
das-scheduler-6cc59dbf96-bf6ml          1/1     Running   0          68s

NAME                                    READY   STATUS    RESTARTS   AGE
das-daemonset-6rsfd                     1/1     Running   0          5m16s
das-daemonset-8qzgf                     1/1     Running   0          5m16s
das-operator-5946478b47-cjfcp           1/1     Running   0          5m18s
das-operator-5946478b47-npwmn           1/1     Running   0          5m18s
das-operator-webhook-59949d4f85-5n9qt   1/1     Running   0          68s
das-operator-webhook-59949d4f85-nbtdl   1/1     Running   0          68s
das-scheduler-6cc59dbf96-4r85f          1/1     Running   0          68s
das-scheduler-6cc59dbf96-bf6ml          1/1     Running   0          68s

Copy to Clipboard

Toggle word wrap

Inspect the logs of the DAS Operator controller by running the following command:
```
oc logs -n das-operator deployment/das-operator
```
```
$ oc logs -n das-operator deployment/das-operator
```
Copy to Clipboard Toggle word wrap
Check the logs of the webhook server by running the following command:
```
oc logs -n das-operator deployment/das-operator-webhook
```
```
$ oc logs -n das-operator deployment/das-operator-webhook
```
Copy to Clipboard Toggle word wrap
Check the logs of the scheduler plugin by running the following command:
```
oc logs -n das-operator deployment/das-scheduler
```
```
$ oc logs -n das-operator deployment/das-scheduler
```
Copy to Clipboard Toggle word wrap
Check the logs of the device plugin daemonset by running the following command:
```
oc logs -n das-operator daemonset/das-daemonset
```
```
$ oc logs -n das-operator daemonset/das-daemonset
```
Copy to Clipboard Toggle word wrap

6.4.2. Monitoring AllocationClaims
Link kopieren

Procedure

Inspect active AllocationClaim resources by running the following command:

oc get allocationclaims -n das-operator

$ oc get allocationclaims -n das-operator

Copy to Clipboard

Toggle word wrap

Example output

NAME                                                                                           AGE
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0   5m
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0   5m

NAME                                                                                           AGE
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0   5m
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0   5m

Copy to Clipboard

Toggle word wrap

View detailed information about a specific AllocationClaim by running the following command:

oc get allocationclaims -n das-operator -o yaml

$ oc get allocationclaims -n das-operator -o yaml

Copy to Clipboard

Toggle word wrap

Example output (truncated)

apiVersion: inference.redhat.com/v1alpha1
kind: AllocationClaim
metadata:
  name: 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0
  namespace: das-operator
spec:
  gpuUUID: GPU-9003fd9c-1ad1-c935-d8cd-d1ae69ef17c0
  migPlacement:
    size: 1
    start: 0
  nodename: harpatil000034jma-qh5fm-worker-f-57md9
  podRef:
    kind: Pod
    name: cuda-vectoradd-f4b84b678-l2m69
    namespace: default
    uid: 13950288-57df-4ab5-82bc-6138f646633e
  profile: 1g.5gb
status:
  conditions:
  - lastTransitionTime: "2025-08-06T19:28:48Z"
    message: Allocation is inUse
    reason: inUse
    status: "True"
    type: State
  state: inUse

apiVersion: inference.redhat.com/v1alpha1
kind: AllocationClaim
metadata:
  name: 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0
  namespace: das-operator
spec:
  gpuUUID: GPU-9003fd9c-1ad1-c935-d8cd-d1ae69ef17c0
  migPlacement:
    size: 1
    start: 0
  nodename: harpatil000034jma-qh5fm-worker-f-57md9
  podRef:
    kind: Pod
    name: cuda-vectoradd-f4b84b678-l2m69
    namespace: default
    uid: 13950288-57df-4ab5-82bc-6138f646633e
  profile: 1g.5gb
status:
  conditions:
  - lastTransitionTime: "2025-08-06T19:28:48Z"
    message: Allocation is inUse
    reason: inUse
    status: "True"
    type: State
  state: inUse

Copy to Clipboard

Toggle word wrap

Check for claims in different states by running the following command:

oc get allocationclaims -n das-operator -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'

$ oc get allocationclaims -n das-operator -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'

Copy to Clipboard

Toggle word wrap

Example output

13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0	inUse
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0	inUse

13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0	inUse
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0	inUse

Copy to Clipboard

Toggle word wrap

View events related to AllocationClaim resources by running the following command:

oc get events -n das-operator --field-selector involvedObject.kind=AllocationClaim

$ oc get events -n das-operator --field-selector involvedObject.kind=AllocationClaim

Copy to Clipboard

Toggle word wrap

Check NodeAccelerator resources to verify GPU hardware detection by running the following command:

oc get nodeaccelerator -n das-operator

$ oc get nodeaccelerator -n das-operator

Copy to Clipboard

Toggle word wrap

Example output

NAME                                     AGE
harpatil000034jma-qh5fm-worker-f-57md9   96m
harpatil000034jma-qh5fm-worker-f-fl4wg   96m

NAME                                     AGE
harpatil000034jma-qh5fm-worker-f-57md9   96m
harpatil000034jma-qh5fm-worker-f-fl4wg   96m

Copy to Clipboard

Toggle word wrap

The NodeAccelerator resources represent the GPU-capable nodes detected by the DAS Operator.

Additional information

The AllocationClaim custom resource tracks the following information:

GPU UUID: The unique identifier of the GPU device.
Slice position: The position of the MIG slice on the GPU.
Pod reference: The pod that requested the GPU slice.
State: The current state of the claim (staged, created, or released).

Claims start in the staged state and transition to created when all requests are satisfied. When a pod is deleted, the associated claim is automatically cleaned up.

6.4.3. Verifying GPU device availability
Link kopieren

Procedure

On a node with GPU hardware, verify that CDI devices were created by running the following command:
```
oc debug node/<node-name>
```
```
$ oc debug node/<node-name>
```
Copy to Clipboard Toggle word wrap
```
chroot /host
ls -l /var/run/cdi/
```
```
sh-4.4# chroot /host
sh-4.4# ls -l /var/run/cdi/
```
Copy to Clipboard Toggle word wrap
Check the NVIDIA GPU Operator status by running the following command:
```
oc get clusterpolicies.nvidia.com -o jsonpath='{.items[0].status.state}'
```
```
$ oc get clusterpolicies.nvidia.com -o jsonpath='{.items[0].status.state}'
```
Copy to Clipboard Toggle word wrap
The output should show ready.

6.4.4. Increasing log verbosity
Link kopieren

Procedure

To get more detailed debugging information:

Edit the DASOperator resource to increase log verbosity by running the following command:
```
oc edit dasoperator -n das-operator
```
```
$ oc edit dasoperator -n das-operator
```
Copy to Clipboard Toggle word wrap
Set the operatorLogLevel field to Debug or Trace:
```
spec:
  operatorLogLevel: Debug
```
```
spec:
  operatorLogLevel: Debug
```
Copy to Clipboard Toggle word wrap
Save the changes and verify that the operator pods restart with increased verbosity.

6.4.5. Common issues and solutions
Link kopieren

Pods stuck in UnexpectedAdmissionError state

Due to kubernetes/kubernetes#128043, pods might enter an UnexpectedAdmissionError state if admission fails. Pods managed by higher level controllers such as Deployments are recreated automatically. Naked pods, however, must be cleaned up manually with oc delete pod. Using controllers is recommended until the upstream issue is resolved.

Prerequisites not met

If the DAS Operator fails to start or function properly, verify that all prerequisites are installed:

Cert-manager
Node Feature Discovery (NFD) Operator
NVIDIA GPU Operator

Nach oben

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 6. Dynamic Accelerator Slicer (DAS) Operator

6.1. Installing the Dynamic Accelerator Slicer Operator
Link kopieren

6.1.1. Installing the Dynamic Accelerator Slicer Operator using the web console
Link kopieren

6.1.2. Installing the Dynamic Accelerator Slicer Operator using the CLI
Link kopieren

6.2. Uninstalling the Dynamic Accelerator Slicer Operator
Link kopieren

6.2.1. Uninstalling the Dynamic Accelerator Slicer Operator using the web console
Link kopieren

6.2.2. Uninstalling the Dynamic Accelerator Slicer Operator using the CLI
Link kopieren

6.3. Deploying GPU workloads with the Dynamic Accelerator Slicer Operator
Link kopieren

6.4. Troubleshooting the Dynamic Accelerator Slicer Operator
Link kopieren

6.4.1. Debugging DAS Operator components
Link kopieren

6.4.2. Monitoring AllocationClaims
Link kopieren

6.4.3. Verifying GPU device availability
Link kopieren

6.4.4. Increasing log verbosity
Link kopieren

6.4.5. Common issues and solutions
Link kopieren

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 6. Dynamic Accelerator Slicer (DAS) Operator

6.1. Installing the Dynamic Accelerator Slicer OperatorLink kopierenLink in die Zwischenablage kopiert!

6.1.1. Installing the Dynamic Accelerator Slicer Operator using the web consoleLink kopierenLink in die Zwischenablage kopiert!

6.1.2. Installing the Dynamic Accelerator Slicer Operator using the CLILink kopierenLink in die Zwischenablage kopiert!

6.2. Uninstalling the Dynamic Accelerator Slicer OperatorLink kopierenLink in die Zwischenablage kopiert!

6.2.1. Uninstalling the Dynamic Accelerator Slicer Operator using the web consoleLink kopierenLink in die Zwischenablage kopiert!

6.2.2. Uninstalling the Dynamic Accelerator Slicer Operator using the CLILink kopierenLink in die Zwischenablage kopiert!

6.3. Deploying GPU workloads with the Dynamic Accelerator Slicer OperatorLink kopierenLink in die Zwischenablage kopiert!

6.4. Troubleshooting the Dynamic Accelerator Slicer OperatorLink kopierenLink in die Zwischenablage kopiert!

6.4.1. Debugging DAS Operator componentsLink kopierenLink in die Zwischenablage kopiert!

6.4.2. Monitoring AllocationClaimsLink kopierenLink in die Zwischenablage kopiert!

6.4.3. Verifying GPU device availabilityLink kopierenLink in die Zwischenablage kopiert!

6.4.4. Increasing log verbosityLink kopierenLink in die Zwischenablage kopiert!

6.4.5. Common issues and solutionsLink kopierenLink in die Zwischenablage kopiert!

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. Installing the Dynamic Accelerator Slicer Operator
Link kopieren

6.1.1. Installing the Dynamic Accelerator Slicer Operator using the web console
Link kopieren

6.1.2. Installing the Dynamic Accelerator Slicer Operator using the CLI
Link kopieren

6.2. Uninstalling the Dynamic Accelerator Slicer Operator
Link kopieren

6.2.1. Uninstalling the Dynamic Accelerator Slicer Operator using the web console
Link kopieren

6.2.2. Uninstalling the Dynamic Accelerator Slicer Operator using the CLI
Link kopieren

6.3. Deploying GPU workloads with the Dynamic Accelerator Slicer Operator
Link kopieren

6.4. Troubleshooting the Dynamic Accelerator Slicer Operator
Link kopieren

6.4.1. Debugging DAS Operator components
Link kopieren

6.4.2. Monitoring AllocationClaims
Link kopieren

6.4.3. Verifying GPU device availability
Link kopieren

6.4.4. Increasing log verbosity
Link kopieren

6.4.5. Common issues and solutions
Link kopieren