Home
Products
Red Hat OpenShift AI Self-Managed
2.25
Managing OpenShift AI
Chapter 8. Managing workloads with Kueue

Chapter 8. Managing workloads with Kueue

As a cluster administrator, you can manage AI and machine learning workloads at scale by integrating the Red Hat build of Kueue with Red Hat OpenShift AI. This integration provides capabilities for quota management, resource allocation, and prioritized job scheduling.

Important

Starting with OpenShift AI 2.24, the embedded Kueue component for managing distributed workloads is deprecated. Kueue is now provided through Red Hat build of Kueue, which is installed and managed by the Red Hat build of Kueue Operator. You cannot install both the embedded Kueue and the Red Hat build of Kueue Operator on the same cluster because this creates conflicting controllers that manage the same resources.

OpenShift AI does not automatically migrate existing workloads. To ensure your workloads continue using queue management after upgrading, cluster administrators must manually migrate from the embedded Kueue to the Red Hat build of Kueue Operator. For more information, see Migrating to the Red Hat build of Kueue Operator.

8.1. Overview of managing workloads with Kueue
Copy link

You can use Kueue in OpenShift AI to manage AI and machine learning workloads at scale. Kueue controls how cluster resources are allocated and shared through hierarchical quota management, dynamic resource allocation, and prioritized job scheduling. These capabilities help prevent cluster contention, ensure fair access across teams, and optimize the use of heterogeneous compute resources, such as hardware accelerators.

Kueue lets you schedule diverse workloads, including distributed training jobs (RayJob, RayCluster, PyTorchJob), workbenches (Notebook), and model serving (InferenceService). Kueue validation and queue enforcement apply only to workloads in namespaces with the kueue.openshift.io/managed=true label.

Using Kueue in OpenShift AI provides these benefits:

Prevents resource conflicts and prioritizes workload processing
Manages quotas across teams and projects
Ensures consistent scheduling for all workload types
Maximizes GPU and other specialized hardware utilization

Important

8.1.1. Kueue management states
Copy link

You configure how OpenShift AI interacts with Kueue by setting the managementState in the DataScienceCluster object.

Unmanaged

This state is supported for using Kueue with OpenShift AI. In Unmanaged state, OpenShift AI integrates with an existing Kueue installation managed by the Red Hat build of Kueue Operator. You must have the Red Hat build of Kueue Operator installed and running on the cluster.

When you enable Unmanaged mode, the OpenShift AI Operator creates a default Kueue custom resource (CR) if one does not already exist. This prompts the Red Hat build of Kueue Operator to activate Kueue on the cluster.

Managed

This state is deprecated. Previously, OpenShift AI deployed and managed an embedded Kueue distribution. Managed mode is not compatible with the Red Hat build of Kueue Operator. If both are installed, OpenShift AI stops reconciliation to avoid conflicts. You must migrate any environment using the Managed state to the Unmanaged state to ensure continued support.

Removed

This state disables Kueue in OpenShift AI. If the state was previously Managed, OpenShift AI uninstalls the embedded Kueue distribution. If the state was previously Unmanaged, OpenShift AI stops checking for the external Kueue integration but does not uninstall the Red Hat build of Kueue Operator. An empty managementState value also functions as Removed.

8.1.2. Queue enforcement for projects
Copy link

To ensure workloads do not bypass the queuing system, a validating webhook automatically enforces queuing rules on any project that is enabled for Kueue management. You enable a project for Kueue management by applying the kueue.openshift.io/managed=true label to the project namespace.

Note

This validating webhook enforcement method replaces the Validating Admission Policy that was used with the deprecated embedded Kueue component. The system also supports the legacy kueue-managed label for backward compatibility, but kueue.openshift.io/managed=true is the recommended label going forward.

After a project is enabled for Kueue management, the webhook requires that any new or updated workload has the kueue.x-k8s.io/queue-name label. If this label is missing, the webhook prevents the workload from being created or updated.

OpenShift AI creates a default, cluster-scoped ClusterQueue (if one does not already exist) and a namespace-scoped LocalQueue for that namespace (if one does not already exist). These default resources are created with the opendatahub.io/managed=false annotation, so they are not managed after creation. Cluster administrators can change or delete them.

The webhook enforces this rule on the create and update operations for the following resource types:

InferenceService
Notebook
PyTorchJob
RayCluster
RayJob

Note

You can apply hardware profiles to other workload types, but the validation webhook enforces the kueue.x-k8s.io/queue-name label requirement only for these specific resource types.

8.1.3. Restrictions for managing workloads with Kueue
Copy link

When you use Kueue to manage workloads in OpenShift AI, the following restrictions apply:

Namespaces must be labeled with kueue.openshift.io/managed=true to enable Kueue validation and queue enforcement.
All workloads that you create from the OpenShift AI dashboard, such as workbenches and model servers, must use a hardware profile that specifies a local queue.
When you specify a local queue in a hardware profile, OpenShift AI automatically applies the corresponding kueue.x-k8s.io/queue-name label to workloads that use that profile.
You cannot use hardware profiles that contain node selectors or tolerations for node placement. To direct workloads to specific nodes, use a hardware profile that specifies a local queue that is associated with a queue configured with the appropriate resource flavors.
You cannot use accelerator profiles with Kueue. You must migrate any existing accelerator profiles to hardware profiles.
Because workbenches are not suspendable workloads, you can only assign them to a local queue that is associated with a non-preemptive cluster queue. The default cluster queue that OpenShift AI creates is non-preemptive.

Additional resources

Red Hat build of Kueue documentation

8.1.4. Kueue workflow
Copy link

Managing workloads with Kueue in OpenShift AI involves tasks for OpenShift cluster administrators, OpenShift AI administrators, and machine learning (ML) engineers or data scientists:

Cluster administrator

Installs and configures Kueue:

Installs the Red Hat build of Kueue Operator on the cluster, as described in the Red Hat build of Kueue documentation.
Activates the Kueue integration by setting the managementState to Unmanaged in the DataScienceCluster custom resource, as described in Configuring workload management with Kueue.
Configures quotas to optimize resource allocation for user workloads, as described in the Red Hat build of Kueue documentation.
Enables Kueue in the dashboard by setting disableKueue to false in the OdhDashboardConfig custom resource, as described in Enabling Kueue in the dashboard.
Note
When Kueue is enabled in the dashboard, OpenShift AI automatically enables Kueue management for all new projects created from the dashboard. For existing projects, or for projects created by using the OpenShift CLI (oc), you must enable Kueue management manually by applying the kueue.openshift.io/managed=true label to the project namespace.

OpenShift AI administrator

Prepares the OpenShift AI environment:

Creates Kueue-enabled hardware profiles so that users can submit workloads from the OpenShift AI dashboard, as described in Working with hardware profiles.

ML Engineer or data scientist

Submits workloads to the queuing system:

For workloads created from the OpenShift AI dashboard, such as workbenches and model servers, selects a Kueue-enabled hardware profile during creation.
For workloads created by using a command-line interface or an SDK, such as distributed training jobs, adds the kueue.x-k8s.io/queue-name label to the workload’s YAML manifest and sets its value to the target LocalQueue name.

8.2. Configuring workload management with Kueue
Copy link

To use workload queuing in OpenShift AI, install the Red Hat build of Kueue Operator and activate the Kueue integration in OpenShift AI.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You are using OpenShift 4.18 or later.
You have installed and configured the cert-manager Operator for Red Hat OpenShift for your cluster.
You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:
- Installing the OpenShift CLI for OpenShift Container Platform
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS

Procedure

In a terminal window, log in to the OpenShift CLI (oc) as shown in the following example:
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Copy to Clipboard Toggle word wrap
Install the Red Hat build of Kueue Operator on your OpenShift cluster as described in the Red Hat build of Kueue documentation.

Activate the Kueue integration. You can use the predefined names for the default cluster queue and default local queue, or specify custom names.

To use the predefined queue names (default), run the following command. Replace <operator-namespace> with your operator namespace. The default operator namespace is redhat-ods-operator.

oc patch datasciencecluster default-dsc --type='merge' -p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged"}}}}' -n <operator-namespace>

$ oc patch datasciencecluster default-dsc --type='merge' -p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged"}}}}' -n <operator-namespace>

Copy to Clipboard

Toggle word wrap

To specify custom queue names, run the following command. Replace <example-cluster-queue> and <example-local-queue> with your custom queue names, and replace <operator-namespace> with your operator namespace. The default operator namespace is redhat-ods-operator.

oc patch datasciencecluster default-dsc --type='merge' -p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged","defaultClusterQueueName":"<example-cluster-queue>","defaultLocalQueueName":"<example-local-queue>"}}}}' -n <operator-namespace>

$ oc patch datasciencecluster default-dsc --type='merge' -p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged","defaultClusterQueueName":"<example-cluster-queue>","defaultLocalQueueName":"<example-local-queue>"}}}}' -n <operator-namespace>

Copy to Clipboard

Toggle word wrap

Verification

Verify that the Red Hat build of Kueue pods are running:

oc get pods -n openshift-kueue-operator

$ oc get pods -n openshift-kueue-operator

Copy to Clipboard

Toggle word wrap

You should see output similar to the following example:

kueue-controller-manager-d9fc745df-ph77w    1/1     Running
openshift-kueue-operator-69cfbf45cf-lwtpm   1/1     Running

kueue-controller-manager-d9fc745df-ph77w    1/1     Running
openshift-kueue-operator-69cfbf45cf-lwtpm   1/1     Running

Copy to Clipboard

Toggle word wrap

Verify that the default ClusterQueue was created:
```
oc get clusterqueues
```
```
$ oc get clusterqueues
```
Copy to Clipboard Toggle word wrap

Next steps

Configure quotas by creating and modifying ResourceFlavor, ClusterQueue, and LocalQueue objects. For details, see the Red Hat build of Kueue documentation.
Enable Kueue in the dashboard so that users can select Kueue-enabled options when creating workloads. When you enable Kueue, you also enable Kueue management for all new projects created from the dashboard. See Enabling Kueue in the dashboard.
Cluster administrators and OpenShift AI administrators can create hardware profiles so that users can submit workloads from the OpenShift AI dashboard. See Working with hardware profiles.

8.2.1. Enabling Kueue in the dashboard
Copy link

Enable Kueue in the OpenShift AI dashboard so that users can select Kueue-enabled options when creating workloads.

When you enable Kueue in the dashboard, OpenShift AI automatically enables Kueue management for all new projects created from the dashboard. For these projects, OpenShift AI applies the kueue.openshift.io/managed=true label to the namespace and creates a LocalQueue object if one does not already exist. The LocalQueue object is created with the opendatahub.io/managed=false annotation, so it is not managed after creation. Cluster administrators can modify or delete it as needed. A validating webhook then enforces that any new or updated workload resource in a Kueue-enabled project includes the kueue.x-k8s.io/queue-name label.

Note

For existing projects, or for projects created by using the OpenShift CLI (oc), you must enable Kueue management manually by applying the kueue.openshift.io/managed=true label to the project namespace.

oc label namespace <project-namespace> kueue.openshift.io/managed=true --overwrite

$ oc label namespace <project-namespace> kueue.openshift.io/managed=true --overwrite

Copy to Clipboard

Toggle word wrap

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You are using OpenShift 4.18 or later.
You have installed and activated the Red Hat build of Kueue Operator, as described in Configuring workload management with Kueue.
You have configured quotas, as described in the Red Hat build of Kueue documentation.

Procedure

In a terminal window, log in to the OpenShift CLI (oc) as shown in the following example:
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Copy to Clipboard Toggle word wrap

Update the odh-dashboard-config custom resource in the OpenShift AI applications namespace. Replace <applications-namespace> with your OpenShift AI applications namespace. The default is redhat-ods-applications.

oc patch odhdashboardconfig odh-dashboard-config \
  -n \<applications-namespace\> \
  --type merge \
  -p {"spec":{"dashboardConfig":{"disableHardwareProfiles":false,"disableKueue":false}}}

$ oc patch odhdashboardconfig odh-dashboard-config \
  -n \<applications-namespace\> \
  --type merge \
  -p {"spec":{"dashboardConfig":{"disableHardwareProfiles":false,"disableKueue":false}}}

Copy to Clipboard

Toggle word wrap

Verification

From the OpenShift AI dashboard, create a new project.

Verify that the project namespace is labeled for Kueue management:

oc get ns <project-namespace> -o jsonpath='{.metadata.labels.kueue\.openshift\.io/managed}{"\n"}'

$ oc get ns <project-namespace> -o jsonpath='{.metadata.labels.kueue\.openshift\.io/managed}{"\n"}'

Copy to Clipboard

Toggle word wrap

The output should be true.

Confirm that a default LocalQueue exists for the project namespace:
```
oc get localqueues -n <project-namespace>
```
```
$ oc get localqueues -n <project-namespace>
```
Copy to Clipboard Toggle word wrap
Create a test workload (for example, a Notebook) and verify that it includes the kueue.x-k8s.io/queue-name label.

Next step

Cluster administrators and OpenShift AI administrators can create hardware profiles so that users can submit workloads from the OpenShift AI dashboard. See Working with hardware profiles.

8.3. Troubleshooting common problems with Kueue
Copy link

If your users are experiencing errors in Red Hat OpenShift AI relating to Kueue workloads, read this section to understand what could be causing the problem, and how to resolve the problem.

If the problem is not documented here or in the release notes, contact Red Hat Support.

8.3.1. A user receives a "failed to call webhook" error message for Kueue
Copy link

Problem

After the user runs the cluster.apply() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}

Copy to Clipboard

Toggle word wrap

Diagnosis

The Kueue pod might not be running.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Verify that the Kueue pod is running. If necessary, restart the Kueue pod.

Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:

{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

Copy to Clipboard

Toggle word wrap

8.3.2. A user receives a "Default Local Queue … not found" error message
Copy link

Problem

After the user runs the cluster.apply() command, the following error is shown:

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Copy to Clipboard

Toggle word wrap

Diagnosis

No default local queue is defined, and a local queue is not specified in the cluster configuration.

Resolution

Check whether a local queue exists in the user’s project, as follows:
1. In the OpenShift console, select the user’s project from the Project list.
2. Click Home Search, and from the Resources list, select LocalQueue.
3. If no local queues are found, create a local queue.
4. Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

8.3.3. A user receives a "local_queue provided does not exist" error message
Copy link

Problem

After the user runs the cluster.apply() command, the following error is shown:

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.

Copy to Clipboard

Toggle word wrap

Diagnosis

An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.

Resolution

In the OpenShift console, select the user’s project from the Project list.
1. Click Search, and from the Resources list, select LocalQueue.
2. Resolve the problem in one of the following ways:
  - If no local queues are found, create a local queue.
  - If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the namespace value in the cluster configuration matches their project name.
3. Define a default local queue.
  For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

8.3.4. The pod provisioned by Kueue is terminated before the image is pulled
Copy link

Problem

Kueue waits for a period of time before marking a workload as ready for all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.

Diagnosis

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Click the user’s pod name to open the pod details page.
Click the Events tab, and review the pod events to check whether the image pull completed successfully.

Resolution

If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:

Add an OnFailure restart policy for resources that are managed by Kueue.
Configure a custom timeout for the waitForPodsReady property in the Kueue custom resource (CR). The CR is installed in the openshift-kueue-operator namespace by the Red Hat build of Kueue Operator.

For more information about this configuration option, see Enabling waitForPodsReady in the Kueue documentation.

8.4. Migrating to the Red Hat build of Kueue Operator
Copy link

Starting with OpenShift AI 2.24, the embedded Kueue component for managing distributed workloads is deprecated.

OpenShift AI now uses the Red Hat build of Kueue Operator to provide enhanced workload scheduling for distributed training, workbench, and model serving workloads.

Check if your environment is using the embedded Kueue component by verifying the spec.components.kueue.managementState field in the DataScienceCluster custom resource. If the field is set to Managed, you must migrate to the Red Hat build of Kueue Operator before upgrading OpenShift AI to avoid controller conflicts and ensure continued support for queue-based workloads.

OpenShift AI does not automatically migrate workloads, and you cannot install both the embedded Kueue and the Red Hat build of Kueue Operator on the same cluster.

Prerequisites

Your environment is currently using the embedded Kueue component. That is, the spec.components.kueue.managementState field in the DataScienceCluster custom resource is set to Managed.
Note
If spec.components.kueue.managementState is set to Removed or Unmanaged, skip this migration.
You have cluster administrator privileges for your OpenShift cluster.
You are using OpenShift 4.18 or later.
You have installed and configured the cert-manager Operator for Red Hat OpenShift for your cluster.

Procedure

Optional: When you migrate from the embedded Kueue to Red Hat build of Kueue, the OpenShift AI Operator automatically moves your existing Kueue configuration from the kueue-manager-config ConfigMap to the Kueue custom resource (CR).
If you want to keep the kueue-manager-config ConfigMap for reference, run the following command. Replace <applications-namespace> with your OpenShift AI applications namespace. The default namespace is redhat-ods-applications.
```
oc annotate configmap kueue-manager-config -n <applications-namespace> opendatahub.io/managed=false
```
```
$ oc annotate configmap kueue-manager-config -n <applications-namespace> opendatahub.io/managed=false
```
Copy to Clipboard Toggle word wrap
Log in to the OpenShift web console as a cluster administrator.
Uninstall the embedded Kueue component to avoid potential configuration conflicts.
Note
If you need to keep workloads running without interruption, you can skip this step. However, skipping it is not recommended because it might cause temporary configuration issues during the OpenShift AI upgrade.
1. In the web console, click Operators Installed Operators and then click the Red Hat OpenShift AI Operator.
2. Click the Data Science Cluster tab.
3. Click the default-dsc object.
4. Click the YAML tab.
5. Set spec.components.kueue.managementState to Removed as shown:
  spec: components: kueue: managementState: Removed
  Copy to Clipboard Toggle word wrap
6. Click Save.
7. Wait for the OpenShift AI Operator to reconcile, and then verify that the embedded Kueue was removed:
  - On the Details tab of the default-dsc object, check that the KueueReady condition has a Status of False and a Reason of Removed.
  - Go to Workloads Deployments, select the project where OpenShift AI is installed (for example, redhat-ods-applications), and confirm that Kueue-related deployments (for example, kueue-controller-manager) are no longer present.
Delete the legacy cohorts.kueue.x-k8s.io/v1alpha1 or topologies.kueue.x-k8s.io/v1alpha1 CRDs to prevent conflicts with the new installation.
```
oc delete crd cohorts.kueue.x-k8s.io
oc delete crd topologies.kueue.x-k8s.io
```
```
$ oc delete crd cohorts.kueue.x-k8s.io
$ oc delete crd topologies.kueue.x-k8s.io
```
Copy to Clipboard Toggle word wrap
Note
Ensure no active workloads depend on the Kueue resources defined by the v1alpha1 CRDs. Copy any necessary data from these instances before deletion, as they must be manually converted to v1beta1 and recreated later. For more information, see Red Hat Build of Kueue 1.2 installation or upgrade fails with Kueue CRD reconciliation error.
Install the Red Hat build of Kueue Operator on your OpenShift cluster:
1. Follow the steps to install the Red Hat build of Kueue Operator, as described in the Red Hat build of Kueue documentation.
2. Go to Operators Installed Operators and confirm that the Red Hat build of Kueue Operator is listed with Status as Succeeded.
Activate the Red Hat build of Kueue Operator in OpenShift AI:
1. In the web console, click Operators Installed Operators and then click the Red Hat OpenShift AI Operator.
2. Click the Data Science Cluster tab.
3. Click the default-dsc object.
4. Click the YAML tab.
5. Set spec.components.kueue.managementState to Unmanaged. You can either use the predefined names (default) for the default cluster queue and default local queue, or specify custom names, as shown in the following examples.
  - To use the predefined queue names, apply the following configuration:
    
    spec: components: kueue: managementState: Unmanaged
    
    Copy to Clipboard Toggle word wrap
  - To specify custom queue names, apply the following configuration, replacing <example-cluster-queue> and <example-local-queue> with your custom values:
    
    spec: components: kueue: managementState: Unmanaged defaultClusterQueueName: <example-cluster-queue> defaultLocalQueueName: <example-local-queue>
    
    Copy to Clipboard Toggle word wrap
6. Click Save.
Enable Kueue management for existing projects by applying the kueue.openshift.io/managed=true label to each project namespace:
```
oc label namespace <project-namespace> kueue.openshift.io/managed=true --overwrite
```
```
$ oc label namespace <project-namespace> kueue.openshift.io/managed=true --overwrite
```
Copy to Clipboard Toggle word wrap
Replace <project-namespace> with the name of your project.
Note
Kueue validation and queue enforcement apply only to workloads in namespaces labeled with kueue.openshift.io/managed=true.

Verification

Verify that the embedded Kueue component was removed.
Verify that the DataScienceCluster resource shows a healthy Unmanaged status for Kueue.
Verify that existing workloads in the queue continue to be processed by the Red Hat build of Kueue controllers. Submit a new test workload to confirm functionality.

Next steps

Configure quotas by creating and modifying ResourceFlavor, ClusterQueue, and LocalQueue objects. For details, see the Red Hat build of Kueue documentation.
Enable Kueue in the dashboard so that users can select Kueue-enabled options when creating workloads. When enabled, Kueue management is automatically applied to all new projects created from the dashboard. See Enabling Kueue in the dashboard.
Cluster administrators and OpenShift AI administrators can create hardware profiles so that users can submit workloads from the OpenShift AI dashboard. See Working with hardware profiles.

Chapter 8. Managing workloads with Kueue

8.1. Overview of managing workloads with Kueue
Copy link

8.1.1. Kueue management states
Copy link

8.1.2. Queue enforcement for projects
Copy link

8.1.3. Restrictions for managing workloads with Kueue
Copy link

8.1.4. Kueue workflow
Copy link

8.2. Configuring workload management with Kueue
Copy link

8.2.1. Enabling Kueue in the dashboard
Copy link

8.3. Troubleshooting common problems with Kueue
Copy link

8.3.1. A user receives a "failed to call webhook" error message for Kueue
Copy link

8.3.2. A user receives a "Default Local Queue … not found" error message
Copy link

8.3.3. A user receives a "local_queue provided does not exist" error message
Copy link

8.3.4. The pod provisioned by Kueue is terminated before the image is pulled
Copy link

8.4. Migrating to the Red Hat build of Kueue Operator
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 8. Managing workloads with Kueue

8.1. Overview of managing workloads with KueueCopy linkLink copied to clipboard!

8.1.1. Kueue management statesCopy linkLink copied to clipboard!

8.1.2. Queue enforcement for projectsCopy linkLink copied to clipboard!

8.1.3. Restrictions for managing workloads with KueueCopy linkLink copied to clipboard!

8.1.4. Kueue workflowCopy linkLink copied to clipboard!

8.2. Configuring workload management with KueueCopy linkLink copied to clipboard!

8.2.1. Enabling Kueue in the dashboardCopy linkLink copied to clipboard!

8.3. Troubleshooting common problems with KueueCopy linkLink copied to clipboard!

8.3.1. A user receives a "failed to call webhook" error message for KueueCopy linkLink copied to clipboard!

8.3.2. A user receives a "Default Local Queue …​ not found" error messageCopy linkLink copied to clipboard!

8.3.3. A user receives a "local_queue provided does not exist" error messageCopy linkLink copied to clipboard!

8.3.4. The pod provisioned by Kueue is terminated before the image is pulledCopy linkLink copied to clipboard!

8.4. Migrating to the Red Hat build of Kueue OperatorCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

8.1. Overview of managing workloads with Kueue
Copy link

8.1.1. Kueue management states
Copy link

8.1.2. Queue enforcement for projects
Copy link

8.1.3. Restrictions for managing workloads with Kueue
Copy link

8.1.4. Kueue workflow
Copy link

8.2. Configuring workload management with Kueue
Copy link

8.2.1. Enabling Kueue in the dashboard
Copy link

8.3. Troubleshooting common problems with Kueue
Copy link

8.3.1. A user receives a "failed to call webhook" error message for Kueue
Copy link

8.3.2. A user receives a "Default Local Queue … not found" error message
Copy link

8.3.3. A user receives a "local_queue provided does not exist" error message
Copy link

8.3.4. The pod provisioned by Kueue is terminated before the image is pulled
Copy link

8.4. Migrating to the Red Hat build of Kueue Operator
Copy link