Este contenido no está disponible en el idioma seleccionado.

Chapter 8. Managing workloads with Kueue


As a cluster administrator, you can manage AI and machine learning workloads at scale by integrating the Red Hat build of Kueue with Red Hat OpenShift AI. This integration provides capabilities for quota management, resource allocation, and prioritized job scheduling.

Important

Starting with OpenShift AI 2.24, the embedded Kueue component for managing distributed workloads is deprecated. Kueue is now provided through Red Hat build of Kueue, which is installed and managed by the Red Hat build of Kueue Operator. You cannot install both the embedded Kueue and the Red Hat build of Kueue Operator on the same cluster because this creates conflicting controllers that manage the same resources.

OpenShift AI does not automatically migrate existing workloads. To ensure your workloads continue using queue management after upgrading, cluster administrators must manually migrate from the embedded Kueue to the Red Hat build of Kueue Operator. For more information, see Migrating to the Red Hat build of Kueue Operator.

8.1. Overview of managing workloads with Kueue

You can use Kueue in OpenShift AI to manage AI and machine learning workloads at scale. Kueue controls how cluster resources are allocated and shared through hierarchical quota management, dynamic resource allocation, and prioritized job scheduling. These capabilities help prevent cluster contention, ensure fair access across teams, and optimize the use of heterogeneous compute resources, such as hardware accelerators.

Kueue lets you schedule diverse workloads, including distributed training jobs (RayJob, RayCluster, PyTorchJob), workbenches (Notebook), and model serving (InferenceService). Kueue validation and queue enforcement apply only to workloads in namespaces with the kueue.openshift.io/managed=true label.

Using Kueue in OpenShift AI provides these benefits:

  • Prevents resource conflicts and prioritizes workload processing
  • Manages quotas across teams and projects
  • Ensures consistent scheduling for all workload types
  • Maximizes GPU and other specialized hardware utilization
Important

Starting with OpenShift AI 2.24, the embedded Kueue component for managing distributed workloads is deprecated. Kueue is now provided through Red Hat build of Kueue, which is installed and managed by the Red Hat build of Kueue Operator. You cannot install both the embedded Kueue and the Red Hat build of Kueue Operator on the same cluster because this creates conflicting controllers that manage the same resources.

OpenShift AI does not automatically migrate existing workloads. To ensure your workloads continue using queue management after upgrading, cluster administrators must manually migrate from the embedded Kueue to the Red Hat build of Kueue Operator. For more information, see Migrating to the Red Hat build of Kueue Operator.

8.1.1. Kueue management states

You configure how OpenShift AI interacts with Kueue by setting the managementState in the DataScienceCluster object.

Unmanaged

This state is supported for using Kueue with OpenShift AI. In Unmanaged state, OpenShift AI integrates with an existing Kueue installation managed by the Red Hat build of Kueue Operator. You must have the Red Hat build of Kueue Operator installed and running on the cluster.

When you enable Unmanaged mode, the OpenShift AI Operator creates a default Kueue custom resource (CR) if one does not already exist. This prompts the Red Hat build of Kueue Operator to activate Kueue on the cluster.

Managed
This state is deprecated. Previously, OpenShift AI deployed and managed an embedded Kueue distribution. Managed mode is not compatible with the Red Hat build of Kueue Operator. If both are installed, OpenShift AI stops reconciliation to avoid conflicts. You must migrate any environment using the Managed state to the Unmanaged state to ensure continued support.
Removed
This state disables Kueue in OpenShift AI. If the state was previously Managed, OpenShift AI uninstalls the embedded Kueue distribution. If the state was previously Unmanaged, OpenShift AI stops checking for the external Kueue integration but does not uninstall the Red Hat build of Kueue Operator. An empty managementState value also functions as Removed.

8.1.2. Queue enforcement for projects

To ensure workloads do not bypass the queuing system, a validating webhook automatically enforces queuing rules on any project that is enabled for Kueue management. You enable a project for Kueue management by applying the kueue.openshift.io/managed=true label to the project namespace.

Note

This validating webhook enforcement method replaces the Validating Admission Policy that was used with the deprecated embedded Kueue component. The system also supports the legacy kueue-managed label for backward compatibility, but kueue.openshift.io/managed=true is the recommended label going forward.

After a project is enabled for Kueue management, the webhook requires that any new or updated workload has the kueue.x-k8s.io/queue-name label. If this label is missing, the webhook prevents the workload from being created or updated.

OpenShift AI creates a default, cluster-scoped ClusterQueue (if one does not already exist) and a namespace-scoped LocalQueue for that namespace (if one does not already exist). These default resources are created with the opendatahub.io/managed=false annotation, so they are not managed after creation. Cluster administrators can change or delete them.

The webhook enforces this rule on the create and update operations for the following resource types:

  • InferenceService
  • Notebook
  • PyTorchJob
  • RayCluster
  • RayJob
Note

You can apply hardware profiles to other workload types, but the validation webhook enforces the kueue.x-k8s.io/queue-name label requirement only for these specific resource types.

8.1.3. Restrictions for managing workloads with Kueue

When you use Kueue to manage workloads in OpenShift AI, the following restrictions apply:

  • Namespaces must be labeled with kueue.openshift.io/managed=true to enable Kueue validation and queue enforcement.
  • All workloads that you create from the OpenShift AI dashboard, such as workbenches and model servers, must use a hardware profile that specifies a local queue.
  • When you specify a local queue in a hardware profile, OpenShift AI automatically applies the corresponding kueue.x-k8s.io/queue-name label to workloads that use that profile.
  • You cannot use hardware profiles that contain node selectors or tolerations for node placement. To direct workloads to specific nodes, use a hardware profile that specifies a local queue that is associated with a queue configured with the appropriate resource flavors.
  • You cannot use accelerator profiles with Kueue. You must migrate any existing accelerator profiles to hardware profiles.
  • Because workbenches are not suspendable workloads, you can only assign them to a local queue that is associated with a non-preemptive cluster queue. The default cluster queue that OpenShift AI creates is non-preemptive.

8.1.4. Kueue workflow

Managing workloads with Kueue in OpenShift AI involves tasks for OpenShift cluster administrators, OpenShift AI administrators, and machine learning (ML) engineers or data scientists:

Cluster administrator

Installs and configures Kueue:

  1. Installs the Red Hat build of Kueue Operator on the cluster, as described in the Red Hat build of Kueue documentation.
  2. Activates the Kueue integration by setting the managementState to Unmanaged in the DataScienceCluster custom resource, as described in Configuring workload management with Kueue.
  3. Configures quotas to optimize resource allocation for user workloads, as described in the Red Hat build of Kueue documentation.
  4. Enables Kueue in the dashboard by setting disableKueue to false in the OdhDashboardConfig custom resource, as described in Enabling Kueue in the dashboard.

    Note

    When Kueue is enabled in the dashboard, OpenShift AI automatically enables Kueue management for all new projects created from the dashboard. For existing projects, or for projects created by using the command-line interface, you must enable Kueue management manually by applying the kueue.openshift.io/managed=true label to the project namespace.

OpenShift AI administrator

Prepares the OpenShift AI environment:

  1. Creates Kueue-enabled hardware profiles so that users can submit workloads from the OpenShift AI dashboard, as described in Working with hardware profiles.

ML Engineer or data scientist

Submits workloads to the queuing system:

  1. For workloads created from the OpenShift AI dashboard, such as workbenches and model servers, selects a Kueue-enabled hardware profile during creation.
  2. For workloads created by using a command-line interface or an SDK, such as distributed training jobs, adds the kueue.x-k8s.io/queue-name label to the workload’s YAML manifest and sets its value to the target LocalQueue name.

8.2. Configuring workload management with Kueue

To use workload queuing in OpenShift AI, install the Red Hat build of Kueue Operator and activate the Kueue integration in OpenShift AI.

Prerequisites

  • You have cluster administrator privileges for your OpenShift cluster.
  • You are using OpenShift 4.18 or later.
  • You have installed and configured the cert-manager Operator for Red Hat OpenShift for your cluster.
  • You have installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.

Procedure

  1. In a terminal window, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
    Copy to Clipboard Toggle word wrap
  2. Install the Red Hat build of Kueue Operator on your OpenShift cluster as described in the Red Hat build of Kueue documentation.
  3. Activate the Kueue integration. You can use the predefined names for the default cluster queue and default local queue, or specify custom names.

    • To use the predefined queue names (default), run the following command. Replace <operator-namespace> with your operator namespace. The default operator namespace is redhat-ods-operator.

      $ oc patch datasciencecluster default-dsc --type='merge' -p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged"}}}}' -n <operator-namespace>
      Copy to Clipboard Toggle word wrap
    • To specify custom queue names, run the following command. Replace <example-cluster-queue> and <example-local-queue> with your custom queue names, and replace <operator-namespace> with your operator namespace. The default operator namespace is redhat-ods-operator.

      $ oc patch datasciencecluster default-dsc --type='merge' -p '{"spec":{"components":{"kueue":{"managementState":"Unmanaged","defaultClusterQueueName":"<example-cluster-queue>","defaultLocalQueueName":"<example-local-queue>"}}}}' -n <operator-namespace>
      Copy to Clipboard Toggle word wrap

Verification

  1. Verify that the Red Hat build of Kueue pods are running:

    $ oc get pods -n openshift-kueue-operator
    Copy to Clipboard Toggle word wrap

    You should see output similar to the following example:

    kueue-controller-manager-d9fc745df-ph77w    1/1     Running
    openshift-kueue-operator-69cfbf45cf-lwtpm   1/1     Running
    Copy to Clipboard Toggle word wrap
  2. Verify that the default ClusterQueue was created:

    $ oc get clusterqueues
    Copy to Clipboard Toggle word wrap

Next steps

  • Configure quotas by creating and modifying ResourceFlavor, ClusterQueue, and LocalQueue objects. For details, see the Red Hat build of Kueue documentation.
  • Enable Kueue in the dashboard so that users can select Kueue-enabled options when creating workloads. When you enable Kueue, you also enable Kueue management for all new projects created from the dashboard. See Enabling Kueue in the dashboard.
  • Cluster administrators and OpenShift AI administrators can create hardware profiles so that users can submit workloads from the OpenShift AI dashboard. See Working with hardware profiles.

8.2.1. Enabling Kueue in the dashboard

Enable Kueue in the OpenShift AI dashboard so that users can select Kueue-enabled options when creating workloads.

When you enable Kueue in the dashboard, OpenShift AI automatically enables Kueue management for all new projects created from the dashboard. For these projects, OpenShift AI applies the kueue.openshift.io/managed=true label to the namespace and creates a LocalQueue object if one does not already exist. The LocalQueue object is created with the opendatahub.io/managed=false annotation, so it is not managed after creation. Cluster administrators can modify or delete it as needed. A validating webhook then enforces that any new or updated workload resource in a Kueue-enabled project includes the kueue.x-k8s.io/queue-name label.

Note

For existing projects, or for projects created by using the command-line interface, you must enable Kueue management manually by applying the kueue.openshift.io/managed=true label to the project namespace.

$ oc label namespace <project-namespace> kueue.openshift.io/managed=true --overwrite
Copy to Clipboard Toggle word wrap

Prerequisites

Procedure

  1. In a terminal window, log in to the OpenShift CLI as shown in the following example:

    $ oc login <openshift_cluster_url> -u <admin_username> -p <password>
    Copy to Clipboard Toggle word wrap
  2. Update the odh-dashboard-config custom resource in the OpenShift AI applications namespace. Replace <applications-namespace> with your OpenShift AI applications namespace. The default is redhat-ods-applications.

    $ oc patch odhdashboardconfig odh-dashboard-config \
      -n \<applications-namespace\> \
      --type merge \
      -p {"spec":{"dashboardConfig":{"disableHardwareProfiles":false,"disableKueue":false}}}
    Copy to Clipboard Toggle word wrap

Verification

  1. From the OpenShift AI dashboard, create a new project.
  2. Verify that the project namespace is labeled for Kueue management:

    $ oc get ns <project-namespace> -o jsonpath='{.metadata.labels.kueue\.openshift\.io/managed}{"\n"}'
    Copy to Clipboard Toggle word wrap

    The output should be true.

  3. Confirm that a default LocalQueue exists for the project namespace:

    $ oc get localqueues -n <project-namespace>
    Copy to Clipboard Toggle word wrap
  4. Create a test workload (for example, a Notebook) and verify that it includes the kueue.x-k8s.io/queue-name label.

Next step

  • Cluster administrators and OpenShift AI administrators can create hardware profiles so that users can submit workloads from the OpenShift AI dashboard. See Working with hardware profiles.

8.3. Troubleshooting common problems with Kueue

If your users are experiencing errors in Red Hat OpenShift AI relating to Kueue workloads, read this section to understand what could be causing the problem, and how to resolve the problem.

If the problem is not documented here or in the release notes, contact Red Hat Support.

Problem

After the user runs the cluster.apply() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
Copy to Clipboard Toggle word wrap

Diagnosis

The Kueue pod might not be running.

Resolution

  1. In the OpenShift console, select the user’s project from the Project list.
  2. Click Workloads Pods.
  3. Verify that the Kueue pod is running. If necessary, restart the Kueue pod.
  4. Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:

    {"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}
    Copy to Clipboard Toggle word wrap

Problem

After the user runs the cluster.apply() command, the following error is shown:

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
Copy to Clipboard Toggle word wrap

Diagnosis

No default local queue is defined, and a local queue is not specified in the cluster configuration.

Resolution

  1. Check whether a local queue exists in the user’s project, as follows:

    1. In the OpenShift console, select the user’s project from the Project list.
    2. Click Home Search, and from the Resources list, select LocalQueue.
    3. If no local queues are found, create a local queue.
    4. Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.
  2. Define a default local queue.

    For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

Problem

After the user runs the cluster.apply() command, the following error is shown:

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
Copy to Clipboard Toggle word wrap

Diagnosis

An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.

Resolution

  1. In the OpenShift console, select the user’s project from the Project list.

    1. Click Search, and from the Resources list, select LocalQueue.
    2. Resolve the problem in one of the following ways:

      • If no local queues are found, create a local queue.
      • If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the namespace value in the cluster configuration matches their project name.
    3. Define a default local queue.

      For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

Problem

Kueue waits for a period of time before marking a workload as ready for all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.

Diagnosis

  1. In the OpenShift console, select the user’s project from the Project list.
  2. Click Workloads Pods.
  3. Click the user’s pod name to open the pod details page.
  4. Click the Events tab, and review the pod events to check whether the image pull completed successfully.

Resolution

If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:

  • Add an OnFailure restart policy for resources that are managed by Kueue.
  • Configure a custom timeout for the waitForPodsReady property in the Kueue custom resource (CR). The CR is installed in the openshift-kueue-operator namespace by the Red Hat build of Kueue Operator.

For more information about this configuration option, see Enabling waitForPodsReady in the Kueue documentation.

8.4. Migrating to the Red Hat build of Kueue Operator

Starting with OpenShift AI 2.24, the embedded Kueue component for managing distributed workloads is deprecated. You must migrate to the Red Hat build of Kueue Operator. You cannot install both the embedded Kueue and the Red Hat build of Kueue Operator on the same cluster because this creates conflicting controllers that manage the same resources.

OpenShift AI does not automatically migrate existing workloads to Red Hat build of Kueue. Cluster administrators must manually migrate from the embedded Kueue to the Red Hat build of Kueue Operator to ensure workloads continue using queue management after upgrading.

Prerequisites

  • You have cluster administrator privileges for your OpenShift cluster.
  • You are using OpenShift 4.18 or later.
  • You have installed and configured the cert-manager Operator for Red Hat OpenShift for your cluster.
  • The embedded Kueue component is enabled (that is, the spec.components.kueue.managementState field in the DataScienceCluster object is set to Managed).

Procedure

  1. Optional: When you migrate from the embedded Kueue to Red Hat build of Kueue, the OpenShift AI Operator automatically moves your existing Kueue configuration from the kueue-manager-config ConfigMap to the Kueue custom resource (CR).

    To retain the kueue-manager-config ConfigMap, run the following command. Replace <applications-namespace> with your OpenShift AI applications namespace. The default is redhat-ods-applications.

    $ oc annotate configmap kueue-manager-config -n <applications-namespace> opendatahub.io/managed=false
    Copy to Clipboard Toggle word wrap
  2. Log in to the OpenShift web console as a cluster administrator.
  3. Optional (recommended): To avoid potential configuration conflicts, uninstall the embedded Kueue component before installing Red Hat build of Kueue.

    1. In the web console, click Operators Installed Operators and then click the Red Hat OpenShift AI Operator.
    2. Click the Data Science Cluster tab.
    3. Click the default-dsc object.
    4. Click the YAML tab.
    5. Set spec.components.kueue.managementState to Removed as shown:

      spec:
        components:
          kueue:
            managementState: Removed
      Copy to Clipboard Toggle word wrap
    6. Click Save.
    7. Wait for the OpenShift AI Operator to reconcile, and then verify that the embedded Kueue was removed:

      • On the Details tab of the default-dsc object, check that the KueueReady condition has a Status of False and a Reason of Removed.
      • Go to Workloads Deployments, select the project where OpenShift AI is installed (for example, redhat-ods-applications), and confirm that Kueue-related deployments (for example, kueue-controller-manager) are no longer present.
  4. Install the Red Hat build of Kueue Operator on your OpenShift cluster:

    1. Follow the steps to install the Red Hat build of Kueue Operator, as described in the Red Hat build of Kueue documentation.
    2. Go to Operators Installed Operators and confirm that the Red Hat build of Kueue Operator is listed with Status as Succeeded.
  5. Activate the Red Hat build of Kueue Operator in OpenShift AI:

    1. In the web console, click Operators Installed Operators and then click the Red Hat OpenShift AI Operator.
    2. Click the Data Science Cluster tab.
    3. Click the default-dsc object.
    4. Click the YAML tab.
    5. Set spec.components.kueue.managementState to Unmanaged. You can either use the predefined names (default) for the default cluster queue and default local queue, or specify custom names, as shown in the following examples.

      • To use the predefined queue names, apply the following configuration:

        spec:
          components:
            kueue:
              managementState: Unmanaged
        Copy to Clipboard Toggle word wrap
      • To specify custom queue names, apply the following configuration, replacing <example-cluster-queue> and <example-local-queue> with your custom values:

        spec:
          components:
            kueue:
              managementState: Unmanaged
              defaultClusterQueueName: <example-cluster-queue>
              defaultLocalQueueName: <example-local-queue>
        Copy to Clipboard Toggle word wrap
    6. Click Save.
  6. Enable Kueue management for existing projects by applying the kueue.openshift.io/managed=true label to each project namespace:

    $ oc label namespace <project-namespace> kueue.openshift.io/managed=true --overwrite
    Copy to Clipboard Toggle word wrap

    Replace <project-namespace> with the name of your project.

    Note

    Kueue validation and queue enforcement apply only to workloads in namespaces with the kueue.openshift.io/managed=true label.

Verification

  • Verify that the embedded Kueue is removed.
  • Verify that the DataScienceCluster resource shows a healthy Unmanaged status for Kueue.
  • Verify that existing workloads in the queue continue to be processed by the new operator-managed Kueue controllers. Submit a new test workload to confirm functionality.

Next steps

  • Configure quotas by creating and modifying ResourceFlavor, ClusterQueue, and LocalQueue objects. For details, see the Red Hat build of Kueue documentation.
  • Enable Kueue in the dashboard so that users can select Kueue-enabled options when creating workloads. When you enable Kueue, you also enable Kueue management for all new projects created from the dashboard. See Enabling Kueue in the dashboard.
  • Cluster administrators and OpenShift AI administrators can create hardware profiles so that users can submit workloads from the OpenShift AI dashboard. See Working with hardware profiles.
Volver arriba
Red Hat logoGithubredditYoutubeTwitter

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Ayudamos a los usuarios de Red Hat a innovar y alcanzar sus objetivos con nuestros productos y servicios con contenido en el que pueden confiar. Explore nuestras recientes actualizaciones.

Hacer que el código abierto sea más inclusivo

Red Hat se compromete a reemplazar el lenguaje problemático en nuestro código, documentación y propiedades web. Para más detalles, consulte el Blog de Red Hat.

Acerca de Red Hat

Ofrecemos soluciones reforzadas que facilitan a las empresas trabajar en plataformas y entornos, desde el centro de datos central hasta el perímetro de la red.

Theme

© 2025 Red Hat