Chapter 8. Managing distributed workloads

8.1. Overview of Kueue resources
Copy link

Cluster administrators can configure Kueue objects (such as resource flavors, cluster queues, and local queues) to manage distributed workload resources across multiple nodes in an OpenShift cluster.

Note

In OpenShift AI 2.16, Red Hat does not support shared cohorts.

8.1.1. Resource flavor
Copy link

The Kueue ResourceFlavor object describes the resource variations that are available in a cluster.

Resources in a cluster can be homogenous or heterogeneous:

Homogeneous resources are identical across the cluster: same node type, CPUs, memory, accelerators, and so on.
Heterogeneous resources have variations across the cluster.

If a cluster has homogeneous resources, or if it is not necessary to manage separate quotas for different flavors of a resource, a cluster administrator can create an empty ResourceFlavor object named default-flavor, without any labels or taints, as follows:

Empty Kueue resource flavor for homegeneous resources

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor

Copy to Clipboard

Toggle word wrap

If a cluster has heterogeneous resources, cluster administrators can define a different resource flavor for each variation in the resources available. Example variations include different CPUs, different memory, or different accelerators. If a cluster has multiple types of accelerator, cluster administrators can set up a resource flavor for each accelerator type. Cluster administrators can then associate the resource flavors with cluster nodes by using labels, taints, and tolerations, as shown in the following example.

Example Kueue resource flavor for heterogeneous resources

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "spot"
spec:
  nodeLabels:
    instance-type: spot
  nodeTaints:
  - effect: NoSchedule
    key: spot
    value: "true"
  tolerations:
  - key: "spot-taint"
    operator: "Exists"
    effect: "NoSchedule"

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "spot"
spec:
  nodeLabels:
    instance-type: spot
  nodeTaints:
  - effect: NoSchedule
    key: spot
    value: "true"
  tolerations:
  - key: "spot-taint"
    operator: "Exists"
    effect: "NoSchedule"

Copy to Clipboard

Toggle word wrap

Make sure that each resource flavor has the correct label selectors and taint tolerations so that workloads run on the expected nodes.

See the example configurations provided in Example Kueue resource configurations.

For more information about configuring resource flavors, see Resource Flavor in the Kueue documentation.

8.1.2. Cluster queue
Copy link

The Kueue ClusterQueue object manages a pool of cluster resources such as pods, CPUs, memory, and accelerators. A cluster can have multiple cluster queues, and each cluster queue can reference multiple resource flavors.

Cluster administrators can configure a cluster queue to define the resource flavors that the queue manages, and assign a quota for each resource in each resource flavor.

The following example configures a cluster queue to assign a quota of 9 CPUs, 36 GiB memory, 5 pods, and 5 NVIDIA GPUs.

Example cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "pods"
        nominalQuota: 5
      - name: "nvidia.com/gpu"
        nominalQuota: '5'

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "pods"
        nominalQuota: 5
      - name: "nvidia.com/gpu"
        nominalQuota: '5'

Copy to Clipboard

Toggle word wrap

A cluster administrator should notify the consumers of a cluster queue about the quota limits for that cluster queue. The cluster queue starts a distributed workload only if the total required resources are within these quota limits. If the sum of the requests for a resource in a distributed workload is greater than the specified quota for that resource in the cluster queue, the cluster queue does not admit the distributed workload.

See the example configurations provided in Example Kueue resource configurations.

For more information about configuring cluster queues, see Cluster Queue in the Kueue documentation.

8.1.3. Local queue
Copy link

The Kueue LocalQueue object groups closely related distributed workloads in a project. Cluster administrators can configure local queues to specify the project name and the associated cluster queue. Each local queue then grants access to the resources that its specified cluster queue manages. A cluster administrator can optionally define one local queue in a project as the default local queue for that project.

When configuring a distributed workload, the user specifies the local queue name. If a cluster administrator configured a default local queue, the user can omit the local queue specification from the distributed workload code.

Kueue allocates the resources for a distributed workload from the cluster queue that is associated with the local queue, if the total requested resources are within the quota limits specified in that cluster queue.

The following example configures a local queue called team-a-queue for the team-a project, and specifies cluster-queue as the associated cluster queue.

Example local queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: team-a
  name: team-a-queue
  annotations:
    kueue.x-k8s.io/default-queue: "true"
spec:
  clusterQueue: cluster-queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: team-a
  name: team-a-queue
  annotations:
    kueue.x-k8s.io/default-queue: "true"
spec:
  clusterQueue: cluster-queue

Copy to Clipboard

Toggle word wrap

In this example, the kueue.x-k8s.io/default-queue: "true" annotation defines this local queue as the default local queue for the team-a project. If a user submits a distributed workload in the team-a project and that distributed workload does not specify a local queue in the cluster configuration, Kueue automatically routes the distributed workload to the team-a-queue local queue. The distributed workload can then access the resources that the cluster-queue cluster queue manages.

For more information about configuring local queues, see Local Queue in the Kueue documentation.

8.2. Example Kueue resource configurations
Copy link

These examples show how to configure Kueue resource flavors and cluster queues.

Note

In OpenShift AI 2.16, Red Hat does not support shared cohorts.

8.2.1. NVIDIA GPUs without shared cohort
Copy link

8.2.1.1. NVIDIA RTX A400 GPU resource flavor
Copy link

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "A400-node"
spec:
  nodeLabels:
    instance-type: nvidia-a400-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "A400-node"
spec:
  nodeLabels:
    instance-type: nvidia-a400-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

Copy to Clipboard

Toggle word wrap

8.2.1.2. NVIDIA RTX A1000 GPU resource flavor
Copy link

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "A1000-node"
spec:
  nodeLabels:
    instance-type: nvidia-a1000-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "A1000-node"
spec:
  nodeLabels:
    instance-type: nvidia-a1000-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

Copy to Clipboard

Toggle word wrap

8.2.1.3. NVIDIA RTX A400 GPU cluster queue
Copy link

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "A400-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    - name: "A400-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "A400-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    - name: "A400-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

Copy to Clipboard

Toggle word wrap

8.2.1.4. NVIDIA RTX A1000 GPU cluster queue
Copy link

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "A1000-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "A1000-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "A1000-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "A1000-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

Copy to Clipboard

Toggle word wrap

8.2.2. NVIDIA GPUs and AMD GPUs without shared cohort
Copy link

8.2.2.1. AMD GPU resource flavor
Copy link

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "amd-node"
spec:
  nodeLabels:
    instance-type: amd-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "amd-node"
spec:
  nodeLabels:
    instance-type: amd-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

Copy to Clipboard

Toggle word wrap

8.2.2.2. NVIDIA GPU resource flavor
Copy link

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "nvidia-node"
spec:
  nodeLabels:
    instance-type: nvidia-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "nvidia-node"
spec:
  nodeLabels:
    instance-type: nvidia-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

Copy to Clipboard

Toggle word wrap

8.2.2.3. AMD GPU cluster queue
Copy link

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "team-a-amd-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "amd.com/gpu"]
    - name: "amd-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "amd.com/gpu"

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "team-a-amd-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "amd.com/gpu"]
    - name: "amd-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "amd.com/gpu"

Copy to Clipboard

Toggle word wrap

8.2.2.4. NVIDIA GPU cluster queue
Copy link

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "team-a-nvidia-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "nvidia-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "team-a-nvidia-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "nvidia-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

Copy to Clipboard

Toggle word wrap

8.3. Configuring quota management for distributed workloads
Copy link

Configure quotas for distributed workloads on a cluster, so that you can share resources between several data science projects.

Prerequisites

You have logged in to OpenShift with the cluster-admin role.
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
You have created a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about how to create a project, see Creating a data science project.
You have sufficient resources. In addition to the base OpenShift AI resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
The resources are physically available in the cluster.
Note
In OpenShift AI 2.16, Red Hat supports only a single cluster queue per cluster (that is, homogenous clusters), and only empty resource flavors. For more information about Kueue resources, see Overview of Kueue resources.
If you want to use graphics processing units (GPUs), you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
Note
In OpenShift AI 2.16, Red Hat supports only NVIDIA and AMD GPU accelerators for distributed workloads.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Copy to Clipboard Toggle word wrap
Create an empty Kueue resource flavor, as follows:
1. Create a file called default_flavor.yaml and populate it with the following content:
  Empty Kueue resource flavor
  apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: default-flavor
  
  Copy to Clipboard Toggle word wrap
2. Apply the configuration to create the default-flavor object:
  $ oc apply -f default_flavor.yaml
  Copy to Clipboard Toggle word wrap

Create a cluster queue to manage the empty Kueue resource flavor, as follows:

Create a file called cluster_queue.yaml and populate it with the following content:

Example cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {}  # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu"
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "nvidia.com/gpu" # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu"
        nominalQuota: 5

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {}  # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu"
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "nvidia.com/gpu" # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu"
        nominalQuota: 5

Copy to Clipboard

Toggle word wrap

Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. The cluster queue will start a distributed workload only if the total required resources are within these quota limits.
You must specify a quota for each resource that the user can request, even if the requested value is 0, by updating the spec.resourceGroups section as follows:
- Include the resource name in the coveredResources list.
- Specify the resource name and nominalQuota in the flavors.resources section, even if the nominalQuota value is 0.
Apply the configuration to create the cluster-queue object:
```
oc apply -f cluster_queue.yaml
```
```
$ oc apply -f cluster_queue.yaml
```
Copy to Clipboard Toggle word wrap

Create a local queue that points to your cluster queue, as follows:
1. Create a file called local_queue.yaml and populate it with the following content:
  Example local queue
  apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: test name: local-queue-test annotations: kueue.x-k8s.io/default-queue: 'true' spec: clusterQueue: cluster-queue
  
  Copy to Clipboard Toggle word wrap
  The kueue.x-k8s.io/default-queue: 'true' annotation defines this queue as the default queue. Distributed workloads are submitted to this queue if no local_queue value is specified in the ClusterConfiguration section of the data science pipeline or Jupyter notebook or Microsoft Visual Studio Code file.
2. Update the namespace value to specify the same namespace as in the ClusterConfiguration section that creates the Ray cluster.
3. Optional: Update the name value accordingly.
4. Apply the configuration to create the local-queue object:
  $ oc apply -f local_queue.yaml
  Copy to Clipboard Toggle word wrap
  The cluster queue allocates the resources to run distributed workloads in the local queue.

Verification

Check the status of the local queue in a project, as follows:

oc get -n <project-name> localqueues

$ oc get -n <project-name> localqueues

Copy to Clipboard

Toggle word wrap

8.4. Configuring the CodeFlare Operator
Copy link

If you want to change the default configuration of the CodeFlare Operator for distributed workloads in OpenShift AI, you can edit the associated config map.

Prerequisites

You have logged in to OpenShift with the cluster-admin role.
You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).

Procedure

In the OpenShift console, click Workloads ConfigMaps.
From the Project list, select redhat-ods-applications.
Search for the codeflare-operator-config config map, and click the config map name to open the ConfigMap details page.
Click the YAML tab to show the config map specifications.
In the data:config.yaml:kuberay section, you can edit the following entries:
ingressDomain
This configuration option is null (ingressDomain: "") by default. Do not change this option unless the Ingress Controller is not running on OpenShift. OpenShift AI uses this value to generate the dashboard and client routes for every Ray Cluster, as shown in the following examples:
Example dashboard and client routes

ray-dashboard-<clustername>-<namespace>.<your.ingress.domain> ray-client-<clustername>-<namespace>.<your.ingress.domain>

Copy to Clipboard Toggle word wrap
mTLSEnabled
This configuration option is enabled (mTLSEnabled: true) by default. When this option is enabled, the Ray Cluster pods create certificates that are used for mutual Transport Layer Security (mTLS), a form of mutual authentication, between Ray Cluster nodes. When this option is enabled, Ray clients cannot connect to the Ray head node unless they download the generated certificates from the ca-secret-_<cluster_name>_ secret, generate the necessary certificates for mTLS communication, and then set the required Ray environment variables. Users must then re-initialize the Ray clients to apply the changes. The CodeFlare SDK provides the following functions to simplify the authentication process for Ray clients:
Example Ray client authentication code

from codeflare_sdk import generate_cert generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) generate_cert.export_env(cluster.config.name, cluster.config.namespace) ray.init(cluster.cluster_uri())

Copy to Clipboard Toggle word wrap
rayDashboardOauthEnabled
This configuration option is enabled (rayDashboardOAuthEnabled: true) by default. When this option is enabled, OpenShift AI places an OpenShift OAuth proxy in front of the Ray Cluster head node. Users must then authenticate by using their OpenShift cluster login credentials when accessing the Ray Dashboard through the browser. If users want to access the Ray Dashboard in another way (for example, by using the Ray JobSubmissionClient class), they must set an authorization header as part of their request, as shown in the following example:
Example authorization header

{Authorization: "Bearer <your-openshift-token>"}

Copy to Clipboard Toggle word wrap
To save your changes, click Save.
To apply your changes, delete the pod:
1. Click Workloads Pods.
2. Find the codeflare-operator-manager-<pod-id> pod.
3. Click the options menu (⋮) for that pod, and then click Delete Pod. The pod restarts with your changes applied.

Verification

Check the status of the codeflare-operator-manager pod, as follows:

In the OpenShift console, click Workloads Deployments.
Search for the codeflare-operator-manager deployment, and then click the deployment name to open the deployment details page.
Click the Pods tab. When the status of the codeflare-operator-manager-<pod-id> pod is Running, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and then click the Logs tab.

8.5. Troubleshooting common problems with distributed workloads for administrators
Copy link

If your users are experiencing errors in Red Hat OpenShift AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.

If the problem is not documented here or in the release notes, contact Red Hat Support.

8.5.1. A user’s Ray cluster is in a suspended state
Copy link

Problem

The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.

Diagnosis

The user’s Ray cluster head pod or worker pods remain in a suspended state. Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the suspended state, as shown in the following example:

status:
 conditions:
   - lastTransitionTime: '2024-05-29T13:05:09Z'
     message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'

status:
 conditions:
   - lastTransitionTime: '2024-05-29T13:05:09Z'
     message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'

Copy to Clipboard

Toggle word wrap

Resolution

Check whether the resource flavor is created, as follows:
1. In the OpenShift console, select the user’s project from the Project list.
2. Click Home Search, and from the Resources list, select ResourceFlavor.
3. If necessary, create the resource flavor.
Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.
If necessary, increase the resource quota.

For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.

8.5.2. A user’s Ray cluster is in a failed state
Copy link

Problem

The user might have insufficient resources.

Diagnosis

The user’s Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.

Resolution

If the failed state persists, complete the following steps:

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Click the user’s pod name to open the pod details page.
Click the Events tab, and review the pod events to identify the cause of the problem.
Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the failed state.

8.5.3. A user receives a failed to call webhook error message for the CodeFlare Operator
Copy link

Problem

After the user runs the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}

Copy to Clipboard

Toggle word wrap

Diagnosis

The CodeFlare Operator pod might not be running.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Verify that the CodeFlare Operator pod is running. If necessary, restart the CodeFlare Operator pod.
Review the logs for the CodeFlare Operator pod to verify that the webhook server is serving, as shown in the following example:
```
INFO	controller-runtime.webhook	  Serving webhook server	{"host": "", "port": 9443}
```
```
INFO	controller-runtime.webhook	  Serving webhook server	{"host": "", "port": 9443}
```
Copy to Clipboard Toggle word wrap

8.5.4. A user receives a failed to call webhook error message for Kueue
Copy link

Problem

After the user runs the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}

Copy to Clipboard

Toggle word wrap

Diagnosis

The Kueue pod might not be running.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Verify that the Kueue pod is running. If necessary, restart the Kueue pod.

Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:

{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

Copy to Clipboard

Toggle word wrap

8.5.5. A user’s Ray cluster does not start
Copy link

Problem

After the user runs the cluster.up() command, when they run either the cluster.details() command or the cluster.status() command, the Ray cluster status remains as Starting instead of changing to Ready. No pods are created.

Diagnosis

Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for remaining in the Starting state. Similarly, check the status.conditions.message field for the RayCluster resource.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.
Review the logs for the KubeRay pod to identify errors.

8.5.6. A user receives a Default Local Queue … not found error message
Copy link

Problem

After the user runs the cluster.up() command, the following error is shown:

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Copy to Clipboard

Toggle word wrap

Diagnosis

No default local queue is defined, and a local queue is not specified in the cluster configuration.

Resolution

Check whether a local queue exists in the user’s project, as follows:
1. In the OpenShift console, select the user’s project from the Project list.
2. Click Home Search, and from the Resources list, select LocalQueue.
3. If no local queues are found, create a local queue.
4. Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

8.5.7. A user receives a local_queue provided does not exist error message
Copy link

Problem

After the user runs the cluster.up() command, the following error is shown:

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.

Copy to Clipboard

Toggle word wrap

Diagnosis

An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.

Resolution

In the OpenShift console, select the user’s project from the Project list.
1. Click Search, and from the Resources list, select LocalQueue.
2. Resolve the problem in one of the following ways:
  - If no local queues are found, create a local queue.
  - If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the namespace value in the cluster configuration matches their project name. If the user does not specify a namespace value in the cluster configuration, the Ray cluster is created in the current project.
3. Define a default local queue.
  For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

8.5.8. A user cannot create a Ray cluster or submit jobs
Copy link

Problem

After the user runs the cluster.up() command, an error similar to the following text is shown:

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}

Copy to Clipboard

Toggle word wrap

Diagnosis

The correct OpenShift login credentials are not specified in the TokenAuthentication section of the user’s notebook code.

Resolution

Advise the user to identify and specify the correct OpenShift login credentials as follows:
1. In the OpenShift console header, click your username and click Copy login command.
2. In the new tab that opens, log in as the user whose credentials you want to use.
3. Click Display Token.
4. From the Log in with this token section, copy the token and server values.
5. Specify the copied token and server values in your notebook code as follows:
  auth = TokenAuthentication( token = "<token>", server = "<server>", skip_tls=False ) auth.login()
  Copy to Clipboard Toggle word wrap
Verify that the user has the correct permissions and is part of the rhoai-users group.

8.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled
Copy link

Problem

Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.

Diagnosis

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Click the user’s pod name to open the pod details page.
Click the Events tab, and review the pod events to check whether the image pull completed successfully.

Resolution

If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:

Add an OnFailure restart policy for resources that are managed by Kueue.
In the redhat-ods-applications namespace, edit the kueue-manager-config ConfigMap to set a custom timeout for the waitForPodsReady property. For more information about this configuration option, see Enabling waitForPodsReady in the Kueue documentation.

Chapter 8. Managing distributed workloads

8.1. Overview of Kueue resources
Copy link

8.1.1. Resource flavor
Copy link

8.1.2. Cluster queue
Copy link

8.1.3. Local queue
Copy link

8.2. Example Kueue resource configurations
Copy link

8.2.1. NVIDIA GPUs without shared cohort
Copy link

8.2.1.1. NVIDIA RTX A400 GPU resource flavor
Copy link

8.2.1.2. NVIDIA RTX A1000 GPU resource flavor
Copy link

8.2.1.3. NVIDIA RTX A400 GPU cluster queue
Copy link

8.2.1.4. NVIDIA RTX A1000 GPU cluster queue
Copy link

8.2.2. NVIDIA GPUs and AMD GPUs without shared cohort
Copy link

8.2.2.1. AMD GPU resource flavor
Copy link

8.2.2.2. NVIDIA GPU resource flavor
Copy link

8.2.2.3. AMD GPU cluster queue
Copy link

8.2.2.4. NVIDIA GPU cluster queue
Copy link

8.3. Configuring quota management for distributed workloads
Copy link

8.4. Configuring the CodeFlare Operator
Copy link

8.5. Troubleshooting common problems with distributed workloads for administrators
Copy link

8.5.1. A user’s Ray cluster is in a suspended state
Copy link

8.5.2. A user’s Ray cluster is in a failed state
Copy link

8.5.3. A user receives a failed to call webhook error message for the CodeFlare Operator
Copy link

8.5.4. A user receives a failed to call webhook error message for Kueue
Copy link

8.5.5. A user’s Ray cluster does not start
Copy link

8.5.6. A user receives a Default Local Queue … not found error message
Copy link

8.5.7. A user receives a local_queue provided does not exist error message
Copy link

8.5.8. A user cannot create a Ray cluster or submit jobs
Copy link

8.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 8. Managing distributed workloads

8.1. Overview of Kueue resourcesCopy linkLink copied to clipboard!

8.1.1. Resource flavorCopy linkLink copied to clipboard!

8.1.2. Cluster queueCopy linkLink copied to clipboard!

8.1.3. Local queueCopy linkLink copied to clipboard!

8.2. Example Kueue resource configurationsCopy linkLink copied to clipboard!

8.2.1. NVIDIA GPUs without shared cohortCopy linkLink copied to clipboard!

8.2.1.1. NVIDIA RTX A400 GPU resource flavorCopy linkLink copied to clipboard!

8.2.1.2. NVIDIA RTX A1000 GPU resource flavorCopy linkLink copied to clipboard!

8.2.1.3. NVIDIA RTX A400 GPU cluster queueCopy linkLink copied to clipboard!

8.2.1.4. NVIDIA RTX A1000 GPU cluster queueCopy linkLink copied to clipboard!

8.2.2. NVIDIA GPUs and AMD GPUs without shared cohortCopy linkLink copied to clipboard!

8.2.2.1. AMD GPU resource flavorCopy linkLink copied to clipboard!

8.2.2.2. NVIDIA GPU resource flavorCopy linkLink copied to clipboard!

8.2.2.3. AMD GPU cluster queueCopy linkLink copied to clipboard!

8.2.2.4. NVIDIA GPU cluster queueCopy linkLink copied to clipboard!

8.3. Configuring quota management for distributed workloadsCopy linkLink copied to clipboard!

8.4. Configuring the CodeFlare OperatorCopy linkLink copied to clipboard!

8.5. Troubleshooting common problems with distributed workloads for administratorsCopy linkLink copied to clipboard!

8.5.1. A user’s Ray cluster is in a suspended stateCopy linkLink copied to clipboard!

8.5.2. A user’s Ray cluster is in a failed stateCopy linkLink copied to clipboard!

8.5.3. A user receives a failed to call webhook error message for the CodeFlare OperatorCopy linkLink copied to clipboard!

8.5.4. A user receives a failed to call webhook error message for KueueCopy linkLink copied to clipboard!

8.5.5. A user’s Ray cluster does not startCopy linkLink copied to clipboard!

8.5.6. A user receives a Default Local Queue …​ not found error messageCopy linkLink copied to clipboard!

8.5.7. A user receives a local_queue provided does not exist error messageCopy linkLink copied to clipboard!

8.5.8. A user cannot create a Ray cluster or submit jobsCopy linkLink copied to clipboard!

8.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulledCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

8.1. Overview of Kueue resources
Copy link

8.1.1. Resource flavor
Copy link

8.1.2. Cluster queue
Copy link

8.1.3. Local queue
Copy link

8.2. Example Kueue resource configurations
Copy link

8.2.1. NVIDIA GPUs without shared cohort
Copy link

8.2.1.1. NVIDIA RTX A400 GPU resource flavor
Copy link

8.2.1.2. NVIDIA RTX A1000 GPU resource flavor
Copy link

8.2.1.3. NVIDIA RTX A400 GPU cluster queue
Copy link

8.2.1.4. NVIDIA RTX A1000 GPU cluster queue
Copy link

8.2.2. NVIDIA GPUs and AMD GPUs without shared cohort
Copy link

8.2.2.1. AMD GPU resource flavor
Copy link

8.2.2.2. NVIDIA GPU resource flavor
Copy link

8.2.2.3. AMD GPU cluster queue
Copy link

8.2.2.4. NVIDIA GPU cluster queue
Copy link

8.3. Configuring quota management for distributed workloads
Copy link

8.4. Configuring the CodeFlare Operator
Copy link

8.5. Troubleshooting common problems with distributed workloads for administrators
Copy link

8.5.1. A user’s Ray cluster is in a suspended state
Copy link

8.5.2. A user’s Ray cluster is in a failed state
Copy link

8.5.3. A user receives a failed to call webhook error message for the CodeFlare Operator
Copy link

8.5.4. A user receives a failed to call webhook error message for Kueue
Copy link

8.5.5. A user’s Ray cluster does not start
Copy link

8.5.6. A user receives a Default Local Queue … not found error message
Copy link

8.5.7. A user receives a local_queue provided does not exist error message
Copy link

8.5.8. A user cannot create a Ray cluster or submit jobs
Copy link

8.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled
Copy link