Chapter 6. Troubleshooting common problems with distributed workloads for users
If you are experiencing errors in Red Hat OpenShift AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.
If the problem is not documented here or in the release notes, contact Red Hat Support.
6.1. My Ray cluster is in a suspended state
Problem
The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.
Diagnosis
The Ray cluster head pod or worker pods remain in a suspended state.
Resolution
- In the OpenShift console, select your project from the Project list.
Check the workload resource:
- Click Search, and from the Resources list, select Workload.
- Select the workload resource that is created with the Ray cluster resource, and click the YAML tab.
Check the text in the
status.conditions.message
field, which provides the reason for the suspended state, as shown in the following example:status: conditions: - lastTransitionTime: '2024-05-29T13:05:09Z' message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'
Check the Ray cluster resource:
- Click Search, and from the Resources list, select RayCluster.
- Select the Ray cluster resource, and click the YAML tab.
-
Check the text in the
status.conditions.message
field.
Check the cluster queue resource:
- Click Search, and from the Resources list, select ClusterQueue.
- Check your cluster queue configuration to ensure that the resources that you requested are within the limits defined for the project.
- Either reduce your requested resources, or contact your administrator to request more resources.
6.2. My Ray cluster is in a failed state
Problem
You might have insufficient resources.
Diagnosis
The Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed
state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.
Resolution
If the failed state persists, complete the following steps:
- In the OpenShift console, select your project from the Project list.
- Click Search, and from the Resources list, select Pod.
- Click your pod name to open the pod details page.
- Click the Events tab, and review the pod events to identify the cause of the problem.
- If you cannot resolve the problem, contact your administrator to request assistance.
6.3. I see a failed to call webhook error message for the CodeFlare Operator
Problem
After you run the cluster.up()
command, the following error is shown:
ApiException: (500) Reason: Internal Server Error HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}
Diagnosis
The CodeFlare Operator pod might not be running.
Resolution
Contact your administrator to request assistance.
6.4. I see a failed to call webhook error message for Kueue
Problem
After you run the cluster.up()
command, the following error is shown:
ApiException: (500) Reason: Internal Server Error HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}
Diagnosis
The Kueue pod might not be running.
Resolution
Contact your administrator to request assistance.
6.5. My Ray cluster doesn’t start
Problem
After you run the cluster.up()
command, when you run either the cluster.details()
command or the cluster.status()
command, the Ray Cluster remains in the Starting
status instead of changing to the Ready
status. No pods are created.
Diagnosis
- In the OpenShift console, select your project from the Project list.
Check the workload resource:
- Click Search, and from the Resources list, select Workload.
- Select the workload resource that is created with the Ray cluster resource, and click the YAML tab.
-
Check the text in the
status.conditions.message
field, which provides the reason for remaining in theStarting
state.
Check the Ray cluster resource:
- Click Search, and from the Resources list, select RayCluster.
- Select the Ray cluster resource, and click the YAML tab.
-
Check the text in the
status.conditions.message
field.
Resolution
If you cannot resolve the problem, contact your administrator to request assistance.
6.6. I see a Default Local Queue … not found error message
Problem
After you run the cluster.up()
command, the following error is shown:
Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.
Diagnosis
No default local queue is defined, and a local queue is not specified in the cluster configuration.
Resolution
- In the OpenShift console, select your project from the Project list.
- Click Search, and from the Resources list, select LocalQueue.
Resolve the problem in one of the following ways:
If a local queue exists, add it to your cluster configuration as follows:
local_queue="<local_queue_name>"
- If no local queue exists, contact your administrator to request assistance.
6.7. I see a local_queue provided does not exist error message
Problem
After you run the cluster.up()
command, the following error is shown:
local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.
Diagnosis
An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.
Resolution
- In the OpenShift console, select your project from the Project list.
- Click Search, and from the Resources list, select LocalQueue.
Resolve the problem in one of the following ways:
-
If a local queue exists, ensure that you spelled the local queue name correctly in your cluster configuration, and that the
namespace
value in the cluster configuration matches your project name. If you do not specify anamespace
value in the cluster configuration, the Ray cluster is created in the current project. - If no local queue exists, contact your administrator to request assistance.
-
If a local queue exists, ensure that you spelled the local queue name correctly in your cluster configuration, and that the
6.8. I cannot create a Ray cluster or submit jobs
Problem
After you run the cluster.up()
command, an error similar to the following error is shown:
RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403) Reason: Forbidden HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}
Diagnosis
The correct OpenShift login credentials are not specified in the TokenAuthentication
section of your notebook code.
Resolution
Identify the correct OpenShift login credentials as follows:
- In the OpenShift console header, click your username and click Copy login command.
- In the new tab that opens, log in as the user whose credentials you want to use.
- Click Display Token.
-
From the Log in with this token section, copy the
token
andserver
values.
In your notebook code, specify the copied
token
andserver
values as follows:auth = TokenAuthentication( token = "<token>", server = "<server>", skip_tls=False ) auth.login()
6.9. My pod provisioned by Kueue is terminated before my image is pulled
Problem
Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.
Diagnosis
- In the OpenShift console, select your project from the Project list.
- Click Search, and from the Resources list, select Pod.
- Click the Ray head pod name to open the pod details page.
- Click the Events tab, and review the pod events to check whether the image pull completed successfully.
Resolution
If the pod takes more than 5 minutes to pull the image, contact your administrator to request assistance.