Chapter 3. Managing cluster resources

3.1. Configuring the default PVC size for your cluster

To configure how resources are claimed within your OpenShift AI cluster, you can change the default size of the cluster’s persistent volume claim (PVC) ensuring that the storage requested matches your common storage workflow. PVCs are requests for resources in your cluster and also act as claim checks to the resource.

Prerequisites

You have logged in to Red Hat OpenShift AI.

Note

Changing the PVC setting restarts the Jupyter pod and makes Jupyter unavailable for up to 30 seconds. As a workaround, it is recommended that you perform this action outside of your organization’s typical working day.

Procedure

From the OpenShift AI dashboard, click Settings Cluster settings.
Under PVC size, enter a new size in gibibytes. The minimum size is 1 GiB, and the maximum size is 16384 GiB.
Click Save changes.

Verification

New PVCs are created with the default storage size that you configured.

Additional resources

Understanding persistent storage

3.2. Restoring the default PVC size for your cluster

To change the size of resources utilized within your OpenShift AI cluster, you can restore the default size of your cluster’s persistent volume claim (PVC).

Prerequisites

You have logged in to Red Hat OpenShift AI.
You are part of the administrator group for OpenShift AI in OpenShift.

Procedure

From the OpenShift AI dashboard, click Settings Cluster settings.
Click Restore Default to restore the default PVC size of 20GiB.
Click Save changes.

Verification

New PVCs are created with the default storage size of 20 GiB.

Additional resources

Understanding persistent storage

3.3. Overview of accelerators

If you work with large data sets, you can use accelerators to optimize the performance of your data science models in OpenShift AI. With accelerators, you can scale your work, reduce latency, and increase productivity. You can use accelerators in OpenShift AI to assist your data scientists in the following tasks:

Natural language processing (NLP)
Inference
Training deep neural networks
Data cleansing and data processing

OpenShift AI supports the following accelerators:

NVIDIA graphics processing units (GPUs)
- To use compute-heavy workloads in your models, you can enable NVIDIA graphics processing units (GPUs) in OpenShift AI.
- To enable GPUs on OpenShift, you must install the NVIDIA GPU Operator.
Intel Gaudi AI accelerators
- Intel provides hardware accelerators intended for deep learning workloads. You can use the Habana libraries and software associated with Intel Gaudi AI accelerators available from your notebook.
- Before you can enable Intel Gaudi AI accelerators in OpenShift AI, you must install the necessary dependencies and the version of the HabanaAI Operator that matches the Habana version of the HabanaAI workbench image in your deployment. For more information about how to enable your OpenShift environment for Intel Gaudi AI accelerators, see HabanaAI Operator v1.10 for OpenShift and HabanaAI Operator v1.13 for OpenShift.
- You can enable Intel Gaudi AI accelerators on-premises or with AWS DL1 compute nodes on an AWS instance.

Before you can use an accelerator in OpenShift AI, your OpenShift instance must contain an associated accelerator profile. For accelerators that are new to your deployment, you must configure an accelerator profile for the accelerator in context. You can create an accelerator profile from the Settings Accelerator profiles page on the OpenShift AI dashboard. If your deployment contains existing accelerators that had associated accelerator profiles already configured, an accelerator profile is automatically created after you upgrade to the latest version of OpenShift AI.

Additional resources

3.3.1. Enabling GPU support in OpenShift AI

Optionally, to ensure that your data scientists can use compute-heavy workloads in their models, you can enable graphics processing units (GPUs) in OpenShift AI.

Important

If you are using OpenShift AI in a disconnected self-managed environment, see Enabling GPU support in OpenShift AI instead.

Prerequisites

You have logged in to your OpenShift cluster.
You have the cluster-admin role in your OpenShift cluster.

Procedure

To enable GPU support on an OpenShift cluster, follow the instructions here: NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
Delete the migration-gpu-status ConfigMap.
1. In the OpenShift web console, switch to the Administrator perspective.
2. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate ConfigMap.
3. Search for the migration-gpu-status ConfigMap.
4. Click the action menu (⋮) and select Delete ConfigMap from the list.
  The Delete ConfigMap dialog appears.
5. Inspect the dialog and confirm that you are deleting the correct ConfigMap.
6. Click Delete.
Restart the dashboard replicaset.
1. In the OpenShift web console, switch to the Administrator perspective.
2. Click Workloads Deployments.
3. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate deployment.
4. Search for the rhods-dashboard deployment.
5. Click the action menu (⋮) and select Restart Rollout from the list.
6. Wait until the Status column indicates that all pods in the rollout have fully restarted.

Verification

The NVIDIA GPU Operator appears on the Operators Installed Operators page in the OpenShift web console.
The reset migration-gpu-status instance is present in the Instances tab on the AcceleratorProfile custom resource definition (CRD) details page.

After installing the NVIDIA GPU Operator, create an accelerator profile as described in Working with accelerator profiles.

3.3.2. Enabling Intel Gaudi AI accelerators

Before you can use Intel Gaudi AI accelerators in OpenShift AI, you must install the necessary dependencies and deploy the HabanaAI Operator.

Prerequisites

You have logged in to OpenShift.
You have the cluster-admin role in OpenShift.

Procedure

To enable Intel Gaudi AI accelerators in OpenShift AI, follow the instructions at HabanaAI Operator for OpenShift.
From the OpenShift AI dashboard, click Settings Accelerator profiles.
The Accelerator profiles page appears, displaying existing accelerator profiles. To enable or disable an existing accelerator profile, on the row containing the relevant accelerator profile, click the toggle in the Enable column.
Click Create accelerator profile.
The Create accelerator profile dialog opens.
In the Name field, enter a name for the Intel Gaudi AI Accelerator.
In the Identifier field, enter a unique string that identifies the Intel Gaudi AI Accelerator, for example, habana.ai/gaudi.
Optional: In the Description field, enter a description for the Intel Gaudi AI Accelerator.
To enable or disable the accelerator profile for the Intel Gaudi AI Accelerator immediately after creation, click the toggle in the Enable column.
Optional: Add a toleration to schedule pods with matching taints.
1. Click Add toleration.
  The Add toleration dialog opens.
2. From the Operator list, select one of the following options:
  - Equal - The key/value/effect parameters must match. This is the default.
  - Exists - The key/effect parameters must match. You must leave a blank value parameter, which matches any.
3. From the Effect list, select one of the following options:
  - None
  - NoSchedule - New pods that do not match the taint are not scheduled onto that node. Existing pods on the node remain.
  - PreferNoSchedule - New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to. Existing pods on the node remain.
  - NoExecute - New pods that do not match the taint cannot be scheduled onto that node. Existing pods on the node that do not have a matching toleration are removed.
4. In the Key field, enter the toleration key habana.ai/gaudi. The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
5. In the Value field, enter a toleration value. The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.
6. In the Toleration Seconds section, select one of the following options to specify how long a pod stays bound to a node that has a node condition.
  - Forever - Pods stays permanently bound to a node.
  - Custom value - Enter a value, in seconds, to define how long pods stay bound to a node that has a node condition.
7. Click Add.
Click Create accelerator profile.

Verification

From the Administrator perspective, the following Operators appear on the Operators Installed Operators page.
- HabanaAI
- Node Feature Discovery (NFD)
- Kernel Module Management (KMM)
The Accelerator list displays the Intel Gaudi AI Accelerator on the Start a notebook server page. After you select an accelerator, the Number of accelerators field appears, which you can use to choose the number of accelerators for your notebook server.
The accelerator profile appears on the Accelerator profiles page
The accelerator profile appears on the Instances tab on the details page for the AcceleratorProfile custom resource definition (CRD).

Additional resources

3.4. Allocating additional resources to OpenShift AI users

As a cluster administrator, you can allocate additional resources to a cluster to support compute-intensive data science work. This support includes increasing the number of nodes in the cluster and changing the cluster’s allocated machine pool.

For more information about allocating additional resources to an OpenShift cluster, see Manually scaling a compute machine set.

3.5. Troubleshooting common problems with distributed workloads for administrators

If your users are experiencing errors in Red Hat OpenShift AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.

If the problem is not documented here or in the release notes, contact Red Hat Support.

3.5.1. A user’s Ray cluster is in a suspended state

Problem

The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.

Diagnosis

The user’s Ray cluster head pod or worker pods remain in a suspended state. Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the suspended state, as shown in the following example:

status:
 conditions:
   - lastTransitionTime: '2024-05-29T13:05:09Z'
     message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'

Resolution

Check whether the resource flavor is created, as follows:
1. In the OpenShift console, select the user’s project from the Project list.
2. Click Home Search, and from the Resources list, select ResourceFlavor.
3. If necessary, create the resource flavor.
Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.
If necessary, increase the resource quota.

For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.

3.5.2. A user’s Ray cluster is in a failed state

Problem

The user might have insufficient resources.

Diagnosis

The user’s Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.

Resolution

If the failed state persists, complete the following steps:

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Click the user’s pod name to open the pod details page.
Click the Events tab, and review the pod events to identify the cause of the problem.
Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the failed state.

3.5.3. A user receives a failed to call webhook error message for the CodeFlare Operator

Problem

After the user runs the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}

Diagnosis

The CodeFlare Operator pod might not be running.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Verify that the CodeFlare Operator pod is running. If necessary, restart the CodeFlare Operator pod.
Review the logs for the CodeFlare Operator pod to verify that the webhook server is serving, as shown in the following example:
```
INFO	controller-runtime.webhook	  Serving webhook server	{"host": "", "port": 9443}
```

3.5.4. A user receives a failed to call webhook error message for Kueue

Problem

After the user runs the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}

Diagnosis

The Kueue pod might not be running.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Verify that the Kueue pod is running. If necessary, restart the Kueue pod.

Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:

{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

3.5.5. A user’s Ray cluster does not start

Problem

After the user runs the cluster.up() command, when they run either the cluster.details() command or the cluster.status() command, the Ray cluster status remains as Starting instead of changing to Ready. No pods are created.

Diagnosis

Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for remaining in the Starting state. Similarly, check the status.conditions.message field for the RayCluster resource.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.
Review the logs for the KubeRay pod to identify errors.

3.5.6. A user receives a Default Local Queue … not found error message

Problem

After the user runs the cluster.up() command, the following error is shown:

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Diagnosis

No default local queue is defined, and a local queue is not specified in the cluster configuration.

Resolution

Check whether a local queue exists in the user’s project, as follows:
1. In the OpenShift console, select the user’s project from the Project list.
2. Click Home Search, and from the Resources list, select LocalQueue.
3. If no local queues are found, create a local queue.
4. Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

3.5.7. A user receives a local_queue provided does not exist error message

Problem

After the user runs the cluster.up() command, the following error is shown:

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.

Diagnosis

An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.

Resolution

In the OpenShift console, select the user’s project from the Project list.
1. Click Search, and from the Resources list, select LocalQueue.
2. Resolve the problem in one of the following ways:
  - If no local queues are found, create a local queue.
  - If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the namespace value in the cluster configuration matches their project name. If the user does not specify a namespace value in the cluster configuration, the Ray cluster is created in the current project.
3. Define a default local queue.
  For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

3.5.8. A user cannot create a Ray cluster or submit jobs

Problem

After the user runs the cluster.up() command, an error similar to the following text is shown:

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}

Diagnosis

The correct OpenShift login credentials are not specified in the TokenAuthentication section of the user’s notebook code.

Resolution

Advise the user to identify and specify the correct OpenShift login credentials as follows:
1. In the OpenShift console header, click your username and click Copy login command.
2. In the new tab that opens, log in as the user whose credentials you want to use.
3. Click Display Token.
4. From the Log in with this token section, copy the token and server values.
5. Specify the copied token and server values in your notebook code as follows:
```
auth = TokenAuthentication(
    token = "<token>",
    server = "<server>",
    skip_tls=False
)
auth.login()
```
Verify that the user has the correct permissions and is part of the rhoai-users group.

3.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled

Problem

Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.

Diagnosis

In the OpenShift console, select the user’s project from the Project list.
Click Workloads Pods.
Click the user’s pod name to open the pod details page.
Click the Events tab, and review the pod events to check whether the image pull completed successfully.

Resolution

If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:

Add an OnFailure restart policy for resources that are managed by Kueue.
In the redhat-ods-applications namespace, edit the kueue-manager-config ConfigMap to set a custom timeout for the waitForPodsReady property. For more information about this configuration option, see Enabling waitForPodsReady in the Kueue documentation.

Chapter 3. Managing cluster resources

3.1. Configuring the default PVC size for your cluster

3.2. Restoring the default PVC size for your cluster

3.3. Overview of accelerators

3.3.1. Enabling GPU support in OpenShift AI

3.3.2. Enabling Intel Gaudi AI accelerators

3.4. Allocating additional resources to OpenShift AI users

3.5. Troubleshooting common problems with distributed workloads for administrators

3.5.1. A user’s Ray cluster is in a suspended state

3.5.2. A user’s Ray cluster is in a failed state

3.5.3. A user receives a failed to call webhook error message for the CodeFlare Operator

3.5.4. A user receives a failed to call webhook error message for Kueue

3.5.5. A user’s Ray cluster does not start

3.5.6. A user receives a Default Local Queue … not found error message

3.5.7. A user receives a local_queue provided does not exist error message

3.5.8. A user cannot create a Ray cluster or submit jobs

3.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 3. Managing cluster resources

3.1. Configuring the default PVC size for your cluster

3.2. Restoring the default PVC size for your cluster

3.3. Overview of accelerators

3.3.1. Enabling GPU support in OpenShift AI

3.3.2. Enabling Intel Gaudi AI accelerators

3.4. Allocating additional resources to OpenShift AI users

3.5. Troubleshooting common problems with distributed workloads for administrators

3.5.1. A user’s Ray cluster is in a suspended state

3.5.2. A user’s Ray cluster is in a failed state

3.5.3. A user receives a failed to call webhook error message for the CodeFlare Operator

3.5.4. A user receives a failed to call webhook error message for Kueue

3.5.5. A user’s Ray cluster does not start

3.5.6. A user receives a Default Local Queue …​ not found error message

3.5.7. A user receives a local_queue provided does not exist error message

3.5.8. A user cannot create a Ray cluster or submit jobs

3.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links

3.5.6. A user receives a Default Local Queue … not found error message