Chapter 4. Monitoring distributed workloads

PDF

In OpenShift AI, you can view project metrics for distributed workloads, and view the status of all distributed workloads in the selected project. You can use these metrics to monitor the resources used by distributed workloads, assess whether project resources are allocated correctly, track the progress of distributed workloads, and identify corrective action when necessary.

Note

Data Science Pipelines (DSP) workloads are not managed by the distributed workloads feature, and are not included in the distributed workloads metrics.

4.1. Viewing project metrics for distributed workloads

In OpenShift AI, you can view the following project metrics for distributed workloads:

CPU - The number of CPU cores that are currently being used by all distributed workloads in the selected project.
Memory - The amount of memory in gibibytes (GiB) that is currently being used by all distributed workloads in the selected project.

You can use these metrics to monitor the resources used by the distributed workloads, and assess whether project resources are allocated correctly.

Prerequisites

You have installed Red Hat OpenShift AI.
On the OpenShift cluster where OpenShift AI is installed, user workload monitoring is enabled.
You have logged in to OpenShift AI.
If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
Your data science project contains distributed workloads.

Procedure

In the OpenShift AI left navigation pane, click Distributed Workloads Metrics.
From the Project list, select the project that contains the distributed workloads that you want to monitor.
Click the Project metrics tab.
Optional: From the Refresh interval list, select a value to specify how frequently the graphs on the metrics page are refreshed to show the latest data.
You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, or 1 day.
In the Requested resources section, review the CPU and Memory graphs to identify the resources requested by distributed workloads as follows:
- Requested by the selected project
- Requested by all projects, including the selected project and projects that you cannot access
- Total shared quota for all projects, as provided by the cluster queue
For each resource type (CPU and Memory), subtract the Requested by all projects value from the Total shared quota value to calculate how much of that resource quota has not been requested and is available for all projects.
Scroll down to the Top resource-consuming distributed workloads section to review the following graphs:
- Top 5 distributed workloads that are consuming the most CPU resources
- Top 5 distributed workloads that are consuming the most memory
You can also identify how much CPU or memory is used in each case.
Scroll down to view the Distributed workload resource metrics table, which lists all of the distributed workloads in the selected project, and indicates the current resource usage and the status of each distributed workload.
In each table entry, progress bars indicate how much of the requested CPU and memory is currently being used by this distributed workload. To see numeric values for the actual usage and requested usage for CPU (measured in cores) and memory (measured in GiB), hover the cursor over each progress bar. Compare the actual usage with the requested usage to assess the distributed workload configuration. If necessary, reconfigure the distributed workload to reduce or increase the requested resources.

Verification

On the Project metrics tab, the graphs and table provide resource-usage data for the distributed workloads in the selected project.

4.2. Viewing the status of distributed workloads

In OpenShift AI, you can view the status of all distributed workloads in the selected project. You can track the progress of the distributed workloads, and identify corrective action when necessary.

Prerequisites

You have installed Red Hat OpenShift AI.
On the OpenShift cluster where OpenShift AI is installed, user workload monitoring is enabled.
You have logged in to OpenShift AI.
If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
Your data science project contains distributed workloads.

Procedure

In the OpenShift AI left navigation pane, click Distributed Workloads Metrics.
From the Project list, select the project that contains the distributed workloads that you want to monitor.
Click the Distributed workload status tab.
Optional: From the Refresh interval list, select a value to specify how frequently the graphs on the metrics page are refreshed to show the latest data.
You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, or 1 day.
In the Status overview section, review a summary of the status of all distributed workloads in the selected project.
The status can be Pending, Inadmissible, Admitted, Running, Evicted, Succeeded, or Failed.
Scroll down to view the Distributed workloads table, which lists all of the distributed workloads in the selected project. The table provides the priority, status, creation date, and latest message for each distributed workload.
The latest message provides more information about the current status of the distributed workload. Review the latest message to identify any corrective action needed. For example, a distributed workload might be Inadmissible because the requested resources exceed the available resources. In such cases, you can either reconfigure the distributed workload to reduce the requested resources, or reconfigure the cluster queue for the project to increase the resource quota.

Verification

On the Distributed workload status tab, the graph provides a summarized view of the status of all distributed workloads in the selected project, and the table provides more details about the status of each distributed workload.

Chapter 4. Monitoring distributed workloads

4.1. Viewing project metrics for distributed workloads

4.2. Viewing the status of distributed workloads

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links