Search

Chapter 4. Monitoring distributed workloads

download PDF

In OpenShift AI, you can view project metrics for distributed workloads, and view the status of all distributed workloads in the selected project. You can use these metrics to monitor the resources used by distributed workloads, assess whether project resources are allocated correctly, track the progress of distributed workloads, and identify corrective action when necessary.

Note

Data science pipelines workloads are not managed by the distributed workloads feature, and are not included in the distributed workloads metrics.

4.1. Viewing project metrics for distributed workloads

In OpenShift AI, you can view the following project metrics for distributed workloads:

  • CPU - The number of CPU cores that are currently being used by all distributed workloads in the selected project.
  • Memory - The amount of memory in gibibytes (GiB) that is currently being used by all distributed workloads in the selected project.

You can use these metrics to monitor the resources used by the distributed workloads, and assess whether project resources are allocated correctly.

Prerequisites

  • You have installed Red Hat OpenShift AI.
  • On the OpenShift cluster where OpenShift AI is installed, user workload monitoring is enabled.
  • You have logged in to OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
  • Your data science project contains distributed workloads.

Procedure

  1. In the OpenShift AI left navigation pane, click Distributed Workloads Metrics.
  2. From the Project list, select the project that contains the distributed workloads that you want to monitor.
  3. Click the Project metrics tab.
  4. Optional: From the Refresh interval list, select a value to specify how frequently the graphs on the metrics page are refreshed to show the latest data.

    You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, or 1 day.

  5. In the Requested resources section, review the CPU and Memory graphs to identify the resources requested by distributed workloads as follows:

    • Requested by the selected project
    • Requested by all projects, including the selected project and projects that you cannot access
    • Total shared quota for all projects, as provided by the cluster queue

    For each resource type (CPU and Memory), subtract the Requested by all projects value from the Total shared quota value to calculate how much of that resource quota has not been requested and is available for all projects.

  6. Scroll down to the Top resource-consuming distributed workloads section to review the following graphs:

    • Top 5 distributed workloads that are consuming the most CPU resources
    • Top 5 distributed workloads that are consuming the most memory

    You can also identify how much CPU or memory is used in each case.

  7. Scroll down to view the Distributed workload resource metrics table, which lists all of the distributed workloads in the selected project, and indicates the current resource usage and the status of each distributed workload.

    In each table entry, progress bars indicate how much of the requested CPU and memory is currently being used by this distributed workload. To see numeric values for the actual usage and requested usage for CPU (measured in cores) and memory (measured in GiB), hover the cursor over each progress bar. Compare the actual usage with the requested usage to assess the distributed workload configuration. If necessary, reconfigure the distributed workload to reduce or increase the requested resources.

Verification

On the Project metrics tab, the graphs and table provide resource-usage data for the distributed workloads in the selected project.

4.2. Viewing the status of distributed workloads

In OpenShift AI, you can view the status of all distributed workloads in the selected project. You can track the progress of the distributed workloads, and identify corrective action when necessary.

Prerequisites

  • You have installed Red Hat OpenShift AI.
  • On the OpenShift cluster where OpenShift AI is installed, user workload monitoring is enabled.
  • You have logged in to OpenShift AI.
  • If you are using specialized OpenShift AI groups, you are part of the user group or admin group (for example, rhoai-users or rhoai-admins) in OpenShift.
  • Your data science project contains distributed workloads.

Procedure

  1. In the OpenShift AI left navigation pane, click Distributed Workloads Metrics.
  2. From the Project list, select the project that contains the distributed workloads that you want to monitor.
  3. Click the Distributed workload status tab.
  4. Optional: From the Refresh interval list, select a value to specify how frequently the graphs on the metrics page are refreshed to show the latest data.

    You can select one of these values: 15 seconds, 30 seconds, 1 minute, 5 minutes, 15 minutes, 30 minutes, 1 hour, 2 hours, or 1 day.

  5. In the Status overview section, review a summary of the status of all distributed workloads in the selected project.

    The status can be Pending, Inadmissible, Admitted, Running, Evicted, Succeeded, or Failed.

  6. Scroll down to view the Distributed workloads table, which lists all of the distributed workloads in the selected project. The table provides the priority, status, creation date, and latest message for each distributed workload.

    The latest message provides more information about the current status of the distributed workload. Review the latest message to identify any corrective action needed. For example, a distributed workload might be Inadmissible because the requested resources exceed the available resources. In such cases, you can either reconfigure the distributed workload to reduce the requested resources, or reconfigure the cluster queue for the project to increase the resource quota.

Verification

On the Distributed workload status tab, the graph provides a summarized view of the status of all distributed workloads in the selected project, and the table provides more details about the status of each distributed workload.

4.3. Viewing Kueue alerts for distributed workloads

In OpenShift AI, you can view Kueue alerts for your cluster. Each alert provides a link to a runbook. The runbook provides instructions on how to resolve the situation that triggered the alert.

Prerequisites

  • You have logged in to OpenShift with the cluster-admin role.
  • You can access a data science cluster that is configured to run distributed workloads as described in Configuring distributed workloads.
  • You can access a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about projects and workbenches, see Working on data science projects.
  • You have logged in to Red Hat OpenShift AI.
  • Your data science project contains distributed workloads.

Procedure

  1. In the OpenShift console, in the Administrator perspective, click Observe Alerting.
  2. Click the Alerting rules tab to view a list of alerting rules for default and user-defined projects.

    • The Severity column indicates whether the alert is informational, a warning, or critical.
    • The Alert state column indicates whether a rule is currently firing.
  3. Click the name of an alerting rule to see more details, such as the condition that triggers the alert. The following table summarizes the alerting rules for Kueue resources.

    Table 4.1. Alerting rules for Kueue resources
    SeverityNameAlert condition

    Critical

    KueuePodDown

    The Kueue pod is not ready for a period of 5 minutes.

    Info

    LowClusterQueueResourceUsage

    Resource usage in the cluster queue is below 20% of its nominal quota for more than 1 day. Resource usage refers to any resources listed in the cluster queue, such as CPU, memory, and so on.

    Info

    ResourceReservationExceedsQuota

    Resource reservation is 10 times the available quota in the cluster queue. Resource reservation refers to any resources listed in the cluster queue, such as CPU, memory, and so on.

    Info

    PendingWorkloadPods

    A pod has been in a Pending state for more than 3 days.

  4. If the Alert state of an alerting rule is set to Firing, complete the following steps:

    1. Click Observe Alerting and then click the Alerts tab.
    2. Click each alert for the firing rule, to see more details. Note that a separate alert is fired for each resource type affected by the alerting rule.
    3. On the alert details page, in the Runbook section, click the link to open a GitHub page that provides troubleshooting information.
    4. Complete the runbook steps to identify the cause of the alert and resolve the situation.

Verification

After you resolve the cause of the alert, the alerting rule stops firing.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.