Chapter 11. Managing observability


Red Hat OpenShift AI provides centralized platform observability: an integrated, out-of-the-box solution for monitoring the health and performance of your OpenShift AI instance and user workloads.

This centralized solution includes a dedicated, pre-configured observability stack, featuring the OpenTelemetry Collector (OTC) for standardized data ingestion, Prometheus for metrics, and the Red Hat build of Tempo for distributed tracing. This architecture enables a common set of health metrics and alerts for OpenShift AI components and offers mechanisms to integrate with your existing external observability tools.

Important

This feature is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

11.1. Enabling the observability stack

The observability stack collects and correlates metrics, traces, and alerts for OpenShift AI so that you can monitor, troubleshoot, and optimize OpenShift AI components. A cluster administrator must explicitly enable this capability in the DataScienceClusterInitialization (DSCI) custom resource.

Once enabled, you can perform the following actions:

  • Accelerate troubleshooting by viewing metrics, traces, and alerts for OpenShift AI components in one place.
  • Maintain platform stability by monitoring health and resource usage and receiving alerts for critical issues.
  • Integrate with existing tools by exporting telemetry to third-party observability solutions through the Red Hat build of OpenTelemetry.
Important

This feature is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

  • You have cluster administrator privileges for your OpenShift cluster.
  • You have installed Red Hat OpenShift AI.
  • You have installed the following Operators, which provide the components of the observability stack:

    • Cluster Observability Operator: Deploys and manages Prometheus and Alertmanager for metrics and alerts.
    • Tempo Operator: Provides the Tempo backend for distributed tracing.
    • Red Hat build of OpenTelemetry: Deploys the OpenTelemetry Collector for collecting and exporting telemetry data.

Procedure

  1. Log in to the OpenShift web console as a cluster administrator.
  2. In the OpenShift console, click Operators Installed Operators.
  3. Search for the Red Hat OpenShift AI Operator, and then click the Operator name to open the Operator details page.
  4. Click the DSCInitialization tab.
  5. Click the default instance name (for example, default-dsci) to open the instance details page.
  6. Click the YAML tab to show the instance specifications.
  7. In the spec.monitoring section, set the value of the managementState field to Managed, and configure metrics, alerting, and tracing settings as shown in the following example:

    Example monitoring configuration

    # ...
    spec:
      monitoring:
        managementState: Managed                 # Required: Enables and manages the observability stack
        namespace: redhat-ods-monitoring    # Required: Namespace where monitoring components are deployed
        alerting: {}                              # Alertmanager configuration, uses default settings if empty
        metrics:                                  # Prometheus configuration for metrics collection
          replicas: 1                             # Optional: Number of Prometheus instances
          resources:                              # CPU and memory requests and limits for Prometheus pods
            cpulimit: 500m                        # Optional: Maximum CPU allocation in millicores
            cpurequest: 100m                      # Optional: Minimum CPU allocation in millicores
            memorylimit: 512Mi                    # Optional: Maximum memory allocation in mebibytes
            memoryrequest: 256Mi                  # Optional: Minimum memory allocation in mebibytes
          storage:                                # Storage configuration for metrics data
            size: 5Gi                             # Required: Storage size for Prometheus data
            retention: 90d                        # Required: Retention period for metrics data in days
          exporters: {}                           # External metrics exporters
        traces:                                   # Tempo backend for distributed tracing
          sampleRatio: '0.1'                      # Optional: Portion of traces to sample, expressed as a decimal
          storage:                                # Storage configuration for trace data
            backend: pv                           # Required: Storage backend for Tempo traces (pv, s3, or gcs)
            retention: 2160h                      # Optional: Retention period for trace data in hours
          exporters: {}                           # External traces exporters
    # ...
    Copy to Clipboard Toggle word wrap

  8. Click Save to apply your changes.

Verification

Verify that the observability stack components are running in the configured namespace:

  1. In the OpenShift web console, click Workloads Pods.
  2. From the project list, select redhat-ods-monitoring.
  3. Confirm that there are running pods for your configuration. The following pods indicate that the observability stack is active:

    alertmanager-data-science-monitoringstack-#      2/2   Running   0   1m
    data-science-collector-collector-#               1/1   Running   0   1m
    prometheus-data-science-monitoringstack-#        2/2   Running   0   1m
    tempo-data-science-tempomonolithic-#             1/1   Running   0   1m
    thanos-querier-data-science-thanos-querier-#     2/2   Running   0   1m
    Copy to Clipboard Toggle word wrap

Next step

  • Collecting metrics from user workloads

11.2. Collecting metrics from user workloads

After a cluster administrator enables the observability stack in your cluster, metric collection becomes available but is not automatically active for all deployed workloads. The monitoring system relies on a specific label to identify which pods Prometheus should scrape for metrics.

To include a workload, such as a user-created workbench, training job, or inference service, in the centralized observability stack, add the label monitoring.opendatahub.io/scrape=true to the pod template in the workload’s deployment configuration. This ensures that all pods created by the deployment include the label and are automatically scraped by Prometheus.

Note

Apply the monitoring.opendatahub.io/scrape=true label only to workloads that expose metrics and that you want the observability stack to monitor. Do not add this label to operator-managed workloads, because the operator might overwrite or remove it during reconciliation.

Prerequisites

  • A cluster administrator has enabled the observability stack as described in Enabling the observability stack.
  • You have OpenShift AI administrator privileges or you are the project owner.
  • You have deployed a workload that exposes a /metrics endpoint, such as a workbench server or model service pod.
  • You have access to the project where the workload is running.

Procedure

  1. Log in to the OpenShift web console as a cluster administrator or project owner.
  2. Click Workloads Deployments.
  3. In the Project list at the top of the page, select the project where your workload is deployed.
  4. Identify the deployment that you want to collect metrics from and click its name.
  5. On the Deployment details page, click the YAML tab.
  6. In the YAML editor, add the required label under the spec.template.metadata.labels section, as shown in the following example:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: <example_name>
      namespace: <example_namespace>
    spec:
      template:
        metadata:
          labels:
            monitoring.opendatahub.io/scrape: 'true'
    # ...
    Copy to Clipboard Toggle word wrap
  7. Click Save.

    OpenShift automatically rolls out a new ReplicaSet and pods with the updated label. When the new pods start, the observability stack begins scraping their metrics.

Verification

Verify that metrics are being collected by accessing the Prometheus instance deployed by OpenShift AI.

  1. Access Prometheus by using a route:

    1. In the OpenShift web console, click Networking Routes.
    2. From the project list, select redhat-ods-monitoring.
    3. Locate the route associated with the Prometheus service, such as data-science-prometheus-route.
    4. Click the Location URL to open the Prometheus web console.
  2. Alternatively, access Prometheus locally by using port forwarding:

    1. List the Prometheus pods:

      $ oc get pods -n redhat-ods-monitoring -l prometheus=data-science-monitoringstack
      Copy to Clipboard Toggle word wrap
    2. Start port forwarding:

      $ oc port-forward __<prometheus-pod-name>__ 9090:9090 -n redhat-ods-monitoring
      Copy to Clipboard Toggle word wrap
    3. In a web browser, open the following URL:

      http://localhost:9090
      Copy to Clipboard Toggle word wrap
  3. In the Prometheus web console, search for a metric exposed by your workload.

    If the label is applied correctly and the workload exposes metrics, the metrics appear in the Prometheus instance deployed by OpenShift AI.

You can export OpenShift AI operational metrics to an external observability platform, such as Grafana, Prometheus, or any OpenTelemetry-compatible backend. This allows you to visualize and monitor OpenShift AI metrics alongside data from other systems in your existing observability environment.

Metrics export is configured in the DataScienceClusterInitialization (DSCI) custom resource by populating the .spec.monitoring.metrics.exporters field. When you define one or more exporters in this field, the OpenTelemetry Collector (OTC) deployed by OpenShift AI automatically updates its configuration to include each exporter in its metrics pipeline. If this field is empty or undefined, metrics are collected only by the in-cluster Prometheus instance that is deployed with OpenShift AI.

Prerequisites

  • You have cluster administrator privileges for your OpenShift cluster.
  • The observability stack is enabled as described in Enabling the observability stack.
  • The external observability platform can receive metrics through a supported export protocol.
  • You know the URL of your external metrics receiver endpoint.

Procedure

  1. Log in to the OpenShift web console as a cluster administrator.
  2. Click Operators Installed Operators.
  3. Select the Red Hat OpenShift AI Operator from the list.
  4. Click the DSCInitialization tab.
  5. Click the default DSCI instance, for example, default-dsci, to open its details page.
  6. Click the YAML tab.
  7. In the spec.monitoring.metrics section, add an exporters list that defines the external receiver configuration, as shown in the following example:

    spec:
      monitoring:
        metrics:
          exporters:
            - name: <external_exporter_name>
              type: <type>
              endpoint: https://example-otlp-receiver.example.com:4317
    Copy to Clipboard Toggle word wrap
    • name: A unique, descriptive name for the exporter configuration. Do not use reserved names such as prometheus or otlp/tempo.
    • type: The protocol used for export, for example:

      • otlp: For OpenTelemetry-compatible backends using gRPC or HTTP.
      • prometheusremotewrite: For Prometheus-compatible systems that use the remote write protocol.
    • endpoint: The full URL of your external metrics receiver. For OTLP, endpoints typically use port 4317 (gRPC) or 4318 (HTTP). For Prometheus remote write, endpoints typically end with /api/v1/write. For example:

      • otlp: https://example-otlp-receiver.example.com:4317 (gRPC) or https://example-otlp-receiver.example.com:4318 (HTTP)
      • prometheusremotewrite: https://example-prometheus-remote.example.com/api/v1/write
  8. Click Save.

    The OpenTelemetry Collector automatically reloads its configuration and begins forwarding metrics to the specified external endpoint.

Verification

  1. Verify that the OpenTelemetry Collector pods restart and apply the new configuration:

    $ oc get pods -n redhat-ods-monitoring
    Copy to Clipboard Toggle word wrap

    The data-science-collector-collector-* pods should restart and display a Running status.

  2. In your external observability platform, verify that new metrics from OpenShift AI appear in the metrics list or dashboard.
Note

If you remove the .spec.monitoring.metrics.exporters configuration from the DSCI, the OpenTelemetry Collector automatically reverts to collecting metrics only for the in-cluster Prometheus instance.

11.4. Viewing traces in external tracing platforms

When tracing is enabled in the DataScienceClusterInitialization (DSCI) custom resource, OpenShift AI deploys the Red Hat build of Tempo as the tracing backend and the Red Hat build of OpenTelemetry Collector (OTC) to receive and route trace data.

To view and analyze traces outside of OpenShift AI, complete the following tasks:

  • Configure your instrumented applications to send traces to the OpenTelemetry Collector.
  • Connect your preferred visualization tool, such as Grafana or Jaeger, to the Tempo Query API.

Prerequisites

  • A cluster administrator has enabled tracing as part of the observability stack in the DSCI configuration.
  • You have access to the monitoring namespace, for example {monitoring-default-namespace}.
  • You have network access or cluster administrator privileges to create a route or port forward from the cluster.
  • Your application is instrumented with an OpenTelemetry SDK or library to generate and export trace data.

Procedure

  1. Find the OpenTelemetry Collector endpoint.

    The OpenTelemetry Collector receives trace data from instrumented applications by using the OpenTelemetry Protocol (OTLP).

    1. In the OpenShift web console, navigate to Networking Services.
    2. In the Project list, select the monitoring namespace, for example, {monitoring-default-namespace}.
    3. Locate the Service named data-science-collector or a similar name associated with the OpenTelemetry Collector.
    4. Use the Service name or ClusterIP as the OTLP endpoint in your application configuration.

      Your application must export traces to one of the following ports on the collector service:

      • gRPC: 4317
      • HTTP: 4318

        Example environment variable:

        OTEL_EXPORTER_OTLP_ENDPOINT=http://data-science-collector.redhat-ods-monitoring.svc.cluster.local:4318
        Copy to Clipboard Toggle word wrap
        Note

        See the Red Hat build of OpenTelemetry documentation for details about configuring application instrumentation.

  2. Connect your visualization tool to the Tempo query service.

    You can use a visualization tool, such as Grafana or Jaeger, to query and display traces from the Red Hat build of Tempo deployed by OpenShift AI.

    1. In the OpenShift web console, navigate to Networking Services.
    2. In the Project list, select the monitoring namespace, for example, {monitoring-default-namespace}.
    3. Locate the Service named tempo-query or tempo-query-frontend.
    4. To make the service accessible to external tools, a cluster administrator must perform one of the following actions:

      • Create a route: Expose the Tempo Query service externally by creating an OpenShift route.
      • Use port forwarding: Temporarily forward a local port to the Tempo Query service by using the OpenShift CLI (oc):

        $ oc port-forward svc/tempo-query-frontend 3200:3200 -n redhat-ods-monitoring
        Copy to Clipboard Toggle word wrap

        After the port is forwarded, connect your visualization tool to the Tempo Query API endpoint, for example:

        http://localhost:3200
        Copy to Clipboard Toggle word wrap
        Note

        See the Tempo Operator documentation for details about connecting to Tempo.

Verification

  1. Confirm that your instrumented application is generating and exporting trace data.
  2. Verify that the OpenTelemetry Collector pod is running in the monitoring namespace:

    $ oc get pods -n redhat-ods-monitoring | grep collector
    Copy to Clipboard Toggle word wrap

    The data-science-collector-collector-* pod should display a Running status.

  3. Access your visualization tool and confirm that new traces appear in the trace list or search view.

11.5. Accessing built-in alerts

The centralized observability stack deploys a Prometheus Alertmanager instance that provides a common set of built-in alerts for OpenShift AI components. These alerts monitor critical platform conditions, such as operator downtime, crashlooping pods, and unresponsive services.

By default, the Alertmanager is internal to the cluster and is not exposed through a route. You can access the Alertmanager web interface locally by using the OpenShift CLI (oc).

Prerequisites

  • You have OpenShift AI administrator privileges.
  • The observability stack is enabled as described in Enabling the observability stack.
  • You know the monitoring namespace, for example {monitoring-default-namespace}.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

Procedure

  1. In a terminal window, log in to the OpenShift CLI (oc) as a cluster administrator:

    $ oc login https://api.198.51.100.10:6443
    Copy to Clipboard Toggle word wrap
  2. Verify that the Alertmanager pods are running in the monitoring namespace:

    $ oc get pods -n redhat-ods-monitoring | grep alertmanager
    Copy to Clipboard Toggle word wrap

    Example output:

    alertmanager-data-science-monitoringstack-0   2/2   Running   0   2h
    alertmanager-data-science-monitoringstack-1   2/2   Running   0   2h
    Copy to Clipboard Toggle word wrap
  3. Confirm that a ClusterIP service exposes the Alertmanager web interface on port 9093:

    $ oc get svc -n redhat-ods-monitoring | grep alertmanager
    Copy to Clipboard Toggle word wrap

    Example output:

    data-science-monitoringstack-alertmanager     ClusterIP   198.51.100.5   <none>   9093/TCP
    Copy to Clipboard Toggle word wrap
  4. Start a local port forward to the Alertmanager service:

    $ oc port-forward svc/data-science-monitoringstack-alertmanager 9093:9093 -n redhat-ods-monitoring
    Copy to Clipboard Toggle word wrap
  5. In a web browser, open the following URL to access the Alertmanager web interface:

    http://localhost:9093
    Copy to Clipboard Toggle word wrap

Verification

  • Confirm that the Alertmanager web interface opens at http://localhost:9093 and displays active alerts for OpenShift AI components.
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat