Chapter 11. Managing observability
Red Hat OpenShift AI provides centralized platform observability: an integrated, out-of-the-box solution for monitoring the health and performance of your OpenShift AI instance and user workloads.
This centralized solution includes a dedicated, pre-configured observability stack, featuring the OpenTelemetry Collector (OTC) for standardized data ingestion, Prometheus for metrics, and the Red Hat build of Tempo for distributed tracing. This architecture enables a common set of health metrics and alerts for OpenShift AI components and offers mechanisms to integrate with your existing external observability tools.
This feature is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
11.1. Enabling the observability stack Copy linkLink copied to clipboard!
The observability stack collects and correlates metrics, traces, and alerts for OpenShift AI so that you can monitor, troubleshoot, and optimize OpenShift AI components. A cluster administrator must explicitly enable this capability in the DataScienceClusterInitialization (DSCI) custom resource.
Once enabled, you can perform the following actions:
- Accelerate troubleshooting by viewing metrics, traces, and alerts for OpenShift AI components in one place.
- Maintain platform stability by monitoring health and resource usage and receiving alerts for critical issues.
- Integrate with existing tools by exporting telemetry to third-party observability solutions through the Red Hat build of OpenTelemetry.
This feature is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- You have installed Red Hat OpenShift AI.
You have installed the following Operators, which provide the components of the observability stack:
- Cluster Observability Operator: Deploys and manages Prometheus and Alertmanager for metrics and alerts.
- Tempo Operator: Provides the Tempo backend for distributed tracing.
- Red Hat build of OpenTelemetry: Deploys the OpenTelemetry Collector for collecting and exporting telemetry data.
Procedure
- Log in to the OpenShift web console as a cluster administrator.
-
In the OpenShift console, click Operators
Installed Operators. - Search for the Red Hat OpenShift AI Operator, and then click the Operator name to open the Operator details page.
- Click the DSCInitialization tab.
- Click the default instance name (for example, default-dsci) to open the instance details page.
- Click the YAML tab to show the instance specifications.
In the
spec.monitoringsection, set the value of themanagementStatefield toManaged, and configure metrics, alerting, and tracing settings as shown in the following example:Example monitoring configuration
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Save to apply your changes.
Verification
Verify that the observability stack components are running in the configured namespace:
-
In the OpenShift web console, click Workloads
Pods. - From the project list, select redhat-ods-monitoring.
Confirm that there are running pods for your configuration. The following pods indicate that the observability stack is active:
alertmanager-data-science-monitoringstack-# 2/2 Running 0 1m data-science-collector-collector-# 1/1 Running 0 1m prometheus-data-science-monitoringstack-# 2/2 Running 0 1m tempo-data-science-tempomonolithic-# 1/1 Running 0 1m thanos-querier-data-science-thanos-querier-# 2/2 Running 0 1m
alertmanager-data-science-monitoringstack-# 2/2 Running 0 1m data-science-collector-collector-# 1/1 Running 0 1m prometheus-data-science-monitoringstack-# 2/2 Running 0 1m tempo-data-science-tempomonolithic-# 1/1 Running 0 1m thanos-querier-data-science-thanos-querier-# 2/2 Running 0 1mCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Next step
- Collecting metrics from user workloads
11.2. Collecting metrics from user workloads Copy linkLink copied to clipboard!
After a cluster administrator enables the observability stack in your cluster, metric collection becomes available but is not automatically active for all deployed workloads. The monitoring system relies on a specific label to identify which pods Prometheus should scrape for metrics.
To include a workload, such as a user-created workbench, training job, or inference service, in the centralized observability stack, add the label monitoring.opendatahub.io/scrape=true to the pod template in the workload’s deployment configuration. This ensures that all pods created by the deployment include the label and are automatically scraped by Prometheus.
Apply the monitoring.opendatahub.io/scrape=true label only to workloads that expose metrics and that you want the observability stack to monitor. Do not add this label to operator-managed workloads, because the operator might overwrite or remove it during reconciliation.
Prerequisites
- A cluster administrator has enabled the observability stack as described in Enabling the observability stack.
- You have OpenShift AI administrator privileges or you are the project owner.
-
You have deployed a workload that exposes a
/metricsendpoint, such as a workbench server or model service pod. - You have access to the project where the workload is running.
Procedure
- Log in to the OpenShift web console as a cluster administrator or project owner.
-
Click Workloads
Deployments. - In the Project list at the top of the page, select the project where your workload is deployed.
- Identify the deployment that you want to collect metrics from and click its name.
- On the Deployment details page, click the YAML tab.
In the YAML editor, add the required label under the
spec.template.metadata.labelssection, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Save.
OpenShift automatically rolls out a new ReplicaSet and pods with the updated label. When the new pods start, the observability stack begins scraping their metrics.
Verification
Verify that metrics are being collected by accessing the Prometheus instance deployed by OpenShift AI.
Access Prometheus by using a route:
-
In the OpenShift web console, click Networking
Routes. - From the project list, select redhat-ods-monitoring.
-
Locate the route associated with the Prometheus service, such as
data-science-prometheus-route. - Click the Location URL to open the Prometheus web console.
-
In the OpenShift web console, click Networking
Alternatively, access Prometheus locally by using port forwarding:
List the Prometheus pods:
oc get pods -n redhat-ods-monitoring -l prometheus=data-science-monitoringstack
$ oc get pods -n redhat-ods-monitoring -l prometheus=data-science-monitoringstackCopy to Clipboard Copied! Toggle word wrap Toggle overflow Start port forwarding:
oc port-forward __<prometheus-pod-name>__ 9090:9090 -n redhat-ods-monitoring
$ oc port-forward __<prometheus-pod-name>__ 9090:9090 -n redhat-ods-monitoringCopy to Clipboard Copied! Toggle word wrap Toggle overflow In a web browser, open the following URL:
http://localhost:9090
http://localhost:9090Copy to Clipboard Copied! Toggle word wrap Toggle overflow
In the Prometheus web console, search for a metric exposed by your workload.
If the label is applied correctly and the workload exposes metrics, the metrics appear in the Prometheus instance deployed by OpenShift AI.
11.3. Exporting metrics to external observability tools Copy linkLink copied to clipboard!
You can export OpenShift AI operational metrics to an external observability platform, such as Grafana, Prometheus, or any OpenTelemetry-compatible backend. This allows you to visualize and monitor OpenShift AI metrics alongside data from other systems in your existing observability environment.
Metrics export is configured in the DataScienceClusterInitialization (DSCI) custom resource by populating the .spec.monitoring.metrics.exporters field. When you define one or more exporters in this field, the OpenTelemetry Collector (OTC) deployed by OpenShift AI automatically updates its configuration to include each exporter in its metrics pipeline. If this field is empty or undefined, metrics are collected only by the in-cluster Prometheus instance that is deployed with OpenShift AI.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- The observability stack is enabled as described in Enabling the observability stack.
- The external observability platform can receive metrics through a supported export protocol.
- You know the URL of your external metrics receiver endpoint.
Procedure
- Log in to the OpenShift web console as a cluster administrator.
-
Click Operators
Installed Operators. - Select the Red Hat OpenShift AI Operator from the list.
- Click the DSCInitialization tab.
- Click the default DSCI instance, for example, default-dsci, to open its details page.
- Click the YAML tab.
In the
spec.monitoring.metricssection, add anexporterslist that defines the external receiver configuration, as shown in the following example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
name: A unique, descriptive name for the exporter configuration. Do not use reserved names such asprometheusorotlp/tempo. type: The protocol used for export, for example:-
otlp: For OpenTelemetry-compatible backends using gRPC or HTTP. -
prometheusremotewrite: For Prometheus-compatible systems that use the remote write protocol.
-
endpoint: The full URL of your external metrics receiver. For OTLP, endpoints typically use port4317(gRPC) or4318(HTTP). For Prometheus remote write, endpoints typically end with/api/v1/write. For example:-
otlp:https://example-otlp-receiver.example.com:4317(gRPC) orhttps://example-otlp-receiver.example.com:4318(HTTP) -
prometheusremotewrite:https://example-prometheus-remote.example.com/api/v1/write
-
-
Click Save.
The OpenTelemetry Collector automatically reloads its configuration and begins forwarding metrics to the specified external endpoint.
Verification
Verify that the OpenTelemetry Collector pods restart and apply the new configuration:
oc get pods -n redhat-ods-monitoring
$ oc get pods -n redhat-ods-monitoringCopy to Clipboard Copied! Toggle word wrap Toggle overflow The
data-science-collector-collector-*pods should restart and display a Running status.- In your external observability platform, verify that new metrics from OpenShift AI appear in the metrics list or dashboard.
If you remove the .spec.monitoring.metrics.exporters configuration from the DSCI, the OpenTelemetry Collector automatically reverts to collecting metrics only for the in-cluster Prometheus instance.
11.4. Viewing traces in external tracing platforms Copy linkLink copied to clipboard!
When tracing is enabled in the DataScienceClusterInitialization (DSCI) custom resource, OpenShift AI deploys the Red Hat build of Tempo as the tracing backend and the Red Hat build of OpenTelemetry Collector (OTC) to receive and route trace data.
To view and analyze traces outside of OpenShift AI, complete the following tasks:
- Configure your instrumented applications to send traces to the OpenTelemetry Collector.
- Connect your preferred visualization tool, such as Grafana or Jaeger, to the Tempo Query API.
Prerequisites
- A cluster administrator has enabled tracing as part of the observability stack in the DSCI configuration.
-
You have access to the monitoring namespace, for example
{monitoring-default-namespace}. - You have network access or cluster administrator privileges to create a route or port forward from the cluster.
- Your application is instrumented with an OpenTelemetry SDK or library to generate and export trace data.
Procedure
Find the OpenTelemetry Collector endpoint.
The OpenTelemetry Collector receives trace data from instrumented applications by using the OpenTelemetry Protocol (OTLP).
-
In the OpenShift web console, navigate to Networking
Services. -
In the Project list, select the monitoring namespace, for example,
{monitoring-default-namespace}. -
Locate the Service named
data-science-collectoror a similar name associated with the OpenTelemetry Collector. Use the Service name or ClusterIP as the OTLP endpoint in your application configuration.
Your application must export traces to one of the following ports on the collector service:
-
gRPC:
4317 HTTP:
4318Example environment variable:
OTEL_EXPORTER_OTLP_ENDPOINT=http://data-science-collector.redhat-ods-monitoring.svc.cluster.local:4318
OTEL_EXPORTER_OTLP_ENDPOINT=http://data-science-collector.redhat-ods-monitoring.svc.cluster.local:4318Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteSee the Red Hat build of OpenTelemetry documentation for details about configuring application instrumentation.
-
gRPC:
-
In the OpenShift web console, navigate to Networking
Connect your visualization tool to the Tempo query service.
You can use a visualization tool, such as Grafana or Jaeger, to query and display traces from the Red Hat build of Tempo deployed by OpenShift AI.
-
In the OpenShift web console, navigate to Networking
Services. -
In the Project list, select the monitoring namespace, for example,
{monitoring-default-namespace}. -
Locate the Service named
tempo-queryortempo-query-frontend. To make the service accessible to external tools, a cluster administrator must perform one of the following actions:
- Create a route: Expose the Tempo Query service externally by creating an OpenShift route.
Use port forwarding: Temporarily forward a local port to the Tempo Query service by using the OpenShift CLI (
oc):oc port-forward svc/tempo-query-frontend 3200:3200 -n redhat-ods-monitoring
$ oc port-forward svc/tempo-query-frontend 3200:3200 -n redhat-ods-monitoringCopy to Clipboard Copied! Toggle word wrap Toggle overflow After the port is forwarded, connect your visualization tool to the Tempo Query API endpoint, for example:
http://localhost:3200
http://localhost:3200Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteSee the Tempo Operator documentation for details about connecting to Tempo.
-
In the OpenShift web console, navigate to Networking
Verification
- Confirm that your instrumented application is generating and exporting trace data.
Verify that the OpenTelemetry Collector pod is running in the monitoring namespace:
oc get pods -n redhat-ods-monitoring | grep collector
$ oc get pods -n redhat-ods-monitoring | grep collectorCopy to Clipboard Copied! Toggle word wrap Toggle overflow The
data-science-collector-collector-*pod should display a Running status.- Access your visualization tool and confirm that new traces appear in the trace list or search view.
11.5. Accessing built-in alerts Copy linkLink copied to clipboard!
The centralized observability stack deploys a Prometheus Alertmanager instance that provides a common set of built-in alerts for OpenShift AI components. These alerts monitor critical platform conditions, such as operator downtime, crashlooping pods, and unresponsive services.
By default, the Alertmanager is internal to the cluster and is not exposed through a route. You can access the Alertmanager web interface locally by using the OpenShift CLI (oc).
Prerequisites
- You have OpenShift AI administrator privileges.
- The observability stack is enabled as described in Enabling the observability stack.
-
You know the monitoring namespace, for example
{monitoring-default-namespace}. You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Container Platform
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS
Procedure
In a terminal window, log in to the OpenShift CLI (
oc) as a cluster administrator:oc login https://api.198.51.100.10:6443
$ oc login https://api.198.51.100.10:6443Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Alertmanager pods are running in the monitoring namespace:
oc get pods -n redhat-ods-monitoring | grep alertmanager
$ oc get pods -n redhat-ods-monitoring | grep alertmanagerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
alertmanager-data-science-monitoringstack-0 2/2 Running 0 2h alertmanager-data-science-monitoringstack-1 2/2 Running 0 2h
alertmanager-data-science-monitoringstack-0 2/2 Running 0 2h alertmanager-data-science-monitoringstack-1 2/2 Running 0 2hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Confirm that a ClusterIP service exposes the Alertmanager web interface on port 9093:
oc get svc -n redhat-ods-monitoring | grep alertmanager
$ oc get svc -n redhat-ods-monitoring | grep alertmanagerCopy to Clipboard Copied! Toggle word wrap Toggle overflow Example output:
data-science-monitoringstack-alertmanager ClusterIP 198.51.100.5 <none> 9093/TCP
data-science-monitoringstack-alertmanager ClusterIP 198.51.100.5 <none> 9093/TCPCopy to Clipboard Copied! Toggle word wrap Toggle overflow Start a local port forward to the Alertmanager service:
oc port-forward svc/data-science-monitoringstack-alertmanager 9093:9093 -n redhat-ods-monitoring
$ oc port-forward svc/data-science-monitoringstack-alertmanager 9093:9093 -n redhat-ods-monitoringCopy to Clipboard Copied! Toggle word wrap Toggle overflow In a web browser, open the following URL to access the Alertmanager web interface:
http://localhost:9093
http://localhost:9093Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
-
Confirm that the Alertmanager web interface opens at
http://localhost:9093and displays active alerts for OpenShift AI components.