Chapter 11. Managing observability

Red Hat OpenShift AI provides centralized platform observability: an integrated, out-of-the-box solution for monitoring the health and performance of your OpenShift AI instance and user workloads.

This centralized solution includes a dedicated, pre-configured observability stack, featuring the OpenTelemetry Collector (OTC) for standardized data ingestion, Prometheus for metrics, and the Red Hat build of Tempo for distributed tracing. This architecture enables a common set of health metrics and alerts for OpenShift AI components and offers mechanisms to integrate with your existing external observability tools.

Important

This feature is currently available in Red Hat OpenShift AI 2.25 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

11.1. Enabling the observability stack
Copy link

The observability stack collects and correlates metrics, traces, and alerts for OpenShift AI so that you can monitor, troubleshoot, and optimize OpenShift AI components. A cluster administrator must explicitly enable this capability in the DataScienceClusterInitialization (DSCI) custom resource.

Once enabled, you can perform the following actions:

Accelerate troubleshooting by viewing metrics, traces, and alerts for OpenShift AI components in one place.
Maintain platform stability by monitoring health and resource usage and receiving alerts for critical issues.
Integrate with existing tools by exporting telemetry to third-party observability solutions through the Red Hat build of OpenTelemetry.

Important

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You have installed Red Hat OpenShift AI.
You have installed the following Operators, which provide the components of the observability stack:
- Cluster Observability Operator: Deploys and manages Prometheus and Alertmanager for metrics and alerts.
- Tempo Operator: Provides the Tempo backend for distributed tracing.
- Red Hat build of OpenTelemetry: Deploys the OpenTelemetry Collector for collecting and exporting telemetry data.

Procedure

Log in to the OpenShift web console as a cluster administrator.
In the OpenShift console, click Operators Installed Operators.
Search for the Red Hat OpenShift AI Operator, and then click the Operator name to open the Operator details page.
Click the DSCInitialization tab.
Click the default instance name (for example, default-dsci) to open the instance details page.
Click the YAML tab to show the instance specifications.

In the spec.monitoring section, set the value of the managementState field to Managed, and configure metrics, alerting, and tracing settings as shown in the following example:

Example monitoring configuration

# ...
spec:
  monitoring:
    managementState: Managed                 # Required: Enables and manages the observability stack
    namespace: redhat-ods-monitoring    # Required: Namespace where monitoring components are deployed
    alerting: {}                              # Alertmanager configuration, uses default settings if empty
    metrics:                                  # Prometheus configuration for metrics collection
      replicas: 1                             # Optional: Number of Prometheus instances
      resources:                              # CPU and memory requests and limits for Prometheus pods
        cpulimit: 500m                        # Optional: Maximum CPU allocation in millicores
        cpurequest: 100m                      # Optional: Minimum CPU allocation in millicores
        memorylimit: 512Mi                    # Optional: Maximum memory allocation in mebibytes
        memoryrequest: 256Mi                  # Optional: Minimum memory allocation in mebibytes
      storage:                                # Storage configuration for metrics data
        size: 5Gi                             # Required: Storage size for Prometheus data
        retention: 90d                        # Required: Retention period for metrics data in days
      exporters: {}                           # External metrics exporters
    traces:                                   # Tempo backend for distributed tracing
      sampleRatio: '0.1'                      # Optional: Portion of traces to sample, expressed as a decimal
      storage:                                # Storage configuration for trace data
        backend: pv                           # Required: Storage backend for Tempo traces (pv, s3, or gcs)
        retention: 2160h                      # Optional: Retention period for trace data in hours
      exporters: {}                           # External traces exporters
# ...

# ...
spec:
  monitoring:
    managementState: Managed                 # Required: Enables and manages the observability stack
    namespace: redhat-ods-monitoring    # Required: Namespace where monitoring components are deployed
    alerting: {}                              # Alertmanager configuration, uses default settings if empty
    metrics:                                  # Prometheus configuration for metrics collection
      replicas: 1                             # Optional: Number of Prometheus instances
      resources:                              # CPU and memory requests and limits for Prometheus pods
        cpulimit: 500m                        # Optional: Maximum CPU allocation in millicores
        cpurequest: 100m                      # Optional: Minimum CPU allocation in millicores
        memorylimit: 512Mi                    # Optional: Maximum memory allocation in mebibytes
        memoryrequest: 256Mi                  # Optional: Minimum memory allocation in mebibytes
      storage:                                # Storage configuration for metrics data
        size: 5Gi                             # Required: Storage size for Prometheus data
        retention: 90d                        # Required: Retention period for metrics data in days
      exporters: {}                           # External metrics exporters
    traces:                                   # Tempo backend for distributed tracing
      sampleRatio: '0.1'                      # Optional: Portion of traces to sample, expressed as a decimal
      storage:                                # Storage configuration for trace data
        backend: pv                           # Required: Storage backend for Tempo traces (pv, s3, or gcs)
        retention: 2160h                      # Optional: Retention period for trace data in hours
      exporters: {}                           # External traces exporters
# ...

Copy to Clipboard

Toggle word wrap

Click Save to apply your changes.

Verification

Verify that the observability stack components are running in the configured namespace:

In the OpenShift web console, click Workloads Pods.
From the project list, select redhat-ods-monitoring.

Confirm that there are running pods for your configuration. The following pods indicate that the observability stack is active:

alertmanager-data-science-monitoringstack-#      2/2   Running   0   1m
data-science-collector-collector-#               1/1   Running   0   1m
prometheus-data-science-monitoringstack-#        2/2   Running   0   1m
tempo-data-science-tempomonolithic-#             1/1   Running   0   1m
thanos-querier-data-science-thanos-querier-#     2/2   Running   0   1m

alertmanager-data-science-monitoringstack-#      2/2   Running   0   1m
data-science-collector-collector-#               1/1   Running   0   1m
prometheus-data-science-monitoringstack-#        2/2   Running   0   1m
tempo-data-science-tempomonolithic-#             1/1   Running   0   1m
thanos-querier-data-science-thanos-querier-#     2/2   Running   0   1m

Copy to Clipboard

Toggle word wrap

Next step

Collecting metrics from user workloads

11.2. Collecting metrics from user workloads
Copy link

After a cluster administrator enables the observability stack in your cluster, metric collection becomes available but is not automatically active for all deployed workloads. The monitoring system relies on a specific label to identify which pods Prometheus should scrape for metrics.

To include a workload, such as a user-created workbench, training job, or inference service, in the centralized observability stack, add the label monitoring.opendatahub.io/scrape=true to the pod template in the workload’s deployment configuration. This ensures that all pods created by the deployment include the label and are automatically scraped by Prometheus.

Note

Apply the monitoring.opendatahub.io/scrape=true label only to workloads that expose metrics and that you want the observability stack to monitor. Do not add this label to operator-managed workloads, because the operator might overwrite or remove it during reconciliation.

Prerequisites

A cluster administrator has enabled the observability stack as described in Enabling the observability stack.
You have OpenShift AI administrator privileges or you are the project owner.
You have deployed a workload that exposes a /metrics endpoint, such as a workbench server or model service pod.
You have access to the project where the workload is running.

Procedure

Log in to the OpenShift web console as a cluster administrator or project owner.
Click Workloads Deployments.
In the Project list at the top of the page, select the project where your workload is deployed.
Identify the deployment that you want to collect metrics from and click its name.
On the Deployment details page, click the YAML tab.

In the YAML editor, add the required label under the spec.template.metadata.labels section, as shown in the following example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <example_name>
  namespace: <example_namespace>
spec:
  template:
    metadata:
      labels:
        monitoring.opendatahub.io/scrape: 'true'
# ...

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <example_name>
  namespace: <example_namespace>
spec:
  template:
    metadata:
      labels:
        monitoring.opendatahub.io/scrape: 'true'
# ...

Copy to Clipboard

Toggle word wrap

Click Save.
OpenShift automatically rolls out a new ReplicaSet and pods with the updated label. When the new pods start, the observability stack begins scraping their metrics.

Verification

Verify that metrics are being collected by accessing the Prometheus instance deployed by OpenShift AI.

Access Prometheus by using a route:
1. In the OpenShift web console, click Networking Routes.
2. From the project list, select redhat-ods-monitoring.
3. Locate the route associated with the Prometheus service, such as data-science-prometheus-route.
4. Click the Location URL to open the Prometheus web console.

Alternatively, access Prometheus locally by using port forwarding:

List the Prometheus pods:

oc get pods -n redhat-ods-monitoring -l prometheus=data-science-monitoringstack

$ oc get pods -n redhat-ods-monitoring -l prometheus=data-science-monitoringstack

Copy to Clipboard

Toggle word wrap

Start port forwarding:

oc port-forward __<prometheus-pod-name>__ 9090:9090 -n redhat-ods-monitoring

$ oc port-forward __<prometheus-pod-name>__ 9090:9090 -n redhat-ods-monitoring

Copy to Clipboard

Toggle word wrap

In a web browser, open the following URL:
```
http://localhost:9090
```
```
http://localhost:9090
```
Copy to Clipboard Toggle word wrap

In the Prometheus web console, search for a metric exposed by your workload.
If the label is applied correctly and the workload exposes metrics, the metrics appear in the Prometheus instance deployed by OpenShift AI.

11.3. Exporting metrics to external observability tools
Copy link

You can export OpenShift AI operational metrics to an external observability platform, such as Grafana, Prometheus, or any OpenTelemetry-compatible backend. This allows you to visualize and monitor OpenShift AI metrics alongside data from other systems in your existing observability environment.

Metrics export is configured in the DataScienceClusterInitialization (DSCI) custom resource by populating the .spec.monitoring.metrics.exporters field. When you define one or more exporters in this field, the OpenTelemetry Collector (OTC) deployed by OpenShift AI automatically updates its configuration to include each exporter in its metrics pipeline. If this field is empty or undefined, metrics are collected only by the in-cluster Prometheus instance that is deployed with OpenShift AI.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
The observability stack is enabled as described in Enabling the observability stack.
The external observability platform can receive metrics through a supported export protocol.
You know the URL of your external metrics receiver endpoint.

Procedure

Log in to the OpenShift web console as a cluster administrator.
Click Operators Installed Operators.
Select the Red Hat OpenShift AI Operator from the list.
Click the DSCInitialization tab.
Click the default DSCI instance, for example, default-dsci, to open its details page.
Click the YAML tab.
In the spec.monitoring.metrics section, add an exporters list that defines the external receiver configuration, as shown in the following example:
```
spec:
  monitoring:
    metrics:
      exporters:
        - name: <external_exporter_name>
          type: <type>
          endpoint: https://example-otlp-receiver.example.com:4317
```
```
spec:
  monitoring:
    metrics:
      exporters:
        - name: <external_exporter_name>
          type: <type>
          endpoint: https://example-otlp-receiver.example.com:4317
```
Copy to Clipboard Toggle word wrap
- name: A unique, descriptive name for the exporter configuration. Do not use reserved names such as prometheus or otlp/tempo.
- type: The protocol used for export, for example:
  - otlp: For OpenTelemetry-compatible backends using gRPC or HTTP.
  - prometheusremotewrite: For Prometheus-compatible systems that use the remote write protocol.
- endpoint: The full URL of your external metrics receiver. For OTLP, endpoints typically use port 4317 (gRPC) or 4318 (HTTP). For Prometheus remote write, endpoints typically end with /api/v1/write. For example:
  - otlp: https://example-otlp-receiver.example.com:4317 (gRPC) or https://example-otlp-receiver.example.com:4318 (HTTP)
  - prometheusremotewrite: https://example-prometheus-remote.example.com/api/v1/write
Click Save.
The OpenTelemetry Collector automatically reloads its configuration and begins forwarding metrics to the specified external endpoint.

Verification

Verify that the OpenTelemetry Collector pods restart and apply the new configuration:
```
oc get pods -n redhat-ods-monitoring
```
```
$ oc get pods -n redhat-ods-monitoring
```
Copy to Clipboard Toggle word wrap
The data-science-collector-collector-* pods should restart and display a Running status.
In your external observability platform, verify that new metrics from OpenShift AI appear in the metrics list or dashboard.

Note

If you remove the .spec.monitoring.metrics.exporters configuration from the DSCI, the OpenTelemetry Collector automatically reverts to collecting metrics only for the in-cluster Prometheus instance.

11.4. Viewing traces in external tracing platforms
Copy link

When tracing is enabled in the DataScienceClusterInitialization (DSCI) custom resource, OpenShift AI deploys the Red Hat build of Tempo as the tracing backend and the Red Hat build of OpenTelemetry Collector (OTC) to receive and route trace data.

To view and analyze traces outside of OpenShift AI, complete the following tasks:

Configure your instrumented applications to send traces to the OpenTelemetry Collector.
Connect your preferred visualization tool, such as Grafana or Jaeger, to the Tempo Query API.

Prerequisites

A cluster administrator has enabled tracing as part of the observability stack in the DSCI configuration.
You have access to the monitoring namespace, for example redhat-ods-monitoring.
You have network access or cluster administrator privileges to create a route or port forward from the cluster.
Your application is instrumented with an OpenTelemetry SDK or library to generate and export trace data.

Procedure

Find the OpenTelemetry Collector endpoint.
The OpenTelemetry Collector receives trace data from instrumented applications by using the OpenTelemetry Protocol (OTLP).
1. In the OpenShift web console, navigate to Networking Services.
2. In the Project list, select the monitoring namespace, for example, redhat-ods-monitoring.
3. Locate the Service named data-science-collector or a similar name associated with the OpenTelemetry Collector.
4. Use the Service name or ClusterIP as the OTLP endpoint in your application configuration.
  Your application must export traces to one of the following ports on the collector service:
  - gRPC: 4317
  - HTTP: 4318
    Example environment variable:
    
    OTEL_EXPORTER_OTLP_ENDPOINT=http://data-science-collector.redhat-ods-monitoring.svc.cluster.local:4318
    
    Copy to Clipboard Toggle word wrap
    
    Note
    See the Red Hat build of OpenTelemetry documentation for details about configuring application instrumentation.
Connect your visualization tool to the Tempo query service.
You can use a visualization tool, such as Grafana or Jaeger, to query and display traces from the Red Hat build of Tempo deployed by OpenShift AI.
1. In the OpenShift web console, navigate to Networking Services.
2. In the Project list, select the monitoring namespace, for example, redhat-ods-monitoring.
3. Locate the Service named tempo-query or tempo-query-frontend.
4. To make the service accessible to external tools, a cluster administrator must perform one of the following actions:
  - Create a route: Expose the Tempo Query service externally by creating an OpenShift route.
  - Use port forwarding: Temporarily forward a local port to the Tempo Query service by using the OpenShift CLI (oc):
    
    $ oc port-forward svc/tempo-query-frontend 3200:3200 -n redhat-ods-monitoring
    
    Copy to Clipboard Toggle word wrap
    
    After the port is forwarded, connect your visualization tool to the Tempo Query API endpoint, for example:
    
    http://localhost:3200
    
    Copy to Clipboard Toggle word wrap
    
    Note
    See the Tempo Operator documentation for details about connecting to Tempo.

Verification

Confirm that your instrumented application is generating and exporting trace data.
Verify that the OpenTelemetry Collector pod is running in the monitoring namespace:
```
oc get pods -n redhat-ods-monitoring | grep collector
```
```
$ oc get pods -n redhat-ods-monitoring | grep collector
```
Copy to Clipboard Toggle word wrap
The data-science-collector-collector-* pod should display a Running status.
Access your visualization tool and confirm that new traces appear in the trace list or search view.

11.5. Accessing built-in alerts
Copy link

The centralized observability stack deploys a Prometheus Alertmanager instance that provides a common set of built-in alerts for OpenShift AI components. These alerts monitor critical platform conditions, such as operator downtime, crashlooping pods, and unresponsive services.

By default, the Alertmanager is internal to the cluster and is not exposed through a route. You can access the Alertmanager web interface locally by using the OpenShift CLI (oc).

Prerequisites

You have OpenShift AI administrator privileges.
The observability stack is enabled as described in Enabling the observability stack.
You know the monitoring namespace, for example redhat-ods-monitoring.
You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:
- Installing the OpenShift CLI for OpenShift Container Platform
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS

Procedure

In a terminal window, log in to the OpenShift CLI (oc) as a cluster administrator:
```
oc login https://api.198.51.100.10:6443
```
```
$ oc login https://api.198.51.100.10:6443
```
Copy to Clipboard Toggle word wrap

Verify that the Alertmanager pods are running in the monitoring namespace:

oc get pods -n redhat-ods-monitoring | grep alertmanager

$ oc get pods -n redhat-ods-monitoring | grep alertmanager

Copy to Clipboard

Toggle word wrap

Example output:

alertmanager-data-science-monitoringstack-0   2/2   Running   0   2h
alertmanager-data-science-monitoringstack-1   2/2   Running   0   2h

alertmanager-data-science-monitoringstack-0   2/2   Running   0   2h
alertmanager-data-science-monitoringstack-1   2/2   Running   0   2h

Copy to Clipboard

Toggle word wrap

Confirm that a ClusterIP service exposes the Alertmanager web interface on port 9093:

oc get svc -n redhat-ods-monitoring | grep alertmanager

$ oc get svc -n redhat-ods-monitoring | grep alertmanager

Copy to Clipboard

Toggle word wrap

Example output:

data-science-monitoringstack-alertmanager     ClusterIP   198.51.100.5   <none>   9093/TCP

data-science-monitoringstack-alertmanager     ClusterIP   198.51.100.5   <none>   9093/TCP

Copy to Clipboard

Toggle word wrap

Start a local port forward to the Alertmanager service:

oc port-forward svc/data-science-monitoringstack-alertmanager 9093:9093 -n redhat-ods-monitoring

$ oc port-forward svc/data-science-monitoringstack-alertmanager 9093:9093 -n redhat-ods-monitoring

Copy to Clipboard

Toggle word wrap

In a web browser, open the following URL to access the Alertmanager web interface:
```
http://localhost:9093
```
```
http://localhost:9093
```
Copy to Clipboard Toggle word wrap

Verification

Confirm that the Alertmanager web interface opens at http://localhost:9093 and displays active alerts for OpenShift AI components.

Chapter 11. Managing observability

11.1. Enabling the observability stack
Copy link

11.2. Collecting metrics from user workloads
Copy link

11.3. Exporting metrics to external observability tools
Copy link

11.4. Viewing traces in external tracing platforms
Copy link

11.5. Accessing built-in alerts
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 11. Managing observability

11.1. Enabling the observability stackCopy linkLink copied to clipboard!

11.2. Collecting metrics from user workloadsCopy linkLink copied to clipboard!

11.3. Exporting metrics to external observability toolsCopy linkLink copied to clipboard!

11.4. Viewing traces in external tracing platformsCopy linkLink copied to clipboard!

11.5. Accessing built-in alertsCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

11.1. Enabling the observability stack
Copy link

11.2. Collecting metrics from user workloads
Copy link

11.3. Exporting metrics to external observability tools
Copy link

11.4. Viewing traces in external tracing platforms
Copy link

11.5. Accessing built-in alerts
Copy link