Chapter 2. Customizing observability

Review the following sections to learn more about customizing, managing, and viewing data that is collected by the observability service.

Collect logs about new information that is created for observability resources with the must-gather command. For more information, see the Must-gather section in the Troubleshooting documentation.

2.1. Creating custom rules

Create custom rules for the observability installation by adding Prometheus recording rules and alerting rules to the observability resource.

  • Recording rules provide you the ability to precalculate, or computate expensive expressions as needed. The results are saved as a new set of time series.
  • Alerting rules provide you the ability to specify the alert conditions based on how an alert should be sent to an external service.

    Define custom rules with Prometheus to create alert conditions, and send notifications to an external messaging service.

    Note: When you update your custom rules, observability-thanos-rule pods are restarted automatically.

    Create a ConfigMap named thanos-ruler-custom-rules in the open-cluster-management-observability namespace. The key must be named, custom_rules.yaml, as shown in the following example. You can create multiple rules in the configuration.

    • By default, the out-of-the-box alert rules are defined in the thanos-ruler-default-rules ConfigMap in the open-cluster-management-observability namespace.

      For example, you can create a custom alert rule that notifies you when your CPU usage passes your defined value. Your YAML might resemble the following content:

        custom_rules.yaml: |
            - name: cluster-health
              - alert: ClusterCPUHealth-jb
                  summary: Notify when CPU utilization on a cluster is greater than the defined utilization limit
                  description: "The cluster has a high CPU usage: {{ $value }} core for {{ $labels.cluster }} {{ $labels.clusterID }}."
                expr: |
                  max(cluster:cpu_usage_cores:sum) by (clusterID, cluster, prometheus) > 0
                for: 5s
                  cluster: "{{ $labels.cluster }}"
                  prometheus: "{{ $labels.prometheus }}"
                  severity: critical
    • You can also create a custom recording rule within the thanos-ruler-custom-rules ConfigMap.

      For example, you can create a recording rule that provides you the ability to get the sum of the container memory cache of a pod. Your YAML might resemble the following content:

      custom_rules.yaml: |
          - name: container-memory
            - record: pod:container_memory_cache:sum
              expr: sum(container_memory_cache{pod!=""}) BY (pod, container)

    + Note: If this is the first new custom rule, it is created immediately. For changes to the ConfigMap, the configuration is automatically reloaded. The configuration is reloaded because of the config-reload within the observability-thanos-ruler sidecar.

To verify that the alert rules are functioning correctly, launch the Grafana dashboard, navigate to the Explore page, and query ALERTS. The alert is only available in Grafana if the alert is initiated.

2.2. Adding custom metrics

Add metrics to the metrics_list.yaml file, to be collected from managed clusters.

Before you add a custom metric, verify that mco observability is enabled with the following command: oc get mco observability -o yaml. Check for the following message in the status.conditions.message reads: Observability components are deployed and running.

Create a file named observability-metrics-custom-allowlist.yaml and add the name of the custom metric to the metrics_list.yaml parameter. Your YAML for the ConfigMap might resemble the following content:

kind: ConfigMap
apiVersion: v1
  name: observability-metrics-custom-allowlist
  metrics_list.yaml: |
      - node_memory_MemTotal_bytes
    - record: apiserver_request_duration_seconds:histogram_quantile_90
      expr: histogram_quantile(0.90,sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",
        verb!=\"WATCH\"}[5m])) by (verb,le))

For user workload metrics, see the Adding user workload metrics section.

  • In the names section, add the name of the custom metrics that is to be collected from the managed cluster.
  • In the rules section, enter only one value for the expr and record parameter pair to define the query expression. The metrics are collected as the name that is defined in the record parameter from your managed cluster. The metric value returned are the results after you run the query expression.
  • The names and rules sections are optional. You can use either one or both of the sections.

Create the observability-metrics-custom-allowlist ConfigMap in the open-cluster-management-observability namespace with the following command: oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml.

Verify that data from your custom metric is being collected by querying the metric from the Explore page, from the Grafana dashboard. You can also use the custom metrics in your own dashboard. For more information about viewing the dashboard, see Using Grafana dashboards.

2.2.1. Adding user workload metrics

You can collect OpenShift Container Platform user-defined metrics from workloads in OpenShift Container Platform. You must enable monitoring, see Enabling monitoring for user-defined projects.

If you have a managed cluster with monitoring for user-defined workloads enabled, the user workloads are located in the test namespace and generate metrics. These metrics are collected by Prometheus from the OpenShift Container Platform user workload.

Collect the metrics from the user workloads by creating a ConfigMap named, observability-metrics-custom-allowlist in the test namespace. View the following example:

kind: ConfigMap
apiVersion: v1
  name: observability-metrics-custom-allowlist
  namespace: test
  uwl_metrics_list.yaml: |
      - sample_metrics
  • The uwl_metrics_list.yaml is the key for the ConfigMap data.
  • The value of the ConfigMap data is in YAML format. The names section includes the list of metric names, which you want to collect from the test namespace. After you create the ConfigMap, the specified metrics from the target namespace is collected by the observability collector and pushed to the hub cluster.

2.2.2. Removing default metrics

If you want data to not be collected in your managed cluster for a specific metric, remove the metric from the observability-metrics-custom-allowlist.yaml file. When you remove a metric, the metric data is not collected in your managed clusters. As mentioned previously, first verify that mco observability is enabled.

Add the name of the default metric to the metrics_list.yaml parameter with a hyphen - at the start of the metric name. For example, -cluster_infrastructure_provider.

Create the observability-metrics-custom-allowlist ConfigMap in the open-cluster-management-observability namespace with the following command: oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml.

Verify that the specific metric is not being collected from your managed clusters. When you query the metric from the Grafana dashboard, the metric is not displayed.

2.3. Exporting metrics to external endpoints

You can customize observability to export the metrics to external endpoints, which support the Prometheus Remote-Write specification in real time. For more information, see Prometheus Remote-Write specification.

2.3.1. Creating the Kubernetes secret for an external endpoint

You must create a Kubernetes secret with the access information of the external endpoint in the open-cluster-management-observability namespace. View the following example secret:

apiVersion: v1
kind: Secret
  name: victoriametrics
  namespace: open-cluster-management-observability
type: Opaque
  ep.yaml: |
    url: http://victoriametrics:8428/api/v1/write
        username: test
        password: test

The ep.yaml is the key of the content and is used in the MultiClusterObservability custom resource in next step. Currently, observability supports exporting metrics to endpoints without any security checks, with basic authentication or with tls enablement. View the following tables for a full list of supported parameters:



URL for the external endpoint.



Advanced configuration for the HTTP client.





HTTP client configuration for basic authentication.



HTTP client configuration for TLS.





User name for basic authorization.



Password for basic authorization.







Name of the secret that contains certificates.



Key of the CA certificate in the secret (only optional if insecure_skip_verify is set to true).



Key of the client certificate in the secret.



Key of the client key in the secret.



Parameter to skip the verification for target certificate.


2.3.2. Updating the MultiClusterObservability custom resource

After you create the Kubernetes secret, you must update the MultiClusterObservability custom resource to add writeStorage in the spec.storageConfig parameter. View the following example:

    - key: ep.yaml
      name: victoriametrics

The value for writeStorage is a list. You can add an item to the list when you want to export metrics to one external endpoint. If you add more than one item to the list, then the metrics are exported to multiple external endpoints. Each item contains two attributes: name and key. Name is the name of the Kubernetes secret that contains endpoint access information, and key is the key of the content in the secret. View the following description table for the

2.3.3. Viewing the status of metric export

After the metrics export is enabled, you can view the status of metrics export by checking the acm_remote_write_requests_total metric. From the OpenShift console of your hub cluster, navigate to the Metrics page by clicking Metrics in the Observe section.

Then query the acm_remote_write_requests_total metric. The value of that metric is the total number of requests with a specific response for one external endpoint, on one observatorium API instance. The name label is the name for the external endpoint. The code label is the return code of the HTTP request for the metrics export.

2.4. Adding advanced configuration

Add the advanced configuration section to update the retention for each observability component, according to your needs.

Edit the MultiClusterObservability custom resource and add the advanced section with the following command: oc edit mco observability -o yaml. Your YAML file might resemble the following contents:

      blockDuration: 2h
      deleteDelay: 48h
      retentionInLocal: 24h
      retentionResolutionRaw: 30d
      retentionResolution5m: 180d
      retentionResolution1h: 0d
          memory: 4096Gi
      replicas: 3

For descriptions of all the parameters that can added into the advanced configuration, see the Observability API.

2.5. Updating the MultiClusterObservability custom resource replicas from the console

If your workload increases, increase the number of replicas of your observability pods. Navigate to the Red Hat OpenShift Container Platform console from your hub cluster. Locate the MultiClusterObservability custom resource, and update the replicas parameter value for the component where you want to change the replicas. Your updated YAML might resemble the following content:

         replicas: 6

For more information about the parameters within the mco observability custom resource, see the Observability API.

2.6. Customizing route certificate

If you want to customize the OpenShift Container Platform route certification, you must add the routes in the alt_names section. To ensure your OpenShift Container Platform routes are accessible, add the following information: alertmanager.apps.<domainname>, observatorium-api.apps.<domainname>, rbac-query-proxy.apps.<domainname>.

Note: Users are responsible for certificate rotations and updates.

2.7. Customizing certificates for accessing the object store

Complete the following steps to customize certificates for accessing the object store:

  1. Edit the http_config section by adding the certificate in the object store secret. View the following example:

     thanos.yaml: |
        type: s3
          bucket: "thanos"
          endpoint: "minio:9000"
          insecure: false
          access_key: "minio"
          secret_key: "minio123"
              ca_file: /etc/minio/certs/ca.crt
              insecure_skip_verify: false
  2. Add the object store secret in the open-cluster-management-observability namespace. The secret must contain the ca.crt that you defined in the previous secret example. If you want to enable Mutual TLS, you need to provide public.crt, and private.key in the previous secret. View the following example:

     thanos.yaml: |
        type: s3
              ca_file: /etc/minio/certs/ca.crt 1
              cert_file: /etc/minio/certs/public.crt
              key_file: /etc/minio/certs/private.key
              insecure_skip_verify: false
    The path to certificates and key values for the thanos-object-storage secret.
  3. Configure the secret name by updating the TLSSecretName parameter in the MultiClusterObservability custom resource. View the following example where the secret name is tls-certs-secret:

      key: thanos.yaml
      name: thanos-object-storage
      tlsSecretName: tls-certs-secret
  4. Mount the secret in the tlsSecretMountPath resource of all components that need to access the object store by renaming the existing TLS. See the following example:

          key: thanos.yaml
          name: thanos-object-storage
          tlsSecretName: <existing-tls-certs-secret>
          tlsSecretMountPath: /etc/minio/certs
  5. To verify that you can access the object store, check that the pods are displayed.

2.8. Viewing and exploring data

View the data from your managed clusters by accessing Grafana from the hub cluster. You can query specific alerts and add filters for the query.

For example, to cluster_infrastructure_provider from a single node cluster, use the following query expression: cluster_infrastructure_provider{clusterType="SNO"}

Note: Do not set the ObservabilitySpec.resources.CPU.limits parameter if observability is enabled on single node managed clusters. When you set the CPU limits, it causes the observability pod to be counted against the capacity for your managed cluster. See the reference for Management Workload Partitioning in the Additional resources section.

2.8.1. Viewing historical data

When you query historical data, manually set your query parameter options to control how much data is displayed from the dashboard. Complete the following steps:

  1. From your hub cluster, select the Grafana link that is in the console header.
  2. Edit your cluster dashboard by selecting Edit Panel.
  3. From the Query front-end data source in Grafana, click the Query tab.
  4. Select $datasource.
  5. If you want to see more data, increase the value of the Step parameter section. If the Step parameter section is empty, it is automatically calculated.
  6. Find the Custom query parameters field and select max_source_resolution=auto.
  7. To verify that the data is displayed, refresh your Grafana page.

Your query data appears from the Grafana dashboard.

2.8.2. Viewing the etcd table

View the etcd table from the hub cluster dashboard in Grafana to learn the stability of the etcd as a data store.

Select the Grafana link from your hub cluster to view the etcd table data, which is collected from your hub cluster. The Leader election changes across managed clusters are displayed.

2.8.3. Viewing the cluster fleet service-level overview for the Kubernetes API server dashboard

View the cluster fleet Kubernetes API service-level overview from the hub cluster dashboard in Grafana.

After you navigate to the Grafana dashboard, access the managed dashboard menu by selecting Kubernetes > Service-Level Overview > API Server. The Fleet Overview and Top Cluster details are displayed.

View the total number of clusters that are exceeding or meeting the targeted service-level objective (SLO) value for the past seven or 30-day period, offending and non-offending clusters, and API Server Request Duration.

2.8.4. Viewing the cluster service-level overview for the Kubernetes API server dashboard

View the Kubernetes API service-level overview table from the hub cluster dashboard in Grafana.

After you navigate to the Grafana dashboard, access the managed dashboard menu by selecting Kubernetes > Service-Level Overview > API Server. The Fleet Overview and Top Cluster details are displayed.

View the error budget for the past seven or 30-day period, the remaining downtime, and trend.

2.9. Disabling observability

You can disable observability, which stops data collection on the Red Hat Advanced Cluster Management hub cluster.

2.9.1. Disabling observability on all clusters

Disable observability by removing observability components on all managed clusters.

Update the multicluster-observability-operator resource by setting enableMetrics to false. Your updated resource might resemble the following change:

  imagePullPolicy: Always
  imagePullSecret: multiclusterhub-operator-pull-secret
  observabilityAddonSpec: # The ObservabilityAddonSpec defines the global settings for all managed clusters which have observability add-on enabled
    enableMetrics: false #indicates the observability addon push metrics to hub server

2.9.2. Disabling observability on a single cluster

Disable observability by removing observability components on specific managed clusters. Add the observability: disabled label to the managedclusters.cluster.open-cluster-management.io custom resource.

From the Red Hat Advanced Cluster Management console Clusters page, add the observability=disabled label to the specified cluster.

Note: When a managed cluster with the observability component is detached, the metrics-collector deployments are removed.

2.10. Additional resources

