Chapter 1. Observability service introduction


The observability component is a service you can use to understand the health and utilization of clusters across your fleet. By default, multicluster observability operator is enabled during the installation of Red Hat Advanced Cluster Management.

Read the following documentation for more details about the observability component:

1.1. Observability architecture

The multiclusterhub-operator enables the multicluster-observability-operator pod by default. You must configure the multicluster-observability-operator pod.

When you enable the service, the observability-endpoint-operator is automatically deployed to each imported or created cluster. This controller collects the data from Red Hat OpenShift Container Platform Prometheus, then sends it to the Red Hat Advanced Cluster Management hub cluster. If the hub cluster imports itself is self-managed and imports itself as the local-cluster, observability is also enabled on it and metrics are collected from the hub cluster. As a reminder, when the hub cluster is self-managed the disableHubSelfManagement parameter is set to false.

The following diagram shows the components of observability:

Multicluster observability architecture

The components of the observability architecture include the following items:

  • The multicluster hub operator, also known as the multiclusterhub-operator pod, deploys the multicluster-observability-operator pod. It sends hub cluster data to your managed clusters.
  • The observability add-on controller is the API server that automatically updates the log of the managed cluster.
  • The Thanos infrastructure includes the Thanos Compactor, which is deployed by the multicluster-observability-operator pod. The Thanos Compactor ensures that queries are performing well by using the retention configuration, and compaction of the data in storage.

    To help identify when the Thanos Compactor is experiencing issues, use the four default alerts that are monitoring its health. Read the following table of default alerts:

    Table 1.1. Table of default Thanos alerts
    AlertSeverityDescription

    ACMThanosCompactHalted

    critical

    An alert is sent when the compactor stops.

    ACMThanosCompactHighCompactionFailures

    warning

    An alert is sent when the compaction failure rate is greater than 5 percent.

    ACMThanosCompactBucketHighOperationFailures

    warning

    An alert is sent when the bucket operation failure rate is greater than 5%.

    ACMThanosCompactHasNotRun

    warning

    An alert is sent when the compactor has not uploaded anything in last 24 hours.

  • The observability component deploys an instance of Grafana to enable data visualization with dashboards (static) or data exploration. Red Hat Advanced Cluster Management supports version 8.5.20 of Grafana. You can also design your Grafana dashboard. For more information, see Designing your Grafana dashboard.
  • The Prometheus Alertmanager enables alerts to be forwarded with third-party applications. You can customize the observability service by creating custom recording rules or alerting rules. Red Hat Advanced Cluster Management supports version 0.25 of Prometheus Alertmanager.

1.1.1. Persistent stores used in the observability service

Important: Do not use the local storage operator or a storage class that uses local volumes for persistent storage. You can lose data if the pod relaunched on a different node after a restart. When this happens, the pod can no longer access the local storage on the node. Be sure that you can access the persistent volumes of the receive and rules pods to avoid data loss.

When you install Red Hat Advanced Cluster Management the following persistent volumes (PV) must be created so that Persistent Volume Claims (PVC) can attach to it automatically. As a reminder, you must define a storage class in the MultiClusterObservability custom resource when there is no default storage class specified or you want to use a non-default storage class to host the PVs. It is recommended to use Block Storage, similar to what Prometheus uses. Also each replica of alertmanager, thanos-compactor, thanos-ruler, thanos-receive-default and thanos-store-shard must have its own PV. View the following table:

Table 1.2. Table list of persistent volumes

Persistent volume name

Purpose

alertmanager

Alertmanager stores the nflog data and silenced alerts in its storage. nflog is an append-only log of active and resolved notifications along with the notified receiver, and a hash digest of contents that the notification identified.

thanos-compact

The compactor needs local disk space to store intermediate data for its processing, as well as bucket state cache. The required space depends on the size of the underlying blocks. The compactor must have enough space to download all of the source blocks, then build the compacted blocks on the disk. On-disk data is safe to delete between restarts and should be the first attempt to get crash-looping compactors unstuck. However, it is recommended to give the compactor persistent disks in order to effectively use bucket state cache in between restarts.

thanos-rule

The thanos ruler evaluates Prometheus recording and alerting rules against a chosen query API by issuing queries at a fixed interval. Rule results are written back to the disk in the Prometheus 2.0 storage format. The amount of hours or days of data retained in this stateful set was fixed in the API version observability.open-cluster-management.io/v1beta1. It has been exposed as an API parameter in observability.open-cluster-management.io/v1beta2: RetentionInLocal

thanos-receive-default

Thanos receiver accepts incoming data (Prometheus remote-write requests) and writes these into a local instance of the Prometheus TSDB. Periodically (every 2 hours), TSDB blocks are uploaded to the object storage for long term storage and compaction. The amount of hours or days of data retained in this stateful set, which acts a local cache was fixed in API Version observability.open-cluster-management.io/v1beta. It has been exposed as an API parameter in observability.open-cluster-management.io/v1beta2: RetentionInLocal

thanos-store-shard

It acts primarily as an API gateway and therefore does not need a significant amount of local disk space. It joins a Thanos cluster on startup and advertises the data it can access. It keeps a small amount of information about all remote blocks on local disk and keeps it in sync with the bucket. This data is generally safe to delete across restarts at the cost of increased startup times.

Note: The time series historical data is stored in object stores. Thanos uses object storage as the primary storage for metrics and metadata related to them. For more details about the object storage and downsampling, see Enabling observability service.

1.1.2. Additional resources

1.2. Observability configuration

Continue reading to understand what metrics can be collected with the observability compnent, and for information about the observability pod capacity.

1.2.1. Metric types

By default, OpenShift Container Platform sends metrics to Red Hat using the Telemetry service. The acm_managed_cluster_info is available with Red Hat Advanced Cluster Management and is included with telemetry, but is not displayed on the Red Hat Advanced Cluster Management Observe environments overview dashboard.

View the following table of metric types that are supported by the framework:

Table 1.3. Parameter table
Metric nameMetric typeLabels/tagsStatus

acm_managed_cluster_info

Gauge

hub_cluster_id, managed_cluster_id, vendor, cloud, version, available, created_via, core_worker, socket_worker

Stable

config_policies_evaluation_duration_seconds_bucket

Histogram

None

Stable. Read Governance metric for more details.

config_policies_evaluation_duration_seconds_count

Histogram

None

Stable. Refer to Governance metric for more details.

config_policies_evaluation_duration_seconds_sum

Histogram

None

Stable. Read Governance metric for more details.

policy_governance_info

Gauge

type, policy, policy_namespace, cluster_namespace

Stable. Review Governance metric for more details.

policyreport_info

Gauge

managed_cluster_id, category, policy, result, severity

Stable. Read Managing insight _PolicyReports_ for more details.

search_api_db_connection_failed_total

Counter

None

Stable. See the Search components section in the Searching in the console introduction documentation.

search_api_dbquery_duration_seconds

Histogram

None

Stable. See the Search components section in the Searching in the console introduction documentation.

search_api_requests

Histogram

None

Stable. See the Search components section in the Searching in the console introduction documentation.

search_indexer_request_count

Counter

None

Stable. See the Search components section in the Searching in the console introduction documentation.

search_indexer_request_duration

Histogram

None

Stable. See the Search components section in the Searching in the console introduction documentation.

search_indexer_requests_in_flight

Gauge

None

Stable. See the Search components section in the Searching in the console introduction documentation.

search_indexer_request_size

Histogram

None

Stable. See the Search components section in the Searching in the console introduction documentation.

1.2.2. Observability pod capacity requests

Observability components require 2701mCPU and 11972Mi memory to install the observability service. The following table is a list of the pod capacity requests for five managed clusters with observability-addons enabled:

Table 1.4. Observability pod capacity requests
Deployment or StatefulSetContainer nameCPU (mCPU)Memory (Mi)ReplicasPod total CPUPod total memory

observability-alertmanager

alertmanager

4

200

3

12

600

config-reloader

4

25

3

12

75

alertmanager-proxy

1

20

3

3

60

observability-grafana

grafana

4

100

2

8

200

grafana-dashboard-loader

4

50

2

8

100

observability-observatorium-api

observatorium-api

20

128

2

40

256

observability-observatorium-operator

observatorium-operator

100

100

1

10

50

observability-rbac-query-proxy

rbac-query-proxy

20

100

2

40

200

oauth-proxy

1

20

2

2

40

observability-thanos-compact

thanos-compact

100

512

1

100

512

observability-thanos-query

thanos-query

300

1024

2

600

2048

observability-thanos-query-frontend

thanos-query-frontend

100

256

2

200

512

observability-thanos-query-frontend-memcached

memcached

45

128

3

135

384

exporter

5

50

3

15

150

observability-thanos-receive-controller

thanos-receive-controller

4

32

1

4

32

observability-thanos-receive-default

thanos-receive

300

512

3

900

1536

observability-thanos-rule

thanos-rule

50

512

3

150

1536

configmap-reloader

4

25

3

12

75

observability-thanos-store-memcached

memcached

45

128

3

135

384

exporter

5

50

3

15

150

observability-thanos-store-shard

thanos-store

100

1024

3

300

3072

1.2.3. Additional resources

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.