Chapter 1. Observability service introduction

The observability component is a service you can use to understand the health and utilization of clusters across your fleet. By default, multicluster observability operator is enabled during the installation of Red Hat Advanced Cluster Management.

Read the following documentation for more details about the observability component:

1.1. Observability architecture

The multiclusterhub-operator enables the multicluster-observability-operator pod by default. You must configure the multicluster-observability-operator pod.

When you enable the service, the observability-endpoint-operator is automatically deployed to each imported or created cluster. This controller collects the data from Red Hat OpenShift Container Platform Prometheus, then sends it to the Red Hat Advanced Cluster Management hub cluster. If the hub cluster imports itself is self-managed and imports itself as the local-cluster, observability is also enabled on it and metrics are collected from the hub cluster. As a reminder, when the hub cluster is self-managed the disableHubSelfManagement parameter is set to false.

The following diagram shows the components of observability:

Multicluster observability architecture

The components of the observability architecture include the following items:

The multicluster hub operator, also known as the multiclusterhub-operator pod, deploys the multicluster-observability-operator pod. It sends hub cluster data to your managed clusters.
The observability add-on controller is the API server that automatically updates the log of the managed cluster.

The Thanos infrastructure includes the Thanos Compactor, which is deployed by the multicluster-observability-operator pod. The Thanos Compactor ensures that queries are performing well by using the retention configuration, and compaction of the data in storage.

To help identify when the Thanos Compactor is experiencing issues, use the four default alerts that are monitoring its health. Read the following table of default alerts:

Table 1.1. Table of default Thanos alerts
Alert	Severity	Description
`ACMThanosCompactHalted`	critical	An alert is sent when the compactor stops.
`ACMThanosCompactHighCompactionFailures`	warning	An alert is sent when the compaction failure rate is greater than 5 percent.
`ACMThanosCompactBucketHighOperationFailures`	warning	An alert is sent when the bucket operation failure rate is greater than 5%.
`ACMThanosCompactHasNotRun`	warning	An alert is sent when the compactor has not uploaded anything in last 24 hours.

The observability component deploys an instance of Grafana to enable data visualization with dashboards (static) or data exploration. Red Hat Advanced Cluster Management supports version 8.5.20 of Grafana. You can also design your Grafana dashboard. For more information, see Designing your Grafana dashboard.
The Prometheus Alertmanager enables alerts to be forwarded with third-party applications. You can customize the observability service by creating custom recording rules or alerting rules. Red Hat Advanced Cluster Management supports version 0.25 of Prometheus Alertmanager.

1.1.1. Persistent stores used in the observability service

Important: Do not use the local storage operator or a storage class that uses local volumes for persistent storage. You can lose data if the pod relaunched on a different node after a restart. When this happens, the pod can no longer access the local storage on the node. Be sure that you can access the persistent volumes of the receive and rules pods to avoid data loss.

When you install Red Hat Advanced Cluster Management the following persistent volumes (PV) must be created so that Persistent Volume Claims (PVC) can attach to it automatically. As a reminder, you must define a storage class in the MultiClusterObservability custom resource when there is no default storage class specified or you want to use a non-default storage class to host the PVs. It is recommended to use Block Storage, similar to what Prometheus uses. Also each replica of alertmanager, thanos-compactor, thanos-ruler, thanos-receive-default and thanos-store-shard must have its own PV. View the following table:

Table 1.2. Table list of persistent volumes
Persistent volume name	Purpose
alertmanager	Alertmanager stores the `nflog` data and silenced alerts in its storage. `nflog` is an append-only log of active and resolved notifications along with the notified receiver, and a hash digest of contents that the notification identified.
thanos-compact	The compactor needs local disk space to store intermediate data for its processing, as well as bucket state cache. The required space depends on the size of the underlying blocks. The compactor must have enough space to download all of the source blocks, then build the compacted blocks on the disk. On-disk data is safe to delete between restarts and should be the first attempt to get crash-looping compactors unstuck. However, it is recommended to give the compactor persistent disks in order to effectively use bucket state cache in between restarts.
thanos-rule	The thanos ruler evaluates Prometheus recording and alerting rules against a chosen query API by issuing queries at a fixed interval. Rule results are written back to the disk in the Prometheus 2.0 storage format. The amount of hours or days of data retained in this stateful set was fixed in the API version `observability.open-cluster-management.io/v1beta1`. It has been exposed as an API parameter in `observability.open-cluster-management.io/v1beta2`: `RetentionInLocal`
thanos-receive-default	Thanos receiver accepts incoming data (Prometheus remote-write requests) and writes these into a local instance of the Prometheus TSDB. Periodically (every 2 hours), TSDB blocks are uploaded to the object storage for long term storage and compaction. The amount of hours or days of data retained in this stateful set, which acts a local cache was fixed in API Version `observability.open-cluster-management.io/v1beta`. It has been exposed as an API parameter in `observability.open-cluster-management.io/v1beta2`: `RetentionInLocal`
thanos-store-shard	It acts primarily as an API gateway and therefore does not need a significant amount of local disk space. It joins a Thanos cluster on startup and advertises the data it can access. It keeps a small amount of information about all remote blocks on local disk and keeps it in sync with the bucket. This data is generally safe to delete across restarts at the cost of increased startup times.

Note: The time series historical data is stored in object stores. Thanos uses object storage as the primary storage for metrics and metadata related to them. For more details about the object storage and downsampling, see Enabling observability service.

1.1.2. Additional resources

1.2. Observability configuration

Continue reading to understand what metrics can be collected with the observability compnent, and for information about the observability pod capacity.

1.2.1. Metric types

By default, OpenShift Container Platform sends metrics to Red Hat using the Telemetry service. The acm_managed_cluster_info is available with Red Hat Advanced Cluster Management and is included with telemetry, but is not displayed on the Red Hat Advanced Cluster Management Observe environments overview dashboard.

View the following table of metric types that are supported by the framework:

Table 1.3. Parameter table
Metric name	Metric type	Labels/tags	Status
`acm_managed_cluster_info`	Gauge	`hub_cluster_id`, `managed_cluster_id`, `vendor`, `cloud`, `version`, `available`, `created_via`, `core_worker`, `socket_worker`	Stable
`config_policies_evaluation_duration_seconds_bucket`	Histogram	None	Stable. Read Governance metric for more details.
`config_policies_evaluation_duration_seconds_count`	Histogram	None	Stable. Refer to Governance metric for more details.
`config_policies_evaluation_duration_seconds_sum`	Histogram	None	Stable. Read Governance metric for more details.
`policy_governance_info`	Gauge	`type`, `policy`, `policy_namespace`, `cluster_namespace`	Stable. Review Governance metric for more details.
`policyreport_info`	Gauge	`managed_cluster_id`, `category`, `policy`, `result`, `severity`	Stable. Read Managing insight _PolicyReports_ for more details.
`search_api_db_connection_failed_total`	Counter	None	Stable. See the Search components section in the Searching in the console introduction documentation.
`search_api_dbquery_duration_seconds`	Histogram	None	Stable. See the Search components section in the Searching in the console introduction documentation.
`search_api_requests`	Histogram	None	Stable. See the Search components section in the Searching in the console introduction documentation.
`search_indexer_request_count`	Counter	None	Stable. See the Search components section in the Searching in the console introduction documentation.
`search_indexer_request_duration`	Histogram	None	Stable. See the Search components section in the Searching in the console introduction documentation.
`search_indexer_requests_in_flight`	Gauge	None	Stable. See the Search components section in the Searching in the console introduction documentation.
`search_indexer_request_size`	Histogram	None	Stable. See the Search components section in the Searching in the console introduction documentation.

1.2.2. Observability pod capacity requests

Observability components require 2701mCPU and 11972Mi memory to install the observability service. The following table is a list of the pod capacity requests for five managed clusters with observability-addons enabled:

Table 1.4. Observability pod capacity requests
Deployment or StatefulSet	Container name	CPU (mCPU)	Memory (Mi)	Replicas	Pod total CPU	Pod total memory
observability-alertmanager	alertmanager	4	200	3	12	600
	config-reloader	4	25	3	12	75
	alertmanager-proxy	1	20	3	3	60
observability-grafana	grafana	4	100	2	8	200
observability-grafana	grafana-dashboard-loader	4	50	2	8	100
observability-observatorium-api	observatorium-api	20	128	2	40	256
observability-observatorium-operator	observatorium-operator	100	100	1	10	50
observability-rbac-query-proxy	rbac-query-proxy	20	100	2	40	200
observability-rbac-query-proxy	oauth-proxy	1	20	2	2	40
observability-thanos-compact	thanos-compact	100	512	1	100	512
observability-thanos-query	thanos-query	300	1024	2	600	2048
observability-thanos-query-frontend	thanos-query-frontend	100	256	2	200	512
observability-thanos-query-frontend-memcached	memcached	45	128	3	135	384
observability-thanos-query-frontend-memcached	exporter	5	50	3	15	150
observability-thanos-receive-controller	thanos-receive-controller	4	32	1	4	32
observability-thanos-receive-default	thanos-receive	300	512	3	900	1536
observability-thanos-rule	thanos-rule	50	512	3	150	1536
observability-thanos-rule	configmap-reloader	4	25	3	12	75
observability-thanos-store-memcached	memcached	45	128	3	135	384
observability-thanos-store-memcached	exporter	5	50	3	15	150
observability-thanos-store-shard	thanos-store	100	1024	3	300	3072

1.2.3. Additional resources

For more information about enabling observability, read Enabling the observability service.
Read Customizing observability to learn how to configure the observability service, view metrics and other data.
Read Using Grafana dashboards.
Learn from the OpenShift Container Platform documentation what types of metrics are collected and sent using telemetry. See Information collected by Telemetry for information.
Refer to Governance metric for details.
Read Managing insight PolicyReports.
Refer to Prometheus recording rules.
Also refer to Prometheus alerting rules.
Return to Observability service introduction.

Chapter 1. Observability service introduction

1.1. Observability architecture

1.1.1. Persistent stores used in the observability service

1.1.2. Additional resources

1.2. Observability configuration

1.2.1. Metric types

1.2.2. Observability pod capacity requests

1.2.3. Additional resources

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links