Home
Products
OpenShift Container Platform
4.15
Hosted control planes
Chapter 8. Hosted control planes Observability

Chapter 8. Hosted control planes Observability

You can gather metrics for hosted control planes by configuring metrics sets. The HyperShift Operator can create or delete monitoring dashboards in the management cluster for each hosted cluster that it manages.

8.1. Configuring metrics sets for hosted control planes
Copy link

Hosted control planes for Red Hat OpenShift Container Platform creates ServiceMonitor resources in each control plane namespace that allow a Prometheus stack to gather metrics from the control planes. The ServiceMonitor resources use metrics relabelings to define which metrics are included or excluded from a particular component, such as etcd or the Kubernetes API server. The number of metrics that are produced by control planes directly impacts the resource requirements of the monitoring stack that gathers them.

Instead of producing a fixed number of metrics that apply to all situations, you can configure a metrics set that identifies a set of metrics to produce for each control plane. The following metrics sets are supported:

Telemetry: These metrics are needed for telemetry. This set is the default set and is the smallest set of metrics.
SRE: This set includes the necessary metrics to produce alerts and allow the troubleshooting of control plane components.
All: This set includes all of the metrics that are produced by standalone OpenShift Container Platform control plane components.

To configure a metrics set, set the METRICS_SET environment variable in the HyperShift Operator deployment by entering the following command:

oc set env -n hypershift deployment/operator METRICS_SET=All

$ oc set env -n hypershift deployment/operator METRICS_SET=All

Copy to Clipboard

Toggle word wrap

8.1.1. Configuring the SRE metrics set
Copy link

When you specify the SRE metrics set, the HyperShift Operator looks for a config map named sre-metric-set with a single key: config. The value of the config key must contain a set of RelabelConfigs that are organized by control plane component.

You can specify the following components:

etcd
kubeAPIServer
kubeControllerManager
openshiftAPIServer
openshiftControllerManager
openshiftRouteControllerManager
cvo
olm
catalogOperator
registryOperator
nodeTuningOperator
controlPlaneOperator
hostedClusterConfigOperator

A configuration of the SRE metrics set is illustrated in the following example:

kubeAPIServer:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_admission_controller_admission_latencies_seconds_.*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_admission_step_admission_latencies_seconds_.*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "transformation_(transformation_latencies_microseconds|failures_total)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "network_plugin_operations_latency_microseconds|sync_proxy_rules_latency_microseconds|rest_client_request_latency_seconds"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)"
    sourceLabels: ["__name__", "le"]
kubeControllerManager:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|request|server).*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "rest_client_request_latency_seconds_(bucket|count|sum)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "root_ca_cert_publisher_sync_duration_seconds_(bucket|count|sum)"
    sourceLabels: ["__name__"]
openshiftAPIServer:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_admission_controller_admission_latencies_seconds_.*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_admission_step_admission_latencies_seconds_.*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)"
    sourceLabels: ["__name__", "le"]
openshiftControllerManager:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|request|server).*"
    sourceLabels: ["__name__"]
openshiftRouteControllerManager:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|request|server).*"
    sourceLabels: ["__name__"]
olm:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]
catalogOperator:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]
cvo:
  - action: drop
    regex: "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]

kubeAPIServer:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_admission_controller_admission_latencies_seconds_.*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_admission_step_admission_latencies_seconds_.*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "scheduler_(e2e_scheduling_latency_microseconds|scheduling_algorithm_predicate_evaluation|scheduling_algorithm_priority_evaluation|scheduling_algorithm_preemption_evaluation|scheduling_algorithm_latency_microseconds|binding_latency_microseconds|scheduling_latency_seconds)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_(request_count|request_latencies|request_latencies_summary|dropped_requests|storage_data_key_generation_latencies_microseconds|storage_transformation_failures_total|storage_transformation_latencies_microseconds|proxy_tunnel_sync_latency_secs)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "docker_(operations|operations_latency_microseconds|operations_errors|operations_timeout)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "reflector_(items_per_list|items_per_watch|list_duration_seconds|lists_total|short_watches_total|watch_duration_seconds|watches_total)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "etcd_(helper_cache_hit_count|helper_cache_miss_count|helper_cache_entry_count|request_cache_get_latencies_summary|request_cache_add_latencies_summary|request_latencies_summary)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "transformation_(transformation_latencies_microseconds|failures_total)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "network_plugin_operations_latency_microseconds|sync_proxy_rules_latency_microseconds|rest_client_request_latency_seconds"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)"
    sourceLabels: ["__name__", "le"]
kubeControllerManager:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|request|server).*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "rest_client_request_latency_seconds_(bucket|count|sum)"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "root_ca_cert_publisher_sync_duration_seconds_(bucket|count|sum)"
    sourceLabels: ["__name__"]
openshiftAPIServer:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_admission_controller_admission_latencies_seconds_.*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_admission_step_admission_latencies_seconds_.*"
    sourceLabels: ["__name__"]
  - action:       "drop"
    regex:        "apiserver_request_duration_seconds_bucket;(0.15|0.25|0.3|0.35|0.4|0.45|0.6|0.7|0.8|0.9|1.25|1.5|1.75|2.5|3|3.5|4.5|6|7|8|9|15|25|30|50)"
    sourceLabels: ["__name__", "le"]
openshiftControllerManager:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|request|server).*"
    sourceLabels: ["__name__"]
openshiftRouteControllerManager:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|request|server).*"
    sourceLabels: ["__name__"]
olm:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]
catalogOperator:
  - action:       "drop"
    regex:        "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]
cvo:
  - action: drop
    regex: "etcd_(debugging|disk|server).*"
    sourceLabels: ["__name__"]

Copy to Clipboard

Toggle word wrap

8.2. Enabling monitoring dashboards in a hosted cluster
Copy link

You can enable monitoring dashboards in a hosted cluster by creating a config map.

Procedure

Create the hypershift-operator-install-flags config map in the local-cluster namespace. See the following example configuration:

kind: ConfigMap
apiVersion: v1
metadata:
  name: hypershift-operator-install-flags
  namespace: local-cluster
data:
  installFlagsToAdd: "--monitoring-dashboards --metrics-set=All" 
  installFlagsToRemove: ""

kind: ConfigMap
apiVersion: v1
metadata:
  name: hypershift-operator-install-flags
  namespace: local-cluster
data:
  installFlagsToAdd: "--monitoring-dashboards --metrics-set=All"


  installFlagsToRemove: ""

Copy to Clipboard

Toggle word wrap

1: The --monitoring-dashboards --metrics-set=All flag adds the monitoring dashboard for all metrics.

Wait a couple of minutes for the HyperShift Operator deployment in the hypershift namespace to be updated to include the following environment variable:
```
    - name: MONITORING_DASHBOARDS
      value: "1"
```
```
    - name: MONITORING_DASHBOARDS
      value: "1"
```
Copy to Clipboard Toggle word wrap
When monitoring dashboards are enabled, for each hosted cluster that the HyperShift Operator manages, the Operator creates a config map named cp-<hosted_cluster_namespace>-<hosted_cluster_name> in the openshift-config-managed namespace, where <hosted_cluster_namespace> is the namespace of the hosted cluster and <hosted_cluster_name> is the name of the hosted cluster. As a result, a new dashboard is added in the administrative console of the management cluster.
To view the dashboard, log in to the management cluster’s console and go to the dashboard for the hosted cluster by clicking Observe Dashboards.
Optional: To disable monitoring dashboards in a hosted cluster, remove the --monitoring-dashboards --metrics-set=All flag from the hypershift-operator-install-flags config map. When you delete a hosted cluster, its corresponding dashboard is also deleted.

8.2.1. Dashboard customization
Copy link

To generate dashboards for each hosted cluster, the HyperShift Operator uses a template that is stored in the monitoring-dashboard-template config map in the Operator namespace (hypershift). This template contains a set of Grafana panels that contain the metrics for the dashboard. You can edit the content of the config map to customize the dashboards.

When a dashboard is generated, the following strings are replaced with values that correspond to a specific hosted cluster:

Expand

Name	Description
`__NAME__`	The name of the hosted cluster
`__NAMESPACE__`	The namespace of the hosted cluster
`__CONTROL_PLANE_NAMESPACE__`	The namespace where the control plane pods of the hosted cluster are placed
`__CLUSTER_ID__`	The UUID of the hosted cluster, which matches the `_id` label of the hosted cluster metrics

Chapter 8. Hosted control planes Observability

8.1. Configuring metrics sets for hosted control planes
Copy link

8.1.1. Configuring the SRE metrics set
Copy link

8.2. Enabling monitoring dashboards in a hosted cluster
Copy link

8.2.1. Dashboard customization
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 8. Hosted control planes Observability

8.1. Configuring metrics sets for hosted control planesCopy linkLink copied to clipboard!

8.1.1. Configuring the SRE metrics setCopy linkLink copied to clipboard!

8.2. Enabling monitoring dashboards in a hosted clusterCopy linkLink copied to clipboard!

8.2.1. Dashboard customizationCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

8.1. Configuring metrics sets for hosted control planes
Copy link

8.1.1. Configuring the SRE metrics set
Copy link

8.2. Enabling monitoring dashboards in a hosted cluster
Copy link

8.2.1. Dashboard customization
Copy link