Chapter 2. Customizing observability
Review the following sections to learn more about customizing, managing, and viewing data that is collected by the observability service.
Collect logs about new information that is created for observability resources with the must-gather
command. For more information, see the Must-gather section in the Troubleshooting documentation.
- Creating custom rules
- Adding custom metrics
- Exporting metrics to external endpoints
- Adding advanced configuration
- Updating the MultiClusterObservability custom resource replicas from the console
- Customizing route certification
- Customizing certificates for accessing the object store
- Viewing and exploring data
- Disabling observability
2.1. Creating custom rules
Create custom rules for the observability installation by adding Prometheus recording rules and alerting rules to the observability resource.
- Recording rules provide you the ability to precalculate, or computate expensive expressions as needed. The results are saved as a new set of time series.
Alerting rules provide you the ability to specify the alert conditions based on how an alert should be sent to an external service.
Define custom rules with Prometheus to create alert conditions, and send notifications to an external messaging service.
Note: When you update your custom rules,
observability-thanos-rule
pods are restarted automatically.Create a ConfigMap named
thanos-ruler-custom-rules
in theopen-cluster-management-observability
namespace. The key must be named,custom_rules.yaml
, as shown in the following example. You can create multiple rules in the configuration.By default, the out-of-the-box alert rules are defined in the
thanos-ruler-default-rules
ConfigMap in theopen-cluster-management-observability
namespace.For example, you can create a custom alert rule that notifies you when your CPU usage passes your defined value. Your YAML might resemble the following content:
data: custom_rules.yaml: | groups: - name: cluster-health rules: - alert: ClusterCPUHealth-jb annotations: summary: Notify when CPU utilization on a cluster is greater than the defined utilization limit description: "The cluster has a high CPU usage: {{ $value }} core for {{ $labels.cluster }} {{ $labels.clusterID }}." expr: | max(cluster:cpu_usage_cores:sum) by (clusterID, cluster, prometheus) > 0 for: 5s labels: cluster: "{{ $labels.cluster }}" prometheus: "{{ $labels.prometheus }}" severity: critical
You can also create a custom recording rule within the
thanos-ruler-custom-rules
ConfigMap.For example, you can create a recording rule that provides you the ability to get the sum of the container memory cache of a pod. Your YAML might resemble the following content:
data: custom_rules.yaml: | groups: - name: container-memory rules: - record: pod:container_memory_cache:sum expr: sum(container_memory_cache{pod!=""}) BY (pod, container)
+ Note: If this is the first new custom rule, it is created immediately. For changes to the ConfigMap, the configuration is automatically reloaded. The configuration is reloaded because of the
config-reload
within theobservability-thanos-ruler
sidecar.
To verify that the alert rules are functioning correctly, launch the Grafana dashboard, navigate to the Explore page, and query ALERTS
. The alert is only available in Grafana if the alert is initiated.
2.2. Adding custom metrics
Add metrics to the metrics_list.yaml
file, to be collected from managed clusters.
Before you add a custom metric, verify that mco observability
is enabled with the following command: oc get mco observability -o yaml
. Check for the following message in the status.conditions.message
reads: Observability components are deployed and running
.
Create a file named observability-metrics-custom-allowlist.yaml
and add the name of the custom metric to the metrics_list.yaml
parameter. Your YAML for the ConfigMap might resemble the following content:
kind: ConfigMap apiVersion: v1 metadata: name: observability-metrics-custom-allowlist data: metrics_list.yaml: | names: - node_memory_MemTotal_bytes rules: - record: apiserver_request_duration_seconds:histogram_quantile_90 expr: histogram_quantile(0.90,sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\", verb!=\"WATCH\"}[5m])) by (verb,le))
For user workload metrics, see the Adding user workload metrics section.
-
In the
names
section, add the name of the custom metrics that is to be collected from the managed cluster. -
In the
rules
section, enter only one value for theexpr
andrecord
parameter pair to define the query expression. The metrics are collected as the name that is defined in therecord
parameter from your managed cluster. The metric value returned are the results after you run the query expression. -
The
names
andrules
sections are optional. You can use either one or both of the sections.
Create the observability-metrics-custom-allowlist
ConfigMap in the open-cluster-management-observability
namespace with the following command: oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
.
Verify that data from your custom metric is being collected by querying the metric from the Explore page, from the Grafana dashboard. You can also use the custom metrics in your own dashboard. For more information about viewing the dashboard, see Using Grafana dashboards.
2.2.1. Adding user workload metrics
You can collect OpenShift Container Platform user-defined metrics from workloads in OpenShift Container Platform. You must enable monitoring, see Enabling monitoring for user-defined projects.
If you have a managed cluster with monitoring for user-defined workloads enabled, the user workloads are located in the test
namespace and generate metrics. These metrics are collected by Prometheus from the OpenShift Container Platform user workload.
Collect the metrics from the user workloads by creating a ConfigMap named, observability-metrics-custom-allowlist
in the test
namespace. View the following example:
kind: ConfigMap apiVersion: v1 metadata: name: observability-metrics-custom-allowlist namespace: test data: uwl_metrics_list.yaml: | names: - sample_metrics
-
The
uwl_metrics_list.yaml
is the key for the ConfigMap data. -
The value of the ConfigMap data is in YAML format. The
names
section includes the list of metric names, which you want to collect from thetest
namespace. After you create the ConfigMap, the specified metrics from the target namespace is collected by the observability collector and pushed to the hub cluster.
2.2.2. Removing default metrics
If you want data to not be collected in your managed cluster for a specific metric, remove the metric from the observability-metrics-custom-allowlist.yaml
file. When you remove a metric, the metric data is not collected in your managed clusters. As mentioned previously, first verify that mco observability
is enabled.
Add the name of the default metric to the metrics_list.yaml
parameter with a hyphen -
at the start of the metric name. For example, -cluster_infrastructure_provider
.
Create the observability-metrics-custom-allowlist
ConfigMap in the open-cluster-management-observability
namespace with the following command: oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
.
Verify that the specific metric is not being collected from your managed clusters. When you query the metric from the Grafana dashboard, the metric is not displayed.
2.3. Exporting metrics to external endpoints
You can customize observability to export the metrics to external endpoints, which support the Prometheus Remote-Write specification in real time. For more information, see Prometheus Remote-Write specification.
2.3.1. Creating the Kubernetes secret for an external endpoint
You must create a Kubernetes secret with the access information of the external endpoint in the open-cluster-management-observability
namespace. View the following example secret:
apiVersion: v1 kind: Secret metadata: name: victoriametrics namespace: open-cluster-management-observability type: Opaque stringData: ep.yaml: | url: http://victoriametrics:8428/api/v1/write http_client_config: basic_auth: username: test password: test
The ep.yaml
is the key of the content and is used in the MultiClusterObservability
custom resource in next step. Currently, observability supports exporting metrics to endpoints without any security checks, with basic authentication or with tls
enablement. View the following tables for a full list of supported parameters:
Name | Description | Schema |
---|---|---|
url | URL for the external endpoint. | string |
http_client_config | Advanced configuration for the HTTP client. |
HttpClientConfig
Name | Description | Schema |
---|---|---|
basic_auth | HTTP client configuration for basic authentication. | |
tls_config | HTTP client configuration for TLS. |
BasicAuth
Name | Description | Schema |
---|---|---|
username | User name for basic authorization. | string |
password | Password for basic authorization. | string |
TLSConfig
Name | Description | Schema |
secret_name | Name of the secret that contains certificates. | string |
ca_file_key | Key of the CA certificate in the secret (only optional if insecure_skip_verify is set to true). | string |
cert_file_key | Key of the client certificate in the secret. | string |
key_file_key | Key of the client key in the secret. | string |
insecure_skip_verify | Parameter to skip the verification for target certificate. | bool |
2.3.2. Updating the MultiClusterObservability custom resource
After you create the Kubernetes secret, you must update the MultiClusterObservability
custom resource to add writeStorage
in the spec.storageConfig
parameter. View the following example:
spec: storageConfig: writeStorage: - key: ep.yaml name: victoriametrics
The value for writeStorage
is a list. You can add an item to the list when you want to export metrics to one external endpoint. If you add more than one item to the list, then the metrics are exported to multiple external endpoints. Each item contains two attributes: name and key. Name is the name of the Kubernetes secret that contains endpoint access information, and key is the key of the content in the secret. View the following description table for the
2.3.3. Viewing the status of metric export
After the metrics export is enabled, you can view the status of metrics export by checking the acm_remote_write_requests_total
metric. From the OpenShift console of your hub cluster, navigate to the Metrics page by clicking Metrics in the Observe section.
Then query the acm_remote_write_requests_total
metric. The value of that metric is the total number of requests with a specific response for one external endpoint, on one observatorium API instance. The name
label is the name for the external endpoint. The code
label is the return code of the HTTP request for the metrics export.
2.4. Adding advanced configuration
Add the advanced
configuration section to update the retention for each observability component, according to your needs.
Edit the MultiClusterObservability
custom resource and add the advanced
section with the following command: oc edit mco observability -o yaml
. Your YAML file might resemble the following contents:
spec: advanced: retentionConfig: blockDuration: 2h deleteDelay: 48h retentionInLocal: 24h retentionResolutionRaw: 30d retentionResolution5m: 180d retentionResolution1h: 0d receive: resources: limits: memory: 4096Gi replicas: 3
For descriptions of all the parameters that can added into the advanced
configuration, see the Observability API.
2.5. Updating the MultiClusterObservability custom resource replicas from the console
If your workload increases, increase the number of replicas of your observability pods. Navigate to the Red Hat OpenShift Container Platform console from your hub cluster. Locate the MultiClusterObservability
custom resource, and update the replicas
parameter value for the component where you want to change the replicas. Your updated YAML might resemble the following content:
spec: advanced: receive: replicas: 6
For more information about the parameters within the mco observability
custom resource, see the Observability API.
2.6. Customizing route certificate
If you want to customize the OpenShift Container Platform route certification, you must add the routes in the alt_names
section. To ensure your OpenShift Container Platform routes are accessible, add the following information: alertmanager.apps.<domainname>
, observatorium-api.apps.<domainname>
, rbac-query-proxy.apps.<domainname>
.
Note: Users are responsible for certificate rotations and updates.
2.7. Customizing certificates for accessing the object store
Complete the following steps to customize certificates for accessing the object store:
Edit the
http_config
section by adding the certificate in the object store secret. View the following example:thanos.yaml: | type: s3 config: bucket: "thanos" endpoint: "minio:9000" insecure: false access_key: "minio" secret_key: "minio123" http_config: tls_config: ca_file: /etc/minio/certs/ca.crt insecure_skip_verify: false
Add the object store secret in the
open-cluster-management-observability
namespace. The secret must contain theca.crt
that you defined in the previous secret example. If you want to enable Mutual TLS, you need to providepublic.crt
, andprivate.key
in the previous secret. View the following example:thanos.yaml: | type: s3 config: ... http_config: tls_config: ca_file: /etc/minio/certs/ca.crt 1 cert_file: /etc/minio/certs/public.crt key_file: /etc/minio/certs/private.key insecure_skip_verify: false
- 1
- The path to certificates and key values for the
thanos-object-storage
secret.
Configure the secret name by updating the
TLSSecretName
parameter in theMultiClusterObservability
custom resource. View the following example where the secret name istls-certs-secret
:metricObjectStorage: key: thanos.yaml name: thanos-object-storage tlsSecretName: tls-certs-secret
Mount the secret in the
tlsSecretMountPath
resource of all components that need to access the object store by renaming the existing TLS. See the following example:metricObjectStorage: key: thanos.yaml name: thanos-object-storage tlsSecretName: <existing-tls-certs-secret> tlsSecretMountPath: /etc/minio/certs receiver: store: ruler: compact:
- To verify that you can access the object store, check that the pods are displayed.
2.8. Viewing and exploring data
View the data from your managed clusters by accessing Grafana from the hub cluster. You can query specific alerts and add filters for the query.
For example, to cluster_infrastructure_provider from a single node cluster, use the following query expression: cluster_infrastructure_provider{clusterType="SNO"}
Note: Do not set the ObservabilitySpec.resources.CPU.limits
parameter if observability is enabled on single node managed clusters. When you set the CPU limits, it causes the observability pod to be counted against the capacity for your managed cluster. See the reference for Management Workload Partitioning in the Additional resources section.
2.8.1. Viewing historical data
When you query historical data, manually set your query parameter options to control how much data is displayed from the dashboard. Complete the following steps:
- From your hub cluster, select the Grafana link that is in the console header.
- Edit your cluster dashboard by selecting Edit Panel.
- From the Query front-end data source in Grafana, click the Query tab.
-
Select
$datasource
. - If you want to see more data, increase the value of the Step parameter section. If the Step parameter section is empty, it is automatically calculated.
-
Find the Custom query parameters field and select
max_source_resolution=auto
. - To verify that the data is displayed, refresh your Grafana page.
Your query data appears from the Grafana dashboard.
2.8.2. Viewing the etcd table
View the etcd table from the hub cluster dashboard in Grafana to learn the stability of the etcd as a data store.
Select the Grafana link from your hub cluster to view the etcd table data, which is collected from your hub cluster. The Leader election changes across managed clusters are displayed.
2.8.3. Viewing the cluster fleet service-level overview for the Kubernetes API server dashboard
View the cluster fleet Kubernetes API service-level overview from the hub cluster dashboard in Grafana.
After you navigate to the Grafana dashboard, access the managed dashboard menu by selecting Kubernetes > Service-Level Overview > API Server. The Fleet Overview and Top Cluster details are displayed.
View the total number of clusters that are exceeding or meeting the targeted service-level objective (SLO) value for the past seven or 30-day period, offending and non-offending clusters, and API Server Request Duration.
2.8.4. Viewing the cluster service-level overview for the Kubernetes API server dashboard
View the Kubernetes API service-level overview table from the hub cluster dashboard in Grafana.
After you navigate to the Grafana dashboard, access the managed dashboard menu by selecting Kubernetes > Service-Level Overview > API Server. The Fleet Overview and Top Cluster details are displayed.
View the error budget for the past seven or 30-day period, the remaining downtime, and trend.
2.9. Disabling observability
You can disable observability, which stops data collection on the Red Hat Advanced Cluster Management hub cluster.
2.9.1. Disabling observability on all clusters
Disable observability by removing observability components on all managed clusters.
Update the multicluster-observability-operator
resource by setting enableMetrics
to false
. Your updated resource might resemble the following change:
spec: imagePullPolicy: Always imagePullSecret: multiclusterhub-operator-pull-secret observabilityAddonSpec: # The ObservabilityAddonSpec defines the global settings for all managed clusters which have observability add-on enabled enableMetrics: false #indicates the observability addon push metrics to hub server
2.9.2. Disabling observability on a single cluster
Disable observability by removing observability components on specific managed clusters. Add the observability: disabled
label to the managedclusters.cluster.open-cluster-management.io
custom resource.
From the Red Hat Advanced Cluster Management console Clusters page, add the observability=disabled
label to the specified cluster.
Note: When a managed cluster with the observability component is detached, the metrics-collector
deployments are removed.
2.10. Additional resources
- For more details about observability alerts, see Observability alerts
- To learn more about alert forwarding, see the Prometheus Alertmanager documentation.
- See Observability alerts for more information.
- For more topics about the observability service, see Observability service introduction.
- Refer to Prometheus configuration for more information.
- See Management Workload Partitioning for more information.
- Return to the beginning of this topic, Customizing observability.