Chapter 4. Customizing observability configuration
Customize the observability configuration to the specific needs of your environment, after you enable observability.
To learn more about how you want to manage and view cluster fleet data that the observability service collects, read the following sections:
Required access: Cluster administrator
- Creating custom rules
- Adding custom metrics
- Adding advanced configuration
- Updating the MultiClusterObservability custom resource replicas from the console
- Customizing route certification
- Customizing certificates for accessing the object store
- Configuring proxy settings for observability add-ons
- Disabling proxy settings for observability add-ons
4.1. Creating custom rules
Create custom rules for the observability installation by adding Prometheus recording rules and alerting rules to the observability resource.
To precalculate expensive expressions, use the recording rules abilities. The results are saved as a new set of time series. With the alerting rules, you can specify the alert conditions based on how you want to send an alert to an external service.
Note: When you update your custom rules, observability-thanos-rule
pods restart automatically.
Define custom rules with Prometheus to create alert conditions, and send notifications to an external messaging service. View the following examples of custom rules:
Create a custom alert rule. Create a config map named
thanos-ruler-custom-rules
in theopen-cluster-management-observability
namespace. You must name the key,custom_rules.yaml
, as shown in the following example. You can create multiple rules in the configuration.- Create a custom alert rule that notifies you when your CPU usage passes your defined value. Your YAML might resemble the following content:
data: custom_rules.yaml: | groups: - name: cluster-health rules: - alert: ClusterCPUHealth-jb annotations: summary: Notify when CPU utilization on a cluster is greater than the defined utilization limit description: "The cluster has a high CPU usage: {{ $value }} core for {{ $labels.cluster }} {{ $labels.clusterID }}." expr: | max(cluster:cpu_usage_cores:sum) by (clusterID, cluster, prometheus) > 0 for: 5s labels: cluster: "{{ $labels.cluster }}" prometheus: "{{ $labels.prometheus }}" severity: critical
+
-
The default alert rules are in the
thanos-ruler-default-rules
config map of theopen-cluster-management-observability
namespace.
Create a custom recording rule within the
thanos-ruler-custom-rules
config map. Create a recording rule that gives you the ability to get the sum of the container memory cache of a pod. Your YAML might resemble the following content:data: custom_rules.yaml: | groups: - name: container-memory rules: - record: pod:container_memory_cache:sum expr: sum(container_memory_cache{pod!=""}) BY (pod, container)
Note: After you make changes to the config map, the configuration automatically reloads. The configuration reloads because of the
config-reload
within theobservability-thanos-ruler
sidecar.-
To verify that the alert rules are functioning correctly, go to the Grafana dashboard, select to the Explore page, and query
ALERTS
. The alert is only available in Grafana if you created the alert.
4.2. Adding custom metrics
Add metrics to the metrics_list.yaml
file, to collect from managed clusters. Complete the following steps:
Before you add a custom metric, verify that
mco observability
is enabled with the following command:oc get mco observability -o yaml
Check for the following message in the
status.conditions.message
section reads:Observability components are deployed and running
Create the
observability-metrics-custom-allowlist
config map in theopen-cluster-management-observability
namespace with the following command:oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
Add the name of the custom metric to the
metrics_list.yaml
parameter. Your YAML for the config map might resemble the following content:kind: ConfigMap apiVersion: v1 metadata: name: observability-metrics-custom-allowlist data: metrics_list.yaml: | names: 1 - node_memory_MemTotal_bytes rules: 2 - record: apiserver_request_duration_seconds:histogram_quantile_90 expr: histogram_quantile(0.90,sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\", verb!=\"WATCH\"}[5m])) by (verb,le))
- 1
- Optional: Add the name of the custom metrics that are to be collected from the managed cluster.
- 2
- Optional: Enter only one value for the
expr
andrecord
parameter pair to define the query expression. The metrics are collected as the name that is defined in therecord
parameter from your managed cluster. The metric value returned are the results after you run the query expression.
You can use either one or both of the sections. For user workload metrics, see the Adding user workload metrics section.
- Verify the data collection from your custom metric by querying the metric from the Explore page. You can also use the custom metrics in your own dashboard.
4.2.1. Adding user workload metrics
Collect OpenShift Container Platform user-defined metrics from workloads in OpenShift Container Platform to display the metrics from your Grafana dashboard. Complete the following steps:
Enable monitoring on your OpenShift Container Platform cluster. See Enabling monitoring for user-defined projects in the Additional resources section.
If you have a managed cluster with monitoring for user-defined workloads enabled, the user workloads are located in the
test
namespace and generate metrics. These metrics are collected by Prometheus from the OpenShift Container Platform user workload.Add user workload metrics to the
observability-metrics-custom-allowlist
config map to collect the metrics in thetest
namespace. View the following example:kind: ConfigMap apiVersion: v1 metadata: name: observability-metrics-custom-allowlist namespace: test data: uwl_metrics_list.yaml: 1 names: 2 - sample_metrics
- 1
- Enter the key for the config map data.
- 2
- Enter the value of the config map data in YAML format. The
names
section includes the list of metric names, which you want to collect from thetest
namespace. After you create the config map, the observability collector collects and pushes the metrics from the target namespace to the hub cluster.
4.2.2. Removing default metrics
If you do not want to collect data for a specific metric from your managed cluster, remove the metric from the observability-metrics-custom-allowlist.yaml
file. When you remove a metric, the metric data is not collected from your managed clusters. Complete the following steps to remove a default metric:
Verify that
mco observability
is enabled by using the following command:oc get mco observability -o yaml
Add the name of the default metric to the
metrics_list.yaml
parameter with a hyphen-
at the start of the metric name. View the following metric example:-cluster_infrastructure_provider
Create the
observability-metrics-custom-allowlist
config map in theopen-cluster-management-observability
namespace with the following command:oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
- Verify that the observability service is not collecting the specific metric from your managed clusters. When you query the metric from the Grafana dashboard, the metric is not displayed.
4.3. Adding advanced configuration for retention
Add the advanced
configuration section to update the retention for each observability component, according to your needs.
Edit the MultiClusterObservability
custom resource and add the advanced
section with the following command:
oc edit mco observability -o yaml
Your YAML file might resemble the following contents:
spec: advanced: retentionConfig: blockDuration: 2h deleteDelay: 48h retentionInLocal: 24h retentionResolutionRaw: 30d retentionResolution5m: 180d retentionResolution1h: 0d receive: resources: limits: memory: 4096Gi replicas: 3
For descriptions of all the parameters that can added into the advanced
configuration, see the Observability API documentation.
4.4. Dynamic metrics for single-node OpenShift clusters
Dynamic metrics collection supports automatic metric collection based on certain conditions. By default, a single-node OpenShift cluster does not collect pod and container resource metrics. Once a single-node OpenShift cluster reaches a specific level of resource consumption, the defined granular metrics are collected dynamically. When the cluster resource consumption is consistently less than the threshold for a period of time, granular metric collection stops.
The metrics are collected dynamically based on the conditions on the managed cluster specified by a collection rule. Because these metrics are collected dynamically, the following Red Hat Advanced Cluster Management Grafana dashboards do not display any data. When a collection rule is activated and the corresponding metrics are collected, the following panels display data for the duration of the time that the collection rule is initiated:
- Kubernetes/Compute Resources/Namespace (Pods)
- Kubernetes/Compute Resources/Namespace (Workloads)
- Kubernetes/Compute Resources/Nodes (Pods)
- Kubernetes/Compute Resources/Pod
- Kubernetes/Compute Resources/Workload A collection rule includes the following conditions:
- A set of metrics to collect dynamically.
- Conditions written as a PromQL expression.
-
A time interval for the collection, which must be set to
true
. - A match expression to select clusters where the collect rule must be evaluated.
By default, collection rules are evaluated continuously on managed clusters every 30 seconds, or at a specific time interval. The lowest value between the collection interval and time interval takes precedence. Once the collection rule condition persists for the duration specified by the for
attribute, the collection rule starts and the metrics specified by the rule are automatically collected on the managed cluster. Metrics collection stops automatically after the collection rule condition no longer exists on the managed cluster, at least 15 minutes after it starts.
The collection rules are grouped together as a parameter section named collect_rules
, where it can be enabled or disabled as a group. Red Hat Advanced Cluster Management installation includes the collection rule group, SNOResourceUsage
with two default collection rules: HighCPUUsage
and HighMemoryUsage
. The HighCPUUsage
collection rule begins when the node CPU usage exceeds 70%. The HighMemoryUsage
collection rule begins if the overall memory utilization of the single-node OpenShift cluster exceeds 70% of the available node memory. Currently, the previously mentioned thresholds are fixed and cannot be changed. When a collection rule begins for more than the interval specified by the for
attribute, the system automatically starts collecting the metrics that are specified in the dynamic_metrics
section.
View the list of dynamic metrics that from the collect_rules
section, in the following YAML file:
collect_rules: - group: SNOResourceUsage annotations: description: > By default, a {sno} cluster does not collect pod and container resource metrics. Once a {sno} cluster reaches a level of resource consumption, these granular metrics are collected dynamically. When the cluster resource consumption is consistently less than the threshold for a period of time, collection of the granular metrics stops. selector: matchExpressions: - key: clusterType operator: In values: ["{sno}"] rules: - collect: SNOHighCPUUsage annotations: description: > Collects the dynamic metrics specified if the cluster cpu usage is constantly more than 70% for 2 minutes expr: (1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100 > 70 for: 2m dynamic_metrics: names: - container_cpu_cfs_periods_total - container_cpu_cfs_throttled_periods_total - kube_pod_container_resource_limits - kube_pod_container_resource_requests - namespace_workload_pod:kube_pod_owner:relabel - node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate - node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate - collect: SNOHighMemoryUsage annotations: description: > Collects the dynamic metrics specified if the cluster memory usage is constantly more than 70% for 2 minutes expr: (1 - sum(:node_memory_MemAvailable_bytes:sum) / sum(kube_node_status_allocatable{resource=\"memory\"})) * 100 > 70 for: 2m dynamic_metrics: names: - kube_pod_container_resource_limits - kube_pod_container_resource_requests - namespace_workload_pod:kube_pod_owner:relabel matches: - __name__="container_memory_cache",container!="" - __name__="container_memory_rss",container!="" - __name__="container_memory_swap",container!="" - __name__="container_memory_working_set_bytes",container!=""
A collect_rules.group
can be disabled in the custom-allowlist
as shown in the following example. When a collect_rules.group
is disabled, metrics collection reverts to the previous behavior. These metrics are collected at regularly, specified intervals:
collect_rules: - group: -SNOResourceUsage
The data is only displayed in Grafana when the rule is initiated.
4.5. Updating the MultiClusterObservability custom resource replicas from the console
If your workload increases, increase the number of replicas of your observability pods. Navigate to the Red Hat OpenShift Container Platform console from your hub cluster. Locate the MultiClusterObservability
custom resource, and update the replicas
parameter value for the component where you want to change the replicas. Your updated YAML might resemble the following content:
spec: advanced: receive: replicas: 6
For more information about the parameters within the mco observability
custom resource, see the Observability API documentation.
4.6. Customizing route certificate
If you want to customize the OpenShift Container Platform route certification, you must add the routes in the alt_names
section. To ensure your OpenShift Container Platform routes are accessible, add the following information: alertmanager.apps.<domainname>
, observatorium-api.apps.<domainname>
, rbac-query-proxy.apps.<domainname>
.
For more details, see Replacing certificates for alertmanager route in the Governance documentation.
Note: Users are responsible for certificate rotations and updates.
4.7. Customizing certificates for accessing the object store
Complete the following steps to customize certificates for accessing the object store:
Edit the
http_config
section by adding the certificate in the object store secret. View the following example:thanos.yaml: | type: s3 config: bucket: "thanos" endpoint: "minio:9000" insecure: false access_key: "minio" secret_key: "minio123" http_config: tls_config: ca_file: /etc/minio/certs/ca.crt insecure_skip_verify: false
Add the object store secret in the
open-cluster-management-observability
namespace. The secret must contain theca.crt
that you defined in the previous secret example. If you want to enable Mutual TLS, you need to providepublic.crt
, andprivate.key
in the previous secret. View the following example:thanos.yaml: | type: s3 config: ... http_config: tls_config: ca_file: /etc/minio/certs/ca.crt 1 cert_file: /etc/minio/certs/public.crt key_file: /etc/minio/certs/private.key insecure_skip_verify: false
- 1
- The path to certificates and key values for the
thanos-object-storage
secret.
Configure the secret name and mount path by updating the
tlsSecretName
andtlsSecretMountPath
parameters in theMultiClusterObservability
custom resource. View the following example where the secret name istls-certs-secret
and the mount path for the certificates and key value is the directory that is used in the prior example:metricObjectStorage: key: thanos.yaml name: thanos-object-storage tlsSecretName: tls-certs-secret tlsSecretMountPath: /etc/minio/certs
Mount the secret in the
tlsSecretMountPath
resource of all components that need to access the object store by renaming the existing TLS. See the following example:metricObjectStorage: key: thanos.yaml name: thanos-object-storage tlsSecretName: <existing-tls-certs-secret> tlsSecretMountPath: /etc/minio/certs receiver: store: ruler: compact:
- To verify that you can access the object store, check that the pods are displayed.
4.8. Configuring proxy settings for observability add-ons
Configure the proxy settings to allow the communications from the managed cluster to access the hub cluster through a HTTP and HTTPS proxy server. Typically, add-ons do not need any special configuration to support HTTP and HTTPS proxy servers between a hub cluster and a managed cluster. But if you enabled the observability add-on, you must complete the proxy configuration.
4.9. Prerequisite
- You have a hub cluster.
- You have enabled the proxy settings between the hub cluster and managed cluster.
Complete the following steps to configure the proxy settings for the observability add-on:
- Go to the cluster namespace on your hub cluster.
Create an
AddOnDeploymentConfig
resource with the proxy settings by adding aspec.proxyConfig
parameter. View the following YAML example:apiVersion: addon.open-cluster-management.io/v1alpha1 kind: AddOnDeploymentConfig metadata: name: <addon-deploy-config-name> namespace: <managed-cluster-name> spec: agentInstallNamespace: open-cluster-managment-addon-observability proxyConfig: httpsProxy: "http://<username>:<password>@<ip>:<port>" 1 noProxy: ".cluster.local,.svc,172.30.0.1" 2
To get the IP address, run following command on your managed cluster:
oc -n default describe svc kubernetes | grep IP:
Go to the
ManagedClusterAddOn
resource and update it by referencing theAddOnDeploymentConfig
resource that you made. View the following YAML example:apiVersion: addon.open-cluster-management.io/v1alpha1 kind: ManagedClusterAddOn metadata: name: observability-controller namespace: <managed-cluster-name> spec: installNamespace: open-cluster-managment-addon-observability configs: - group: addon.open-cluster-management.io resource: AddonDeploymentConfig name: <addon-deploy-config-name> namespace: <managed-cluster-name>
Verify the proxy settings. If you successfully configured the proxy settings, the metric collector deployed by the observability add-on agent on the managed cluster sends the data to the hub cluster. Complete the following steps:
- Go to the hub cluster then the managed cluster on the Grafana dashboard.
- View the metrics for the proxy settings.
4.10. Disabling proxy settings for observability add-ons
If your development needs change, you might need to disable the proxy setting for the observability add-ons you configured for the hub cluster and managed cluster. You can disable the proxy settings for the observability add-on at any time. Complete the following steps:
-
Go to the
ManagedClusterAddOn
resource. -
Remove the referenced
AddOnDeploymentConfig
resource.
4.11. Additional resources
- Refer to Prometheus configuration for more information. For more information about recording rules and alerting rules, refer to the recording rules and alerting rules from the Prometheus documentation.
- For more information about viewing the dashboard, see Using Grafana dashboards.
- See Exporting metrics to external endpoints.
- See Enabling monitoring for user-defined projects.
- See the Observability API.
- For information about updating the certificate for the alertmanager route, see Replacing certificates for alertmanager.
- For more details about observability alerts, see Observability alerts
- To learn more about alert forwarding, see the Prometheus Alertmanager documentation.
- See Observability alerts for more information.
- For more topics about the observability service, see Observability service introduction.
- See Management Workload Partitioning for more information.
- Return to the beginning of this topic, Customizing observability.