Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 4. Customizing observability configuration

Customize the observability configuration to the specific needs of your environment, after you enable observability.

To learn more about how you want to manage and view cluster fleet data that the observability service collects, read the following sections:

Required access: Cluster administrator

4.1. Creating custom rules

Create custom rules for the observability installation by adding Prometheus recording rules and alerting rules to the observability resource.

To precalculate expensive expressions, use the recording rules abilities. The results are saved as a new set of time series. With the alerting rules, you can specify the alert conditions based on how you want to send an alert to an external service.

Note: When you update your custom rules, observability-thanos-rule pods restart automatically.

Define custom rules with Prometheus to create alert conditions, and send notifications to an external messaging service. View the following examples of custom rules:

Create a custom alert rule. Create a config map named thanos-ruler-custom-rules in the open-cluster-management-observability namespace. You must name the key, custom_rules.yaml, as shown in the following example. You can create multiple rules in the configuration.

Create a custom alert rule that notifies you when your CPU usage passes your defined value. Your YAML might resemble the following content:

data:
  custom_rules.yaml: |
    groups:
      - name: cluster-health
        rules:
        - alert: ClusterCPUHealth-jb
          annotations:
            summary: Notify when CPU utilization on a cluster is greater than the defined utilization limit
            description: "The cluster has a high CPU usage: {{ $value }} core for {{ $labels.cluster }} {{ $labels.clusterID }}."
          expr: |
            max(cluster:cpu_usage_cores:sum) by (clusterID, cluster, prometheus) > 0
          for: 5s
          labels:
            cluster: "{{ $labels.cluster }}"
            prometheus: "{{ $labels.prometheus }}"
            severity: critical

The default alert rules are in the thanos-ruler-default-rules config map of the open-cluster-management-observability namespace.

Create a custom recording rule within the thanos-ruler-custom-rules config map. Create a recording rule that gives you the ability to get the sum of the container memory cache of a pod. Your YAML might resemble the following content:
```
data:
  custom_rules.yaml: |
    groups:
      - name: container-memory
        rules:
        - record: pod:container_memory_cache:sum
          expr: sum(container_memory_cache{pod!=""}) BY (pod, container)
```
Note: After you make changes to the config map, the configuration automatically reloads. The configuration reloads because of the config-reload within the observability-thanos-ruler sidecar.
To verify that the alert rules are functioning correctly, go to the Grafana dashboard, select the Explore page, and query ALERTS. The alert is only available in Grafana if you created the alert.

4.2. Adding custom metrics

Add metrics to the metrics_list.yaml file to collect from managed clusters. Complete the following steps:

Before you add a custom metric, verify that mco observability is enabled with the following command:
```
oc get mco observability -o yaml
```
Check for the following message in the status.conditions.message section reads:
```
Observability components are deployed and running
```
Create the observability-metrics-custom-allowlist config map in the open-cluster-management-observability namespace with the following command:
```
oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
```
Add the name of the custom metric to the metrics_list.yaml parameter. Your YAML for the config map might resemble the following content:
```
kind: ConfigMap
apiVersion: v1
metadata:
  name: observability-metrics-custom-allowlist
data:
  metrics_list.yaml: |
    names: 1
      - node_memory_MemTotal_bytes
    rules: 2
    - record: apiserver_request_duration_seconds:histogram_quantile_90
      expr: histogram_quantile(0.90,sum(rate(apiserver_request_duration_seconds_bucket{job=\"apiserver\",
        verb!=\"WATCH\"}[5m])) by (verb,le))
```
1
Optional: Add the name of the custom metrics that are to be collected from the managed cluster.
2
Optional: Enter only one value for the expr and record parameter pair to define the query expression. The metrics are collected as the name that is defined in the record parameter from your managed cluster. The metric value returned are the results after you run the query expression.
You can use either one or both of the sections. For user workload metrics, see the Adding user workload metrics section.
Verify the data collection from your custom metric by querying the metric from the Grafana dashboard Explore page. You can also use the custom metrics in your own dashboard.

4.2.1. Adding user workload metrics

Collect OpenShift Container Platform user-defined metrics from workloads in OpenShift Container Platform to display the metrics from your Grafana dashboard. Complete the following steps:

Enable monitoring on your OpenShift Container Platform cluster. See Enabling monitoring for user-defined projects in the Additional resources section.
If you have a managed cluster with monitoring for user-defined workloads enabled, the user workloads are located in the test namespace and generate metrics. These metrics are collected by Prometheus from the OpenShift Container Platform user workload.
Add user workload metrics to the observability-metrics-custom-allowlist config map to collect the metrics in the test namespace. View the following example:
```
kind: ConfigMap
apiVersion: v1
metadata:
  name: observability-metrics-custom-allowlist
  namespace: test
data:
  uwl_metrics_list.yaml: 1
    names: 2
      - sample_metrics
```
1
Enter the key for the config map data.
2
Enter the value of the config map data in YAML format. The names section includes the list of metric names, which you want to collect from the test namespace. After you create the config map, the observability collector collects and pushes the metrics from the target namespace to the hub cluster.

4.2.2. Removing default metrics

If you do not want to collect data for a specific metric from your managed cluster, remove the metric from the observability-metrics-custom-allowlist.yaml file. When you remove a metric, the metric data is not collected from your managed clusters. Complete the following steps to remove a default metric:

Verify that mco observability is enabled by using the following command:
```
oc get mco observability -o yaml
```
Add the name of the default metric to the metrics_list.yaml parameter with a hyphen - at the start of the metric name. View the following metric example:
```
-cluster_infrastructure_provider`
```
Create the observability-metrics-custom-allowlist config map in the open-cluster-management-observability namespace with the following command:
```
oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
```
Verify that the observability service is not collecting the specific metric from your managed clusters. When you query the metric from the Grafana dashboard, the metric is not displayed.

4.3. Adding advanced configuration for retention

Add the advanced configuration section to update the retention for each observability component, according to your needs.

Edit the MultiClusterObservability custom resource and add the advanced section with the following command:

oc edit mco observability -o yaml

Your YAML file might resemble the following contents:

spec:
  advanced:
    retentionConfig:
      blockDuration: 2h
      deleteDelay: 48h
      retentionInLocal: 24h
      retentionResolutionRaw: 30d
      retentionResolution5m: 180d
      retentionResolution1h: 0d
    receive:
      resources:
        limits:
          memory: 4096Gi
      replicas: 3

For descriptions of all the parameters that can added into the advanced configuration, see the Observability API documentation.

4.4. Dynamic metrics for single-node OpenShift clusters

Dynamic metrics collection supports automatic metric collection based on certain conditions. By default, a single-node OpenShift cluster does not collect pod and container resource metrics. Once a single-node OpenShift cluster reaches a specific level of resource consumption, the defined granular metrics are collected dynamically. When the cluster resource consumption is consistently less than the threshold for a period of time, granular metric collection stops.

The metrics are collected dynamically based on the conditions on the managed cluster specified by a collection rule. Because these metrics are collected dynamically, the following Red Hat Advanced Cluster Management Grafana dashboards do not display any data. When a collection rule is activated and the corresponding metrics are collected, the following panels display data for the duration of the time that the collection rule is initiated:

Kubernetes/Compute Resources/Namespace (Pods)
Kubernetes/Compute Resources/Namespace (Workloads)
Kubernetes/Compute Resources/Nodes (Pods)
Kubernetes/Compute Resources/Pod
Kubernetes/Compute Resources/Workload A collection rule includes the following conditions:
A set of metrics to collect dynamically.
Conditions written as a PromQL expression.
A time interval for the collection, which must be set to true.
A match expression to select clusters where the collect rule must be evaluated.

By default, collection rules are evaluated continuously on managed clusters every 30 seconds, or at a specific time interval. The lowest value between the collection interval and time interval takes precedence. Once the collection rule condition persists for the duration specified by the for attribute, the collection rule starts and the metrics specified by the rule are automatically collected on the managed cluster. Metrics collection stops automatically after the collection rule condition no longer exists on the managed cluster, at least 15 minutes after it starts.

The collection rules are grouped together as a parameter section named collect_rules, where it can be enabled or disabled as a group. Red Hat Advanced Cluster Management installation includes the collection rule group, SNOResourceUsage with two default collection rules: HighCPUUsage and HighMemoryUsage. The HighCPUUsage collection rule begins when the node CPU usage exceeds 70%. The HighMemoryUsage collection rule begins if the overall memory utilization of the single-node OpenShift cluster exceeds 70% of the available node memory. Currently, the previously mentioned thresholds are fixed and cannot be changed. When a collection rule begins for more than the interval specified by the for attribute, the system automatically starts collecting the metrics that are specified in the dynamic_metrics section.

View the list of dynamic metrics that from the collect_rules section, in the following YAML file:

collect_rules:
  - group: SNOResourceUsage
    annotations:
      description: >
        By default, a {sno} cluster does not collect pod and container resource metrics. Once a {sno} cluster
        reaches a level of resource consumption, these granular metrics are collected dynamically.
        When the cluster resource consumption is consistently less than the threshold for a period of time,
        collection of the granular metrics stops.
    selector:
      matchExpressions:
        - key: clusterType
          operator: In
          values: ["{sno}"]
    rules:
    - collect: SNOHighCPUUsage
      annotations:
        description: >
          Collects the dynamic metrics specified if the cluster cpu usage is constantly more than 70% for 2 minutes
      expr: (1 - avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m]))) * 100 > 70
      for: 2m
      dynamic_metrics:
        names:
          - container_cpu_cfs_periods_total
          - container_cpu_cfs_throttled_periods_total
          - kube_pod_container_resource_limits
          - kube_pod_container_resource_requests
          - namespace_workload_pod:kube_pod_owner:relabel
          - node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
          - node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
    - collect: SNOHighMemoryUsage
      annotations:
        description: >
          Collects the dynamic metrics specified if the cluster memory usage is constantly more than 70% for 2 minutes
      expr: (1 - sum(:node_memory_MemAvailable_bytes:sum) / sum(kube_node_status_allocatable{resource=\"memory\"})) * 100 > 70
      for: 2m
      dynamic_metrics:
        names:
          - kube_pod_container_resource_limits
          - kube_pod_container_resource_requests
          - namespace_workload_pod:kube_pod_owner:relabel
        matches:
          - __name__="container_memory_cache",container!=""
          - __name__="container_memory_rss",container!=""
          - __name__="container_memory_swap",container!=""
          - __name__="container_memory_working_set_bytes",container!=""

A collect_rules.group can be disabled in the custom-allowlist as shown in the following example. When a collect_rules.group is disabled, metrics collection reverts to the previous behavior. These metrics are collected at regularly, specified intervals:

collect_rules:
  - group: -SNOResourceUsage

The data is only displayed in Grafana when the rule is initiated.

4.5. Updating the MultiClusterObservability custom resource replicas from the console

If your workload increases, increase the number of replicas of your observability pods. Navigate to the Red Hat OpenShift Container Platform console from your hub cluster. Locate the MultiClusterObservability custom resource, and update the replicas parameter value for the component where you want to change the replicas. Your updated YAML might resemble the following content:

spec:
   advanced:
      receive:
         replicas: 6

For more information about the parameters within the mco observability custom resource, see the Observability API documentation.

4.6. Increasing and decreasing persistent volumes and persistent volume claims

Increase and decrease the persistent volume and persistent volume claims to change the amount of storage in your storage class. Complete the following steps:

To increase the size of the persistent volume, update the MultiClusterObservability custom resource if the storage class support expanding volumes.
To decrease the size of the persistent volumes remove the pods using the persistent volumes, delete the persistent volume and recreate them. You might experience data loss in the persistent volume. Complete the following steps:
1. Pause the MultiClusterObservability operator by adding the annotation mco-pause: "true" to the MultiClusterObservability custom resource.
2. Look for the stateful sets or deployments of the desired component. Change their replica count to 0. This initiates a shutdown, which involves uploading local data when applicable to avoid data loss. For example, the Thanos Receive stateful set is named observability-thanos-receive-default and has three replicas by default. Therefore, you are looking for the following persistent volume claims:
  - data-observability-thanos-receive-default-0
  - data-observability-thanos-receive-default-1
  - data-observability-thanos-receive-default-2
3. Delete the persistent volumes and persistent volume claims used by the desired component.
4. In the MultiClusterObservability custom resource, edit the storage size in the configuration of the component to the desired amount in the storage size field. Prefix with the name of the component.
5. Unpause the MultiClusterObservability operator by removing the previously added annotation.
6. To initiate a reconcilation after having the operator paused, delete the multicluster-observability-operator and observatorium-operator pods. The pods are recreated and reconciled immediately.
Verify that persistent volume and volume claims are updated by checking the MultiClusterObservability custom resource.

4.7. Customizing route certificate

If you want to customize the OpenShift Container Platform route certification, you must add the routes in the alt_names section. To ensure your OpenShift Container Platform routes are accessible, add the following information: alertmanager.apps.<domainname>, observatorium-api.apps.<domainname>, rbac-query-proxy.apps.<domainname>.

For more details, see Replacing certificates for alertmanager route in the Governance documentation.

Note: Users are responsible for certificate rotations and updates.

4.8. Customizing certificates for accessing the object store

Complete the following steps to customize certificates for accessing the object store:

Edit the http_config section by adding the certificate in the object store secret. View the following example:

 thanos.yaml: |
    type: s3
    config:
      bucket: "thanos"
      endpoint: "minio:9000"
      insecure: false
      access_key: "minio"
      secret_key: "minio123"
      http_config:
        tls_config:
          ca_file: /etc/minio/certs/ca.crt
          insecure_skip_verify: false

Add the object store secret in the open-cluster-management-observability namespace. The secret must contain the ca.crt that you defined in the previous secret example. If you want to enable mutual TLS, you need to add the public.crt and private.key keys in the earlier secret. View the following example:
```
 thanos.yaml: |
    type: s3
    config:
      ...
      http_config:
        tls_config:
          ca_file: /etc/minio/certs/ca.crt 1
          cert_file: /etc/minio/certs/public.crt
          key_file: /etc/minio/certs/private.key
          insecure_skip_verify: false
```
1
The path to certificates and key values for the thanos-object-storage secret.
Configure the secret name and mount path by updating the tlsSecretName and tlsSecretMountPath parameters in the MultiClusterObservability custom resource. View the following example where the secret name is tls-certs-secret and the mount path for the certificates and key value is the directory that is used in the prior example:
```
metricObjectStorage:
      key: thanos.yaml
      name: thanos-object-storage
      tlsSecretName: tls-certs-secret
      tlsSecretMountPath: /etc/minio/certs
```

Mount the secret in the tlsSecretMountPath resource of all components that need to access the object store by renaming the existing TLS. See the following example:

metricObjectStorage:
      key: thanos.yaml
      name: thanos-object-storage
      tlsSecretName: <existing-tls-certs-secret>
      tlsSecretMountPath: /etc/minio/certs
        receiver:
        store:
        ruler:
        compact:

To verify that you can access the object store, check that the pods are displayed.

4.9. Configuring proxy settings for observability add-ons

Configure the proxy settings to allow the communications from the managed cluster to access the hub cluster through a HTTP and HTTPS proxy server. Typically, add-ons do not need any special configuration to support HTTP and HTTPS proxy servers between a hub cluster and a managed cluster. But if you enabled the observability add-on, you must complete the proxy configuration.

4.10. Prerequisite

You have a hub cluster.
You have enabled the proxy settings between the hub cluster and managed cluster.

Complete the following steps to configure the proxy settings for the observability add-on:

Go to the cluster namespace on your hub cluster.

Create an AddOnDeploymentConfig resource with the proxy settings by adding a spec.proxyConfig parameter. View the following YAML example:

apiVersion: addon.open-cluster-management.io/v1alpha1
kind: AddOnDeploymentConfig
metadata:
  name: <addon-deploy-config-name>
  namespace: <managed-cluster-name>
spec:
  agentInstallNamespace: open-cluster-managment-addon-observability
  proxyConfig:
    httpsProxy: "http://<username>:<password>@<ip>:<port>" 1
    noProxy: ".cluster.local,.svc,172.30.0.1" 2

1: For this field, you can specify either a HTTP proxy or a HTTPS proxy.
2: Include the IP address of the kube-apiserver.

To get the IP address, run following command on your managed cluster:
```
oc -n default describe svc kubernetes | grep IP:
```

Go to the ManagedClusterAddOn resource and update it by referencing the AddOnDeploymentConfig resource that you made. View the following YAML example:

apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ManagedClusterAddOn
metadata:
  name: observability-controller
  namespace: <managed-cluster-name>
spec:
  installNamespace: open-cluster-managment-addon-observability
  configs:
  - group: addon.open-cluster-management.io
    resource: AddonDeploymentConfig
    name: <addon-deploy-config-name>
    namespace: <managed-cluster-name>

Verify the proxy settings. If you successfully configured the proxy settings, the metric collector deployed by the observability add-on agent on the managed cluster sends the data to the hub cluster. Complete the following steps:
1. Go to the hub cluster then the managed cluster on the Grafana dashboard.
2. View the metrics for the proxy settings.

4.11. Disabling proxy settings for observability add-ons

If your development needs change, you might need to disable the proxy setting for the observability add-ons you configured for the hub cluster and managed cluster. You can disable the proxy settings for the observability add-on at any time. Complete the following steps:

Go to the ManagedClusterAddOn resource.
Remove the referenced AddOnDeploymentConfig resource.

4.12. Additional resources

Refer to Prometheus configuration for more information. For more information about recording rules and alerting rules, refer to the recording rules and alerting rules from the Prometheus documentation.
For more information about viewing the dashboard, see Using Grafana dashboards.
See Exporting metrics to external endpoints.
See Enabling monitoring for user-defined projects.
See the Observability API.
For information about updating the certificate for the alertmanager route, see Replacing certificates for alertmanager.
For more details about observability alerts, see Observability alerts
To learn more about alert forwarding, see the Prometheus Alertmanager documentation.
See Observability alerts for more information.
For more topics about the observability service, see Observability service introduction.
See Management Workload Partitioning for more information.
Return to the beginning of this topic, Customizing observability.