Chapter 2. Getting started


2.1. Maintenance and support for monitoring

Not all configuration options for the monitoring stack are exposed. The only supported way of configuring OpenShift Container Platform monitoring is by configuring the Cluster Monitoring Operator (CMO) using the options described in the Config map reference for the Cluster Monitoring Operator. Do not use other configurations, as they are unsupported.

Configuration paradigms might change across Prometheus releases, and such cases can only be handled gracefully if all configuration possibilities are controlled. If you use configurations other than those described in the Config map reference for the Cluster Monitoring Operator, your changes will disappear because the CMO automatically reconciles any differences and resets any unsupported changes back to the originally defined state by default and by design.

2.1.1. Support considerations for monitoring

Note

Backward compatibility for metrics, recording rules, or alerting rules is not guaranteed.

The following modifications are explicitly not supported:

  • Creating additional ServiceMonitor, PodMonitor, and PrometheusRule objects in the openshift-* and kube-* projects.
  • Modifying any resources or objects deployed in the openshift-monitoring or openshift-user-workload-monitoring projects. The resources created by the OpenShift Container Platform monitoring stack are not meant to be used by any other resources, as there are no guarantees about their backward compatibility.

    Note

    The Alertmanager configuration is deployed as the alertmanager-main secret resource in the openshift-monitoring namespace. If you have enabled a separate Alertmanager instance for user-defined alert routing, an Alertmanager configuration is also deployed as the alertmanager-user-workload secret resource in the openshift-user-workload-monitoring namespace. To configure additional routes for any instance of Alertmanager, you need to decode, modify, and then encode that secret. This procedure is a supported exception to the preceding statement.

  • Modifying resources of the stack. The OpenShift Container Platform monitoring stack ensures its resources are always in the state it expects them to be. If they are modified, the stack will reset them.
  • Deploying user-defined workloads to openshift-*, and kube-* projects. These projects are reserved for Red Hat provided components and they should not be used for user-defined workloads.
  • Enabling symptom based monitoring by using the Probe custom resource definition (CRD) in Prometheus Operator.
  • Manually deploying monitoring resources into namespaces that have the openshift.io/cluster-monitoring: "true" label.
  • Adding the openshift.io/cluster-monitoring: "true" label to namespaces. This label is reserved only for the namespaces with core OpenShift Container Platform components and Red Hat certified components.
  • Installing custom Prometheus instances on OpenShift Container Platform. A custom instance is a Prometheus custom resource (CR) managed by the Prometheus Operator.

2.1.2. Support policy for monitoring Operators

Monitoring Operators ensure that OpenShift Container Platform monitoring resources function as designed and tested. If Cluster Version Operator (CVO) control of an Operator is overridden, the Operator does not respond to configuration changes, reconcile the intended state of cluster objects, or receive updates.

While overriding CVO control for an Operator can be helpful during debugging, this is unsupported and the cluster administrator assumes full control of the individual component configurations and upgrades.

Overriding the Cluster Version Operator

The spec.overrides parameter can be added to the configuration for the CVO to allow administrators to provide a list of overrides to the behavior of the CVO for a component. Setting the spec.overrides[].unmanaged parameter to true for a component blocks cluster upgrades and alerts the administrator after a CVO override has been set:

Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.
Warning

Setting a CVO override puts the entire cluster in an unsupported state and prevents the monitoring stack from being reconciled to its intended state. This impacts the reliability features built into Operators and prevents updates from being received. Reported issues must be reproduced after removing any overrides for support to proceed.

2.1.3. Support version matrix for monitoring components

The following matrix contains information about versions of monitoring components for OpenShift Container Platform 4.12 and later releases:

Table 2.1. OpenShift Container Platform and component versions
OpenShift Container PlatformPrometheus OperatorPrometheusMetrics ServerAlertmanagerkube-state-metrics agentmonitoring-pluginnode-exporter agentThanos

4.18

0.78.1

2.55.1

0.7.2

0.27.0

2.13.0

1.0.0

1.8.2

0.36.1

4.17

0.75.2

2.53.1

0.7.1

0.27.0

2.13.0

1.0.0

1.8.2

0.35.1

4.16

0.73.2

2.52.0

0.7.1

0.26.0

2.12.0

1.0.0

1.8.0

0.35.0

4.15

0.70.0

2.48.0

0.6.4

0.26.0

2.10.1

1.0.0

1.7.0

0.32.5

4.14

0.67.1

2.46.0

N/A

0.25.0

2.9.2

1.0.0

1.6.1

0.30.2

4.13

0.63.0

2.42.0

N/A

0.25.0

2.8.1

N/A

1.5.0

0.30.2

4.12

0.60.1

2.39.1

N/A

0.24.0

2.6.0

N/A

1.4.0

0.28.1

Note

The openshift-state-metrics agent and Telemeter Client are OpenShift-specific components. Therefore, their versions correspond with the versions of OpenShift Container Platform.

2.2. Core platform monitoring first steps

After OpenShift Container Platform is installed, core platform monitoring components immediately begin collecting metrics, which you can query and view. The default in-cluster monitoring stack includes the core platform Prometheus instance that collects metrics from your cluster and the core Alertmanager instance that routes alerts, among other components. Depending on who will use the monitoring stack and for what purposes, as a cluster administrator, you can further configure these monitoring components to suit the needs of different users in various scenarios.

2.2.1. Configuring core platform monitoring: Postinstallation steps

After OpenShift Container Platform is installed, cluster administrators typically configure core platform monitoring to suit their needs. These activities include setting up storage and configuring options for Prometheus, Alertmanager, and other monitoring components.

Note

By default, in a newly installed OpenShift Container Platform system, users can query and view collected metrics. You need only configure an alert receiver if you want users to receive alert notifications. Any other configuration options listed here are optional.

  • Create the cluster-monitoring-config ConfigMap object if it does not exist.
  • Configure notifications for default platform alerts so that Alertmanager can send alerts to an external notification system such as email, Slack, or PagerDuty.
  • For shorter term data retention, configure persistent storage for Prometheus and Alertmanager to store metrics and alert data. Specify the metrics data retention parameters for Prometheus and Thanos Ruler.

    Important
    • In multi-node clusters, you must configure persistent storage for Prometheus, Alertmanager, and Thanos Ruler to ensure high availability.
    • By default, in a newly installed OpenShift Container Platform system, the monitoring ClusterOperator resource reports a PrometheusDataPersistenceNotConfigured status message to remind you that storage is not configured.
  • For longer term data retention, configure the remote write feature to enable Prometheus to send ingested metrics to remote systems for storage.

    Important

    Be sure to add cluster ID labels to metrics for use with your remote write storage configuration.

  • Grant monitoring cluster roles to any non-administrator users that need to access certain monitoring features.
  • Assign tolerations to monitoring stack components so that administrators can move them to tainted nodes.
  • Set the body size limit for metrics collection to help avoid situations in which Prometheus consumes excessive amounts of memory when scraped targets return a response that contains a large amount of data.
  • Modify or create alerting rules for your cluster. These rules specify the conditions that trigger alerts, such as high CPU or memory usage, network latency, and so forth.
  • Specify resource limits and requests for monitoring components to ensure that the containers that run monitoring components have enough CPU and memory resources.

With the monitoring stack configured to suit your needs, Prometheus collects metrics from the specified services and stores these metrics according to your settings. You can go to the Observe pages in the OpenShift Container Platform web console to view and query collected metrics, manage alerts, identify performance bottlenecks, and scale resources as needed:

  • View dashboards to visualize collected metrics, troubleshoot alerts, and monitor other information about your cluster.
  • Query collected metrics by creating PromQL queries or using predefined queries.

2.3. User workload monitoring first steps

As a cluster administrator, you can optionally enable monitoring for user-defined projects in addition to core platform monitoring. Non-administrator users such as developers can then monitor their own projects outside of core platform monitoring.

Cluster administrators typically complete the following activities to configure user-defined projects so that users can view collected metrics, query these metrics, and receive alerts for their own projects:

2.4. Developer and non-administrator steps

After monitoring for user-defined projects is enabled and configured, developers and other non-administrator users can then perform the following activities to set up and use monitoring for their own projects:

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.