Chapter 2. Getting started
2.1. Maintenance and support for monitoring
Not all configuration options for the monitoring stack are exposed. The only supported way of configuring OpenShift Container Platform monitoring is by configuring the Cluster Monitoring Operator (CMO) using the options described in the Config map reference for the Cluster Monitoring Operator. Do not use other configurations, as they are unsupported.
Configuration paradigms might change across Prometheus releases, and such cases can only be handled gracefully if all configuration possibilities are controlled. If you use configurations other than those described in the Config map reference for the Cluster Monitoring Operator, your changes will disappear because the CMO automatically reconciles any differences and resets any unsupported changes back to the originally defined state by default and by design.
2.1.1. Support considerations for monitoring
Backward compatibility for metrics, recording rules, or alerting rules is not guaranteed.
The following modifications are explicitly not supported:
-
Creating additional
ServiceMonitor
,PodMonitor
, andPrometheusRule
objects in theopenshift-*
andkube-*
projects. Modifying any resources or objects deployed in the
openshift-monitoring
oropenshift-user-workload-monitoring
projects. The resources created by the OpenShift Container Platform monitoring stack are not meant to be used by any other resources, as there are no guarantees about their backward compatibility.NoteThe Alertmanager configuration is deployed as the
alertmanager-main
secret resource in theopenshift-monitoring
namespace. If you have enabled a separate Alertmanager instance for user-defined alert routing, an Alertmanager configuration is also deployed as thealertmanager-user-workload
secret resource in theopenshift-user-workload-monitoring
namespace. To configure additional routes for any instance of Alertmanager, you need to decode, modify, and then encode that secret. This procedure is a supported exception to the preceding statement.- Modifying resources of the stack. The OpenShift Container Platform monitoring stack ensures its resources are always in the state it expects them to be. If they are modified, the stack will reset them.
-
Deploying user-defined workloads to
openshift-*
, andkube-*
projects. These projects are reserved for Red Hat provided components and they should not be used for user-defined workloads. -
Enabling symptom based monitoring by using the
Probe
custom resource definition (CRD) in Prometheus Operator. -
Manually deploying monitoring resources into namespaces that have the
openshift.io/cluster-monitoring: "true"
label. -
Adding the
openshift.io/cluster-monitoring: "true"
label to namespaces. This label is reserved only for the namespaces with core OpenShift Container Platform components and Red Hat certified components. - Installing custom Prometheus instances on OpenShift Container Platform. A custom instance is a Prometheus custom resource (CR) managed by the Prometheus Operator.
2.1.2. Support policy for monitoring Operators
Monitoring Operators ensure that OpenShift Container Platform monitoring resources function as designed and tested. If Cluster Version Operator (CVO) control of an Operator is overridden, the Operator does not respond to configuration changes, reconcile the intended state of cluster objects, or receive updates.
While overriding CVO control for an Operator can be helpful during debugging, this is unsupported and the cluster administrator assumes full control of the individual component configurations and upgrades.
Overriding the Cluster Version Operator
The spec.overrides
parameter can be added to the configuration for the CVO to allow administrators to provide a list of overrides to the behavior of the CVO for a component. Setting the spec.overrides[].unmanaged
parameter to true
for a component blocks cluster upgrades and alerts the administrator after a CVO override has been set:
Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.
Setting a CVO override puts the entire cluster in an unsupported state and prevents the monitoring stack from being reconciled to its intended state. This impacts the reliability features built into Operators and prevents updates from being received. Reported issues must be reproduced after removing any overrides for support to proceed.
2.1.3. Support version matrix for monitoring components
The following matrix contains information about versions of monitoring components for OpenShift Container Platform 4.12 and later releases:
OpenShift Container Platform | Prometheus Operator | Prometheus | Metrics Server | Alertmanager | kube-state-metrics agent | monitoring-plugin | node-exporter agent | Thanos |
---|---|---|---|---|---|---|---|---|
4.18 | 0.78.1 | 2.55.1 | 0.7.2 | 0.27.0 | 2.13.0 | 1.0.0 | 1.8.2 | 0.36.1 |
4.17 | 0.75.2 | 2.53.1 | 0.7.1 | 0.27.0 | 2.13.0 | 1.0.0 | 1.8.2 | 0.35.1 |
4.16 | 0.73.2 | 2.52.0 | 0.7.1 | 0.26.0 | 2.12.0 | 1.0.0 | 1.8.0 | 0.35.0 |
4.15 | 0.70.0 | 2.48.0 | 0.6.4 | 0.26.0 | 2.10.1 | 1.0.0 | 1.7.0 | 0.32.5 |
4.14 | 0.67.1 | 2.46.0 | N/A | 0.25.0 | 2.9.2 | 1.0.0 | 1.6.1 | 0.30.2 |
4.13 | 0.63.0 | 2.42.0 | N/A | 0.25.0 | 2.8.1 | N/A | 1.5.0 | 0.30.2 |
4.12 | 0.60.1 | 2.39.1 | N/A | 0.24.0 | 2.6.0 | N/A | 1.4.0 | 0.28.1 |
The openshift-state-metrics agent and Telemeter Client are OpenShift-specific components. Therefore, their versions correspond with the versions of OpenShift Container Platform.
2.2. Core platform monitoring first steps
After OpenShift Container Platform is installed, core platform monitoring components immediately begin collecting metrics, which you can query and view. The default in-cluster monitoring stack includes the core platform Prometheus instance that collects metrics from your cluster and the core Alertmanager instance that routes alerts, among other components. Depending on who will use the monitoring stack and for what purposes, as a cluster administrator, you can further configure these monitoring components to suit the needs of different users in various scenarios.
2.2.1. Configuring core platform monitoring: Postinstallation steps
After OpenShift Container Platform is installed, cluster administrators typically configure core platform monitoring to suit their needs. These activities include setting up storage and configuring options for Prometheus, Alertmanager, and other monitoring components.
By default, in a newly installed OpenShift Container Platform system, users can query and view collected metrics. You need only configure an alert receiver if you want users to receive alert notifications. Any other configuration options listed here are optional.
-
Create the
cluster-monitoring-config
ConfigMap
object if it does not exist. - Configure notifications for default platform alerts so that Alertmanager can send alerts to an external notification system such as email, Slack, or PagerDuty.
For shorter term data retention, configure persistent storage for Prometheus and Alertmanager to store metrics and alert data. Specify the metrics data retention parameters for Prometheus and Thanos Ruler.
Important- In multi-node clusters, you must configure persistent storage for Prometheus, Alertmanager, and Thanos Ruler to ensure high availability.
-
By default, in a newly installed OpenShift Container Platform system, the monitoring
ClusterOperator
resource reports aPrometheusDataPersistenceNotConfigured
status message to remind you that storage is not configured.
For longer term data retention, configure the remote write feature to enable Prometheus to send ingested metrics to remote systems for storage.
ImportantBe sure to add cluster ID labels to metrics for use with your remote write storage configuration.
- Grant monitoring cluster roles to any non-administrator users that need to access certain monitoring features.
- Assign tolerations to monitoring stack components so that administrators can move them to tainted nodes.
- Set the body size limit for metrics collection to help avoid situations in which Prometheus consumes excessive amounts of memory when scraped targets return a response that contains a large amount of data.
- Modify or create alerting rules for your cluster. These rules specify the conditions that trigger alerts, such as high CPU or memory usage, network latency, and so forth.
- Specify resource limits and requests for monitoring components to ensure that the containers that run monitoring components have enough CPU and memory resources.
With the monitoring stack configured to suit your needs, Prometheus collects metrics from the specified services and stores these metrics according to your settings. You can go to the Observe pages in the OpenShift Container Platform web console to view and query collected metrics, manage alerts, identify performance bottlenecks, and scale resources as needed:
- View dashboards to visualize collected metrics, troubleshoot alerts, and monitor other information about your cluster.
- Query collected metrics by creating PromQL queries or using predefined queries.
2.3. User workload monitoring first steps
As a cluster administrator, you can optionally enable monitoring for user-defined projects in addition to core platform monitoring. Non-administrator users such as developers can then monitor their own projects outside of core platform monitoring.
Cluster administrators typically complete the following activities to configure user-defined projects so that users can view collected metrics, query these metrics, and receive alerts for their own projects:
- Enable user workload monitoring.
-
Grant non-administrator users permissions to monitor user-defined projects by assigning the
monitoring-rules-view
,monitoring-rules-edit
, ormonitoring-edit
cluster roles. -
Assign the
user-workload-monitoring-config-edit
role to grant non-administrator users permission to configure user-defined projects. - Enable alert routing for user-defined projects so that developers and other users can configure custom alerts and alert routing for their projects.
- If needed, configure alert routing for user-defined projects to use an optional Alertmanager instance dedicated for use only by user-defined projects.
- Configure notifications for user-defined alerts.
- If you use the platform Alertmanager instance for user-defined alert routing, configure different alert receivers for default platform alerts and user-defined alerts.
2.4. Developer and non-administrator steps
After monitoring for user-defined projects is enabled and configured, developers and other non-administrator users can then perform the following activities to set up and use monitoring for their own projects:
- Deploy and monitor services.
- Create and manage alerting rules.
- Receive and manage alerts for your projects.
-
If granted the
alert-routing-edit
cluster role, configure alert routing. - View dashboards by using the OpenShift Container Platform web console.
- Query the collected metrics by creating PromQL queries or using predefined queries.