Home
Products
OpenShift Container Platform
4.10
Monitoring
Chapter 13. Troubleshooting monitoring issues

Chapter 13. Troubleshooting monitoring issues

13.1. Investigating why user-defined metrics are unavailable
Copy link

ServiceMonitor resources enable you to determine how to use the metrics exposed by a service in user-defined projects. Follow the steps outlined in this procedure if you have created a ServiceMonitor resource but cannot see any corresponding metrics in the Metrics UI.

Prerequisites

You have access to the cluster as a user with the cluster-admin cluster role.
You have installed the OpenShift CLI (oc).
You have enabled and configured monitoring for user-defined workloads.
You have created the user-workload-monitoring-config ConfigMap object.
You have created a ServiceMonitor resource.

Procedure

Check that the corresponding labels match in the service and ServiceMonitor resource configurations.
1. Obtain the label defined in the service. The following example queries the prometheus-example-app service in the ns1 project:
  $ oc -n ns1 get service prometheus-example-app -o yaml
  Copy to Clipboard Toggle word wrap
  Example output
  labels: app: prometheus-example-app
  
  Copy to Clipboard Toggle word wrap
2. Check that the matchLabels app label in the ServiceMonitor resource configuration matches the label output in the preceding step:
  $ oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml
  Copy to Clipboard Toggle word wrap
  Example output
  spec: endpoints: - interval: 30s port: web scheme: http selector: matchLabels: app: prometheus-example-app
  
  Copy to Clipboard Toggle word wrap
  Note
  You can check service and ServiceMonitor resource labels as a developer with view permissions for the project.

Inspect the logs for the Prometheus Operator in the openshift-user-workload-monitoring project.

List the pods in the openshift-user-workload-monitoring project:

oc -n openshift-user-workload-monitoring get pods

$ oc -n openshift-user-workload-monitoring get pods

Copy to Clipboard

Toggle word wrap

Example output

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-776fcbbd56-2nbfm   2/2     Running   0          132m
prometheus-user-workload-0             5/5     Running   1          132m
prometheus-user-workload-1             5/5     Running   1          132m
thanos-ruler-user-workload-0           3/3     Running   0          132m
thanos-ruler-user-workload-1           3/3     Running   0          132m

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-776fcbbd56-2nbfm   2/2     Running   0          132m
prometheus-user-workload-0             5/5     Running   1          132m
prometheus-user-workload-1             5/5     Running   1          132m
thanos-ruler-user-workload-0           3/3     Running   0          132m
thanos-ruler-user-workload-1           3/3     Running   0          132m

Copy to Clipboard

Toggle word wrap

Obtain the logs from the prometheus-operator container in the prometheus-operator pod. In the following example, the pod is called prometheus-operator-776fcbbd56-2nbfm:

oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator

$ oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator

Copy to Clipboard

Toggle word wrap

If there is a issue with the service monitor, the logs might include an error similar to this example:

level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload

level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload

Copy to Clipboard

Toggle word wrap

Review the target status for your project in the Prometheus UI directly.
1. Establish port-forwarding to the Prometheus instance in the openshift-user-workload-monitoring project:
  $ oc port-forward -n openshift-user-workload-monitoring pod/prometheus-user-workload-0 9090
  Copy to Clipboard Toggle word wrap
2. Open http://localhost:9090/targets in a web browser and review the status of the target for your project directly in the Prometheus UI. Check for error messages relating to the target.
Configure debug level logging for the Prometheus Operator in the openshift-user-workload-monitoring project.
1. Edit the user-workload-monitoring-config ConfigMap object in the openshift-user-workload-monitoring project:
  $ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
  Copy to Clipboard Toggle word wrap
2. Add logLevel: debug for prometheusOperator under data/config.yaml to set the log level to debug:
  apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheusOperator: logLevel: debug
  Copy to Clipboard Toggle word wrap
3. Save the file to apply the changes.
  Note
  The prometheus-operator in the openshift-user-workload-monitoring project restarts automatically when you apply the log-level change.
4. Confirm that the debug log-level has been applied to the prometheus-operator deployment in the openshift-user-workload-monitoring project:
  $ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -o yaml | grep "log-level"
  Copy to Clipboard Toggle word wrap
  Example output
  - --log-level=debug
  
  Copy to Clipboard Toggle word wrap
  Debug level logging will show all calls made by the Prometheus Operator.
5. Check that the prometheus-operator pod is running:
  $ oc -n openshift-user-workload-monitoring get pods
  Copy to Clipboard Toggle word wrap
  Note
  If an unrecognized Prometheus Operator loglevel value is included in the config map, the prometheus-operator pod might not restart successfully.
6. Review the debug logs to see if the Prometheus Operator is using the ServiceMonitor resource. Review the logs for other related errors.

13.2. Determining why Prometheus is consuming a lot of disk space
Copy link

Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id attribute is unbound because it has an infinite number of possible values.

Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.

You can use the following measures when Prometheus consumes a lot of disk:

Check the number of scrape samples that are being collected.
Check the time series database (TSDB) status in the Prometheus UI for more information on which labels are creating the most time series. This requires cluster administrator privileges.
Reduce the number of unique time series that are created by reducing the number of unbound attributes that are assigned to user-defined metrics.
Note
Using attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations.
Enforce limits on the number of samples that can be scraped across user-defined projects. This requires cluster administrator privileges.

Prerequisites

You have access to the cluster as a user with the cluster-admin cluster role.
You have installed the OpenShift CLI (oc).

Procedure

In the Administrator perspective, navigate to Observe Metrics.
Run the following Prometheus Query Language (PromQL) query in the Expression field. This returns the ten metrics that have the highest number of scrape samples:
```
topk(10,count by (job)({__name__=~".+"}))
```
```
topk(10,count by (job)({__name__=~".+"}))
```
Copy to Clipboard Toggle word wrap
Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts.
- If the metrics relate to a user-defined project, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
- If the metrics relate to a core OpenShift Container Platform project, create a Red Hat support case on the Red Hat Customer Portal.
Check the TSDB status in the Prometheus UI.
1. In the Administrator perspective, navigate to Networking Routes.
2. Select the openshift-monitoring project in the Project list.
3. Select the URL in the prometheus-k8s row to open the login page for the Prometheus UI.
4. Choose Log in with OpenShift to log in using your OpenShift Container Platform credentials.
5. In the Prometheus UI, navigate to Status TSDB Status.

Chapter 13. Troubleshooting monitoring issues

13.1. Investigating why user-defined metrics are unavailable
Copy link

13.2. Determining why Prometheus is consuming a lot of disk space
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 13. Troubleshooting monitoring issues

13.1. Investigating why user-defined metrics are unavailableCopy linkLink copied to clipboard!

13.2. Determining why Prometheus is consuming a lot of disk spaceCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

13.1. Investigating why user-defined metrics are unavailable
Copy link

13.2. Determining why Prometheus is consuming a lot of disk space
Copy link