Home
Products
OpenShift Container Platform
4.13
Monitoring
Chapter 12. Troubleshooting monitoring issues

Chapter 12. Troubleshooting monitoring issues

Find troubleshooting steps for common issues with core platform and user-defined project monitoring.

12.1. Investigating why user-defined project metrics are unavailable
Copy link

ServiceMonitor resources enable you to determine how to use the metrics exposed by a service in user-defined projects. Follow the steps outlined in this procedure if you have created a ServiceMonitor resource but cannot see any corresponding metrics in the Metrics UI.

Prerequisites

You have access to the cluster as a user with the cluster-admin role.
You have installed the OpenShift CLI (oc).
You have enabled and configured monitoring for user-defined projects.
You have created a ServiceMonitor resource.

Procedure

Ensure that your project is not excluded from user workload monitoring. The following examples use the ns1 project.
1. Verify that the project does not have the openshift.io/user-monitoring=false label attached:
  $ oc get namespace ns1 --show-labels | grep 'openshift.io/user-monitoring=false'
  Copy to Clipboard Toggle word wrap
  Note
  The default label set for user workload projects is openshift.io/user-monitoring=true. However, the label is not visible unless you manually apply it.
2. If the label is attached, remove the label:
  Example of removing the label from the project
  $ oc label namespace ns1 'openshift.io/user-monitoring-'
  
  Copy to Clipboard Toggle word wrap
  Example output
  namespace/ns1 unlabeled
  
  Copy to Clipboard Toggle word wrap

Check that the corresponding labels match in the service and ServiceMonitor resource configurations. The following examples use the prometheus-example-app service, the prometheus-example-monitor service monitor, and the ns1 project.

Obtain the label defined in the service.

oc -n ns1 get service prometheus-example-app -o yaml

$ oc -n ns1 get service prometheus-example-app -o yaml

Copy to Clipboard

Toggle word wrap

Example output

  labels:
    app: prometheus-example-app

  labels:
    app: prometheus-example-app

Copy to Clipboard

Toggle word wrap

Check that the matchLabels definition in the ServiceMonitor resource configuration matches the label output in the preceding step.

oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml

$ oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
kind: ServiceMonitor
metadata:
  name: prometheus-example-monitor
  namespace: ns1
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
  selector:
    matchLabels:
      app: prometheus-example-app

apiVersion: v1
kind: ServiceMonitor
metadata:
  name: prometheus-example-monitor
  namespace: ns1
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
  selector:
    matchLabels:
      app: prometheus-example-app

Copy to Clipboard

Toggle word wrap

Note

You can check service and ServiceMonitor resource labels as a developer with view permissions for the project.

Inspect the logs for the Prometheus Operator in the openshift-user-workload-monitoring project.

List the pods in the openshift-user-workload-monitoring project:

oc -n openshift-user-workload-monitoring get pods

$ oc -n openshift-user-workload-monitoring get pods

Copy to Clipboard

Toggle word wrap

Example output

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-776fcbbd56-2nbfm   2/2     Running   0          132m
prometheus-user-workload-0             5/5     Running   1          132m
prometheus-user-workload-1             5/5     Running   1          132m
thanos-ruler-user-workload-0           3/3     Running   0          132m
thanos-ruler-user-workload-1           3/3     Running   0          132m

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-776fcbbd56-2nbfm   2/2     Running   0          132m
prometheus-user-workload-0             5/5     Running   1          132m
prometheus-user-workload-1             5/5     Running   1          132m
thanos-ruler-user-workload-0           3/3     Running   0          132m
thanos-ruler-user-workload-1           3/3     Running   0          132m

Copy to Clipboard

Toggle word wrap

Obtain the logs from the prometheus-operator container in the prometheus-operator pod. In the following example, the pod is called prometheus-operator-776fcbbd56-2nbfm:

oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator

$ oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator

Copy to Clipboard

Toggle word wrap

If there is a issue with the service monitor, the logs might include an error similar to this example:

level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload

level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload

Copy to Clipboard

Toggle word wrap

Review the target status for your endpoint on the Metrics targets page in the OpenShift Container Platform web console UI.
1. Log in to the OpenShift Container Platform web console and navigate to Observe Targets in the Administrator perspective.
2. Locate the metrics endpoint in the list, and review the status of the target in the Status column.
3. If the Status is Down, click the URL for the endpoint to view more information on the Target Details page for that metrics target.
Configure debug level logging for the Prometheus Operator in the openshift-user-workload-monitoring project.
1. Edit the user-workload-monitoring-config ConfigMap object in the openshift-user-workload-monitoring project:
  $ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
  Copy to Clipboard Toggle word wrap
2. Add logLevel: debug for prometheusOperator under data/config.yaml to set the log level to debug:
  apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheusOperator: logLevel: debug # ...
  Copy to Clipboard Toggle word wrap
3. Save the file to apply the changes. The affected prometheus-operator pod is automatically redeployed.
4. Confirm that the debug log-level has been applied to the prometheus-operator deployment in the openshift-user-workload-monitoring project:
  $ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -o yaml | grep "log-level"
  Copy to Clipboard Toggle word wrap
  Example output
  - --log-level=debug
  
  Copy to Clipboard Toggle word wrap
  Debug level logging will show all calls made by the Prometheus Operator.
5. Check that the prometheus-operator pod is running:
  $ oc -n openshift-user-workload-monitoring get pods
  Copy to Clipboard Toggle word wrap
  Note
  If an unrecognized Prometheus Operator loglevel value is included in the config map, the prometheus-operator pod might not restart successfully.
6. Review the debug logs to see if the Prometheus Operator is using the ServiceMonitor resource. Review the logs for other related errors.

12.2. Determining why Prometheus is consuming a lot of disk space
Copy link

Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id attribute is unbound because it has an infinite number of possible values.

Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.

You can use the following measures when Prometheus consumes a lot of disk:

Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series data. Doing so requires cluster administrator privileges.
Check the number of scrape samples that are being collected.
Reduce the number of unique time series that are created by reducing the number of unbound attributes that are assigned to user-defined metrics.
Note
Using attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations.
Enforce limits on the number of samples that can be scraped across user-defined projects. This requires cluster administrator privileges.

Prerequisites

You have access to the cluster as a user with the cluster-admin cluster role.
You have installed the OpenShift CLI (oc).

Procedure

In the Administrator perspective, navigate to Observe Metrics.
Enter a Prometheus Query Language (PromQL) query in the Expression field. The following example queries help to identify high cardinality metrics that might result in high disk space consumption:
- By running the following query, you can identify the ten jobs that have the highest number of scrape samples:
  topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))
  Copy to Clipboard Toggle word wrap
- By running the following query, you can pinpoint time series churn by identifying the ten jobs that have created the most time series data in the last hour:
  topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))
  Copy to Clipboard Toggle word wrap
Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts:
- If the metrics relate to a user-defined project, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
- If the metrics relate to a core OpenShift Container Platform project, create a Red Hat support case on the Red Hat Customer Portal.

Review the TSDB status using the Prometheus HTTP API by following these steps when logged in as a cluster administrator:

Get the Prometheus API route URL by running the following command:

HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath='{.status.ingress[].host}')

$ HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath='{.status.ingress[].host}')

Copy to Clipboard

Toggle word wrap

Extract an authentication token by running the following command:
```
TOKEN=$(oc whoami -t)
```
```
$ TOKEN=$(oc whoami -t)
```
Copy to Clipboard Toggle word wrap

Query the TSDB status for Prometheus by running the following command:

curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"

$ curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"

Copy to Clipboard

Toggle word wrap

Example output

"status": "success","data":{"headStats":{"numSeries":507473,
"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
"maxTime":1712257935346},"seriesCountByMetricName":
[{"name":"etcd_request_duration_seconds_bucket","value":51840},
{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
...

"status": "success","data":{"headStats":{"numSeries":507473,
"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
"maxTime":1712257935346},"seriesCountByMetricName":
[{"name":"etcd_request_duration_seconds_bucket","value":51840},
{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
...

Copy to Clipboard

Toggle word wrap

12.3. Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus
Copy link

As a cluster administrator, you can resolve the KubePersistentVolumeFillingUp alert being triggered for Prometheus.

The critical alert fires when a persistent volume (PV) claimed by a prometheus-k8s-* pod in the openshift-monitoring project has less than 3% total space remaining. This can cause Prometheus to function abnormally.

Note

There are two KubePersistentVolumeFillingUp alerts:

Critical alert: The alert with the severity="critical" label is triggered when the mounted PV has less than 3% total space remaining.
Warning alert: The alert with the severity="warning" label is triggered when the mounted PV has less than 15% total space remaining and is expected to fill up within four days.

To address this issue, you can remove Prometheus time-series database (TSDB) blocks to create more space for the PV.

Prerequisites

You have access to the cluster as a user with the cluster-admin cluster role.
You have installed the OpenShift CLI (oc).

Procedure

List the size of all TSDB blocks, sorted from oldest to newest, by running the following command:

oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \
-c prometheus --image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'cd /prometheus/;du -hs $(ls -dtr */ | grep -Eo "[0-9|A-Z]{26}")'

$ oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \


-c prometheus --image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \


-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'cd /prometheus/;du -hs $(ls -dtr */ | grep -Eo "[0-9|A-Z]{26}")'

Copy to Clipboard

Toggle word wrap

1 2: Replace <prometheus_k8s_pod_name> with the pod mentioned in the KubePersistentVolumeFillingUp alert description.

Example output

308M    01HVKMPKQWZYWS8WVDAYQHNMW6
52M     01HVK64DTDA81799TBR9QDECEZ
102M    01HVK64DS7TRZRWF2756KHST5X
140M    01HVJS59K11FBVAPVY57K88Z11
90M     01HVH2A5Z58SKT810EM6B9AT50
152M    01HV8ZDVQMX41MKCN84S32RRZ1
354M    01HV6Q2N26BK63G4RYTST71FBF
156M    01HV664H9J9Z1FTZD73RD1563E
216M    01HTHXB60A7F239HN7S2TENPNS
104M    01HTHMGRXGS0WXA3WATRXHR36B

308M    01HVKMPKQWZYWS8WVDAYQHNMW6
52M     01HVK64DTDA81799TBR9QDECEZ
102M    01HVK64DS7TRZRWF2756KHST5X
140M    01HVJS59K11FBVAPVY57K88Z11
90M     01HVH2A5Z58SKT810EM6B9AT50
152M    01HV8ZDVQMX41MKCN84S32RRZ1
354M    01HV6Q2N26BK63G4RYTST71FBF
156M    01HV664H9J9Z1FTZD73RD1563E
216M    01HTHXB60A7F239HN7S2TENPNS
104M    01HTHMGRXGS0WXA3WATRXHR36B

Copy to Clipboard

Toggle word wrap

Identify which and how many blocks could be removed, then remove the blocks. The following example command removes the three oldest Prometheus TSDB blocks from the prometheus-k8s-0 pod:

oc debug prometheus-k8s-0 -n openshift-monitoring \
-c prometheus --image=$(oc get po -n openshift-monitoring prometheus-k8s-0 \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'ls -latr /prometheus/ | egrep -o "[0-9|A-Z]{26}" | head -3 | \
while read BLOCK; do rm -r /prometheus/$BLOCK; done'

$ oc debug prometheus-k8s-0 -n openshift-monitoring \
-c prometheus --image=$(oc get po -n openshift-monitoring prometheus-k8s-0 \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'ls -latr /prometheus/ | egrep -o "[0-9|A-Z]{26}" | head -3 | \
while read BLOCK; do rm -r /prometheus/$BLOCK; done'

Copy to Clipboard

Toggle word wrap

Verify the usage of the mounted PV and ensure there is enough space available by running the following command:

oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \
--image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') -- df -h /prometheus/

$ oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \


--image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \


-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') -- df -h /prometheus/

Copy to Clipboard

Toggle word wrap

1 2: Replace <prometheus_k8s_pod_name> with the pod mentioned in the KubePersistentVolumeFillingUp alert description.

The following example output shows the mounted PV claimed by the prometheus-k8s-0 pod that has 63% of space remaining:

Example output

Starting pod/prometheus-k8s-0-debug-j82w4 ...
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  40G   15G  40G  37% /prometheus

Removing debug pod ...

Starting pod/prometheus-k8s-0-debug-j82w4 ...
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  40G   15G  40G  37% /prometheus

Removing debug pod ...

Copy to Clipboard

Toggle word wrap

12.4. Resolving the AlertmanagerReceiversNotConfigured alert
Copy link

Every cluster that is deployed has the AlertmanagerReceiversNotConfigured alert firing by default. To resolve the issue, you must configure alert receivers.

For default platform monitoring, follow the steps in "Configuring alert notifications" in Configuring core platform monitoring.
For user workload monitoring, follow the steps in "Configuring alert notifications" in Configuring user workload monitoring.

Chapter 12. Troubleshooting monitoring issues

12.1. Investigating why user-defined project metrics are unavailable
Copy link

12.2. Determining why Prometheus is consuming a lot of disk space
Copy link

12.3. Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus
Copy link

12.4. Resolving the AlertmanagerReceiversNotConfigured alert
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 12. Troubleshooting monitoring issues

12.1. Investigating why user-defined project metrics are unavailableCopy linkLink copied to clipboard!

12.2. Determining why Prometheus is consuming a lot of disk spaceCopy linkLink copied to clipboard!

12.3. Resolving the KubePersistentVolumeFillingUp alert firing for PrometheusCopy linkLink copied to clipboard!

12.4. Resolving the AlertmanagerReceiversNotConfigured alertCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

12.1. Investigating why user-defined project metrics are unavailable
Copy link

12.2. Determining why Prometheus is consuming a lot of disk space
Copy link

12.3. Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus
Copy link

12.4. Resolving the AlertmanagerReceiversNotConfigured alert
Copy link