This documentation is for a release that is no longer maintained
See documentation for the latest supported version 3 or the latest supported version 4.Chapter 14. Troubleshooting monitoring issues
14.2. Determining why Prometheus is consuming a lot of disk space
Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id
attribute is unbound because it has an infinite number of possible values.
Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.
You can use the following measures when Prometheus consumes a lot of disk:
- Check the number of scrape samples that are being collected.
- Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series. Doing so requires cluster administrator privileges.
Reduce the number of unique time series that are created by reducing the number of unbound attributes that are assigned to user-defined metrics.
NoteUsing attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations.
- Enforce limits on the number of samples that can be scraped across user-defined projects. This requires cluster administrator privileges.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admin
cluster role. -
You have installed the OpenShift CLI (
oc
).
Procedure
-
In the Administrator perspective, navigate to Observe
Metrics. Run the following Prometheus Query Language (PromQL) query in the Expression field. This returns the ten metrics that have the highest number of scrape samples:
topk(10,count by (job)({__name__=~".+"}))
topk(10,count by (job)({__name__=~".+"}))
Copy to Clipboard Copied! Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts.
- If the metrics relate to a user-defined project, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
- If the metrics relate to a core OpenShift Container Platform project, create a Red Hat support case on the Red Hat Customer Portal.
Review the TSDB status using the Prometheus HTTP API by running the following commands as a cluster administrator:
oc login -u <username> -p <password>
$ oc login -u <username> -p <password>
Copy to Clipboard Copied! host=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.spec.host})
$ host=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.spec.host})
Copy to Clipboard Copied! token=$(oc whoami -t)
$ token=$(oc whoami -t)
Copy to Clipboard Copied! curl -H "Authorization: Bearer $token" -k "https://$host/api/v1/status/tsdb"
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v1/status/tsdb"
Copy to Clipboard Copied! Example output
"status": "success",
"status": "success",
Copy to Clipboard Copied!