홈
제품
OpenShift Dedicated
4
Monitoring
Chapter 6. Troubleshooting monitoring issues

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Troubleshooting monitoring issues

Find troubleshooting steps for common issues with user-defined project monitoring.

6.1. Determining why user-defined project metrics are unavailable
링크 복사

If metrics are not displaying when monitoring user-defined projects, follow these steps to troubleshoot the issue.

Procedure

Query the metric name and verify that the project is correct:
1. In the Developer perspective of the web console, click Observe and go to the Metrics tab.
2. Select the project that you want to view metrics for in the Project: list.
3. Select an existing query from the Select query list, or run a custom query by adding a PromQL query to the Expression field.
  The metrics are displayed in a chart.
  Queries must be done on a per-project basis. The metrics that are shown relate to the project that you have selected.

Verify that the pod that you want metrics from is actively serving metrics. Run the following oc exec command into a pod to target the podIP, port, and /metrics.

oc exec <sample_pod> -n <sample_namespace> -- curl <target_pod_IP>:<port>/metrics

$ oc exec <sample_pod> -n <sample_namespace> -- curl <target_pod_IP>:<port>/metrics

Copy to Clipboard

Toggle word wrap

Note

You must run the command on a pod that has curl installed.

The following example output shows a result with a valid version metric.

Example output

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
# HELP version Version information about this binary-- --:--:-- --:--:--     0
# TYPE version gauge
version{version="v0.1.0"} 1
100   102  100   102    0     0  51000      0 --:--:-- --:--:-- --:--:-- 51000

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
# HELP version Version information about this binary-- --:--:-- --:--:--     0
# TYPE version gauge
version{version="v0.1.0"} 1
100   102  100   102    0     0  51000      0 --:--:-- --:--:-- --:--:-- 51000

Copy to Clipboard

Toggle word wrap

An invalid output indicates that there is a problem with the corresponding application.

If you are using a PodMonitor CRD, verify that the PodMonitor CRD is configured to point to the correct pods using label matching. For more information, see the Prometheus Operator documentation.

If you are using a ServiceMonitor CRD, and if the /metrics endpoint of the pod is showing metric data, follow these steps to verify the configuration:

Verify that the service is pointed to the correct /metrics endpoint. The service labels in this output must match the services monitor labels and the /metrics endpoint defined by the service in the subsequent steps.

oc get service

$ oc get service

Copy to Clipboard

Toggle word wrap

Example output

apiVersion: v1
kind: Service 
metadata:
  labels: 
    app: prometheus-example-app
  name: prometheus-example-app
  namespace: ns1
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
    name: web
  selector:
    app: prometheus-example-app
  type: ClusterIP

apiVersion: v1
kind: Service


metadata:
  labels:


    app: prometheus-example-app
  name: prometheus-example-app
  namespace: ns1
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
    name: web
  selector:
    app: prometheus-example-app
  type: ClusterIP

Copy to Clipboard

Toggle word wrap

1: Specifies that this is a service API.
2: Specifies the labels that are being used for this service.

Query the serviceIP, port, and /metrics endpoints to see if the same metrics from the curl command you ran on the pod previously:

Run the following command to find the service IP:
```
oc get service -n <target_namespace>
```
```
$ oc get service -n <target_namespace>
```
Copy to Clipboard Toggle word wrap

Query the /metrics endpoint:

oc exec <sample_pod> -n <sample_namespace> -- curl <service_IP>:<port>/metrics

$ oc exec <sample_pod> -n <sample_namespace> -- curl <service_IP>:<port>/metrics

Copy to Clipboard

Toggle word wrap

Valid metrics are returned in the following example.

Example output

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
100   102  100   102    0     0  51000      0 --:--:-- --:--:-- --:--:--   99k
# HELP version Version information about this binary
# TYPE version gauge
version{version="v0.1.0"} 1

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                               Dload  Upload   Total   Spent    Left  Speed
100   102  100   102    0     0  51000      0 --:--:-- --:--:-- --:--:--   99k
# HELP version Version information about this binary
# TYPE version gauge
version{version="v0.1.0"} 1

Copy to Clipboard

Toggle word wrap

Use label matching to verify that the ServiceMonitor object is configured to point to the desired service. To do this, compare the Service object from the oc get service output to the ServiceMonitor object from the oc get servicemonitor output. The labels must match for the metrics to be displayed.
For example, from the previous steps, notice how the Service object has the app: prometheus-example-app label and the ServiceMonitor object has the same app: prometheus-example-app match label.

If everything looks valid and the metrics are still unavailable, please contact the support team for further help.

6.2. Determining why Prometheus is consuming a lot of disk space
링크 복사

Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id attribute is unbound because it has an infinite number of possible values.

Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.

You can use the following measures when Prometheus consumes a lot of disk:

Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series data. Doing so requires cluster administrator privileges.
Check the number of scrape samples that are being collected.
Reduce the number of unique time series that are created by reducing the number of unbound attributes that are assigned to user-defined metrics.
Note
Using attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations.
Enforce limits on the number of samples that can be scraped across user-defined projects. This requires cluster administrator privileges.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift CLI (oc).

Procedure

In the OpenShift Dedicated web console, go to Observe Metrics.
Enter a Prometheus Query Language (PromQL) query in the Expression field. The following example queries help to identify high cardinality metrics that might result in high disk space consumption:
- By running the following query, you can identify the ten jobs that have the highest number of scrape samples:
  topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))
  Copy to Clipboard Toggle word wrap
- By running the following query, you can pinpoint time series churn by identifying the ten jobs that have created the most time series data in the last hour:
  topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))
  Copy to Clipboard Toggle word wrap
Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts:
- If the metrics relate to a user-defined project, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
- If the metrics relate to a core OpenShift Dedicated project, create a Red Hat support case on the Red Hat Customer Portal.

Review the TSDB status using the Prometheus HTTP API by following these steps when logged in as a dedicated-admin:

Get the Prometheus API route URL by running the following command:

HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath='{.status.ingress[].host}')

$ HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath='{.status.ingress[].host}')

Copy to Clipboard

Toggle word wrap

Extract an authentication token by running the following command:
```
TOKEN=$(oc whoami -t)
```
```
$ TOKEN=$(oc whoami -t)
```
Copy to Clipboard Toggle word wrap

Query the TSDB status for Prometheus by running the following command:

curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"

$ curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"

Copy to Clipboard

Toggle word wrap

Example output

"status": "success","data":{"headStats":{"numSeries":507473,
"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
"maxTime":1712257935346},"seriesCountByMetricName":
[{"name":"etcd_request_duration_seconds_bucket","value":51840},
{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
...

"status": "success","data":{"headStats":{"numSeries":507473,
"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
"maxTime":1712257935346},"seriesCountByMetricName":
[{"name":"etcd_request_duration_seconds_bucket","value":51840},
{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
...

Copy to Clipboard

Toggle word wrap

6.3. Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus
링크 복사

As a cluster administrator, you can resolve the KubePersistentVolumeFillingUp alert being triggered for Prometheus.

The critical alert fires when a persistent volume (PV) claimed by a prometheus-k8s-* pod in the openshift-monitoring project has less than 3% total space remaining. This can cause Prometheus to function abnormally.

Note

There are two KubePersistentVolumeFillingUp alerts:

Critical alert: The alert with the severity="critical" label is triggered when the mounted PV has less than 3% total space remaining.
Warning alert: The alert with the severity="warning" label is triggered when the mounted PV has less than 15% total space remaining and is expected to fill up within four days.

To address this issue, you can remove Prometheus time-series database (TSDB) blocks to create more space for the PV.

Prerequisites

You have access to the cluster as a user with the dedicated-admin role.
You have installed the OpenShift CLI (oc).

Procedure

List the size of all TSDB blocks, sorted from oldest to newest, by running the following command:

oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \
-c prometheus --image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'cd /prometheus/;du -hs $(ls -dtr */ | grep -Eo "[0-9|A-Z]{26}")'

$ oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \
-c prometheus --image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'cd /prometheus/;du -hs $(ls -dtr */ | grep -Eo "[0-9|A-Z]{26}")'

Copy to Clipboard

Toggle word wrap

Replace <prometheus_k8s_pod_name> with the pod mentioned in the KubePersistentVolumeFillingUp alert description.

Example output

308M    01HVKMPKQWZYWS8WVDAYQHNMW6
52M     01HVK64DTDA81799TBR9QDECEZ
102M    01HVK64DS7TRZRWF2756KHST5X
140M    01HVJS59K11FBVAPVY57K88Z11
90M     01HVH2A5Z58SKT810EM6B9AT50
152M    01HV8ZDVQMX41MKCN84S32RRZ1
354M    01HV6Q2N26BK63G4RYTST71FBF
156M    01HV664H9J9Z1FTZD73RD1563E
216M    01HTHXB60A7F239HN7S2TENPNS
104M    01HTHMGRXGS0WXA3WATRXHR36B

308M    01HVKMPKQWZYWS8WVDAYQHNMW6
52M     01HVK64DTDA81799TBR9QDECEZ
102M    01HVK64DS7TRZRWF2756KHST5X
140M    01HVJS59K11FBVAPVY57K88Z11
90M     01HVH2A5Z58SKT810EM6B9AT50
152M    01HV8ZDVQMX41MKCN84S32RRZ1
354M    01HV6Q2N26BK63G4RYTST71FBF
156M    01HV664H9J9Z1FTZD73RD1563E
216M    01HTHXB60A7F239HN7S2TENPNS
104M    01HTHMGRXGS0WXA3WATRXHR36B

Copy to Clipboard

Toggle word wrap

Identify which and how many blocks could be removed, then remove the blocks. The following example command removes the three oldest Prometheus TSDB blocks from the prometheus-k8s-0 pod:

oc debug prometheus-k8s-0 -n openshift-monitoring \
-c prometheus --image=$(oc get po -n openshift-monitoring prometheus-k8s-0 \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'ls -latr /prometheus/ | egrep -o "[0-9|A-Z]{26}" | head -3 | \
while read BLOCK; do rm -r /prometheus/$BLOCK; done'

$ oc debug prometheus-k8s-0 -n openshift-monitoring \
-c prometheus --image=$(oc get po -n openshift-monitoring prometheus-k8s-0 \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'ls -latr /prometheus/ | egrep -o "[0-9|A-Z]{26}" | head -3 | \
while read BLOCK; do rm -r /prometheus/$BLOCK; done'

Copy to Clipboard

Toggle word wrap

Verify the usage of the mounted PV and ensure there is enough space available by running the following command:

oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \
--image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') -- df -h /prometheus/

$ oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \
--image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') -- df -h /prometheus/

Copy to Clipboard

Toggle word wrap

Replace <prometheus_k8s_pod_name> with the pod mentioned in the KubePersistentVolumeFillingUp alert description.

The following example output shows the mounted PV claimed by the prometheus-k8s-0 pod that has 63% of space remaining:

Example output

Starting pod/prometheus-k8s-0-debug-j82w4 ...
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  40G   15G  40G  37% /prometheus

Removing debug pod ...

Starting pod/prometheus-k8s-0-debug-j82w4 ...
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  40G   15G  40G  37% /prometheus

Removing debug pod ...

Copy to Clipboard

Toggle word wrap

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Troubleshooting monitoring issues

6.1. Determining why user-defined project metrics are unavailable
링크 복사

6.2. Determining why Prometheus is consuming a lot of disk space
링크 복사

6.3. Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus
링크 복사

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 6. Troubleshooting monitoring issues

6.1. Determining why user-defined project metrics are unavailable링크 복사링크가 클립보드에 복사되었습니다!

6.2. Determining why Prometheus is consuming a lot of disk space링크 복사링크가 클립보드에 복사되었습니다!

6.3. Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus링크 복사링크가 클립보드에 복사되었습니다!

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. Determining why user-defined project metrics are unavailable
링크 복사

6.2. Determining why Prometheus is consuming a lot of disk space
링크 복사

6.3. Resolving the KubePersistentVolumeFillingUp alert firing for Prometheus
링크 복사