7.11. 调查监控问题

OpenShift Container Platform 包括一个预配置、预安装和自我更新的监控堆栈，可为核心平台组件提供监控。在 OpenShift Container Platform 4.12 中，集群管理员可以选择性地为用户定义的项目启用监控。

如果出现问题，请使用这些步骤：

您自己的指标不可用。
Prometheus 消耗大量磁盘空间。
KubePersistentVolumeFillingUp 警报正在触发 Prometheus。

7.11.1. 检查为什么用户定义的指标不可用

通过 ServiceMonitor 资源，您可以确定如何使用用户定义的项目中的服务公开的指标。如果您创建了 ServiceMonitor 资源，但无法在 Metrics UI 中看到任何对应的指标，请按该流程中所述的步骤操作。

先决条件

您可以使用具有 cluster-admin 集群角色的用户身份访问集群。
已安装 OpenShift CLI(oc)。
您已为用户定义的项目启用并配置了监控。
您已创建了 ServiceMonitor 资源。

流程

在服务和 ServiceMonitor 资源配置中检查对应的标签是否匹配。
1. 获取服务中定义的标签。以下示例在 ns1 项目中查询 prometheus-example-app 服务：
```
$ oc -n ns1 get service prometheus-example-app -o yaml
```
  输出示例
```
  labels:
    app: prometheus-example-app
```
2. 检查 ServiceMonitor 资源配置中的 matchLabels 定义是否与上一步中的标签输出匹配。以下示例在 ns1 项目中查询 prometheus-example-monitor 服务监控器：
```
$ oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml
```
  输出示例
```
apiVersion: v1
kind: ServiceMonitor
metadata:
  name: prometheus-example-monitor
  namespace: ns1
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
  selector:
    matchLabels:
      app: prometheus-example-app
```
  注意
  您可以作为具有项目查看权限的开发者检查服务和 ServiceMonitor 资源标签。

在 openshift-user-workload-monitoring 项目中检查 Prometheus Operator 的日志。

列出 openshift-user-workload-monitoring 项目中的 Pod：

$ oc -n openshift-user-workload-monitoring get pods

输出示例

NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-776fcbbd56-2nbfm   2/2     Running   0          132m
prometheus-user-workload-0             5/5     Running   1          132m
prometheus-user-workload-1             5/5     Running   1          132m
thanos-ruler-user-workload-0           3/3     Running   0          132m
thanos-ruler-user-workload-1           3/3     Running   0          132m

从 prometheus-operator Pod 中的 prometheus-operator 容器获取日志。在以下示例中，Pod 名为 prometheus-operator-776fcbbd56-2nbfm：

$ oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator

如果服务监控器出现问题，日志可能包含类似本例的错误：

level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload

在 OpenShift Container Platform Web 控制台 UI 中的 Metrics 目标 页面中查看您的端点的目标状态。
1. 登录到 OpenShift Container Platform web 控制台，进入 Administrator 视角中的 Observe Targets。
2. 在列表中找到指标端点，并在 Status 列中查看目标的状态。
3. 如果 Status 为 Down，点端点的 URL 查看该指标目标的 Target Details 页面的更多信息。
在 openshift-user-workload-monitoring 项目中为 Prometheus Operator 配置 debug 级别的日志记录。
1. 在 openshift-user-workload-monitoring 项目中编辑 user-workload-monitoring-config ConfigMap 对象：
```
$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
```
2. 在 data/config.yaml 下为 prometheusOperator 添加 logLevel: debug，将日志级别设置为 debug：
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheusOperator:
      logLevel: debug
# ...
```
3. 保存文件以使改变生效。受影响的 prometheus-operator Pod 会自动重新部署。
4. 确认 debug 日志级别已应用到 openshift-user-workload-monitoring 项目中的 prometheus-operator 部署：
```
$ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -o yaml |  grep "log-level"
```
  输出示例
```
        - --log-level=debug
```
  Debug 级别日志记录将显示 Prometheus Operator 发出的所有调用。
5. 检查 prometheus-operator Pod 是否正在运行：
```
$ oc -n openshift-user-workload-monitoring get pods
```
  注意
  如果配置映射中包含了一个未识别的 Prometheus Operator loglevel 值，则 prometheus-operator Pod 可能无法成功重启。
6. 查看 debug 日志，以了解 Prometheus Operator 是否在使用 ServiceMonitor 资源。查看日志中的其他相关错误。

其他资源

7.11.2. 确定为什么 Prometheus 消耗大量磁盘空间

开发人员可以使用键值对的形式为指标定义属性。潜在的键值对数量与属性的可能值数量对应。具有无限数量可能值的属性被称为未绑定属性。例如，customer_id 属性不绑定，因为它有无限多个可能的值。

每个分配的键值对都有唯一的时间序列。在标签中使用许多未绑定属性可导致所创建的时间序列数量出现指数增加。这可能会影响 Prometheus 性能，并消耗大量磁盘空间。

当 Prometheus 消耗大量磁盘时，您可以使用以下方法：

使用 Prometheus HTTP API 检查时间序列数据库(TSDB)状态，以了解有关哪些标签创建最多时间序列数据的更多信息。这样做需要集群管理员特权。
检查正在收集的提取示例数量。
要减少创建的唯一时间序列数量，您可以减少分配给用户定义的指标的未绑定属性数量
注意
使用绑定到一组有限可能值的属性可减少潜在的键-值对组合数量。
对可在用户定义的项目中提取的示例数量实施限制。这需要集群管理员特权。

先决条件

您可以使用具有 cluster-admin 集群角色的用户身份访问集群。
已安装 OpenShift CLI(oc)。

流程

在 Administrator 视角中，进入到 Observe Metrics。
在 Expression 字段中输入 Prometheus Query Language (PromQL) 查询。以下示例查询有助于识别可能导致高磁盘空间消耗的高卡性指标：
- 通过运行以下查询，您可以识别具有最高提取示例数的十个作业：
```
topk(10, max by(namespace, job) (topk by(namespace, job) (1, scrape_samples_post_metric_relabeling)))
```
- 通过运行以下查询，您可以通过识别在上一小时内创建了最多时间序列数据的十个作业，从而找出相关的时间序列：
```
topk(10, sum by(namespace, job) (sum_over_time(scrape_series_added[1h])))
```
如果指标的提取示例数大于预期，请检查分配给指标的未绑定标签值数量：
- 如果指标与用户定义的项目相关，请查看分配给您的工作负载的指标键-值对。它们通过应用程序级别的 Prometheus 客户端库实施。尝试限制标签中引用的未绑定属性数量。
- 如果指标与 OpenShift Container Platform 核心项目相关，请在红帽客户门户网站上创建一个红帽支持问题单。

以集群管理员身份登录，运行以下命令，使用 Prometheus HTTP API 查看 TSDB 状态：

运行以下命令来获取 Prometheus API 路由 URL：

$ HOST=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.status.ingress[].host})

运行以下命令来提取身份验证令牌：
```
$ TOKEN=$(oc whoami -t)
```

运行以下命令，查询 Prometheus 的 TSDB 状态：

$ curl -H "Authorization: Bearer $TOKEN" -k "https://$HOST/api/v1/status/tsdb"

输出示例

"status": "success","data":{"headStats":{"numSeries":507473,
"numLabelPairs":19832,"chunkCount":946298,"minTime":1712253600010,
"maxTime":1712257935346},"seriesCountByMetricName":
[{"name":"etcd_request_duration_seconds_bucket","value":51840},
{"name":"apiserver_request_sli_duration_seconds_bucket","value":47718},
...

其他资源

如需有关如何设置提取示例限制和创建相关警报规则的详情，请参阅为用户定义的项目设置提取示例限制

7.11.3. 解决 Prometheus 的 KubePersistentVolumeFillingUp 警报触发的问题

作为集群管理员，您可以解析 Prometheus 触发的 KubePersistentVolumeFillingUp 警报。

当 openshift-monitoring 项目中的 prometheus-k8s-* pod 声明的持久性卷 (PV) 时，关键警报会在剩余的总空间少于 3% 时触发。这可能导致 Prometheus 正常正常工作。

注意

有两个 KubePersistentVolumeFillingUp 警报：

Critical 警报 ：当挂载的 PV 小于 3% 的总空间时，会触发具有 severity="critical" 标签的警报。
Warning 警报 ：当挂载的 PV 的总空间低于 15% 时，会触发带有 severity="warning" 标签的警报，且预期在四天内填满。

要解决这个问题，您可以删除 Prometheus 时间序列数据库 (TSDB) 块来为 PV 创建更多空间。

先决条件

您可以使用具有 cluster-admin 集群角色的用户身份访问集群。
已安装 OpenShift CLI(oc)。

流程

运行以下命令，列出所有 TSDB 块的大小，从最旧的到最新排序：

$ oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \1
-c prometheus --image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \2
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'cd /prometheus/;du -hs $(ls -dt */ | grep -Eo "[0-9|A-Z]{26}")'

1 2: 将 <prometheus_k8s_pod_name> 替换为 KubePersistentVolumeFillingUp 警报描述中提到的 pod。

输出示例

308M    01HVKMPKQWZYWS8WVDAYQHNMW6
52M     01HVK64DTDA81799TBR9QDECEZ
102M    01HVK64DS7TRZRWF2756KHST5X
140M    01HVJS59K11FBVAPVY57K88Z11
90M     01HVH2A5Z58SKT810EM6B9AT50
152M    01HV8ZDVQMX41MKCN84S32RRZ1
354M    01HV6Q2N26BK63G4RYTST71FBF
156M    01HV664H9J9Z1FTZD73RD1563E
216M    01HTHXB60A7F239HN7S2TENPNS
104M    01HTHMGRXGS0WXA3WATRXHR36B

确定可以删除哪些块以及多少块，然后删除块。以下示例命令从 prometheus-k8s-0 pod 中删除三个最旧的 Prometheus TSDB 块：

$ oc debug prometheus-k8s-0 -n openshift-monitoring \
-c prometheus --image=$(oc get po -n openshift-monitoring prometheus-k8s-0 \
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') \
-- sh -c 'ls -latr /prometheus/ | egrep -o "[0-9|A-Z]{26}" | head -3 | \
while read BLOCK; do rm -r /prometheus/$BLOCK; done'

运行以下命令，验证挂载的 PV 的使用并确保有足够的可用空间：

$ oc debug <prometheus_k8s_pod_name> -n openshift-monitoring \1
--image=$(oc get po -n openshift-monitoring <prometheus_k8s_pod_name> \2
-o jsonpath='{.spec.containers[?(@.name=="prometheus")].image}') -- df -h /prometheus/

1 2: 将 <prometheus_k8s_pod_name> 替换为 KubePersistentVolumeFillingUp 警报描述中提到的 pod。

以下示例显示了由 prometheus-k8s-0 pod 声明的挂载的 PV，该 pod 剩余 63%：

输出示例

Starting pod/prometheus-k8s-0-debug-j82w4 ...
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p4  40G   15G  40G  37% /prometheus

Removing debug pod ...

7.11. 调查监控问题

7.11.1. 检查为什么用户定义的指标不可用

7.11.2. 确定为什么 Prometheus 消耗大量磁盘空间

7.11.3. 解决 Prometheus 的 KubePersistentVolumeFillingUp 警报触发的问题

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Red Hat legal and privacy links

Red Hat legal and privacy links