第 14 章 监控问题的故障排除
14.1. 检查为什么用户定义的指标不可用 复制链接链接已复制到粘贴板!
通过 ServiceMonitor 资源,您可以确定如何使用用户定义的项目中的服务公开的指标。如果您创建了 ServiceMonitor 资源,但无法在 Metrics UI 中看到任何对应的指标,请按该流程中所述的步骤操作。
先决条件
-
您可以使用具有
cluster-admin集群角色的用户身份访问集群。 -
已安装 OpenShift CLI(
oc)。 - 您已为用户定义的项目启用并配置了监控。
-
您已创建了
ServiceMonitor资源。
流程
确保您的项目没有从用户工作负载监控中被排除。以下示例使用
ns1项目。验证项目没有附加
openshift.io/user-monitoring=false标签:oc get namespace ns1 --show-labels | grep 'openshift.io/user-monitoring=false'
$ oc get namespace ns1 --show-labels | grep 'openshift.io/user-monitoring=false'Copy to Clipboard Copied! Toggle word wrap Toggle overflow 注意用户工作负载项目设置的默认标签为
openshift.io/user-monitoring=true。但是,除非手动应用了该标签,否则标签不可见。如果附加了标签,请删除该标签:
从项目中删除该标签的示例
oc label namespace ns1 'openshift.io/user-monitoring-'
$ oc label namespace ns1 'openshift.io/user-monitoring-'Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
namespace/ns1 unlabeled
namespace/ns1 unlabeledCopy to Clipboard Copied! Toggle word wrap Toggle overflow
在服务和
ServiceMonitor资源配置中,检查对应的标签是否匹配。以下示例使用prometheus-example-app服务、prometheus-example-monitor服务监控和ns1项目。获取服务中定义的标签。
oc -n ns1 get service prometheus-example-app -o yaml
$ oc -n ns1 get service prometheus-example-app -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
labels: app: prometheus-example-applabels: app: prometheus-example-appCopy to Clipboard Copied! Toggle word wrap Toggle overflow 检查
ServiceMonitor资源配置中的matchLabels定义是否与上一步中的标签输出匹配。oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml
$ oc -n ns1 get servicemonitor prometheus-example-monitor -o yamlCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
apiVersion: v1 kind: ServiceMonitor metadata: name: prometheus-example-monitor namespace: ns1 spec: endpoints: - interval: 30s port: web scheme: http selector: matchLabels: app: prometheus-example-appapiVersion: v1 kind: ServiceMonitor metadata: name: prometheus-example-monitor namespace: ns1 spec: endpoints: - interval: 30s port: web scheme: http selector: matchLabels: app: prometheus-example-appCopy to Clipboard Copied! Toggle word wrap Toggle overflow 注意您可以作为具有项目查看权限的开发者检查服务和
ServiceMonitor资源标签。
在
openshift-user-workload-monitoring项目中检查 Prometheus Operator 的日志。列出
openshift-user-workload-monitoring项目中的 Pod:oc -n openshift-user-workload-monitoring get pods
$ oc -n openshift-user-workload-monitoring get podsCopy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
NAME READY STATUS RESTARTS AGE prometheus-operator-776fcbbd56-2nbfm 2/2 Running 0 132m prometheus-user-workload-0 5/5 Running 1 132m prometheus-user-workload-1 5/5 Running 1 132m thanos-ruler-user-workload-0 3/3 Running 0 132m thanos-ruler-user-workload-1 3/3 Running 0 132m
NAME READY STATUS RESTARTS AGE prometheus-operator-776fcbbd56-2nbfm 2/2 Running 0 132m prometheus-user-workload-0 5/5 Running 1 132m prometheus-user-workload-1 5/5 Running 1 132m thanos-ruler-user-workload-0 3/3 Running 0 132m thanos-ruler-user-workload-1 3/3 Running 0 132mCopy to Clipboard Copied! Toggle word wrap Toggle overflow 从
prometheus-operatorPod 中的prometheus-operator容器获取日志。在以下示例中,Pod 名为prometheus-operator-776fcbbd56-2nbfm:oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator
$ oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operatorCopy to Clipboard Copied! Toggle word wrap Toggle overflow 如果服务监控器出现问题,日志可能包含类似本例的错误:
level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload
level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workloadCopy to Clipboard Copied! Toggle word wrap Toggle overflow
在 OpenShift Container Platform Web 控制台 UI 中的 Metrics 目标 页中查看您的端点的目标状态。
-
登录到 OpenShift Container Platform web 控制台,进入 Administrator 视角中的 Observe
Targets。 - 在列表中找到指标端点,并在 Status 列中查看目标的状态。
- 如果 Status 为 Down,点端点的 URL 查看该指标目标的 Target Details 页面的更多信息。
-
登录到 OpenShift Container Platform web 控制台,进入 Administrator 视角中的 Observe
在
openshift-user-workload-monitoring项目中为 Prometheus Operator 配置 debug 级别日志记录。在
openshift-user-workload-monitoring项目中编辑user-workload-monitoring-configConfigMap对象:oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-configCopy to Clipboard Copied! Toggle word wrap Toggle overflow 在
data/config.yaml下为prometheusOperator添加logLevel: debug,将日志级别设置为debug:apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheusOperator: logLevel: debug # ...apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheusOperator: logLevel: debug # ...Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
保存文件以使改变生效。受影响的
prometheus-operatorPod 会自动重新部署。 确认
debug日志级别已应用到openshift-user-workload-monitoring项目中的prometheus-operator部署:oc -n openshift-user-workload-monitoring get deploy prometheus-operator -o yaml | grep "log-level"
$ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -o yaml | grep "log-level"Copy to Clipboard Copied! Toggle word wrap Toggle overflow 输出示例
- --log-level=debug
- --log-level=debugCopy to Clipboard Copied! Toggle word wrap Toggle overflow Debug 级别日志记录将显示 Prometheus Operator 发出的所有调用。
检查
prometheus-operatorPod 是否正在运行:oc -n openshift-user-workload-monitoring get pods
$ oc -n openshift-user-workload-monitoring get podsCopy to Clipboard Copied! Toggle word wrap Toggle overflow 注意如果配置映射中包含了一个未识别的 Prometheus Operator
loglevel值,则prometheus-operatorPod 可能无法成功重启。-
查看 debug 日志,以了解 Prometheus Operator 是否在使用
ServiceMonitor资源。查看日志中的其他相关错误。