1.2. 配置监控堆栈

在 OpenShift Container Platform 4 之前，Prometheus Cluster Monitoring 堆栈是通过 Ansible 清单文件配置的。为此，堆栈公开了一小部分可用配置选项作为 Ansible 的变量。您需要在安装 OpenShift Container Platform 前配置该堆栈。

在 OpenShift Container Platform 4 中，Ansible 不再是安装 OpenShift Container Platform 的主要方法。安装之前，安装程序只提供非常少的配置选项。大多数 OpenShift 框架组件（包括 Prometheus Cluster Monitoring 堆栈）都在安装后进行配置。

本节介绍支持的配置，演示如何配置监控堆栈，并且展示几个常见的配置情景。

1.2.1. 先决条件

监控堆栈会带来额外的资源需求。请参考缩放 Cluster Monitoring Operator 中的计算资源建议，并验证您是否有充足的资源。

1.2.2. 维护和支持

若要配置 OpenShift Container Platform Monitoring，支持的方式是使用本文中介绍的选项。请勿使用其他配置，因为不受支持。各个 Prometheus 发行版本的配置范例可能会有所变化，只有掌握了所有可能的配置，才能稳妥应对这样的配置变化。如果使用并非本节所描述的配置，您的更改可能会丢失，因为 cluster-monitoring-operator 会调节差异。根据设计，Operator 默认将一切还原到定义的状态。

明确不支持的情形包括：

在 openshift-* 命名空间中创建额外的 ServiceMonitor 对象。这会扩大集群监控 Prometheus 实例抓取目标的范围，可能会造成无法考量的冲突和负载差异。这些因素可能会导致 Prometheus 设置不稳定。
创建非预期的 ConfigMap 对象或 PrometheusRule 对象。这会导致集群监控 Prometheus 实例包含额外的警报和记录规则。
修改堆栈的资源。Prometheus 监控堆栈确保其资源始终处于期望的状态。如果修改了资源，堆栈会重置它们。
将堆栈资源用于其他目的。Prometheus Cluster Monitoring 堆栈所创建的资源并不是为了供任何其他资源使用，因为不能保证向后兼容性。
使 Cluster Monitoring Operator 停止调节监控堆栈。
添加新的警报规则。
修改监控堆栈 Grafana 实例。

1.2.3. 创建集群监控配置映射

要配置 OpenShift Container Platform 监控堆栈，您必须创建集群监控 ConfigMap 对象。

先决条件

您可以使用具有 cluster-admin 角色的用户访问集群。
已安装 OpenShift CLI（oc）。

流程

检查 cluster-monitoring-config ConfigMap 对象是否存在：

$ oc -n openshift-monitoring get configmap cluster-monitoring-config

如果 ConfigMap 对象不存在：

创建以下 YAML 清单。在本例中，该文件名为 cluster-monitoring-config.yaml：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |

应用配置以创建 ConfigMap 对象：

$ oc apply -f cluster-monitoring-config.yaml

1.2.4. 配置集群监控堆栈

您可以使用配置映射配置 Prometheus Cluster Monitoring 堆栈。配置映射配置 Cluster Monitoring Operator，后者配置堆栈的组件。

先决条件

您可以使用具有 cluster-admin 角色的用户访问集群。
已安装 OpenShift CLI（oc）。
您已创建 cluster-monitoring-config ConfigMap 对象。

流程

编辑 cluster-monitoring-config ConfigMap 对象：

$ oc -n openshift-monitoring edit configmap cluster-monitoring-config

将您的配置以键值对 <component_name>: <component_configuration> 的形式放到 data/config.yaml 下：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    <component>:
      <configuration_for_the_component>

相应地替换 <component> 和 <configuration_for_the_component>。

例如，创建此 ConfigMap 对象来为 Prometheus 配置持久性卷声明（PVC）：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      volumeClaimTemplate: spec: storageClassName: fast volumeMode: Filesystem resources: requests: storage: 40Gi

此处的 prometheusK8s 定义 Prometheus 组件，后面几行则定义其配置。

保存文件以使改变生效。受新配置影响的 Pod 会自动重启。

其他资源

请参阅创建集群监控配置映射以了解如何创建 cluster-monitoring-config ConfigMap 对象。

1.2.5. 可配置的监控组件

下表显示了您可以配置的监控组件，以及配置映射中用来指定这些组件的键：

表 1.2. 可配置的监控组件
组件	键
Prometheus Operator	`prometheusOperator`
Prometheus	`prometheusK8s`
Alertmanager	`alertmanagerMain`
kube-state-metrics	`kubeStateMetrics`
openshift-state-metrics	`openshiftStateMetrics`
Grafana	`grafana`
Telemeter Client	`telemeterClient`
Prometheus Adapter	`k8sPrometheusAdapter`

在以上列表中，只有 Prometheus 和 Alertmanager 有许多配置选项。所有其他组件通常仅提供 nodeSelector 字段，用于部署到指定节点。

1.2.6. 将监控组件移到其他节点

您可以将任何监控堆栈组件移到特定的节点。

先决条件

您可以使用具有 cluster-admin 角色的用户访问集群。
已安装 OpenShift CLI（oc）。
您已创建 cluster-monitoring-config ConfigMap 对象。

流程

编辑 cluster-monitoring-config ConfigMap 对象：

$ oc -n openshift-monitoring edit configmap cluster-monitoring-config

在 data/config.yaml 下为组件指定 nodeSelector 约束：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    <component>:
      nodeSelector:
        <node_key>: <node_value>
        <node_key>: <node_value>
        <...>

相应地替换 <component>，并将 <node_key>: <node_value> 替换为用于指定目标节点的键值对映射。通常只使用一个键值对。

组件只能在以各个指定键值对作为标签的节点上运行。节点也可以有附加标签。

例如，要将组件移到具有 foo: bar 标签的节点上，请使用：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusOperator:
      nodeSelector:
        foo: bar
    prometheusK8s:
      nodeSelector:
        foo: bar
    alertmanagerMain:
      nodeSelector:
        foo: bar
    kubeStateMetrics:
      nodeSelector:
        foo: bar
    grafana:
      nodeSelector:
        foo: bar
    telemeterClient:
      nodeSelector:
        foo: bar
    k8sPrometheusAdapter:
      nodeSelector:
        foo: bar
    openshiftStateMetrics:
      nodeSelector:
        foo: bar

保存文件以使改变生效。受新配置影响的组件会自动移到新节点上。

其他资源

请参阅创建集群监控配置映射以了解如何创建 cluster-monitoring-config ConfigMap 对象。
如需了解更多有关使用节点选择器的信息，请参阅使用节点选择器在特定节点上放置 pod。
参阅 Kubernetes 文档来详细了解 nodeSelector 约束。

1.2.7. 为监控组件分配容忍（tolerations）

您可以为任何监控堆栈组件分配容忍，以便将其移到污点。

先决条件

您可以使用具有 cluster-admin 角色的用户访问集群。
已安装 OpenShift CLI（oc）。
您已创建 cluster-monitoring-config ConfigMap 对象。

流程

编辑 cluster-monitoring-config ConfigMap 对象：

$ oc -n openshift-monitoring edit configmap cluster-monitoring-config

为组件指定 tolerations：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    <component>:
      tolerations:
        <toleration_specification>

相应地替换 <component> 和 <toleration_specification>。

例如，oc adm taint nodes node1 key1=value1:NoSchedule 污点可以防止调度程序将 Pod 放置到 foo: bar 节点中。要让 alertmanagerMain 组件忽略这个污点并且照常将 alertmanagerMain 放置到 foo: bar，请使用以下容忍：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    alertmanagerMain:
      nodeSelector:
        foo: bar
      tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule"

保存文件以使改变生效。这样就会自动应用新组件放置配置。

其他资源

请参阅创建集群监控配置映射以了解如何创建 cluster-monitoring-config ConfigMap 对象。
参阅 OpenShift Container Platform 文档中有关污点和容忍的内容。
参阅 Kubernetes 文档中有关污点和容忍的内容。

1.2.8. 配置持久性存储

如果使用持久性存储运行集群监控，您的指标将保存在持久性卷（PV）中，并可在 Pod 重新启动或重新创建后保留。如果您需要预防指标或警报数据丢失，这是理想方案。在生产环境中，强烈建议配置持久性存储。由于 IO 需求很高，使用本地存储颇有优势。

重要

请参阅建议的可配置存储技术。

1.2.9. 先决条件

分配充足的专用本地持久性存储，以确保磁盘不会被填满。您需要的存储量取决于 Pod 的数目。如需有关持久性存储系统要求的信息，请参阅 Prometheus 数据库存储要求。
确保持久性卷 (PV) 已准备好以供持久性卷声明 (PVC) 使用，每个副本一个 PV。由于 Prometheus 有两个副本并且 Alertmanager 有三个副本，因此您需要五个 PV 来支持整个监控堆栈。PV 应该从 Local Storage Operator 中提供。如果启用了动态置备的存储，则这项要求不适用。
使用块存储类型。
配置本地持久性存储。

1.2.9.1. 配置本地持久性卷声明

要让 Prometheus 或 Alertmanager 使用持久性卷 (PV)，您首先必须配置持久性卷声明 (PVC)。

先决条件

您可以使用具有 cluster-admin 角色的用户访问集群。
已安装 OpenShift CLI（oc）。
您已创建 cluster-monitoring-config ConfigMap 对象。

流程

编辑 cluster-monitoring-config ConfigMap 对象：

$ oc -n openshift-monitoring edit configmap cluster-monitoring-config

将组件的 PVC 配置放在 data/config.yaml 下：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    <component>:
      volumeClaimTemplate:
        metadata:
          name: <PVC_name_prefix>
        spec:
          storageClassName: <storage_class>
          resources:
            requests:
              storage: <amount_of_storage>

如需有关如何指定 volumeClaimTemplate 的信息，请参阅 Kubernetes 文档中与 PersistentVolumeClaim 相关的内容。

例如，若要配置一个 PVC 来声明用于 Prometheus 的本地持久性存储，请使用：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      volumeClaimTemplate:
        metadata:
          name: localpvc
        spec:
          storageClassName: local-storage
          resources:
            requests:
              storage: 40Gi

在上例中，由 Local Storage Operator 创建的存储类称为 local-storage。

若要配置一个 PVC 来声明用于 Alertmanager 的本地持久性存储，请使用：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    alertmanagerMain:
      volumeClaimTemplate:
        metadata:
          name: localpvc
        spec:
          storageClassName: local-storage
          resources:
            requests:
              storage: 40Gi

保存文件以使改变生效。受新配置影响的 Pod 会自动重启，并且应用新的存储配置。

1.2.9.2. 修改 Prometheus 指标数据的保留时间

默认情况下，Prometheus Cluster Monitoring 堆栈将 Prometheus 数据的保留时间配置为 15 天。您可以修改保留时间来更改将在多久后删除数据。

先决条件

您可以使用具有 cluster-admin 角色的用户访问集群。
已安装 OpenShift CLI（oc）。
您已创建 cluster-monitoring-config ConfigMap 对象。

流程

编辑 cluster-monitoring-config ConfigMap 对象：

$ oc -n openshift-monitoring edit configmap cluster-monitoring-config

将保留时间配置放在 data/config.yaml 下：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      retention: <time_specification>

将 <time_specification> 替换为一个数字，后面紧跟 ms（毫秒）、s（秒）、m（分钟）、h（小时）、d（天）、w（周）或 y（年）。

例如，若要将保留时间配置为 24 小时，请使用：

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    prometheusK8s:
      retention: 24h

保存文件以使改变生效。受新配置影响的 Pod 会自动重启。

其他资源

请参阅创建集群监控配置映射以了解如何创建 cluster-monitoring-config ConfigMap 对象。
了解持久性存储
优化存储

1.2.10. 配置 Alertmanager

Prometheus Alertmanager 是管理传入警报的组件，其包括：

静默警报
禁止警报
聚合警报
可靠数据去重警报
分组警报
使用电子邮件、PagerDuty 和 HipChat 等接收工具以通知形式发送分组警报

1.2.10.1. Alertmanager 默认配置

OpenShift Container Platform Monitoring Alertmanager 集群的默认配置如下：

global:
  resolve_timeout: 5m
route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: default
  routes:
  - match:
      alertname: Watchdog
    repeat_interval: 5m
    receiver: watchdog
receivers:
- name: default
- name: watchdog

OpenShift Container Platform 监控功能附带 Watchdog 警报，它会持续触发。Alertmanager 重复向通知提供程序发送 Watchdog 警报通知，例如： PagerDuty。此提供程序通常会在管理员停止收到 Watchdog 警告时通知管理员。这种机制有助于确保 Prometheus 的继续操作以及 Alertmanager 和通知提供程序之间的持续通信。

1.2.10.2. 应用自定义 Alertmanager 配置

您可以通过编辑 openshift-monitoring 命名空间中的 alertmanager-main secret，覆盖默认的 Alertmanager 配置。

先决条件

安装了用来处理 JSON 数据的 jq 工具

流程

将当前活跃的 Alertmanager 配置输出到 alertmanager.yaml 文件：

$ oc -n openshift-monitoring get secret alertmanager-main --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml

将 alertmanager.yaml 文件中的配置改为您的新配置：

global:
  resolve_timeout: 5m
route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: default
  routes:
  - match:
      alertname: Watchdog
    repeat_interval: 5m
    receiver: watchdog
  - match:
      service: <your_service> 1
    routes:
    - match:
        <your_matching_rules> 2
      receiver: <receiver> 3
receivers:
- name: default
- name: watchdog
- name: <receiver>
  <receiver_configuration>

1: service 指定触发警报的服务。
2: <your_matching_rules> 指定目标警报。
3: receiver 指定用于该警报的接收器。

例如，以下列表配置用于通知的 PagerDuty：

global:
  resolve_timeout: 5m
route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: default
  routes:
  - match:
      alertname: Watchdog
    repeat_interval: 5m
    receiver: watchdog
  - match: service: example-app routes: - match: severity: critical receiver: team-frontend-page
receivers:
- name: default
- name: watchdog
- name: team-frontend-page pagerduty_configs: - service_key: "your-key"

采用此配置时，由 example-app 服务触发的、严重性为 critical 的警报将使用 team-frontend-page 接收器发送；即，这些警报将传给选定人员。

应用文件中的新配置：

$ oc -n openshift-monitoring create secret generic alertmanager-main --from-file=alertmanager.yaml --dry-run -o=yaml |  oc -n openshift-monitoring replace secret --filename=-

其他资源

参阅 PagerDuty 官方网站来进一步了解 PagerDuty。
参阅 PagerDuty Prometheus 集成指南来学习如何检索 service_key。
参阅 Alertmanager 配置来配置通过不同警报接收器发送警报。

1.2.10.3. 警报规则

OpenShift Container Platform Cluster Monitoring 默认附带一组预定义的警报规则。

请注意：

默认的警报规则专门用于 OpenShift Container Platform 集群，别无它用。例如，您可以获得集群中持久性卷的警报，但不会获得自定义命名空间中持久性卷的警报。
目前无法添加自定义警报规则。
有些警报规则的名称相同。这是有意设计的。它们发送关于同一事件但具有不同阈值和/或不同严重性的警报。
在禁止规则中，触发较高的严重性时会禁止较低严重性。

1.2.10.4. 列出起作用的警报规则

您可以列出当前应用到集群的警报规则。

流程

配置所需的端口转发：

$ oc -n openshift-monitoring port-forward svc/prometheus-operated 9090

获取包含作用中警报规则及其属性的 JSON 对象：

$ curl -s http://localhost:9090/api/v1/rules | jq '[.data.groups[].rules[] | select(.type=="alerting")]'

输出示例

[
  {
    "name": "ClusterOperatorDown",
    "query": "cluster_operator_up{job=\"cluster-version-operator\"} == 0",
    "duration": 600,
    "labels": {
      "severity": "critical"
    },
    "annotations": {
      "message": "Cluster operator {{ $labels.name }} has not been available for 10 mins. Operator may be down or disabled, cluster will not be kept up to date and upgrades will not be possible."
    },
    "alerts": [],
    "health": "ok",
    "type": "alerting"
  },
  {
    "name": "ClusterOperatorDegraded",
    ...

其他资源

另请参阅 Alertmanager 文档。

1.2.11. 后续步骤

管理集群警报。
了解远程健康报告，如果需要，可以选择停用它。

1.2. 配置监控堆栈

1.2.1. 先决条件

1.2.2. 维护和支持

1.2.3. 创建集群监控配置映射

1.2.4. 配置集群监控堆栈

1.2.5. 可配置的监控组件

1.2.6. 将监控组件移到其他节点

1.2.7. 为监控组件分配容忍（tolerations）

1.2.8. 配置持久性存储

1.2.9. 先决条件

1.2.9.1. 配置本地持久性卷声明

1.2.9.2. 修改 Prometheus 指标数据的保留时间

1.2.10. 配置 Alertmanager

1.2.10.1. Alertmanager 默认配置

1.2.10.2. 应用自定义 Alertmanager 配置

1.2.10.3. 警报规则

1.2.10.4. 列出起作用的警报规则

1.2.11. 后续步骤

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Red Hat legal and privacy links

Red Hat legal and privacy links