第 6 章 监控灾难恢复健康状况
6.1. 在 Hub 集群上启用灾难恢复仪表板 复制链接链接已复制到粘贴板!
您可以在启用了控制台插件安装 ODF Multicluster Orchestrator 后启用灾难恢复仪表板。
对于 Regional-DR,仪表板使用监控状态卡,如 Operator 健康状况和集群健康状况来显示指标、警报和应用程序计数。
对于 Metro-DR,您可以将仪表板配置为仅监控帧设置健康和应用程序计数。
仪表板仅显示基于 ApplicationSet 的应用程序的数据,而不适用于基于 Subscription 的应用程序。
先决条件
流程
- 在 Hub 集群中,打开终端窗口并执行以下步骤。
为
openshift-operator命名空间添加标签。$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'创建名为
observability-metrics-custom-allowlist.yaml的 configmap 文件。您可以使用以下 YAML 列出 Hub 集群上的灾难恢复指标。详情请参阅添加自定义指标。要了解有关帧指标的更多信息,请参阅 灾难恢复指标。
kind: ConfigMap apiVersion: v1 metadata: name: observability-metrics-custom-allowlist namespace: open-cluster-management-observability data: metrics_list.yaml: | names: - ramen_last_sync_timestamp_seconds - ramen_policy_schedule_interval_seconds matches: - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*" - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*" recording_rules: - record: count_persistentvolumeclaim_total expr: count(kube_persistentvolumeclaim_info) - record: ramen_sync_duration_seconds expr: sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))在
open-cluster-management-observability命名空间中运行以下命令:$ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml创建
observability-metrics-custom-allowlistyaml 后,RHACM 将开始从所有受管集群收集列出的 OpenShift Data Foundation 指标。要排除特定受管集群来收集可观察性数据,请在集群中添加以下集群标签
clusters: observability: disabled。创建名为
thanos-ruler-custom-rules.yaml的 configmap 文件,并将自定义警报规则的名称添加到custom_rules.yaml参数中。您可以使用以下 YAML 针对 Hub 集群上的 ramen 指标创建警报。详情请参阅添加自定义指标。要了解有关警报的更多信息,请参阅灾难恢复警报。
kind: ConfigMap apiVersion: v1 metadata: name: thanos-ruler-custom-rules namespace: open-cluster-management-observability data: custom_rules.yaml: | groups: - name: ramen-alerts rules: - record: ramen_rpo_difference expr: ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"}) - alert: VolumeSynchronizationDelay expr: ramen_rpo_difference >= 3 for: 5s labels: cluster: "{{ $labels.cluster }}" severity: critical annotations: description: "Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than thrice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests." alert_type: "DisasterRecovery" - alert: VolumeSynchronizationDelay expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3 for: 5s labels: cluster: "{{ $labels.cluster }}" severity: warning annotations: description: "Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than twice the scheduled snapshot interval. This may cause data loss and impact replication requests." alert_type: "DisasterRecovery"在
open-cluster-management-observability命名空间中运行以下命令:$ oc apply -n open-cluster-management-observability -f thanos-ruler-custom-rules.yaml