7.5. 灾难恢复警报
本节列出了在灾难恢复环境中与 Red Hat OpenShift Data Foundation 关联的所有支持警报。
记录规则
record:
ramen_sync_duration_seconds- 表达式
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))- 目的
- 卷组最后一次同步时间和时间之间的时间间隔(以秒为单位)。
Record:
ramen_rpo_difference- 表达式
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})- 目的
- 预期同步延迟和卷复制组执行的实际同步延迟之间的区别。
record:
count_persistentvolumeclaim_total- 表达式
count(kube_persistentvolumeclaim_info)- 目的
- 受管集群中所有 PVC 的总和。
警报
alert:
VolumeSynchronizationDelay- 影响
- critical
- 目的
- 卷组执行的实际同步延迟是预期的同步延迟。
- YAML
alert: VolumeSynchronizationDelay expr: ramen_rpo_difference >= 3 for: 5s labels: severity: critical annotations: description: "The syncing of volumes is exceeding three times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})" alert_type: "DisasterRecovery"
alert:
VolumeSynchronizationDelay- 影响
- 警告
- 目的
- 卷组执行的实际同步延迟是预期的同步延迟两倍。
- YAML
alert: VolumeSynchronizationDelay expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3 for: 5s labels: severity: warning annotations: description: "The syncing of volumes is exceeding two times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})" alert_type: "DisasterRecovery"
警报:
WorkloadUnprotected- 影响
- 警告
- 目的
- 应用程序保护状态降级超过 10 分钟
- YAML
alert: WorkloadUnprotected expr: ramen_workload_protection_status == 0 for: 10m labels: severity: warning annotations: description: "Workload is not protected for disaster recovery (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})." alert_type: "DisasterRecovery"