6.5. 灾难恢复警报
本节提供了在灾难恢复环境中与 Red Hat OpenShift Data Foundation 关联的所有支持警报的列表。
记录规则
记录:
ramen_sync_duration_seconds
- 表达式
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
- 用途
- 卷组最后一次同步时间和时间(以秒为单位)之间的时间间隔。
记录:
ramen_rpo_difference
- 表达式
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
- 用途
- 预期同步延迟和卷复制组所使用的实际同步延迟之间的差别。
记录:
count_persistentvolumeclaim_total
- 表达式
count(kube_persistentvolumeclaim_info)
- 用途
- 来自受管集群的所有 PVC 的总和。
警报
警报:
VolumeSynchronizationDelay
- 影响
- Critical
- 用途
- 卷复制组占用的实际同步延迟是延迟预期同步延迟。
- YAML
alert: VolumeSynchronizationDela expr: ramen_rpo_difference >= 3 for: 5s labels: cluster: '{{ $labels.cluster }}' severity: critical annotations: description: >- Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than thrice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests. alert_type: DisasterRecovery
警报:
VolumeSynchronizationDelay
- 影响
- Warning
- 用途
- 卷复制组占用的实际同步延迟是预期的同步延迟的两倍。
- YAML
alert: VolumeSynchronizationDela expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3 for: 5s labels: cluster: '{{ $labels.cluster }}' severity: critical annotations: description: >- Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than twice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests. alert_type: DisasterRecovery