6.5. Disaster recovery alerts
This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation within a disaster recovery environment.
Recording rules
Record:
ramen_sync_duration_seconds- Expression
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))- Purpose
- The time interval between the volume group’s last sync time and the time now in seconds.
Record:
ramen_rpo_difference- Expression
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})- Purpose
- The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
Record:
count_persistentvolumeclaim_total- Expression
count(kube_persistentvolumeclaim_info)- Purpose
- Sum of all PVC from the managed cluster.
Alerts
Alert:
VolumeSynchronizationDelay- Impact
- Critical
- Purpose
- Actual sync delay taken by the volume replication group is thrice the expected sync delay.
- YAML
alert: VolumeSynchronizationDela expr: ramen_rpo_difference >= 3 for: 5s labels: cluster: '{{ $labels.cluster }}' severity: critical annotations: description: >- Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than thrice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests. alert_type: DisasterRecovery
Alert:
VolumeSynchronizationDelay- Impact
- Warning
- Purpose
- Actual sync delay taken by the volume replication group is twice the expected sync delay.
- YAML
alert: VolumeSynchronizationDela expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3 for: 5s labels: cluster: '{{ $labels.cluster }}' severity: critical annotations: description: >- Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than twice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests. alert_type: DisasterRecovery