Chapter 6. Monitoring disaster recovery health
6.1. Enabling disaster recovery dashboard on Hub cluster Copy linkLink copied to clipboard!
You can enable the disaster recovery dashboard after installing ODF Multicluster Orchestrator with the console plugin enabled.
For Regional-DR, the dashboard makes use of monitoring status cards like operator health and cluster health to show metrics, alerts and application count.
For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.
The dashboard only shows data for ApplicationSet-based applications, and not for Subscription-based applications.
Prerequisites
- Ensure that you have installed OpenShift Container Platform version 4.13 and have administrator privileges.
- Ensure that you have installed Red Hat Advanced Cluster Management for Kubernetes 2.8 (RHACM) from Operator Hub. For instructions on how to install, see Installing RHACM.
- Ensure you have enabled observability on RHACM. See Enabling observability guidelines.
Procedure
- On the Hub cluster, open a terminal window and perform the next steps.
Add label to
openshift-operatornamespace.$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'Create the configmap file named
observability-metrics-custom-allowlist.yaml.You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.
kind: ConfigMap apiVersion: v1 metadata: name: observability-metrics-custom-allowlist namespace: open-cluster-management-observability data: metrics_list.yaml: | names: - ramen_last_sync_timestamp_seconds - ramen_policy_schedule_interval_seconds matches: - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*" - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*" recording_rules: - record: count_persistentvolumeclaim_total expr: count(kube_persistentvolumeclaim_info) - record: ramen_sync_duration_seconds expr: sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))In the
open-cluster-management-observabilitynamespace, run the following command:$ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yamlAfter
observability-metrics-custom-allowlistyaml is created, RHACM will start collecting the listed OpenShift Data Foundation metrics from all the managed clusters.To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the
clusters: observability: disabled.Create the configmap file named
thanos-ruler-custom-rules.yamland add the name of the custom alert rules to thecustom_rules.yamlparameter.You can use the following YAML to create an alert against the ramen metrics on the Hub cluster. For details, see Adding custom metrics. To know more about the alerts, see Disaster Recovery alerts.
kind: ConfigMap apiVersion: v1 metadata: name: thanos-ruler-custom-rules namespace: open-cluster-management-observability data: custom_rules.yaml: | groups: - name: ramen-alerts rules: - record: ramen_rpo_difference expr: ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"}) - alert: VolumeSynchronizationDelay expr: ramen_rpo_difference >= 3 for: 5s labels: cluster: "{{ $labels.cluster }}" severity: critical annotations: description: "Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than thrice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests." alert_type: "DisasterRecovery" - alert: VolumeSynchronizationDelay expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3 for: 5s labels: cluster: "{{ $labels.cluster }}" severity: warning annotations: description: "Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than twice the scheduled snapshot interval. This may cause data loss and impact replication requests." alert_type: "DisasterRecovery"Run the following command in the
open-cluster-management-observabilitynamespace:$ oc apply -n open-cluster-management-observability -f thanos-ruler-custom-rules.yaml
6.2. Viewing health status of disaster recovery replication relationships Copy linkLink copied to clipboard!
Prerequisites
Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.
Procedure
- On the Hub cluster, ensure All Clusters option is selected.
- Refresh the console to make the DR monitoring dashboard tab accessible.
- Navigate to Data Services and click Data policies.
- On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
- Click the Disaster recovery tab to view a list of DR policy details and connected applications.
6.3. Disaster recovery metrics Copy linkLink copied to clipboard!
These are the ramen metrics that are scrapped by prometheus.
- ramen_last_sync_timestamp_seconds
- ramen_policy_schedule_interval_seconds
Ramen’s last synchronization timestamp in seconds
This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs.
- Metric name
-
ramen_last_sync_timestamp_seconds - Metrics type
- Gauge
- Labels
-
ObjType: Type of the object, here its DPPC -
ObjName: Name of the object, here it is DRPC-Name -
ObjNamespace: DRPC namespace -
Policyname: Name of the DRPolicy -
SchedulingInterval: scheduling interval value from DRPolicy
-
- Metric value
-
Set to
lastGroupSyncTimefrom DRPC in seconds.
Ramen’s policy schedule interval in seconds
This gives the scheduling interval in seconds from DRPolicy.
- Metric name
-
ramen_policy_schedule_interval_seconds - Metrics type
- Gauge
- Labels
-
Policyname: Name of the DRPolicy
-
- Metric value
- Set to scheduling interval in seconds which is taken from DRPolicy.
6.4. Disaster recovery alerts Copy linkLink copied to clipboard!
This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation 4.13 and above within disaster recovery environment.
Recording rules
Record:
ramen_sync_duration_seconds- Expression
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))- Purpose
- The time interval between the volume group’s last sync time and the time now in seconds.
Record:
ramen_rpo_difference- Expression
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})- Purpose
- The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
Record:
count_persistentvolumeclaim_total- Expression
count(kube_persistentvolumeclaim_info)- Purpose
- Sum of all PVC from the managed cluster.
Alerts
Alert:
VolumeSynchronizationDelay- Impact
- Critical
- Purpose
- Actual sync delay taken by the volume replication group is thrice the expected sync delay.
- YAML
alert: VolumeSynchronizationDela expr: ramen_rpo_difference >= 3 for: 5s labels: cluster: '{{ $labels.cluster }}' severity: critical annotations: description: >- Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than thrice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests. alert_type: DisasterRecovery
Alert:
VolumeSynchronizationDelay- Impact
- Warning
- Purpose
- Actual sync delay taken by the volume replication group is twice the expected sync delay.
- YAML
alert: VolumeSynchronizationDela expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3 for: 5s labels: cluster: '{{ $labels.cluster }}' severity: critical annotations: description: >- Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than twice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests. alert_type: DisasterRecovery