Chapter 6. Monitoring disaster recovery health
6.1. Enable monitoring for disaster recovery
Use this procedure to enable basic monitoring for your disaster recovery setup.
Procedure
- On the Hub cluster, open a terminal window
Add the following label to
openshift-operator
namespace.$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
You must always add this label for Regional-DR solution.
6.2. Enabling disaster recovery dashboard on Hub cluster
This section guides you to enable the disaster recovery dashboard for advanced monitoring on the Hub cluster.
For Regional-DR, the dashboard shows monitoring status cards for operator health, cluster health, metrics, alerts and application count.
For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.
Prerequisites
Ensure that you have already installed the following
- OpenShift Container Platform version 4.16 and have administrator privileges.
- ODF Multicluster Orchestrator with the console plugin enabled.
- Red Hat Advanced Cluster Management for Kubernetes 2.11 (RHACM) from Operator Hub. For instructions on how to install, see Installing RHACM.
- Ensure you have enabled observability on RHACM. See Enabling observability guidelines.
Procedure
- On the Hub cluster, open a terminal window and perform the next steps.
Create the configmap file named
observability-metrics-custom-allowlist.yaml
.You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.
kind: ConfigMap apiVersion: v1 metadata: name: observability-metrics-custom-allowlist namespace: open-cluster-management-observability data: metrics_list.yaml: | names: - ceph_rbd_mirror_snapshot_sync_bytes - ceph_rbd_mirror_snapshot_snapshots matches: - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*" - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*"
In the
open-cluster-management-observability
namespace, run the following command:$ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
After
observability-metrics-custom-allowlist
yaml is created, RHACM starts collecting the listed OpenShift Data Foundation metrics from all the managed clusters.To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the
clusters: observability: disabled
.
6.3. Viewing health status of disaster recovery replication relationships
Prerequisites
Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.
Procedure
- On the Hub cluster, ensure All Clusters option is selected.
- Refresh the console to make the DR monitoring dashboard tab accessible.
- Navigate to Data Services and click Data policies.
- On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
- Click the Disaster recovery tab to view a list of DR policy details and connected applications.
6.4. Disaster recovery metrics
These are the ramen metrics that are scrapped by prometheus.
- ramen_last_sync_timestamp_seconds
- ramen_policy_schedule_interval_seconds
- ramen_last_sync_duration_seconds
- ramen_last_sync_data_bytes
- ramen_workload_protection_status
Run these metrics from the Hub cluster where Red Hat Advanced Cluster Management for Kubernetes (RHACM operator) is installed.
6.4.1. Last synchronization timestamp in seconds
This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_timestamp_seconds
- Metrics type
- Gauge
- Labels
-
ObjType
: Type of the object, here its DRPC -
ObjName
: Name of the object, here it is DRPC-Name -
ObjNamespace
: DRPC namespace -
Policyname
: Name of the DRPolicy -
SchedulingInterval
: Scheduling interval value from DRPolicy
-
- Metric value
-
Value is set as Unix seconds which is obtained from
lastGroupSyncTime
from DRPC status.
6.4.2. Policy schedule interval in seconds
This gives the scheduling interval in seconds from DRPolicy.
- Metric name
-
ramen_policy_schedule_interval_seconds
- Metrics type
- Gauge
- Labels
-
Policyname
: Name of the DRPolicy
-
- Metric value
- This is set to a scheduling interval in seconds which is taken from DRPolicy.
6.4.3. Last synchronization duration in seconds
This represents the longest time taken to sync from the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_duration_seconds
- Metrics type
- Gauge
- Labels
-
obj_type
: Type of the object, here it is DRPC -
obj_name
: Name of the object, here it is DRPC-Name -
obj_namespace
: DRPC namespace -
scheduling_interval
: Scheduling interval value from DRPolicy
-
- Metric value
-
The value is taken from
lastGroupSyncDuration
from DRPC status.
6.4.4. Total bytes transferred from most recent synchronization
This value represents the total bytes transferred from the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_data_bytes
- Metrics type
- Gauge
- Labels
-
obj_type
: Type of the object, here it is DRPC -
obj_name
: Name of the object, here it is DRPC-Name -
obj_namespace
: DRPC namespace -
scheduling_interval
: Scheduling interval value from DRPolicy
-
- Metric value
-
The value is taken from
lastGroupSyncBytes
from DRPC status.
6.4.5. Workload protection status
This value provides the application protection status per application that is DR protected.
- Metric name
-
ramen_workload_protection_status
- Metrics type
- Gauge
- Labels
-
ObjType
: Type of the object, here its DRPC -
ObjName
: Name of the object, here it is DRPC-Name -
ObjNamespace
: DRPC namespace
-
- Metric value
- The value is either a "1" or a "0", where "1" indicates application DR protection is healthy and a "0" indicates application protection degraged and potentially unprotected.
6.5. Disaster recovery alerts
This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation within a disaster recovery environment.
Recording rules
Record:
ramen_sync_duration_seconds
- Expression
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
- Purpose
- The time interval between the volume group’s last sync time and the time now in seconds.
Record:
ramen_rpo_difference
- Expression
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
- Purpose
- The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
Record:
count_persistentvolumeclaim_total
- Expression
count(kube_persistentvolumeclaim_info)
- Purpose
- Sum of all PVC from the managed cluster.
Alerts
Alert:
VolumeSynchronizationDelay
- Impact
- Critical
- Purpose
- Actual sync delay taken by the volume replication group is thrice the expected sync delay.
- YAML
alert: VolumeSynchronizationDelay expr: ramen_rpo_difference >= 3 for: 5s labels: severity: critical annotations: description: "The syncing of volumes is exceeding three times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})" alert_type: "DisasterRecovery"
Alert:
VolumeSynchronizationDelay
- Impact
- Warning
- Purpose
- Actual sync delay taken by the volume replication group is twice the expected sync delay.
- YAML
alert: VolumeSynchronizationDelay expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3 for: 5s labels: severity: warning annotations: description: "The syncing of volumes is exceeding two times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})" alert_type: "DisasterRecovery"
Alert:
WorkloadUnprotected
- Impact
- Warning
- Purpose
- Application protection status is degraded for more than 10 minutes
- YAML
alert: WorkloadUnprotected expr: ramen_workload_protection_status == 0 for: 10m labels: severity: warning annotations: description: "Workload is not protected for disaster recovery (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})." alert_type: "DisasterRecovery"