Chapter 6. Monitoring disaster recovery health


You can enable the disaster recovery dashboard after installing ODF Multicluster Orchestrator with the console plugin enabled.

For Regional-DR, the dashboard makes use of monitoring status cards like operator health and cluster health to show metrics, alerts and application count.

For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.

Note

The dashboard only shows data for ApplicationSet-based applications, and not for Subscription-based applications.

Prerequisites

  • Ensure that you have installed OpenShift Container Platform version 4.13 and have administrator privileges.
  • Ensure that you have installed Red Hat Advanced Cluster Management for Kubernetes 2.8 (RHACM) from Operator Hub. For instructions on how to install, see Installing RHACM.
  • Ensure you have enabled observability on RHACM. See Enabling observability guidelines.

Procedure

  1. On the Hub cluster, open a terminal window and perform the next steps.
  2. Add label to openshift-operator namespace.

    $ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
  3. Create the configmap file named observability-metrics-custom-allowlist.yaml.

    You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: observability-metrics-custom-allowlist
      namespace: open-cluster-management-observability
    data:
      metrics_list.yaml: |
        names:
          - ramen_last_sync_timestamp_seconds
          - ramen_policy_schedule_interval_seconds
        matches:
          - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*"
          - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*"
        recording_rules:
          - record: count_persistentvolumeclaim_total
            expr: count(kube_persistentvolumeclaim_info)
          - record: ramen_sync_duration_seconds
            expr: sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
  4. In the open-cluster-management-observability namespace, run the following command:

    $ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
  5. After observability-metrics-custom-allowlist yaml is created, RHACM will start collecting the listed OpenShift Data Foundation metrics from all the managed clusters.

    To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the clusters: observability: disabled.

  6. Create the configmap file named thanos-ruler-custom-rules.yaml and add the name of the custom alert rules to the custom_rules.yaml parameter.

    You can use the following YAML to create an alert against the ramen metrics on the Hub cluster. For details, see Adding custom metrics. To know more about the alerts, see Disaster Recovery alerts.

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: thanos-ruler-custom-rules
      namespace: open-cluster-management-observability
    data:
      custom_rules.yaml: |
        groups:
          - name: ramen-alerts
            rules:
            - record: ramen_rpo_difference
              expr: ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
            - alert: VolumeSynchronizationDelay
              expr: ramen_rpo_difference >= 3
              for: 5s
              labels:
                cluster: "{{ $labels.cluster }}"
                severity: critical
              annotations:
                description: "Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than thrice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests."
                alert_type: "DisasterRecovery"
            - alert: VolumeSynchronizationDelay
              expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
              for: 5s
              labels:
                cluster: "{{ $labels.cluster }}"
                severity: warning
              annotations:
                description: "Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than twice the scheduled snapshot interval. This may cause data loss and impact replication requests."
                alert_type: "DisasterRecovery"
  7. Run the following command in the open-cluster-management-observability namespace:

    $ oc apply -n open-cluster-management-observability -f thanos-ruler-custom-rules.yaml

Prerequisites

Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.

Procedure

  1. On the Hub cluster, ensure All Clusters option is selected.
  2. Refresh the console to make the DR monitoring dashboard tab accessible.
  3. Navigate to Data Services and click Data policies.
  4. On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
  5. Click the Disaster recovery tab to view a list of DR policy details and connected applications.

6.3. Disaster recovery metrics

These are the ramen metrics that are scrapped by prometheus.

  • ramen_last_sync_timestamp_seconds
  • ramen_policy_schedule_interval_seconds

Ramen’s last synchronization timestamp in seconds

This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs.

Metric name
ramen_last_sync_timestamp_seconds
Metrics type
Gauge
Labels
  • ObjType: Type of the object, here its DPPC
  • ObjName: Name of the object, here it is DRPC-Name
  • ObjNamespace: DRPC namespace
  • Policyname: Name of the DRPolicy
  • SchedulingInterval: scheduling interval value from DRPolicy
Metric value
Set to lastGroupSyncTime from DRPC in seconds.

Ramen’s policy schedule interval in seconds

This gives the scheduling interval in seconds from DRPolicy.

Metric name
ramen_policy_schedule_interval_seconds
Metrics type
Gauge
Labels
  • Policyname: Name of the DRPolicy
Metric value
Set to scheduling interval in seconds which is taken from DRPolicy.

6.4. Disaster recovery alerts

This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation 4.13 and above within disaster recovery environment.

Recording rules

  • Record: ramen_sync_duration_seconds

    Expression
    sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
    Purpose
    The time interval between the volume group’s last sync time and the time now in seconds.
  • Record: ramen_rpo_difference

    Expression
    ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
    Purpose
    The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
  • Record: count_persistentvolumeclaim_total

    Expression
    count(kube_persistentvolumeclaim_info)
    Purpose
    Sum of all PVC from the managed cluster.

Alerts

  • Alert: VolumeSynchronizationDelay

    Impact
    Critical
    Purpose
    Actual sync delay taken by the volume replication group is thrice the expected sync delay.
    YAML
      alert: VolumeSynchronizationDela
      expr: ramen_rpo_difference >= 3
      for: 5s
      labels:
        cluster: '{{ $labels.cluster }}'
        severity: critical
      annotations:
        description: >-
          Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
          $labels.obj_namespace }}) is taking more than thrice the scheduled
          snapshot interval. This may cause data loss and a backlog of replication
          requests.
        alert_type: DisasterRecovery
  • Alert: VolumeSynchronizationDelay

    Impact
    Warning
    Purpose
    Actual sync delay taken by the volume replication group is twice the expected sync delay.
    YAML
      alert: VolumeSynchronizationDela
      expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
      for: 5s
      labels:
        cluster: '{{ $labels.cluster }}'
        severity: critical
      annotations:
        description: >-
          Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
          $labels.obj_namespace }}) is taking more than twice the scheduled
          snapshot interval. This may cause data loss and a backlog of replication
          requests.
        alert_type: DisasterRecovery
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top