Search

Chapter 6. Monitoring disaster recovery health

download PDF

6.1. Enable monitoring for disaster recovery

Use this procedure to enable basic monitoring for your disaster recovery setup.

Procedure

  1. On the Hub cluster, open a terminal window
  2. Add the following label to openshift-operator namespace.

    $ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
Note

You must always add this label for Regional-DR solution.

6.2. Enabling disaster recovery dashboard on Hub cluster

This section guides you to enable the disaster recovery dashboard for advanced monitoring on the Hub cluster.

For Regional-DR, the dashboard shows monitoring status cards for operator health, cluster health, metrics, alerts and application count.

For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.

Prerequisites

  • Ensure that you have already installed the following

    • OpenShift Container Platform version 4.16 and have administrator privileges.
    • ODF Multicluster Orchestrator with the console plugin enabled.
    • Red Hat Advanced Cluster Management for Kubernetes 2.11 (RHACM) from Operator Hub. For instructions on how to install, see Installing RHACM.
  • Ensure you have enabled observability on RHACM. See Enabling observability guidelines.

Procedure

  1. On the Hub cluster, open a terminal window and perform the next steps.
  2. Create the configmap file named observability-metrics-custom-allowlist.yaml.

    You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: observability-metrics-custom-allowlist
      namespace: open-cluster-management-observability
    data:
      metrics_list.yaml: |
        names:
          - ceph_rbd_mirror_snapshot_sync_bytes
          - ceph_rbd_mirror_snapshot_snapshots
        matches:
          - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*"
          - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*"
  3. In the open-cluster-management-observability namespace, run the following command:

    $ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
  4. After observability-metrics-custom-allowlist yaml is created, RHACM starts collecting the listed OpenShift Data Foundation metrics from all the managed clusters.

    To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the clusters: observability: disabled.

6.3. Viewing health status of disaster recovery replication relationships

Prerequisites

Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.

Procedure

  1. On the Hub cluster, ensure All Clusters option is selected.
  2. Refresh the console to make the DR monitoring dashboard tab accessible.
  3. Navigate to Data Services and click Data policies.
  4. On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
  5. Click the Disaster recovery tab to view a list of DR policy details and connected applications.

6.4. Disaster recovery metrics

These are the ramen metrics that are scrapped by prometheus.

  • ramen_last_sync_timestamp_seconds
  • ramen_policy_schedule_interval_seconds
  • ramen_last_sync_duration_seconds
  • ramen_last_sync_data_bytes
  • ramen_workload_protection_status

Run these metrics from the Hub cluster where Red Hat Advanced Cluster Management for Kubernetes (RHACM operator) is installed.

6.4.1. Last synchronization timestamp in seconds

This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs per application.

Metric name
ramen_last_sync_timestamp_seconds
Metrics type
Gauge
Labels
  • ObjType: Type of the object, here its DRPC
  • ObjName: Name of the object, here it is DRPC-Name
  • ObjNamespace: DRPC namespace
  • Policyname: Name of the DRPolicy
  • SchedulingInterval: Scheduling interval value from DRPolicy
Metric value
Value is set as Unix seconds which is obtained from lastGroupSyncTime from DRPC status.

6.4.2. Policy schedule interval in seconds

This gives the scheduling interval in seconds from DRPolicy.

Metric name
ramen_policy_schedule_interval_seconds
Metrics type
Gauge
Labels
  • Policyname: Name of the DRPolicy
Metric value
This is set to a scheduling interval in seconds which is taken from DRPolicy.

6.4.3. Last synchronization duration in seconds

This represents the longest time taken to sync from the most recent successful synchronization of all PVCs per application.

Metric name
ramen_last_sync_duration_seconds
Metrics type
Gauge
Labels
  • obj_type: Type of the object, here it is DRPC
  • obj_name: Name of the object, here it is DRPC-Name
  • obj_namespace: DRPC namespace
  • scheduling_interval: Scheduling interval value from DRPolicy
Metric value
The value is taken from lastGroupSyncDuration from DRPC status.

6.4.4. Total bytes transferred from most recent synchronization

This value represents the total bytes transferred from the most recent successful synchronization of all PVCs per application.

Metric name
ramen_last_sync_data_bytes
Metrics type
Gauge
Labels
  • obj_type: Type of the object, here it is DRPC
  • obj_name: Name of the object, here it is DRPC-Name
  • obj_namespace: DRPC namespace
  • scheduling_interval: Scheduling interval value from DRPolicy
Metric value
The value is taken from lastGroupSyncBytes from DRPC status.

6.4.5. Workload protection status

This value provides the application protection status per application that is DR protected.

Metric name
ramen_workload_protection_status
Metrics type
Gauge
Labels
  • ObjType: Type of the object, here its DRPC
  • ObjName: Name of the object, here it is DRPC-Name
  • ObjNamespace: DRPC namespace
Metric value
The value is either a "1" or a "0", where "1" indicates application DR protection is healthy and a "0" indicates application protection degraged and potentially unprotected.

6.5. Disaster recovery alerts

This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation within a disaster recovery environment.

Recording rules

  • Record: ramen_sync_duration_seconds

    Expression
    sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
    Purpose
    The time interval between the volume group’s last sync time and the time now in seconds.
  • Record: ramen_rpo_difference

    Expression
    ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
    Purpose
    The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
  • Record: count_persistentvolumeclaim_total

    Expression
    count(kube_persistentvolumeclaim_info)
    Purpose
    Sum of all PVC from the managed cluster.

Alerts

  • Alert: VolumeSynchronizationDelay

    Impact
    Critical
    Purpose
    Actual sync delay taken by the volume replication group is thrice the expected sync delay.
    YAML
    alert: VolumeSynchronizationDelay
    expr: ramen_rpo_difference >= 3
    for: 5s
    labels:
      severity: critical
    annotations:
      description: "The syncing of volumes is exceeding three times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})"
      alert_type: "DisasterRecovery"
  • Alert: VolumeSynchronizationDelay

    Impact
    Warning
    Purpose
    Actual sync delay taken by the volume replication group is twice the expected sync delay.
    YAML
    alert: VolumeSynchronizationDelay
    expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
    for: 5s
    labels:
      severity: warning
    annotations:
      description: "The syncing of volumes is exceeding two times the scheduled snapshot interval, or the volumes have been recently protected. (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})"
      alert_type: "DisasterRecovery"
  • Alert: WorkloadUnprotected

    Impact
    Warning
    Purpose
    Application protection status is degraded for more than 10 minutes
    YAML
    alert: WorkloadUnprotected
    expr: ramen_workload_protection_status == 0
    for: 10m
    labels:
      severity: warning
    annotations:
      description: "Workload is not protected for disaster recovery (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }})."
      alert_type: "DisasterRecovery"
Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.