Home
Products
Red Hat OpenShift Data Foundation
4.13
Configuring OpenShift Data Foundation Disaster Recovery for OpenShift Workloads
Chapter 6. Monitoring disaster recovery health

Chapter 6. Monitoring disaster recovery health

6.1. Enabling disaster recovery dashboard on Hub cluster
Copy link

You can enable the disaster recovery dashboard after installing ODF Multicluster Orchestrator with the console plugin enabled.

For Regional-DR, the dashboard makes use of monitoring status cards like operator health and cluster health to show metrics, alerts and application count.

For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.

Note

The dashboard only shows data for ApplicationSet-based applications, and not for Subscription-based applications.

Prerequisites

Ensure that you have installed OpenShift Container Platform version 4.13 and have administrator privileges.
Ensure that you have installed Red Hat Advanced Cluster Management for Kubernetes 2.8 (RHACM) from Operator Hub. For instructions on how to install, see Installing RHACM.
Ensure you have enabled observability on RHACM. See Enabling observability guidelines.

Procedure

On the Hub cluster, open a terminal window and perform the next steps.

Add label to openshift-operator namespace.

$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'

Create the configmap file named observability-metrics-custom-allowlist.yaml.

You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.

kind: ConfigMap
apiVersion: v1
metadata:
  name: observability-metrics-custom-allowlist
  namespace: open-cluster-management-observability
data:
  metrics_list.yaml: |
    names:
      - ramen_last_sync_timestamp_seconds
      - ramen_policy_schedule_interval_seconds
    matches:
      - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*"
      - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*"
    recording_rules:
      - record: count_persistentvolumeclaim_total
        expr: count(kube_persistentvolumeclaim_info)
      - record: ramen_sync_duration_seconds
        expr: sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))

In the open-cluster-management-observability namespace, run the following command:

$ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml

After observability-metrics-custom-allowlist yaml is created, RHACM will start collecting the listed OpenShift Data Foundation metrics from all the managed clusters.
To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the clusters: observability: disabled.

Create the configmap file named thanos-ruler-custom-rules.yaml and add the name of the custom alert rules to the custom_rules.yaml parameter.

You can use the following YAML to create an alert against the ramen metrics on the Hub cluster. For details, see Adding custom metrics. To know more about the alerts, see Disaster Recovery alerts.

kind: ConfigMap
apiVersion: v1
metadata:
  name: thanos-ruler-custom-rules
  namespace: open-cluster-management-observability
data:
  custom_rules.yaml: |
    groups:
      - name: ramen-alerts
        rules:
        - record: ramen_rpo_difference
          expr: ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
        - alert: VolumeSynchronizationDelay
          expr: ramen_rpo_difference >= 3
          for: 5s
          labels:
            cluster: "{{ $labels.cluster }}"
            severity: critical
          annotations:
            description: "Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than thrice the scheduled snapshot interval. This may cause data loss and a backlog of replication requests."
            alert_type: "DisasterRecovery"
        - alert: VolumeSynchronizationDelay
          expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
          for: 5s
          labels:
            cluster: "{{ $labels.cluster }}"
            severity: warning
          annotations:
            description: "Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{ $labels.obj_namespace }}) is taking more than twice the scheduled snapshot interval. This may cause data loss and impact replication requests."
            alert_type: "DisasterRecovery"

Run the following command in the open-cluster-management-observability namespace:

$ oc apply -n open-cluster-management-observability -f thanos-ruler-custom-rules.yaml

6.2. Viewing health status of disaster recovery replication relationships
Copy link

Prerequisites

Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.

Procedure

On the Hub cluster, ensure All Clusters option is selected.
Refresh the console to make the DR monitoring dashboard tab accessible.
Navigate to Data Services and click Data policies.
On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
Click the Disaster recovery tab to view a list of DR policy details and connected applications.

6.3. Disaster recovery metrics
Copy link

These are the ramen metrics that are scrapped by prometheus.

ramen_last_sync_timestamp_seconds
ramen_policy_schedule_interval_seconds

Ramen’s last synchronization timestamp in seconds

This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs.

Metric name

ramen_last_sync_timestamp_seconds

Metrics type

Gauge

Labels

ObjType: Type of the object, here its DPPC
ObjName: Name of the object, here it is DRPC-Name
ObjNamespace: DRPC namespace
Policyname: Name of the DRPolicy
SchedulingInterval: scheduling interval value from DRPolicy

Metric value

Set to lastGroupSyncTime from DRPC in seconds.

Ramen’s policy schedule interval in seconds

This gives the scheduling interval in seconds from DRPolicy.

Metric name

ramen_policy_schedule_interval_seconds

Metrics type

Gauge

Labels

Policyname: Name of the DRPolicy

Metric value

Set to scheduling interval in seconds which is taken from DRPolicy.

6.4. Disaster recovery alerts
Copy link

This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation 4.13 and above within disaster recovery environment.

Recording rules

Record: ramen_sync_duration_seconds
Expression
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
Purpose
The time interval between the volume group’s last sync time and the time now in seconds.

Record: ramen_rpo_difference

Expression

ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})

Purpose

The difference between the expected sync delay and the actual sync delay taken by the volume replication group.

Record: count_persistentvolumeclaim_total
Expression
count(kube_persistentvolumeclaim_info)
Purpose
Sum of all PVC from the managed cluster.

Alerts

Alert: VolumeSynchronizationDelay

Impact

Critical

Purpose

Actual sync delay taken by the volume replication group is thrice the expected sync delay.

YAML

  alert: VolumeSynchronizationDela
  expr: ramen_rpo_difference >= 3
  for: 5s
  labels:
    cluster: '{{ $labels.cluster }}'
    severity: critical
  annotations:
    description: >-
      Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
      $labels.obj_namespace }}) is taking more than thrice the scheduled
      snapshot interval. This may cause data loss and a backlog of replication
      requests.
    alert_type: DisasterRecovery

Alert: VolumeSynchronizationDelay

Impact

Warning

Purpose

Actual sync delay taken by the volume replication group is twice the expected sync delay.

YAML

  alert: VolumeSynchronizationDela
  expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
  for: 5s
  labels:
    cluster: '{{ $labels.cluster }}'
    severity: critical
  annotations:
    description: >-
      Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
      $labels.obj_namespace }}) is taking more than twice the scheduled
      snapshot interval. This may cause data loss and a backlog of replication
      requests.
    alert_type: DisasterRecovery

Chapter 6. Monitoring disaster recovery health

6.1. Enabling disaster recovery dashboard on Hub cluster
Copy link

6.2. Viewing health status of disaster recovery replication relationships
Copy link

6.3. Disaster recovery metrics
Copy link

6.4. Disaster recovery alerts
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 6. Monitoring disaster recovery health

6.1. Enabling disaster recovery dashboard on Hub clusterCopy linkLink copied to clipboard!

6.2. Viewing health status of disaster recovery replication relationshipsCopy linkLink copied to clipboard!

6.3. Disaster recovery metricsCopy linkLink copied to clipboard!

6.4. Disaster recovery alertsCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. Enabling disaster recovery dashboard on Hub cluster
Copy link

6.2. Viewing health status of disaster recovery replication relationships
Copy link

6.3. Disaster recovery metrics
Copy link

6.4. Disaster recovery alerts
Copy link