Chapter 6. Monitoring disaster recovery health

6.1. Enable monitoring for disaster recovery
Copier lien

Use this procedure to enable basic monitoring for your disaster recovery setup.

Procedure

On the Hub cluster, open a terminal window

Add the following label to openshift-operator namespace.

oc label namespace openshift-operators openshift.io/cluster-monitoring='true'

$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'

Copy to Clipboard

Toggle word wrap

6.2. Enabling disaster recovery dashboard on Hub cluster
Copier lien

This section guides you to enable the disaster recovery dashboard for advanced monitoring on the Hub cluster.

For Regional-DR, the dashboard shows monitoring status cards for operator health, cluster health, metrics, alerts and application count.

For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.

Prerequisites

Ensure that you have already installed the following
- OpenShift Container Platform version 4.15 and have administrator privileges.
- ODF Multicluster Orchestrator with the console plugin enabled.
- Red Hat Advanced Cluster Management for Kubernetes 2.10 (RHACM) from Operator Hub. For instructions on how to install, see Installing RHACM.
Ensure you have enabled observability on RHACM. See Enabling observability guidelines.

Procedure

On the Hub cluster, open a terminal window and perform the next steps.

Create the configmap file named observability-metrics-custom-allowlist.yaml.

You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.

kind: ConfigMap
apiVersion: v1
metadata:
  name: observability-metrics-custom-allowlist
  namespace: open-cluster-management-observability
data:
  metrics_list.yaml: |
    names:
      - ceph_rbd_mirror_snapshot_sync_bytes
      - ceph_rbd_mirror_snapshot_snapshots
    matches:
      - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*"
      - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*"

kind: ConfigMap
apiVersion: v1
metadata:
  name: observability-metrics-custom-allowlist
  namespace: open-cluster-management-observability
data:
  metrics_list.yaml: |
    names:
      - ceph_rbd_mirror_snapshot_sync_bytes
      - ceph_rbd_mirror_snapshot_snapshots
    matches:
      - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*"
      - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*"

Copy to Clipboard

Toggle word wrap

In the open-cluster-management-observability namespace, run the following command:

oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml

$ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml

Copy to Clipboard

Toggle word wrap

After observability-metrics-custom-allowlist yaml is created, RHACM starts collecting the listed OpenShift Data Foundation metrics from all the managed clusters.
To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the clusters: observability: disabled.

6.3. Viewing health status of disaster recovery replication relationships
Copier lien

Prerequisites

Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.

Procedure

On the Hub cluster, ensure All Clusters option is selected.
Refresh the console to make the DR monitoring dashboard tab accessible.
Navigate to Data Services and click Data policies.
On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
Click the Disaster recovery tab to view a list of DR policy details and connected applications.

6.4. Disaster recovery metrics
Copier lien

These are the ramen metrics that are scrapped by prometheus.

ramen_last_sync_timestamp_seconds
ramen_policy_schedule_interval_seconds
ramen_last_sync_duration_seconds
ramen_last_sync_data_bytes

Run these metrics from the Hub cluster where Red Hat Advanced Cluster Management for Kubernetes (RHACM operator) is installed.

Last synchronization timestamp in seconds

This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs per application.

Metric name

ramen_last_sync_timestamp_seconds

Metrics type

Gauge

Labels

ObjType: Type of the object, here its DPPC
ObjName: Name of the object, here it is DRPC-Name
ObjNamespace: DRPC namespace
Policyname: Name of the DRPolicy
SchedulingInterval: Scheduling interval value from DRPolicy

Metric value

Value is set as Unix seconds which is obtained from lastGroupSyncTime from DRPC status.

Policy schedule interval in seconds

This gives the scheduling interval in seconds from DRPolicy.

Metric name

ramen_policy_schedule_interval_seconds

Metrics type

Gauge

Labels

Policyname: Name of the DRPolicy

Metric value

This is set to a scheduling interval in seconds which is taken from DRPolicy.

Last synchronization duration in seconds

This represents the longest time taken to sync from the most recent successful synchronization of all PVCs per application.

Metric name

ramen_last_sync_duration_seconds

Metrics type

Gauge

Labels

obj_type: Type of the object, here it is DPPC
obj_name: Name of the object, here it is DRPC-Name
obj_namespace: DRPC namespace
scheduling_interval: Scheduling interval value from DRPolicy

Metric value

The value is taken from lastGroupSyncDuration from DRPC status.

Total bytes transferred from most recent synchronization

This value represents the total bytes transferred from the most recent successful synchronization of all PVCs per application.

Metric name

ramen_last_sync_data_bytes

Metrics type

Gauge

Labels

obj_type: Type of the object, here it is DPPC
obj_name: Name of the object, here it is DRPC-Name
obj_namespace: DRPC namespace
scheduling_interval: Scheduling interval value from DRPolicy

Metric value

The value is taken from lastGroupSyncBytes from DRPC status.

6.5. Disaster recovery alerts
Copier lien

This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation within a disaster recovery environment.

Recording rules

Record: ramen_sync_duration_seconds

Expression

sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))

sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))

Copy to Clipboard

Toggle word wrap

Purpose

The time interval between the volume group’s last sync time and the time now in seconds.

Record: ramen_rpo_difference

Expression

ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})

ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})

Copy to Clipboard

Toggle word wrap

Purpose

The difference between the expected sync delay and the actual sync delay taken by the volume replication group.

Record: count_persistentvolumeclaim_total
Expression
count(kube_persistentvolumeclaim_info)

Copy to Clipboard Toggle word wrap
Purpose
Sum of all PVC from the managed cluster.

Alerts

Alert: VolumeSynchronizationDelay

Impact

Critical

Purpose

Actual sync delay taken by the volume replication group is thrice the expected sync delay.

YAML

  alert: VolumeSynchronizationDela
  expr: ramen_rpo_difference >= 3
  for: 5s
  labels:
    cluster: '{{ $labels.cluster }}'
    severity: critical
  annotations:
    description: >-
      Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
      $labels.obj_namespace }}) is taking more than thrice the scheduled
      snapshot interval. This may cause data loss and a backlog of replication
      requests.
    alert_type: DisasterRecovery

  alert: VolumeSynchronizationDela
  expr: ramen_rpo_difference >= 3
  for: 5s
  labels:
    cluster: '{{ $labels.cluster }}'
    severity: critical
  annotations:
    description: >-
      Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
      $labels.obj_namespace }}) is taking more than thrice the scheduled
      snapshot interval. This may cause data loss and a backlog of replication
      requests.
    alert_type: DisasterRecovery

Copy to Clipboard

Toggle word wrap

Alert: VolumeSynchronizationDelay

Impact

Warning

Purpose

Actual sync delay taken by the volume replication group is twice the expected sync delay.

YAML

  alert: VolumeSynchronizationDela
  expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
  for: 5s
  labels:
    cluster: '{{ $labels.cluster }}'
    severity: critical
  annotations:
    description: >-
      Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
      $labels.obj_namespace }}) is taking more than twice the scheduled
      snapshot interval. This may cause data loss and a backlog of replication
      requests.
    alert_type: DisasterRecovery

  alert: VolumeSynchronizationDela
  expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
  for: 5s
  labels:
    cluster: '{{ $labels.cluster }}'
    severity: critical
  annotations:
    description: >-
      Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
      $labels.obj_namespace }}) is taking more than twice the scheduled
      snapshot interval. This may cause data loss and a backlog of replication
      requests.
    alert_type: DisasterRecovery

Copy to Clipboard

Toggle word wrap

Ce contenu n'est pas disponible dans la langue sélectionnée.

6.1. Enable monitoring for disaster recovery
Copier lien

6.2. Enabling disaster recovery dashboard on Hub cluster
Copier lien

6.3. Viewing health status of disaster recovery replication relationships
Copier lien

6.4. Disaster recovery metrics
Copier lien

6.5. Disaster recovery alerts
Copier lien

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 6. Monitoring disaster recovery health

6.1. Enable monitoring for disaster recoveryCopier lienLien copié sur presse-papiers!

6.2. Enabling disaster recovery dashboard on Hub clusterCopier lienLien copié sur presse-papiers!

6.3. Viewing health status of disaster recovery replication relationshipsCopier lienLien copié sur presse-papiers!

6.4. Disaster recovery metricsCopier lienLien copié sur presse-papiers!

6.5. Disaster recovery alertsCopier lienLien copié sur presse-papiers!

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

6.1. Enable monitoring for disaster recovery
Copier lien

6.2. Enabling disaster recovery dashboard on Hub cluster
Copier lien

6.3. Viewing health status of disaster recovery replication relationships
Copier lien

6.4. Disaster recovery metrics
Copier lien

6.5. Disaster recovery alerts
Copier lien