Rechercher

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 6. Monitoring disaster recovery health

download PDF

6.1. Enable monitoring for disaster recovery

Use this procedure to enable basic monitoring for your disaster recovery setup.

Procedure

  1. On the Hub cluster, open a terminal window
  2. Add the following label to openshift-operator namespace.

    $ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'

6.2. Enabling disaster recovery dashboard on Hub cluster

This section guides you to enable the disaster recovery dashboard for advanced monitoring on the Hub cluster.

For Regional-DR, the dashboard shows monitoring status cards for operator health, cluster health, metrics, alerts and application count.

For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.

Prerequisites

  • Ensure that you have already installed the following

    • OpenShift Container Platform version 4.15 and have administrator privileges.
    • ODF Multicluster Orchestrator with the console plugin enabled.
    • Red Hat Advanced Cluster Management for Kubernetes 2.10 (RHACM) from Operator Hub. For instructions on how to install, see Installing RHACM.
  • Ensure you have enabled observability on RHACM. See Enabling observability guidelines.

Procedure

  1. On the Hub cluster, open a terminal window and perform the next steps.
  2. Create the configmap file named observability-metrics-custom-allowlist.yaml.

    You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.

    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: observability-metrics-custom-allowlist
      namespace: open-cluster-management-observability
    data:
      metrics_list.yaml: |
        names:
          - ceph_rbd_mirror_snapshot_sync_bytes
          - ceph_rbd_mirror_snapshot_snapshots
        matches:
          - __name__="csv_succeeded",exported_namespace="openshift-dr-system",name=~"odr-cluster-operator.*"
          - __name__="csv_succeeded",exported_namespace="openshift-operators",name=~"volsync.*"
  3. In the open-cluster-management-observability namespace, run the following command:

    $ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
  4. After observability-metrics-custom-allowlist yaml is created, RHACM starts collecting the listed OpenShift Data Foundation metrics from all the managed clusters.

    To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the clusters: observability: disabled.

6.3. Viewing health status of disaster recovery replication relationships

Prerequisites

Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.

Procedure

  1. On the Hub cluster, ensure All Clusters option is selected.
  2. Refresh the console to make the DR monitoring dashboard tab accessible.
  3. Navigate to Data Services and click Data policies.
  4. On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
  5. Click the Disaster recovery tab to view a list of DR policy details and connected applications.

6.4. Disaster recovery metrics

These are the ramen metrics that are scrapped by prometheus.

  • ramen_last_sync_timestamp_seconds
  • ramen_policy_schedule_interval_seconds
  • ramen_last_sync_duration_seconds
  • ramen_last_sync_data_bytes

Run these metrics from the Hub cluster where Red Hat Advanced Cluster Management for Kubernetes (RHACM operator) is installed.

Last synchronization timestamp in seconds

This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs per application.

Metric name
ramen_last_sync_timestamp_seconds
Metrics type
Gauge
Labels
  • ObjType: Type of the object, here its DPPC
  • ObjName: Name of the object, here it is DRPC-Name
  • ObjNamespace: DRPC namespace
  • Policyname: Name of the DRPolicy
  • SchedulingInterval: Scheduling interval value from DRPolicy
Metric value
Value is set as Unix seconds which is obtained from lastGroupSyncTime from DRPC status.

Policy schedule interval in seconds

This gives the scheduling interval in seconds from DRPolicy.

Metric name
ramen_policy_schedule_interval_seconds
Metrics type
Gauge
Labels
  • Policyname: Name of the DRPolicy
Metric value
This is set to a scheduling interval in seconds which is taken from DRPolicy.

Last synchronization duration in seconds

This represents the longest time taken to sync from the most recent successful synchronization of all PVCs per application.

Metric name
ramen_last_sync_duration_seconds
Metrics type
Gauge
Labels
  • obj_type: Type of the object, here it is DPPC
  • obj_name: Name of the object, here it is DRPC-Name
  • obj_namespace: DRPC namespace
  • scheduling_interval: Scheduling interval value from DRPolicy
Metric value
The value is taken from lastGroupSyncDuration from DRPC status.

Total bytes transferred from most recent synchronization

This value represents the total bytes transferred from the most recent successful synchronization of all PVCs per application.

Metric name
ramen_last_sync_data_bytes
Metrics type
Gauge
Labels
  • obj_type: Type of the object, here it is DPPC
  • obj_name: Name of the object, here it is DRPC-Name
  • obj_namespace: DRPC namespace
  • scheduling_interval: Scheduling interval value from DRPolicy
Metric value
The value is taken from lastGroupSyncBytes from DRPC status.

6.5. Disaster recovery alerts

This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation within a disaster recovery environment.

Recording rules

  • Record: ramen_sync_duration_seconds

    Expression
    sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
    Purpose
    The time interval between the volume group’s last sync time and the time now in seconds.
  • Record: ramen_rpo_difference

    Expression
    ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
    Purpose
    The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
  • Record: count_persistentvolumeclaim_total

    Expression
    count(kube_persistentvolumeclaim_info)
    Purpose
    Sum of all PVC from the managed cluster.

Alerts

  • Alert: VolumeSynchronizationDelay

    Impact
    Critical
    Purpose
    Actual sync delay taken by the volume replication group is thrice the expected sync delay.
    YAML
      alert: VolumeSynchronizationDela
      expr: ramen_rpo_difference >= 3
      for: 5s
      labels:
        cluster: '{{ $labels.cluster }}'
        severity: critical
      annotations:
        description: >-
          Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
          $labels.obj_namespace }}) is taking more than thrice the scheduled
          snapshot interval. This may cause data loss and a backlog of replication
          requests.
        alert_type: DisasterRecovery
  • Alert: VolumeSynchronizationDelay

    Impact
    Warning
    Purpose
    Actual sync delay taken by the volume replication group is twice the expected sync delay.
    YAML
      alert: VolumeSynchronizationDela
      expr: ramen_rpo_difference > 2 and ramen_rpo_difference < 3
      for: 5s
      labels:
        cluster: '{{ $labels.cluster }}'
        severity: critical
      annotations:
        description: >-
          Syncing of volumes (DRPC: {{ $labels.obj_name }}, Namespace: {{
          $labels.obj_namespace }}) is taking more than twice the scheduled
          snapshot interval. This may cause data loss and a backlog of replication
          requests.
        alert_type: DisasterRecovery
Red Hat logoGithubRedditYoutubeTwitter

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Nous aidons les utilisateurs de Red Hat à innover et à atteindre leurs objectifs grâce à nos produits et services avec un contenu auquel ils peuvent faire confiance.

Rendre l’open source plus inclusif

Red Hat s'engage à remplacer le langage problématique dans notre code, notre documentation et nos propriétés Web. Pour plus de détails, consultez leBlog Red Hat.

À propos de Red Hat

Nous proposons des solutions renforcées qui facilitent le travail des entreprises sur plusieurs plates-formes et environnements, du centre de données central à la périphérie du réseau.

© 2024 Red Hat, Inc.