Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 6. Monitoring disaster recovery health
6.1. Enable monitoring for disaster recovery Copier lienLien copié sur presse-papiers!
Use this procedure to enable basic monitoring for your disaster recovery setup.
Procedure
- On the Hub cluster, open a terminal window
Add the following label to
openshift-operator
namespace.oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
$ oc label namespace openshift-operators openshift.io/cluster-monitoring='true'
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
6.2. Enabling disaster recovery dashboard on Hub cluster Copier lienLien copié sur presse-papiers!
This section guides you to enable the disaster recovery dashboard for advanced monitoring on the Hub cluster.
For Regional-DR, the dashboard shows monitoring status cards for operator health, cluster health, metrics, alerts and application count.
For Metro-DR, you can configure the dashboard to only monitor the ramen setup health and application count.
Prerequisites
Ensure that you have already installed the following
- OpenShift Container Platform version 4.15 and have administrator privileges.
- ODF Multicluster Orchestrator with the console plugin enabled.
- Red Hat Advanced Cluster Management for Kubernetes 2.10 (RHACM) from Operator Hub. For instructions on how to install, see Installing RHACM.
- Ensure you have enabled observability on RHACM. See Enabling observability guidelines.
Procedure
- On the Hub cluster, open a terminal window and perform the next steps.
Create the configmap file named
observability-metrics-custom-allowlist.yaml
.You can use the following YAML to list the disaster recovery metrics on Hub cluster. For details, see Adding custom metrics. To know more about ramen metrics, see Disaster recovery metrics.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In the
open-cluster-management-observability
namespace, run the following command:oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
$ oc apply -n open-cluster-management-observability -f observability-metrics-custom-allowlist.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After
observability-metrics-custom-allowlist
yaml is created, RHACM starts collecting the listed OpenShift Data Foundation metrics from all the managed clusters.To exclude a specific managed cluster from collecting the observability data, add the following cluster label to the
clusters: observability: disabled
.
6.3. Viewing health status of disaster recovery replication relationships Copier lienLien copié sur presse-papiers!
Prerequisites
Ensure that you have enabled the disaster recovery dashboard for monitoring. For instructions, see chapter Enabling disaster recovery dashboard on Hub cluster.
Procedure
- On the Hub cluster, ensure All Clusters option is selected.
- Refresh the console to make the DR monitoring dashboard tab accessible.
- Navigate to Data Services and click Data policies.
- On the Overview tab, you can view the health status of the operators, clusters and applications. Green tick indicates that the operators are running and available..
- Click the Disaster recovery tab to view a list of DR policy details and connected applications.
6.4. Disaster recovery metrics Copier lienLien copié sur presse-papiers!
These are the ramen metrics that are scrapped by prometheus.
- ramen_last_sync_timestamp_seconds
- ramen_policy_schedule_interval_seconds
- ramen_last_sync_duration_seconds
- ramen_last_sync_data_bytes
Run these metrics from the Hub cluster where Red Hat Advanced Cluster Management for Kubernetes (RHACM operator) is installed.
Last synchronization timestamp in seconds
This is the time in seconds which gives the time of the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_timestamp_seconds
- Metrics type
- Gauge
- Labels
-
ObjType
: Type of the object, here its DPPC -
ObjName
: Name of the object, here it is DRPC-Name -
ObjNamespace
: DRPC namespace -
Policyname
: Name of the DRPolicy -
SchedulingInterval
: Scheduling interval value from DRPolicy
-
- Metric value
-
Value is set as Unix seconds which is obtained from
lastGroupSyncTime
from DRPC status.
Policy schedule interval in seconds
This gives the scheduling interval in seconds from DRPolicy.
- Metric name
-
ramen_policy_schedule_interval_seconds
- Metrics type
- Gauge
- Labels
-
Policyname
: Name of the DRPolicy
-
- Metric value
- This is set to a scheduling interval in seconds which is taken from DRPolicy.
Last synchronization duration in seconds
This represents the longest time taken to sync from the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_duration_seconds
- Metrics type
- Gauge
- Labels
-
obj_type
: Type of the object, here it is DPPC -
obj_name
: Name of the object, here it is DRPC-Name -
obj_namespace
: DRPC namespace -
scheduling_interval
: Scheduling interval value from DRPolicy
-
- Metric value
-
The value is taken from
lastGroupSyncDuration
from DRPC status.
Total bytes transferred from most recent synchronization
This value represents the total bytes transferred from the most recent successful synchronization of all PVCs per application.
- Metric name
-
ramen_last_sync_data_bytes
- Metrics type
- Gauge
- Labels
-
obj_type
: Type of the object, here it is DPPC -
obj_name
: Name of the object, here it is DRPC-Name -
obj_namespace
: DRPC namespace -
scheduling_interval
: Scheduling interval value from DRPolicy
-
- Metric value
-
The value is taken from
lastGroupSyncBytes
from DRPC status.
6.5. Disaster recovery alerts Copier lienLien copié sur presse-papiers!
This section provides a list of all supported alerts associated with Red Hat OpenShift Data Foundation within a disaster recovery environment.
Recording rules
Record:
ramen_sync_duration_seconds
- Expression
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
sum by (obj_name, obj_namespace, obj_type, job, policyname)(time() - (ramen_last_sync_timestamp_seconds > 0))
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Purpose
- The time interval between the volume group’s last sync time and the time now in seconds.
Record:
ramen_rpo_difference
- Expression
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
ramen_sync_duration_seconds{job="ramen-hub-operator-metrics-service"} / on(policyname, job) group_left() (ramen_policy_schedule_interval_seconds{job="ramen-hub-operator-metrics-service"})
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Purpose
- The difference between the expected sync delay and the actual sync delay taken by the volume replication group.
Record:
count_persistentvolumeclaim_total
- Expression
count(kube_persistentvolumeclaim_info)
count(kube_persistentvolumeclaim_info)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Purpose
- Sum of all PVC from the managed cluster.
Alerts
Alert:
VolumeSynchronizationDelay
- Impact
- Critical
- Purpose
- Actual sync delay taken by the volume replication group is thrice the expected sync delay.
- YAML
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Alert:
VolumeSynchronizationDelay
- Impact
- Warning
- Purpose
- Actual sync delay taken by the volume replication group is twice the expected sync delay.
- YAML
Copy to Clipboard Copied! Toggle word wrap Toggle overflow