9.6. Monitoring the status of cross-site replication
Monitor the site status of your backup locations to detect interruptions in the communication between the sites. When a remote site status changes to offline, Data Grid stops replicating your data to the backup location. Your data become out of sync and you must fix the inconsistencies before bringing the clusters back online.
Monitoring cross-site events is necessary for early problem detection. Use one of the following monitoring strategies:
- Monitoring cross-site replication with the REST API
- Monitoring cross-site replication with the Prometheus metrics or any other monitoring system
Monitoring cross-site replication with the REST API
Monitor the status of cross-site replication for all caches using the REST endpoint. You can implement a custom script to poll the REST endpoint or use the following example.
Prerequisites
- Enable cross-site replication.
Procedure
Implement a script to poll the REST endpoint.
The following example demonstrates how you can use a Python script to poll the site status every five seconds.
#!/usr/bin/python3
import time
import requests
from requests.auth import HTTPDigestAuth
class InfinispanConnection:
def __init__(self, server: str = 'http://localhost:11222', cache_manager: str = 'default',
auth: tuple = ('admin', 'change_me')) -> None:
super().__init__()
self.__url = f'{server}/rest/v2/container/x-site/backups/'
self.__auth = auth
self.__headers = {
'accept': 'application/json'
}
def get_sites_status(self):
try:
rsp = requests.get(self.__url, headers=self.__headers, auth=HTTPDigestAuth(self.__auth[0], self.__auth[1]))
if rsp.status_code != 200:
return None
return rsp.json()
except:
return None
# Specify credentials for Data Grid user with permission to access the REST endpoint
USERNAME = 'admin'
PASSWORD = 'change_me'
# Set an interval between cross-site status checks
POLL_INTERVAL_SEC = 5
# Provide a list of servers
SERVERS = [
InfinispanConnection('http://127.0.0.1:11222', auth=(USERNAME, PASSWORD)),
InfinispanConnection('http://127.0.0.1:12222', auth=(USERNAME, PASSWORD))
]
#Specify the names of remote sites
REMOTE_SITES = [
'nyc'
]
#Provide a list of caches to monitor
CACHES = [
'work',
'sessions'
]
def on_event(site: str, cache: str, old_status: str, new_status: str):
# TODO implement your handling code here
print(f'site={site} cache={cache} Status changed {old_status} -> {new_status}')
def __handle_mixed_state(state: dict, site: str, site_status: dict):
if site not in state:
state[site] = {c: 'online' if c in site_status['online'] else 'offline' for c in CACHES}
return
for cache in CACHES:
__update_cache_state(state, site, cache, 'online' if cache in site_status['online'] else 'offline')
def __handle_online_or_offline_state(state: dict, site: str, new_status: str):
if site not in state:
state[site] = {c: new_status for c in CACHES}
return
for cache in CACHES:
__update_cache_state(state, site, cache, new_status)
def __update_cache_state(state: dict, site: str, cache: str, new_status: str):
old_status = state[site].get(cache)
if old_status != new_status:
on_event(site, cache, old_status, new_status)
state[site][cache] = new_status
def update_state(state: dict):
rsp = None
for conn in SERVERS:
rsp = conn.get_sites_status()
if rsp:
break
if rsp is None:
print('Unable to fetch site status from any server')
return
for site in REMOTE_SITES:
site_status = rsp.get(site, {})
new_status = site_status.get('status')
if new_status == 'mixed':
__handle_mixed_state(state, site, site_status)
else:
__handle_online_or_offline_state(state, site, new_status)
if __name__ == '__main__':
_state = {}
while True:
update_state(_state)
time.sleep(POLL_INTERVAL_SEC)
When a site status changes from online to offline or vice-versa, the function on_event is invoked.
If you want to use this script, you must specify the following variables:
-
USERNAMEandPASSWORD: The username and password of Data Grid user with permission to access the REST endpoint. -
POLL_INTERVAL_SEC: The number of seconds between polls. -
SERVERS: The list of Data Grid Servers at this site. The script only requires a single valid response but the list is provided to allow fail over. -
REMOTE_SITES: The list of remote sites to monitor on these servers. -
CACHES: The list of cache names to monitor.
Monitoring cross-site replication with the Prometheus metrics
Prometheus, and other monitoring systems, let you configure alerts to detect when a site status changes to offline.
Monitoring cross-site latency metrics can help you to discover potential issues.
Prerequisites
- Enable cross-site replication.
Procedure
- Configure Data Grid metrics.
Configure alerting rules using the Prometheus metrics format.
-
For the site status, use
1foronlineand0foroffline. For the
exprfiled, use the following format:infinispan_x_site_admin_status{cache=\"<cache name>\",site=\"<site name>\"}.In the following example, Prometheus alerts you when the NYC site gets
offlinefor cache namedworkorsessions.groups: - name: Cross Site Rules rules: - alert: Cache Work and Site NYC expr: infinispan_x_site_admin_status{cache=\"Work\",site=\"NYC\"} == 0 - alert: Cache Sessions and Site NYC expr: infinispan_x_site_admin_status{cache=\"Sessions\",site=\"NYC\"} == 0The following image shows an alert that the NYC site is
offlinefor cachework.図9.1 Prometheus Alert
-
For the site status, use