9.6. クロスサイトレプリケーションのステータスの監視

バックアップの場所のサイトステータスを監視して、サイト間の通信の中断を検出します。リモートサイトのステータスが offline に変わると、Data Grid はバックアップの場所へのデータのレプリケーションを停止します。データが同期しなくなるため、クラスターをオンラインに戻す前に不整合を修正する必要があります。

問題を早期に検出するには、クロスサイトのイベントの監視が必要です。以下の監視ストラテジーのいずれかを使用します。

REST API を使用したクロスサイトレプリケーションの監視

REST エンドポイントを使用して、すべてのキャッシュのクロスサイトレプリケーションのステータスを監視します。カスタムスクリプトを実装して REST エンドポイントをポーリングするか、以下の例を使用できます。

前提条件

クロスサイトレプリケーションを有効にする。

手順

REST エンドポイントをポーリングするスクリプトを実装します。
以下の例は、Python スクリプトを使用して、5 秒ごとにサイトのステータスをポーリングする方法を示しています。

#!/usr/bin/python3
import time
import requests
from requests.auth import HTTPDigestAuth


class InfinispanConnection:

    def __init__(self, server: str = 'http://localhost:11222', cache_manager: str = 'default',
                 auth: tuple = ('admin', 'change_me')) -> None:
        super().__init__()
        self.__url = f'{server}/rest/v2/cache-managers/{cache_manager}/x-site/backups/'
        self.__auth = auth
        self.__headers = {
            'accept': 'application/json'
        }

    def get_sites_status(self):
        try:
            rsp = requests.get(self.__url, headers=self.__headers, auth=HTTPDigestAuth(self.__auth[0], self.__auth[1]))
            if rsp.status_code != 200:
                return None
            return rsp.json()
        except:
            return None


# Specify credentials for Data Grid user with permission to access the REST endpoint
USERNAME = 'admin'
PASSWORD = 'change_me'
# Set an interval between cross-site status checks
POLL_INTERVAL_SEC = 5
# Provide a list of servers
SERVERS = [
    InfinispanConnection('http://127.0.0.1:11222', auth=(USERNAME, PASSWORD)),
    InfinispanConnection('http://127.0.0.1:12222', auth=(USERNAME, PASSWORD))
]
#Specify the names of remote sites
REMOTE_SITES = [
    'nyc'
]
#Provide a list of caches to monitor
CACHES = [
    'work',
    'sessions'
]


def on_event(site: str, cache: str, old_status: str, new_status: str):
    # TODO implement your handling code here
    print(f'site={site} cache={cache} Status changed {old_status} -> {new_status}')


def __handle_mixed_state(state: dict, site: str, site_status: dict):
    if site not in state:
        state[site] = {c: 'online' if c in site_status['online'] else 'offline' for c in CACHES}
        return

    for cache in CACHES:
        __update_cache_state(state, site, cache, 'online' if cache in site_status['online'] else 'offline')


def __handle_online_or_offline_state(state: dict, site: str, new_status: str):
    if site not in state:
        state[site] = {c: new_status for c in CACHES}
        return

    for cache in CACHES:
        __update_cache_state(state, site, cache, new_status)


def __update_cache_state(state: dict, site: str, cache: str, new_status: str):
    old_status = state[site].get(cache)
    if old_status != new_status:
        on_event(site, cache, old_status, new_status)
        state[site][cache] = new_status


def update_state(state: dict):
    rsp = None
    for conn in SERVERS:
        rsp = conn.get_sites_status()
        if rsp:
            break
    if rsp is None:
        print('Unable to fetch site status from any server')
        return

    for site in REMOTE_SITES:
        site_status = rsp.get(site, {})
        new_status = site_status.get('status')
        if new_status == 'mixed':
            __handle_mixed_state(state, site, site_status)
        else:
            __handle_online_or_offline_state(state, site, new_status)


if __name__ == '__main__':
    _state = {}
    while True:
        update_state(_state)
        time.sleep(POLL_INTERVAL_SEC)

サイトのステータスが online から offline に変わったり、その逆の場合、関数 on_event が呼び出されます。

このスクリプトを使用する場合は、以下の変数を指定する必要があります。

USERNAME および PASSWORD: REST エンドポイントにアクセスする権限を持つ Data Grid ユーザーのユーザー名およびパスワード。
POLL_INTERVAL_SEC: ポーリング間の秒数。
SERVERS: このサイトの Data Grid Server の一覧。このスクリプトには必要なのは、有効な応答が 1 つだけですが、フェイルオーバーを許可するためにリストが提供されています。
REMOTE_SITES: これらのサーバー上で監視するリモートサイトのリスト。
CACHES : 監視するキャッシュ名のリスト。

関連情報

REST API: バックアップの場所のステータスの取得

Prometheus メトリクスを使用したクロスサイトレプリケーションの監視

Prometheus およびその他の監視システムを使用すると、サイトのステータスが offline に変化したことを検出するアラートを設定できます。

ヒント

クロスサイトレイテンシーメトリックを監視すると、発生する可能性のある問題を検出しやすくなります。

前提条件

クロスサイトレプリケーションを有効にする。

手順

Data Grid メトリクスを設定します。
Prometheus メトリクス形式を使用してアラートルールを設定します。
- サイトの状態については、online の場合は 1 を、offline の場合は 0 を使用します。
- expr フィールドには、
  vendor_cache_manager_default_cache_<cache name>_x_site_admin_<site name>_status の形式を使用します。
  以下の例では、Prometheus は、NYC サイトが work または sessions という名前のキャッシュに対して オフライン になると警告を表示します。
```
groups:
- name: Cross Site Rules
  rules:
  - alert: Cache Work and Site NYC
    expr: vendor_cache_manager_default_cache_work_x_site_admin_nyc_status == 0
  - alert: Cache Sessions and Site NYC
    expr: vendor_cache_manager_default_cache_sessions_x_site_admin_nyc_status == 0
```
  次の図は、NYC サイトがキャッシュ work を行うために offline であるという警告を示しています。
  図9.1 Prometheus 警告

関連情報

9.6. クロスサイトレプリケーションのステータスの監視

REST API を使用したクロスサイトレプリケーションの監視

Prometheus メトリクスを使用したクロスサイトレプリケーションの監視

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Red Hat legal and privacy links

Red Hat legal and privacy links