Chapter 4. Monitoring and Troubleshooting Global Data Grid Clusters
Data Grid provides statistics for cross-site replication operations via JMX or the /metrics
endpoint for Data Grid server.
Cross-site replication statistics are available at cache level so you must explicitly enable statistics for your caches. Likewise, if you want to collect statistics via JMX you must configure Data Grid to register MBeans.
Data Grid also includes an org.infinispan.XSITE
logging category so you can monitor and troubleshoot common issues with networking and state transfer operations.
4.1. Enabling Data Grid Statistics
Configure Data Grid to export statistics for Cache Managers and caches.
Procedure
Modify your configuration to enable Data Grid statistics in one of the following ways:
-
Declarative: Add the
statistics="true"
attribute. -
Programmatic: Call the
.statistics()
method.
Declarative
<!-- Enables statistics for the Cache Manager. --> <cache-container statistics="true"> <!-- Enables statistics for the named cache. --> <local-cache name="mycache" statistics="true"/> </cache-container>
Programmatic
GlobalConfiguration globalConfig = new GlobalConfigurationBuilder() //Enables statistics for the Cache Manager. .cacheContainer().statistics(true) .build(); Configuration config = new ConfigurationBuilder() //Enables statistics for the named cache. .statistics().enable() .build();
4.2. Configuring Data Grid Metrics
Configure Data Grid to export gauges and histograms via the metrics
endpoint.
Procedure
-
Turn gauges and histograms on or off in the
metrics
configuration as appropriate.
Declarative
<!-- Computes and collects statistics for the Cache Manager. --> <cache-container statistics="true"> <!-- Exports collected statistics as gauge and histogram metrics. --> <metrics gauges="true" histograms="true" /> </cache-container>
Programmatic
GlobalConfiguration globalConfig = new GlobalConfigurationBuilder() //Computes and collects statistics for the Cache Manager. .statistics().enable() //Exports collected statistics as gauge and histogram metrics. .metrics().gauges(true).histograms(true) .build();
4.2.1. Collecting Data Grid Metrics
Collect Data Grid metrics with monitoring tools such as Prometheus.
Prerequisites
-
Enable statistics. If you do not enable statistics, Data Grid provides
0
and-1
values for metrics. - Optionally enable histograms. By default Data Grid generates gauges but not histograms.
Procedure
Get metrics in Prometheus (OpenMetrics) format:
$ curl -v http://localhost:11222/metrics
Get metrics in MicroProfile JSON format:
$ curl --header "Accept: application/json" http://localhost:11222/metrics
Next steps
Configure monitoring applications to collect Data Grid metrics. For example, add the following to prometheus.yml
:
static_configs: - targets: ['localhost:11222']
Reference
- Prometheus Configuration
- Enabling Data Grid Statistics
4.3. Configuring Data Grid to Register JMX MBeans
Data Grid can register JMX MBeans that you can use to collect statistics and perform administrative operations. You must enable statistics separately to JMX otherwise Data Grid provides 0
values for all statistic attributes.
Procedure
Modify your cache container configuration to enable JMX in one of the following ways:
-
Declarative: Add the
<jmx enabled="true" />
element to the cache container. -
Programmatic: Call the
.jmx().enable()
method.
Declarative
<cache-container> <jmx enabled="true" /> </cache-container>
Programmatic
GlobalConfiguration globalConfig = new GlobalConfigurationBuilder() .jmx().enable() .build();
4.3.1. JMX MBeans for Cross-Site Replication
Data Grid provides JMX MBeans for cross-site replication that let you gather statistics and perform remote operations.
The org.infinispan:type=Cache
component provides the following JMX MBeans:
-
XSiteAdmin
exposes cross-site operations that apply to specific cache instances. -
StateTransferManager
provides statistics for state transfer operations. -
InboundInvocationHandler
provides statistics and operations for asynchronous and synchronous cross-site requests.
The org.infinispan:type=CacheManager
component includes the following JMX MBean:
-
GlobalXSiteAdminOperations
exposes cross-site operations that apply to all caches in a cache container.
For details about JMX MBeans along with descriptions of available operations and statistics, see the Data Grid JMX Components documentation.
Reference
4.4. Collecting Logs and Troubleshooting Cross-Site Replication
Diagnose and resolve issues related to Data Grid cross-site replication. Use the Data Grid Command Line Interface (CLI) to adjust log levels at run-time and perform cross-site troubleshooting.
Procedure
-
Open a terminal in
$RHDG_HOME
. - Create a Data Grid CLI connection.
Adjust run-time logging levels to capture DEBUG messages if necessary.
For example, the following command enables DEBUG log messages for the org.infinispan.XSITE category:
[//containers/default]> logging set --level=DEBUG org.infinispan.XSITE
You can then check the Data Grid log files for cross-site messages in the
${rhdg.server.root}/log
directory.-
Use the
site
command to view status for backup locations and perform troubleshooting.
For example, check the status of the "customers" cache that uses "LON" as a backup location:
[//containers/default]> site status --cache=customers { "LON" : "online" }
Another scenario for using the Data Grid CLI to troubleshoot is when the network connection between backup locations is broken during a state transfer operation.
If this occurs, Data Grid clusters that receive state transfer continually wait for the operation to complete. In this case you should cancel the state transfer to the receiving site to return it to a normal operational state.
For example, cancel state transfer for "NYC" as follows:
[//containers/default]> site cancel-receive-state --cache=mycache --site=NYC`
4.4.1. Cross-Site Log Messages
Find user actions for log messages related to cross-site replication.
Log level | Identifier | Message | Description |
---|---|---|---|
DEBUG | ISPN000400 | Node null was suspected | Data Grid prints this message when it cannot reach backup locations. Ensure that sites are online and check network status. |
INFO | ISPN000439 | Received new x-site view: ${site.name} | Data Grid prints this message when sites join and leave the global cluster. |
INFO | ISPN100005 | Site ${site.name} is online. | Data Grid prints this message when a site comes online. |
INFO | ISPN100006 | Site ${site.name} is offline. | Data Grid prints this message when a site goes offline. If you did not take the site offline manually, this message could indicate a failure has occurred. Check network status and try to bring the site back online. |
WARN | ISPN000202 | Problems backing up data for cache ${cache.name} to site ${site.name}: | Data Grid prints this message when issues occur with state transfer operations along with the exception. If necessary adjust Data Grid logging to get more fine-grained logging messages. |
WARN | ISPN000289 | Unable to send X-Site state chunk to ${site.name}. | Indicates that Data Grid cannot transfer a batch of cache entries during a state transfer operation. Ensure that sites are online and check network status. |
WARN | ISPN000291 | Unable to apply X-Site state chunk. | Indicates that Data Grid cannot apply a batch of cache entries during a state transfer operation. Ensure that sites are online and check network status. |
WARN | ISPN000322 | Unable to re-start x-site state transfer to site ${site.name} | Indicates that Data Grid cannot resume a state transfer operation to a backup location. Ensure that sites are online and check network status. |
ERROR | ISPN000477 | Unable to perform operation ${operation.name} for site ${site.name} | Indicates that Data Grid cannot successfully complete an operation on a backup location. If necessary adjust Data Grid logging to get more fine-grained logging messages. |
FATAL | ISPN000449 | XSite state transfer timeout must be higher or equals than 1 (one). |
Results when the value of the |
FATAL | ISPN000450 | XSite state transfer waiting time between retries must be higher or equals than 1 (one). |
Results when the value of the |
FATAL | ISPN000576 | Cross-site Replication not available for local cache. | Cross-site replication does not work with the local cache mode. Either remove the backup configuration from the local cache definition or use a distributed or replicated cache mode. |