Chapter 9. Monitoring
Monitoring data allows you to monitor the performance and health of AMQ Streams. You can configure your deployment to capture metrics data for analysis and notifications.
Metrics data is useful when investigating issues with connectivity and data delivery. For example, metrics data can identify under-replicated partitions or the rate at which messages are consumed. Alerting rules can provide time-critical notifications on such metrics through a specified communications channel. Monitoring visualizations present real-time metrics data to help determine when and how to update the configuration of your deployment. Example metrics configuration files are provided with AMQ Streams.
Distributed tracing complements the gathering of metrics data by providing a facility for end-to-end tracking of messages through AMQ Streams.
Cruise Control provides support for rebalancing of Kafka clusters, based on workload data.
Metrics and monitoring tools
AMQ Streams can employ the following tools for metrics and monitoring:
- Prometheus
- Prometheus pulls metrics from Kafka, ZooKeeper and Kafka Connect clusters. The Prometheus Alertmanager plugin handles alerts and routes them to a notification service.
- Kafka Exporter
- Kafka Exporter adds additional Prometheus metrics.
- Grafana
- Grafana Labs provides dashboard visualizations of Prometheus metrics.
- Jaeger
- Jaeger documentation provides distributed tracing support to track transactions between applications.
- Cruise Control
- Cruise Control monitors data distribution and performs data rebalances across a Kafka cluster.
9.1. Prometheus
Prometheus can extract metrics data from Kafka components and the AMQ Streams Operators.
To use Prometheus to obtain metrics data and provide alerts, Prometheus and the Prometheus Alertmanager plugin must be deployed. Kafka resources must also be deployed or redeployed with metrics configuration to expose the metrics data.
Prometheus scrapes the exposed metrics data for monitoring. Alertmanager issues alerts when conditions indicate potential problems, based on pre-defined alerting rules.
Sample metrics and alerting rules configuration files are provided with AMQ Streams. The sample alerting mechanism provided with AMQ Streams is configured to send notifications to a Slack channel.
9.2. Grafana
Grafana uses the metrics data exposed by Prometheus to present dashboard visualizations for monitoring.
A deployment of Grafana is required, with Prometheus added as a data source. Example dashboards, supplied with AMQ Streams as JSON files, are imported through the Grafana interface to present monitoring data.
9.3. Kafka Exporter
Kafka Exporter is an open source project to enhance monitoring of Apache Kafka brokers and clients. Kafka Exporter is deployed with a Kafka cluster to extract additional Prometheus metrics data from Kafka brokers related to offsets, consumer groups, consumer lag, and topics. You can use the Grafana dashboard provided to visualize the data collected by Prometheus from Kafka Exporter.
A sample configuration file, alerting rules and Grafana dashboard for Kafka Exporter are provided with AMQ Streams.
9.4. Distributed tracing
Distributed tracing tracks the progress of transactions between applications in a distributed system. In a microservices architecture, tracing tracks the progress of transactions between services. Trace data is useful for monitoring application performance and investigating issues with target systems and end-user applications.
In AMQ Streams, tracing facilitates the end-to-end tracking of messages: from source systems to Kafka, and then from Kafka to target systems and applications. Distributed tracing complements the monitoring of metrics in Grafana dashboards, as well as the component loggers.
Support for tracing is built in to the following Kafka components:
- MirrorMaker to trace messages from a source cluster to a target cluster
- Kafka Connect to trace messages consumed and produced by Kafka Connect
- Kafka Bridge to trace messages between Kafka and HTTP client applications
Tracing is not supported for Kafka brokers.
You enable and configure tracing for these components through their custom resources. You add tracing configuration using spec.template
properties.
You enable tracing by specifying a tracing type using the spec.tracing.type
property:
opentelemetry
-
Specify
type: opentelemetry
to use OpenTelemetry. By Default, OpenTelemetry uses the OTLP (OpenTelemetry Protocol) exporter and endpoint to get trace data. You can specify other tracing systems supported by OpenTelemetry, including Jaeger tracing. To do this, you change the OpenTelemetry exporter and endpoint in the tracing configuration. jaeger
-
Specify
type:jaeger
to use OpenTracing and the Jaeger client to get trace data.
Support for type: jaeger
tracing is deprecated. The Jaeger clients are now retired and the OpenTracing project archived. As such, we cannot guarantee their support for future Kafka versions. If possible, we will maintain the support for type: jaeger
tracing until June 2023 and remove it afterwards. Please migrate to OpenTelemetry as soon as possible.
Tracing for Kafka clients
Client applications, such as Kafka producers and consumers, can also be set up so that transactions are monitored. Clients are configured with a tracing profile, and a tracer is initialized for the client application to use.
9.5. Cruise Control
Cruise Control is an open source system that supports the following Kafka operations:
- Monitoring cluster workload
- Rebalancing a cluster based on predefined constraints
The operations help with running a more balanced Kafka cluster that uses broker pods more efficiently.
A typical cluster can become unevenly loaded over time. Partitions that handle large amounts of message traffic might not be evenly distributed across the available brokers. To rebalance the cluster, administrators must monitor the load on brokers and manually reassign busy partitions to brokers with spare capacity.
Cruise Control automates the cluster rebalancing process. It constructs a workload model of resource utilization for the cluster—based on CPU, disk, and network load—and generates optimization proposals (that you can approve or reject) for more balanced partition assignments. A set of configurable optimization goals is used to calculate these proposals.
You can generate optimization proposals in specific modes. The default full
mode rebalances partitions across all brokers. You can also use the add-brokers
and remove-brokers
modes to accommodate changes when scaling a cluster up or down.
When you approve an optimization proposal, Cruise Control applies it to your Kafka cluster. You configure and generate optimization proposals using a KafkaRebalance
resource. You can configure the resource using an annotation so that optimization proposals are approved automatically or manually.
Prometheus can extract Cruise Control metrics data, including data related to optimization proposals and rebalancing operations. A sample configuration file and Grafana dashboard for Cruise Control are provided with AMQ Streams.