이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 21. Introducing metrics
Collecting metrics is essential for understanding the health and performance of your Kafka deployment. By monitoring metrics, you can actively identify issues before they become critical and make informed decisions about resource allocation and capacity planning. Without metrics, you may be left with limited visibility into the behavior of your Kafka deployment, which can make troubleshooting more difficult and time-consuming. Setting up metrics can save you time and resources, and help ensure the reliability of your Kafka deployment.
21.1. Monitoring using the Metrics Reporter (technology preview) 링크 복사링크가 클립보드에 복사되었습니다!
This feature is a technology preview and not intended for a production environment. For more information see the release notes.
The Streams for Apache Kafka Metrics Reporter exposes Kafka metrics directly over HTTP in a Prometheus-compatible format. It integrates with Kafka brokers, clients, Kafka Connect, MirrorMaker 2, and Kafka Streams applications. The reporter enables consistent, low-overhead metrics collection and simplifies integration with monitoring systems.
21.1.1. Installing the Metrics Reporter 링크 복사링크가 클립보드에 복사되었습니다!
Streams for Apache Kafka includes the Metrics Reporter to support Prometheus-compatible metrics collection. The reporter is bundled with the distribution and can be enabled by configuration of Kafka components.
This procedure shows how to get the reporter up and running.
Procedure
Confirm that the Metrics Reporter JAR files are included in the
$KAFKA_HOME/libs/directory.The reporter is provided as part of the
amq-streams-<version>-kafka-bin.zipdistribution archive.Check that the Kafka component’s classpath includes these reporter libraries.
This usually happens automatically when you launch Kafka from the Streams for Apache Kafka distribution. If not, you can set it manually:
export CLASSPATH="$CLASSPATH:$KAFKA_HOME/libs/*"Configure the Kafka component to use the reporter.
You’ll find all the necessary configuration properties in Section 21.1.2, “Using the Metrics Reporter”.
21.1.2. Using the Metrics Reporter 링크 복사링크가 클립보드에 복사되었습니다!
After installing the Streams for Apache Kafka Metrics Reporter, configure your Kafka components to use the reporter. Metrics are exposed on an HTTP endpoint in Prometheus format and can be scraped by Prometheus.
Procedure
Add the metrics reporter to the component configuration.
The required configuration depends on the type of Kafka component.
Kafka brokers
Add the following properties to the Kafka broker configuration file.
metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter kafka.metrics.reporters = io.strimzi.kafka.metrics.YammerPrometheusMetricsReporter
metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter kafka.metrics.reporters = io.strimzi.kafka.metrics.YammerPrometheusMetricsReporterCopy to Clipboard Copied! Toggle word wrap Toggle overflow This configuration enables the reporter and allows collection of internal broker metrics via Yammer.
Kafka clients (producers, consumers, admin clients)
Add the following properties to the client configuration:
metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter
metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Kafka Connect and Kafka Streams
Add the following properties to the Kafka Connect runtime or Streams application configuration:
metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter admin.metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter producer.metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter consumer.metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter
metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter admin.metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter producer.metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporter consumer.metric.reporters = io.strimzi.kafka.metrics.KafkaPrometheusMetricsReporterCopy to Clipboard Copied! Toggle word wrap Toggle overflow The same reporter configuration must be applied to the admin, producer, and consumer clients used by Kafka Connect and Kafka Streams.
MirrorMaker 2 connectors
Add the following properties to the configuration of a MirrorMaker 2 connector, such as
MirrorSourceConnector:Copy to Clipboard Copied! Toggle word wrap Toggle overflow This enables metrics collection for MirrorMaker 2 connectors such as the source and checkpoint connectors. MirrorMaker 2 runs on Kafka Connect, so its metrics are exposed through the same HTTP listener and endpoint used by Kafka Connect, using a shared metrics registry. By default, this endpoint is http://localhost:8080/metrics, but it can be changed using the
prometheus.metrics.reporter.listenerproperty. Settingprometheus.metrics.reporter.listener.enabletofalseautomatically routes metrics through the Kafka Connect listener, eliminating the need for a separate listener for MirrorMaker.TipIn distributed mode, you must provide the connector configuration as JSON through the Kafka Connect REST API, as shown in the example. In standalone mode, you can also define the configuration using a properties file.
(Optional) Configure a non-default listener for the metrics endpoint.
By default, metrics are exposed at http://localhost:8080/metrics.
To use a different listener, set the
prometheus.metrics.reporter.listenerproperty to the requiredhttp://[host]:[port]value:prometheus.metrics.reporter.listener = http://:8081
prometheus.metrics.reporter.listener = http://:8081Copy to Clipboard Copied! Toggle word wrap Toggle overflow For Kafka Connect and Kafka Streams, you must set the same property for the admin, producer, and consumer:
admin.prometheus.metrics.reporter.listener = http://:8081 producer.prometheus.metrics.reporter.listener = http://:8081 consumer.prometheus.metrics.reporter.listener = http://:8081
admin.prometheus.metrics.reporter.listener = http://:8081 producer.prometheus.metrics.reporter.listener = http://:8081 consumer.prometheus.metrics.reporter.listener = http://:8081Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteTo stop exposing metrics over HTTP, set
prometheus.metrics.reporter.listener.enable = false.- Start the Kafka component using the usual command or deployment method.
Verify that metrics are exposed at the configured endpoint.
Open the URL in a browser or use a curl command:
curl http://localhost:8080/metrics
curl http://localhost:8080/metricsCopy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) Control which metrics are exposed.
By default, the reporter exposes all available metrics.
To limit the metrics collected, set the
prometheus.metrics.reporter.allowlistproperty with a comma-separated list of regular expressions that match Prometheus metric names:prometheus.metrics.reporter.allowlist = kafka_log_.*,kafka_server_brokertopicmetrics_bytesin_total
prometheus.metrics.reporter.allowlist = kafka_log_.*,kafka_server_brokertopicmetrics_bytesin_totalCopy to Clipboard Copied! Toggle word wrap Toggle overflow Prometheus metric names are lowercase and use underscores.
For Kafka Connect and Kafka Streams, set the property separately for the admin, producer, and consumer:
admin.prometheus.metrics.reporter.allowlist = ... producer.prometheus.metrics.reporter.allowlist = ... consumer.prometheus.metrics.reporter.allowlist = ...
admin.prometheus.metrics.reporter.allowlist = ... producer.prometheus.metrics.reporter.allowlist = ... consumer.prometheus.metrics.reporter.allowlist = ...Copy to Clipboard Copied! Toggle word wrap Toggle overflow Use this setting to reduce the volume of metrics or to expose only specific metrics.
Configure Prometheus to scrape the metrics.
Example configuration:
scrape_configs: - job_name: 'kafka-metrics' static_configs: - targets: ['localhost:8080']scrape_configs: - job_name: 'kafka-metrics' static_configs: - targets: ['localhost:8080']Copy to Clipboard Copied! Toggle word wrap Toggle overflow
21.2. Monitoring using JMX metrics 링크 복사링크가 클립보드에 복사되었습니다!
Kafka components use Java Management Extensions (JMX) to share management information through metrics. Kafka employs Managed Beans (MBeans) to supply metric data to monitoring tools and dashboards. JMX operates at the JVM level, allowing external tools to connect and retrieve management information from Kafka components. To connect to the JVM, these tools typically need to run on the same machine and with the same user privileges by default.
21.2.1. Enabling the JMX agent 링크 복사링크가 클립보드에 복사되었습니다!
Enable JMX monitoring of Kafka components using JVM system properties. Use the KAFKA_JMX_OPTS environment variable to set the JMX system properties required for enabling JMX monitoring. The scripts that run the Kafka component use these properties.
Procedure
Set the
KAFKA_JMX_OPTSenvironment variable with the JMX properties for enabling JMX monitoring.export KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=<port> -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
export KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=<port> -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=falseCopy to Clipboard Copied! Toggle word wrap Toggle overflow Replace <port> with the name of the port on which you want the Kafka component to listen for JMX connections.
Add
org.apache.kafka.common.metrics.JmxReportertometric.reportersin theserver.propertiesfile.metric.reporters=org.apache.kafka.common.metrics.JmxReporter
metric.reporters=org.apache.kafka.common.metrics.JmxReporterCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Start the Kafka component using the appropriate script, such as
bin/kafka-server-start.shfor a broker orbin/connect-distributed.shfor Kafka Connect.
It is recommended that you configure authentication and SSL to secure a remote JMX connection. For more information about the system properties needed to do this, see the Oracle documentation.
21.2.2. Disabling the JMX agent 링크 복사링크가 클립보드에 복사되었습니다!
Disable JMX monitoring for Kafka components by updating the KAFKA_JMX_OPTS environment variable.
Procedure
Set the
KAFKA_JMX_OPTSenvironment variable to disable JMX monitoring.export KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=false
export KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote=falseCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteOther JMX properties, like port, authentication, and SSL properties do not need to be specified when disabling JMX monitoring.
Set
auto.include.jmx.reportertofalsein the Kafkaserver.propertiesfile.auto.include.jmx.reporter=false
auto.include.jmx.reporter=falseCopy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
auto.include.jmx.reporterproperty is deprecated. From Kafka 4, the JMXReporter is only enabled iforg.apache.kafka.common.metrics.JmxReporteris added to themetric.reportersconfiguration in the properties file.-
Start the Kafka component using the appropriate script, such as
bin/kafka-server-start.shfor a broker orbin/connect-distributed.shfor Kafka Connect.
21.2.3. Metrics naming conventions 링크 복사링크가 클립보드에 복사되었습니다!
When working with Kafka JMX metrics, it’s important to understand the naming conventions used to identify and retrieve specific metrics. Kafka JMX metrics use the following format:
Metrics format
<metric_group>:type=<type_name>,name=<metric_name><other_attribute>=<value>
<metric_group>:type=<type_name>,name=<metric_name><other_attribute>=<value>
- <metric_group> is the name of the metric group
- <type_name> is the name of the type of metric
- <metric_name> is the name of the specific metric
- <other_attribute> represents zero or more additional attributes
For example, the BytesInPerSec metric is a BrokerTopicMetrics type in the kafka.server group:
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec
In some cases, metrics may include the ID of an entity. For instance, when monitoring a specific client, the metric format includes the client ID:
Metrics for a specific client
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=<client_id>
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=<client_id>
Similarly, a metric can be further narrowed down to a specific client and topic:
Metrics for a specific client and topic
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=<client_id>,topic=<topic_id>
kafka.consumer:type=consumer-fetch-manager-metrics,client-id=<client_id>,topic=<topic_id>
Understanding these naming conventions will allow you to accurately specify the metrics you want to monitor and analyze.
To view the full list of available JMX metrics for a Strimzi installation, you can use a graphical tool like JConsole. JConsole is a Java Monitoring and Management Console that allows you to monitor and manage Java applications, including Kafka. By connecting to the JVM running the Kafka component using its process ID, the tool’s user interface allows you to view the list of metrics.
21.3. Using Kafka Exporter 링크 복사링크가 클립보드에 복사되었습니다!
Kafka Exporter is an open source project to enhance monitoring of Apache Kafka brokers and clients. Kafka Exporter is provided with Streams for Apache Kafka for deployment with a Kafka cluster to extract additional metrics data from Kafka brokers related to offsets, consumer groups, consumer lag, and topics.
The metrics data is used, for example, to help identify slow consumers.
Lag data is exposed as Prometheus metrics, which can then be presented in Grafana for analysis.
If you are already using Prometheus and Grafana for monitoring of built-in Kafka metrics, you can configure Prometheus to also scrape the Kafka Exporter Prometheus endpoint.
Kafka exposes metrics through JMX, which can then be exported as Prometheus metrics. For more information, see Monitoring your cluster using JMX.
21.3.1. Consumer lag 링크 복사링크가 클립보드에 복사되었습니다!
Consumer lag indicates the difference in the rate of production and consumption of messages. Specifically, consumer lag for a given consumer group indicates the delay between the last message in the partition and the message being currently picked up by that consumer. The lag reflects the position of the consumer offset in relation to the end of the partition log.
This difference is sometimes referred to as the delta between the producer offset and consumer offset, the read and write positions in the Kafka broker topic partitions.
Suppose a topic streams 100 messages a second. A lag of 1000 messages between the producer offset (the topic partition head) and the last offset the consumer has read means a 10-second delay.
The importance of monitoring consumer lag
For applications that rely on the processing of (near) real-time data, it is critical to monitor consumer lag to check that it does not become too big. The greater the lag becomes, the further the process moves from the real-time processing objective.
Consumer lag, for example, might be a result of consuming too much old data that has not been purged, or through unplanned shutdowns.
Reducing consumer lag
Typical actions to reduce lag include:
- Scaling-up consumer groups by adding new consumers
- Increasing the retention time for a message to remain in a topic
- Adding more disk capacity to increase the message buffer
Actions to reduce consumer lag depend on the underlying infrastructure and the use cases Streams for Apache Kafka is supporting. For instance, a lagging consumer is less likely to benefit from the broker being able to service a fetch request from its disk cache. And in certain cases, it might be acceptable to automatically drop messages until a consumer has caught up.
21.3.2. Kafka Exporter alerting rule examples 링크 복사링크가 클립보드에 복사되었습니다!
The sample alert notification rules specific to Kafka Exporter are as follows:
UnderReplicatedPartition- An alert to warn that a topic is under-replicated and the broker is not replicating enough partitions. The default configuration is for an alert if there are one or more under-replicated partitions for a topic. The alert might signify that a Kafka instance is down or the Kafka cluster is overloaded. A planned restart of the Kafka broker may be required to restart the replication process.
TooLargeConsumerGroupLag- An alert to warn that the lag on a consumer group is too large for a specific topic partition. The default configuration is 1000 records. A large lag might indicate that consumers are too slow and are falling behind the producers.
NoMessageForTooLong- An alert to warn that a topic has not received messages for a period of time. The default configuration for the time period is 10 minutes. The delay might be a result of a configuration issue preventing a producer from publishing messages to the topic.
You can adapt alerting rules according to your specific needs.
21.3.3. Kafka Exporter metrics 링크 복사링크가 클립보드에 복사되었습니다!
Lag information is exposed by Kafka Exporter as Prometheus metrics for presentation in Grafana.
Kafka Exporter exposes metrics data for brokers, topics, and consumer groups.
| Name | Information |
|---|---|
|
| Number of brokers in the Kafka cluster |
| Name | Information |
|---|---|
|
| Number of partitions for a topic |
|
| Current topic partition offset for a broker |
|
| Oldest topic partition offset for a broker |
|
| Number of in-sync replicas for a topic partition |
|
| Leader broker ID of a topic partition |
|
|
Shows |
|
| Number of replicas for this topic partition |
|
|
Shows |
| Name | Information |
|---|---|
|
| Current topic partition offset for a consumer group |
|
| Current approximate lag for a consumer group at a topic partition |
21.3.4. Running Kafka Exporter 링크 복사링크가 클립보드에 복사되었습니다!
Run Kafka Exporter to expose Prometheus metrics for presentation in a Grafana dashboard.
Download and install the Kafka Exporter package to use the Kafka Exporter with Streams for Apache Kafka. You need a Streams for Apache Kafka subscription to be able to download and install the package.
Prerequisites
- Streams for Apache Kafka is installed on each host, and the configuration files are available.
- You have a subscription to Streams for Apache Kafka.
This procedure assumes you already have access to a Grafana user interface and Prometheus is deployed and added as a data source.
Procedure
Install the Kafka Exporter package:
dnf install kafka_exporter
dnf install kafka_exporterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the package has installed:
dnf info kafka_exporter
dnf info kafka_exporterCopy to Clipboard Copied! Toggle word wrap Toggle overflow Run the Kafka Exporter using appropriate configuration parameter values:
kafka_exporter --kafka.server=<kafka_bootstrap_address>:9092 --kafka.version=4.1.0 --<my_other_parameters>
kafka_exporter --kafka.server=<kafka_bootstrap_address>:9092 --kafka.version=4.1.0 --<my_other_parameters>Copy to Clipboard Copied! Toggle word wrap Toggle overflow The parameters require a double-hyphen (
--) convention.
The--kafka.serverparameter specifies a hostname and port to connect to a Kafka instance.
The--kafka.versionparameter specifies the Kafka version to ensure compatibility.
Usekafka_exporter --helpfor information on other available parameters.Configure Prometheus to monitor the Kafka Exporter metrics.
For more information on configuring Prometheus, see the Prometheus documentation.
Enable Grafana to present the Kafka Exporter metrics data exposed by Prometheus.
For more information, see Presenting Kafka Exporter metrics in Grafana.
Updating Kafka Exporter
Use the latest version of Kafka Exporter with your Streams for Apache Kafka installation.
To check for updates, use:
dnf check-update
dnf check-update
To update Kafka Exporter, use:
dnf update kafka_exporter
dnf update kafka_exporter
21.3.5. Presenting Kafka Exporter metrics in Grafana 링크 복사링크가 클립보드에 복사되었습니다!
Using Kafka Exporter Prometheus metrics as a data source, you can create a dashboard of Grafana charts.
For example, from the metrics you can create the following Grafana charts:
- Message in per second (from topics)
- Message in per minute (from topics)
- Lag by consumer group
- Messages consumed per minute (by consumer groups)
When metrics data has been collected for some time, the Kafka Exporter charts are populated.
Use the Grafana charts to analyze lag and to check if actions to reduce lag are having an impact on an affected consumer group. If, for example, Kafka brokers are adjusted to reduce lag, the dashboard will show the Lag by consumer group chart going down and the Messages consumed per minute chart going up.
21.4. Analyzing Kafka metrics for troubleshooting 링크 복사링크가 클립보드에 복사되었습니다!
Kafka metrics provide essential insights into the performance and health of your brokers and the wider cluster. By analyzing these metrics, you can identify common issues such as high CPU usage, memory pressure, thread contention, or slow request handling. Some metrics can help pinpoint the root cause of performance bottlenecks or operational anomalies.
Metrics also support broader performance monitoring by helping track throughput, latency, availability, and system resource consumption. Analyzing trends over time can assist with capacity planning and performance tuning.
Collecting and visualizing Kafka metrics using tools such as Prometheus and Grafana enables you to monitor changes, detect issues, and respond proactively. Graphing metrics over time also helps establish performance baselines and forecast future resource needs.
The examples in this section use the JMX naming format, but the same metrics are available when using the Streams for Apache Kafka Metrics Reporter. When exposed through the reporter, metric names take the Prometheus format of lowercase letters and underscores. For example, UnderReplicatedPartitions becomes kafka_server_replicamanager_underreplicatedpartitions.
21.4.1. Checking for under-replicated partitions 링크 복사링크가 클립보드에 복사되었습니다!
A balanced Kafka cluster is important for optimal performance. In a balanced cluster, partitions and leaders are evenly distributed across all brokers, and I/O metrics reflect this. As well as using metrics, you can use the kafka-topics.sh tool to get a list of under-replicated partitions and identify the problematic brokers. If the number of under-replicated partitions is fluctuating or many brokers show high request latency, this typically indicates a performance issue in the cluster that requires investigation. On the other hand, a steady (unchanging) number of under-replicated partitions reported by many of the brokers in a cluster normally indicates that one of the brokers in the cluster is offline.
Use the describe --under-replicated-partitions option from the kafka-topics.sh tool to show information about partitions that are currently under-replicated in the cluster. These are the partitions that have fewer replicas than the configured replication factor.
If the output is blank, the Kafka cluster has no under-replicated partitions. Otherwise, the output shows replicas that are not in sync or available.
In the following example, only 2 of the 3 replicas are in sync for each partition, with a replica missing from the ISR (in-sync replica).
Returning information on under-replicated partitions from the command line
bin/kafka-topics.sh --bootstrap-server :9092 --describe --under-replicated-partitions Topic: topic-1 Partition: 0 Leader: 4 Replicas: 4,2,3 Isr: 4,3 Topic: topic-1 Partition: 1 Leader: 3 Replicas: 2,3,4 Isr: 3,4 Topic: topic-1 Partition: 2 Leader: 3 Replicas: 3,4,2 Isr: 3,4
bin/kafka-topics.sh --bootstrap-server :9092 --describe --under-replicated-partitions
Topic: topic-1 Partition: 0 Leader: 4 Replicas: 4,2,3 Isr: 4,3
Topic: topic-1 Partition: 1 Leader: 3 Replicas: 2,3,4 Isr: 3,4
Topic: topic-1 Partition: 2 Leader: 3 Replicas: 3,4,2 Isr: 3,4
Here are some metrics to check for I/O and under-replicated partitions:
Metrics to check for under-replicated partitions
- 1
- Total number of partitions across all topics in the cluster.
- 2
- Total number of leaders across all topics in the cluster.
- 3
- Rate of incoming bytes per second for each broker.
- 4
- Rate of outgoing bytes per second for each broker.
- 5
- Number of under-replicated partitions across all topics in the cluster.
- 6
- Number of partitions below the minimum ISR.
If topic configuration is set for high availability, with a replication factor of at least 3 for topics and a minimum number of in-sync replicas being 1 less than the replication factor, under-replicated partitions can still be usable. Conversely, partitions below the minimum ISR have reduced availability. You can monitor these using the kafka.server:type=ReplicaManager,name=UnderMinIsrPartitionCount metric and the under-min-isr-partitions option from the kafka-topics.sh tool.
Use Cruise Control to automate the task of monitoring and rebalancing a Kafka cluster to ensure that the partition load is evenly distributed. For more information, see Chapter 15, Using Cruise Control for cluster rebalancing.
21.4.2. Identifying performance problems in a Kafka cluster 링크 복사링크가 클립보드에 복사되었습니다!
Spikes in cluster metrics may indicate a broker issue, which is often related to slow or failing storage devices or compute restraints from other processes. If there is no issue at the operating system or hardware level, an imbalance in the load of the Kafka cluster is likely, with some partitions receiving disproportionate traffic compared to others in the same Kafka topic.
To anticipate performance problems in a Kafka cluster, it’s useful to monitor the RequestHandlerAvgIdlePercent metric. RequestHandlerAvgIdlePercent provides a good overall indicator of how the cluster is behaving. The value of this metric is between 0 and 1. A value below 0.7 indicates that threads are busy 30% of the time and performance is starting to degrade. If the value drops below 50%, problems are likely to occur, especially if the cluster needs to scale or rebalance. At 30%, a cluster is barely usable.
Another useful metric is kafka.network:type=Processor,name=IdlePercent, which you can use to monitor the extent (as a percentage) to which network processors in a Kafka cluster are idle. The metric helps identify whether the processors are over or underutilized.
To ensure optimal performance, set the num.io.threads property equal to the number of processors in the system, including hyper-threaded processors. If the cluster is balanced, but a single client has changed its request pattern and is causing issues, reduce the load on the cluster or increase the number of brokers.
It’s important to note that a single disk failure on a single broker can severely impact the performance of an entire cluster. Since producer clients connect to all brokers that lead partitions for a topic, and those partitions are evenly spread over the entire cluster, a poorly performing broker will slow down produce requests and cause back pressure in the producers, slowing down requests to all brokers. A RAID (Redundant Array of Inexpensive Disks) storage configuration that combines multiple physical disk drives into a single logical unit can help prevent this issue.
Here are some metrics to check the performance of a Kafka cluster:
Metrics to check the performance of a Kafka cluster
- 1
- Average idle percentage of the request handler threads in the Kafka broker’s thread pool. The
OneMinuteRateandFifteenMinuteRateattributes show the request rate of the last one minute and fifteen minutes, respectively. - 2
- Rate at which new connections are being created on a specific network processor of a specific listener in the Kafka broker. The
listenerattribute refers to the name of the listener, and thenetworkProcessorattribute refers to the ID of the network processor. Theconnection-creation-rateattribute shows the rate of connection creation in connections per second. - 3
- Current size of the request queue.
- 4
- Current sizes of the response queue.
- 5
- Percentage of time the specified network processor is idle. The
networkProcessorspecifies the ID of the network processor to monitor. - 6
- Total number of bytes read from disk by a Kafka server.
- 7
- Total number of bytes written to disk by a Kafka server.
21.4.3. Identifying performance problems with a Kafka controller 링크 복사링크가 클립보드에 복사되었습니다!
The Kafka controller is responsible for managing the overall state of the cluster, such as broker registration, partition reassignment, and topic management. Problems with the controller in the Kafka cluster are difficult to diagnose and often fall into the category of bugs in Kafka itself. Controller issues might manifest as broker metadata being out of sync, offline replicas when the brokers appear to be fine, or actions on topics like topic creation not happening correctly.
There are not many ways to monitor the controller, but you can monitor the active controller count and the controller queue size. Monitoring these metrics gives a high-level indicator if there is a problem. Although spikes in the queue size are expected, if this value continuously increases, or stays steady at a high value and does not drop, it indicates that the controller may be stuck. If you encounter this problem, you can move the controller to a different broker, which requires shutting down the broker that is currently the controller.
Here are some metrics to check the performance of a Kafka controller:
Metrics to check the performance of a Kafka controller
kafka.controller:type=KafkaController,name=ActiveControllerCount kafka.controller:type=KafkaController,name=OfflinePartitionsCount kafka.controller:type=ControllerEventManager,name=EventQueueSize
kafka.controller:type=KafkaController,name=ActiveControllerCount
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
kafka.controller:type=ControllerEventManager,name=EventQueueSize
- 1
- Number of active controllers in the Kafka cluster. A value of 1 indicates that there is only one active controller, which is the desired state.
- 2
- Number of partitions that are currently offline. If this value is continuously increasing or stays at a high value, there may be a problem with the controller.
- 3
- Size of the event queue in the controller. Events are actions that must be performed by the controller, such as creating a new topic or moving a partition to a new broker. if the value continuously increases or stays at a high value, the controller may be stuck and unable to perform the required actions.
21.4.4. Identifying problems with requests 링크 복사링크가 클립보드에 복사되었습니다!
You can use the RequestHandlerAvgIdlePercent metric to determine if requests are slow. Additionally, request metrics can identify which specific requests are experiencing delays and other issues.
To effectively monitor Kafka requests, it is crucial to collect two key metrics: count and 99th percentile latency, also known as tail latency.
The count metric represents the number of requests processed within a specific time interval. It provides insights into the volume of requests handled by your Kafka cluster and helps identify spikes or drops in traffic.
The 99th percentile latency metric measures the request latency, which is the time taken for a request to be processed. It represents the duration within which 99% of requests are handled. However, it does not provide information about the exact duration for the remaining 1% of requests. In other words, the 99th percentile latency metric tells you that 99% of the requests are handled within a certain duration, and the remaining 1% may take even longer, but the precise duration for this remaining 1% is not known. The choice of the 99th percentile is primarily to focus on the majority of requests and exclude outliers that can skew the results.
This metric is particularly useful for identifying performance issues and bottlenecks related to the majority of requests, but it does not give a complete picture of the maximum latency experienced by a small fraction of requests.
By collecting and analyzing both count and 99th percentile latency metrics, you can gain an understanding of the overall performance and health of your Kafka cluster, as well as the latency of the requests being processed.
Here are some metrics to check the performance of Kafka requests:
Metrics to check the performance of requests
- 1
- Request types to break down the request metrics.
- 2
- Rate at which requests are being processed by the Kafka broker per second.
- 3
- Time (in milliseconds) that a request spends waiting in the broker’s request queue before being processed.
- 4
- Total time (in milliseconds) that a request takes to complete, from the time it is received by the broker to the time the response is sent back to the client.
- 5
- Time (in milliseconds) that a request spends being processed by the broker on the local machine.
- 6
- Time (in milliseconds) that a request spends being processed by other brokers in the cluster.
- 7
- Time (in milliseconds) that a request spends being throttled by the broker. Throttling occurs when the broker determines that a client is sending too many requests too quickly and needs to be slowed down.
- 8
- Time (in milliseconds) that a response spends waiting in the broker’s response queue before being sent back to the client.
- 9
- Time (in milliseconds) that a response takes to be sent back to the client after it has been generated by the broker.
- 10
- For all of the requests metrics, the
Countand99thPercentileattributes show the total number of requests that have been processed and the time it takes for the slowest 1% of requests to complete, respectively.
21.4.5. Using metrics to check the performance of clients 링크 복사링크가 클립보드에 복사되었습니다!
By analyzing client metrics, you can monitor the performance of the Kafka clients (producers and consumers) connected to a broker. This can help identify issues highlighted in broker logs, such as consumers being frequently kicked off their consumer groups, high request failure rates, or frequent disconnections.
Here are some metrics to check the performance of Kafka clients:
Metrics to check the performance of client requests
- 1
- (Consumer) Average and maximum time between poll requests, which can help determine if the consumers are polling for messages frequently enough to keep up with the message flow. The
time-between-poll-avgandtime-between-poll-maxattributes show the average and maximum time in milliseconds between successive polls by a consumer, respectively. - 2
- (Consumer) Metrics to monitor the coordination process between Kafka consumers and the broker coordinator. Attributes relate to the heartbeat, join, and rebalance process.
- 3
- (Producer) Metrics to monitor the performance of Kafka producers. Attributes relate to buffer usage, request latency, in-flight requests, transactional processing, and record handling.
21.4.6. Using metrics to check the performance of topics and partitions 링크 복사링크가 클립보드에 복사되었습니다!
Metrics for topics and partitions can also be helpful in diagnosing issues in a Kafka cluster. You can also use them to debug issues with a specific client when you are unable to collect client metrics.
Here are some metrics to check the performance of a specific topic and partition:
Metrics to check the performance of topics and partitions
- 1
- Rate of incoming bytes per second for a specific topic.
- 2
- Rate of outgoing bytes per second for a specific topic.
- 3
- Rate of fetch requests that failed per second for a specific topic.
- 4
- Rate of produce requests that failed per second for a specific topic.
- 5
- Incoming message rate per second for a specific topic.
- 6
- Total rate of fetch requests (successful and failed) per second for a specific topic.
- 7
- Total rate of fetch requests (successful and failed) per second for a specific topic.
- 8
- Size of a specific partition’s log in bytes.
- 9
- Number of log segments in a specific partition.
- 10
- Offset of the last message in a specific partition’s log.
- 11
- Offset of the first message in a specific partition’s log