Chapter 4. Kafka consumer configuration tuning
Use a basic consumer configuration with optional properties that are tailored to specific use cases.
When tuning your consumers your primary concern will be ensuring that they cope efficiently with the amount of data ingested. As with the producer tuning, be prepared to make incremental changes until the consumers operate as expected.
4.1. Basic consumer configuration
Connection and deserializer properties are required for every consumer. Generally, it is good practice to add a client id for tracking.
In a consumer configuration, irrespective of any subsequent configuration:
- The consumer fetches from a given offset and consumes the messages in order, unless the offset is changed to skip or re-read messages.
- The broker does not know if the consumer processed the responses, even when committing offsets to Kafka, because the offsets might be sent to a different broker in the cluster.
Basic consumer configuration properties
# ... bootstrap.servers=localhost:9092 1 key.deserializer=org.apache.kafka.common.serialization.StringDeserializer 2 value.deserializer=org.apache.kafka.common.serialization.StringDeserializer 3 client.id=my-client 4 group.id=my-group-id 5 # ...
- 1
- (Required) Tells the consumer to connect to a Kafka cluster using a host:port bootstrap server address for a Kafka broker. The consumer uses the address to discover and connect to all brokers in the cluster. Use a comma-separated list to specify two or three addresses in case a server is down, but it is not necessary to provide a list of all the brokers in the cluster. If you are using a loadbalancer service to expose the Kafka cluster, you only need the address for the service because the availability is handled by the loadbalancer.
- 2
- (Required) Deserializer to transform the bytes fetched from the Kafka broker into message keys.
- 3
- (Required) Deserializer to transform the bytes fetched from the Kafka broker into message values.
- 4
- (Optional) The logical name for the client, which is used in logs and metrics to identify the source of a request. The id can also be used to throttle consumers based on processing time quotas.
- 5
- (Conditional) A group id is required for a consumer to be able to join a consumer group.
4.2. Scaling data consumption using consumer groups
Consumer groups share a typically large data stream generated by one or multiple producers from a given topic. Consumers are grouped using a group.id
property, allowing messages to be spread across the members. One of the consumers in the group is elected leader and decides how the partitions are assigned to the consumers in the group. Each partition can only be assigned to a single consumer.
If you do not already have as many consumers as partitions, you can scale data consumption by adding more consumer instances with the same group.id
. Adding more consumers to a group than there are partitions will not help throughput, but it does mean that there are consumers on standby should one stop functioning. If you can meet throughput goals with fewer consumers, you save on resources.
Consumers within the same consumer group send offset commits and heartbeats to the same broker. So the greater the number of consumers in the group, the higher the request load on the broker.
# ...
group.id=my-group-id 1
# ...
- 1
- Add a consumer to a consumer group using a group id.
4.3. Message ordering guarantees
Kafka brokers receive fetch requests from consumers that ask the broker to send messages from a list of topics, partitions and offset positions.
A consumer observes messages in a single partition in the same order that they were committed to the broker, which means that Kafka only provides ordering guarantees for messages in a single partition. Conversely, if a consumer is consuming messages from multiple partitions, the order of messages in different partitions as observed by the consumer does not necessarily reflect the order in which they were sent.
If you want a strict ordering of messages from one topic, use one partition per consumer.
4.4. Optimizing consumers for throughput and latency
Control the number of messages returned when your client application calls KafkaConsumer.poll()
.
Use the fetch.max.wait.ms
and fetch.min.bytes
properties to increase the minimum amount of data fetched by the consumer from the Kafka broker. Time-based batching is configured using fetch.max.wait.ms
, and size-based batching is configured using fetch.min.bytes
.
If CPU utilization in the consumer or broker is high, it might be because there are too many requests from the consumer. You can adjust fetch.max.wait.ms
and fetch.min.bytes
properties higher so that there are fewer requests and messages are delivered in bigger batches. By adjusting higher, throughput is improved with some cost to latency. You can also adjust higher if the amount of data being produced is low.
For example, if you set fetch.max.wait.ms
to 500ms and fetch.min.bytes
to 16384 bytes, when Kafka receives a fetch request from the consumer it will respond when the first of either threshold is reached.
Conversely, you can adjust the fetch.max.wait.ms
and fetch.min.bytes
properties lower to improve end-to-end latency.
# ... fetch.max.wait.ms=500 1 fetch.min.bytes=16384 2 # ...
- 1
- The maximum time in milliseconds the broker will wait before completing fetch requests. The default is
500
milliseconds. - 2
- If a minimum batch size in bytes is used, a request is sent when the minimum is reached, or messages have been queued for longer than
fetch.max.wait.ms
(whichever comes sooner). Adding the delay allows batches to accumulate messages up to the batch size.
Lowering latency by increasing the fetch request size
Use the fetch.max.bytes
and max.partition.fetch.bytes
properties to increase the maximum amount of data fetched by the consumer from the Kafka broker.
The fetch.max.bytes
property sets a maximum limit in bytes on the amount of data fetched from the broker at one time.
The max.partition.fetch.bytes
sets a maximum limit in bytes on how much data is returned for each partition, which must always be larger than the number of bytes set in the broker or topic configuration for max.message.bytes
.
The maximum amount of memory a client can consume is calculated approximately as:
NUMBER-OF-BROKERS * fetch.max.bytes and NUMBER-OF-PARTITIONS * max.partition.fetch.bytes
If memory usage can accommodate it, you can increase the values of these two properties. By allowing more data in each request, latency is improved as there are fewer fetch requests.
# ... fetch.max.bytes=52428800 1 max.partition.fetch.bytes=1048576 2 # ...
4.5. Avoiding data loss or duplication when committing offsets
The Kafka auto-commit mechanism allows a consumer to commit the offsets of messages automatically. If enabled, the consumer will commit offsets received from polling the broker at 5000ms intervals.
The auto-commit mechanism is convenient, but it introduces a risk of data loss and duplication. If a consumer has fetched and transformed a number of messages, but the system crashes with processed messages in the consumer buffer when performing an auto-commit, that data is lost. If the system crashes after processing the messages, but before performing the auto-commit, the data is duplicated on another consumer instance after rebalancing.
Auto-committing can avoid data loss only when all messages are processed before the next poll to the broker, or the consumer closes.
To minimize the likelihood of data loss or duplication, you can set enable.auto.commit
to false
and develop your client application to have more control over committing offsets. Or you can use auto.commit.interval.ms
to decrease the intervals between commits.
# ...
enable.auto.commit=false 1
# ...
- 1
- Auto commit is set to false to provide more control over committing offsets.
By setting to enable.auto.commit
to false
, you can commit offsets after all processing has been performed and the message has been consumed. For example, you can set up your application to call the Kafka commitSync
and commitAsync
commit APIs.
The commitSync
API commits the offsets in a message batch returned from polling. You call the API when you are finished processing all the messages in the batch. If you use the commitSync
API, the application will not poll for new messages until the last offset in the batch is committed. If this negatively affects throughput, you can commit less frequently, or you can use the commitAsync
API. The commitAsync
API does not wait for the broker to respond to a commit request, but risks creating more duplicates when rebalancing. A common approach is to combine both commit APIs in an application, with the commitSync
API used just before shutting the consumer down or rebalancing to make sure the final commit is successful.
4.5.1. Controlling transactional messages
Consider using transactional ids and enabling idempotence (enable.idempotence=true
) on the producer side to guarantee exactly-once delivery. On the consumer side, you can then use the isolation.level
property to control how transactional messages are read by the consumer.
The isolation.level
property has two valid values:
-
read_committed
-
read_uncommitted
(default)
Use read_committed
to ensure that only transactional messages that have been committed are read by the consumer. However, this will cause an increase in end-to-end latency, because the consumer will not be able to return a message until the brokers have written the transaction markers that record the result of the transaction (committed or aborted).
# ...
enable.auto.commit=false
isolation.level=read_committed 1
# ...
- 1
- Set to
read_committed
so that only committed messages are read by the consumer.
4.6. Recovering from failure to avoid data loss
Use the session.timeout.ms
and heartbeat.interval.ms
properties to configure the time taken to check and recover from consumer failure within a consumer group.
The session.timeout.ms
property specifies the maximum amount of time in milliseconds a consumer within a consumer group can be out of contact with a broker before being considered inactive and a rebalancing is triggered between the active consumers in the group. When the group rebalances, the partitions are reassigned to the members of the group.
The heartbeat.interval.ms
property specifies the interval in milliseconds between heartbeat checks to the consumer group coordinator to indicate that the consumer is active and connected. The heartbeat interval must be lower, usually by a third, than the session timeout interval.
If you set the session.timeout.ms
property lower, failing consumers are detected earlier, and rebalancing can take place quicker. However, take care not to set the timeout so low that the broker fails to receive a heartbeat in time and triggers an unnecessary rebalance.
Decreasing the heartbeat interval reduces the chance of accidental rebalancing, but more frequent heartbeats increases the overhead on broker resources.
4.7. Managing offset policy
Use the auto.offset.reset
property to control how a consumer behaves when no offsets have been committed, or a committed offset is no longer valid or deleted.
Suppose you deploy a consumer application for the first time, and it reads messages from an existing topic. Because this is the first time the group.id
is used, the __consumer_offsets
topic does not contain any offset information for this application. The new application can start processing all existing messages from the start of the log or only new messages. The default reset value is latest
, which starts at the end of the partition, and consequently means some messages are missed. To avoid data loss, but increase the amount of processing, set auto.offset.reset
to earliest
to start at the beginning of the partition.
Also consider using the earliest
option to avoid messages being lost when the offsets retention period (offsets.retention.minutes
) configured for a broker has ended. If a consumer group or standalone consumer is inactive and commits no offsets during the retention period, previously committed offsets are deleted from __consumer_offsets
.
# ... heartbeat.interval.ms=3000 1 session.timeout.ms=45000 2 auto.offset.reset=earliest 3 # ...
- 1
- Adjust the heartbeat interval lower according to anticipated rebalances.
- 2
- If no heartbeats are received by the Kafka broker before the timeout duration expires, the consumer is removed from the consumer group and a rebalance is initiated. If the broker configuration has a
group.min.session.timeout.ms
andgroup.max.session.timeout.ms
, the session timeout value must be within that range. - 3
- Set to
earliest
to return to the start of a partition and avoid data loss if offsets were not committed.
If the amount of data returned in a single fetch request is large, a timeout might occur before the consumer has processed it. In this case, you can lower max.partition.fetch.bytes
or increase session.timeout.ms
.
4.8. Minimizing the impact of rebalances
The rebalancing of a partition between active consumers in a group is the time it takes for:
- Consumers to commit their offsets
- The new consumer group to be formed
- The group leader to assign partitions to group members
- The consumers in the group to receive their assignments and start fetching
Clearly, the process increases the downtime of a service, particularly when it happens repeatedly during a rolling restart of a consumer group cluster.
In this situation, you can use the concept of static membership to reduce the number of rebalances. Rebalancing assigns topic partitions evenly among consumer group members. Static membership uses persistence so that a consumer instance is recognized during a restart after a session timeout.
The consumer group coordinator can identify a new consumer instance using a unique id that is specified using the group.instance.id
property. During a restart, the consumer is assigned a new member id, but as a static member it continues with the same instance id, and the same assignment of topic partitions is made.
If the consumer application does not make a call to poll at least every max.poll.interval.ms
milliseconds, the consumer is considered to be failed, causing a rebalance. If the application cannot process all the records returned from poll in time, you can avoid a rebalance by using the max.poll.interval.ms
property to specify the interval in milliseconds between polls for new messages from a consumer. Or you can use the max.poll.records
property to set a maximum limit on the number of records returned from the consumer buffer, allowing your application to process fewer records within the max.poll.interval.ms
limit.
# ... group.instance.id=UNIQUE-ID 1 max.poll.interval.ms=300000 2 max.poll.records=500 3 # ...