Chapter 9. Using post processors to modify event messages


Post processors perform lightweight, per-message mutations, similar to the modifications that are performed by single message transformations (SMTs). However, Debezium calls post processors earlier in the event chain than transformations, enabling post processors to act on messages before they are handed off to the messaging runtime. Because post processors can act on messages from within the Debezium context, they are more efficient at modifying event payloads than transformations.

For a transformation to modify a message, it must recreate the message’s immutable ConnectRecord, or more aptly, its SourceRecord. By contrast, because a post processor acts within the Debezium scope, it can operate on mutable Struct types in the event payload of a message, modifying payloads before the construction of the SourceRecord. Close integration with Debezium provides post processors with access to Debezium internals, such as Debezium metadata about database connections, relational schema model, and so forth. In turn, this access enhances efficiency when performing tasks that rely on such internal information. For example, the Reselect columns post processor can automatically re-query the database to reselect a record and retrieve columns that were excluded from the original change event.

Debezium provides the following post processors:

Reselect columns
Re-selects specific columns that may not have been provided by the change event, such as TOASTed columns or Oracle LOB columns that were not modified by the current change.

9.1. Using the reselect columns post processor to add source fields to change event records

To improve performance and reduce storage overhead, databases can use external storage for certain columns. This type of storage is used for columns that store large amounts of data, such as the PostgreSQL TOAST (The Oversized-Attribute Storage Technique), Oracle Large Object (LOB), or the Oracle Exadata Extended String data types. To reduce I/O overhead and increase query speed, when data changes in a table row, the database retrieves only the columns that contain new values, ignoring data in externally stored columns that remain unchanged. As a result, the value of the externally stored column is not recorded in the database log, and Debezium subsequently omits the column when it emits the event record. Downstream consumers that receive event records that omit required values can experience processing errors.

IF a value for an externally stored column is not present in the database log entry for an event, when Debezium emits a record for the event, it replaces the missing value with an unavailable.value.placeholder sentinel value. These sentinel values are inserted into appropriately typed fields, for example, a byte array for bytes, a string for strings, or a key-value map for maps.

To retrieve data for columns that were not available in the initial query, you can apply the Debezium reselect columns post processor (ReselectColumnsPostProcessor). You can configure the post processor to reselect one or more columns from a table. After you configure the post processor, it monitors events that the connector emits for the column names that you designate for reselection. When it detects an event with the specified columns, the post processor re-queries the source tables to retrieve data for the specified columns, and fetches their current state.

You can configure the post processor to reselect the following column types:

  • null columns.
  • Columns that contain the unavailable.value.placeholder sentinel value.
Note

You can use the ReselectColumnsPostProcessor post processor only with Debezium source connectors.
The post processor is not designed to work with the Debezium JDBC sink connector.

For details about using the ReselectColumnsPostProcessor post processor, see the following topics:

9.1.1. Use of the Debezium ReselectColumnsPostProcessor with keyless tables

The reselect columns post processor generates a reselect query that returns the row to be modified. To construct the WHERE clause for the query, by default, the post processor uses a relational table model that is based on the table’s primary key columns or on the unique index that is defined for the table.

For keyless tables, the SELECT query that ReselectColumnsPostProcessor submits might return multiple rows, in which case Debezium always uses only the first row. You cannot prioritize the order of the returned rows. To enable the post processor to return a consistently usable result for a keyless table, it’s best to designate a custom key that can identify a unique row. The custom key must be capable of uniquely identify records in the source table based on a combination of columns.

To define such a custom message key, use the message.key.columns property in the connector configuration. After you define a custom key, set the reselect.use.event.key configuration property to true. Setting this option enables the post processor to use the specified event key fields as selection criteria in lieu of a primary key column. Be sure to test the configuration to ensure that the reselection query provides the expected results.

9.1.2. Example: Debezium ReselectColumnsPostProcessor configuration

Configuring a post processor is similar to configuring a custom converter or single message transformation (SMT). To enable the connector to use the ReselectColumnsPostProcessor, add the following entries to the connector configuration:

  "post.processors": "reselector", 1
  "reselector.type": "io.debezium.processors.reselect.ReselectColumnsPostProcessor", 2
  "reselector.reselect.columns.include.list": "<schema>.<table>:<column>,<schema>.<table>:<column>", 3
  "reselector.reselect.unavailable.values": "true", 4
  "reselector.reselect.null.values": "true" 5
  "reselector.reselect.use.event.key": "false" 6
ItemDescription

1

Comma-separated list of post processor prefixes.

2

The fully-qualified class type name for the post processor.

3

Comma-separated list of column names, specified by using the following format: <schema>.<table>:<column>.

4

Enables or disables reselection of columns that contain the unavailable.value.placeholder sentinel value.

5

Enables or disables reselection of columns that are null.

6

Enables or disables reselection based event key field names.

9.1.3. Descriptions of Debezium reselect columns post processor configuration properties

The following table lists the configuration options that you can set for the Reselect Columns post-processor.

Table 9.1. Reselect columns post processor configuration options

Property

Default

Description

reselect.columns.include.list

No default

Comma-separated list of column names to reselect from the source database. Use the following format to specify column names:

<schema>.<table>:<column>

Do not set this property if you set the reselect.columns.exclude.list property.

reselect.columns.exclude.list

No default

Comma-separated list of column names in the source database to exclude from reselection. Use the following format to specify column names:

<schema>.<table>:<column>

Do not set this property if you set the reselect.columns.include.list property.

reselect.unavailable.values

true

Specifies whether the post processor reselects a column that matches the reselect.columns.include.list filter if the column value is provided by the connector’s unavailable.value.placeholder property.

reselect.null.values

true

Specifies whether the post processor reselects a column that matches the reselect.columns.include.list filter if the column value is null.

reselect.use.event.key

false

Specifies whether the post processor reselects based on the event’s key field names or uses the relational table’s primary key column names.

By default, the reselection query is based on the relational table’s primary key columns or unique key index. For tables that do not have a primary key, set this property to true, and configure the message.key.columns property in the connector configuration to specify a custom key for the connector to use when it creates events. The post processor then uses the specified key field names as the primary key in the SQL reselection query.

Red Hat logoGithubRedditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

© 2024 Red Hat, Inc.