Chapter 9. Using post processors to modify event messages
Post processors perform lightweight, per-message mutations, similar to the modifications that are performed by single message transformations (SMTs). However, Debezium calls post processors earlier in the event chain than transformations, enabling post processors to act on messages before they are handed off to the messaging runtime. Because post processors can act on messages from within the Debezium context, they are more efficient at modifying event payloads than transformations.
For a transformation to modify a message, it must recreate the message’s immutable ConnectRecord
, or more aptly, its SourceRecord
. By contrast, because a post processor acts within the Debezium scope, it can operate on mutable Struct
types in the event payload of a message, modifying payloads before the construction of the SourceRecord
. Close integration with Debezium provides post processors with access to Debezium internals, such as Debezium metadata about database connections, relational schema model, and so forth. In turn, this access enhances efficiency when performing tasks that rely on such internal information. For example, the Reselect columns post processor can automatically re-query the database to reselect a record and retrieve columns that were excluded from the original change event.
Debezium provides the following post processors:
- Reselect columns
- Re-selects specific columns that may not have been provided by the change event, such as TOASTed columns or Oracle LOB columns that were not modified by the current change.
9.1. Using the reselect columns post processor to add source fields to change event records
To improve performance and reduce storage overhead, databases can use external storage for certain columns. This type of storage is used for columns that store large amounts of data, such as the PostgreSQL TOAST (The Oversized-Attribute Storage Technique), Oracle Large Object (LOB), or the Oracle Exadata Extended String data types. To reduce I/O overhead and increase query speed, when data changes in a table row, the database retrieves only the columns that contain new values, ignoring data in externally stored columns that remain unchanged. As a result, the value of the externally stored column is not recorded in the database log, and Debezium subsequently omits the column when it emits the event record. Downstream consumers that receive event records that omit required values can experience processing errors.
IF a value for an externally stored column is not present in the database log entry for an event, when Debezium emits a record for the event, it replaces the missing value with an unavailable.value.placeholder
sentinel value. These sentinel values are inserted into appropriately typed fields, for example, a byte array for bytes, a string for strings, or a key-value map for maps.
To retrieve data for columns that were not available in the initial query, you can apply the Debezium reselect columns post processor (ReselectColumnsPostProcessor
). You can configure the post processor to reselect one or more columns from a table. After you configure the post processor, it monitors events that the connector emits for the column names that you designate for reselection. When it detects an event with the specified columns, the post processor re-queries the source tables to retrieve data for the specified columns, and fetches their current state.
You can configure the post processor to reselect the following column types:
-
null
columns. -
Columns that contain the
unavailable.value.placeholder
sentinel value.
You can use the ReselectColumnsPostProcessor
post processor only with Debezium source connectors.
The post processor is not designed to work with the Debezium JDBC sink connector.
For details about using the ReselectColumnsPostProcessor
post processor, see the following topics:
9.1.1. Use of the Debezium ReselectColumnsPostProcessor
with keyless tables
The reselect columns post processor generates a reselect query that returns the row to be modified. To construct the WHERE
clause for the query, by default, the post processor uses a relational table model that is based on the table’s primary key columns or on the unique index that is defined for the table.
For keyless tables, the SELECT
query that ReselectColumnsPostProcessor
submits might return multiple rows, in which case Debezium always uses only the first row. You cannot prioritize the order of the returned rows. To enable the post processor to return a consistently usable result for a keyless table, it’s best to designate a custom key that can identify a unique row. The custom key must be capable of uniquely identify records in the source table based on a combination of columns.
To define such a custom message key, use the message.key.columns
property in the connector configuration. After you define a custom key, set the reselect.use.event.key
configuration property to true
. Setting this option enables the post processor to use the specified event key fields as selection criteria in lieu of a primary key column. Be sure to test the configuration to ensure that the reselection query provides the expected results.
9.1.2. Example: Debezium ReselectColumnsPostProcessor
configuration
Configuring a post processor is similar to configuring a custom converter or single message transformation (SMT). To enable the connector to use the ReselectColumnsPostProcessor
, add the following entries to the connector configuration:
"post.processors": "reselector", 1 "reselector.type": "io.debezium.processors.reselect.ReselectColumnsPostProcessor", 2 "reselector.reselect.columns.include.list": "<schema>.<table>:<column>,<schema>.<table>:<column>", 3 "reselector.reselect.unavailable.values": "true", 4 "reselector.reselect.null.values": "true" 5 "reselector.reselect.use.event.key": "false" 6
Item | Description |
---|---|
1 | Comma-separated list of post processor prefixes. |
2 | The fully-qualified class type name for the post processor. |
3 |
Comma-separated list of column names, specified by using the following format: |
4 |
Enables or disables reselection of columns that contain the |
5 |
Enables or disables reselection of columns that are |
6 | Enables or disables reselection based event key field names. |
9.1.3. Descriptions of Debezium reselect columns post processor configuration properties
The following table lists the configuration options that you can set for the Reselect Columns post-processor.
Property | Default | Description |
No default |
Comma-separated list of column names to reselect from the source database. Use the following format to specify column names:
Do not set this property if you set the | |
No default |
Comma-separated list of column names in the source database to exclude from reselection. Use the following format to specify column names:
Do not set this property if you set the | |
|
Specifies whether the post processor reselects a column that matches the | |
|
Specifies whether the post processor reselects a column that matches the | |
|
Specifies whether the post processor reselects based on the event’s key field names or uses the relational table’s primary key column names. |