01 - Change Data Capture

Change Data Capture

Change Data Capture (CDC) is a data engineering technique that identifies and tracks changes to data in a source database, allowing for real-time or near real-time replication of those changes to target systems[1].

alt text

Here are the key aspects of CDC in data engineering:

Purpose and functionality:

  • CDC captures inserts, updates, and deletes in a source database as they occur.
  • It enables the efficient transfer of only the changed data, rather than entire datasets, to target systems like data warehouses or data lakes[1].

Real-time data movement:

  • CDC provides near real-time or real-time movement of data by continuously processing changes as new database events occur[1].
  • This allows for up-to-date data in downstream systems, supporting real-time analytics and operational efficiency[2].

Delta Lake CDF

Delta Lake CDF is a feature that automatically tracks row-level changes between versions of a Delta table. When enabled, it records “change events” for all data written into the table, including: Inserts, Updates and Deletes

alt text

These change events contain the row data along with metadata indicating the type of change that occurred.

https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/

  • Use CDF when tables changes include updates and/or deletes
  • Don’t use CDF when Table’s changes are append only