01 - Change Data Capture
Change Data Capture
Change Data Capture (CDC) is a data engineering technique that identifies and tracks changes to data in a source database, allowing for real-time or near real-time replication of those changes to target systems[1].
Here are the key aspects of CDC in data engineering:
Purpose and functionality:
- CDC captures inserts, updates, and deletes in a source database as they occur.
- It enables the efficient transfer of only the changed data, rather than entire datasets, to target systems like data warehouses or data lakes[1].
Real-time data movement:
- CDC provides near real-time or real-time movement of data by continuously processing changes as new database events occur[1].
- This allows for up-to-date data in downstream systems, supporting real-time analytics and operational efficiency[2].
Delta Lake CDF
Delta Lake CDF is a feature that automatically tracks row-level changes between versions of a Delta table. When enabled, it records “change events” for all data written into the table, including: Inserts, Updates and Deletes
These change events contain the row data along with metadata indicating the type of change that occurred.
https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/
- Use CDF when tables changes include updates and/or deletes
- Don’t use CDF when Table’s changes are append only