Delta Live Tables is a framework designed to simplify the creation and management of data processing pipelines. It ensures data quality and maintains table dependencies, making it easier to build large-scale ETL processes.
Multi-hop Architecture:
orders_raw
table ingests Parquet data incrementally using Auto Loader.orders_cleaned
enriches order data with customer information.daily_customer_books
calculates daily book counts per customer.Databricks Notebooks:
LIVE
keyword is used to declare DLT tables.Auto Loader:
STREAMING
keyword for streaming tables.cloud_files
method specifies the source location, format (e.g., Parquet), and reader options.Data Quality Constraints:
DROP ROW
: Discards violating records.FAIL UPDATE
: Fails the pipeline on violations.Bronze Table Declaration:
CREATE STREAMING LIVE TABLE orders_raw
COMMENT "Raw orders data"
AS SELECT * FROM cloud_files("/path/to/data", "parquet", map("inferSchema", "true"));
Silver Table Declaration with Constraints:
CREATE LIVE TABLE orders_cleaned
AS SELECT * FROM LIVE.orders_raw
WHERE order_id IS NOT NULL;
Gold Table Example:
CREATE LIVE TABLE daily_customer_books
AS SELECT customer_id, COUNT(*) AS books_count FROM LIVE.orders_cleaned GROUP BY customer_id;
Change Data Capture refers to capturing changes made to source data and applying them to target tables. This process involves handling inserts, updates, and deletes efficiently.
CDC Events:
Apply Changes Into Command:
Command Syntax:
APPLY CHANGES INTO target_table
FROM source_cdc_table
KEYS (primary_key_field)
APPLY AS DELETE WHEN operation = 'Delete'
SEQUENCE BY timestamp_field
EXCEPT (columns_to_ignore);
Slowly Changing Dimensions (SCD):
Limitations:
Apply Changes Into
command breaks append-only requirements for streaming sources due to updates and deletes in the target table.Apply Changes Into
to ensure your target table reflects these changes accurately while maintaining historical context if needed.