05 - Production Pipelines

Delta Live Tables (DLT)

Delta Live Tables is a framework designed to simplify the creation and management of data processing pipelines. It ensures data quality and maintains table dependencies, making it easier to build large-scale ETL processes.

Key Concepts of Delta Live Tables

  1. Multi-hop Architecture:

    • Bronze Tables: These tables store raw data in its original form. For example, the orders_raw table ingests Parquet data incrementally using Auto Loader.
    • Silver Tables: These tables contain refined and cleaned data. Operations such as data cleansing and enrichment are performed at this level. For example, orders_cleaned enriches order data with customer information.
    • Gold Tables: These tables provide aggregated or business-level insights. For instance, daily_customer_books calculates daily book counts per customer.
  2. Databricks Notebooks:

    • DLT pipelines are implemented using Databricks notebooks with SQL syntax.
    • The LIVE keyword is used to declare DLT tables.
  3. Auto Loader:

    • Used for incremental data ingestion with the STREAMING keyword for streaming tables.
    • The cloud_files method specifies the source location, format (e.g., Parquet), and reader options.
  4. Data Quality Constraints:

    • Constraints ensure data quality by specifying actions on constraint violations:
      • DROP ROW: Discards violating records.
      • FAIL UPDATE: Fails the pipeline on violations.
      • Reports violations without discarding records.

Creating a DLT Pipeline

  • Navigate to the Workflows tab and select Delta Live Table.
  • Create a pipeline by providing configurations like dataset path and storage location.
  • Choose between Triggered Mode (runs once) and Continuous Mode (continuously ingests new data).
  • Use Development Mode for interactive development, which reuses clusters and disables retries for quick error resolution.
  • The execution flow is visualized as a Directed Acyclic Graph (DAG), showing entities and their relationships.

Example SQL Commands

  • Bronze Table Declaration:

    CREATE STREAMING LIVE TABLE orders_raw
    COMMENT "Raw orders data"
    AS SELECT * FROM cloud_files("/path/to/data", "parquet", map("inferSchema", "true"));
    
  • Silver Table Declaration with Constraints:

    CREATE LIVE TABLE orders_cleaned
    AS SELECT * FROM LIVE.orders_raw
    WHERE order_id IS NOT NULL;
    
  • Gold Table Example:

    CREATE LIVE TABLE daily_customer_books
    AS SELECT customer_id, COUNT(*) AS books_count FROM LIVE.orders_cleaned GROUP BY customer_id;
    

Change Data Capture (CDC)

Change Data Capture refers to capturing changes made to source data and applying them to target tables. This process involves handling inserts, updates, and deletes efficiently.

Key Concepts of CDC

  1. CDC Events:

    • Changes are logged as events containing both data and metadata (e.g., operation type, timestamp).
  2. Apply Changes Into Command:

    • Used in DLT to apply CDC events to target tables.
  3. Command Syntax:

    APPLY CHANGES INTO target_table
    FROM source_cdc_table
    KEYS (primary_key_field)
    APPLY AS DELETE WHEN operation = 'Delete'
    SEQUENCE BY timestamp_field
    EXCEPT (columns_to_ignore);
    
  4. Slowly Changing Dimensions (SCD):

    • Type 1 SCD updates records in place.
    • Type 2 SCD keeps historical records by adding new rows for changes.
  5. Limitations:

    • The Apply Changes Into command breaks append-only requirements for streaming sources due to updates and deletes in the target table.

Example Use Case

  • Suppose you have a CDC feed with operations like insert, update, and delete.
  • Use Apply Changes Into to ensure your target table reflects these changes accurately while maintaining historical context if needed.