02 - Databricks Lakehouse Platform

The Databricks Lakehouse Platform combines the best features of data lakes and data warehouses. It provides a unified platform for data engineering, data science, machine learning, and business analytics.

Delta Lake

Delta Lake is an open-source storage framework that brings reliability, performance, and lifecycle management to data lakes. It provides several key features:

Key Features of Delta Lake

  1. ACID Transactions: Delta Lake enables ACID (Atomicity, Consistency, Isolation, Durability) transactions on data lakes, ensuring data reliability and consistency.
  2. Scalable Metadata Handling: Delta Lake efficiently manages metadata, allowing for quick access and updates.
  3. Audit Trail: Delta Lake maintains an audit trail of all changes to the tables, using transaction logs.
  4. Time Travel: Delta Lake supports versioning, allowing users to query previous versions of the data.
  5. Compaction and Optimization: Delta Lake can compact small files and optimize data layout for better performance.

Delta lake is :

  • Open source
  • Builds up on standard data format
  • Optimized for cloud object storage
  • Built for scalable metadata handling

Delta lake is not :

  • Proprietary technology
  • Storage format
  • Storage medium
  • Database service or data warehouse

Transaction Logs

Delta Lake uses transaction logs to record every change made to a table. These logs are essential for maintaining ACID properties and enabling features like time travel.

Compaction

Compaction is the process of merging small files into larger ones to optimize read performance. This can be done using the OPTIMIZE command.

Basic Commands

Here are some basic SQL commands to interact with Delta tables:

Describe Database

DESCRIBE DATABASE [ EXTENDED ] database_name

-- Create employees SCHEMA
> CREATE SCHEMA employees COMMENT 'For software companies';

-- Describe employees SCHEMA.
-- Returns Database Name, Description and Root location of the filesystem
-- for the employees SCHEMA.
> DESCRIBE SCHEMA employees;
 database_description_item    database_description_value
 ------------------------- -----------------------------
             Database Name                     employees
               Description        For software companies
                  Location file:/you/Temp/employees.db

Creating a Delta Table

CREATE  TABLE delta_table (id INT, name STRING, value DOUBLE)
        USING DELTA; // (Optional)

Inserting Data

INSERT  INTO delta_table(id,name,value)
        VALUES (1,'example',10.5);

Describing Table Details

DESCRIBE DETAIL delta_table;

Updating a Table

UPDATE delta_table SET  value  =  20.5 WHERE id =  1;

Describing Table History

DESCRIBE HISTORY delta_table;

Advanced Delta Lake Features

Time Travel

Time travel allows you to query older versions of your data. You can specify a timestamp or version number.

SELECT  *  FROM delta_table VERSION AS  OF  1;

Compaction of Small Files

To compact small files, use the OPTIMIZE command:

OPTIMIZE delta_table;

Vacuuming Delta Tables

Vacuuming removes old files that are no longer needed, freeing up storage space.

Default retention period for Delta Table is 7 Days

VACUUM delta_table RETAIN 168 HOURS;

Understanding Databases and Tables in Lakehouse Platform

alt text

Managed Table

Databricks manages the lifecycle of the table, including its data and metadata. Created under the database directory. When table is dropped, the underlying data files are also deleted.

CREATE TABLE managed_default
  (width INT, length INT, height INT);

INSERT INTO managed_default
VALUES (3 INT, 2 INT, 1 INT)

External Table

The data is stored outside of Databricks, and only the metadata is managed by Databricks.

When an external table is dropped, only the table definition is dropped from metastore everything including data and metadata(Delta transaction log, time travel history) remains in the storage. Delta log is considered as part of metadata because if you drop a column in a delta table(managed or external) the column is not physically removed from the parquet files rather it is recorded in the delta log. The delta log becomes a key metadata layer for a Delta table to work.

CREATE TABLE external_default
  (width INT, length INT, height INT)
LOCATION 'dbfs:/mnt/demo/external_default';
  
INSERT INTO external_default
VALUES (3 INT, 2 INT, 1 INT)

CTAS Statement - Create Table As Select

CTAS allows you to create a new table from the result of a query. It automatically infer schema information from query results. It do not support manual schema declaration.

CREATE TABLE new_table AS
SELECT * FROM existing_table WHERE value > 10;
CREATE TABLE new_table COMMENT "Contains PII” PARTITIONED BY (city, birth_date)
        LOCATION ‘/some/path’
        AS SELECT id, name, email, birth_date, city FROM users

alt text

Table Constraints

Delta Lake supports constraints like NOT NULL and CHECK to enforce data integrity.

ALTER TABLE table_name ADD CONSTRAINT constraint_name constraint_details

ALTER TABLE orders ADD CONSTRAINT valid_date CHECK (date > '2020-01-01');

Views

alt text

  • Stored View: A saved query that you can treat as a table.
  • Temporary View: A view available only within the current session.
  • Global Temporary View: A view available across all sessions.
CREATE  VIEW my_view AS SELECT id, name FROM delta_table;

CREATE  TEMPORARY VIEW temp_view AS
     SELECT id, value FROM delta_table;

CREATE  GLOBAL  TEMPORARY  VIEW global_view AS
   SELECT * FROM delta_table;