The Databricks Lakehouse Platform combines the best features of data lakes and data warehouses. It provides a unified platform for data engineering, data science, machine learning, and business analytics.
Delta Lake is an open-source storage framework that brings reliability, performance, and lifecycle management to data lakes. It provides several key features:
Delta lake is :
Delta lake is not :
Delta Lake uses transaction logs to record every change made to a table. These logs are essential for maintaining ACID properties and enabling features like time travel.
Compaction is the process of merging small files into larger ones to optimize read performance. This can be done using the OPTIMIZE
command.
Here are some basic SQL commands to interact with Delta tables:
DESCRIBE DATABASE [ EXTENDED ] database_name
-- Create employees SCHEMA
> CREATE SCHEMA employees COMMENT 'For software companies';
-- Describe employees SCHEMA.
-- Returns Database Name, Description and Root location of the filesystem
-- for the employees SCHEMA.
> DESCRIBE SCHEMA employees;
database_description_item database_description_value
------------------------- -----------------------------
Database Name employees
Description For software companies
Location file:/you/Temp/employees.db
CREATE TABLE delta_table (id INT, name STRING, value DOUBLE)
USING DELTA; // (Optional)
INSERT INTO delta_table(id,name,value)
VALUES (1,'example',10.5);
DESCRIBE DETAIL delta_table;
UPDATE delta_table SET value = 20.5 WHERE id = 1;
DESCRIBE HISTORY delta_table;
Time travel allows you to query older versions of your data. You can specify a timestamp or version number.
SELECT * FROM delta_table VERSION AS OF 1;
To compact small files, use the OPTIMIZE
command:
OPTIMIZE delta_table;
Vacuuming removes old files that are no longer needed, freeing up storage space.
Default retention period for Delta Table is 7 Days
VACUUM delta_table RETAIN 168 HOURS;
Databricks manages the lifecycle of the table, including its data and metadata. Created under the database directory. When table is dropped, the underlying data files are also deleted.
CREATE TABLE managed_default
(width INT, length INT, height INT);
INSERT INTO managed_default
VALUES (3 INT, 2 INT, 1 INT)
The data is stored outside of Databricks, and only the metadata is managed by Databricks.
When an external table is dropped, only the table definition is dropped from metastore everything including data and metadata(Delta transaction log, time travel history) remains in the storage. Delta log is considered as part of metadata because if you drop a column in a delta table(managed or external) the column is not physically removed from the parquet files rather it is recorded in the delta log. The delta log becomes a key metadata layer for a Delta table to work.
CREATE TABLE external_default
(width INT, length INT, height INT)
LOCATION 'dbfs:/mnt/demo/external_default';
INSERT INTO external_default
VALUES (3 INT, 2 INT, 1 INT)
CTAS allows you to create a new table from the result of a query. It automatically infer schema information from query results. It do not support manual schema declaration.
CREATE TABLE new_table AS
SELECT * FROM existing_table WHERE value > 10;
CREATE TABLE new_table COMMENT "Contains PII” PARTITIONED BY (city, birth_date)
LOCATION ‘/some/path’
AS SELECT id, name, email, birth_date, city FROM users
Delta Lake supports constraints like NOT NULL
and CHECK
to enforce data integrity.
ALTER TABLE table_name ADD CONSTRAINT constraint_name constraint_details
ALTER TABLE orders ADD CONSTRAINT valid_date CHECK (date > '2020-01-01');
CREATE VIEW my_view AS SELECT id, name FROM delta_table;
CREATE TEMPORARY VIEW temp_view AS
SELECT id, value FROM delta_table;
CREATE GLOBAL TEMPORARY VIEW global_view AS
SELECT * FROM delta_table;