The Databricks Lakehouse Platform combines the best features of data lakes and data warehouses. It provides a unified platform for data engineering, data science, machine learning, and business analytics.
Delta Lake is an open-source storage framework that brings reliability, performance, and lifecycle management to data lakes. It provides several key features:
Delta Lake uses transaction logs to record every change made to a table. These logs are essential for maintaining ACID properties and enabling features like time travel.
Compaction is the process of merging small files into larger ones to optimize read performance. This can be done using the OPTIMIZE
command.
Here are some basic SQL commands to interact with Delta tables:
CREATE TABLE delta_table (id INT, name STRING, value DOUBLE)
USING DELTA; // (Optional)
INSERT INTO delta_table(id,name,value)
VALUES (1,'example',10.5);
DESCRIBE DETAIL delta_table;
UPDATE delta_table SET value = 20.5 WHERE id = 1;
DESCRIBE HISTORY delta_table;
Time travel allows you to query older versions of your data. You can specify a timestamp or version number.
SELECT * FROM delta_table VERSION AS OF 1;
To compact small files, use the OPTIMIZE
command:
OPTIMIZE delta_table;
Vacuuming removes old files that are no longer needed, freeing up storage space.
VACUUM delta_table RETAIN 168 HOURS;
Databricks manages the lifecycle of the table, including its data and metadata. Created under the database directory. When table is dropped, the underlying data files are also deleted.
CREATE TABLE managed_default
(width INT, length INT, height INT);
INSERT INTO managed_default
VALUES (3 INT, 2 INT, 1 INT)
The data is stored outside of Databricks, and only the metadata is managed by Databricks.
CREATE TABLE external_default
(width INT, length INT, height INT)
LOCATION 'dbfs:/mnt/demo/external_default';
INSERT INTO external_default
VALUES (3 INT, 2 INT, 1 INT)
CTAS allows you to create a new table from the result of a query. It automatically infer schema information from query results. It do not support manual schema declaration.
CREATE TABLE new_table AS
SELECT * FROM existing_table WHERE value > 10;
CREATE TABLE new_table COMMENT "Contains PII” PARTITIONED BY (city, birth_date)
LOCATION ‘/some/path’
AS SELECT id, name, email, birth_date, city FROM users
Delta Lake supports constraints like NOT NULL
and CHECK
to enforce data integrity.
ALTER TABLE table_name ADD CONSTRAINT constraint_name constraint_details
ALTER TABLE orders ADD CONSTRAINT valid_date CHECK (date > '2020-01-01');
CREATE VIEW my_view AS SELECT id, name FROM delta_table;
CREATE TEMPORARY VIEW temp_view AS
SELECT id, value FROM delta_table;
CREATE GLOBAL TEMPORARY VIEW global_view AS
SELECT * FROM delta_table;