Learning Notebook > Data Engineering > Databricks Data Engineer Associate - Course Prep > 01 - Databricks Lakehouse Architecture

01 - Databricks Lakehouse Architecture

Understanding the Databricks Lakehouse Architecture — A Simple Explanation for Beginners

A Databricks Lakehouse, also known as a Data Lakehouse, is a modern data platform architecture that combines the strengths of both data lakes and data warehouses into a single unified system. It removes the need of maintaining seperate systems for rawdata and analytics

Why we have moved to datalakehouse?

In datalake, we only have raw data and it don’t have any structure, so it is used to read any data in raw form. so it is only supported to schema on read approach. While in datawarehouse, data is in structured format. so it follows schema on write approach. it will on store data in tablular format where unstructured data is not stored there. since both have some limitations so to avoid these limitations databricks came across one design called datalakehouse which combines the structure of both datalake and datawarehouse.

1. Limitations of a Data Lake

A data lake stores raw, unstructured, semi-structured, and structured data at low cost.
It follows a schema-on-read approach, meaning the structure is applied only when reading the data.
Because data remains raw, issues like data quality, governance, and performance can arise.
Traditional data lakes struggle with reliability, ACID transactions, and optimized query performance.

2. Limitations of a Data Warehouse

A data warehouse stores data in a highly structured, tabular format.
It follows a schema-on-write approach, where data must be cleaned and structured before loading.
Not suitable for unstructured data such as images, logs, or JSON.
Can be expensive and less flexible for big data and streaming use cases.

How datalakehouse solved these limitations?

Single platform to read and store raw data, flexible for high data workloads
Improved efficiency
Schema enforcement supporting all data types for both read and write
Supports ACID transactions
Unified system to access both raw and structured data for easier analysis
Supports real-time streaming, removing the need for separate systems
Storage and compute are decoupled (Storage: S3/ADLS/GCS, Compute: Spark/Photon clusters)

Architecture:

Datalakehouse architecture

A Data Lakehouse typically consists of three main layers: Ingestion Layer, Processing Layer, and Serving Layer.

Ingestion layer : The Ingestion Layer is responsible for collecting raw data from various source systems such as files, OLTP systems, SQL Server databases, APIs, and streaming platforms like Kafka or Pub/Sub. This layer performs key functions including batch ingestion, real-time ingestion, and schema detection. All incoming raw data is typically landed into the Bronze Layer, where it is stored in its original, unprocessed form.

Processing layer : The Processing Layer is where data transformation, cleaning, and enrichment take place. Batch processing is used for scheduled or periodic data loads, while real-time streaming is utilized when data is continuously generated at the source. Common tools used in this layer include Apache Spark, dbt, Delta Live Tables (DLT) for streaming pipelines, and SQL engines. This layer consists of two sub-layers: Silver layer and Gold layer. The Silver Layer refines the raw data by performing cleaning, deduplication, joining multiple datasets, and applying basic aggregations. It produces structured and standardized data that is more reliable than the Bronze Layer but not yet fully business-curated. The Gold Layer contains fully processed, business-ready data The Processing Layer relies on ACID-compliant table formats such as Delta Lake, Iceberg, and Hudi to ensure reliability, versioning, and transaction support.

Serving layer : This is the layer where data is going to serve to business analysts and data analysts to analyse their data trends. Gold layer is the final layer where data is going to serve to business analysts and power bi s responsible for delivering the curated Gold Layer datasets to end users such as business analysts and data analysts. This layer supports dashboards, reports, and trend analysis using tools like Power BI and other BI platforms. It ensures that high-quality, clean, and fully processed data is readily accessible for decision-making and analytics such as KPIs.

Conclusion :

The Lakehouse architecture is future-ready, supporting streaming workloads, machine learning, and advanced BI at scale