Query files directly using SQL.
SELECT * FROM 'csv:/path/to/file.csv';
SELECT * FROM CSV. 'dbfs:/location/csv_files/'
CREATE TABLE my_table
USING CSV
OPTIONS (path '/path/to/file.csv');
CREATE TABLE delta_table
AS SELECT * FROM source_table;
CREATE TABLE table_name (col_name1 col_type1, ...)
USING data_source
OPTIONS (path 'location');
Data Sync Limitation:
Alternative Options:
CREATE TEMP VIEW temp_view AS
SELECT * FROM source_table;
Completely overwrites the table each time.
CREATE OR REPLACE TABLE my_table AS
SELECT * FROM source_table;
Replaces existing data with new data.
INSERT OVERWRITE TABLE my_table
SELECT * FROM new_data;
Performs inserts, updates, and deletes in a single transaction.
CREATE OR REPLACE TEMP VIEW Customer_updates AS
SELECT * FROM json.`${dataset}/Customer-json-needs`;
MERGE INTO Customer C
USING Customer_updates U
ON C.customer_id = U.customer_id
WHEN MATCHED AND C.email IS NULL AND U.email IS NOT NULL THEN
UPDATE SET email = U.email, updated = U.updated
WHEN NOT MATCHED THEN INSERT *
SELECT from_json(profile, 'STRUCT<name: STRING, age: INT, email: STRING>') AS profile_struct
FROM customers;
explode
: This function is used to transform an array or map into multiple rows, allowing for detailed analysis of individual elements. In Databricks Spark SQL, you use it with LATERAL VIEW.
collect_set
: This function aggregates distinct elements into a set, providing a list of unique values for each group. It’s useful for eliminating duplicates and summarizing data.
SELECT * FROM table1
UNION
SELECT * FROM table2;
SELECT * FROM (
SELECT column1, column2 FROM table
) PIVOT (
SUM(column2) FOR column1 IN ('value1', 'value2')
);
FILTER
, TRANSFORM
SELECT TRANSFORM(array_column, x -> x + 1) FROM table;
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def add_one(x):
return x + 1
add_one_udf = udf(add_one, IntegerType())
df.withColumn('new_column', add_one_udf(df['column']))
INSERT
and UPDATE
operations.MERGE INTO target_table
USING source_table
ON target_table.id = source_table.id
WHEN MATCHED THEN
UPDATE SET target_table.value = source_table.value
WHEN NOT MATCHED THEN
INSERT (id, value) VALUES (source_table.id, source_table.value);
These detailed notes and examples cover essential concepts in Spark SQL and the ETL process, focusing on table management, data transformation, and advanced functions, crucial for the Databricks Certified Data Engineer Associate exam.