How to use Lakeflow Declarative Pipelines with SQL in Databricks

👋 Hey there! This video is for members only.

Please log in or create a free account below.

Login / Sign Up

In today’s post, we’re going to explore Lakeflow Declarative Pipelines (LDPs) using SQL. I’ve done a couple of videos on LDPs before, but this time I want to explicitly show how it works with SQL. If you haven’t seen the other videos, I’ll make sure to link them in the description.

Introduction: What Are LDPs?

Lakeflow Declarative Pipelines are a way to describe how data should flow into a table without writing out all the manual SQL logic.

Instead of handling a big MERGE statement or manually coding updates, you simply define:

the source,
the target,
and the rules.

Lakeflow then manages everything for you — incremental updates, history tracking, and even performance optimizations.

Key Concepts in LDPs

Here are the main building blocks you’ll see when working with declarative pipelines in SQL:

CREATE FLOW → defines the pipeline from the source to the target.
AS AUTO CDC → tells Lakeflow to automatically handle inserts, updates, and deletes.
APPLY AS DELETE WHEN → specifies when a row should be treated as a delete (often using a utility column like operation).
SEQUENCE BY → determines the order of events, usually with a timestamp or a sequence number.
STORED AS SCD TYPE 2 → stores the table with slowly changing dimension history. Old records are closed with an end timestamp, and new versions are added with a start timestamp.

Demo: Building a Pipeline

To simulate this, we start by creating a raw table with some dummy data.

Create the raw source table
- Insert initial records.
- Add updates and deletes to simulate changes over time.
Create the target streaming table
- This is where Lakeflow will apply transformations.
Define the flow
- We create a flow that streams data from the raw table into the target.
- It applies CDC rules automatically, keys by userId, sequences events by a chosen column, and stores the target as SCD Type 2.

Once the pipeline is created and run, Lakeflow automatically handles inserts, updates, and deletes, and maintains historical versions of the data.

-- Step 1: Create a dummy CDC source table
CREATE OR REPLACE TABLE users_raw (
  userId BIGINT,
  name STRING,
  email STRING,
  operation STRING,      -- 'INSERT', 'UPDATE', 'DELETE'
  sequenceNum BIGINT
);
-- Step 2: Seed initial users
INSERT INTO users_raw VALUES
  (1, 'Alice', 'alice@example.com', 'INSERT', 1),
  (2, 'Bob',   'bob@example.com',   'INSERT', 2),
  (3, 'Carol', 'carol@example.com', 'INSERT', 3);
-- Step 3: Simulate update + delete
INSERT INTO users_raw VALUES
  (2, 'Bob', 'bob_new@example.com', 'UPDATE', 4), -- Bob updates email
  (3, 'Carol', 'carol@example.com', 'DELETE', 5); -- Carol deleted
-- Step 4: Create a target SCD Type 2 table
CREATE OR REFRESH STREAMING TABLE users_scd;
-- Step 5: Build flow that applies CDC rules
CREATE FLOW users_flow
AS AUTO CDC INTO
  users_scd
FROM
  STREAM(users_raw)
KEYS
  (userId)
APPLY AS DELETE WHEN
  operation = 'DELETE'
SEQUENCE BY
  sequenceNum
COLUMNS * EXCEPT
  (operation, sequenceNum)
STORED AS
  SCD TYPE 2;

Functionality of LDPs

Some of the built-in functionality includes:

Incremental ingestion → only processes new or changed data.
Automatic history tracking → SCD Type 2 support out of the box.
Performance optimization → designed for streaming and incremental processing.
Simplicity → far less code compared to manual SQL merges.
UI integration → pipelines can be monitored and managed visually in Databricks.

Advantages of LDPs

Easy to use — minimal code required.
Handles data quality and history automatically.
Integrates with Lakeflow Connectors (e.g., Salesforce, SQL Server) for end-to-end processes.
Works in both Spark and SQL, so you can choose your preferred approach.

Disadvantages of LDPs

Less control → harder to customize compared to raw SQL or Python.
Debugging → error messages can be vague and sometimes frustrating.
Black box effect → the system hides some of the execution details.
Notifications → limited to email out of the box (workarounds needed for Teams/Slack).
Learning curve → requires learning new syntax (FLOW, AUTO CDC, etc.).

Conclusion

Lakeflow Declarative Pipelines make common data flows like CDC and SCD Type 2 much easier, faster, and more reliable.

But there is a trade-off: you gain simplicity and automation, while giving up some flexibility and transparency.

If you want a straightforward way to handle incremental ingestion, history tracking, and performance optimizations with minimal SQL, LDPs are a great fit.