Lakeflow Declarative Pipelines: From DLT to Production

If you’ve been writing Delta Live Tables (DLT) pipelines, you’ve been building with Lakeflow without knowing the new name. In 2026, the rebranding matters because it signals how Databricks now wants you to think about declarative pipeline design.

This isn’t just a rename. The mental model has shifted from “tables and dependencies” to “data flows and transformations.” Let me show you what changed and why it matters.

What Lakeflow Actually Is

Lakeflow Declarative Pipelines is the modern Databricks way to say: “I describe what data I want, and Databricks manages how to get it.”

You define:

Datasets (streaming tables, materialized views, or static tables)
Dependencies (what feeds what)
Update semantics (incremental, CDC, append-only, etc.)

Databricks handles:

Orchestration (running transformations in dependency order)
Incremental execution (only processing new or changed data)
Performance optimization (caching, clustering, partitioning)
Error handling and recovery
Scaling (from local notebook to production pipelines)

This is declarative programming applied to data pipelines. You say what, not how.

Why This Matters: The Evolution from DLT

The Old DLT Mental Model (2020–2024)

CREATE OR REFRESH TABLE bronze_events AS
  SELECT * FROM STREAM(raw_events);

CREATE OR REFRESH TABLE silver_events AS
  SELECT 
    event_id,
    event_type,
    event_ts,
    user_id
  FROM STREAM(bronze_events)
  WHERE event_id IS NOT NULL;

This worked, and it still works. But it had rough edges:

Naming was confusing. “Delta Live Tables” doesn’t communicate the pattern as clearly as “declarative pipelines.”
Scope was table-focused. The unit of thinking was “tables and their dependencies.”
Limited semantics. Streaming tables worked for append-only data, but CDC was bolted on via APPLY CHANGES INTO.
Operator experience was ad hoc. Monitoring, debugging, and recovery required Databricks-specific tools.

The New Lakeflow Mental Model (2026)

CREATE FLOW my_pipeline AS
  SOURCE events = READ 'path/to/events'
  STEP silver_events = TRANSFORM (events -> SELECT event_id, user_id, event_ts WHERE event_id IS NOT NULL)
  SINK WRITE silver_events TO LOCATION 'path/to/silver';

(This is pseudocode; actual Lakeflow syntax is still SQL-based, but the framing has shifted toward flows.)

The reframing means:

Unit of work is a “flow,” not tables. You’re defining data lineage, not individual tables.
Semantics are explicit. CDC, incremental, streaming, static—these are first-class options, not workarounds.
Operations are first-class. Monitoring, debugging, and cost tracking are baked into the flow concept.
Portability hints at the future. Flows might run on engines other than Spark (though they don’t yet).

Key Concepts in Lakeflow 2026

1. Streaming Tables (Append-Only)

Use when: New data is appended; you never update or delete old records.

CREATE OR REFRESH STREAMING TABLE raw_events AS
  SELECT * FROM cloud_files('s3://bucket/events/', 'json');

Lazily processed (waits for new files before running)
Checkpointed for recovery
Ideal for event data, logs, telemetry

When to use: Events, clicks, API calls, anything that’s append-only and time-ordered.

2. Materialized Views (Recomputable)

Use when: Results can be fully recomputed; correctness matters more than latency.

CREATE MATERIALIZED VIEW user_daily_summary AS
  SELECT 
    user_id,
    DATE(event_ts) AS event_date,
    COUNT(*) AS event_count
  FROM raw_events
  GROUP BY user_id, DATE(event_ts);

Fully recalculated on refresh (not incremental by default)
Great for aggregations and slowly-changing dimensions
Lower latency than streaming for complex transformations
More CPU-heavy because it’s not incremental

When to use: Dashboards, aggregations, derived metrics that need to be recalculated nightly or hourly.

3. AUTO CDC (Change Data Capture)

Use when: Source data is being updated/deleted and you need to track those changes in your lakehouse.

CREATE STREAMING TABLE users_scd2;

CREATE FLOW user_cdc AS
  AUTO CDC INTO users_scd2
  FROM stream(bronze_users_cdf)
  KEYS (user_id)
  SEQUENCE BY update_timestamp
  STORED AS SCD TYPE 2;

Tracks inserts, updates, deletes from source system
Produces SCD Type 2 slowly-changing dimensions
Tracks valid_from and valid_to dates automatically
Easier than hand-rolling merge logic

When to use: Syncing databases, data warehouses, CRM data—anything that mutates at the source.

4. Derived Tables and Temp Tables

For intermediate transformations that aren’t exposed to users:

DECLARE VARIABLE silver_users TABLE (
  user_id STRING,
  user_name STRING,
  created_ts TIMESTAMP
);

These are local to the flow, don’t create physical tables, and help organize complex pipelines.

Pattern: Building a Production Lakeflow Pipeline

Here’s how a modern 2026 pipeline looks:

-- BRONZE: Raw ingestion
CREATE OR REFRESH STREAMING TABLE bronze_events AS
  SELECT * FROM cloud_files(
    's3://raw-bucket/events/', 
    'json',
    schema => 'event_id STRING, user_id STRING, event_ts TIMESTAMP, properties STRING'
  );

-- SILVER: Cleaned, typed, deduplicated
CREATE OR REFRESH MATERIALIZED VIEW silver_events AS
  SELECT 
    event_id,
    user_id,
    event_ts,
    from_json(properties, 'key STRING, value STRING') AS properties
  FROM bronze_events
  WHERE event_id IS NOT NULL AND user_id IS NOT NULL
  QUALIFY ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_ts) = 1;

-- GOLD: Business logic (daily aggregation)
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_events AS
  SELECT 
    DATE(event_ts) AS event_date,
    user_id,
    COUNT(*) AS event_count,
    COUNT(DISTINCT event_id) AS unique_event_count
  FROM silver_events
  GROUP BY DATE(event_ts), user_id;

-- CDC Example: Syncing a mutable source
CREATE OR REFRESH STREAMING TABLE bronze_users_raw AS
  SELECT * FROM cloud_files(
    's3://raw-bucket/users/',
    'json'
  );

CREATE OR REFRESH STREAMING TABLE silver_users AS
  SELECT 
    user_id,
    user_name,
    email,
    created_ts,
    _change_type,
    _change_ts
  FROM stream(bronze_users_raw)
  WHERE _change_type IN ('insert', 'update_postimage');

Pattern principles:

Bronze = Raw (minimal transformation, just schema validation)
Silver = Clean (deduplication, type casting, filtering nulls)
Gold = Ready (business aggregations, denormalization for BI)
Use streaming for immutable appends; materialized views for computed aggregations
Version CDC data explicitly (track _change_type and _change_ts)

Lakeflow vs Hand-Rolled Spark Jobs

When would you not use Lakeflow?

Use Spark Jobs (Not Lakeflow) When:

Complex imperative logic (if-then-else branching, stateful transformations)
Multiple outputs from one job (fan-out to many tables with different logic)
External API calls (enrichment from external services; Lakeflow isn’t designed for this)
Non-tabular outputs (writing to databases, APIs, files outside Lakeflow)
GPU/ML workloads (Lakeflow is for data movement; ML pipelines live elsewhere)

# This is a Spark job, not Lakeflow
from pyspark.sql import SparkSession
import requests

spark = SparkSession.builder.appName("enrichment").getOrCreate()

events = spark.read.table("silver_events")

def enrich_with_api(user_id):
    # Call external API—not idiomatic for Lakeflow
    response = requests.get(f"https://api.company.com/users/{user_id}")
    return response.json()

# This is why Spark jobs still exist
enriched = events.rdd.map(lambda row: enrich_with_api(row.user_id)).toDF()
enriched.write.saveAsTable("enriched_events", mode="overwrite")

Rule of thumb: If your transformation is SQL or simple Python dataframe operations, use Lakeflow. If it requires code with state, external calls, or complex logic, use Spark jobs.

Performance and Cost Considerations

Streaming Tables vs Materialized Views: Cost Tradeoffs

Aspect	Streaming Tables	Materialized Views
Processing	Incremental (only new data)	Full recompute (all data)
Latency	Lower (continuous or frequent)	Higher (depends on refresh schedule)
CPU cost	Lower (incremental work)	Higher (full scan)
Best for	Event data, logs, immutable sources	Aggregations, BI, reports
Example	Raw events ingestion	Daily user summary

Optimization in Lakeflow (2026)

Lakeflow now handles much of this automatically:

Liquid clustering is default for new tables (no need to design partitions)
Predictive optimization runs maintenance automatically
Query caching at the materialized view level
Skipping files based on statistics (Z-order is legacy)

CREATE OR REFRESH STREAMING TABLE events
CLUSTER BY (user_id, event_date);  -- Liquid clustering, not partitions

You still need to understand the tradeoff (streaming incremental vs materialized full-recompute), but the physical optimization is handled.

Monitoring and Cost Tracking

In 2026, Lakeflow pipelines expose cost metrics:

Compute cost per pipeline run
Storage cost for each table and intermediate result
Data scan volume to understand shuffle and network costs

This makes it clear which transformations are expensive:

Is your materialized view doing a full table scan that could be incremental? (Cost spike)
Is your streaming table checkpointing massive state? (Check the intermediate storage)
Are you recalculating data that could be cached? (Use materialized views + clustering)

Testing Lakeflow Pipelines

Declarative pipelines are easier to test than Spark jobs because they’re mostly SQL. But testing still matters.

Unit Test Pattern

-- Test: silver_events deduplicates correctly
SELECT COUNT(*) AS total, COUNT(DISTINCT event_id) AS unique
FROM silver_events
HAVING total > unique;  -- Should fail if deduplication works

-- Test: null events are filtered
SELECT COUNT(*) 
FROM silver_events
WHERE event_id IS NULL OR user_id IS NULL;  -- Should return 0

Integration Test Pattern

-- Reset to a known state
TRUNCATE TABLE bronze_events;
INSERT INTO bronze_events VALUES (1, 'u1', CURRENT_TIMESTAMP, '{}');

-- Refresh pipeline
CALL refresh_pipeline('my_pipeline');

-- Assert output
SELECT COUNT(*) AS result FROM gold_daily_events
WHERE user_id = 'u1'
HAVING result = 1;  -- Should pass

The Operational Picture: From Lakeflow to Production

Development → Staging → Production

Lakeflow 2026 supports this cleanly:

-- Development: Narrow dataset, fast feedback
CREATE OR REPLACE STREAMING TABLE bronze_events_dev AS
  SELECT * FROM bronze_events LIMIT 10000;

-- Staging: Full dataset, real-world scale
CREATE OR REPLACE STREAMING TABLE bronze_events_staging AS
  SELECT * FROM bronze_events WHERE event_date >= DATE_SUB(CURRENT_DATE, 30);

-- Production: All data, full history
CREATE OR REPLACE STREAMING TABLE bronze_events AS
  SELECT * FROM cloud_files(
    's3://raw-events/',
    'json'
  );

Pipeline infrastructure supports:

Multi-workspace deployment (dev workspace, prod workspace)
Git versioning (Databricks Repos integration)
Notifications (Slack alerts on pipeline failure)
Manual/scheduled triggers (run on schedule or on-demand)

Lakeflow vs dbt

In 2026, you might ask: “Should I use Lakeflow or dbt?”

They’re complementary, not competitors:

Aspect	Lakeflow	dbt
Native to	Databricks	Any data warehouse
Orchestration	Built-in (Lakeflow service)	Requires external orchestrator (Airflow, dbt Cloud)
Incremental logic	Native (streaming, CDC, materialized views)	Plugin-based (requires dbt-incremental macros)
Testing	SQL assertions and expectations	dbt tests, great dbt ecosystem
Best for	Databricks-native workflows	Multi-warehouse, BI-heavy stacks

Use Lakeflow if: You’re all-in on Databricks and want the simplest path to production.

Use dbt if: You have multiple data warehouses, need stronger testing, or want portable SQL transformations.

Use both if: Lakeflow for ingestion to silver, dbt for silver-to-gold transformations (some teams do this).

The Future of Lakeflow (2026 and Beyond)

What’s coming:

AI-assisted pipeline generation (describe your data, Databricks generates the SQL)
Finer cost tracking (optimize specific transformations)
Cross-warehouse federation (query Snowflake from Lakeflow pipelines)
Better testing frameworks (built-in data quality tools)
Streaming SQL improvements (more flexible windowing and stateful operations)

The direction is clear: make declarative pipelines the default path, and push complexity (ML, external APIs, GPU workloads) to specialized tools.

Putting It Together: A Real Example

Building a recommendation engine’s featurization pipeline:

-- BRONZE: Raw events
CREATE OR REFRESH STREAMING TABLE bronze_events AS
  SELECT * FROM cloud_files('s3://events/', 'parquet');

-- SILVER: Validated events
CREATE OR REFRESH STREAMING TABLE silver_events AS
  SELECT 
    event_id,
    user_id,
    product_id,
    event_type,
    event_ts,
    CURRENT_TIMESTAMP() AS processed_ts
  FROM stream(bronze_events)
  WHERE user_id IS NOT NULL 
    AND product_id IS NOT NULL
    AND event_ts > CURRENT_TIMESTAMP() - INTERVAL 1 YEAR;

-- GOLD: 30-day user features
CREATE OR REFRESH MATERIALIZED VIEW gold_user_features AS
  SELECT 
    user_id,
    COUNT(*) AS events_30d,
    COUNT(DISTINCT product_id) AS products_viewed_30d,
    COUNT(CASE WHEN event_type = 'purchase' THEN 1 END) AS purchases_30d,
    COUNT(CASE WHEN event_type = 'view' THEN 1 END) AS views_30d,
    MAX(event_ts) AS last_event_ts
  FROM silver_events
  WHERE event_ts >= DATE_SUB(CURRENT_DATE(), 30)
  GROUP BY user_id;

-- GOLD: Product popularity (hourly)
CREATE OR REFRESH MATERIALIZED VIEW gold_product_popularity AS
  SELECT 
    product_id,
    DATE_TRUNC('hour', event_ts) AS hour,
    COUNT(*) AS event_count,
    COUNT(CASE WHEN event_type = 'purchase' THEN 1 END) AS purchases
  FROM silver_events
  WHERE event_ts >= DATE_SUB(CURRENT_DATE(), 7)
  GROUP BY product_id, DATE_TRUNC('hour', event_ts);

This entire pipeline is:

Automatically orchestrated (dependencies detected)
Incrementally processed (only new events processed)
Monitored (metrics, lineage, cost tracked)
Scalable (from notebook to production cluster)

And you wrote pure SQL. No Spark tuning, no cluster management, no orchestration code.

That’s the promise of Lakeflow.

Last Updated: April 7, 2026

What Lakeflow Actually Is#

Why This Matters: The Evolution from DLT#

The Old DLT Mental Model (2020–2024)#

The New Lakeflow Mental Model (2026)#

Key Concepts in Lakeflow 2026#

1. Streaming Tables (Append-Only)#

2. Materialized Views (Recomputable)#

3. AUTO CDC (Change Data Capture)#

4. Derived Tables and Temp Tables#

Pattern: Building a Production Lakeflow Pipeline#

Lakeflow vs Hand-Rolled Spark Jobs#

Use Spark Jobs (Not Lakeflow) When:#

Performance and Cost Considerations#

Streaming Tables vs Materialized Views: Cost Tradeoffs#

Optimization in Lakeflow (2026)#

Monitoring and Cost Tracking#

Testing Lakeflow Pipelines#

Unit Test Pattern#

Integration Test Pattern#

The Operational Picture: From Lakeflow to Production#

Development → Staging → Production#

Lakeflow vs dbt#

The Future of Lakeflow (2026 and Beyond)#

Putting It Together: A Real Example#