Following the Money: Databricks vs Snowflake vs the Open-Source Alternative

A deep dive into the shifting economics of the data landscape in 2026. Why the choice between Snowflake and Databricks is increasingly an accounting decision, and where the open-source DIY stack actually saves you money.

April 8, 2026 · 4 min · James Myddelton

Modern Data Engineering on Databricks (2026 Guide)

The 2026 Databricks Baseline Databricks in 2026 looks much more opinionated than it did just a few years ago. For most new data engineering work, the default stack is now clear: Unity Catalog for governance managed tables where possible serverless compute for notebooks, SQL, pipelines, and jobs Lakeflow Declarative Pipelines for batch and streaming data products liquid clustering instead of old-style partition design for many workloads That shift matters because the platform has moved beyond “bring your own clusters and tune everything manually.” The modern Databricks approach is increasingly declarative, governed, and automated. ...

April 6, 2026 · 7 min · James M

Lakeflow Declarative Pipelines: From DLT to Production

If you’ve been writing Delta Live Tables (DLT) pipelines, you’ve been building with Lakeflow without knowing the new name. In 2026, the rebranding matters because it signals how Databricks now wants you to think about declarative pipeline design. This isn’t just a rename. The mental model has shifted from “tables and dependencies” to “data flows and transformations.” Let me show you what changed and why it matters. What Lakeflow Actually Is Lakeflow Declarative Pipelines is the modern Databricks way to say: “I describe what data I want, and Databricks manages how to get it.” ...

April 5, 2026 · 9 min · James M

Unity Catalog in Practice: Lessons From the Field

Unity Catalog sounds straightforward: “one governance layer for all your data and AI assets.” In theory, it’s elegant. In practice, you’ll run into gotchas that docs don’t prepare you for. This post is from the field - patterns that work, mistakes I’ve seen repeated, and how to actually build a sustainable governance layer in 2026. What Unity Catalog Is (And Isn’t) What It Is A unified access control and metadata layer for: ...

April 5, 2026 · 10 min · James M

Databricks vs Snowflake in 2026: An Honest Comparison

The question “Databricks or Snowflake?” has dominated data engineering conversations for the past five years. In 2026, it’s still the wrong question. But let me answer it anyway, because sometimes you have to pick one. The Honest Framing By 2026, both platforms have converged in surprising ways: Databricks started as a Spark compute engine and added warehouse features Snowflake started as a cloud data warehouse and added Iceberg support for lakehouse semantics Both now claim to be “lakehouses” that combine data lake flexibility with warehouse performance The difference isn’t in capability - it’s in architectural DNA, operational model, and what they expect you to optimize for. ...

April 5, 2026 · 11 min · James M

Data Engineering & Data Science Courses

How to Use This Guide This curated list covers courses from beginner to advanced levels across multiple platforms. Choose based on: Your role: Data Engineer, Data Analyst, or Data Scientist Learning style: Self-paced courses, specializations, or nanodegrees Timeline: Single courses (weeks) vs. comprehensive programs (months) Hands-on practice: Most include projects and real-world scenarios Cloud platform: AWS, GCP, Azure, or multi-cloud approaches Data Engineering Professional Certificates (Industry-Backed) Best for: Structured learning with recognized credentials ...

April 4, 2026 · 5 min · James M

Databricks Training & Certification

Overview Databricks offers certification tracks aligned to common roles: Data Engineer, Data Analyst, Apache Spark Developer, Machine Learning Engineer, and Generative AI Engineer. All certifications: Validity: 2 years from pass date Cost: $200 per exam attempt Format: Multiple choice, proctored online Recent Updates (2025): Emphasis on DLT (Delta Live Tables), Unity Catalog, Lakehouse Federation, and Auto Loader Choose a certification based on your: Current Role: Align with job responsibilities Experience Level: Associate = entry-level (or 6-12 months experience), Professional = production experience Platform Focus: Lakehouse architecture or specific tools (Delta Lake, MLflow) Official Databricks Resources Start here for authoritative training materials and exam information: ...

April 4, 2026 · 4 min · James M

Databricks CheatSheet

Quick Start This cheatsheet covers essential Databricks notebook commands, SQL operations, PySpark transformations, and optimization techniques for the lakehouse platform. Databricks Notebook Commands Magic commands provide shortcuts for common operations in Databricks notebooks: Command Purpose Use Case %python Executes python code (default language) PySpark transformations, data processing %sql Executes SQL queries Querying tables and views %scala Executes scala code Spark API operations, JVM access %r Execute R code Statistical analysis and visualization %sh Shell commands on cluster nodes Git operations, system utilities %fs Databricks file system operations File management, DBFS interactions %md Markdown text formatting Documentation and cell titles %pip Install Python packages Adding Python dependencies %env Set environment variables Configuration and secrets %config Notebook configuration options Display settings, execution parameters %jobs Lists all running jobs Job monitoring %load Load external file contents Include external code %reload Reload Python modules Refresh imports %run Execute another notebook Code reuse and modularization %lsmagic List all available magic commands Discovery %who List variables in current scope Debugging and variable inspection %matplotlib Configure matplotlib backend Visualization setup Notebook Widgets # Create widgets dbutils.widgets.text("param_name", "default_value", "label") dbutils.widgets.dropdown("param_name", "default", ["option1", "option2"]) dbutils.widgets.multiselect("param_name", "default", ["option1", "option2"]) dbutils.widgets.combobox("param_name", "default", ["option1", "option2"]) # Get widget values param_value = dbutils.widgets.get("param_name") # Remove widget dbutils.widgets.remove("param_name") dbutils.widgets.removeAll() Secrets Management # Create secret scope dbutils.secrets.createScope("scope_name") # Store secret dbutils.secrets.put("scope_name", "secret_key", "secret_value") # Retrieve secret secret_value = dbutils.secrets.get("scope_name", "secret_key") # List secrets dbutils.secrets.list("scope_name") # Delete secret dbutils.secrets.delete("scope_name", "secret_key") Accessing Files /path/to/file (local) dbfs:/path/to/file (DBFS) file:/path/to/file (driver filesystem) s3://path/to/file (S3) /Volumes/catalog/schema/volume/path (Unity Catalog Volumes) Copying Files %fs cp file:/<path> /Volumes/<catalog>/<schema>/<volume>/<path> %python dbutils.fs.cp("file:/<path>", "/Volumes/<catalog>/<schema>/<volume>/<path>") %python dbutils.fs.cp("file:/databricks/driver/test", "dbfs:/repo", True) %sh cp /<path> /Volumes/<catalog>/<schema>/<volume>/<path> SQL Statements DDL - Data Definition Language (Schema & Table Operations) Create & Use Schema CREATE SCHEMA test; CREATE SCHEMA custom LOCATION 'dbfs:/custom'; USE SCHEMA test; Unity Catalog (UC) -- Create catalog CREATE CATALOG my_catalog COMMENT "Production catalog"; -- Create schema in UC CREATE SCHEMA my_catalog.my_schema; USE CATALOG my_catalog; USE SCHEMA my_schema; -- Create volume (for files) CREATE VOLUME my_catalog.my_schema.my_volume; ALTER VOLUME my_catalog.my_schema.my_volume OWNER TO `team@company.com`; -- List catalogs, schemas, volumes SHOW CATALOGS; SHOW SCHEMAS IN my_catalog; SHOW VOLUMES IN my_catalog.my_schema; -- Grant permissions GRANT USAGE ON CATALOG my_catalog TO `user@company.com`; GRANT READ_VOLUME ON VOLUME my_catalog.my_schema.my_volume TO `user@company.com`; Create Table CREATE TABLE test(col1 INT, col2 STRING, col3 STRING, col4 BIGINT, col5 INT, col6 FLOAT); CREATE TABLE test AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv'); CREATE TABLE test USING CSV LOCATION '/repo/data/test.csv'; CREATE TABLE test USING CSV OPTIONS (header="true") LOCATION '/repo/data/test.csv'; CREATE TABLE test AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv'); CREATE TABLE test AS ... CREATE TABLE test USING ... CREATE TABLE test(id INT, title STRING, col1 STRING, publish_time BIGINT, pages INT, price FLOAT) COMMENT 'This is comment for the table itself'; CREATE TABLE test AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.json', format => 'json'); CREATE TABLE test_raw AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv', sep => ';'); CREATE TABLE custom_table_test LOCATION 'dbfs:/custom-table' AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv'); CREATE TABLE test PARTITIONED BY (col1) AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv') CREATE TABLE users( firstname STRING, lastname STRING, full_name STRING GENERATED ALWAYS AS (concat(firstname, ' ', lastname)) ); CREATE OR REPLACE TABLE test AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv'); CREATE OR REPLACE TABLE test AS SELECT * FROM json.`/repo/data/test.json`; CREATE OR REPLACE TABLE test AS SELECT * FROM read_files('/repo/data/test.csv'); Create View CREATE VIEW view_test AS SELECT * FROM test WHERE col1 = 'test'; CREATE VIEW view_test AS SELECT col1, col1 FROM test JOIN test2 ON test.col2 == test2.col2; CREATE TEMP VIEW temp_test AS SELECT * FROM test WHERE col1 = 'test'; CREATE TEMP VIEW temp_test AS SELECT * FROM read_files('/repo/data/test.csv'); CREATE GLOBAL TEMP VIEW view_test AS SELECT * FROM test WHERE col1 = 'test'; SELECT * FROM global_temp.view_test; CREATE TEMP VIEW jdbc_example USING JDBC OPTIONS ( url "<jdbc-url>", dbtable "<table-name>", user '<username>', password '<password>'); CREATE OR REPLACE TEMP VIEW test AS SELECT * FROM delta.`<logpath>`; CREATE VIEW event_log_raw AS SELECT * FROM event_log("<pipeline-id>"); CREATE OR REPLACE TEMP VIEW test_view AS SELECT test.col1 AS col1 FROM test_table WHERE col1 = 'value1' ORDER BY timestamp DESC LIMIT 1; Drop & Describe DROP TABLE test; SHOW TABLES; DESCRIBE EXTENDED test; DML - Data Manipulation Language (Data Operations) Select SELECT * FROM csv.`/repo/data/test.csv`; SELECT * FROM read_files('/repo/data/test.csv'); SELECT * FROM read_files('/repo/data/test.csv', format => 'csv', header => 'true', sep => ',') SELECT * FROM json.`/repo/data/test.json`; SELECT * FROM json.`/repo/data/*.json`; SELECT * FROM test WHERE year(from_unixtime(test_time)) > 1900; SELECT * FROM test WHERE title LIKE '%a%' SELECT * FROM test WHERE title LIKE 'a%' SELECT * FROM test WHERE title LIKE '%a' SELECT * FROM test TIMESTAMP AS OF '2024-01-01T00:00:00.000Z'; SELECT * FROM test VERSION AS OF 2; SELECT * FROM test@v2; SELECT * FROM event_log("<pipeline-id>"); SELECT count(*) FROM VALUES (NULL), (10), (10) AS example(col); SELECT count(col) FROM VALUES (NULL), (10), (10) AS example(col); SELECT count_if(col1 = 'test') FROM test; SELECT from_unixtime(test_time) FROM test; SELECT cast(test_time / 1 AS timestamp) FROM test; SELECT cast(cast(test_time AS BIGINT) AS timestamp) FROM test; SELECT element.sub_element FROM test; SELECT flatten(array(array(1, 2), array(3, 4))); SELECT * FROM ( SELECT col1, col2 FROM test ) PIVOT ( sum(col1) for col2 in ('item1','item2') ); SELECT *, CASE WHEN col1 > 10 THEN 'value1' ELSE 'value2' END FROM test; SELECT * FROM test ORDER BY (CASE WHEN col1 > 10 THEN col2 ELSE col3 END); WITH t(col1, col2) AS (SELECT 1, 2) SELECT * FROM t WHERE col1 = 1; SELECT details:flow_definition.output_dataset as output_dataset, details:flow_definition.input_datasets as input_dataset FROM event_log_raw, latest_update WHERE event_type = 'flow_definition' AND origin.update_id = latest_update.id; Insert INSERT OVERWRITE test SELECT * FROM read_files('/repo/data/test.csv'); INSERT INTO test(col1, col2) VALUES ('value1', 'value2'); Merge Into MERGE INTO test USING test_to_delete ON test.col1 = test_to_delete.col1 WHEN MATCHED THEN DELETE; MERGE INTO test USING test_to_update ON test.col1 = test_to_update.col1 WHEN MATCHED THEN UPDATE SET *; MERGE INTO test USING test_to_insert ON test.col1 = test_to_insert.col1 WHEN NOT MATCHED THEN INSERT *; Copy Into COPY INTO test FROM '/repo/data' FILEFORMAT = CSV FILES = ('test.csv') FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true'); Spark DataFrame API PySpark is the Python API for Apache Spark, enabling distributed data processing on the Databricks platform. ...

April 4, 2026 · 9 min · James M