Data Engineering

Data Engineering Resources

Practical guides and technical deep dives on modern data engineering, covering lakehouse platforms, open table formats, streaming architectures, and real-world implementation lessons. This section includes:

Courses & Certifications - Curated learning paths from Coursera, DataCamp, A Cloud Guru, and specialized providers
Databricks Deep Dives - Cheatsheets, 2026 innovations, certification guides, and Claude Opus 4.7 integration on Databricks
Platform Comparisons - Head-to-head analysis of Databricks vs Snowflake in 2026, including cost and architectural tradeoffs
Open Table Formats - Apache Iceberg in 2026, Snowflake Storage for Iceberg, and choosing the right open format
Modern Lakehouse Architecture - Production-ready stack recommendations, governance patterns, and stream vs batch processing decisions
Governance & Pipelines - Unity Catalog implementation lessons and Lakeflow Declarative Pipelines for production
ETL & Integration Tools - Overview of data integration, orchestration, and real-time processing platforms
Blogs & Communities - Curated list of authoritative data engineering voices

From foundational courses to advanced architecture decisions, these resources guide you through building scalable, cost-effective data pipelines and understanding the tradeoffs that shape real-world engineering decisions.

Snowflake Storage for Apache Iceberg: Enterprise Open Data Comes to AWS and Azure

A New Era for Open Data Formats Snowflake has announced the general availability of Snowflake Storage for Apache Iceberg on both AWS and Azure, marking a significant shift in how enterprises can build open, interoperable data lakehouses. This development combines Snowflake’s enterprise reliability and governance capabilities with the flexibility and openness of Apache Iceberg, one of the most promising open table formats in the data ecosystem. For a deeper look at Iceberg itself, see Apache Iceberg in 2026, and for where this sits in the broader platform picture see The modern lakehouse stack. ...

Following the Money: Databricks vs Snowflake vs the Open-Source Alternative

The views in this post are my own personal reflections on the data industry, written in my own time. They are not about any specific employer, team, or colleague, past or present, and do not draw on any non-public information. In 2026, the technical gap between Databricks and Snowflake has narrowed to a sliver. Both offer world-class serverless compute, both support Iceberg/Delta as first-class citizens, and both have integrated AI agents that can write SQL better than your average intern. ...

Lakeflow Declarative Pipelines: From DLT to Production

TL;DR Lakeflow Declarative Pipelines is the evolution of Delta Live Tables, and the rename signals a real shift in mental model: from “tables and dependencies” to “data flows and transformations” The three core building blocks are streaming tables (incremental, append-only), materialized views (full recompute, best for aggregations), and AUTO CDC for slowly-changing dimensions without hand-rolled merge logic Physical optimisation is increasingly automatic in 2026 - liquid clustering is the default, predictive optimization handles maintenance, and Z-order is legacy Keep hand-rolled Spark jobs for imperative logic, external API calls, and ML workloads; Lakeflow is for SQL-shaped data movement Lakeflow and dbt are complementary rather than competitors - some teams use Lakeflow for ingestion to silver and dbt for silver-to-gold If you’ve been writing Delta Live Tables (DLT) pipelines, you’ve been building with Lakeflow without knowing the new name. In 2026, the rebranding matters because it signals how Databricks now wants you to think about declarative pipeline design. ...

Modern Data Engineering on Databricks (2026 Guide)

The 2026 Databricks Baseline Databricks in 2026 looks much more opinionated than it did just a few years ago. For most new data engineering work, the default stack is now clear: Unity Catalog for governance managed tables where possible serverless compute for notebooks, SQL, pipelines, and jobs Lakeflow Declarative Pipelines for batch and streaming data products liquid clustering instead of old-style partition design for many workloads That shift matters because the platform has moved beyond “bring your own clusters and tune everything manually.” The modern Databricks approach is increasingly declarative, governed, and automated. ...

Data Engineering Blogs

Modern Data Stack & Engineering Core Blogs & Publications Start Data Engineering - Practical guides, tutorials, and real-world projects for building scalable data platforms from scratch. Seattle Data Guy - Balance of business strategy and technical implementation in modern data engineering. Eclectic Data - Deep technical analysis of data infrastructure, distributed systems, and architectural patterns. Benn Stancil’s Blog - Strategic insights and industry commentary on analytics, data culture, and organizational challenges. Platform & Tool Blogs Airbyte Blog - Data integration, ELT approaches, and best practices for data movement at scale. Databricks Blog - Comprehensive coverage of Apache Spark, Delta Lake, and Lakehouse architectural patterns. LakeFS Blog - Data versioning, governance, and data lakes as code principles. dbt Blog - Analytics engineering workflows, SQL best practices, and modern data transformation. Apache Airflow Blog - Workflow orchestration patterns, DAG design, and production deployment strategies. Kafka Blog - Stream processing, real-time data architectures, and event-driven systems. Redpanda Blog - Kafka ecosystem evolution, streaming data pipelines, and cost optimization. Podcasts & Multimedia The Data Engineering Podcast - Interviews and deep dives into data tools, techniques, and industry practitioners. DataFramed Podcast - Conversations on data careers, best practices, and emerging technologies. Data Warehousing & Analytics Snowflake Blog - Cloud data warehouse innovations, performance optimization, and enterprise data strategies. Google Cloud Data Analytics Blog - BigQuery best practices, modern data stack integration, and Google Cloud data solutions. Restack Blog - Data infrastructure comparisons, architecture patterns, and cost optimization strategies. Communities & Learning Online Communities DataTalks.Club - Free community-driven courses, job board, and peer-to-peer learning for data professionals. r/dataengineering - Active community discussions, career advice, and industry insights. dbt Community - Slack workspace, forums, and networking for analytics engineers and data teams. Learning Resources Data Engineering Fundamentals - Comprehensive guide covering data architecture, ETL/ELT, and system design. Engineer Codehouse - Practical tutorials and guides for modern data stack technologies. Industry News & Trends The Data Stack News - Weekly roundup of news, funding announcements, and updates across the data ecosystem. KDnuggets - News, tutorials, and discussions on data science, machine learning, and data engineering. Data Engineering Weekly - Curated newsletter featuring tools, articles, and thought leadership in data engineering. The Pragmatic Engineer - Data - Engineering-led analysis with frequent data platform deep dives. Open Table Format & Lakehouse Apache Iceberg Blog - Official updates on the open table format increasingly central to the 2026 lakehouse. Tabular Blog - Deep technical writing on Iceberg internals and multi-engine lakehouse design. Dremio Blog - Query engines, Iceberg, and open data architecture. Onehouse Blog - Hudi and open lakehouse patterns. Transformation & Analytics Engineering dbt Developer Blog - Analytics engineering patterns and practical SQL modelling guidance. Tobiko / SQLMesh Blog - Next-generation transformation framework with virtual environments. Locally Optimistic - Long-form posts on analytics engineering culture and practice. Related Reading Data Engineering & Data Science Courses ETL Tools & Data Integration Platforms Lakeflow Declarative Pipelines: From DLT to Production Iceberg vs Delta vs Hudi in 2026 - The Format Wars Are Over

Databricks vs Snowflake in 2026: An Honest Comparison

TL;DR By 2026, Databricks and Snowflake have converged - both claim lakehouse status; the old binary is outdated Databricks wins on transformation, ML, and cost at scale; Snowflake wins on SQL simplicity and BI integration Choose on workload (analytics vs ETL vs ML), team skills (SQL-first vs code-first), budget, and existing ecosystem Iceberg support on both sides makes multi-engine portability real - the mistake is choosing on hype, not fit For stack context, see The Modern Lakehouse Stack The views in this post are my own personal reflections on the data industry, written in my own time. They are not about any specific employer, team, or colleague, past or present, and do not draw on any non-public information. ...

Data Engineering & Data Science Courses

How to Use This Guide This curated list covers courses from beginner to advanced levels across multiple platforms. Choose based on: Your role: Data Engineer, Data Analyst, or Data Scientist Learning style: Self-paced courses, specializations, or nanodegrees Timeline: Single courses (weeks) vs. comprehensive programs (months) Hands-on practice: Most include projects and real-world scenarios Cloud platform: AWS, GCP, Azure, or multi-cloud approaches Data Engineering Professional Certificates (Industry-Backed) Best for: Structured learning with recognized credentials ...

Databricks CheatSheet

Quick Start This cheatsheet covers essential Databricks notebook commands, SQL operations, PySpark transformations, and optimization techniques for the lakehouse platform. Databricks Notebook Commands Magic commands provide shortcuts for common operations in Databricks notebooks: Command Purpose Use Case %python Executes python code (default language) PySpark transformations, data processing %sql Executes SQL queries Querying tables and views %scala Executes scala code Spark API operations, JVM access %r Execute R code Statistical analysis and visualization %sh Shell commands on cluster nodes Git operations, system utilities %fs Databricks file system operations File management, DBFS interactions %md Markdown text formatting Documentation and cell titles %pip Install Python packages Adding Python dependencies %env Set environment variables Configuration and secrets %config Notebook configuration options Display settings, execution parameters %jobs Lists all running jobs Job monitoring %load Load external file contents Include external code %reload Reload Python modules Refresh imports %run Execute another notebook Code reuse and modularization %lsmagic List all available magic commands Discovery %who List variables in current scope Debugging and variable inspection %matplotlib Configure matplotlib backend Visualization setup Notebook Widgets # Create widgets dbutils.widgets.text("param_name", "default_value", "label") dbutils.widgets.dropdown("param_name", "default", ["option1", "option2"]) dbutils.widgets.multiselect("param_name", "default", ["option1", "option2"]) dbutils.widgets.combobox("param_name", "default", ["option1", "option2"]) # Get widget values param_value = dbutils.widgets.get("param_name") # Remove widget dbutils.widgets.remove("param_name") dbutils.widgets.removeAll() Secrets Management # Create secret scope dbutils.secrets.createScope("scope_name") # Store secret dbutils.secrets.put("scope_name", "secret_key", "secret_value") # Retrieve secret secret_value = dbutils.secrets.get("scope_name", "secret_key") # List secrets dbutils.secrets.list("scope_name") # Delete secret dbutils.secrets.delete("scope_name", "secret_key") Accessing Files /path/to/file (local) dbfs:/path/to/file (DBFS) file:/path/to/file (driver filesystem) s3://path/to/file (S3) /Volumes/catalog/schema/volume/path (Unity Catalog Volumes) Copying Files %fs cp file:/<path> /Volumes/<catalog>/<schema>/<volume>/<path> %python dbutils.fs.cp("file:/<path>", "/Volumes/<catalog>/<schema>/<volume>/<path>") %python dbutils.fs.cp("file:/databricks/driver/test", "dbfs:/repo", True) %sh cp /<path> /Volumes/<catalog>/<schema>/<volume>/<path> SQL Statements DDL - Data Definition Language (Schema & Table Operations) Create & Use Schema CREATE SCHEMA test; CREATE SCHEMA custom LOCATION 'dbfs:/custom'; USE SCHEMA test; Unity Catalog (UC) -- Create catalog CREATE CATALOG my_catalog COMMENT "Production catalog"; -- Create schema in UC CREATE SCHEMA my_catalog.my_schema; USE CATALOG my_catalog; USE SCHEMA my_schema; -- Create volume (for files) CREATE VOLUME my_catalog.my_schema.my_volume; ALTER VOLUME my_catalog.my_schema.my_volume OWNER TO `team@company.com`; -- List catalogs, schemas, volumes SHOW CATALOGS; SHOW SCHEMAS IN my_catalog; SHOW VOLUMES IN my_catalog.my_schema; -- Grant permissions GRANT USAGE ON CATALOG my_catalog TO `user@company.com`; GRANT READ_VOLUME ON VOLUME my_catalog.my_schema.my_volume TO `user@company.com`; Create Table CREATE TABLE test(col1 INT, col2 STRING, col3 STRING, col4 BIGINT, col5 INT, col6 FLOAT); CREATE TABLE test AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv'); CREATE TABLE test USING CSV LOCATION '/repo/data/test.csv'; CREATE TABLE test USING CSV OPTIONS (header="true") LOCATION '/repo/data/test.csv'; CREATE TABLE test AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv'); CREATE TABLE test AS ... CREATE TABLE test USING ... CREATE TABLE test(id INT, title STRING, col1 STRING, publish_time BIGINT, pages INT, price FLOAT) COMMENT 'This is comment for the table itself'; CREATE TABLE test AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.json', format => 'json'); CREATE TABLE test_raw AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv', sep => ';'); CREATE TABLE custom_table_test LOCATION 'dbfs:/custom-table' AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv'); CREATE TABLE test PARTITIONED BY (col1) AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv') CREATE TABLE users( firstname STRING, lastname STRING, full_name STRING GENERATED ALWAYS AS (concat(firstname, ' ', lastname)) ); CREATE OR REPLACE TABLE test AS SELECT * EXCEPT (_rescued_data) FROM read_files('/repo/data/test.csv'); CREATE OR REPLACE TABLE test AS SELECT * FROM json.`/repo/data/test.json`; CREATE OR REPLACE TABLE test AS SELECT * FROM read_files('/repo/data/test.csv'); Create View CREATE VIEW view_test AS SELECT * FROM test WHERE col1 = 'test'; CREATE VIEW view_test AS SELECT col1, col1 FROM test JOIN test2 ON test.col2 == test2.col2; CREATE TEMP VIEW temp_test AS SELECT * FROM test WHERE col1 = 'test'; CREATE TEMP VIEW temp_test AS SELECT * FROM read_files('/repo/data/test.csv'); CREATE GLOBAL TEMP VIEW view_test AS SELECT * FROM test WHERE col1 = 'test'; SELECT * FROM global_temp.view_test; CREATE TEMP VIEW jdbc_example USING JDBC OPTIONS ( url "<jdbc-url>", dbtable "<table-name>", user '<username>', password '<password>'); CREATE OR REPLACE TEMP VIEW test AS SELECT * FROM delta.`<logpath>`; CREATE VIEW event_log_raw AS SELECT * FROM event_log("<pipeline-id>"); CREATE OR REPLACE TEMP VIEW test_view AS SELECT test.col1 AS col1 FROM test_table WHERE col1 = 'value1' ORDER BY timestamp DESC LIMIT 1; Drop & Describe DROP TABLE test; SHOW TABLES; DESCRIBE EXTENDED test; DML - Data Manipulation Language (Data Operations) Select SELECT * FROM csv.`/repo/data/test.csv`; SELECT * FROM read_files('/repo/data/test.csv'); SELECT * FROM read_files('/repo/data/test.csv', format => 'csv', header => 'true', sep => ',') SELECT * FROM json.`/repo/data/test.json`; SELECT * FROM json.`/repo/data/*.json`; SELECT * FROM test WHERE year(from_unixtime(test_time)) > 1900; SELECT * FROM test WHERE title LIKE '%a%' SELECT * FROM test WHERE title LIKE 'a%' SELECT * FROM test WHERE title LIKE '%a' SELECT * FROM test TIMESTAMP AS OF '2024-01-01T00:00:00.000Z'; SELECT * FROM test VERSION AS OF 2; SELECT * FROM test@v2; SELECT * FROM event_log("<pipeline-id>"); SELECT count(*) FROM VALUES (NULL), (10), (10) AS example(col); SELECT count(col) FROM VALUES (NULL), (10), (10) AS example(col); SELECT count_if(col1 = 'test') FROM test; SELECT from_unixtime(test_time) FROM test; SELECT cast(test_time / 1 AS timestamp) FROM test; SELECT cast(cast(test_time AS BIGINT) AS timestamp) FROM test; SELECT element.sub_element FROM test; SELECT flatten(array(array(1, 2), array(3, 4))); SELECT * FROM ( SELECT col1, col2 FROM test ) PIVOT ( sum(col1) for col2 in ('item1','item2') ); SELECT *, CASE WHEN col1 > 10 THEN 'value1' ELSE 'value2' END FROM test; SELECT * FROM test ORDER BY (CASE WHEN col1 > 10 THEN col2 ELSE col3 END); WITH t(col1, col2) AS (SELECT 1, 2) SELECT * FROM t WHERE col1 = 1; SELECT details:flow_definition.output_dataset as output_dataset, details:flow_definition.input_datasets as input_dataset FROM event_log_raw, latest_update WHERE event_type = 'flow_definition' AND origin.update_id = latest_update.id; Insert INSERT OVERWRITE test SELECT * FROM read_files('/repo/data/test.csv'); INSERT INTO test(col1, col2) VALUES ('value1', 'value2'); Merge Into MERGE INTO test USING test_to_delete ON test.col1 = test_to_delete.col1 WHEN MATCHED THEN DELETE; MERGE INTO test USING test_to_update ON test.col1 = test_to_update.col1 WHEN MATCHED THEN UPDATE SET *; MERGE INTO test USING test_to_insert ON test.col1 = test_to_insert.col1 WHEN NOT MATCHED THEN INSERT *; Copy Into COPY INTO test FROM '/repo/data' FILEFORMAT = CSV FILES = ('test.csv') FORMAT_OPTIONS('header' = 'true', 'inferSchema' = 'true'); Spark DataFrame API PySpark is the Python API for Apache Spark, enabling distributed data processing on the Databricks platform. ...

Databricks Training & Certification

Overview Databricks offers certification tracks aligned to common roles: Data Engineer, Data Analyst, Apache Spark Developer, Machine Learning Engineer, and Generative AI Engineer. All certifications: Validity: 2 years from pass date Cost: $200 per exam attempt Format: Multiple choice, proctored online Recent Updates (2026): Emphasis on Lakeflow Declarative Pipelines (the evolution of DLT), Unity Catalog, liquid clustering, predictive optimization, AUTO CDC, Lakehouse Federation, and serverless compute Choose a certification based on your: ...

Unity Catalog in Practice: Lessons From the Field

The views in this post are my own personal reflections on industry patterns, written in my own time. They are not about any specific employer, team, or colleague, past or present, and do not draw on any non-public information. TL;DR Unity Catalog is a unified access-control and metadata layer for tables, volumes, models, and notebooks - it is not a data-quality tool, a discovery engine, or a masking system, and teams expecting those will be disappointed Migrating from Hive metastore remains the biggest operational challenge in 2026; the hybrid path (migrate reference data first, stage the rest) is the most common in practice Design catalogs around medallion layers (bronze/silver/gold), not per-environment schema sprawl, and grant permissions only through roles, never directly to users Budget realistically: $50k-$200k of engineering time for large-organisation migrations, roughly a third to half of one engineer’s time ongoing, and under 5% query overhead Skip UC for single-team startups, sandbox data, and some streaming workloads; full adoption typically takes 6-12 months Unity Catalog sounds straightforward: “one governance layer for all your data and AI assets.” In theory, it’s elegant. In practice, you’ll run into gotchas that docs don’t prepare you for. ...

Data Engineering Resources#

Data Engineering Resources