Diagrams as Code Banner

Diagrams as Code: A Practitioner's Guide for Data Engineers

TL;DR Hand-drawn diagrams in Lucidchart, Visio, draw.io or Confluence rot because they live outside the codebase, cannot be diffed, and have no compiler to flag when they go stale. Diagrams as code closes all three gaps by treating the text source as truth and the rendered image as a build artefact. Pick by the question you are answering, not by taste. Mermaid for embedded docs and anything that has to render in GitHub. D2 for aesthetically polished architecture with real cloud icons. Python diagrams for AWS-heavy decks. PlantUML or Structurizr when you need formal UML or the C4 model. The conventions that make trust explicit: co-locate diagrams with the code they describe, add a metadata header with last_verified and next_review_due, encode confidence visually ( verified / stale / proposed ), pair each non-obvious diagram with an ADR, and render in CI. The highest-leverage move is to generate diagrams from the system itself - Terraform state, lineage graphs, dbt manifests, Airflow DAGs. A generated diagram is provably current by construction, which is a much stronger guarantee than “I reviewed it last quarter.” If you have ever opened a Confluence page from two years ago and wondered whether the architecture it shows is still real, you have already met the problem this post is trying to fix. Hand-drawn diagrams in Lucidchart, Visio, draw.io or PowerPoint share three failure modes that no amount of governance ever quite eliminates. They live somewhere your code does not, so nobody updates them in the same PR that changes the system. They cannot be diffed, reviewed, or merged. And they rot silently, because there is no compiler error for “this picture is now a lie.” ...

May 18, 2026 · 19 min · James M
Stream vs Batch Processing Banner

Real-Time Data Processing: Stream Processing vs Batch Processing

If you spend enough time in data engineering, you will eventually encounter the conviction that batch processing is dying and streaming is the future. This is the third or fourth time the industry has had this conversation in my career, and the answer has been the same every time. Streaming is not the future. Batch is not the past. They are different tools with different operational profiles, and the systems that age well use both, with discipline about which is the right choice for which problem. ...

May 10, 2026 · 9 min · James M
The Modern Lakehouse Stack Banner

The Modern Lakehouse Stack: What Actually Belongs in Production

The word “lakehouse” has been doing a lot of work for the last five years. It has been used to describe everything from a thin SQL layer over object storage to a fully integrated platform with governance, lineage, ML training, and BI built on top. Like most umbrella terms, this elasticity has been useful for marketers and confusing for engineers. This post is the version of the conversation I would have with a senior engineer who has been asked to “build out our lakehouse” and wants to know which pieces are load-bearing and which are noise. It draws on what I have actually seen ship and survive in production data platforms in 2026, and it tries to be specific about why each layer is in the stack rather than just describing the picture as a fait accompli. ...

May 8, 2026 · 9 min · James M
AI-Native Pipelines Banner

AI-Native Pipelines - What Changes When Your Consumer Is an LLM, Not a Dashboard

TL;DR Data pipelines were optimised for human consumers - dashboards, BI tools, analysts. In 2026 a growing share of pipeline output flows directly to language models, agents, and retrieval systems. That changes the design constraints in ways that catch teams off guard. Aggregation matters less. Context fidelity matters more. Freshness behaves differently. Schema moves from rigid to negotiated. Cost shifts from compute to tokens. The biggest mistake is treating an LLM consumer as if it were just another dashboard. It is not. It does not skim, it does not interpret charts, it does not have working memory across rows. It needs to be fed. The new patterns - retrieval-aware partitioning, embedding pipelines, structured-document outputs, prompt-shaped views, evaluation harnesses for data quality - are the actual subject of “AI-native data engineering” in 2026. The Underlying Shift For thirty years the implicit consumer of every data pipeline was a human looking at a screen. Even when the pipeline ended in an API or a CSV, the conceptual end-user was someone who would interpret the output with judgement, context, and skim-reading. ...

May 3, 2026 · 9 min · James M
Iceberg vs Delta vs Hudi 2026 Banner

Iceberg vs Delta vs Hudi in 2026 - The Format Wars Are Over

TL;DR The open table format war between Apache Iceberg, Delta Lake, and Apache Hudi is effectively over in 2026 - and the outcome is not a single winner but a clear settlement. Iceberg has won the role of the neutral standard that engines and platforms expect to read and write. It is the format you choose when you do not want to be coupled to a single vendor. Delta has won the role of the incumbent default inside the Databricks ecosystem and remains a strong choice if Databricks is your primary engine. Delta UniForm has narrowed the gap by letting Delta tables expose Iceberg metadata. Hudi has not won a category outright. It retains a smaller but loyal user base for streaming-heavy and CDC-heavy workloads, where its design choices still genuinely fit. The interesting battle has moved up the stack to the catalog layer. The format question is mostly settled. The catalog question is the new fight. The Format Wars - A Brief History For most of the early 2020s the lakehouse story was a three-way argument about how to put ACID transactions on top of object storage. ...

May 3, 2026 · 8 min · James M
Catalog Layer Battleground Banner

The Catalog Layer Is the New Battleground - Unity, Polaris, Gravitino, Nessie

TL;DR With the open table format wars largely settled, the strategic fight in 2026 has moved up to the catalog layer - the system that manages tables, namespaces, governance, and access. Four credible open or open-ish catalogs are now in serious play: Unity Catalog (Databricks), Polaris (Snowflake), Apache Gravitino (Datastrato/community), and Project Nessie (Dremio/community). All four implement the Iceberg REST catalog spec to varying degrees, which means clients can talk to them through a common protocol. The differentiation has moved to governance, multi-tenancy, lineage, federation, and developer experience. Unity is the most production-mature and the most coupled to Databricks. Polaris is the cleanest open implementation of the REST spec. Gravitino is the most ambitious in scope - aiming to catalog non-table assets too. Nessie is the most opinionated about Git-style branching for data. The winning catalog will probably not be a single project. It will be the protocol (Iceberg REST) plus multiple compliant implementations plus federation between them. That is the picture 2026 ends with. Why The Catalog Layer Matters Now A table format defines how data is laid out on disk. A catalog defines: ...

May 2, 2026 · 8 min · James M
Agent-First Architecture Banner

Agent-First Architecture: The Engineer as System Curator

TL;DR Agent-first architecture imagines a future where the primary unit of work is an AI agent with intent, tools, memory, and a feedback loop - not a human-authored codebase The engineer’s role may shift from building and maintaining systems line by line to curating, governing, and evolving fleets of agents Glue code, routine maintenance, first-pass incident triage, and migration work are plausible candidates for automation; deciding what a system is for and holding architectural intent across time probably are not Managing an agent fleet might resemble logistics fleet management: define intent, set constraints, design feedback loops, curate the roster, and own the outcomes This is a speculative post, not a description of how anything works today - the author is pinning down a hypothesis to revisit when it turns out to be wrong This is a “thinking out loud” post, not a report from the front lines. I have no evidence any of this is happening at scale, and it is not how my current day job looks. These are just ideas I keep turning over, and I wanted to write them down to see if they hold together. ...

April 23, 2026 · 13 min · James M
Apache Iceberg in 2026

Apache Iceberg in 2026: The Open Table Format That Won

In 2023, the question was “which open table format will survive - Iceberg, Delta, or Hudi?” In 2026, that debate is over. Apache Iceberg won, and it won for reasons that have almost nothing to do with its raw performance. It won because it is the only format that both Snowflake and Databricks now treat as a first-class citizen, because the vendors picked sides on catalogs rather than table formats, and because enterprise buyers decided that multi-engine portability was worth more than a small performance edge. ...

April 22, 2026 · 11 min · James M
Following the Money in Data

Following the Money: Databricks vs Snowflake vs the Open-Source Alternative

The views in this post are my own personal reflections on the data industry, written in my own time. They are not about any specific employer, team, or colleague, past or present, and do not draw on any non-public information. In 2026, the technical gap between Databricks and Snowflake has narrowed to a sliver. Both offer world-class serverless compute, both support Iceberg/Delta as first-class citizens, and both have integrated AI agents that can write SQL better than your average intern. ...

April 8, 2026 · 4 min · James M
Lakeflow Declarative Pipelines

Lakeflow Declarative Pipelines: From DLT to Production

If you’ve been writing Delta Live Tables (DLT) pipelines, you’ve been building with Lakeflow without knowing the new name. In 2026, the rebranding matters because it signals how Databricks now wants you to think about declarative pipeline design. This isn’t just a rename. The mental model has shifted from “tables and dependencies” to “data flows and transformations.” Let me show you what changed and why it matters. For where Lakeflow fits relative to other orchestration choices and the broader paradigm question, see The modern lakehouse stack and Stream vs batch processing. ...

April 6, 2026 · 9 min · James M