Platform

Diagrams as Code: A Practitioner's Guide for Data Engineers

TL;DR Hand-drawn diagrams in Lucidchart, Visio, draw.io or Confluence rot because they live outside the codebase, cannot be diffed, and have no compiler to flag when they go stale. Diagrams as code closes all three gaps by treating the text source as truth and the rendered image as a build artefact. Pick by the question you are answering, not by taste. Mermaid for embedded docs and anything that has to render in GitHub. D2 for aesthetically polished architecture with real cloud icons. Python diagrams for AWS-heavy decks. PlantUML or Structurizr when you need formal UML or the C4 model. The conventions that make trust explicit: co-locate diagrams with the code they describe, add a metadata header with last_verified and next_review_due, encode confidence visually ( verified / stale / proposed ), pair each non-obvious diagram with an ADR, and render in CI. The highest-leverage move is to generate diagrams from the system itself - Terraform state, lineage graphs, dbt manifests, Airflow DAGs. A generated diagram is provably current by construction, which is a much stronger guarantee than “I reviewed it last quarter.” If you have ever opened a Confluence page from two years ago and wondered whether the architecture it shows is still real, you have already met the problem this post is trying to fix. Hand-drawn diagrams in Lucidchart, Visio, draw.io or PowerPoint share three failure modes that no amount of governance ever quite eliminates. They live somewhere your code does not, so nobody updates them in the same PR that changes the system. They cannot be diffed, reviewed, or merged. And they rot silently, because there is no compiler error for “this picture is now a lie.” ...

The Modern Lakehouse Stack: What Actually Belongs in Production

TL;DR A 2026 lakehouse has seven layers: object storage, open table format, catalog, compute engine, orchestration, transformation, and governance Apache Iceberg is the default table format; catalog choice (Unity, Polaris, Nessie) depends on your primary engine Databricks or Snowflake as compute, dbt or SQLMesh for transformation, orchestration via Dagster or Airflow Governance and observability are the layer most often skipped and most expensive to retrofit Default stack: ship end-to-end on a small slice first, then expand - do not spend six months evaluating before data flows The word “lakehouse” has been doing a lot of work for the last five years. It has been used to describe everything from a thin SQL layer over object storage to a fully integrated platform with governance, lineage, ML training, and BI built on top. Like most umbrella terms, this elasticity has been useful for marketers and confusing for engineers. ...

The eBPF Revolution - What Every Platform Engineer Should Know

TL;DR eBPF is the technology that lets you run safe, sandboxed programs inside the Linux kernel without writing kernel modules. In 2026 it is the foundation under most serious observability, networking, and runtime security tools. The interesting story is not the technology itself - it is the wave of products built on top of it: Cilium for networking, Tetragon for runtime security, Pixie, Parca, and Coroot for observability, plus a long tail of vendor offerings using eBPF under the hood. For platform engineers, eBPF is not “a thing you have to learn to write.” It is a thing you have to know about so you can choose tools intelligently and understand what is happening on your nodes when those tools cause problems. The most important shift eBPF has enabled is observability without instrumentation. You can see what is happening on a system without modifying the application, without restarting it, and with low overhead. That is genuinely new. What eBPF Actually Is eBPF stands for “extended Berkeley Packet Filter,” which is historical and confusing because eBPF has long since outgrown packet filtering. The simple version: ...

Kubernetes in 2026 Complexity Tax Banner

Kubernetes in 2026 - Is It Still Worth the Complexity Tax?

TL;DR Kubernetes won the orchestration argument years ago. The question is no longer “should we use Kubernetes.” It is “should this particular team, with this particular workload, with this particular budget, pay the operational tax.” For genuinely large, multi-tenant, multi-region platforms with dedicated infrastructure teams, the answer is still mostly yes. The ecosystem maturity is unmatched and the alternatives lose at scale. For mid-sized engineering organisations, the answer in 2026 is probably not, and increasingly not. Managed serverless, container platforms like Fly and Railway, and the new generation of platform-as-a-service offerings are competitive in ways they were not three years ago. For startups and small teams, the answer is almost always no, and stop pretending otherwise. The honest read in 2026: Kubernetes is the right answer to fewer questions than it used to be, and being honest about that is now a competitive advantage rather than a heresy. How We Got Here Kubernetes was the right idea at the right time. By the late 2010s, every serious engineering team needed an answer to “how do we run containers in production.” Kubernetes provided one, it was open, it was backed by a credible foundation, and the cloud providers all blessed it. Within five years it was the default. Within ten years it was the assumption. ...

Self-Hosted vs Managed in 2026 - The Cost Math Has Changed Again

TL;DR The self-hosted vs managed decision in 2026 is genuinely different from the same decision in 2022. The math has shifted in three directions: cloud egress costs, AI workload economics, and self-hosted tooling maturity. Managed remains the right default for most teams. The thing that has changed is that the threshold at which self-hosting becomes worth considering has dropped. Workloads that were obviously managed in 2022 are genuine 50/50 calls in 2026. The most important shift is that self-hosting is no longer synonymous with on-premises. Modern self-hosting often means renting bare-metal in a colocation, running your own clusters in a hyperscaler, or using sovereign cloud providers - all with different economics. For specific categories - AI inference at scale, data egress-heavy workloads, predictable steady-state compute, regulated environments - self-hosting now wins on cost more often than people assume. The honest framing: managed is the right default; self-hosting is the right minority case; the minority is bigger than it used to be. Why This Decision Got Harder For most of the 2010s the answer was easy. Managed services were cheaper than self-hosting once you priced in operational overhead. The cloud providers competed aggressively. Self-hosting was for the regulated, the eccentric, and the very large. ...

DevOps Blogs

Good engineering blogs are one of the cheapest forms of mentorship available. The posts below are from teams and individuals I return to when I want to see how real organisations solve real problems - outages, scaling walls, migrations, and the occasional cultural mistake. Vendor and Platform Blogs These blogs publish architectural deep-dives and reference implementations. They are partly marketing, but the engineering detail is usually genuine. Atlassian DevOps Blog - practitioner posts on pipelines, incident response, and team topology AWS DevOps Blog - pipeline patterns, CDK/CodePipeline how-tos, and multi-account guidance Google Cloud Blog - the SRE-flavoured material that originated at Google Microsoft DevOps Blog - Azure DevOps, GitHub Actions, and developer platform posts GitLab Blog - CI/CD and platform engineering content from the GitLab team HashiCorp Blog - Terraform, Vault, Consul, and Nomad in production Individual and Community Voices Ricard Bejarano - SRE at Cisco with sharp posts on minimal container images and infrastructure hygiene Charity Majors - co-founder of Honeycomb, writing extensively on observability and on-call culture Julia Evans - illustrated explainers on Linux, networking, and debugging fundamentals Gergely Orosz - The Pragmatic Engineer - deep dives into how large engineering organisations actually operate High Scalability - architecture breakdowns of well-known systems SRE-Specific Reading Google SRE Books - the foundational texts on SRE as a discipline Increment Magazine - long-form essays on on-call, incident response, and reliability Everything DevOps (Reddit) - less polished, but a useful pulse on what practitioners are struggling with this week How I Use This List Blog posts age quickly. A Kubernetes best-practices post from 2019 may actively mislead you in 2026. When I read any of these, I check the date first and treat anything older than three years as historical context rather than current guidance. ...