Architecture

Threat Modeling for Engineers: Finding the Flaws Before Attackers Do

TL;DR A scanner finds bugs in code that already exists. Threat modeling finds flaws in a design before the code exists - which is the cheapest possible time to find them It is a structured conversation built around four questions: what are we building, what can go wrong, what are we going to do about it, and did we do a good job STRIDE gives you a vocabulary for “what can go wrong”: Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, and Elevation of privilege You do not need a tool or a certificate. You need a diagram, the people who understand the system, and an hour The highest-value moment to threat model is when the design is still cheap to change - and the most common mistake is treating it as a one-off audit instead of a habit Most security work, as people experience it day to day, is reactive. A scanner flags a vulnerable dependency. A penetration test produces a report. An alert fires. Someone patches the thing, closes the ticket, and moves on. This is necessary work, but it has a structural weakness: it can only find problems in systems that already exist. By the time a scanner can see a flaw, you have already built it, shipped it, and possibly run it in production for months. ...

System Design Fundamentals: Making Trade-offs You Won't Regret

TL;DR System design has no right answers, only trade-offs chosen deliberately or chosen by accident. The skill is making the choice consciously Most decisions move along a few core axes: consistency against availability, latency against throughput, simplicity against flexibility, and build against buy A good design states its assumptions - expected load, acceptable latency, failure tolerance - because a design is only “good” relative to assumptions The most common self-inflicted wound is designing for scale you do not have. Complexity added for an imagined future is paid for every day until that future arrives, if it ever does Write designs down. A short document that names the options, the choice, and the reason is worth more than any diagram There is a particular kind of interview question, and a particular kind of blog post, that treats system design as a body of correct answers - as if there were a known-good way to “design a URL shortener” or “design a news feed” and the job is to recall it. This framing is actively harmful, because it teaches people that system design is about memorising solutions. ...

Diagrams as Code: A Practitioner's Guide for Data Engineers

TL;DR Hand-drawn diagrams in Lucidchart, Visio, draw.io or Confluence rot because they live outside the codebase, cannot be diffed, and have no compiler to flag when they go stale. Diagrams as code closes all three gaps by treating the text source as truth and the rendered image as a build artefact. Pick by the question you are answering, not by taste. Mermaid for embedded docs and anything that has to render in GitHub. D2 for aesthetically polished architecture with real cloud icons. Python diagrams for AWS-heavy decks. PlantUML or Structurizr when you need formal UML or the C4 model. The conventions that make trust explicit: co-locate diagrams with the code they describe, add a metadata header with last_verified and next_review_due, encode confidence visually ( verified / stale / proposed ), pair each non-obvious diagram with an ADR, and render in CI. The highest-leverage move is to generate diagrams from the system itself - Terraform state, lineage graphs, dbt manifests, Airflow DAGs. A generated diagram is provably current by construction, which is a much stronger guarantee than “I reviewed it last quarter.” If you have ever opened a Confluence page from two years ago and wondered whether the architecture it shows is still real, you have already met the problem this post is trying to fix. Hand-drawn diagrams in Lucidchart, Visio, draw.io or PowerPoint share three failure modes that no amount of governance ever quite eliminates. They live somewhere your code does not, so nobody updates them in the same PR that changes the system. They cannot be diffed, reviewed, or merged. And they rot silently, because there is no compiler error for “this picture is now a lie.” ...

Real-Time Data Processing: Stream Processing vs Batch Processing

TL;DR Batch processes bounded data on a schedule; streaming processes unbounded data continuously - different operational profiles, not a religious choice Streaming often costs 5-10x more per row than batch for the same volume; you pay for latency Streaming earns its keep when event value decays fast: fraud, ops alerts, live dashboards, inventory sync The lambda hybrid (streaming fast path + batch system of record) is what large platforms actually run Default to batch in 2026; add streaming only where latency genuinely matters, and land raw events in object storage from day one If you spend enough time in data engineering, you will eventually encounter the conviction that batch processing is dying and streaming is the future. This is the third or fourth time the industry has had this conversation in my career, and the answer has been the same every time. Streaming is not the future. Batch is not the past. They are different tools with different operational profiles, and the systems that age well use both, with discipline about which is the right choice for which problem. ...

The Modern Lakehouse Stack: What Actually Belongs in Production

TL;DR A 2026 lakehouse has seven layers: object storage, open table format, catalog, compute engine, orchestration, transformation, and governance Apache Iceberg is the default table format; catalog choice (Unity, Polaris, Nessie) depends on your primary engine Databricks or Snowflake as compute, dbt or SQLMesh for transformation, orchestration via Dagster or Airflow Governance and observability are the layer most often skipped and most expensive to retrofit Default stack: ship end-to-end on a small slice first, then expand - do not spend six months evaluating before data flows The word “lakehouse” has been doing a lot of work for the last five years. It has been used to describe everything from a thin SQL layer over object storage to a fully integrated platform with governance, lineage, ML training, and BI built on top. Like most umbrella terms, this elasticity has been useful for marketers and confusing for engineers. ...

AI Agents That Actually Work: Patterns From Real Projects

TL;DR Most agent demos fail in production because demos operate in a regime where the model’s natural behaviour is good enough - production is longer, messier, and largely unobserved Eight patterns separate agents that stay shipped from the ones that fall over: scope the loop, structured tool design, mandatory verification, curated context, first-class human handoff, idempotency, agent-level observability, and real evaluation infrastructure Models confabulate actions - “I ran the tests” does not mean the tests were run; every agent needs explicit verification baked into the control flow, not bolted on as an afterthought The tool layer between the model and underlying systems is where most of the engineering effort actually lives, and exposing raw APIs directly to the agent almost always goes wrong Build agents the same way you would build any other long-running, partially-autonomous system you cannot afford to have fail silently - the novelty is in the failure modes, not the engineering principles I have spent the last eighteen months either building, reviewing, or operating systems that some marketing department somewhere has called “agents”. The definition has been so thoroughly stretched that it now means anything from a chatbot with a calculator tool to a long-running autonomous workflow that touches production infrastructure. Underneath the noise there is a real engineering discipline emerging, and the patterns that separate the systems that survive contact with real users from the ones that demo well and fall over are starting to be legible. ...

When to Fine-Tune vs When to RAG: Choosing Your AI Architecture

TL;DR The default choice for most teams should be RAG - it is reversible in days, whereas a bad fine-tuning decision is an expensive sunk cost that requires retraining to fix RAG fails when the question requires reasoning across an entire knowledge domain rather than extracting a specific answer from a passage; fine-tuning handles that case better Fine-tuning fails silently when underlying facts change - it produces confidently wrong, stale answers with no warning; RAG automatically picks up changes at query time A practical decision framework: use RAG for volatile facts and cited answers, use fine-tuning for stable style, voice, and cross-domain reasoning The best production systems use both: a fine-tuned base model for stable domain knowledge, augmented with retrieval for current and specific information The question I get asked most often by engineers starting to build with language models is some variation of: “should we fine-tune or should we do RAG?” It is almost always the wrong question, but it is the wrong question in an instructive way. The reason it gets asked so much is that the choice feels architectural, and architectural choices feel like the kind of thing you commit to once and live with. In practice, the choice is closer to “should I use a database or a cache” - the answer is usually some of both, applied to different problems, and the ratio shifts as the system matures. ...

Agent-First Architecture: The Engineer as System Curator

TL;DR Agent-first architecture imagines a future where the primary unit of work is an AI agent with intent, tools, memory, and a feedback loop - not a human-authored codebase The engineer’s role may shift from building and maintaining systems line by line to curating, governing, and evolving fleets of agents Glue code, routine maintenance, first-pass incident triage, and migration work are plausible candidates for automation; deciding what a system is for and holding architectural intent across time probably are not Managing an agent fleet might resemble logistics fleet management: define intent, set constraints, design feedback loops, curate the roster, and own the outcomes This is a speculative post, not a description of how anything works today - pinning down a hypothesis to revisit when it turns out to be wrong This is a “thinking out loud” post, not a report from the front lines. I have no evidence any of this is happening at scale, and it is not how my current day job looks. These are just ideas I keep turning over, and I wanted to write them down to see if they hold together. ...

Platform Engineering in 2026: What It Is and Why DevOps Teams Are Adopting It

TL;DR Platform engineering - building an internal developer platform (IDP) of golden paths, self-service environments, a developer portal, policy as code, and paved-road CI/CD - is the default shape of infrastructure teams larger than a dozen people in 2026 Four forces drove the convergence: cognitive load (the cloud-native stack is too big for one head), the DORA evidence linking platforms to elite performance, the regulatory ratchet, and AI agents AI agents made 2026 the tipping point: an agent that can open PRs and apply Terraform changes is only safe inside a platform that enforces policy checks, cost caps, and blast-radius limits Platform engineering is not a rebrand of DevOps - the platform team is a product team whose customers are other engineers If you have no platform yet, start with the single most-painful golden path, not a portal Platform engineering used to be the title on a few job adverts at Spotify and Netflix. In 2026 it is the default shape of any infrastructure team larger than a dozen people. The shift is worth understanding, because it is not just a rebrand of DevOps - it is a different operating model, with different tools, different incentives, and a different relationship to the developers it serves. ...

What the Amiga got right that we are still copying in modern computing

What the Amiga Got Right (That We're Still Copying)

TL;DR The Amiga, launched in 1985 and dead by 1994, was a commercial failure - and almost every good idea in modern computing traces back to it Preemptive multitasking, graphics compositing, hardware-accelerated audio and video, plug-and-play expansion, and system-wide scripting all shipped on the Amiga while IBM PCs were still effectively 8-bit Jay Miner’s radical design used multiple custom processors, each with its own job - the same philosophy behind today’s GPUs and specialised silicon What killed it was not the technology but Commodore’s management collapse The lesson: deep architectural insight can put a machine a decade ahead, and still lose to distribution and business execution The Commodore Amiga was not the most successful computer. It was not the fastest. It was not the cheapest. It was introduced in 1985, bought by Commodore in a panic, and discontinued by 1994 as the company collapsed. By most commercial metrics, it was a failure. ...