Threat Modeling for Engineers - Finding the Flaws Before Attackers Do Banner

Threat Modeling for Engineers: Finding the Flaws Before Attackers Do

TL;DR A scanner finds bugs in code that already exists. Threat modeling finds flaws in a design before the code exists - which is the cheapest possible time to find them It is a structured conversation built around four questions: what are we building, what can go wrong, what are we going to do about it, and did we do a good job STRIDE gives you a vocabulary for “what can go wrong”: Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, and Elevation of privilege You do not need a tool or a certificate. You need a diagram, the people who understand the system, and an hour The highest-value moment to threat model is when the design is still cheap to change - and the most common mistake is treating it as a one-off audit instead of a habit Most security work, as people experience it day to day, is reactive. A scanner flags a vulnerable dependency. A penetration test produces a report. An alert fires. Someone patches the thing, closes the ticket, and moves on. This is necessary work, but it has a structural weakness: it can only find problems in systems that already exist. By the time a scanner can see a flaw, you have already built it, shipped it, and possibly run it in production for months. ...

May 20, 2026 · 9 min · James M
System Design Fundamentals - Making Trade-offs You Won't Regret Banner

System Design Fundamentals: Making Trade-offs You Won't Regret

TL;DR System design has no right answers, only trade-offs chosen deliberately or chosen by accident. The skill is making the choice consciously Most decisions move along a few core axes: consistency against availability, latency against throughput, simplicity against flexibility, and build against buy A good design states its assumptions - expected load, acceptable latency, failure tolerance - because a design is only “good” relative to assumptions The most common self-inflicted wound is designing for scale you do not have. Complexity added for an imagined future is paid for every day until that future arrives, if it ever does Write designs down. A short document that names the options, the choice, and the reason is worth more than any diagram There is a particular kind of interview question, and a particular kind of blog post, that treats system design as a body of correct answers - as if there were a known-good way to “design a URL shortener” or “design a news feed” and the job is to recall it. This framing is actively harmful, because it teaches people that system design is about memorising solutions. ...

May 19, 2026 · 8 min · James M
Diagrams as Code Banner

Diagrams as Code: A Practitioner's Guide for Data Engineers

TL;DR Hand-drawn diagrams in Lucidchart, Visio, draw.io or Confluence rot because they live outside the codebase, cannot be diffed, and have no compiler to flag when they go stale. Diagrams as code closes all three gaps by treating the text source as truth and the rendered image as a build artefact. Pick by the question you are answering, not by taste. Mermaid for embedded docs and anything that has to render in GitHub. D2 for aesthetically polished architecture with real cloud icons. Python diagrams for AWS-heavy decks. PlantUML or Structurizr when you need formal UML or the C4 model. The conventions that make trust explicit: co-locate diagrams with the code they describe, add a metadata header with last_verified and next_review_due, encode confidence visually ( verified / stale / proposed ), pair each non-obvious diagram with an ADR, and render in CI. The highest-leverage move is to generate diagrams from the system itself - Terraform state, lineage graphs, dbt manifests, Airflow DAGs. A generated diagram is provably current by construction, which is a much stronger guarantee than “I reviewed it last quarter.” If you have ever opened a Confluence page from two years ago and wondered whether the architecture it shows is still real, you have already met the problem this post is trying to fix. Hand-drawn diagrams in Lucidchart, Visio, draw.io or PowerPoint share three failure modes that no amount of governance ever quite eliminates. They live somewhere your code does not, so nobody updates them in the same PR that changes the system. They cannot be diffed, reviewed, or merged. And they rot silently, because there is no compiler error for “this picture is now a lie.” ...

May 18, 2026 · 21 min · James M
Stream vs Batch Processing Banner

Real-Time Data Processing: Stream Processing vs Batch Processing

If you spend enough time in data engineering, you will eventually encounter the conviction that batch processing is dying and streaming is the future. This is the third or fourth time the industry has had this conversation in my career, and the answer has been the same every time. Streaming is not the future. Batch is not the past. They are different tools with different operational profiles, and the systems that age well use both, with discipline about which is the right choice for which problem. ...

May 10, 2026 · 9 min · James M
The Modern Lakehouse Stack Banner

The Modern Lakehouse Stack: What Actually Belongs in Production

The word “lakehouse” has been doing a lot of work for the last five years. It has been used to describe everything from a thin SQL layer over object storage to a fully integrated platform with governance, lineage, ML training, and BI built on top. Like most umbrella terms, this elasticity has been useful for marketers and confusing for engineers. This post is the version of the conversation I would have with a senior engineer who has been asked to “build out our lakehouse” and wants to know which pieces are load-bearing and which are noise. It draws on what I have actually seen ship and survive in production data platforms in 2026, and it tries to be specific about why each layer is in the stack rather than just describing the picture as a fait accompli. ...

May 8, 2026 · 9 min · James M
AI Agents That Actually Work Banner

AI Agents That Actually Work: Patterns From Real Projects

TL;DR Most agent demos fail in production because demos operate in a regime where the model’s natural behaviour is good enough - production is longer, messier, and largely unobserved Eight patterns separate agents that stay shipped from the ones that fall over: scope the loop, structured tool design, mandatory verification, curated context, first-class human handoff, idempotency, agent-level observability, and real evaluation infrastructure Models confabulate actions - “I ran the tests” does not mean the tests were run; every agent needs explicit verification baked into the control flow, not bolted on as an afterthought The tool layer between the model and underlying systems is where most of the engineering effort actually lives, and exposing raw APIs directly to the agent almost always goes wrong Build agents the same way you would build any other long-running, partially-autonomous system you cannot afford to have fail silently - the novelty is in the failure modes, not the engineering principles I have spent the last eighteen months either building, reviewing, or operating systems that some marketing department somewhere has called “agents”. The definition has been so thoroughly stretched that it now means anything from a chatbot with a calculator tool to a long-running autonomous workflow that touches production infrastructure. Underneath the noise there is a real engineering discipline emerging, and the patterns that separate the systems that survive contact with real users from the ones that demo well and fall over are starting to be legible. ...

May 1, 2026 · 11 min · James M
When to Fine-Tune vs When to RAG Banner

When to Fine-Tune vs When to RAG: Choosing Your AI Architecture

TL;DR The default choice for most teams should be RAG - it is reversible in days, whereas a bad fine-tuning decision is an expensive sunk cost that requires retraining to fix RAG fails when the question requires reasoning across an entire knowledge domain rather than extracting a specific answer from a passage; fine-tuning handles that case better Fine-tuning fails silently when underlying facts change - it produces confidently wrong, stale answers with no warning; RAG automatically picks up changes at query time A practical decision framework: use RAG for volatile facts and cited answers, use fine-tuning for stable style, voice, and cross-domain reasoning The best production systems use both: a fine-tuned base model for stable domain knowledge, augmented with retrieval for current and specific information The question I get asked most often by engineers starting to build with language models is some variation of: “should we fine-tune or should we do RAG?” It is almost always the wrong question, but it is the wrong question in an instructive way. The reason it gets asked so much is that the choice feels architectural, and architectural choices feel like the kind of thing you commit to once and live with. In practice, the choice is closer to “should I use a database or a cache” - the answer is usually some of both, applied to different problems, and the ratio shifts as the system matures. ...

April 29, 2026 · 11 min · James M
Agent-First Architecture Banner

Agent-First Architecture: The Engineer as System Curator

TL;DR Agent-first architecture imagines a future where the primary unit of work is an AI agent with intent, tools, memory, and a feedback loop - not a human-authored codebase The engineer’s role may shift from building and maintaining systems line by line to curating, governing, and evolving fleets of agents Glue code, routine maintenance, first-pass incident triage, and migration work are plausible candidates for automation; deciding what a system is for and holding architectural intent across time probably are not Managing an agent fleet might resemble logistics fleet management: define intent, set constraints, design feedback loops, curate the roster, and own the outcomes This is a speculative post, not a description of how anything works today - the author is pinning down a hypothesis to revisit when it turns out to be wrong This is a “thinking out loud” post, not a report from the front lines. I have no evidence any of this is happening at scale, and it is not how my current day job looks. These are just ideas I keep turning over, and I wanted to write them down to see if they hold together. ...

April 23, 2026 · 13 min · James M
Platform Engineering in 2026 Banner

Platform Engineering in 2026: What It Is and Why DevOps Teams Are Adopting It

Platform engineering used to be the title on a few job adverts at Spotify and Netflix. In 2026 it is the default shape of any infrastructure team larger than a dozen people. The shift is worth understanding, because it is not just a rebrand of DevOps - it is a different operating model, with different tools, different incentives, and a different relationship to the developers it serves. This post is a plain-language walk through what platform engineering actually is, why the industry has converged on it, and how the arrival of AI agents is reshaping the discipline mid-flight. ...

April 22, 2026 · 8 min · James M

What the Amiga Got Right (That We're Still Copying)

What the Amiga Got Right (That We’re Still Copying) The Commodore Amiga was not the most successful computer. It was not the fastest. It was not the cheapest. It was introduced in 1985, bought by Commodore in a panic, and discontinued by 1994 as the company collapsed. By most commercial metrics, it was a failure. Yet almost every good idea in modern computing traces back to the Amiga. Preemptive multitasking. Graphics layers and compositing. Named pipes. Memory protection. Hardware acceleration. Plug-and-play peripherals. Scripting languages. Digital audio and video editing. Networking. The Amiga did these things in 1985 when IBM PCs were still running in 8-bit mode. ...

April 3, 2026 · 10 min · James M