What I'm Researching in AI Right Now Banner

What I'm Researching in AI Right Now - And Where I'm Going Next

TL;DR I treat my own learning like a research agenda - a small set of questions I am actively chasing, not a reading list I feel guilty about The work I have been deep in clusters into four areas: agent reliability and non-determinism, context engineering and memory, the economics of intelligence, and the open-weight and small-model frontier The areas I have decided to move into next are the ones where I keep hitting questions I cannot answer well: securing agents that hold real tool access, evaluating agents on their trajectory rather than their final answer, world models beyond the language-only era, and the machine-to-machine agent economy I treat AGI timelines less as a forecast to win and more as a planning input - what changes for an engineer if capable autonomous systems arrive in three years rather than fifteen I am deliberately not chasing every frontier. Quantum machine learning and neuromorphic hardware sit on my watch list, not my work list, and being honest about that line is the whole point Most people consume AI news. I used to do the same - a feed of model releases, benchmark claims, and launch threads that left me feeling informed and changed nothing about what I could actually build. ...

June 8, 2026 · 12 min · James M
Trust series - deploying AI agents in production

Trust: Conditions for Deploying AI Agents in Production

TL;DR The Trust series is my answer to one question: what has to be true before you can hand a non-deterministic system a real job and walk away? Read in this order: research map → evals → security → world models → trajectory evaluation Supporting posts cover reliability, context engineering, and safety foundations Full series index: /series/trust/ Start here What I’m Researching in AI Right Now — the research map and trust through-line AI Evals Are Broken — why public benchmarks stopped measuring real capability Securing AI Agents — MCP hardening, confused deputy, and what I run on my home stack World Models: What Comes After the Language-Only Era — when text-only agents hit their ceiling Evaluating Agents in Production: Trajectory Metrics — step-level scoring, not just final answers Supporting reading AI Agents That Actually Work — patterns from real projects The Agent Reliability Problem — debugging non-deterministic systems Context Engineering — curating the window across a whole agent run AI Reliability Is Weird — why testing LLMs breaks familiar QA AI Safety From First Principles — engineering safety vs speculative scenarios Related paths Home Agent Stack — build the stack these defenses protect AI Dev Tooling — the coding-agent side of the same problem

June 8, 2026 · 1 min · James M
Context Engineering - The Discipline That Replaced Prompt Engineering Banner

Context Engineering: The Discipline That Replaced Prompt Engineering

TL;DR Prompt engineering optimised the wording of a single human-written request. Context engineering optimises the entire set of tokens in the model’s window across a whole run - system prompt, tool definitions, retrieved documents, tool results, conversation history, and memory The shift happened because of agents. The window is no longer one prompt you wrote - it is an accumulation that grows on every step, and most of it is produced by the system, not by you More context is not better context. Research on “context rot” and the older lost-in-the-middle effect show model accuracy degrades as the window fills, even well below the advertised limit The four levers are retrieval (what you pull in), memory (what persists across runs), tool results (what tools dump back), and compaction (what you summarise and discard) Treat the window as a budget. Measure its token composition, design tools to return terse output, curate rather than accumulate, and keep the static prefix stable so prompt caching still works For a few years, “prompt engineering” was the named skill of working with language models. It meant finding the wording, the framing, the few-shot examples, and the role instructions that coaxed the best answer out of a single request. It produced a small industry of prompt libraries, prompt marketplaces, and job titles. And in 2026 it is mostly gone, absorbed into something larger and harder. ...

May 20, 2026 · 11 min · James M
AI dev tooling reading path

AI Dev Tooling: A Reading Path for 2026

TL;DR Start with What Actually Belongs in My AI Dev Stack in 2026 — the canonical stack essay Then An AI Tooling Learning Path — phased skill-building order Deep dives below cover comparisons and spec-driven workflows; single-tool posts are briefs, not entry points Canonical essays What Actually Belongs in My AI Dev Stack in 2026 An AI Tooling Learning Path: Logical Phases for 2026 Context Engineering — the production skill behind reliable coding agents Spec-Driven Development — when the brief becomes the product Deep dives Claude Code vs Cursor: A 6-Month Comparison GitHub Spec Kit and Spec-Driven Development GitHub Spec Kit in 2026: SDD Goes Mainstream My AI-Augmented Design Workflow When to Fine-Tune vs When to RAG Briefs (moment-in-time) These are useful snapshots, not the starting point: ...

May 20, 2026 · 1 min · James M
AI economics and hardware reading path

AI Economics and Hardware: A Reading Path

TL;DR Cost is a design constraint, not an afterthought — model tier, context size, and deployment location are economic decisions Read the essays below in any order; start with Token Economics if you only have time for one Pairs with open-weight models and local inference guides Core essays Token Economics: Why the Cost of AI Isn’t Going Down GPU Servers vs AI API Credits: The Real Cost Breakdown Local AI vs Cloud AI: The Tradeoff Landscape in 2026 The AI Energy Crisis: Why Data Center Power Will Define the Next Decade Cerebras, Groq, SambaNova: The Inference Hardware Insurgents Adjacent The State of Open-Weight Models in 2026 — when open weights beat closed APIs on price Prompt Caching — the quiet latency and cost win The Token Efficiency Mindset — curating spend per conversation Is the $20 AI Subscription Era Over? We Are Learning to Buy Intelligence

May 18, 2026 · 1 min · James M
Home agent stack reading path

Home Agent Stack: From Mac Studio to Secured MCP Tools

TL;DR This path walks through the full stack I run on a Mac Studio: local models → MCP tools → memory → remote access → security Almost no other blogs document the build and the hardening layer together Finish with Securing AI Agents before giving the agent real filesystem or mail access Part of the broader Trust series Read in order Which Mac Studio Should You Buy for Running LLMs Locally? — hardware and model sizing Giving Your Home AI Agent Real Tools: MCP Servers on a Mac Studio — wiring the tool layer Giving Your Home AI Agent Memory That Lasts — persistence across sessions How to Phone Your Home AI Agent — remote access when you are away Securing AI Agents — least privilege, confirmation gates, audit logs Adjacent guides Running AI Models Locally with Ollama — lighter-weight local inference option Agent Protocols in 2026: MCP, A2A, and ACP — the protocol layer Local AI vs Cloud AI — when to host vs call APIs DGX Spark vs Mac Studio — if you are sizing a dedicated inference box

May 15, 2026 · 1 min · James M
The Agent Reliability Problem Banner

The Agent Reliability Problem: Debugging Non-Deterministic Systems

The conventional reliability engineering toolkit was built for systems that behaved the same way each time given the same input. AI agents do not behave the same way each time given the same input. The classic tools - unit tests, integration tests, deterministic replay, traditional monitoring - all assume a property that the systems being operated do not have. This mismatch is not a small operational annoyance; it is the central challenge of running AI agents in production, and the patterns for handling it are still being worked out. ...

May 15, 2026 · 7 min · James M
Multimodal AI in 2026 Banner

Multimodal AI in 2026: Vision + Text + Audio - What's Actually Useful

TL;DR Document understanding is the unglamorous killer application - invoices, contracts, and scanned PDFs that were painful to extract data from are now tractable without dedicated pipelines Vision models still under-deliver on precise spatial reasoning, object counting, and subtle medical or scientific imagery - these remain jobs for specialist models Audio is the modality with the most upside: beyond transcription, it carries tone, pace, and hesitation that text loses, enabling fault detection, emotional analysis, and richer inputs The teams getting real value treat multimodal as an invisible enabling capability within a workflow, not a feature to demo - and they verify high-stakes outputs just as they would text The right question when evaluating multimodal is not “can we use this” but “what specific user problem becomes tractable that previously was not” When the first multimodal frontier models shipped, the demos were genuinely impressive. A photo of a fridge interior with the model suggesting a recipe. A handwritten napkin sketch becoming working code. A short audio clip of a meeting being transcribed, summarised, and structured. It looked, briefly, like the boundary between modalities had collapsed and we were entering a new regime in which models could reason fluidly across text, images, and sound. ...

May 9, 2026 · 10 min · James M
Prompt Caching Banner

Prompt Caching: The Quiet Performance Win for LLM Applications

TL;DR Prompt caching saves the computed representation of a prompt’s static prefix so subsequent requests reuse it rather than recompute it - cached tokens cost roughly 10% of normal input token prices The savings are highest when prompts have a long, identical prefix across requests - system prompts, tool definitions, and few-shot examples can make up 80-90% of total input cost The most common mistake is interpolating variables into the system prompt, which breaks caching silently; fix it by moving all static content to the top and dynamic content to the end Cache lifetimes are bounded (minutes to a few hours per provider) and any change to the prefix - including whitespace - creates a new cache miss Track your cache hit rate explicitly on every LLM dashboard; a dropping hit rate usually signals unintended prompt construction changes, and fixing it is the highest-leverage cost optimisation available If you build LLM applications for any length of time, you eventually notice that you are paying to have the model read the same instructions over and over again. The system prompt, the tool definitions, the few-shot examples, the structured output schema - all of it goes back into the model on every single request, and you pay for the input tokens every single time. For a chatbot doing one or two thousand requests a day this is annoying. For an agent doing tens of thousands of requests with long contexts, it is the dominant cost line. ...

May 9, 2026 · 10 min · James M
AI Agents That Actually Work Banner

AI Agents That Actually Work: Patterns From Real Projects

TL;DR Most agent demos fail in production because demos operate in a regime where the model’s natural behaviour is good enough - production is longer, messier, and largely unobserved Eight patterns separate agents that stay shipped from the ones that fall over: scope the loop, structured tool design, mandatory verification, curated context, first-class human handoff, idempotency, agent-level observability, and real evaluation infrastructure Models confabulate actions - “I ran the tests” does not mean the tests were run; every agent needs explicit verification baked into the control flow, not bolted on as an afterthought The tool layer between the model and underlying systems is where most of the engineering effort actually lives, and exposing raw APIs directly to the agent almost always goes wrong Build agents the same way you would build any other long-running, partially-autonomous system you cannot afford to have fail silently - the novelty is in the failure modes, not the engineering principles I have spent the last eighteen months either building, reviewing, or operating systems that some marketing department somewhere has called “agents”. The definition has been so thoroughly stretched that it now means anything from a chatbot with a calculator tool to a long-running autonomous workflow that touches production infrastructure. Underneath the noise there is a real engineering discipline emerging, and the patterns that separate the systems that survive contact with real users from the ones that demo well and fall over are starting to be legible. ...

May 1, 2026 · 11 min · James M