jamesm.blog

AI Agents That Actually Work: Patterns From Real Projects

TL;DR Most agent demos fail in production because demos operate in a regime where the model’s natural behaviour is good enough - production is longer, messier, and largely unobserved Eight patterns separate agents that stay shipped from the ones that fall over: scope the loop, structured tool design, mandatory verification, curated context, first-class human handoff, idempotency, agent-level observability, and real evaluation infrastructure Models confabulate actions - “I ran the tests” does not mean the tests were run; every agent needs explicit verification baked into the control flow, not bolted on as an afterthought The tool layer between the model and underlying systems is where most of the engineering effort actually lives, and exposing raw APIs directly to the agent almost always goes wrong Build agents the same way you would build any other long-running, partially-autonomous system you cannot afford to have fail silently - the novelty is in the failure modes, not the engineering principles I have spent the last eighteen months either building, reviewing, or operating systems that some marketing department somewhere has called “agents”. The definition has been so thoroughly stretched that it now means anything from a chatbot with a calculator tool to a long-running autonomous workflow that touches production infrastructure. Underneath the noise there is a real engineering discipline emerging, and the patterns that separate the systems that survive contact with real users from the ones that demo well and fall over are starting to be legible. ...

My Tracks - April 2026

A selection of my music production work from April 2026. I move freely between funky house, chillsynth, ballads, techno, hard house and instrumental soundscapes. I build tracks around rhythm, mood and tiny sparks of emotion that grow into something bigger. Some tunes hit hard, some float, some just wander in and make themselves at home. Many of the tracks have been remastered and almost all album art has been updated, so the tracks have been republished. ...

Claude connected to Ableton Live and Push

Connecting Claude to Ableton: Why the New Knowledge Connector Matters

On 28 April 2026 Anthropic shipped a batch of nine creative-tool connectors for Claude, and one of them is the Ableton Knowledge connector. It is a small thing on the surface and a big thing underneath. Here is what it does, what it does not do, and why it matters if you spend your evenings inside Live or staring at a Push. What the Connector Actually Does The official Ableton connector grounds Claude’s answers in Ableton’s own product documentation for Live and Push. That is the whole pitch, and it is more useful than it sounds. ...

AI Safety From First Principles: What Actually Matters vs What's Hype

TL;DR “AI safety” covers four distinct layers - product safety, system safety, model alignment, and civilisational safety - and conflating them produces incoherent debates For engineers building production systems today, system safety dominates: most real incidents trace back to flawed system design around the model, not the model itself Practical mitigations are unglamorous: scope tool permissions, bound blast radius, require human approval for irreversible actions, validate outputs, and observe everything The hype conflates capability with intent, existential risk with ordinary risk, and refusal with safety - all three conflations make the conversation harder to act on The load-bearing principle across all four layers is the same: a system should fail in ways that are detectable, recoverable, and bounded The AI safety conversation has reached the point where the phrase has stopped meaning anything specific. In the same week, you will see “AI safety” used to describe content moderation on a chat product, the alignment of frontier models toward human values, the question of whether superintelligence ends civilisation, and a regulatory paper about copyright. These are not the same problem. Treating them as one conversation is the reason the conversation never resolves. ...

AI Skills: One Folder, Any Model

TL;DR A Claude Code skill is just a folder with a SKILL.md file - YAML frontmatter plus natural-language instructions - and the same folder works across Cursor, Gemini CLI, Codex, and a dozen other tools The format is model-agnostic because it contains no provider-specific syntax; any instruction-following model can read it, and any harness that loads markdown can execute it Progressive disclosure keeps large skill libraries cheap: only names and descriptions load at session start, with full instructions loading only when a skill is activated The portability is practically valuable - version-controlled runbooks that survive tool switches, model upgrades, and team growth without being rewritten Core skills are genuinely portable; advanced frontmatter extensions (like allowed-tools or context: fork) are tool-specific and may need tuning across harnesses Most of the tooling I have written about over the last year has been provider-specific. A particular model, a particular harness, a particular set of features. The thing I find interesting about agent skills is that they are not. ...

Suno in May 2026: where the platform actually is

TL;DR - Suno v5.5 (March 2026) is the most expressive model yet, and three personalisation features finally make the platform usable as a real workflow: Voices (clone your own verified singing voice), Custom Models (fine-tune v5.5 on your own catalogue), and My Taste (lightweight preference learning for everyone). The Warner Music deal is now visible in the product - older models are being deprecated, free accounts have lost commercial download rights, and the ownership language has softened from “you own this” to “you have commercial rights.” Best used for demos, stem libraries, and personal sound signatures; still risky for releases that need clean copyright provenance. ...

My AI-Augmented Design Workflow: A 10-Minute Loop From Discussion to Documented Decision

TL;DR A combination of Cursor in the IDE, Claude Code and Codex in the terminal, and GitHub Spec Kit as the living contract has collapsed the discuss-design-document loop from days to under ten minutes Every meeting is transcribed and checked into GitHub alongside the design corpus, giving AI agents access to the full historical record - not just curated decisions but the debates that shaped them Model selection matters: cheaper, faster models for throwaway sketches and small refactors; expensive models (Opus) for large cross-repo work where the cost of a wrong answer is high The real transformation is cognitive flow - removing friction between thinking and recording means decisions get made and captured while the problem is still fresh, with almost no context switching AI is now suggesting improvements faster than the author can implement them; the next bottleneck is compaction, not generation - asking the model to reduce documents to their load-bearing claims rather than produce more content Since making a combination of Cursor in the IDE and Claude Code and Codex in the terminal the centre of my working day - with ChatGPT for general questions and GitHub Spec Kit holding the design contract - the way I move from a question on Slack to a documented design decision has changed beyond recognition. ...

When to Fine-Tune vs When to RAG: Choosing Your AI Architecture

TL;DR The default choice for most teams should be RAG - it is reversible in days, whereas a bad fine-tuning decision is an expensive sunk cost that requires retraining to fix RAG fails when the question requires reasoning across an entire knowledge domain rather than extracting a specific answer from a passage; fine-tuning handles that case better Fine-tuning fails silently when underlying facts change - it produces confidently wrong, stale answers with no warning; RAG automatically picks up changes at query time A practical decision framework: use RAG for volatile facts and cited answers, use fine-tuning for stable style, voice, and cross-domain reasoning The best production systems use both: a fine-tuned base model for stable domain knowledge, augmented with retrieval for current and specific information The question I get asked most often by engineers starting to build with language models is some variation of: “should we fine-tune or should we do RAG?” It is almost always the wrong question, but it is the wrong question in an instructive way. The reason it gets asked so much is that the choice feels architectural, and architectural choices feel like the kind of thing you commit to once and live with. In practice, the choice is closer to “should I use a database or a cache” - the answer is usually some of both, applied to different problems, and the ratio shifts as the system matures. ...

The Free Intelligence Era: What Breaks When Thinking Costs Nothing

TL;DR The marginal cost of AI intelligence is halving roughly every two months and heading toward a level where rationing stops making sense - similar to how bandwidth and storage became effectively unconstrained This will break pricing models built on scarce cognition: anything billed per word, per hour, or per consult faces a hard ceiling set by what machines charge for the same work The Jevons paradox means total cognitive work in the economy likely goes up, not down - cheaper thinking means we apply thinking to far more problems, not the same problems more cheaply Three categories of human work survive: accountability (being the named responsible party), taste (choosing well from infinite AI-generated options), and real-world coupling (a body in a place, a relationship that took years to build) The political question of who captures the surplus and who absorbs the transition cost is still open - it will be decided by institutions and policy, not by the technology itself This is a personal reflection, not a forecast dressed up as one. I am writing about a trend I think is real, but the second-order consequences are guesses, and I am sure some of them are wrong. ...

The Quiet Discipline of Self-Honesty

TL;DR Nearly all self-improvement advice assumes you have already looked at yourself clearly - and that step is the bottleneck almost nobody completes Goals that match a fictional version of you cannot be reached by the real one, which is why systems and habit trackers fail for people with an inaccurate self-picture Self-honesty is hard because the mind protects identity, not accuracy; the signs of low self-honesty are recognisable and worth checking for It can be practised deliberately - small, regular, low-stakes acts of telling yourself the truth compound like any other habit Self-honesty and self-compassion are not opposites; honesty without compassion becomes self-attack, which is just another form of avoidance Most self-improvement advice assumes a step that almost nobody actually completes. ...