Llm | jamesm.blog

An AI Tooling Learning Path: Logical Phases for 2026

TL;DR The order you learn AI tools matters as much as which tools you learn - most people start with terminal agents or editors before they understand how models actually fail The seven-phase path runs: fundamentals, chat interfaces, AI-native editors, terminal agents, local models, orchestration, and review and evaluation Terminal agents (Claude Code, Cline, Aider) represent the biggest mindset shift - you move from driving with suggestions to specifying and letting the model execute Local models via Ollama belong in phase five, once you have felt the pain of API costs and know which tasks actually need frontier capability Review, evaluation, and capture (phase seven) is the phase most developers skip - and the one that separates AI-curious from AI-competent The hardest part of learning AI tooling in 2026 is not any single tool. It is the order you meet them in. ...

DGX Spark vs Mac Studio: Which Personal AI Supercomputer Should You Buy?

TL;DR Best value: Mac Studio M4 Max at $1,999 for most local LLM work Best prefill speed: DGX Spark at $4,699 (3.8× faster prompt processing) Best token generation: Mac Studio M3 Ultra at $3,999 (819 GB/s bandwidth) Best for fine-tuning: DGX Spark (CUDA ecosystem wins) Best combined setup: DGX Spark + M3 Ultra = 2.8× faster than either alone Introduction The market for personal AI supercomputers has exploded in 2025-2026. Two standout options have emerged: NVIDIA’s DGX Spark and Apple’s Mac Studio lineup. Both promise desktop-scale AI compute, but they approach the problem very differently. This guide breaks down the specs, costs, and real-world performance to help you decide which is right for you. ...

Which Mac Studio Should You Buy for Running LLMs Locally?

TL;DR Best entry point: M2 Max 32-64 GB (~£1.4k-£2k) for 7B-13B models at 25-40 tok/s Best sweet spot: M2 Ultra 64-128 GB (~£3k-£4.5k) handles 30B+ models comfortably Best for 70B models: M3 Ultra 128 GB+ (~£5.5k+) with 800+ GB/s bandwidth Newer alternative: M4 Max (£2k-£4k) - lower bandwidth (410-546 GB/s) than Ultra chips, but still solid for 7B-13B models Key rule: Memory bandwidth matters more than raw compute for token generation Reality check: A RTX 5090 rig is 2-3× faster for similar money - buy Mac for simplicity and unified memory You want to run large language models locally on a Mac Studio. Good idea - unified memory is genuinely useful for LLMs. But the specs matter, and there are some hard truths about what “works” versus what feels responsive. More importantly: the right Mac depends entirely on which model you want to run. ...

Open WebUI: A Polished Interface for Local and Remote LLMs

TL;DR Open WebUI is an open-source, ChatGPT-style web interface that connects to local Ollama instances, OpenAI’s API, or any OpenAI-compatible backend It eliminates the friction of command-line LLM tools and supports features like RAG with document uploads, web search, custom prompts, model switching, and multi-user permissions Deployment is a single Docker command; maintenance is lightweight with persistent storage and optional PostgreSQL for multi-instance setups The primary appeal is full data ownership - queries never leave your infrastructure - making it well suited for privacy-conscious users and compliance-bound organizations Open WebUI adds minimal latency since the bottleneck is always the inference engine behind it, not the web interface itself If you’ve spent time running language models locally through Ollama or another inference engine, you’ve probably discovered the same friction point: the command-line experience works, but it’s clunky. You’re juggling terminal windows, tracking conversation context manually, navigating files through the filesystem. ...

Structured outputs and schema design for LLMs

Structured Outputs: When Your AI Needs to Follow a Schema

TL;DR Structured outputs constrain an LLM’s response to match a JSON schema during generation, eliminating the entire class of post-processing parse failures (which occur 2-5% of the time with free-form output) They produce simpler code, more reliable pipelines, and modest inference cost savings (typically 5-15% fewer tokens) in high-volume systems Use structured outputs for data extraction, classification, entity recognition, and API payload generation - not for creative writing or open-ended reasoning Common mistakes include over-constraining schemas with too-strict enums, forgetting that the response format changes, and mistaking schema validity for semantic correctness The trajectory is toward structured outputs becoming the default: schemas will be inferred from English descriptions, and TypeScript types will auto-generate schemas For years, extracting structured data from LLMs meant post-processing their text output: parse JSON, handle edge cases where the model forgot to close a bracket, write validation code to check if the output matched your schema, implement fallback logic when parsing failed. ...

The LLM Context Window Arms Race: Does It Actually Matter?

TL;DR Context window size is the wrong metric to optimise for - attention scales quadratically, so larger windows mean dramatically higher latency and cost with diminishing quality gains Retrieval-augmented generation consistently outperforms stuffing entire documents into a prompt, because focused context beats diluted context What actually matters in production: token efficiency, prompt caching, structured output formats, and intelligent retrieval - not raw window size Large context windows are genuinely useful for whole-document analysis and complex cross-file code review, but wasteful for Q&A, structured extraction, and high-volume routine tasks The teams that will ship faster and scale further are those building intelligent architecture around a 200K context window, not those waiting for 1M-token models Every week brings a new headline: “Model X reaches 1M token context!” “Model Y supports 2M tokens!” The LLM industry seems locked in an arms race where the stated goal is always “bigger context window,” as if this single metric determines whether a model is useful. ...

AI reliability - testing non-deterministic systems

AI Reliability Is Weird: Why Testing LLMs Breaks Everything You Know

TL;DR Traditional testing assumes determinism - given input X, function f always returns Y - but LLMs are non-deterministic, which breaks assertion-based testing at its foundation The same agentic task run twice may produce different but equally correct code, making exact-output assertions brittle and often useless The new paradigm shifts from “test the code” to “verify the intent”: property-based testing, LLM-as-a-Judge evaluation, golden datasets for regression, and human review for overall correctness Structured outputs enforce syntactic correctness at generation time, but semantic correctness - whether the output actually solves the right problem - still requires layered verification on top The future of AI quality assurance is designing robust evaluation frameworks and measuring properties of acceptable outputs, not writing exhaustive unit tests for code the model may generate differently next time AI agents like Cline are now the primary “builders” of software in many workflows, executing complex engineering plans from high-level specifications. As I have argued in “The Architect vs The Builder”, the human role is shifting from execution to architectural oversight and defining intent. The patterns that determine whether agents stay shipped are covered in “AI agents that actually work”, and the wider safety framing sits in “AI safety from first principles”. ...

Claude Code vs Cursor: A 6-Month Comparison

TL;DR After six months of daily use, neither Cursor nor Claude Code wins outright - they represent two distinct philosophies that complement each other in a hybrid workflow Cursor’s strength is deep IDE integration: seamless codebase indexing, best-in-class multi-file Composer Mode, and zero context switching for feature development and UI work Claude Code’s strength is agentic execution: it runs tests, reads output, fixes code, and loops until passing - ideal for debugging, test-driven fixes, and housekeeping tasks The real winner underlying both tools is the Claude 4 family (Sonnet 4.6 for most work, Opus 4.7 for the harder agentic loops); the choice of tool determines how you interact with that intelligence, not which intelligence you get The practical split: use Cursor as your primary environment for feature work, use Claude Code when you need something to just run and fix itself It’s been six months since the landscape of AI coding tools shifted from “helpful autocomplete” to “autonomous agents.” During this time, I’ve used both Cursor and Claude Code (Anthropic’s CLI tool) for every major project. ...

GPU servers vs API credits cost breakdown

GPU Servers vs AI API Credits: The Real Cost Breakdown (2026)

TL;DR The core trade-off is pay-per-use (APIs) vs pay-for-capacity (GPUs) - APIs are cheaper at low volume, GPUs win massively at high volume (100M+ tokens/day) The break-even point for GPU self-hosting sits around 2 to 5 million tokens per day for premium-model workloads - below that, APIs almost always win GPU utilisation is the most important variable: at less than 50-60% utilisation, self-hosted inference costs more per token than just calling an API Hidden costs matter - real GPU spend is 2x to 5x the raw hardware price once you add DevOps, scaling, monitoring, and networking; API costs can also balloon from poor prompt design and multi-step agent loops Most serious production systems land on a hybrid architecture: APIs for complex reasoning and long-context work, GPUs for bulk inference, embeddings, and fine-tuned models If you’re building anything with LLMs right now, you’ll hit this question sooner than you expect: ...

What Actually Belongs in My AI Dev Stack in 2026

TL;DR A single AI tool cannot handle everything - a proper AI dev stack in 2026 needs distinct layers for spec writing, fast editing, heavy agentic work, cheap model tasks, review, research, and capture Spec-driven development is the most underused part: writing requirements and acceptance criteria before generation dramatically improves AI output and reduces wasted iterations Tools like Cursor AI handle fast, in-flow editing while Claude Code or Cline are better suited to multi-file refactors and autonomous implementation from specs Letting the same model that generated code also review it is a weak loop - a separate review pass with a different model or explicitly critical prompt is essential The real shift is treating AI not as a bolt-on assistant but as part of the workflow architecture itself, with each tool assigned a clear, specific responsibility There is a big difference between using AI for development and having an actual AI development stack. ...