This section is organised around one question: what has to be true before you can trust AI to do real work? Reliability, context, economics, security, evaluation, and eventually physical action - each post is a different angle on the same problem.

Start here

I want to build

I want context

Resources

Link indexes and tool directories - useful for discovery, not the narrative spine:

AI reliability - testing non-deterministic systems

AI Reliability Is Weird: Why Testing LLMs Breaks Everything You Know

TL;DR Traditional testing assumes determinism - given input X, function f always returns Y - but LLMs are non-deterministic, which breaks assertion-based testing at its foundation The same agentic task run twice may produce different but equally correct code, making exact-output assertions brittle and often useless The new paradigm shifts from “test the code” to “verify the intent”: property-based testing, LLM-as-a-Judge evaluation, golden datasets for regression, and human review for overall correctness Structured outputs enforce syntactic correctness at generation time, but semantic correctness - whether the output actually solves the right problem - still requires layered verification on top The future of AI quality assurance is designing robust evaluation frameworks and measuring properties of acceptable outputs, not writing exhaustive unit tests for code the model may generate differently next time AI agents like Cline are now the primary “builders” of software in many workflows, executing complex engineering plans from high-level specifications. As I have argued in “The Architect vs The Builder”, the human role is shifting from execution to architectural oversight and defining intent. The patterns that determine whether agents stay shipped are covered in “AI agents that actually work”, and the wider safety framing sits in “AI safety from first principles”. ...

April 9, 2026 · 7 min · James M
Career-Ops - AI-powered career decision tools

Career-Ops: Flipping the Script on AI-Powered Job Search

TL;DR Career-Ops is an open-source tool built on Claude Code that inverts the job search power dynamic - giving candidates AI-powered evaluation and application tools to match what companies use to filter them Each opportunity is scored across 10 weighted dimensions on an A-F scale, producing a structured comparison that replaces the ad hoc spreadsheet most candidates rely on The system generates ATS-optimized resumes dynamically tailored to each job description and auto-discovers new postings from 45+ pre-configured job boards A key design principle is human-in-control: nothing auto-submits, the AI recommends and the candidate decides, making it a decision-support system rather than an automation Career-Ops is a clean example of the broader pattern of AI tools that amplify individual judgment rather than replace it - worth studying for its architecture as much as its use case The job search has long been a one-way mirror - companies deploy AI to filter applications while candidates manually juggle spreadsheets, tailor cover letters, and hope their resume gets past the automated screener. Career-Ops flips that script entirely. Built on Claude Code, it’s an open-source system that gives job seekers their own AI advantage: intelligent evaluation of opportunities, automated customized applications, and systematic candidate strategy. ...

April 9, 2026 · 5 min · James M
Claude Mythos benchmark performance

Claude Mythos: The AI Benchmark Breaker That Won't Be Released

TL;DR Claude Mythos Preview set new records across coding, mathematics, and reasoning: 93.9% on SWE-bench Verified, 97.6% on USAMO 2026, and leads GPT-5.4 on every shared benchmark The USAMO result - a 55-point jump over Claude Opus 4.6 - suggests genuinely different reasoning capabilities, not just incremental improvement, and Anthropic screened against memorization concerns Despite dominating benchmarks, Mythos is not publicly available because it autonomously discovered thousands of zero-day vulnerabilities across every major OS and browser Access is restricted to 12 major tech and finance companies via Project Glasswing, a defensive cybersecurity research initiative backed by $100M in Anthropic usage credits The wider implication: we have entered an era where “the best model” and “the publicly available model” may be permanently different things, with security becoming a deployment constraint alongside capability Anthropic released Claude Mythos Preview on April 7, 2026 - and immediately announced it won’t be publicly available. ...

April 8, 2026 · 4 min · James M
Claude Code vs Cursor comparison

Claude Code vs Cursor: A 6-Month Comparison

TL;DR After six months of daily use, neither Cursor nor Claude Code wins outright - they represent two distinct philosophies that complement each other in a hybrid workflow Cursor’s strength is deep IDE integration: seamless codebase indexing, best-in-class multi-file Composer Mode, and zero context switching for feature development and UI work Claude Code’s strength is agentic execution: it runs tests, reads output, fixes code, and loops until passing - ideal for debugging, test-driven fixes, and housekeeping tasks The real winner underlying both tools is the Claude 4 family (Sonnet 4.6 for most work, Opus 4.7 for the harder agentic loops); the choice of tool determines how you interact with that intelligence, not which intelligence you get The practical split: use Cursor as your primary environment for feature work, use Claude Code when you need something to just run and fix itself It’s been six months since the landscape of AI coding tools shifted from “helpful autocomplete” to “autonomous agents.” During this time, I’ve used both Cursor and Claude Code (Anthropic’s CLI tool) for every major project. ...

April 8, 2026 · 3 min · James M
The Automation Paradox Why More AI Makes Human Judgment More Valuable Banner

The Automation Paradox: Why More AI Makes Human Judgment More Valuable

TL;DR Every time AI automates a specific task, the monetary value of doing that task falls - the scarce resource shifts from execution to the judgment of what is worth doing at all Historical precedent holds: Deep Blue did not kill professional chess, calculators did not kill accountants - automation raises the value of the thinking above the automated layer The new hierarchy of work puts judgment first (irreplaceable), direction second (human but scalable), and execution last (increasingly commodity) Judgment is constrained opinion - it requires trade-off awareness, skin in the game, pattern recognition, and willingness to be wrong - none of which AI can replicate The economic inversion means hiring shifts from paying for output to paying for prevention: the bad decisions not made, the features not built, the wrong paths not taken The automation paradox is quietly reshaping what we pay for. ...

April 7, 2026 · 6 min · James M
Spec-driven development - when the brief becomes the product

Spec-Driven Development: When the Brief Becomes the Product

TL;DR Spec-driven development means making specifications iteratively precise enough that handing them to an AI produces the right result without further iteration AI makes hidden specification costs visible - ambiguous briefs now produce wrong code instantly rather than surfacing bugs slowly during implementation The spec becomes the product because it is where all the thinking lives; implementation is just the reflection of the spec in runnable form Good specs must be honest, not just precise - they should explain trade-offs accepted, constraints being solved for, and how you will know if the spec was wrong Developers in 2026 need to shift from implementing specs to writing specs that are clear enough to implement themselves There’s a moment in every developer’s career when you realize the code is not the product. The product is the decision. ...

April 7, 2026 · 6 min · James M
The architect vs builder split in AI-assisted development

The Architect vs The Builder: Redefining Engineering Roles in 2026

TL;DR AI has collapsed the middle rungs of the engineering ladder by automating execution - the junior-to-architect progression no longer works the way it did The emerging split is two human roles: Architects who decide what to build and why, and Builders who turn architectural decisions into precise, testable specifications Neither role exists to write code - code-writing is incidental to both, and AI handles the bulk of implementation The two paths require genuinely different skills that do not build cleanly on each other; taste for architectural judgment and clarity for specification are separate capabilities If you are a junior engineer in 2026, you need to choose your path now - the traditional ladder is a trap, and “I write good code” is no longer a sufficient value proposition For forty years, the engineering career ladder has looked like this: ...

April 6, 2026 · 7 min · James M
What expertise means when AI can pass any exam

What Does 'Expertise' Mean When AI Can Pass Any Exam?

TL;DR AI can now pass virtually every professional exam, breaking the long-held assumption that passing an exam equals having expertise What exams actually tested was knowledge retrieval under pressure - a bottleneck that no longer exists when machines can retrieve and apply knowledge better than any human Real expertise is what remains after knowledge retrieval is automated: judgment, integration of context, responsibility, and taste - none of which appear on any exam Professions built on credentialing (law, medicine, engineering) are being forced to confront that their proxies for expertise never measured the thing they cared about New models of assessment - portfolio-based credentialing, apprenticeship, outcomes tracking, and community reputation - will replace exams, but none of them scale as easily In 2023, Claude passed the bar exam. In 2024, it passed the CPA exam and medical licensing exams. By 2026, there’s barely an exam left that AI can’t pass, often on the first try. ...

April 6, 2026 · 7 min · James M
GPU servers vs API credits cost breakdown

GPU Servers vs AI API Credits: The Real Cost Breakdown (2026)

TL;DR The core trade-off is pay-per-use (APIs) vs pay-for-capacity (GPUs) - APIs are cheaper at low volume, GPUs win massively at high volume (100M+ tokens/day) The break-even point for GPU self-hosting sits around 2 to 5 million tokens per day for premium-model workloads - below that, APIs almost always win GPU utilisation is the most important variable: at less than 50-60% utilisation, self-hosted inference costs more per token than just calling an API Hidden costs matter - real GPU spend is 2x to 5x the raw hardware price once you add DevOps, scaling, monitoring, and networking; API costs can also balloon from poor prompt design and multi-step agent loops Most serious production systems land on a hybrid architecture: APIs for complex reasoning and long-context work, GPUs for bulk inference, embeddings, and fine-tuned models If you’re building anything with LLMs right now, you’ll hit this question sooner than you expect: ...

April 5, 2026 · 5 min · James M
What belongs in an AI dev stack in 2026

What Actually Belongs in My AI Dev Stack in 2026

TL;DR A single AI tool cannot handle everything - a proper AI dev stack in 2026 needs distinct layers for spec writing, fast editing, heavy agentic work, cheap model tasks, review, research, and capture Spec-driven development is the most underused part: writing requirements and acceptance criteria before generation dramatically improves AI output and reduces wasted iterations Tools like Cursor AI handle fast, in-flow editing while Claude Code or Cline are better suited to multi-file refactors and autonomous implementation from specs Letting the same model that generated code also review it is a weak loop - a separate review pass with a different model or explicitly critical prompt is essential The real shift is treating AI not as a bolt-on assistant but as part of the workflow architecture itself, with each tool assigned a clear, specific responsibility There is a big difference between using AI for development and having an actual AI development stack. ...

April 5, 2026 · 9 min · James M