The Forbidden Frontier: Claude Mythos and the Dawn of Restricted AI Power

TL;DR Claude Mythos is Anthropic’s most powerful model to date, scoring 93.9% on SWE-bench and 97.6% on USAMO 2026 - a 55-point leap over rival models It is not publicly available; Anthropic restricted access to 12 vetted companies through Project Glasswing, focused on defensive cybersecurity Mythos autonomously identified thousands of zero-day vulnerabilities, including a 27-year-old unpatched OpenBSD bug - making its offensive potential too dangerous to democratize This marks a shift away from open innovation toward controlled deployment, where the most capable AI may never be publicly released The Mythos story forces a rethink of how we evaluate AI: benchmark performance and public availability are no longer the same thing Imagine an artificial intelligence so profoundly capable, so far beyond anything we’ve seen, that its creators deem it too risky for public release. This isn’t a dystopian fantasy, but the real-world scenario presented by Anthropic’s Claude Mythos. When Anthropic first unveiled Mythos, the AI community was abuzz - not just with its mind-bending benchmarks, but with the immediate caveat: it would not be publicly available. This decision heralds a new era in AI, one where raw power intersects with paramount security concerns. ...

April 13, 2026 · 4 min · James M

The LLM Context Window Arms Race: Does It Actually Matter?

TL;DR Context window size is the wrong metric to optimise for - attention scales quadratically, so larger windows mean dramatically higher latency and cost with diminishing quality gains Retrieval-augmented generation consistently outperforms stuffing entire documents into a prompt, because focused context beats diluted context What actually matters in production: token efficiency, prompt caching, structured output formats, and intelligent retrieval - not raw window size Large context windows are genuinely useful for whole-document analysis and complex cross-file code review, but wasteful for Q&A, structured extraction, and high-volume routine tasks The teams that will ship faster and scale further are those building intelligent architecture around a 200K context window, not those waiting for 1M-token models Every week brings a new headline: “Model X reaches 1M token context!” “Model Y supports 2M tokens!” The LLM industry seems locked in an arms race where the stated goal is always “bigger context window,” as if this single metric determines whether a model is useful. ...

April 11, 2026 · 7 min · James M

Local AI vs Cloud AI: The Tradeoff Landscape in 2026

By early 2026, the “Local vs. Cloud” debate has moved past the experimental phase. We are no longer just “trying to see if Llama runs on a Mac.” Instead, professional engineers are building sophisticated Hybrid AI Stacks where local and cloud models work in tandem. The landscape has shifted because the hardware caught up to the software. With the prevalence of unified memory on Apple Silicon and the accessibility of 24GB+ VRAM cards like the RTX 50-series, the “local” ceiling has been smashed. ...

April 11, 2026 · 5 min · James M

Cline: The Next Generation AI Coding Assistant

An exploration of Cline, the autonomous AI coding agent that lives in your IDE and handles complex, multi-step engineering tasks through tool-use and agency.

April 10, 2026 · 4 min · James M

Cline + Kanban: Autonomous Development Meets Project Management

TL;DR Cline integrates with Kanban boards (Linear, GitHub Projects, Jira, Trello) via Model Context Protocol (MCP), closing the gap between project management and code execution Instead of manually copy-pasting tasks, Cline reads directly from your board, works through the implementation, and updates the task status automatically when done This makes the Kanban board the single source of truth - it stays in sync with reality rather than being an afterthought you update when you remember Works best with clear, testable acceptance criteria; vague tasks like “improve performance” need refinement before Cline can act on them autonomously Even with full autonomy, human code review remains essential - Cline completing a task means it is “Ready for Review”, not that it ships In the evolution of agentic software engineering, one critical gap remains: the disconnect between project management and code execution. Your Kanban board tracks what needs doing, but your AI assistant lives in your IDE. Cline + Kanban closes that gap. ...

April 10, 2026 · 5 min · James M

AI Reliability Is Weird: Why Testing LLMs Breaks Everything You Know

TL;DR Traditional testing assumes determinism - given input X, function f always returns Y - but LLMs are non-deterministic, which breaks assertion-based testing at its foundation The same agentic task run twice may produce different but equally correct code, making exact-output assertions brittle and often useless The new paradigm shifts from “test the code” to “verify the intent”: property-based testing, LLM-as-a-Judge evaluation, golden datasets for regression, and human review for overall correctness Structured outputs enforce syntactic correctness at generation time, but semantic correctness - whether the output actually solves the right problem - still requires layered verification on top The future of AI quality assurance is designing robust evaluation frameworks and measuring properties of acceptable outputs, not writing exhaustive unit tests for code the model may generate differently next time We’ve embraced the future. AI agents like Cline are now the primary “builders” of software, executing complex engineering plans from high-level specifications. As I’ve argued in “The Architect vs The Builder”, the human role is shifting from execution to architectural oversight and defining intent. The patterns that determine whether agents stay shipped are covered in “AI agents that actually work”, and the wider safety framing sits in “AI safety from first principles”. ...

April 9, 2026 · 7 min · James M

Career-Ops: Flipping the Script on AI-Powered Job Search

TL;DR Career-Ops is an open-source tool built on Claude Code that inverts the job search power dynamic - giving candidates AI-powered evaluation and application tools to match what companies use to filter them Each opportunity is scored across 10 weighted dimensions on an A-F scale, producing a structured comparison that replaces the ad hoc spreadsheet most candidates rely on The system generates ATS-optimized resumes dynamically tailored to each job description and auto-discovers new postings from 45+ pre-configured job boards A key design principle is human-in-control: nothing auto-submits, the AI recommends and the candidate decides, making it a decision-support system rather than an automation Career-Ops is a clean example of the broader pattern of AI tools that amplify individual judgment rather than replace it - worth studying for its architecture as much as its use case The job search has long been a one-way mirror - companies deploy AI to filter applications while candidates manually juggle spreadsheets, tailor cover letters, and hope their resume gets past the automated screener. Career-Ops flips that script entirely. Built on Claude Code, it’s an open-source system that gives job seekers their own AI advantage: intelligent evaluation of opportunities, automated customized applications, and systematic candidate strategy. ...

April 9, 2026 · 5 min · James M

Claude Code vs Cursor: A 6-Month Comparison

After six months of daily use, here is how the two heavyweights of AI-assisted coding compare: the terminal-native Claude Code and the IDE-integrated Cursor.

April 8, 2026 · 3 min · James M
The Automation Paradox Why More AI Makes Human Judgment More Valuable Banner

The Automation Paradox: Why More AI Makes Human Judgment More Valuable

TL;DR Every time AI automates a specific task, the monetary value of doing that task falls - the scarce resource shifts from execution to the judgment of what is worth doing at all Historical precedent holds: Deep Blue did not kill professional chess, calculators did not kill accountants - automation raises the value of the thinking above the automated layer The new hierarchy of work puts judgment first (irreplaceable), direction second (human but scalable), and execution last (increasingly commodity) Judgment is constrained opinion - it requires trade-off awareness, skin in the game, pattern recognition, and willingness to be wrong - none of which AI can replicate The economic inversion means hiring shifts from paying for output to paying for prevention: the bad decisions not made, the features not built, the wrong paths not taken The automation paradox is quietly reshaping what we pay for. ...

April 7, 2026 · 6 min · James M

Spec-Driven Development: When the Brief Becomes the Product

TL;DR Spec-driven development means making specifications iteratively precise enough that handing them to an AI produces the right result without further iteration AI makes hidden specification costs visible - ambiguous briefs now produce wrong code instantly rather than surfacing bugs slowly during implementation The spec becomes the product because it is where all the thinking lives; implementation is just the reflection of the spec in runnable form Good specs must be honest, not just precise - they should explain trade-offs accepted, constraints being solved for, and how you will know if the spec was wrong Developers in 2026 need to shift from implementing specs to writing specs that are clear enough to implement themselves There’s a moment in every developer’s career when you realize the code is not the product. The product is the decision. ...

April 7, 2026 · 6 min · James M