AI Reliability Is Weird: Why Testing LLMs Breaks Everything You Know

We’ve embraced the future. AI agents like Cline are now the primary “builders” of software, executing complex engineering plans from high-level specifications. As I’ve argued in “The Architect vs The Builder”, the human role is shifting from execution to architectural oversight and defining intent. The patterns that determine whether agents stay shipped are covered in “AI agents that actually work”, and the wider safety framing sits in “AI safety from first principles”. But this shift introduces a profound, often uncomfortable, question: How do we know it actually works? ...

April 9, 2026 · 6 min · James M