Verification

TL;DR AI hallucinations are not perceptual errors - they are confident pattern completions that happen to be unanchored in the world, and no model will ever stop producing them entirely because truth is not what the training objective optimises for Hallucinations cluster into five distinct types: factual, citation, code and API, instruction (claiming to have done something it did not), and reasoning - each with a different root cause and a different mitigation The mitigations that genuinely move the dial are structural: retrieval-augmented generation, tool use over recall, constrained structured outputs, explicit verification layers, and lower temperature for factual tasks The model is not the product; the model surrounded by retrieval, verification, structured outputs, calibration, and human-in-the-loop review is the product Hallucination is not the bug - the absence of a system around the model is the bug, and treating it as an engineering problem rather than a model problem is what separates demos from production The word “hallucination” is one of the most successful pieces of accidental marketing in our industry. It is a soft, almost endearing way to describe an LLM stating with full confidence that a function exists when it does not, that a court case was decided when it was not, that a paper was written by an author who has never published in that field. It makes the failure sound like a quirk rather than the central reliability problem of the entire technology. ...

TL;DR Traditional testing assumes determinism - given input X, function f always returns Y - but LLMs are non-deterministic, which breaks assertion-based testing at its foundation The same agentic task run twice may produce different but equally correct code, making exact-output assertions brittle and often useless The new paradigm shifts from “test the code” to “verify the intent”: property-based testing, LLM-as-a-Judge evaluation, golden datasets for regression, and human review for overall correctness Structured outputs enforce syntactic correctness at generation time, but semantic correctness - whether the output actually solves the right problem - still requires layered verification on top The future of AI quality assurance is designing robust evaluation frameworks and measuring properties of acceptable outputs, not writing exhaustive unit tests for code the model may generate differently next time We’ve embraced the future. AI agents like Cline are now the primary “builders” of software, executing complex engineering plans from high-level specifications. As I’ve argued in “The Architect vs The Builder”, the human role is shifting from execution to architectural oversight and defining intent. The patterns that determine whether agents stay shipped are covered in “AI agents that actually work”, and the wider safety framing sits in “AI safety from first principles”. ...

Verification

AI Hallucinations: Understanding and Mitigating False Outputs

AI Reliability Is Weird: Why Testing LLMs Breaks Everything You Know