Safety

Mechanistic Interpretability: Reading the Mind of a Model

TL;DR Mechanistic interpretability is the attempt to reverse-engineer a trained neural network into human-understandable parts - to say not just what a model does but which internal machinery makes it do that The core obstacle is superposition: models pack far more concepts than they have neurons by smearing each concept across many neurons and each neuron across many concepts, so a single neuron almost never means one clean thing Sparse autoencoders were the breakthrough that undid the smearing, pulling millions of monosemantic features out of a production model - Anthropic’s “Golden Gate Claude” demonstration proved these features are causal, not just correlational Circuit tracing went further, showing that models plan ahead when writing poetry, share a language-independent “space of thought,” and sometimes reason backwards from a desired answer while narrating a plausible-but-fake chain of thought I am a data engineer and an enthusiast here, not an interpretability researcher, but I think this is the single most under-watched thread in AI: it is the only path I know of to a model we can audit rather than merely test, and it quietly reshapes how I think about the mind question too Every other reliability technique I have written about treats the model as a black box. Retrieval, verification, structured outputs, evals - they all wrap machinery you cannot see and try to make its outputs trustworthy from the outside. That is the correct engineering stance today, and I stand by all of it. But it is also, if you sit with it, a slightly desperate stance. We are building the most consequential technology of the century and our primary safety strategy is to poke it from the outside and see what comes out. ...

Inside Anthropic Bloomberg The Circuit Documentary Banner

Inside Anthropic: What The Bloomberg Documentary Reveals

TL;DR Bloomberg’s The Circuit with Emily Chang went inside Anthropic in a rare, in-depth episode released June 10, 2026. Dario and Daniela Amodei discuss the founding story, the Pentagon dispute, and why they say safety and commercial success are the same bet. Anthropic is now valued at $965 billion, eclipsing OpenAI’s $852 billion for the first time, after an 80-fold revenue surge in Q1 2026. The Pentagon story is not PR - Anthropic refused to remove safety guardrails from its military contract, was blacklisted by the Trump administration, and sued. A federal judge sided with Anthropic. A confidential S-1 IPO filing in June 2026 means this stops being a private company conversation soon. The Bloomberg Documentary: Emily Chang Inside Anthropic Bloomberg’s The Circuit has done this kind of access piece before - Zuckerberg, Musk, Jensen Huang. But the Anthropic episode feels different in tone. Emily Chang is not sitting across from a founder who has already won. She is sitting across from two founders in the middle of one of the most consequential moments in the company’s short history: record valuation, Pentagon litigation, IPO on the horizon, and model releases arriving fast enough that the competitive landscape changes every few months. ...

Policy on the AI Exponential: Dario Amodei's Case for Acting While the Window Is Open

Dario Amodei has published a new essay, Policy on the AI Exponential, and it reads like the third act of a trilogy. Machines of Loving Grace made the case for what powerful AI could give us. The Adolescence of Technology catalogued what could go wrong. This one is about the machinery in between - the laws, agencies, and international arrangements that will decide which of those two essays turns out to be the better prediction. ...

Recursive Self-Improvement: Can AI Bootstrap Its Own Intelligence?

TL;DR Recursive self-improvement (RSI) is the idea of an AI that improves its own ability to improve - each round producing a smarter system that does the next round better. It is the engine behind every “intelligence explosion” story since I.J. Good described it in 1965 The narrow version is already real. Systems like AlphaEvolve and the AI Scientist measurably improve algorithms, code, and even research output - including, in AlphaEvolve’s case, the infrastructure that trains the models themselves The leap people fear is different: improving an algorithm is not the same as improving general intelligence. Nothing in 2026 has crossed that line, and the gap is structural, not just a matter of scale Four bottlenecks decide whether RSI runs away or fizzles: compute, data, verification, and diminishing returns. Each is a hard physical or informational limit, not a temporary engineering nuisance The realistic picture is steady, human-paced acceleration - AI assisting AI research - not an overnight takeoff. METR’s time-horizon data shows fast but smooth exponential progress, which is exactly what a bottlenecked process looks like In May 2026 Anthropic put numbers on this from inside a frontier lab. Its essay When AI Builds Itself reports that over 80% of the code it merges is now written by Claude, that task horizons are doubling every roughly four months rather than seven, and lays out a candid three-way bet on where this ends. None of it overturns the bottlenecked-flywheel picture - but it sharpens it It still deserves serious safety attention, because a slow takeoff is the one we can actually govern There is a particular shape of argument that has haunted artificial intelligence since before the field had a settled name. It goes like this: build a machine slightly better than humans at designing machines, and it will design a machine better than itself. That machine designs a better one. The loop tightens, each turn faster than the last, and intelligence runs away from us in an afternoon. ...

Personal Universes: Yampolskiy's Strangest Answer to the AI Alignment Problem

First, the thing this is all in service of. The AI alignment problem is the challenge of making a powerful AI system reliably pursue what we actually want it to pursue - getting its goals, values, and behaviour to line up with human intentions, and to stay lined up even as the system becomes more capable than the people supervising it. It sounds simple and is not: we struggle to state our own values precisely, those values conflict between people, and an AI optimising hard for a slightly-wrong objective can produce outcomes nobody asked for. The multi-agent version - aligning one system with all of humanity at once, rather than a single person - is harder still, and it is the specific version Personal Universes is trying to dodge. ...

Dario Amodei: The Anthropic CEO Betting on Safety as Strategy

Dario Amodei is one of the few frontier-lab CEOs whose public talking points have not changed materially in five years. The same message he gave to small audiences in 2021 - that powerful AI is coming faster than people think, that the safety problem is real, and that the companies building it have an obligation to do so carefully - is the message he is giving to Congress and Davos in 2026. The thing that has changed is that he now runs the company most aggressively turning that message into a commercial position. ...

Roman Yampolskiy: The Researcher Who Thinks AI Cannot Be Controlled

Most people writing about AI risk in 2026 are recent arrivals. Roman Yampolskiy is not. He has been making the same argument - that advanced AI systems may be fundamentally uncontrollable - since before the field of AI safety had a settled name, which is partly because he is the one who gave it that name. Whether you find his conclusions alarmist, prescient, or somewhere in between depends mostly on how you read the gap between current systems and the ones he writes about. This post is an attempt to lay out the man, the argument, and the reasons it deserves more than a dismissal. ...

AI Safety From First Principles: What Actually Matters vs What's Hype

TL;DR “AI safety” covers four distinct layers - product safety, system safety, model alignment, and civilisational safety - and conflating them produces incoherent debates For engineers building production systems today, system safety dominates: most real incidents trace back to flawed system design around the model, not the model itself Practical mitigations are unglamorous: scope tool permissions, bound blast radius, require human approval for irreversible actions, validate outputs, and observe everything The hype conflates capability with intent, existential risk with ordinary risk, and refusal with safety - all three conflations make the conversation harder to act on The load-bearing principle across all four layers is the same: a system should fail in ways that are detectable, recoverable, and bounded The AI safety conversation has reached the point where the phrase has stopped meaning anything specific. In the same week, you will see “AI safety” used to describe content moderation on a chat product, the alignment of frontier models toward human values, the question of whether superintelligence ends civilisation, and a regulatory paper about copyright. These are not the same problem. Treating them as one conversation is the reason the conversation never resolves. ...