jamesm.blog

SpaceX's $60 Billion Cursor Acquisition: Why It Matters

TL;DR SpaceX filed a $60 billion all-stock acquisition of Cursor on June 16, 2026 - marking one of the largest AI/developer tools acquisitions ever (confirmed via SEC filing) Cursor’s revenue metrics are impressive: ~$4 billion annualized revenue with $2.6 billion from enterprise customers, suggesting strong product-market fit Strategic pivot: SpaceX is moving beyond rockets and satellites into the software infrastructure layer that powers AI development itself Signal to the market: This acquisition suggests major tech companies are betting heavily on owning the entire stack - from hardware to the tools developers use to build AI systems Enterprise focus: The majority of Cursor’s revenue coming from enterprise (65%) indicates this is a B2B infrastructure play, not just a consumer developer tool Why SpaceX Acquiring Cursor Matters On the surface, it might seem odd that a company known for rockets and space exploration would acquire an AI code editor. But this acquisition reveals something fundamental about how the largest technology companies are thinking about AI development infrastructure. ...

Evaluating agents in production with trajectory metrics

Evaluating Agents in Production: Trajectory Metrics, Not Just Final Answers

TL;DR Endpoint evals miss the failure mode that hurts in production - an agent can reach the right answer through a reckless path: wrong tool first, lucky recovery, ignored constraints that did not bite this time Trajectory evaluation scores the run: which tools were called, in what order, with what arguments, and whether each step satisfied policy The minimum viable setup: 50–200 real examples, per-step rubrics, 10+ runs per example, statistical regression tracking, and a held-out set you never tune against Replay harnesses let you re-run a captured trace against a new model or policy without re-hitting production systems This is the measurement layer that connects broken public benchmarks to agent security - you cannot harden what you cannot observe AI Evals Are Broken argued that leaderboard numbers stopped measuring production capability. Securing AI Agents argued that the tool layer must enforce policy the model cannot be trusted to enforce. This post is the bridge: how you measure whether an agent actually behaves before and after you ship. ...

World Models: What Comes After the Language-Only Era

TL;DR Language-only models do not contain a reliable simulator of physical reality - they contain a statistical shadow of one, good enough for many tasks and dangerously wrong for others. A world model is a system that learns to predict how an environment evolves and can plan inside that prediction - not just describe it in text. The gap matters for agents that must act in physical space, manipulate objects, or reason about counterfactuals where the answer is not in the training corpus. The 2026 frontier includes generative world simulators, vision-language-action models for robotics, and sim-to-real pipelines - not one breakthrough but a stack assembling in parallel. For builders today: language agents with MCP tools are the right architecture for knowledge work. World models are the path to agents that can competently act in the physical world. Almost everything I have written about AI agents assumes a model whose understanding of the world arrives through text. That assumption has carried the field a long way. Context engineering, tool use via MCP, memory across sessions - all of it sits on top of language models that read, reason, and call APIs. ...

Government directive to suspend Fable 5 and Mythos 5 access

Pulled From The Shelf: The Government Order to Suspend Fable 5 and Mythos 5

TL;DR On 12 June 2026 at 5:21pm ET, the US government issued an export control directive ordering Anthropic to suspend all access to Fable 5 and Mythos 5 - globally, for every user, including Anthropic’s own employees The stated reason is national security: the government believes it has identified a method of jailbreaking Fable 5. Anthropic says the evidence was verbal only and describes a narrow, non-universal technique - essentially asking the model to read a codebase and fix software flaws Anthropic reviewed a demonstration and found it surfaced a small number of previously known, minor vulnerabilities that are widely available from other models Anthropic disagrees that a narrow jailbreak justifies recalling a commercial model deployed to hundreds of millions of people, and warns the same standard would “essentially halt all new model deployments for all frontier model providers” All other Anthropic models are unaffected. The company says it believes this is a misunderstanding and is working to restore access Four days. That is how long Mythos-class capability lasted as a publicly available product before the US government ordered it off the shelf. ...

AI Evals Are Broken: Why Benchmarks Stopped Measuring Real Capability

When a frontier lab releases a new model in 2026, the press release leads with a row of benchmark scores. The numbers are bigger than they were a year ago, the model is the new state-of-the-art on whichever evaluation the lab chose to highlight, and the headline writes itself. The honest summary is that most of these numbers have stopped measuring what they were designed to measure, and the gap between benchmark performance and real-world capability is now wide enough that the benchmark-led narrative is actively misleading. ...

What It Means to Be Expert in 2030

TL;DR This is the sequel to What Does Expertise Mean When AI Can Pass Any Exam? - less about broken credentials, more about what expertise becomes next Reference knowledge and routine pattern recognition are being commodified; judgement, accountability, integration, and tacit skill are appreciating By 2030, “expert” likely means someone who can direct AI systems, bear professional liability for AI-augmented work, and teach skills that do not compress into training data A concrete example: the 2030 civil engineer signs off on AI-generated structural calcs but remains expert at spotting when the model missed soil conditions the drawings never captured The practitioners who win are the ones who classify their own work honestly and invest in the appreciating categories now Expertise After AI argued that exams stopped measuring what we thought they measured. This post asks what replaces them - not as policy, but as a working picture of what practitioners will need to be good at by 2030. ...

Inside Anthropic Bloomberg The Circuit Documentary Banner

Inside Anthropic: What The Bloomberg Documentary Reveals

TL;DR Bloomberg’s The Circuit with Emily Chang went inside Anthropic in a rare, in-depth episode released June 10, 2026. Dario and Daniela Amodei discuss the founding story, the Pentagon dispute, and why they say safety and commercial success are the same bet. Anthropic is now valued at $965 billion, eclipsing OpenAI’s $852 billion for the first time, after an 80-fold revenue surge in Q1 2026. The Pentagon story is not PR - Anthropic refused to remove safety guardrails from its military contract, was blacklisted by the Trump administration, and sued. A federal judge sided with Anthropic. A confidential S-1 IPO filing in June 2026 means this stops being a private company conversation soon. The Bloomberg Documentary: Emily Chang Inside Anthropic Bloomberg’s The Circuit has done this kind of access piece before - Zuckerberg, Musk, Jensen Huang. But the Anthropic episode feels different in tone. Emily Chang is not sitting across from a founder who has already won. She is sitting across from two founders in the middle of one of the most consequential moments in the company’s short history: record valuation, Pentagon litigation, IPO on the horizon, and model releases arriving fast enough that the competitive landscape changes every few months. ...

Securing AI Agents: Tool-Calling Risks, MCP Hardening, and the Confused Deputy Problem

TL;DR Agent security is reliability under an adversary. Everything you learned about debugging non-deterministic agents still applies - but now someone may be trying to break the system on purpose. The confused-deputy problem is the core threat. An agent acts with its own privileges on behalf of an instruction it cannot fully trust. Prompt injection is how the untrusted instruction gets in. The attack path is simple: untrusted input → agent reasoning → privileged tool call → data exfiltration, spend, or production damage. MCP hardening means least privilege at the tool layer - scoped filesystem roots, confirmation gates for irreversible actions, denylisted extensions, and policies enforced by a router, not by the prompt. Prompts cannot be your security boundary. Confirmation, allowlists, action budgets, and audit logs have to live in code the model cannot rewrite mid-run. I spent most of last year on agent reliability - why agents that demo well fail in production, how to constrain non-determinism, what evaluation actually looks like. That work assumed honest users and honest inputs. The moment I gave my home agent real tools - filesystem access, mail, calendar, shell - I realised I had been studying half the problem. ...

Policy on the AI Exponential: Dario Amodei's Case for Acting While the Window Is Open

Dario Amodei has published a new essay, Policy on the AI Exponential, and it reads like the third act of a trilogy. Machines of Loving Grace made the case for what powerful AI could give us. The Adolescence of Technology catalogued what could go wrong. This one is about the machinery in between - the laws, agencies, and international arrangements that will decide which of those two essays turns out to be the better prediction. ...

When Machines Stop Speaking Our Language Banner

When Machines Stop Speaking Our Language - Binary Agents and the End of Compilers

TL;DR When two AI agents talk to each other in English, they are doing something faintly absurd: serialising rich internal state into a lossy human language, transmitting it, and decoding it back. English between machines is a compatibility layer, not a natural medium. Machines have already shown they will drop that layer the moment we let them - negotiation bots drifting out of English in 2017, agents switching to sound-based data protocols in 2025, and research systems now sharing internal model state directly with no language in between. The same logic applies to programming languages. Python and Rust exist for human readers. If agents write, maintain, and consume the software, the human-readability requirement quietly disappears - and with it, eventually, the need for source code and compilers as we know them. I do not think compilers vanish so much as sink. Like assembly, the layers below us stop being something humans write or read, while the guarantees they provide get absorbed into the agents’ toolchain. The part worth worrying about is not efficiency, it is legibility. Human language and human-readable code are our audit trail into what machines are doing. This is all speculation on my part, and I sketch where I think the line should be held. Human Language Is a Compatibility Layer Think about what actually happens when two AI agents have a conversation in English today. ...