Artificial Intelligence

This section is organised around one question: what has to be true before you can trust AI to do real work? Reliability, context, economics, security, evaluation, and eventually physical action - each post is a different angle on the same problem.

Start here

Trust series - research map, broken evals, agent security, world models, trajectory evaluation
What I’m Researching in AI Right Now - my live research agenda

I want to build

Home Agent Stack - Mac Studio → MCP → memory → remote access → hardening
AI Dev Tooling - stack decisions, learning path, Cursor vs Claude Code, spec-driven development

I want context

AI Economics and Hardware - token costs, local vs cloud, energy, inference hardware
Expertise and Work - credentials, judgement, roles, and 2030 speculation
The State of Open-Weight Models in 2026 - Llama, Qwen, Mistral, DeepSeek, Gemma

Resources

Link indexes and tool directories - useful for discovery, not the narrative spine:

AI Tools & Frameworks · Courses · Conferences · GitHub Projects · Explainers · Chatbots & LLMs

Which Mac Studio Should You Buy for Running LLMs Locally?

TL;DR Best entry point: M2 Max 32-64 GB (~£1.4k-£2k) for 7B-13B models at 25-40 tok/s Best sweet spot: M2 Ultra 64-128 GB (~£3k-£4.5k) handles 30B+ models comfortably Best for 70B models: M3 Ultra 128 GB+ (~£5.5k+) with 800+ GB/s bandwidth Newer alternative: M4 Max (£2k-£4k) - lower bandwidth (410-546 GB/s) than Ultra chips, but still solid for 7B-13B models Key rule: Memory bandwidth matters more than raw compute for token generation Reality check: A RTX 5090 rig is 2-3× faster for similar money - buy Mac for simplicity and unified memory July 2026 update: Apple’s memory crunch has killed new 256GB/512GB Ultra configs for now - big-memory Macs are refurb-only until the M5 Ultra (tested up to 768GB) lands late 2026 On the horizon: the M7 Ultra, rumoured for around 2029, is reportedly designed to support up to 1.5TB of unified memory - see the road ahead below You want to run large language models locally on a Mac Studio. Good idea - unified memory is genuinely useful for LLMs. But the specs matter, and there are some hard truths about what “works” versus what feels responsive. More importantly: the right Mac depends entirely on which model you want to run. ...

Apple sues OpenAI over trade secrets as Elon Musk and Sam Altman clash on X

Apple Sues OpenAI, and the Musk-Altman Feud Finds a New Stage

TL;DR Apple sued OpenAI on July 10, 2026 in the Northern District of California, alleging the company stole trade secrets “at every level” to build its own hardware - CNBC Central to the claim: Tang Tan, OpenAI’s hardware chief and a former Apple VP, allegedly told job candidates still employed at Apple to bring “actual parts” to interviews for show-and-tell, and circulated an Apple offboarding document that taught new hires how to dodge exit security checks A separate allegation names Chang Liu, a former Apple systems electrical engineer, who allegedly kept an Apple-issued laptop after joining OpenAI in 2026 and used it to pull confidential documents on unannounced Apple products OpenAI’s on-record response: “We have no interest in other companies’ trade secrets. We remain focused on building innovative technology that empowers people everywhere.” Two days later, on July 12, Elon Musk and Sam Altman traded insults on X - Musk opened with “Scam Altman strikes again,” Altman replied with a post that hit 11 million views - CNBC The spat lands seven weeks after a jury dismissed Musk’s own lawsuit against Altman and OpenAI on May 18, 2026, and while both SpaceX (public since June 12) and OpenAI (confidentially filed for IPO) are courting the same public markets Three storylines collided this week, and it’s worth pulling them apart before deciding how much any of it matters. Apple filed a serious federal lawsuit against OpenAI over alleged theft of hardware trade secrets. Two days later, Elon Musk and Sam Altman were back to public insults on X, in a feud that has now run for the better part of a decade. And underneath both, two of the most valuable private companies on the planet - SpaceX and OpenAI - are mid-transition to public markets, which changes the stakes of looking undisciplined in public. ...

Mechanistic Interpretability: Reading the Mind of a Model

TL;DR Mechanistic interpretability is the attempt to reverse-engineer a trained neural network into human-understandable parts - to say not just what a model does but which internal machinery makes it do that The core obstacle is superposition: models pack far more concepts than they have neurons by smearing each concept across many neurons and each neuron across many concepts, so a single neuron almost never means one clean thing Sparse autoencoders were the breakthrough that undid the smearing, pulling millions of monosemantic features out of a production model - Anthropic’s “Golden Gate Claude” demonstration proved these features are causal, not just correlational Circuit tracing went further, showing that models plan ahead when writing poetry, share a language-independent “space of thought,” and sometimes reason backwards from a desired answer while narrating a plausible-but-fake chain of thought I am a data engineer and an enthusiast here, not an interpretability researcher, but I think this is the single most under-watched thread in AI: it is the only path I know of to a model we can audit rather than merely test, and it quietly reshapes how I think about the mind question too Every other reliability technique I have written about treats the model as a black box. Retrieval, verification, structured outputs, evals - they all wrap machinery you cannot see and try to make its outputs trustworthy from the outside. That is the correct engineering stance today, and I stand by all of it. But it is also, if you sit with it, a slightly desperate stance. We are building the most consequential technology of the century and our primary safety strategy is to poke it from the outside and see what comes out. ...

Claude Fable 5 redeployment after export control suspension

Fable 5 Is Back: What Anthropic Learned From Eighteen Days Off The Shelf

TL;DR Fable 5 returns globally on 1 July 2026 on Claude.ai, Claude Code, Claude Cowork, and the Claude Platform, after export controls were lifted on 30 June The recall was triggered by an Amazon research report describing a jailbreak that let Fable 5 identify software vulnerabilities; Anthropic’s testing found Opus 4.8, GPT-5.5, Kimi K2.7, and others could do the same A new safety classifier blocks the specific technique in over 99% of cases; blocked requests fall back to Opus 4.8 Anthropic argues the jailbreak was minor - it intruded into the model’s deliberate “safety margin”, not its core harmful capabilities Together with Amazon, Google, and Microsoft, Anthropic is drafting a shared jailbreak severity framework (capability gain, breadth, ease of weaponisation, discoverability) - the AI equivalent of CVSS Mythos 5 is restored for a set of US Glasswing partners; broader international access remains under government coordination Eighteen days ago I wrote about the government order that pulled Fable 5 and Mythos 5 off the shelf - four days after launch, a verbal export control directive at 5:21pm ET, global suspension for every user including Anthropic’s own staff. The open question in that post was whether access would be restored in days or weeks, and whether the precedent would reshape how every frontier lab ships models. ...

Four Futures series - mapping the machine-speed economy

Four Futures: Mapping the Machine-Speed Economy

TL;DR The Four Futures series asks: as AI collapses build times and concentrates infrastructure, which economic future are we actually selecting for? Read in this order: framework → signals → century horizons Four scenarios: Broad Abundance, Winner-Take-Most, Techno-Feudalism, Managed Transition Full series index: /series/four-futures/ Start here Four Futures for the Machine-Speed Economy - the map: four plausible outcomes and what to watch for Reading the Signals: Which of the Four Futures Is Actually Emerging? - scoring real-world signals against the framework as of 2026 The Year 2126: What the Next Hundred Years Actually Looks Like - century-scale consequences if the transition goes well or badly The Year 3026: Thinking Seriously About a Thousand Years From Now - what, if anything, holds value across civilisational time Supporting reading The Free Intelligence Era: What Breaks When Thinking Costs Nothing - the abundance-side argument in detail The Automation Paradox: Why More AI Makes Human Judgment More Valuable - what stays human as machines accelerate The Meaning of Work in an Age of Abundance - what work is for when production gets cheap Policy on the AI Exponential - institutional responses and governance lag Expertise and Work in the Age of AI - how trust and accountability reshape human roles Related paths AI Economics and Hardware: A Reading Path - cost and infrastructure constraints underneath every scenario Trust: Conditions for Deploying AI Agents in Production - what has to be true before handing real work to agents Related Reading Human Advancement Is Accelerating - the longer exponential curve behind the machine-speed frame What It Means to Be an Expert in 2030 - expertise futures inside a winner-take-most world Scott Galloway on AI - one outside framing of concentration and platform power

Cursor iOS app launching coding agents from a phone

Cursor on iOS: When the Code Editor Becomes a Remote Control

TL;DR On June 29, 2026, Cursor released a native iOS app in public beta, available on all paid plans, for iPhone and iPad You can launch cloud agents from your phone - pick a repo, describe the task by voice or text, use slash commands, choose a frontier model, and let an agent run in an isolated VM Remote Control lets you take an agent already running on your desktop and keep steering it from your phone, with an option to keep the machine awake while you’re away Live Activities put agent status on your lock screen; you get push notifications, can review demos, screenshots and logs, inspect diffs, and merge pull requests without opening a laptop A launch promo gives 75% off Composer 2.5 runs in the mobile app through July 5, 2026 This lands months after SpaceX’s move on Cursor - and reframes the editor as an orchestration surface rather than a place you type code I’ve written about Cursor enough times on this blog that a phone app could have been a footnote. It isn’t. Not because the app itself is revolutionary - it’s a well-made mobile client - but because of what it quietly admits about how the work has changed. For most of software history, the editor was where you sat and typed. Cursor’s iOS app is built on the assumption that you mostly aren’t typing anymore. You’re directing. ...

Five Archetypes for a Post-Role Team

TL;DR Boris Cherny, who built Claude Code at Anthropic, posted a short framing: as engineering, product, design, and data science melt into one role, he sees five archetypes on his team The five are Prototyper, Builder, Sweeper, Grower, and Maintainer - and crucially, none of them map cleanly to a job title The interesting claim is not the list, it is the decoupling: the archetype is a description of what energy you bring to a system, not what your contract says you do I think the framing is genuinely useful as a self-diagnostic, and quietly radical for how teams get staffed and rewarded Where it leaves me unsure: it describes a steady-state team that already exists, and says less about how you grow people into these shapes, or what happens to the people who do not fit any of them A short post on X has been rattling around my head for a few days. Boris Cherny, who built Claude Code at Anthropic, was reflecting on what happens to roles when the old functional boundaries stop meaning much. His observation: when he looks at the Claude Code team, he does not really see engineers, designers, PMs, and data scientists. He sees five archetypes that cut across all of them. ...

OpenAI IPO filing and ChatGPT market share falling below 50% for the first time

The $2.22 Problem: OpenAI's IPO and the First Crack in the ChatGPT Monopoly

TL;DR On June 8, 2026, OpenAI filed a confidential S-1 with the SEC, targeting a September 2026 public listing with Goldman Sachs and Morgan Stanley as underwriters The private valuation sits at $852 billion, with analysts projecting a debut above $1 trillion - one of the five largest IPOs in US history The same week, ChatGPT’s market share fell below 50% for the first time - to 46.4%, with Gemini at 27.7% and Claude at 10.3% OpenAI’s Q1 2026 non-GAAP operating margin was negative 122%: it spends $2.22 for every dollar it earns Noam Shazeer - co-author of Attention Is All You Need and the AI talent Google paid $2.7 billion to retain in 2024 - just left Google to join OpenAI Anthropic filed its own S-1 a week earlier, on June 1, targeting October, at a $965 billion valuation - the two biggest AI labs are racing to Wall Street simultaneously The timing is almost too perfect to be coincidence - and yet it is. On June 8, 2026, OpenAI submitted a confidential S-1 registration with the SEC, beginning the legal process toward a public listing. The same week, for the first time since ChatGPT launched in November 2022, OpenAI’s flagship product held less than half of the global AI assistant market. The company is going to Wall Street at the precise moment it is no longer the only name in the room. ...

SpaceX's $60 Billion Cursor Acquisition: Why It Matters

TL;DR SpaceX filed a $60 billion all-stock acquisition of Cursor on June 16, 2026 - marking one of the largest AI/developer tools acquisitions ever (confirmed via SEC filing) Cursor’s revenue metrics are impressive: ~$4 billion annualized revenue with $2.6 billion from enterprise customers, suggesting strong product-market fit Strategic pivot: SpaceX is moving beyond rockets and satellites into the software infrastructure layer that powers AI development itself Signal to the market: This acquisition suggests major tech companies are betting heavily on owning the entire stack - from hardware to the tools developers use to build AI systems Enterprise focus: The majority of Cursor’s revenue coming from enterprise (65%) indicates this is a B2B infrastructure play, not just a consumer developer tool Why SpaceX Acquiring Cursor Matters On the surface, it might seem odd that a company known for rockets and space exploration would acquire an AI code editor. But this acquisition reveals something fundamental about how the largest technology companies are thinking about AI development infrastructure. ...

Evaluating agents in production with trajectory metrics

Evaluating Agents in Production: Trajectory Metrics, Not Just Final Answers

TL;DR Endpoint evals miss the failure mode that hurts in production - an agent can reach the right answer through a reckless path: wrong tool first, lucky recovery, ignored constraints that did not bite this time Trajectory evaluation scores the run: which tools were called, in what order, with what arguments, and whether each step satisfied policy The minimum viable setup: 50–200 real examples, per-step rubrics, 10+ runs per example, statistical regression tracking, and a held-out set you never tune against Replay harnesses let you re-run a captured trace against a new model or policy without re-hitting production systems This is the measurement layer that connects broken public benchmarks to agent security - you cannot harden what you cannot observe AI Evals Are Broken argued that leaderboard numbers stopped measuring production capability. Securing AI Agents argued that the tool layer must enforce policy the model cannot be trusted to enforce. This post is the bridge: how you measure whether an agent actually behaves before and after you ship. ...

Start here#

I want to build#

I want context#

Resources#

Start here

I want to build

I want context

Resources