Ai | jamesm.blog

Recursive Self-Improvement: Can AI Bootstrap Its Own Intelligence?

TL;DR Recursive self-improvement (RSI) is the idea of an AI that improves its own ability to improve - each round producing a smarter system that does the next round better. It is the engine behind every “intelligence explosion” story since I.J. Good described it in 1965 The narrow version is already real. Systems like AlphaEvolve and the AI Scientist measurably improve algorithms, code, and even research output - including, in AlphaEvolve’s case, the infrastructure that trains the models themselves The leap people fear is different: improving an algorithm is not the same as improving general intelligence. Nothing in 2026 has crossed that line, and the gap is structural, not just a matter of scale Four bottlenecks decide whether RSI runs away or fizzles: compute, data, verification, and diminishing returns. Each is a hard physical or informational limit, not a temporary engineering nuisance The realistic picture is steady, human-paced acceleration - AI assisting AI research - not an overnight takeoff. METR’s time-horizon data shows fast but smooth exponential progress, which is exactly what a bottlenecked process looks like It still deserves serious safety attention, because a slow takeoff is the one we can actually govern There is a particular shape of argument that has haunted artificial intelligence since before the field had a settled name. It goes like this: build a machine slightly better than humans at designing machines, and it will design a machine better than itself. That machine designs a better one. The loop tightens, each turn faster than the last, and intelligence runs away from us in an afternoon. ...

Context Engineering: The Discipline That Replaced Prompt Engineering

TL;DR Prompt engineering optimised the wording of a single human-written request. Context engineering optimises the entire set of tokens in the model’s window across a whole run - system prompt, tool definitions, retrieved documents, tool results, conversation history, and memory The shift happened because of agents. The window is no longer one prompt you wrote - it is an accumulation that grows on every step, and most of it is produced by the system, not by you More context is not better context. Research on “context rot” and the older lost-in-the-middle effect show model accuracy degrades as the window fills, even well below the advertised limit The four levers are retrieval (what you pull in), memory (what persists across runs), tool results (what tools dump back), and compaction (what you summarise and discard) Treat the window as a budget. Measure its token composition, design tools to return terse output, curate rather than accumulate, and keep the static prefix stable so prompt caching still works For a few years, “prompt engineering” was the named skill of working with language models. It meant finding the wording, the framing, the few-shot examples, and the role instructions that coaxed the best answer out of a single request. It produced a small industry of prompt libraries, prompt marketplaces, and job titles. And in 2026 it is mostly gone, absorbed into something larger and harder. ...

Composer 2.5: Cursor's In-House Model Grows Up

TL;DR Composer 2.5 is Cursor’s most capable in-house coding model yet, built on Moonshot’s open-source Kimi K2.5 checkpoint with about 85% of total training compute spent on Cursor’s own continued pretraining and RL The model is purpose-built for the agent loop inside Cursor - long-horizon tasks, hundreds of tool calls, multi-step instructions - rather than as a general-purpose chat model Cursor claims parity with Claude Opus 4.7 and GPT-5.5 on its own CursorBench v3.1 (63.2%) and a strong 79.8% on SWE-Bench Multilingual Pricing is dramatically lower: $0.50 / $2.50 per million input/output tokens on the default variant, with included usage doubled for the first week Together with SpaceXAI, Cursor is now training a much larger successor model from scratch on Colossus 2 with around 10x the compute - so 2.5 is a waypoint, not the endgame For a while, Cursor was an IDE wrapped around someone else’s models - Claude, GPT, Gemini. That story has shifted. With Composer 2.5, released this week, Cursor has shipped its most capable first-party coding model yet, and it is a serious enough piece of work that it deserves real consideration as a daily driver rather than a budget fallback. ...

AI as Analogy Engine: Synthesis, Invention, and the Combinatorial Frontier

A common dismissal of modern AI goes like this: “It is just a fancy autocomplete. It memorises text and stitches it back together. There is no real understanding, only retrieval.” It is a comforting story, and it has the shape of a critique that ought to be true. But spend enough time with frontier systems and a different picture starts to form. The thing that large models actually seem to be good at is not memorisation. It is something stranger and arguably more important: the formation of analogies, the combination of distant concepts, and the generation of conceptual relationships that were not explicitly present in any one place in the training data. ...

The Agent Reliability Problem: Debugging Non-Deterministic Systems

The conventional reliability engineering toolkit was built for systems that behaved the same way each time given the same input. AI agents do not behave the same way each time given the same input. The classic tools - unit tests, integration tests, deterministic replay, traditional monitoring - all assume a property that the systems being operated do not have. This mismatch is not a small operational annoyance; it is the central challenge of running AI agents in production, and the patterns for handling it are still being worked out. ...

Dario Amodei: The Anthropic CEO Betting on Safety as Strategy

Dario Amodei is one of the few frontier-lab CEOs whose public talking points have not changed materially in five years. The same message he gave to small audiences in 2021 - that powerful AI is coming faster than people think, that the safety problem is real, and that the companies building it have an obligation to do so carefully - is the message he is giving to Congress and Davos in 2026. The thing that has changed is that he now runs the company most aggressively turning that message into a commercial position. ...

AI in Scientific Research: From AlphaFold to the Long Tail

AlphaFold’s release in 2021 was the AI-for-science moment that broke through to the general public. A computational solution to a 50-year-old problem in biology - predicting protein structure from sequence - that produced a tool used by hundreds of thousands of researchers. The narrative around AI-for-science crystallised: deep learning would produce a series of similar breakthroughs across scientific domains. The 2026 reality is more interesting and less clean. AlphaFold-class breakthroughs have been rarer than the early narrative suggested. But AI has spread across scientific practice in subtler ways that, in aggregate, have done more to change how science is actually done than the few headline breakthroughs. ...

AI Energy Crisis - Why Data Center Power Will Define the Next Decade Banner

The AI Energy Crisis: Why Data Center Power Will Define the Next Decade

For most of the AI conversation in 2024 and 2025, the binding constraints on the build-out were chips and capital. By 2026 the conversation has shifted, and the constraint that gets discussed most seriously inside the hyperscalers is electricity. Not the cost of electricity. The actual physical availability of electrons - at gigawatt scale, in the places where the data centres need to be, on the schedule the model labs need them to be. The story does not have a single villain or a single number, but it has a shape, and the shape is becoming the story of the second half of the decade. ...

Inference Hardware Insurgents - Cerebras, Groq, SambaNova Banner

Cerebras, Groq, SambaNova: The Inference Hardware Insurgents

For most of the last decade, talking about AI hardware meant talking about Nvidia. In 2026 that has stopped being true at the inference layer. Three companies - Cerebras, Groq, and SambaNova - have built genuinely different chips around the same insight: that the workload economics of running models in production are not the same as the workload economics of training them, and that the chip architecture should follow the workload. The bet has been right enough that Nvidia has now licensed pieces of it. ...

The Open Weight Models Renaissance: Llama, Mistral, Qwen, DeepSeek

For most of the LLM era the open-weight story was framed as a trailing one. Open models were cheaper, smaller, and a generation behind. That framing has not survived 2026. The gap between the best open-weight model and the best closed model is now narrow enough on most workloads that the choice is no longer “settle for less” - it is “decide what you actually need.” TL;DR Open weights have closed the headline gap. Top open-weight models are within striking distance of closed frontier models on reasoning, coding, and general knowledge benchmarks. The economics changed first. DeepSeek’s R1 made it credible that a frontier model could be trained for tens of millions, not billions - and that the weights could be released for free. Llama, Mistral, Qwen, and DeepSeek lead on different axes: Llama for broad ecosystem support, Mistral for European deployment and tool use, Qwen for multilingual and long-context work, DeepSeek for raw reasoning. Inference flexibility is the underrated win. Open weights mean you can run on your own hardware, fine-tune freely, and avoid surprises from a closed provider’s roadmap. The remaining closed-model advantages are real but narrowing - agentic depth, multimodal performance, and the polished tool-use stacks around them. Where the gap actually is in 2026 Benchmarks are imperfect, but the picture they sketch is consistent. On standard reasoning suites - MMLU, GPQA, MATH - open-weight models are within a few percentage points of the closed frontier. On coding - HumanEval, SWE-Bench - the gap is similar. On long-context retrieval, the gap is mostly gone. ...