Mmlu | jamesm.blog

When a frontier lab releases a new model in 2026, the press release leads with a row of benchmark scores. The numbers are bigger than they were a year ago, the model is the new state-of-the-art on whichever evaluation the lab chose to highlight, and the headline writes itself. The honest summary is that most of these numbers have stopped measuring what they were designed to measure, and the gap between benchmark performance and real-world capability is now wide enough that the benchmark-led narrative is actively misleading. ...