Inference

Which Mac Studio Should You Buy for Running LLMs Locally?

TL;DR Best entry point: M2 Max 32-64 GB (~£1.4k-£2k) for 7B-13B models at 25-40 tok/s Best sweet spot: M2 Ultra 64-128 GB (~£3k-£4.5k) handles 30B+ models comfortably Best for 70B models: M3 Ultra 128 GB+ (~£5.5k+) with 800+ GB/s bandwidth Newer alternative: M4 Max (£2k-£4k) - lower bandwidth (410-546 GB/s) than Ultra chips, but still solid for 7B-13B models Key rule: Memory bandwidth matters more than raw compute for token generation Reality check: A RTX 5090 rig is 2-3× faster for similar money - buy Mac for simplicity and unified memory July 2026 update: Apple’s memory crunch has killed new 256GB/512GB Ultra configs for now - big-memory Macs are refurb-only until the M5 Ultra (tested up to 768GB) lands late 2026 On the horizon: the M7 Ultra, rumoured for around 2029, is reportedly designed to support up to 1.5TB of unified memory - see the road ahead below You want to run large language models locally on a Mac Studio. Good idea - unified memory is genuinely useful for LLMs. But the specs matter, and there are some hard truths about what “works” versus what feels responsive. More importantly: the right Mac depends entirely on which model you want to run. ...

AI Economics and Hardware: A Reading Path

TL;DR Cost is a design constraint, not an afterthought - model tier, context size, and deployment location are economic decisions Read the essays below in any order; start with Token Economics if you only have time for one Pairs with open-weight models and local inference guides Core essays Token Economics: Why the Cost of AI Isn’t Going Down GPU Servers vs AI API Credits: The Real Cost Breakdown Local AI vs Cloud AI: The Tradeoff Landscape in 2026 The AI Energy Crisis: Why Data Center Power Will Define the Next Decade Cerebras, Groq, SambaNova: The Inference Hardware Insurgents Adjacent The State of Open-Weight Models in 2026 - when open weights beat closed APIs on price Prompt Caching - the quiet latency and cost win The Token Efficiency Mindset - curating spend per conversation Is the $20 AI Subscription Era Over? We Are Learning to Buy Intelligence Related Reading AI Dev Tooling: A Reading Path for 2026 - canonical path for coding agents and stack decisions that depend on these cost constraints Home Agent Stack: From Mac Studio to Secured MCP Tools - building the hardware and software layer these economics govern Reasoning Models in 2026: What Changed and What Didn’t - why reasoning models carry a different cost profile than base models The Free Intelligence Era - the macro argument for where intelligence costs are headed

Inference Hardware Insurgents - Cerebras, Groq, SambaNova Banner

Cerebras, Groq, SambaNova: The Inference Hardware Insurgents

For most of the last decade, talking about AI hardware meant talking about Nvidia. In 2026 that has stopped being true at the inference layer. Three companies - Cerebras, Groq, and SambaNova - have built genuinely different chips around the same insight: that the workload economics of running models in production are not the same as the workload economics of training them, and that the chip architecture should follow the workload. The bet has been right enough that Nvidia has now licensed pieces of it. ...

Reasoning Models in 2026: o3, R2, and the Compute-at-Inference Shift

Two years ago the way to make a model better was to train a bigger one. By the start of 2026 that recipe has stopped being the most interesting answer. The frontier has moved to a different lever - letting the model think for longer at inference time, generating intermediate reasoning, and only then producing the final answer. The category has a name now (reasoning models) and a family of products built around it. The interesting questions are no longer whether the trick works, because it clearly does, but when to reach for one, where it lands in production, and what the costs actually look like once the demo glow wears off. ...