Artificial Intelligence

This section is organised around one question: what has to be true before you can trust AI to do real work? Reliability, context, economics, security, evaluation, and eventually physical action - each post is a different angle on the same problem.

Start here

Trust series - research map, broken evals, agent security, world models, trajectory evaluation
What I’m Researching in AI Right Now - my live research agenda

I want to build

Home Agent Stack - Mac Studio → MCP → memory → remote access → hardening
AI Dev Tooling - stack decisions, learning path, Cursor vs Claude Code, spec-driven development

I want context

AI Economics and Hardware - token costs, local vs cloud, energy, inference hardware
Expertise and Work - credentials, judgement, roles, and 2030 speculation
The State of Open-Weight Models in 2026 - Llama, Qwen, Mistral, DeepSeek, Gemma

Resources

Link indexes and tool directories - useful for discovery, not the narrative spine:

AI Tools & Frameworks · Courses · Conferences · GitHub Projects · Explainers · Chatbots & LLMs

MacWhisper vs Wispr Flow vs Superwhisper: The 2026 Dictation Stack Compared

TL;DR MacWhisper is a file transcription tool (audio in, text out) that runs entirely on-device - the right pick for journalists, researchers, and anyone transcribing recordings Wispr Flow is the easiest system-wide dictation option, with AI-powered prose cleanup and cross-platform sync, but it sends audio to the cloud with no on-device option Superwhisper matches Wispr Flow’s system-wide dictation but processes audio locally, with bring-your-own-key LLM cleanup and deep customisation for power users The core decision is simple: if your audio can leave your machine, use Wispr Flow; if it must stay local, use Superwhisper; if you just need transcription, use MacWhisper The real product differentiation is no longer the underlying Whisper model - it is hotkey ergonomics, auto-edit prompts, and workflow integration Voice input on the Mac used to mean fighting with the built-in Dictation feature or paying Nuance a small fortune. In 2026, the landscape looks completely different. A handful of indie and venture-backed apps have turned Whisper-class models into genuinely fast, accurate tools that sit quietly in your menu bar until you hold a hotkey. ...

AI Cloud Subscriptions: Comparing Pricing and Features in 2026

AI cloud subscriptions have fragmented into a crowded market. Frontier-lab APIs compete with open-weights challengers, consumer chat plans compete with agent platforms, and every provider is reshuffling model tiers every few months. This guide organizes the 2026 landscape so you can pick a plan without reading six pricing pages. For background on how these costs behave over time, see Token Economics: Why Costs Aren’t Going Down and Local vs Cloud AI in 2026. ...

DGX Spark vs Mac Studio: Which Personal AI Supercomputer Should You Buy?

TL;DR Best value: Mac Studio M4 Max at $1,999 for most local LLM work Best prefill speed: DGX Spark at $4,699 (3.8× faster prompt processing) Best token generation: Mac Studio M3 Ultra at $3,999 (819 GB/s bandwidth) Best for fine-tuning: DGX Spark (CUDA ecosystem wins) Best combined setup: DGX Spark + M3 Ultra = 2.8× faster than either alone Introduction The market for personal AI supercomputers has exploded in 2025-2026. Two standout options have emerged: NVIDIA’s DGX Spark and Apple’s Mac Studio lineup. Both promise desktop-scale AI compute, but they approach the problem very differently. This guide breaks down the specs, costs, and real-world performance to help you decide which is right for you. ...

The Complete AI Developer's Guide: Resources and Best Practices

TL;DR Prompt engineering, token efficiency, and structured outputs are the core skills for working effectively with any AI model System design patterns - streaming, caching, structured outputs, graceful fallbacks - matter as much as prompting fluency Testing and validation in AI systems requires clear evaluation criteria and production monitoring, not just pre-launch checks Official documentation from model providers (Anthropic, OpenAI, Google) is the most reliable source of best practices The curated resources table covers everything from GitHub Copilot to local model deployment with Ollama Most AI tutorials teach you how to get started. Few teach you how to get it right. This post curates the most valuable resources and practices for working effectively with modern AI systems - from prompt engineering fundamentals through to production system design and evaluation. ...

Which Mac Studio Should You Buy for Running LLMs Locally?

TL;DR Best entry point: M2 Max 32-64 GB (~£1.4k-£2k) for 7B-13B models at 25-40 tok/s Best sweet spot: M2 Ultra 64-128 GB (~£3k-£4.5k) handles 30B+ models comfortably Best for 70B models: M3 Ultra 128 GB+ (~£5.5k+) with 800+ GB/s bandwidth Newer alternative: M4 Max (£2k-£4k) - lower bandwidth (410-546 GB/s) than Ultra chips, but still solid for 7B-13B models Key rule: Memory bandwidth matters more than raw compute for token generation Reality check: A RTX 5090 rig is 2-3× faster for similar money - buy Mac for simplicity and unified memory You want to run large language models locally on a Mac Studio. Good idea - unified memory is genuinely useful for LLMs. But the specs matter, and there are some hard truths about what “works” versus what feels responsive. More importantly: the right Mac depends entirely on which model you want to run. ...

The Token Efficiency Mindset - Why Your Claude Conversations Cost More Than They Should

TL;DR Token costs don’t scale linearly with productivity - the context window compounds with every follow-up message, so a five-message conversation can cost 2-3x more than one well-structured request Compression is your biggest lever: cutting a prompt in half before sending it reduces cost and often improves answer quality by removing noise Batch tasks that share context together; don’t batch unrelated tasks - real batching spreads the setup cost across related work Build reusable systems (templates, project files, prompt prefixes) instead of solving the same problem repeatedly and paying the context cost each time Prompt caching can cut input token costs by 80-90% on workloads with stable prefixes - the single biggest structural saving most teams are missing If you’re paying attention to your Claude usage, you’ve probably noticed something: your token bills don’t scale linearly with your productivity. Sometimes a conversation that feels quick costs three times more than expected. Other conversations that took hours feel suspiciously cheap. ...

Claude Design: Closing the Design-to-Code Gap

TL;DR Claude Design is Anthropic’s new design collaboration tool that lets designers and engineers work in the same environment, with Claude as the bridge between intent and implementation It reads your codebase and existing design files during onboarding so generated designs respect your team’s actual constraints, not hypothetical best practices The strongest feature is its integration with Claude Code: designs are packaged into handoff bundles that encode intent and context, not just pixels and spacing values Collaboration happens inside the tool - inline comments, on-the-fly adjustments, and consistent application of changes across the whole design - removing the need for scattered Figma comments and DMs Currently in research preview for paid Claude tiers; works best for teams already using Claude across writing, coding, and research rather than teams deeply embedded in the Figma ecosystem Design-to-development handoff has always been a friction point. Designers create something beautiful. Engineers interpret Figma specs, argue about spacing, squint at color values. SVG assets get lost. Responsive behavior gets reimplemented. By the time the code matches the design, half the polish is gone. ...

Four Futures Machine Speed Economy Banner

Four Futures for the Machine-Speed Economy

TL;DR AI is collapsing build times across the entire software stack, meaning small teams can now ship in weeks what once required 50-person organisations working for a year Four plausible futures are mapped: Broad Abundance (gains widely distributed), Winner-Take-Most (rents accrue to infrastructure owners), Techno-Feudalism (intelligence rented from platform landlords), and Managed Transition (governments respond with UBI and regulation) Signals to watch include open-source model performance, vertical integration of chips and data centres, platform lock-in of agentic workflows, and serious UBI pilots at national scale Leading AI researchers including Geoffrey Hinton and Yoshua Bengio argue the critical variable is no longer how capable models become, but how gains are distributed and how fast institutions adapt Across most scenarios, the things that hold their value are consistent: trust, relationships, physical presence, and creativity rooted in specific human experience The pace of AI development over the past three years is genuinely unlike anything in recent economic history. The Stanford AI Index has tracked frontier model capability roughly doubling on a yearly cadence, and private AI investment has reached levels that dwarf the dot-com peak in inflation-adjusted terms. What’s less widely understood is what that pace actually means for competition, investment, and the structure of the economy. ...

Claude Opus 4.7: Autonomy and Vision at Scale

TL;DR Claude Opus 4.7 raises the vision ceiling to 3.75 megapixels (2,576 pixels), letting Claude read dense screenshots and complex charts without losing detail Autonomous software engineering is the headline upgrade - Opus 4.7 can handle complex, long-running tasks with reduced need for constant direction A new xhigh effort level for extended thinking gives developers explicit control over the speed-versus-reasoning tradeoff Improved instruction-following and resistance to prompt injection make it safer for production use Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens - this is the new standard, not a premium tier Opus 4.7 is a meaningful step forward. Not a revolutionary rewrite, but a targeted upgrade that addresses friction points developers actually experience: vision quality, autonomous task handling, and creative output. ...

Open WebUI: A Polished Interface for Local and Remote LLMs

TL;DR Open WebUI is an open-source, ChatGPT-style web interface that connects to local Ollama instances, OpenAI’s API, or any OpenAI-compatible backend It eliminates the friction of command-line LLM tools and supports features like RAG with document uploads, web search, custom prompts, model switching, and multi-user permissions Deployment is a single Docker command; maintenance is lightweight with persistent storage and optional PostgreSQL for multi-instance setups The primary appeal is full data ownership - queries never leave your infrastructure - making it well suited for privacy-conscious users and compliance-bound organizations Open WebUI adds minimal latency since the bottleneck is always the inference engine behind it, not the web interface itself If you’ve spent time running language models locally through Ollama or another inference engine, you’ve probably discovered the same friction point: the command-line experience works, but it’s clunky. You’re juggling terminal windows, tracking conversation context manually, navigating files through the filesystem. ...

Start here#

I want to build#

I want context#

Resources#

Start here

I want to build

I want context

Resources