Artificial Intelligence

In-depth exploration of AI in practice: building and deploying AI agents that work, designing developer workflows around Claude and other LLMs, critical analysis of AI safety and reliability, and the real shifts happening in careers, skills, and how we work. This section mixes tactical guides (how to actually build with AI), strategic analysis (what’s hype vs. what matters), and deeper dives into the tools and systems reshaping software development and knowledge work.

Claude Design: Closing the Design-to-Code Gap

TL;DR Claude Design is Anthropic’s new design collaboration tool that lets designers and engineers work in the same environment, with Claude as the bridge between intent and implementation It reads your codebase and existing design files during onboarding so generated designs respect your team’s actual constraints, not hypothetical best practices The strongest feature is its integration with Claude Code: designs are packaged into handoff bundles that encode intent and context, not just pixels and spacing values Collaboration happens inside the tool - inline comments, on-the-fly adjustments, and consistent application of changes across the whole design - removing the need for scattered Figma comments and DMs Currently in research preview for paid Claude tiers; works best for teams already using Claude across writing, coding, and research rather than teams deeply embedded in the Figma ecosystem Design-to-development handoff has always been a friction point. Designers create something beautiful. Engineers interpret Figma specs, argue about spacing, squint at color values. SVG assets get lost. Responsive behavior gets reimplemented. By the time the code matches the design, half the polish is gone. ...

Claude Opus 4.7: Autonomy and Vision at Scale

TL;DR Claude Opus 4.7 raises the vision ceiling to 3.75 megapixels (2,576 pixels), letting Claude read dense screenshots and complex charts without losing detail Autonomous software engineering is the headline upgrade - Opus 4.7 can handle complex, long-running tasks with reduced need for constant direction A new xhigh effort level for extended thinking gives developers explicit control over the speed-versus-reasoning tradeoff Improved instruction-following and resistance to prompt injection make it safer for production use Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens - this is the new standard, not a premium tier Opus 4.7 is a meaningful step forward. Not a revolutionary rewrite, but a targeted upgrade that addresses friction points developers actually experience: vision quality, autonomous task handling, and creative output. ...

AI Reliability Is Weird: Why Testing LLMs Breaks Everything You Know

TL;DR Traditional testing assumes determinism - given input X, function f always returns Y - but LLMs are non-deterministic, which breaks assertion-based testing at its foundation The same agentic task run twice may produce different but equally correct code, making exact-output assertions brittle and often useless The new paradigm shifts from “test the code” to “verify the intent”: property-based testing, LLM-as-a-Judge evaluation, golden datasets for regression, and human review for overall correctness Structured outputs enforce syntactic correctness at generation time, but semantic correctness - whether the output actually solves the right problem - still requires layered verification on top The future of AI quality assurance is designing robust evaluation frameworks and measuring properties of acceptable outputs, not writing exhaustive unit tests for code the model may generate differently next time We’ve embraced the future. AI agents like Cline are now the primary “builders” of software, executing complex engineering plans from high-level specifications. As I’ve argued in “The Architect vs The Builder”, the human role is shifting from execution to architectural oversight and defining intent. The patterns that determine whether agents stay shipped are covered in “AI agents that actually work”, and the wider safety framing sits in “AI safety from first principles”. ...

Cline: The Next Generation AI Coding Assistant

An exploration of Cline, the autonomous AI coding agent that lives in your IDE and handles complex, multi-step engineering tasks through tool-use and agency.

Career-Ops: Flipping the Script on AI-Powered Job Search

TL;DR Career-Ops is an open-source tool built on Claude Code that inverts the job search power dynamic - giving candidates AI-powered evaluation and application tools to match what companies use to filter them Each opportunity is scored across 10 weighted dimensions on an A-F scale, producing a structured comparison that replaces the ad hoc spreadsheet most candidates rely on The system generates ATS-optimized resumes dynamically tailored to each job description and auto-discovers new postings from 45+ pre-configured job boards A key design principle is human-in-control: nothing auto-submits, the AI recommends and the candidate decides, making it a decision-support system rather than an automation Career-Ops is a clean example of the broader pattern of AI tools that amplify individual judgment rather than replace it - worth studying for its architecture as much as its use case The job search has long been a one-way mirror - companies deploy AI to filter applications while candidates manually juggle spreadsheets, tailor cover letters, and hope their resume gets past the automated screener. Career-Ops flips that script entirely. Built on Claude Code, it’s an open-source system that gives job seekers their own AI advantage: intelligent evaluation of opportunities, automated customized applications, and systematic candidate strategy. ...

Cline + Kanban: Autonomous Development Meets Project Management

TL;DR Cline integrates with Kanban boards (Linear, GitHub Projects, Jira, Trello) via Model Context Protocol (MCP), closing the gap between project management and code execution Instead of manually copy-pasting tasks, Cline reads directly from your board, works through the implementation, and updates the task status automatically when done This makes the Kanban board the single source of truth - it stays in sync with reality rather than being an afterthought you update when you remember Works best with clear, testable acceptance criteria; vague tasks like “improve performance” need refinement before Cline can act on them autonomously Even with full autonomy, human code review remains essential - Cline completing a task means it is “Ready for Review”, not that it ships In the evolution of agentic software engineering, one critical gap remains: the disconnect between project management and code execution. Your Kanban board tracks what needs doing, but your AI assistant lives in your IDE. Cline + Kanban closes that gap. ...

Structured Outputs: When Your AI Needs to Follow a Schema

TL;DR Structured outputs constrain an LLM’s response to match a JSON schema during generation, eliminating the entire class of post-processing parse failures (which occur 2-5% of the time with free-form output) They produce simpler code, more reliable pipelines, and modest inference cost savings (typically 5-15% fewer tokens) in high-volume systems Use structured outputs for data extraction, classification, entity recognition, and API payload generation - not for creative writing or open-ended reasoning Common mistakes include over-constraining schemas with too-strict enums, forgetting that the response format changes, and mistaking schema validity for semantic correctness The trajectory is toward structured outputs becoming the default: schemas will be inferred from English descriptions, and TypeScript types will auto-generate schemas For years, extracting structured data from LLMs meant post-processing their text output: parse JSON, handle edge cases where the model forgot to close a bracket, write validation code to check if the output matched your schema, implement fallback logic when parsing failed. ...

The LLM Context Window Arms Race: Does It Actually Matter?

TL;DR Context window size is the wrong metric to optimise for - attention scales quadratically, so larger windows mean dramatically higher latency and cost with diminishing quality gains Retrieval-augmented generation consistently outperforms stuffing entire documents into a prompt, because focused context beats diluted context What actually matters in production: token efficiency, prompt caching, structured output formats, and intelligent retrieval - not raw window size Large context windows are genuinely useful for whole-document analysis and complex cross-file code review, but wasteful for Q&A, structured extraction, and high-volume routine tasks The teams that will ship faster and scale further are those building intelligent architecture around a 200K context window, not those waiting for 1M-token models Every week brings a new headline: “Model X reaches 1M token context!” “Model Y supports 2M tokens!” The LLM industry seems locked in an arms race where the stated goal is always “bigger context window,” as if this single metric determines whether a model is useful. ...

Token Economics: Why the Cost of AI Isn't Going Down

TL;DR Inference cost is architectural - generating each token requires loading massive models into GPU memory, and that fundamental constraint doesn’t disappear with scale or competition Despite Moore’s Law expectations, flagship model prices (Claude 3, GPT-4) have remained flat for 18+ months because demand growth absorbs any efficiency gains The true cost of using AI is 1.5 - 2.5x the raw token price once you factor in monitoring, retries, fine-tuning, and compliance overhead Providers convert efficiency gains into better features (longer context, faster inference, multimodal) rather than lower prices - you get more value per dollar, not fewer dollars Stop waiting for cheaper AI; treat token costs as fixed infrastructure spend and optimise usage with tools like prompt caching instead There’s a persistent myth in tech: AI will get cheaper. The argument is straightforward - Moore’s Law, scale effects, competition, and raw compute efficiency improvements mean costs should plummet. Yet in April 2026, Claude costs roughly what it did in 2024. GPT-4 Turbo pricing hasn’t moved in eighteen months. Gemini’s cost structure remains sticky. Why? ...

Local AI vs Cloud AI: The Tradeoff Landscape in 2026

By early 2026, the “Local vs. Cloud” debate has moved past the experimental phase. We are no longer just “trying to see if Llama runs on a Mac.” Instead, professional engineers are building sophisticated Hybrid AI Stacks where local and cloud models work in tandem. The landscape has shifted because the hardware caught up to the software. With the prevalence of unified memory on Apple Silicon and the accessibility of 24GB+ VRAM cards like the RTX 50-series, the “local” ceiling has been smashed. ...