Open WebUI: A Polished Interface for Local and Remote LLMs

TL;DR Open WebUI is an open-source, ChatGPT-style web interface that connects to local Ollama instances, OpenAI’s API, or any OpenAI-compatible backend It eliminates the friction of command-line LLM tools and supports features like RAG with document uploads, web search, custom prompts, model switching, and multi-user permissions Deployment is a single Docker command; maintenance is lightweight with persistent storage and optional PostgreSQL for multi-instance setups The primary appeal is full data ownership - queries never leave your infrastructure - making it well suited for privacy-conscious users and compliance-bound organizations Open WebUI adds minimal latency since the bottleneck is always the inference engine behind it, not the web interface itself If you’ve spent time running language models locally through Ollama or another inference engine, you’ve probably discovered the same friction point: the command-line experience works, but it’s clunky. You’re juggling terminal windows, managing conversation context manually, managing files through the filesystem. ...

April 15, 2026 · 6 min · James M

Structured Outputs: When Your AI Needs to Follow a Schema

TL;DR Structured outputs constrain an LLM’s response to match a JSON schema during generation, eliminating the entire class of post-processing parse failures (which occur 2-5% of the time with free-form output) They produce simpler code, more reliable pipelines, and modest inference cost savings (typically 5-15% fewer tokens) in high-volume systems Use structured outputs for data extraction, classification, entity recognition, and API payload generation - not for creative writing or open-ended reasoning Common mistakes include over-constraining schemas with too-strict enums, forgetting that the response format changes, and mistaking schema validity for semantic correctness The trajectory is toward structured outputs becoming the default: schemas will be inferred from English descriptions, and TypeScript types will auto-generate schemas For years, extracting structured data from LLMs meant post-processing their text output: parse JSON, handle edge cases where the model forgot to close a bracket, write validation code to check if the output matched your schema, implement fallback logic when parsing failed. ...

April 12, 2026 · 7 min · James M

The LLM Context Window Arms Race: Does It Actually Matter?

TL;DR Context window size is the wrong metric to optimise for - attention scales quadratically, so larger windows mean dramatically higher latency and cost with diminishing quality gains Retrieval-augmented generation consistently outperforms stuffing entire documents into a prompt, because focused context beats diluted context What actually matters in production: token efficiency, prompt caching, structured output formats, and intelligent retrieval - not raw window size Large context windows are genuinely useful for whole-document analysis and complex cross-file code review, but wasteful for Q&A, structured extraction, and high-volume routine tasks The teams that will ship faster and scale further are those building intelligent architecture around a 200K context window, not those waiting for 1M-token models Every week brings a new headline: “Model X reaches 1M token context!” “Model Y supports 2M tokens!” The LLM industry seems locked in an arms race where the stated goal is always “bigger context window,” as if this single metric determines whether a model is useful. ...

April 11, 2026 · 7 min · James M

AI Reliability Is Weird: Why Testing LLMs Breaks Everything You Know

TL;DR Traditional testing assumes determinism - given input X, function f always returns Y - but LLMs are non-deterministic, which breaks assertion-based testing at its foundation The same agentic task run twice may produce different but equally correct code, making exact-output assertions brittle and often useless The new paradigm shifts from “test the code” to “verify the intent”: property-based testing, LLM-as-a-Judge evaluation, golden datasets for regression, and human review for overall correctness Structured outputs enforce syntactic correctness at generation time, but semantic correctness - whether the output actually solves the right problem - still requires layered verification on top The future of AI quality assurance is designing robust evaluation frameworks and measuring properties of acceptable outputs, not writing exhaustive unit tests for code the model may generate differently next time We’ve embraced the future. AI agents like Cline are now the primary “builders” of software, executing complex engineering plans from high-level specifications. As I’ve argued in “The Architect vs The Builder”, the human role is shifting from execution to architectural oversight and defining intent. The patterns that determine whether agents stay shipped are covered in “AI agents that actually work”, and the wider safety framing sits in “AI safety from first principles”. ...

April 9, 2026 · 7 min · James M

Claude Code vs Cursor: A 6-Month Comparison

After six months of daily use, here is how the two heavyweights of AI-assisted coding compare: the terminal-native Claude Code and the IDE-integrated Cursor.

April 8, 2026 · 3 min · James M

GPU Servers vs AI API Credits: The Real Cost Breakdown (2026)

TL;DR The core trade-off is pay-per-use (APIs) vs pay-for-capacity (GPUs) - APIs are cheaper at low volume, GPUs win massively at high volume (100M+ tokens/day) The break-even point for GPU self-hosting sits around 2 to 5 million tokens per day for premium-model workloads - below that, APIs almost always win GPU utilisation is the most important variable: at less than 50-60% utilisation, self-hosted inference costs more per token than just calling an API Hidden costs matter - real GPU spend is 2x to 5x the raw hardware price once you add DevOps, scaling, monitoring, and networking; API costs can also balloon from poor prompt design and multi-step agent loops Most serious production systems land on a hybrid architecture: APIs for complex reasoning and long-context work, GPUs for bulk inference, embeddings, and fine-tuned models If you’re building anything with LLMs right now, you’ll hit this question sooner than you expect: ...

April 5, 2026 · 5 min · James M

What Actually Belongs in My AI Dev Stack in 2026

TL;DR A single AI tool cannot handle everything - a proper AI dev stack in 2026 needs distinct layers for spec writing, fast editing, heavy agentic work, cheap model tasks, review, research, and capture Spec-driven development is the most underused part: writing requirements and acceptance criteria before generation dramatically improves AI output and reduces wasted iterations Tools like Cursor AI handle fast, in-flow editing while Claude Code or Cline are better suited to multi-file refactors and autonomous implementation from specs Letting the same model that generated code also review it is a weak loop - a separate review pass with a different model or explicitly critical prompt is essential The real shift is treating AI not as a bolt-on assistant but as part of the workflow architecture itself, with each tool assigned a clear, specific responsibility There is a big difference between using AI for development and having an actual AI development stack. ...

April 5, 2026 · 9 min · James M

Chatbots & Large Language Models (LLMs)

TL;DR An LLM is the underlying reasoning engine; a chatbot is the product experience wrapped around it - they are related but not the same thing LLMs excel at summarizing, rewriting, generating drafts, and coding, but should be treated as fast collaborators rather than infallible oracles The main model families are frontier models (GPT, Claude, Gemini), open-weight / self-hostable models (Llama), and product-specific assistants (ChatGPT, Cursor, Copilot) Choose the right tool for the job: chatbots for convenience and exploration, APIs for automation, coding-native tools for repo-aware work The market is now split between AI as a consumer product and AI as programmable infrastructure - understanding both layers makes the landscape far less confusing Most people still talk about chatbots and large language models as if they are the same thing. ...

May 17, 2024 · 6 min · James M

Why the ChatGPT iPhone App Mattered

TL;DR The ChatGPT iPhone app mattered not as a feature release but as the moment AI shifted from a desktop destination to an everyday mobile utility Key additions - native interface, synced history, and voice input - were modest, but they fundamentally changed when and how people reached for AI Mobile AI is intimate in a way desktop AI is not, fitting into gaps throughout the day rather than dedicated work sessions The launch signalled that ChatGPT was becoming a platform and a habit, not just a viral website or a one-off experiment It pointed toward the future that followed: voice as a primary interface, cross-device continuity, and conversational AI as persistent infrastructure When OpenAI launched the ChatGPT app for iPhone, it was easy to see it as a simple mobile companion to a popular web product. ...

May 18, 2023 · 4 min · James M

OpenAI has released GPT-4

Fully intent on being the next Skynet, OpenAI has released GPT-4, its most robust AI to date that the company claims is even more accurate while generating language and even better at solving problems. GPT-4 is so good at its job, in fact, that it report… https://t.co/Q2btYgtWSA — GrindZero Tribe (@Snapzu_Blogs) March 16, 2023

March 16, 2023 · 1 min · James M