Cost | jamesm.blog

Prompt Caching: The Quiet Performance Win for LLM Applications

TL;DR Prompt caching saves the computed representation of a prompt’s static prefix so subsequent requests reuse it rather than recompute it - cached tokens cost roughly 10% of normal input token prices The savings are highest when prompts have a long, identical prefix across requests - system prompts, tool definitions, and few-shot examples can make up 80-90% of total input cost The most common mistake is interpolating variables into the system prompt, which breaks caching silently; fix it by moving all static content to the top and dynamic content to the end Cache lifetimes are bounded (minutes to a few hours per provider) and any change to the prefix - including whitespace - creates a new cache miss Track your cache hit rate explicitly on every LLM dashboard; a dropping hit rate usually signals unintended prompt construction changes, and fixing it is the highest-leverage cost optimisation available If you build LLM applications for any length of time, you eventually notice that you are paying to have the model read the same instructions over and over again. The system prompt, the tool definitions, the few-shot examples, the structured output schema - all of it goes back into the model on every single request, and you pay for the input tokens every single time. For a chatbot doing one or two thousand requests a day this is annoying. For an agent doing tens of thousands of requests with long contexts, it is the dominant cost line. ...

Self-Hosted vs Managed in 2026 - The Cost Math Has Changed Again

TL;DR The self-hosted vs managed decision in 2026 is genuinely different from the same decision in 2022. The math has shifted in three directions: cloud egress costs, AI workload economics, and self-hosted tooling maturity. Managed remains the right default for most teams. The thing that has changed is that the threshold at which self-hosting becomes worth considering has dropped. Workloads that were obviously managed in 2022 are genuine 50/50 calls in 2026. The most important shift is that self-hosting is no longer synonymous with on-premises. Modern self-hosting often means renting bare-metal in a colocation, running your own clusters in a hyperscaler, or using sovereign cloud providers - all with different economics. For specific categories - AI inference at scale, data egress-heavy workloads, predictable steady-state compute, regulated environments - self-hosting now wins on cost more often than people assume. The honest framing: managed is the right default; self-hosting is the right minority case; the minority is bigger than it used to be. Why This Decision Got Harder For most of the 2010s the answer was easy. Managed services were cheaper than self-hosting once you priced in operational overhead. The cloud providers competed aggressively. Self-hosting was for the regulated, the eccentric, and the very large. ...

GPU servers vs API credits cost breakdown

GPU Servers vs AI API Credits: The Real Cost Breakdown (2026)

TL;DR The core trade-off is pay-per-use (APIs) vs pay-for-capacity (GPUs) - APIs are cheaper at low volume, GPUs win massively at high volume (100M+ tokens/day) The break-even point for GPU self-hosting sits around 2 to 5 million tokens per day for premium-model workloads - below that, APIs almost always win GPU utilisation is the most important variable: at less than 50-60% utilisation, self-hosted inference costs more per token than just calling an API Hidden costs matter - real GPU spend is 2x to 5x the raw hardware price once you add DevOps, scaling, monitoring, and networking; API costs can also balloon from poor prompt design and multi-step agent loops Most serious production systems land on a hybrid architecture: APIs for complex reasoning and long-context work, GPUs for bulk inference, embeddings, and fine-tuned models If you’re building anything with LLMs right now, you’ll hit this question sooner than you expect: ...