Prompt Caching Banner

Prompt Caching: The Quiet Performance Win for LLM Applications

TL;DR Prompt caching saves the computed representation of a prompt’s static prefix so subsequent requests reuse it rather than recompute it - cached tokens cost roughly 10% of normal input token prices The savings are highest when prompts have a long, identical prefix across requests - system prompts, tool definitions, and few-shot examples can make up 80-90% of total input cost The most common mistake is interpolating variables into the system prompt, which breaks caching silently; fix it by moving all static content to the top and dynamic content to the end Cache lifetimes are bounded (minutes to a few hours per provider) and any change to the prefix - including whitespace - creates a new cache miss Track your cache hit rate explicitly on every LLM dashboard; a dropping hit rate usually signals unintended prompt construction changes, and fixing it is the highest-leverage cost optimisation available If you build LLM applications for any length of time, you eventually notice that you are paying to have the model read the same instructions over and over again. The system prompt, the tool definitions, the few-shot examples, the structured output schema - all of it goes back into the model on every single request, and you pay for the input tokens every single time. For a chatbot doing one or two thousand requests a day this is annoying. For an agent doing tens of thousands of requests with long contexts, it is the dominant cost line. ...

May 9, 2026 · 10 min · James M
MPE Deep Dive Banner

MPE Deep Dive: Why Expressive MIDI Changes Everything

If you have spent any time around electronic music in the last decade, you have probably seen the letters MPE written on the side of a controller and not thought too much about them. The acronym sounds like a feature bullet. It is not. It is a quiet but fundamental reframing of what an electronic instrument can do, and once you have spent serious time playing one, going back to a fixed-velocity keyboard feels like trading a touch screen for a number pad. ...

May 2, 2026 · 9 min · James M

Scaling Graph Algorithms: From Prototypes to Production

Graph algorithms work great on your laptop. PageRank on a 100,000-node graph finishes in seconds. Louvain finds communities instantly. Then you try it on production data - a graph with 5 billion nodes and 50 billion edges - and suddenly everything takes hours, consumes terabytes of memory, and melts your infrastructure. The jump from prototyping to production in graph algorithms is steep. But it’s a known problem with known solutions. ...

March 9, 2026 · 7 min · James M