Performance

Prompt Caching: The Quiet Performance Win for LLM Applications

TL;DR Prompt caching saves the computed representation of a prompt’s static prefix so subsequent requests reuse it rather than recompute it - cached tokens cost roughly 10% of normal input token prices The savings are highest when prompts have a long, identical prefix across requests - system prompts, tool definitions, and few-shot examples can make up 80-90% of total input cost The most common mistake is interpolating variables into the system prompt, which breaks caching silently; fix it by moving all static content to the top and dynamic content to the end Cache lifetimes are bounded (minutes to a few hours per provider) and any change to the prefix - including whitespace - creates a new cache miss Track your cache hit rate explicitly on every LLM dashboard; a dropping hit rate usually signals unintended prompt construction changes, and fixing it is the highest-leverage cost optimisation available If you build LLM applications for any length of time, you eventually notice that you are paying to have the model read the same instructions over and over again. The system prompt, the tool definitions, the few-shot examples, the structured output schema - all of it goes back into the model on every single request, and you pay for the input tokens every single time. For a chatbot doing one or two thousand requests a day this is annoying. For an agent doing tens of thousands of requests with long contexts, it is the dominant cost line. ...

MPE Deep Dive: Why Expressive MIDI Changes Everything

TL;DR MPE (MIDI Polyphonic Expression) gives every sounding note its own continuous pitch, pressure, and timbre control, where standard MIDI shares those controls across the whole channel The trick is channel rotation: a master channel carries global messages while each note gets its own channel, making per-voice expression a first-class citizen Hardware and synth must both participate - an MPE controller into a non-MPE synth gets you nothing It is genuinely worth the cost for solo lines, expressive pads, and modelled acoustic instruments; it is overkill for step-sequenced and quantised material Once you have spent serious time on an MPE instrument, a fixed-velocity keyboard feels like trading a touch screen for a number pad If you have spent any time around electronic music in the last decade, you have probably seen the letters MPE written on the side of a controller and not thought too much about them. The acronym sounds like a feature bullet. It is not. It is a quiet but fundamental reframing of what an electronic instrument can do, and once you have spent serious time playing one, going back to a fixed-velocity keyboard feels like trading a touch screen for a number pad. ...

Small language models - why size is not everything

The Rise of Small Language Models: Why Size Isn't Everything

TL;DR Small language models (typically under 15B parameters) trained on high-quality data can match or outperform much larger models on many real-world tasks, thanks to distillation, instruction tuning, and quantization The key advantages are speed (milliseconds vs seconds), cost (no per-token API charges), privacy (data stays on your hardware), and offline capability Standout models include Mistral 7B for speed, Phi-3 for edge devices, and OpenClaw for code and reasoning - all usable locally via Ollama The industry is moving toward a multi-tier approach: small models (7-13B) for 80% of workloads, medium models as a step-up, and large models reserved only for complex reasoning tasks where they genuinely outperform Large models still win on deep multi-step reasoning, breadth of knowledge, and few-shot generalization - the shift is about matching model size to task, not replacing large models entirely For years, the narrative was simple: bigger is better. GPT-4 was massive, Claude was massive, and the race seemed to be about who could train the largest model on the most data. But that story is changing. Small language models - typically under 15 billion parameters - are proving that you don’t need 175 billion parameters to solve real problems. ...

Large-scale network graph representing graph algorithms at production scale

Scaling Graph Algorithms: From Prototypes to Production

TL;DR Graph algorithms are memory-bound, not CPU-bound: a 5 billion node, 50 billion edge graph needs 500+ GB just for adjacency, scores, and working memory - it simply does not fit on one machine Four scaling strategies cover most production cases: distributed processing (vertex-cut or edge-cut partitioning), approximate algorithms (sampling, sketching, early stopping), incremental/streaming approaches, and storage-level optimisation (columnar layout, compression, caching) The hidden costs dominate at scale: communication overhead, synchronisation barriers, debugging distributed state, and the operational burden of keeping it all running Full-graph recompute is not feasible at billions of nodes - incremental algorithms and clever approximations are the norm, not the exception Managed services like Neptune Analytics remove much of the partitioning and operations work, at the price of less control Graph algorithms work great on your laptop. PageRank on a 100,000-node graph finishes in seconds. Louvain finds communities instantly. ...

Polkadot Agile Coretime Explainer Banner

Polkadot's Agile Coretime: A Plain-English Explainer

TL;DR Agile Coretime (Polkadot 2.0) replaces two-year reserved parachain slots with pay-as-you-go blockspace - reserved parking spots become parking meters The old model cost $2-5 million in DOT collateral per slot and left the network at maybe 30-40% effective capacity, because idle parachains still held their lanes Cores now sell by the week, month, or season - roughly $100-500 per week as of 2026, around 100x cheaper than the per-week equivalent of a traditional slot The secondary market is the quiet game-changer: unused capacity can be resold instead of wasted The trade-offs are real - bulk-sale price volatility and less long-term certainty for teams that want guaranteed capacity If you’ve been following Polkadot, you’ve probably heard “Agile Coretime” mentioned alongside “Elastic Scaling” and “Asynchronous Backing.” It sounds technical, important, and confusing. This post explains what it actually is, why it matters, and what it means for the network. ...