How LLM Prompt Caching Works: KV Cache & TTL Explained

May 22, 2026 · prompt-cache · transformer · llm-architecture

Contents

Why Your AI App’s Token Bill Is Growing Faster Than Your Users
1. Why LLMs Even Have a Cache: A Transformer Inference Walkthrough
1.1 Self-Attention in One Equation
1.2 The Two Phases of Inference
1.3 The KV Cache: Saving Prefill Work for Decode
1.4 The Memory–Compute Tradeoff (Why TTLs Exist)
1.5 Two Layers of Caching
2. The Two Wins: Cost AND Latency
2.1 The Cost Math
2.2 The Latency Win (Often the Bigger Story)
2.3 Why This Matters for Product Strategy
3. Cache Freshness, TTL, and the Operational Model
3.1 Freshness Has Two Meanings — Don’t Confuse Them
3.2 TTL Behavior Across Providers
3.3 Designing for TTL
4. Universal Principles Every Developer Should Know
4.1 Caching Is Prefix-Based — Order Matters
4.2 The Cache Stores K/V, Not Answers
4.3 Cache Writes Are Investments, Not Free
4.4 Caching APIs Don’t Port Across Providers
5. Is Prompt Caching Free Money?
Quickstart: Use the OpenAI SDK Against Every Provider
FAQ

TL;DR — LLM prompt caching is not a bolt-on optimization; it falls out of how the Transformer architecture actually computes attention. Once you understand why the Key/Value vectors of a stable prefix are mathematically reusable, the real surprise becomes the two-headed benefit: dramatic cost reduction (50–90%) and dramatic time-to-first-token reduction (5–20×). This article — Part 1 of a five-part series — covers the architectural reason caching exists, the memory-vs-compute tradeoff that determines whether a cache pays off, and the TTL behavior every developer needs to understand. Part 2 digs into provider-specific implementations.

Series: Part 1 of 5 — Caching Principles · Next: Part 2 — Provider Comparison & Evaluation · Part 3 — Working Code Tutorial · Part 4 — Best LLM by Use Case · Part 5 — LangChain Integration

Why Your AI App’s Token Bill Is Growing Faster Than Your Users

If you’re shipping a chatbot, a RAG app, or an AI agent, you’ve probably hit the same wall: your invoice doubles while your usage doesn’t. Open the request log and you’ll find the same multi-thousand-token system prompt, the same tool descriptions, the same knowledge-base chunks — re-sent on every call.

That is the central economic problem of LLM inference: the model is stateless. Every request re-processes the entire context from scratch. An 8K-token system prompt called 1,000 times equals 8 million tokens of repeated work. You pay for every one of them — and your users wait for every one of them.

Prompt caching fixes this. But unlike most performance tricks, it isn’t added to the architecture — it’s a natural consequence of how Transformer attention is defined. Once you see that, the rest of the article (pricing, TTL, provider differences) lines up cleanly.

1. Why LLMs Even Have a Cache: A Transformer Inference Walkthrough

This section is the part that almost every “prompt caching” tutorial skips. It’s the part that explains why the cache exists in the first place — and why the discounts providers offer aren’t arbitrary marketing numbers but reflect real GPU economics.

1.1 Self-Attention in One Equation

A decoder-only Transformer (the family GPT-4, Claude, Gemini, DeepSeek, Qwen all belong to) processes tokens by repeatedly applying self-attention. For a sequence of N tokens, the attention output for each token i is:

Attention(Q, K, V) = softmax( Q · Kᵀ / √d ) · V

where Q, K, V are matrices of shape [N × d] derived from the input embeddings by three learned linear projections (one per layer per head). The original definition is from Attention Is All You Need (Vaswani et al., 2017).

Two properties of this equation matter enormously for caching:

Property 1 — Causal masking. During generation, token i can only attend to tokens at positions ≤ i. The attention matrix is lower-triangular: the K and V vectors for early tokens are used by every later token, but later tokens never modify them.

Property 2 — K and V depend only on the prefix. Because they’re computed from the input embeddings of positions 1…i via fixed weight matrices, the K and V vectors at position i are a deterministic function of the tokens at positions 1…i — and only those tokens. Nothing about position i+1 can change K_i or V_i.

The implication is immediate: if two requests share an identical prefix of length P, the first P rows of K and V are bit-for-bit identical.

That’s the entire theoretical basis for prompt caching. Everything else is engineering.

1.2 The Two Phases of Inference

Modern LLM inference runs in two distinct phases that consume GPU time very differently. This split is documented thoroughly in Efficiently Scaling Transformer Inference (Pope et al., 2022).

Prefill phase. The model ingests the full prompt at once. For each layer, it computes Q, K, V for every input token and runs the self-attention. Prefill is compute-bound: it saturates the GPU’s matrix-multiply units. Cost scales as O(N²) in prompt length because of the attention matrix.

Decode phase. The model produces one output token at a time, autoregressively. At step t, only the new token’s Q is computed; it attends against the K/V of all previous tokens. Decode is memory-bandwidth-bound — most time is spent reading K/V from GPU memory, not multiplying. Cost scales as O(N) per token (linear in current context length).

For a typical chatbot workload (8K-token system prompt + 100-token user query + 300-token response), prefill dominates wall-clock time and dollar cost by roughly 4:1. That’s the part caching saves.

Per call breakdown (8K prompt, 300 output tokens, Claude-class model):

  ████████████████████████████████░░░░░░░░  Prefill: ~80% of compute
  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░████████  Decode:  ~20% of compute

1.3 The KV Cache: Saving Prefill Work for Decode

The “KV cache” originally referred to a within-request optimization. During decoding, each newly generated token needs to attend to the K and V of every previous token. Recomputing those every step would turn an O(N) decode into O(N²) decode. So every inference engine stores K and V from prefill in GPU memory and reuses them for the entire decode phase. This is universal — every commercial LLM does it. It’s what makes generation tractable at all.

What providers expose as “prompt caching” is the next generalization: keep the KV cache around after the request ends, and reuse it for the next request that shares the same prefix.

1.4 The Memory–Compute Tradeoff (Why TTLs Exist)

So why doesn’t every provider just cache everything forever? Because KV cache is enormous.

For a model with L transformer layers, H attention heads, D head-dimension, and B bytes per value (typically 2 for fp16), the KV cache size for N tokens is:

KV cache size  =  2 × L × H × D × B × N
                  ↑   ↑   ↑   ↑   ↑   ↑
                  K&V layers heads head bytes tokens

For a 70B-class model with 80 layers, 8 KV heads (after grouped-query-attention), 128 head-dim, fp16 weights, that’s roughly 320 KB per token. A 32K-token context = ~10 GB of KV cache, just for one request. A modern H100 GPU has 80 GB; you can fit at most a handful of these simultaneously.

This is the core constraint PagedAttention (Kwon et al., 2023, the paper behind vLLM) was designed to solve at the batch level — and the same constraint is what bounds prompt caching at the cross-request level:

Resource	Cost of recomputing prefix	Cost of storing prefix
GPU compute time	High (O(N²) attention)	Low (just memory loads)
GPU memory	Free (computed then discarded)	High (10 GB per 32K context)

So a provider’s cache TTL is essentially a memory eviction policy: at some point the GPU needs that memory for other users’ active workloads, and the cached prefix gets evicted. 5 minutes for HBM-resident caches; up to 1 hour for paged-to-DRAM caches; hours for disk-backed caches.

DeepSeek’s trick. DeepSeek-V2 introduced Multi-head Latent Attention (MLA), which compresses KV cache by roughly 4× compared to standard grouped-query attention (DeepSeek-AI, 2024). That compression is exactly what lets them persist KV cache to disk instead of HBM — which in turn enables a much smaller minimum cache unit (64 tokens vs. 1,024 for HBM-resident caches) and much longer effective TTLs.

This is also why caching across requests requires identical token-by-token prefixes. The cache is indexed by a hash of token IDs, and any divergence — even one character that re-tokenizes differently — produces a different K and V from that point onward. There’s no “fuzzy match” at this layer (that’s what semantic caching does, but that’s a different mechanism in the gateway).

1.5 Two Layers of Caching

┌──────────────────────────────────────────────────────────────┐
│  Layer 1: Per-request KV cache (always on, every provider)    │
│  → keeps decode O(N) instead of O(N²)                        │
│  → you don't pay attention to it; the provider just does it  │
└──────────────────────────────────────────────────────────────┘
                              ↓
┌──────────────────────────────────────────────────────────────┐
│  Layer 2: Cross-request Prompt Cache (the money-and-time      │
│           saver this series is about)                         │
│  → reuses prefill K/V across requests with matching prefixes  │
│  → exposed as: explicit / fully automatic / hybrid           │
│  → bounded by TTL (memory-eviction-driven)                   │
└──────────────────────────────────────────────────────────────┘

The rest of the series — and most of what you’ll tune as a developer — lives in Layer 2.

2. The Two Wins: Cost AND Latency

Most articles frame caching as a cost optimization. That undersells it. The latency win is often the bigger reason production teams adopt caching, especially for user-facing chat.

2.1 The Cost Math

Pricing pages give the headline numbers but rarely show them applied to a realistic workload. Take a customer-support bot with an 8,000-token system prompt, 100K queries/day, 200-token user messages. Pricing it on claude-sonnet-4-5 using Anthropic’s published 2026 rates (10% cached input, 125% cache-write premium):

Without caching

Input per call: 8,200 tokens × base input rate
Per-call cost (measured single-call): ~$0.022
Monthly cost: 100K × 30 × $0.022 = ~$66,000

With prompt caching

One-time cache write: 8,000 tokens × 125% premium (negligible against monthly volume)
Per call thereafter: 8,000 tokens × 10% base + 200 tokens × base + output
Effective per-call cost: ~$0.003
Monthly cost: ~$9,000

~86% saved. That number is Anthropic’s published discount applied to a realistic input shape. The article that follows (Part 3 — Tutorial) shows real measured numbers across the rest of the providers.

2.2 The Latency Win (Often the Bigger Story)

Prefill isn’t only expensive — it’s the single biggest contributor to time-to-first-token for any prompt longer than a few hundred tokens. Cache hits let you skip almost all of it.

Measured streaming TTFT against the public Synthorai gateway, 2026-05-25, ~7,300-token stable system prompt:

Model	Cold total	Warm TTFT	Improvement
`gpt-5.4-mini`	~3.6 s	0.73 s	~5×
`gpt-5.4-nano`	~2.2 s	1.00 s	~2×
`claude-haiku-4-5`	~3.0 s	1.31 s	~2×
`claude-sonnet-4-5`	~2.0 s	1.76 s	~1.2×
`claude-opus-4-5`	~2.2 s	2.08 s	~1.05×
`deepseek-v4-flash`	~4.0 s	2.93 s	~1.4×
`qwen3-max`	~4.8 s	1.53 s	~3×

Single-run, single-tenant. The TTFT win is most visible on long prompts (>5K tokens); for short prompts the prefill portion is too small to dominate latency. Claude’s biggest measured win is cost (~88–89% off input on cache read) — for prompt sizes in the 100K+ range, the TTFT win compounds substantially per Anthropic’s published numbers.

For chat UIs, the threshold above which users consciously perceive a delay is around 1 s for TTFT and ~2 s for first useful text. A 10K-token RAG prompt without caching is firmly above that line. With caching, the same workload feels instant.

For agent loops with 15+ steps, the cost story is good (50% saved), but the latency story is what makes the product actually shippable: 15 steps × 5s prefill = 75 s of dead time per task → with caching, 15 × 0.5s = 7.5 s.

2.3 Why This Matters for Product Strategy

A common mistake is to treat caching as “ops doing cost optimization” — a thing you bolt on after launch. But the latency win means caching is also part of the UX surface:

A chatbot with sub-1-s TTFT feels alive; the same bot at 3 s feels broken.
A RAG product where retrieval+prefill takes 4 s loses to the same product where it takes 1 s.
An agent that completes a task in 20 s wins against one that takes 90 s.

You should be deciding cache strategy at the same time you decide your model and prompt structure — not three sprints after launch.

3. Cache Freshness, TTL, and the Operational Model

The TTL question is one of the most-asked and least-explained parts of prompt caching. Two things to understand:

3.1 Freshness Has Two Meanings — Don’t Confuse Them

Cache freshness ≠ response freshness. Two distinct concepts often get conflated:

Concept	What it means	Risk
KV cache freshness	Whether the cached K/V vectors are still the same bytes as a fresh computation	Zero risk. K/V are deterministic — a cached value at position `i` is bit-identical to a freshly recomputed value.
Prompt content freshness	Whether the information in your prompt is still current (e.g., “today’s weather”, “current stock price”)	Your problem. The cache doesn’t know your data is stale. You need to bust it deliberately.

In other words: cached responses are not “stale” in any model-quality sense. They’re mathematically identical to uncached ones. But if you put “the current time is 14:32:05” into your system prompt and rely on cache hits, your “current time” stays at 14:32:05 until TTL expiry and your model will confidently lie to users about it.

3.2 TTL Behavior Across Providers

Provider	Default TTL	Refresh on hit?	Extended option
Anthropic Claude	5 min	Yes (sliding window)	1-hour option
OpenAI	~5 min	Yes	Up to ~60 min for high-traffic prefixes
Google Gemini	Developer-chosen (default 1 hour)	No (fixed)	Up to 24 hours via API
DeepSeek	Hours (tier-dependent)	Yes	—
Alibaba Qwen	5 min default	Yes	Configurable per cache

The 5-minute default isn’t arbitrary — it’s roughly the GPU memory pressure horizon for popular models at peak load. As we calculated in §1.4, KV cache for one large context can be tens of GB; providers can’t afford to hold them indefinitely.

3.3 Designing for TTL

Three patterns that work in production:

Pattern A — Keep sessions warm. For chat, the natural request cadence (seconds to minutes between turns) keeps cache alive on its own. Don’t worry about TTL; just don’t put dynamic data in the prefix.

Pattern B — Heartbeat for batch. For batch jobs that span hours, send a minimal request every TTL/2 to keep the cache warm. The cost is essentially zero (a few input tokens) and prevents cache-eviction storms.

Pattern C — Use long-TTL providers for cold storage. If you have a 50K-token document that’s queried sporadically (e.g., once an hour for a week), Gemini explicit caches (24-hour TTL) or DeepSeek disk caches will outperform short-TTL alternatives despite the storage fee.

4. Universal Principles Every Developer Should Know

Providers expose caching in five very different shapes — explicit markers, fully automatic, hybrid, architectural disk-backing, or none at all. We dedicate the next article to that comparison (Part 2 — Provider Comparison & Evaluation). But four principles apply regardless of provider and follow directly from the architecture we just walked through:

4.1 Caching Is Prefix-Based — Order Matters

Because K/V at position i depends on tokens at positions 1…i, providers can only match a contiguous prefix starting from token 0. Change a single character at position 0 and the entire prefix invalidates. Stable content goes first, volatile content goes last. This is not a heuristic — it’s a direct consequence of self-attention’s causal structure (§1.1).

4.2 The Cache Stores K/V, Not Answers

A cache hit doesn’t return a previously generated answer — it returns previously computed K and V vectors, which the model then uses to generate a fresh response to the current question. This means:

Output quality is identical to an uncached call (§1.1).
Output is non-deterministic in the usual ways — temperature, top-p, etc. still apply.
Cached responses are never “stale” in the model-quality sense — only your prompt’s content (timestamps, prices) can be stale. See §3.1 again.

4.3 Cache Writes Are Investments, Not Free

For providers that charge a write premium (Anthropic 125%, Gemini explicit 125%), the first call with a new prefix costs more than no caching. The break-even is fast (usually one hit), but if your “stable” prefix changes every request you’ll pay write costs over and over with no payoff. Watch this if you’re sorting retrieved documents by relevance — that’s the classic anti-pattern.

4.4 Caching APIs Don’t Port Across Providers

cache_control (Anthropic) ≠ cached_content (Gemini) ≠ cache_id (Qwen). If your application must run against multiple providers, either you maintain three integrations or you put a Token Gateway in front to unify them. Part 2 covers this in detail.

5. Is Prompt Caching Free Money?

Almost. It pays off when:

Your prompts have a stable prefix — system prompt, knowledge base, tool schemas
Your calls are frequent or connected — same session, batch workloads, in-progress agent runs
You can structure prompts so that stable content sits at the front

Hit those three and you’ll typically see 50–90% lower spend and 3–20× faster TTFT without changing models.

Coming next: Part 2 — Provider Caching Comparison & Evaluation Framework takes the architectural picture above and turns it into a feature-by-feature comparison of Claude, OpenAI, Gemini, DeepSeek, and Qwen, with an evaluation rubric for picking the right provider for your workload.

Quickstart: Use the OpenAI SDK Against Every Provider

Synthorai exposes an OpenAI-compatible endpoint — point the official openai SDK at it and every model (Claude, GPT, Gemini, DeepSeek, Qwen) becomes a one-line model swap. The gateway translates cache_control into each provider’s native caching syntax.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["SYNTHORAI_KEY"],
    base_url="https://synthorai.io/v1",
)

resp = client.chat.completions.create(
    model="claude-sonnet-4-5",                       # swap freely
    max_tokens=256,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "Hello"},
    ],
)

print(resp.choices[0].message.content)
print(resp.usage.prompt_tokens_details)  # cached_tokens when upstream reports it
print(resp.usage.cost)                   # USD per call (gateway-computed)

Same call works for gpt-5.4-mini, gemini-2.5-pro, deepseek-v4-flash, qwen3-max — only the model field changes. The gateway returns the prompt-cache hit metadata in the standard OpenAI prompt_tokens_details.cached_tokens field, plus a cost field in USD so you don’t need to maintain a per-vendor pricing matrix locally.

FAQ

Is LLM prompt caching the same as semantic caching? No. Prompt caching is prefix-based — it reuses K/V values for an exact token-level match at the start of the prompt. Semantic caching matches at the meaning level (via embeddings) and returns a previous response. Both are useful, and a good Token Gateway combines them in layers.

Does prompt caching change the model’s output? No. K and V are deterministic functions of the input tokens (§1.1). The logits the model produces from a cached K/V are mathematically identical to those from a freshly recomputed K/V. Caching is a pure efficiency optimization with no quality impact.

Why is the cache TTL so short — can’t they just keep it forever? KV cache is enormous (§1.4: ~10 GB per 32K context for a 70B model). GPU memory is the bottleneck; caches get evicted whenever the server needs that memory for active workloads. Disk-backed caches (DeepSeek) can live for hours, but in-memory caches typically can’t.

What’s the difference between KV cache and prompt cache? KV cache is the in-memory data structure used during inference. “Prompt cache” is the cross-request reuse of that KV cache. Layer 1 vs Layer 2 in §1.5 above.

Do cached prompts ever go stale in a quality-degrading way? No, from the model’s perspective. Yes, from your content’s perspective if your prompt encodes time-sensitive information. The cache stores K/V vectors, not facts about the world. See §3.1.

How do I measure cache hit rate? Every provider returns it in the response usage object — cache_read_input_tokens (Anthropic), cached_tokens (OpenAI), cached_content_token_count (Gemini), prompt_cache_hit_tokens (DeepSeek). Track these in your logging pipeline.

References & sources: Vaswani et al., “Attention Is All You Need” (NeurIPS 2017) · Pope et al., “Efficiently Scaling Transformer Inference” (2022) · Kwon et al., “Efficient Memory Management for LLM Serving with PagedAttention” (SOSP 2023, vLLM) · DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model” (2024) — MLA architecture · Anthropic Prompt Caching docs · OpenAI Prompt Caching docs · Google Gemini Context Caching docs · DeepSeek KV Cache guide · Alibaba Bailian Context Cache

← Back to blog