LLM Prompt Caching #2: Compare Claude, GPT, Gemini, DeepSeek

Contents
  1. 1. A Taxonomy of LLM Cache Types
  2. 1.1 Control: Explicit vs Implicit vs Hybrid
  3. 1.2 Persistence: In-Memory vs Disk-Backed
  4. 1.3 Granularity: Match Resolution
  5. 1.4 Object Model: Per-Call Markers vs Named Cache Objects
  6. 2. Per-Provider Deep-Dive
  7. 2.1 Anthropic Claude — Explicit, In-Memory, 1,024-Token Granularity
  8. 2.2 OpenAI GPT-5.x — Automatic, In-Memory, 1,024-Token Granularity
  9. 2.3 Google Gemini — Hybrid, In-Memory, Named Cache Objects
  10. 2.4 DeepSeek-v4 — Automatic, Disk-Backed, 64-Token Granularity
  11. 2.5 Alibaba Qwen3 — Hybrid, In-Memory, Named Cache Objects + Implicit
  12. 3. Side-by-Side Comparison
  13. 3.1 Discount Structure (vendor docs, 2026-05)
  14. 3.2 TTL, Granularity & Persistence
  15. 3.3 Measured Latency on a 7K-Token Prefix (2026-05-25)
  16. 4. The 5-Dimension Evaluation Framework
  17. 4.1 Effective Cost per Million Tokens (Hit-Rate-Weighted)
  18. 4.2 Hit-Rate Predictability
  19. 4.3 TTL ↔ Traffic Cadence Fit
  20. 4.4 Latency Under Cache Miss
  21. 4.5 API Ergonomics & Migration Cost
  22. 5. Quick Verdicts by Workload Shape
  23. 6. Migration Considerations
  24. 7. What Changes Over Time
  25. FAQ

TL;DR — Five major LLM providers expose prompt caching in five very different shapes — explicit markers (Claude), fully automatic (GPT-5, DeepSeek-v4), hybrid implicit+explicit (Gemini, Qwen), or architectural disk-backing (DeepSeek’s MLA). This article gives you a feature-by-feature comparison and a 5-dimension evaluation framework to score them for your workload — cost, hit-rate predictability, latency, TTL fit, and API ergonomics. The architectural background is in Part 1: Caching Principles; measured numbers and working Python are in Part 3: Tutorial.

Series: Part 2 of 4 · Previously: Part 1 — Caching Principles · Next: Part 3 — Working Code Tutorial · Part 4 — Best LLM by Use Case


1. A Taxonomy of LLM Cache Types

Before going provider-by-provider, four design axes worth pinning down:

1.1 Control: Explicit vs Implicit vs Hybrid

  • Explicit — developer marks which parts of the prompt to cache (Anthropic Claude cache_control). Maximum control; requires code changes.
  • Implicit / automatic — provider auto-detects matching prefixes (OpenAI GPT-5, DeepSeek-v4). Zero code changes; no way to force a hit.
  • Hybrid — both modes available; pick per call (Gemini, Qwen).

1.2 Persistence: In-Memory vs Disk-Backed

Set by the provider’s KV cache architecture, not by the API surface.

  • In-memory (HBM) — caches live in GPU memory, short-lived (minutes), large minimum chunks (1,024 tokens). Default for most providers.
  • Disk-backed — caches persist to SSD/NVMe with much longer TTLs and finer granularity. DeepSeek ships this at scale, enabled by their Multi-head Latent Attention (MLA) compression that shrinks KV cache by ~4× (DeepSeek-AI, 2024).

1.3 Granularity: Match Resolution

How small a prefix can earn a discount?

  • 64 tokens — DeepSeek (industry-finest)
  • 128 tokens — OpenAI (matching increment)
  • 1,024 tokens — minimum cacheable chunk for Claude, OpenAI, Gemini, Qwen

Finer granularity means partial-prefix overlap also counts — much more forgiving of small prompt variations.

1.4 Object Model: Per-Call Markers vs Named Cache Objects

  • Per-call markers — every request inlines the content to be cached, provider hashes it (Claude, OpenAI, DeepSeek, Qwen implicit).
  • Named cache objects — developer creates a cache via a separate API call, gets a cache_id, references it later (Gemini explicit, Qwen explicit). Trades extra ceremony for explicit lifecycle control.

These four axes interact. A provider’s offering is described by where it sits on each. The next section walks through each provider individually.


2. Per-Provider Deep-Dive

2.1 Anthropic Claude — Explicit, In-Memory, 1,024-Token Granularity

Lead models (2026-05): claude-haiku-4-5, claude-sonnet-4-5 / 4-6, claude-opus-4-5 / 4-6 / 4-7.

Cache API. Mark up to four cache_control breakpoints anywhere in your system or messages array. Cache hits cost ~10% of base input rate; cache writes cost 125% (a 25% premium). Default TTL is 5 minutes sliding (every hit resets it), with a 1-hour option.

Pricing shape. Anthropic publishes per-million-token rates per model on their pricing page; the cache discount is consistent across the family. For an 8,000-token system prompt at 100K calls/day on claude-sonnet-4-5, the per-call cost drops by roughly 8–10× once the prefix is warm — break-even after a single hit.

TTL behavior. Sliding 5-minute default — every hit pushes expiry forward by another 5 minutes. The 1-hour TTL doubles write cost but is essential for any workload with idle gaps > 5 min.

Granularity. 1,024-token minimum. Hash is over the exact token sequence; one-character change at the front invalidates the entire prefix.

API ergonomics. Highest. Multi-breakpoint design lets you cache “never changes” + “rarely changes” + “per-task changes” independently — best-in-class for agent and RAG workloads where prompt sections change at different cadences.

Gotchas.

  • Forgetting to add cache_control means no caching at all — unlike GPT or DeepSeek, there’s no implicit fallback.
  • Cache hashing is order-sensitive even within tool/function arrays — sort them deterministically.
  • The 5-min default makes Claude a poor fit for sporadic batch jobs without an explicit keep-alive.
  • If you call Claude through a gateway, verify the gateway supports Anthropic’s native /v1/messages path with cache_control markers (the OpenAI-compat /chat/completions path generally does not propagate the markers — use the Anthropic SDK pointed at the gateway’s base URL).

Best fit. Long-context agents, multi-turn chat with stable system prompts, structured RAG with layered caching.


2.2 OpenAI GPT-5.x — Automatic, In-Memory, 1,024-Token Granularity

Lead models (2026-05): gpt-5.4-nano, gpt-5.4-mini, gpt-5.2, gpt-5.4-pro, gpt-5.5-pro. Codex variants for code: gpt-5.2-codex, gpt-5.3-codex.

Cache API. Nothing to do — automatic on every request ≥1,024 tokens. Cache hits billed at 50% of input rate; no write premium. Match increment: 128 tokens.

Pricing shape. OpenAI publishes per-million-token rates on their pricing page. Cached input is 50% off; output is unchanged.

Measured (2026-05-25, ~6,900-token system prompt):

ModelMiss total costHit total costHit cache rateHit stream TTFT
gpt-5.4-nano$0.00131$0.00074 (−44%)5,888 / 6,887 (85%)1.00 s
gpt-5.4-mini$0.00267$0.00257*6,400 / 6,887 (93%)0.73 s

* gpt-5.4-mini’s hit-pass completion was much shorter than the miss-pass; the cost delta there mixes cache discount with completion-length variation. The 5× latency drop (3.63 → 0.73 s) is the cleaner signal.

TTL behavior. Undocumented exact value; field reports suggest 5–60 minutes depending on load and prefix popularity. Popular shared prefixes live longer (LRU favors them).

API ergonomics. Trivial — existing code keeps working. Log prompt_tokens_details.cached_tokens to measure hit rate.

Gotchas.

  • No way to force a hit. If your traffic produces unique prefixes, you get nothing.
  • The 50% discount is shallower than Claude/DeepSeek’s 90/75% (matches Gemini implicit ~25%).
  • Streaming sometimes reports cache hits in the final chunk only — instrument carefully and pass stream_options={"include_usage": True}.

Best fit. Existing GPT-using codebases where retrofit cost outweighs marginal savings. Bursty traffic where prefix repetition is naturally high.


2.3 Google Gemini — Hybrid, In-Memory, Named Cache Objects

Lead models (2026-05): gemini-2.5-flash, gemini-2.5-pro, gemini-3-flash-preview, gemini-3.1-pro-preview, gemini-3.1-flash-lite-preview.

Cache API. Two modes:

  • Implicit: automatic, like GPT. Cached tokens billed at ~25% of input rate. No storage fee, no setup.
  • Explicit: create a cachedContent object via a separate API call. Reference it by name in subsequent requests. Cached tokens billed at ~10% (lower) but you pay an hourly storage fee per million tokens.

Pricing shape. Long context is Gemini’s strength; pricing scales with context length category (sub-200K vs above-200K thresholds at higher per-token rates).

Measured (2026-05-25):

ModelMiss costHit cost (stream)Hit cache rate
gemini-2.5-flash$0.00198$0.00024 (−88%)7,140 / 7,322 (97%)
gemini-2.5-pro$0.00824$0.00205 (−75%)6,120 / 7,328 (84%)

TTL behavior. Implicit: minutes, undisclosed. Explicit: developer-set, default 1 hour, up to 24 hours.

API ergonomics. Explicit cache requires a 2-step flow (create → reference). The cachedContent lifecycle (create, update TTL, delete) is your responsibility.

Gotchas.

  • Storage fee is the killer for low-volume explicit caches. Always compute break-even for your call frequency.
  • Implicit cache hit rate is variable; don’t rely on it for cost modeling.
  • Cache objects are region-bound — multi-region apps need duplicate caches.
  • gemini-*-pro is a reasoning model: at small max_tokens, completion is consumed by hidden thinking and you’ll see completion_tokens=0. Bump max_tokens to ≥256 in any user-facing path.

Best fit. One large document (>20K tokens) queried 10+ times/hour. Video Q&A. Multi-modal RAG over enterprise PDFs.


2.4 DeepSeek-v4 — Automatic, Disk-Backed, 64-Token Granularity

Lead models (2026-05): deepseek-v4-flash (general), deepseek-v4-flash (also covers coder workloads on this generation).

Cache API. Automatic, like GPT — but powered by MLA compression that makes the cache compact enough to persist to disk. Cache hits billed at ~25% of input rate; no write premium. Minimum match: 64 tokens.

Pricing shape. Yuan-denominated rates on DeepSeek’s pricing page. Hit rate translates roughly to a 75% input-cost reduction.

Measured (2026-05-25):

ModelMiss costHit costHit cache rateHit TTFT
deepseek-v4-flash$0.00091$0.00023 (−74%)6,784 / 7,101 (96%)2.93 s

TTL behavior. Hours, sometimes longer for high-traffic prefixes. Disk-backed storage means caches survive GPU memory pressure that would evict in-memory caches at other vendors.

Granularity. 64-token minimum is the smallest in the industry. Small prompt edits leave most of the prefix matching, instead of completely invalidating like 1,024-token providers.

API ergonomics. OpenAI-shaped API; swap base URL. Standard prompt_tokens_details.cached_tokens field.

Gotchas.

  • DeepSeek-family models only. No way to use this cache with other model families.
  • Quality on English is excellent but trails Claude/GPT-5 on the hardest reasoning benchmarks.

Best fit. Chinese-language workloads (cost). High-frequency-prefix workloads where granularity matters (RAG with unstable retrieval order). Cost-sensitive batch jobs.


2.5 Alibaba Qwen3 — Hybrid, In-Memory, Named Cache Objects + Implicit

Lead models (2026-05): qwen3-max, qwen3.5-plus, qwen3.5-flash. Vision variants: qwen3-vl-plus, qwen3-vl-flash.

Cache API. Two modes:

  • Implicit: always-on, like GPT. Cached portion billed at ~20% of input rate.
  • Explicit: create cache via API with custom TTL. Hits at ~10%, writes at 125%.

Measured (2026-05-25):

ModelMiss costHit costHit cache rateHit TTFTNotes
qwen3-max$0.00553$0.005497,040 / 7,234 (97%)1.53 sCache hit reported, gateway cost field didn’t reflect discount on this date (verify in production)

TTL behavior. Default 5 minutes, configurable per cache object. Sliding window for explicit; short fixed TTL for implicit.

API ergonomics. Implicit is GPT-shaped (zero work). Explicit is a 2-step flow with cache lifecycle.

Gotchas.

  • Only qwen3-max and qwen3.5-plus support explicit caching at the moment.
  • Multi-region (Singapore, US) availability rolling out — confirm region before relying on it for non-China data.
  • Documentation gaps relative to Anthropic/OpenAI — empirical testing recommended.

Best fit. Chinese enterprise workloads needing tight cache control. Customers already on Alibaba Cloud.



3. Side-by-Side Comparison

3.1 Discount Structure (vendor docs, 2026-05)

ProviderCache write premiumCached input rateEffective discount
Anthropic Claude+25%10% of base~90% off
OpenAI GPT-5none50% of base50% off
Google Gemini (implicit)none~25% of base~75% off
Google Gemini (explicit)none, but hourly storage fee~10% of base~90% off if amortized
DeepSeek-v4none~25% of base~75% off
Alibaba Qwen3 (implicit)none~20% of base~80% off
Alibaba Qwen3 (explicit)+25%~10% of base~90% off

3.2 TTL, Granularity & Persistence

ProviderDefault TTLMax TTLPersistenceMin match unit
Claude5 min sliding1 hourIn-memory (HBM)1,024 tok
GPT-5~5 min~60 minIn-memory (HBM)1,024 tok / 128-tok increment
Gemini (implicit)minutesundisclosedIn-memory1,024 tok
Gemini (explicit)1 hour24 hoursIn-memory1,024 tok
DeepSeek-v4hourshours+Disk (SSD)64 tok
Qwen35 minconfigurableIn-memory~1,024 tok

3.3 Measured Latency on a 7K-Token Prefix (2026-05-25)

Provider / modelMiss totalHit TTFT (stream)Latency win
claude-haiku-4-5~3.0 s1.31 s~2×
claude-sonnet-4-5~2.0 s1.76 s~1.2×
claude-opus-4-5~2.2 s2.08 s~1.05×
gpt-5.4-mini~3.6 s0.73 s~5×
gpt-5.4-nano~2.2 s1.00 s~2×
gemini-2.5-flash~2.5 s~1.4 s~1.8×
gemini-2.5-pro~3.0 s~1.8 s~1.7×
deepseek-v4-flash~4.0 s2.93 s~1.4×
qwen3-max~4.8 s1.53 s~3×

† Claude rows are measured with cache_control markers via the Anthropic-native /v1/messages endpoint (see Part 3 §2). Claude’s biggest win is on cost (~88–89% off input — see Part 3 §2 for the full cost table); the TTFT improvement scales up dramatically for 100K+ token prompts per Anthropic’s published numbers.

Single sequential run, no concurrent load. Your numbers will move with region, time-of-day, and competing tenant load.


4. The 5-Dimension Evaluation Framework

Headlines like “Claude saves 90%” are interesting but rarely tell you what to pick. Score each provider on these five dimensions for your workload, then weigh them by what you care about.

4.1 Effective Cost per Million Tokens (Hit-Rate-Weighted)

Don’t compare base prices — compare expected cost at your real hit rate:

effective_cost = base × (1 - hit_rate × (1 - discount)) + write_premium × write_rate

Worked example for 70% prefix repetition (typical chatbot):

  • Claude: ~90% discount × 0.7 hit + 25% write × 0.3 → effective ≈ base × 0.45
  • GPT-5: ~50% × 0.7 + 0 → effective ≈ base × 0.65
  • Gemini implicit: ~75% × 0.7 + 0 → effective ≈ base × 0.48
  • DeepSeek-v4: ~75% × 0.7 + 0 → effective ≈ base × 0.48

Multiply by each vendor’s actual base rate (different across providers) to get the comparable dollar number. Score: calculate effective_cost for your workload; lower is better.

4.2 Hit-Rate Predictability

  • Explicit cachers (Claude, Qwen explicit, Gemini explicit) — high predictability. You marked it, it hits (within TTL).
  • Automatic cachers (GPT-5, DeepSeek-v4, Gemini implicit, Qwen implicit) — depends on prefix similarity and provider load (LRU eviction).

For SLAs tied to cost, prefer explicit. For best-effort optimization, automatic is fine.

4.3 TTL ↔ Traffic Cadence Fit

Traffic patternWhat you need
Continuous (seconds between calls)Any provider’s default works
Session-bound (minutes)5–60 min TTL (Claude, GPT-5, Qwen)
Bursty (hours between bursts)1-hour+ TTL (Claude 1h, Gemini explicit, DeepSeek-v4)
Sporadic (queries per day)24-hour TTL (Gemini explicit) or accept cold writes

4.4 Latency Under Cache Miss

A provider fast on hits but slow on misses is still problematic if your hit rate isn’t high. Compare both numbers from §3.3 and weight by expected hit rate.

4.5 API Ergonomics & Migration Cost

  • Lowest migration: GPT-5 ↔ DeepSeek-v4 (both OpenAI-shaped, both auto-cache).
  • Medium: GPT-5 → Gemini implicit (different SDK, no cache code to rewrite).
  • High: GPT-5 → Claude (must add cache_control, restructure prompt layers).
  • Highest: any single → multi-provider without a gateway (multiple cache APIs).

5. Quick Verdicts by Workload Shape

WorkloadPickWhy
English chat, global usersclaude-haiku-4-5 or gpt-5.4-nanoDeep cache discount + small fast model
Chinese chat, mainlanddeepseek-v4-flash or qwen3.5-flashHour-scale cache + low cost on CN-language
English RAG (high quality)claude-sonnet-4-5 + multi-breakpointLayered prompt structure caches efficiently
Chinese RAG (cost-sensitive)deepseek-v4-flash64-token granularity tolerates retrieval reordering
Long-document Q&A (sporadic)gemini-2.5-pro explicit24-hour TTL, designed for this
Existing GPT codebase, no rewritegpt-5.4-mini (status quo)~50% savings free
Complex agents (15+ steps)claude-sonnet-4-5 + 4-BP cache_control85%+ hit rate on agent traffic
Multi-provider portabilityGateway, any modelOne SDK, one auth header

6. Migration Considerations

If your scoring says switch, three things to plan for:

Data movement. Cached prefixes don’t transfer between providers — every switch is a cold start. Budget for several hours of higher-than-normal cost during warm-up.

Prompt re-architecting. Anthropic’s multi-breakpoint design encourages a layered prompt structure that’s actually better for any provider — refactoring once benefits non-Claude paths too.

Hedging through a gateway. If you’re not sure, route through a Token Gateway. You keep optionality without committing to a single vendor, at the cost of one extra hop and (depending on the gateway) potentially losing access to vendor-specific cache controls. See Part 3 §9 for what the Synthorai gateway actually does vs claims you should be skeptical of.


7. What Changes Over Time

A note on durability of these comparisons: the numbers in this article will move. Caching has become a price-competitive feature, and providers update their offerings every few months. Two things to watch:

  • TTL extensions. Anthropic’s 1-hour option is GA; Gemini may stretch to multi-day. Expect TTL anxiety to ease.
  • Granularity. OpenAI and Anthropic are likely to drop their 1,024-token minimum eventually; DeepSeek’s 64-token bar set the new expectation.

When discounts converge, the differentiator becomes API ergonomics and latency — not headline savings.


Coming next: Part 3 — Prompt Caching Tutorial: Working Python takes the architectural picture above and turns it into runnable code with the §3.3 latency table reproduced as a benchmark you can run yourself.


FAQ

Which LLM provider has the cheapest prompt caching, all things considered? At equal hit rate (~75%), deepseek-v4-flash for Chinese workloads and gemini-2.5-flash implicit for English are the cheapest effective-cost-per-million in our 2026-05 measurements. claude-sonnet-4-5 has the deepest single-call discount (~90%) but a higher base price — it wins when hit rate is >85%. Plug your own hit rate into the §4.1 formula.

Why does Gemini cost more on low-volume workloads? The hourly storage fee for explicit caches eats the discount unless you query the cache frequently. For low-volume workloads, use Gemini implicit caching (no storage fee, ~25% discount).

Can I use Claude’s cache_control with OpenAI? Not directly — they’re separate cache implementations. On the OpenAI-compat /chat/completions endpoint, the field is typically a no-op against non-Anthropic models (those auto-cache anyway). For Claude specifically, use the Anthropic-native /v1/messages endpoint with the markers.

Is DeepSeek’s MLA architecture proprietary? The paper (DeepSeek-AI 2024) is public. Other providers could adopt MLA-style KV compression, but it requires retraining the base model — not a runtime switch. As of 2026-05, DeepSeek remains the only major provider shipping it in production.

What about open-source self-hosted models? vLLM, SGLang, and other inference engines support prefix caching natively (the PagedAttention paper is the basis). If you self-host on H100s/H200s, you can implement disk-backed caching with LMCache or similar. The pricing analysis here only applies to managed services — self-hosted economics are entirely different.

Why no Mistral, Cohere, or Llama API providers in this comparison? Their caching offerings are less mature as of 2026-05. Mistral’s caching is in early access; Cohere doesn’t expose explicit caching; Llama API providers (Groq, Together, Replicate) vary widely. Revisit when their feature sets stabilize.


Sources: Anthropic Prompt Caching · OpenAI Prompt Caching · Google Gemini Context Caching · DeepSeek KV Cache · Alibaba Bailian Context Cache · DeepSeek-V2 / MLA paper · PagedAttention / vLLM (Kwon et al. 2023). Measured numbers from https://synthorai.io/v1 on 2026-05-25.

← Back to blog