LLM Prompt Caching #4: Best Model for Chat, RAG & Agents
Contents
- 0. The Universal Cost Formula
- Use Case 1: Chatbots, Customer Support, and Assistants
- Traffic Profile
- Why Chat Caches Almost By Itself
- Model Recommendations (Measured 2026-05)
- Minimal Production Code
- Chatbot Pitfalls
- Use Case 2: API Workloads (RAG, Content Generation, Batch Processing)
- Traffic Profile
- The Hard Problem: Retrieval Reorders Your Prefix
- TTL Considerations for API Workloads
- Model Recommendations by Task
- RAG Cost Sketch (100K queries/day)
- RAG / API Pitfalls
- Use Case 3: AI Agents (Multi-Step Reasoning, Tool Use, Long Chains)
- Traffic Profile
- Why Agents Depend on Caching
- TTL Match — The One Use Case Where It Matters
- Agent Model Recommendations
- Real Cost Estimate: A 15-Step Agent Task
- Agent Pitfalls
- The Master Decision Matrix
- TTL Quick Reference by Use Case
- What This Gateway Does and Doesn’t Do
- Final Takeaway
- FAQ
TL;DR — Picking the “best” LLM isn’t one benchmark question — it depends on whether you’re shipping a chatbot, a RAG/batch API, or an AI agent. Each shape has a different prompt structure, hit-rate profile, TTL fit, and latency tolerance, which dictates a different optimal pairing of model + caching strategy. This guide builds on the measured numbers in Part 3 — same gateway, same OpenAI SDK, swap the model field per call.
Series: Part 4 of 4 · Previously: Part 1 — Caching Principles · Part 2 — Provider Comparison & Evaluation · Part 3 — Working Code Tutorial
0. The Universal Cost Formula
Before we dive into use cases, here’s the equation every choice should optimize:
per-call cost = (input_uncached × P_in)
+ (input_cached × P_in × cache_discount)
+ (output × P_out)
per-call TTFT ≈ prefill_time × (1 - hit_rate)
+ decode_time
Four levers:
- Lower the unit price (
P_in/P_out) → pick a cheaper model. - Raise hit rate → restructure your prompt; match TTL to your traffic cadence.
- Lower the cache discount coefficient → pick a provider with stronger caching.
- Pick a provider whose cached prefill is fastest → latency matters for UX.
Each use case below pulls these levers differently.
Use Case 1: Chatbots, Customer Support, and Assistants
Traffic Profile
- Each request = long system prompt (persona + knowledge + rules) + multi-turn history + new user message.
- Average context: 4K–20K tokens.
- Users are extremely sensitive to time-to-first-token (>2 s feels broken).
- Within a session, requests come seconds-to-minutes apart — well inside any provider’s cache TTL.
Why Chat Caches Almost By Itself
Chat is the most cache-friendly workload. Within a single session:
Request 1: [system: 8K] + [history: 0] + [user: Q1]
Request 2: [system: 8K] + [history: 200] + [user: Q2]
Request 3: [system: 8K] + [history: 400] + [user: Q3]
↑──────── prefix is monotonically growing ────────↑
If inter-message gaps stay under TTL (a few minutes at every provider), the system prompt portion clears 90%+ hit rate without effort. You don’t need keep-alives.
Model Recommendations (Measured 2026-05)
| User segment | Recommended model | Typical cached TTFT* | Notes |
|---|---|---|---|
| Global, cost-first | gpt-5.4-nano | 1.0 s | Cheapest in our measured set; 85% cache hit |
| Global, balanced quality/cost | gpt-5.4-mini | 0.73 s | Fastest cached TTFT we measured |
| Global, premium feel | claude-haiku-4-5 | 1.35 s | Strong instruction-following at modest premium |
| Chinese-language, cost-first | deepseek-v4-flash | 2.9 s | Disk-backed cache survives hour-scale idle |
| Chinese-language, quality | qwen3-max | 1.5 s | Reports cache hits; verify cost discount on your tenant |
| Premium English reasoning | claude-sonnet-4-5, gpt-5.5-pro, gemini-2.5-pro | model-dependent | Reasoning models — budget max_tokens ≥ 256 |
* Measured against a 7,300-token stable system prompt, single sequential run, no concurrent load. See Part 3 §6 for the full table.
Minimal Production Code
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["SYNTHORAI_KEY"],
base_url="https://synthorai.io/v1",
)
def chat(history: list, user_msg: str):
return client.chat.completions.create(
model="gpt-5.4-mini",
max_tokens=512,
messages=[
{"role": "system", "content": STABLE_SYSTEM_PROMPT}, # front
*history, # middle
{"role": "user", "content": user_msg}, # back
],
)
That’s it. Caching is automatic for every model listed above; no marker is required. Read resp.usage.prompt_tokens_details.cached_tokens to confirm hits during development.
Chatbot Pitfalls
- ❌ Don’t bake the current timestamp into the system prompt (
"Today is 2026-05-25 14:30:25"). Second-precision invalidates every cache. - ❌ Don’t re-stitch history each turn — keep message-array ordering byte-identical and append-only.
- ✅ Put user-persona data in the first user message, not the system prompt — per-user variation then doesn’t poison the shared prefix.
- ✅ For sessions that go cold past TTL, send a 1-token keep-alive ping (see Part 3 §8.2) before the user’s next message lands.
Use Case 2: API Workloads (RAG, Content Generation, Batch Processing)
Traffic Profile
- RAG Q&A: input = stable system + variable retrieved docs + variable query.
- Content generation (marketing copy, code, translation): stable template, varying data.
- Batch processing (document classification, data cleaning): same task at high volume.
- Latency is secondary; per-call cost dominates.
The Hard Problem: Retrieval Reorders Your Prefix
RAG’s central caching problem: retrieved docs change between calls, breaking the prefix mid-prompt.
Request 1: [system: 3K] + [doc_A, doc_B, doc_C] + [user: Q1]
Request 2: [system: 3K] + [doc_B, doc_D, doc_A] + [user: Q2]
↑─ hits ─────↑ ↑──── miss ─────────↑
Three fixes, in increasing complexity:
Fix A — Push retrieved docs to the back, not the front.
messages = [
{"role": "system", "content": SYSTEM_PROMPT}, # ~3K, stable
{"role": "system", "content": INSTRUCTION_TEMPLATE}, # ~500, stable
{"role": "user", "content": f"References:\n{retrieved_docs}\n\nQuestion: {q}"},
]
Result: the entire system portion (the stable ~3.5K tokens) caches. Only the user-facing portion misses on each call. This is enough for most production RAG. Measured hit rate on this pattern with gpt-5.4-mini: 80%+ on the system tokens.
Fix B — Deterministic retrieval ordering. Sort retrieved chunks by a stable key (doc_id ascending) rather than relevance score. High-frequency chunks stay at consistent positions and the prefix matches more often. Costs a small accuracy hit on the ranker; usually irrelevant.
Fix C — Native explicit-cache markers via direct vendor SDKs. If you’re on Anthropic Claude directly (not via this gateway), the multi-cache_control pattern lets you cache “never changes” + “rarely changes” + “per-task changes” as separate breakpoints. Excellent for complex RAG when you can carry an extra SDK.
TTL Considerations for API Workloads
- Continuous traffic (24/7 RAG endpoint): 5-min TTLs work fine — there’s always a next request inside the window.
- Bursty / cron (daily 09:00 batch): use a long-TTL provider (
deepseek-v4-flashis the longest-lived in our test set) or run a 1-token keep-alive every TTL/2 during the run window. Pattern in Part 3 §8.2.
Model Recommendations by Task
| Task type | Recommended model | Why |
|---|---|---|
| RAG, English / global | gpt-5.4-mini, gemini-2.5-pro, claude-sonnet-4-5† | Quality + low cached cost |
| RAG, Chinese-heavy | deepseek-v4-flash, qwen3-max | Best Chinese quality at lowest cost |
| Code generation | claude-sonnet-4-5, gpt-5.2-codex / 5.3-codex | Strong reasoning on long code contexts |
| Batch translation | gpt-5.4-nano, gemini-2.5-flash | Cheapest input rate; template caches |
| Structured doc classification | qwen3.5-flash | Cheap, fast, well-suited to short rule prompts |
† Claude’s multi-cache_control markers are unmatched for layered RAG — use the anthropic SDK pointed at the gateway, see Part 3 §2.
RAG Cost Sketch (100K queries/day)
3K system + 5K retrieved docs + 200-token query + 300-token output. Numbers are scaled from the measured single-call costs in Part 3 §6 — single-tenant, no concurrent load.
| Approach | Per-call estimate | Monthly (100K/day) |
|---|---|---|
gpt-5.4-mini, no cache | ~$0.005 | ~$15K |
gpt-5.4-mini, 80% hit on system tokens | ~$0.0035 | ~$10K |
claude-sonnet-4-5, 80% hit (multi-cache_control BP) | ~$0.004 | ~$12K |
deepseek-v4-flash, 80% hit | ~$0.0009 | ~$2.7K |
Treat as order-of-magnitude. Real production has concurrent calls, bursts, and your retrieved-doc length distribution will dominate the math.
RAG / API Pitfalls
- ❌ Don’t sort retrieved chunks by dynamic relevance score — every request gets a unique prefix.
- ❌ Don’t drop usage logs when streaming — your cost attribution falls apart. Pass
stream_options={"include_usage": True}and storeprompt_tokens_details.cached_tokensandusage.cost. - ✅ For batch tasks, stack vendor Batch APIs (OpenAI Batch, Anthropic Message Batches) on top of caching for another ~50% off — done outside this gateway by calling the provider directly.
Use Case 3: AI Agents (Multi-Step Reasoning, Tool Use, Long Chains)
Traffic Profile
- One agent task = many LLM calls, interleaved with tool results.
- Very long context (system + tools + accumulated history): typically 30K–100K tokens by step 10.
- Highly structured prompts: long stable prefix, small variable tail.
- Both latency and cost matter — each extra second of prefill adds visible wait, and a 15-step agent multiplies that 15×.
Why Agents Depend on Caching
Each step appends to the prior step’s tool call and result. Without caching, every step re-pays prefill on tens of thousands of tokens.
Step 1: [system: 5K] + [tools: 3K]
Step 2: [system: 5K] + [tools: 3K] + [call_1: 1K] + [result_1: 2K]
Step 3: [system: 5K] + [tools: 3K] + [call_1: 1K] + [result_1: 2K]
+ [call_2: 1K] + [result_2: 5K]
↑──── prefix grows monotonically — perfect for caching ────↑
Critical rule: tool calls and results must be append-only and byte-identical across steps. Any rewrite or reorder kills the cache from that point onward. The single most common agent bug is “I cleaned up the tool result before re-sending” → cache rate drops to zero → cost and latency multiply.
TTL Match — The One Use Case Where It Matters
A typical agent task runs 10–60 seconds; inside a single task, default 5-min TTLs are fine. But agents that wait on a human approval (“review this plan and respond”) can sit idle for minutes. If the human pauses for 10 minutes and the cache has gone cold, your follow-up step re-pays prefill on 50K tokens. For those workflows, either:
- Use a provider with longer TTL (
deepseek-v4-flashis the longest-lived in our test set), or - Send a TTL/2 keep-alive ping while waiting (see Part 3 §8.2).
Agent Model Recommendations
Agents demand reasoning capability — pick on quality first, then optimize cost.
| Complexity | Primary model | Why |
|---|---|---|
| Simple ReAct (≤5 steps) | gpt-5.4-mini, qwen3-max | Fast, cheap, enough quality |
| Mid-complexity (5–15 steps) | claude-sonnet-4-5†, gpt-5.4-mini, gemini-2.5-pro | Better reasoning at moderate cost |
| Complex multi-modal / long planning | claude-opus-4-5†, gpt-5.5-pro, gemini-3.1-pro-preview | Top-tier; budget accordingly |
| Chinese-language stack | qwen3-max (planning), deepseek-v4-flash (execution) | Strongest Chinese reasoning + lowest execution cost |
† Claude’s 4-cache_control-marker pattern remains the strongest setup for agent caching (cumulative prefix discount across 10+ steps). Use the anthropic SDK pointed at the gateway — see Part 3 §2 for the exact payload shape and TTL options.
Real Cost Estimate: A 15-Step Agent Task
Assume 5K system + 3K tools + ~3K appended per step, 15 steps total. Per-call cost from Part 3 §6 scaled to the agent shape:
| Approach | Per step (cached) | 15-step task |
|---|---|---|
claude-sonnet-4-5 + 4-BP cache_control, ~90% hit | ~$0.003 | ~$0.05 |
gpt-5.4-mini, prefix-stable, ~90% hit | ~$0.003 | ~$0.05 |
gpt-5.5-pro, prefix-stable, ~90% hit | ~$0.025 | ~$0.40 |
deepseek-v4-flash, prefix-stable, ~90% hit | ~$0.0005 | ~$0.01 |
gpt-5.4-mini, no cache discipline | ~$0.025 | ~$0.40 |
Once again, sketch numbers. The dominant variable is whether you actually keep the prefix byte-identical step to step.
Agent Pitfalls
- ❌ Don’t rebuild the messages list each step — keep the array byte-identical, append only.
- ❌ Don’t trim or reformat tool results — any byte change invalidates downstream cache.
- ❌ Don’t share a cache key across concurrent agent instances — their step orderings diverge and contaminate each other.
- ✅ Monitor
cache_creation_tokens : cache_read_tokensper task — healthy ratio is 1:50 or better by step 10.
The Master Decision Matrix
┌─ Chinese-heavy ─→ deepseek-v4-flash + auto cache
┌─ High ─→│
│ └─ Global users ──→ gpt-5.4-nano / claude-haiku-4-5
Chatbot ──────→│
│ ┌─ Quality-first ─→ gpt-5.4-mini / claude-sonnet-4-5
└─ Mid ──→│
└─ Balanced ──────→ gemini-2.5-flash / qwen3-max
┌─ Chinese RAG ───→ deepseek-v4-flash / qwen3-max
┌─ Live ─→│
│ └─ English RAG ───→ gpt-5.4-mini / claude-sonnet-4-5†
API ──────────→│
│ ┌─ Translation ───→ gpt-5.4-nano (template caches)
└─ Batch →│
└─ Doc review ────→ qwen3.5-flash + Batch APIs
┌─ Simple ────────→ deepseek-v4-flash / qwen3-max
┌─ China ─→│
│ └─ Complex ───────→ qwen3-max (plan) + deepseek (execute)
Agent ────────→│
│ ┌─ Simple ────────→ gpt-5.4-mini + auto
└─ Global →│
└─ Complex ───────→ claude-sonnet-4-5† / gpt-5.5-pro
† Claude with multi-`cache_control` breakpoints via the `anthropic` SDK pointed at the gateway (see Part 3 §2)
TTL Quick Reference by Use Case
| Use case | TTL strategy | Why |
|---|---|---|
| Live chat | Auto (5 min default) | Natural cadence keeps cache warm |
| RAG API (continuous) | Auto | High request rate; no need for longer |
| RAG API (bursty / cron) | Keep-alive ping | Avoid cold-start writes between bursts |
| Agent (no human-in-loop) | Auto | Task duration < TTL anyway |
| Agent (with approval steps) | Keep-alive or deepseek-v4-flash | Survive review wait time |
| Cold storage (large doc, sporadic queries) | deepseek-v4-flash (disk-backed) | Survives hour-scale idle |
What This Gateway Does and Doesn’t Do
To set expectations honestly:
| The gateway does | The gateway does not |
|---|---|
One base_url, one auth header, every model | Auto-pick a model for you (no meta-router) |
usage.cost in USD per call — no pricing matrix | Inject cache_control markers into your prompts |
Standard cached_tokens field across providers | Provide a hosted explicit-cache create endpoint |
| Streaming, function calling, vision per upstream support | Cross-provider failover with cache state migration |
If you need any of the items on the right side today, do it in your application layer or directly against the vendor SDK. The gateway is a thin proxy plus a pricing layer; everything caching-related happens at the model layer upstream.
Final Takeaway
The four-part series compresses to four lines:
Caching is two wins, not one. Cost AND latency. Stable content first, volatile content last. Prefix discipline is free, do it everywhere. Match model + cache behavior to the use case. Chat ≠ RAG ≠ Agents. Measure on your own traffic. Single-run benchmarks are a starting point, not the answer.
The fastest path from here: pick the use case closest to yours from the matrix above, apply the structural changes (stable-first prefix, deterministic retrieval, byte-identical agent state), log cached_tokens and usage.cost for a week, then re-evaluate.
FAQ
Which LLM is cheapest for a Chinese-language chatbot?
deepseek-v4-flash and qwen3.5-flash are an order of magnitude cheaper than English-tuned models on Chinese text in our test set, while matching gpt-5.4-mini in quality on typical chat workloads.
What’s the best LLM for RAG in 2026?
For English: gpt-5.4-mini with the Fix A prompt layout (system tokens at the front, references at the bottom) gives 80%+ hit rate on the stable portion. For Chinese: deepseek-v4-flash. For very long documents queried often: gemini-2.5-pro (handles 1M+ token context natively).
Should I use GPT or Claude for agents?
Both are strong; the choice depends on how much cache discipline you want to invest in. Claude’s 4-cache_control-marker pattern (via the anthropic SDK against the gateway) is uniquely powerful for cumulative agent prefixes — ~90% input-cost reduction once the prefix is warm, across 10+ steps. If you’d rather stay on the OpenAI-shaped client and accept ~50% cache savings without any markers, gpt-5.4-mini or gpt-5.5-pro is the lower-friction choice.
How much can I realistically save by switching from “naive” to “optimized” LLM usage? On the measured runs in this series: 50–88% cost reduction and 30–60% TTFT reduction for the same model. Most of the win is from getting your hit rate above 80%, not from picking a different model.
Where do I start?
Pick the use case closest to yours from the matrix. Apply the structural prompt changes. Measure cached_tokens and usage.cost for a week of production traffic. Only then consider switching models.
Sources & verification: Measured numbers from Part 3 §6, https://synthorai.io/v1 on 2026-05-25, openai SDK 2.38.0. Vendor pricing pages: OpenAI · Anthropic · Google Gemini · DeepSeek · Alibaba Bailian.