LLM Prompt Caching #4: Best Model for Chat, RAG & Agents

Contents
  1. 0. The Universal Cost Formula
  2. Use Case 1: Chatbots, Customer Support, and Assistants
  3. Traffic Profile
  4. Why Chat Caches Almost By Itself
  5. Model Recommendations (Measured 2026-05)
  6. Minimal Production Code
  7. Chatbot Pitfalls
  8. Use Case 2: API Workloads (RAG, Content Generation, Batch Processing)
  9. Traffic Profile
  10. The Hard Problem: Retrieval Reorders Your Prefix
  11. TTL Considerations for API Workloads
  12. Model Recommendations by Task
  13. RAG Cost Sketch (100K queries/day)
  14. RAG / API Pitfalls
  15. Use Case 3: AI Agents (Multi-Step Reasoning, Tool Use, Long Chains)
  16. Traffic Profile
  17. Why Agents Depend on Caching
  18. TTL Match — The One Use Case Where It Matters
  19. Agent Model Recommendations
  20. Real Cost Estimate: A 15-Step Agent Task
  21. Agent Pitfalls
  22. The Master Decision Matrix
  23. TTL Quick Reference by Use Case
  24. What This Gateway Does and Doesn’t Do
  25. Final Takeaway
  26. FAQ

TL;DR — Picking the “best” LLM isn’t one benchmark question — it depends on whether you’re shipping a chatbot, a RAG/batch API, or an AI agent. Each shape has a different prompt structure, hit-rate profile, TTL fit, and latency tolerance, which dictates a different optimal pairing of model + caching strategy. This guide builds on the measured numbers in Part 3 — same gateway, same OpenAI SDK, swap the model field per call.

Series: Part 4 of 4 · Previously: Part 1 — Caching Principles · Part 2 — Provider Comparison & Evaluation · Part 3 — Working Code Tutorial


0. The Universal Cost Formula

Before we dive into use cases, here’s the equation every choice should optimize:

per-call cost = (input_uncached × P_in)
              + (input_cached   × P_in × cache_discount)
              + (output × P_out)

per-call TTFT ≈ prefill_time × (1 - hit_rate)
              + decode_time

Four levers:

  1. Lower the unit price (P_in / P_out) → pick a cheaper model.
  2. Raise hit rate → restructure your prompt; match TTL to your traffic cadence.
  3. Lower the cache discount coefficient → pick a provider with stronger caching.
  4. Pick a provider whose cached prefill is fastest → latency matters for UX.

Each use case below pulls these levers differently.


Use Case 1: Chatbots, Customer Support, and Assistants

Traffic Profile

  • Each request = long system prompt (persona + knowledge + rules) + multi-turn history + new user message.
  • Average context: 4K–20K tokens.
  • Users are extremely sensitive to time-to-first-token (>2 s feels broken).
  • Within a session, requests come seconds-to-minutes apart — well inside any provider’s cache TTL.

Why Chat Caches Almost By Itself

Chat is the most cache-friendly workload. Within a single session:

Request 1: [system: 8K] + [history: 0]   + [user: Q1]
Request 2: [system: 8K] + [history: 200] + [user: Q2]
Request 3: [system: 8K] + [history: 400] + [user: Q3]
           ↑──────── prefix is monotonically growing ────────↑

If inter-message gaps stay under TTL (a few minutes at every provider), the system prompt portion clears 90%+ hit rate without effort. You don’t need keep-alives.

Model Recommendations (Measured 2026-05)

User segmentRecommended modelTypical cached TTFT*Notes
Global, cost-firstgpt-5.4-nano1.0 sCheapest in our measured set; 85% cache hit
Global, balanced quality/costgpt-5.4-mini0.73 sFastest cached TTFT we measured
Global, premium feelclaude-haiku-4-51.35 sStrong instruction-following at modest premium
Chinese-language, cost-firstdeepseek-v4-flash2.9 sDisk-backed cache survives hour-scale idle
Chinese-language, qualityqwen3-max1.5 sReports cache hits; verify cost discount on your tenant
Premium English reasoningclaude-sonnet-4-5, gpt-5.5-pro, gemini-2.5-promodel-dependentReasoning models — budget max_tokens ≥ 256

* Measured against a 7,300-token stable system prompt, single sequential run, no concurrent load. See Part 3 §6 for the full table.

Minimal Production Code

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["SYNTHORAI_KEY"],
    base_url="https://synthorai.io/v1",
)

def chat(history: list, user_msg: str):
    return client.chat.completions.create(
        model="gpt-5.4-mini",
        max_tokens=512,
        messages=[
            {"role": "system", "content": STABLE_SYSTEM_PROMPT},   # front
            *history,                                              # middle
            {"role": "user", "content": user_msg},                 # back
        ],
    )

That’s it. Caching is automatic for every model listed above; no marker is required. Read resp.usage.prompt_tokens_details.cached_tokens to confirm hits during development.

Chatbot Pitfalls

  • ❌ Don’t bake the current timestamp into the system prompt ("Today is 2026-05-25 14:30:25"). Second-precision invalidates every cache.
  • ❌ Don’t re-stitch history each turn — keep message-array ordering byte-identical and append-only.
  • ✅ Put user-persona data in the first user message, not the system prompt — per-user variation then doesn’t poison the shared prefix.
  • ✅ For sessions that go cold past TTL, send a 1-token keep-alive ping (see Part 3 §8.2) before the user’s next message lands.

Use Case 2: API Workloads (RAG, Content Generation, Batch Processing)

Traffic Profile

  • RAG Q&A: input = stable system + variable retrieved docs + variable query.
  • Content generation (marketing copy, code, translation): stable template, varying data.
  • Batch processing (document classification, data cleaning): same task at high volume.
  • Latency is secondary; per-call cost dominates.

The Hard Problem: Retrieval Reorders Your Prefix

RAG’s central caching problem: retrieved docs change between calls, breaking the prefix mid-prompt.

Request 1: [system: 3K] + [doc_A, doc_B, doc_C] + [user: Q1]
Request 2: [system: 3K] + [doc_B, doc_D, doc_A] + [user: Q2]
           ↑─ hits ─────↑  ↑──── miss ─────────↑

Three fixes, in increasing complexity:

Fix A — Push retrieved docs to the back, not the front.

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},          # ~3K, stable
    {"role": "system", "content": INSTRUCTION_TEMPLATE},   # ~500, stable
    {"role": "user",   "content": f"References:\n{retrieved_docs}\n\nQuestion: {q}"},
]

Result: the entire system portion (the stable ~3.5K tokens) caches. Only the user-facing portion misses on each call. This is enough for most production RAG. Measured hit rate on this pattern with gpt-5.4-mini: 80%+ on the system tokens.

Fix B — Deterministic retrieval ordering. Sort retrieved chunks by a stable key (doc_id ascending) rather than relevance score. High-frequency chunks stay at consistent positions and the prefix matches more often. Costs a small accuracy hit on the ranker; usually irrelevant.

Fix C — Native explicit-cache markers via direct vendor SDKs. If you’re on Anthropic Claude directly (not via this gateway), the multi-cache_control pattern lets you cache “never changes” + “rarely changes” + “per-task changes” as separate breakpoints. Excellent for complex RAG when you can carry an extra SDK.

TTL Considerations for API Workloads

  • Continuous traffic (24/7 RAG endpoint): 5-min TTLs work fine — there’s always a next request inside the window.
  • Bursty / cron (daily 09:00 batch): use a long-TTL provider (deepseek-v4-flash is the longest-lived in our test set) or run a 1-token keep-alive every TTL/2 during the run window. Pattern in Part 3 §8.2.

Model Recommendations by Task

Task typeRecommended modelWhy
RAG, English / globalgpt-5.4-mini, gemini-2.5-pro, claude-sonnet-4-5Quality + low cached cost
RAG, Chinese-heavydeepseek-v4-flash, qwen3-maxBest Chinese quality at lowest cost
Code generationclaude-sonnet-4-5, gpt-5.2-codex / 5.3-codexStrong reasoning on long code contexts
Batch translationgpt-5.4-nano, gemini-2.5-flashCheapest input rate; template caches
Structured doc classificationqwen3.5-flashCheap, fast, well-suited to short rule prompts

† Claude’s multi-cache_control markers are unmatched for layered RAG — use the anthropic SDK pointed at the gateway, see Part 3 §2.

RAG Cost Sketch (100K queries/day)

3K system + 5K retrieved docs + 200-token query + 300-token output. Numbers are scaled from the measured single-call costs in Part 3 §6 — single-tenant, no concurrent load.

ApproachPer-call estimateMonthly (100K/day)
gpt-5.4-mini, no cache~$0.005~$15K
gpt-5.4-mini, 80% hit on system tokens~$0.0035~$10K
claude-sonnet-4-5, 80% hit (multi-cache_control BP)~$0.004~$12K
deepseek-v4-flash, 80% hit~$0.0009~$2.7K

Treat as order-of-magnitude. Real production has concurrent calls, bursts, and your retrieved-doc length distribution will dominate the math.

RAG / API Pitfalls

  • ❌ Don’t sort retrieved chunks by dynamic relevance score — every request gets a unique prefix.
  • ❌ Don’t drop usage logs when streaming — your cost attribution falls apart. Pass stream_options={"include_usage": True} and store prompt_tokens_details.cached_tokens and usage.cost.
  • ✅ For batch tasks, stack vendor Batch APIs (OpenAI Batch, Anthropic Message Batches) on top of caching for another ~50% off — done outside this gateway by calling the provider directly.

Use Case 3: AI Agents (Multi-Step Reasoning, Tool Use, Long Chains)

Traffic Profile

  • One agent task = many LLM calls, interleaved with tool results.
  • Very long context (system + tools + accumulated history): typically 30K–100K tokens by step 10.
  • Highly structured prompts: long stable prefix, small variable tail.
  • Both latency and cost matter — each extra second of prefill adds visible wait, and a 15-step agent multiplies that 15×.

Why Agents Depend on Caching

Each step appends to the prior step’s tool call and result. Without caching, every step re-pays prefill on tens of thousands of tokens.

Step 1: [system: 5K] + [tools: 3K]
Step 2: [system: 5K] + [tools: 3K] + [call_1: 1K] + [result_1: 2K]
Step 3: [system: 5K] + [tools: 3K] + [call_1: 1K] + [result_1: 2K]
                                   + [call_2: 1K] + [result_2: 5K]
        ↑──── prefix grows monotonically — perfect for caching ────↑

Critical rule: tool calls and results must be append-only and byte-identical across steps. Any rewrite or reorder kills the cache from that point onward. The single most common agent bug is “I cleaned up the tool result before re-sending” → cache rate drops to zero → cost and latency multiply.

TTL Match — The One Use Case Where It Matters

A typical agent task runs 10–60 seconds; inside a single task, default 5-min TTLs are fine. But agents that wait on a human approval (“review this plan and respond”) can sit idle for minutes. If the human pauses for 10 minutes and the cache has gone cold, your follow-up step re-pays prefill on 50K tokens. For those workflows, either:

  • Use a provider with longer TTL (deepseek-v4-flash is the longest-lived in our test set), or
  • Send a TTL/2 keep-alive ping while waiting (see Part 3 §8.2).

Agent Model Recommendations

Agents demand reasoning capability — pick on quality first, then optimize cost.

ComplexityPrimary modelWhy
Simple ReAct (≤5 steps)gpt-5.4-mini, qwen3-maxFast, cheap, enough quality
Mid-complexity (5–15 steps)claude-sonnet-4-5†, gpt-5.4-mini, gemini-2.5-proBetter reasoning at moderate cost
Complex multi-modal / long planningclaude-opus-4-5†, gpt-5.5-pro, gemini-3.1-pro-previewTop-tier; budget accordingly
Chinese-language stackqwen3-max (planning), deepseek-v4-flash (execution)Strongest Chinese reasoning + lowest execution cost

† Claude’s 4-cache_control-marker pattern remains the strongest setup for agent caching (cumulative prefix discount across 10+ steps). Use the anthropic SDK pointed at the gateway — see Part 3 §2 for the exact payload shape and TTL options.

Real Cost Estimate: A 15-Step Agent Task

Assume 5K system + 3K tools + ~3K appended per step, 15 steps total. Per-call cost from Part 3 §6 scaled to the agent shape:

ApproachPer step (cached)15-step task
claude-sonnet-4-5 + 4-BP cache_control, ~90% hit~$0.003~$0.05
gpt-5.4-mini, prefix-stable, ~90% hit~$0.003~$0.05
gpt-5.5-pro, prefix-stable, ~90% hit~$0.025~$0.40
deepseek-v4-flash, prefix-stable, ~90% hit~$0.0005~$0.01
gpt-5.4-mini, no cache discipline~$0.025~$0.40

Once again, sketch numbers. The dominant variable is whether you actually keep the prefix byte-identical step to step.

Agent Pitfalls

  • ❌ Don’t rebuild the messages list each step — keep the array byte-identical, append only.
  • ❌ Don’t trim or reformat tool results — any byte change invalidates downstream cache.
  • ❌ Don’t share a cache key across concurrent agent instances — their step orderings diverge and contaminate each other.
  • ✅ Monitor cache_creation_tokens : cache_read_tokens per task — healthy ratio is 1:50 or better by step 10.

The Master Decision Matrix

                            ┌─ Chinese-heavy ─→ deepseek-v4-flash + auto cache
                  ┌─ High ─→│
                  │          └─ Global users ──→ gpt-5.4-nano / claude-haiku-4-5
   Chatbot ──────→│
                  │          ┌─ Quality-first ─→ gpt-5.4-mini / claude-sonnet-4-5
                  └─ Mid ──→│
                            └─ Balanced ──────→ gemini-2.5-flash / qwen3-max

                            ┌─ Chinese RAG ───→ deepseek-v4-flash / qwen3-max
                  ┌─ Live ─→│
                  │          └─ English RAG ───→ gpt-5.4-mini / claude-sonnet-4-5†
   API ──────────→│
                  │          ┌─ Translation ───→ gpt-5.4-nano (template caches)
                  └─ Batch →│
                            └─ Doc review ────→ qwen3.5-flash + Batch APIs

                            ┌─ Simple ────────→ deepseek-v4-flash / qwen3-max
                  ┌─ China ─→│
                  │          └─ Complex ───────→ qwen3-max (plan) + deepseek (execute)
   Agent ────────→│
                  │          ┌─ Simple ────────→ gpt-5.4-mini + auto
                  └─ Global →│
                            └─ Complex ───────→ claude-sonnet-4-5† / gpt-5.5-pro

  † Claude with multi-`cache_control` breakpoints via the `anthropic` SDK pointed at the gateway (see Part 3 §2)

TTL Quick Reference by Use Case

Use caseTTL strategyWhy
Live chatAuto (5 min default)Natural cadence keeps cache warm
RAG API (continuous)AutoHigh request rate; no need for longer
RAG API (bursty / cron)Keep-alive pingAvoid cold-start writes between bursts
Agent (no human-in-loop)AutoTask duration < TTL anyway
Agent (with approval steps)Keep-alive or deepseek-v4-flashSurvive review wait time
Cold storage (large doc, sporadic queries)deepseek-v4-flash (disk-backed)Survives hour-scale idle

What This Gateway Does and Doesn’t Do

To set expectations honestly:

The gateway doesThe gateway does not
One base_url, one auth header, every modelAuto-pick a model for you (no meta-router)
usage.cost in USD per call — no pricing matrixInject cache_control markers into your prompts
Standard cached_tokens field across providersProvide a hosted explicit-cache create endpoint
Streaming, function calling, vision per upstream supportCross-provider failover with cache state migration

If you need any of the items on the right side today, do it in your application layer or directly against the vendor SDK. The gateway is a thin proxy plus a pricing layer; everything caching-related happens at the model layer upstream.


Final Takeaway

The four-part series compresses to four lines:

Caching is two wins, not one. Cost AND latency. Stable content first, volatile content last. Prefix discipline is free, do it everywhere. Match model + cache behavior to the use case. Chat ≠ RAG ≠ Agents. Measure on your own traffic. Single-run benchmarks are a starting point, not the answer.

The fastest path from here: pick the use case closest to yours from the matrix above, apply the structural changes (stable-first prefix, deterministic retrieval, byte-identical agent state), log cached_tokens and usage.cost for a week, then re-evaluate.


FAQ

Which LLM is cheapest for a Chinese-language chatbot? deepseek-v4-flash and qwen3.5-flash are an order of magnitude cheaper than English-tuned models on Chinese text in our test set, while matching gpt-5.4-mini in quality on typical chat workloads.

What’s the best LLM for RAG in 2026? For English: gpt-5.4-mini with the Fix A prompt layout (system tokens at the front, references at the bottom) gives 80%+ hit rate on the stable portion. For Chinese: deepseek-v4-flash. For very long documents queried often: gemini-2.5-pro (handles 1M+ token context natively).

Should I use GPT or Claude for agents? Both are strong; the choice depends on how much cache discipline you want to invest in. Claude’s 4-cache_control-marker pattern (via the anthropic SDK against the gateway) is uniquely powerful for cumulative agent prefixes — ~90% input-cost reduction once the prefix is warm, across 10+ steps. If you’d rather stay on the OpenAI-shaped client and accept ~50% cache savings without any markers, gpt-5.4-mini or gpt-5.5-pro is the lower-friction choice.

How much can I realistically save by switching from “naive” to “optimized” LLM usage? On the measured runs in this series: 50–88% cost reduction and 30–60% TTFT reduction for the same model. Most of the win is from getting your hit rate above 80%, not from picking a different model.

Where do I start? Pick the use case closest to yours from the matrix. Apply the structural prompt changes. Measure cached_tokens and usage.cost for a week of production traffic. Only then consider switching models.


Sources & verification: Measured numbers from Part 3 §6, https://synthorai.io/v1 on 2026-05-25, openai SDK 2.38.0. Vendor pricing pages: OpenAI · Anthropic · Google Gemini · DeepSeek · Alibaba Bailian.

← Back to blog