LLM Prompt Caching #3: Working Python Tutorial

Contents
  1. 0. Setup
  2. 1. The Cache-Aware Call (Same on Every Provider)
  3. 2. Anthropic Claude — Explicit cache_control Markers
  4. 3. OpenAI GPT-5.x — Automatic Caching
  5. 4. Google Gemini — Implicit Caching
  6. 5. DeepSeek-v4-flash — Disk-Backed Auto Cache
  7. 6. Alibaba Qwen — Hit Reported, Discount Variable
  8. 7. Cross-Provider Benchmark (Measured 2026-05-25)
  9. 8. Pre-Launch Checklist
  10. 9. TTL-Aware Patterns
  11. 8.1 Session-Bound Workloads (chat, IDE assistants)
  12. 8.2 Heartbeat for Batch / Cron
  13. 8.3 Cold-Storage Documents
  14. 10. What the Gateway Actually Adds
  15. FAQ

TL;DR — One OpenAI SDK, one base_url, every major LLM. The numbers in this article are measured against the live Synthorai gateway on 2026-05-25 with a ~7,300-token stable system prompt. The point of the gateway here is modest and honest: a single endpoint, a single auth header, and a usage.cost field that saves you from maintaining a per-vendor pricing matrix. The Transformer math behind caching is covered in Part 1: Caching Principles; the per-provider design choices are in Part 2: Provider Comparison.

Series: Part 3 of 4 · Previously: Part 1 — Caching Principles · Part 2 — Provider Comparison & Evaluation · Next: Part 4 — Best LLM by Use Case


0. Setup

pip install openai
# common.py — reused across every example
import os, time
from openai import OpenAI

oai = OpenAI(
    api_key=os.environ["SYNTHORAI_KEY"],
    base_url="https://synthorai.io/v1",
)

The gateway speaks OpenAI’s wire format for every model it fronts (GPT, Claude, Gemini, DeepSeek, Qwen). You change the model field, not the SDK. Authentication uses Authorization: Bearer <key>.

Cache-capable model IDs available on the public gateway (2026-05 snapshot): claude-haiku-4-5, claude-sonnet-4-5 / 4-6, claude-opus-4-5 / 4-6 / 4-7, gpt-5.4-mini, gpt-5.4-nano, gpt-5.2, gpt-5.5-pro, gemini-2.5-flash, gemini-2.5-pro, gemini-3.1-pro-preview, deepseek-v4-flash, qwen3-max, qwen3.5-flash. The full live list is at GET /v1/models.


1. The Cache-Aware Call (Same on Every Provider)

You don’t have to opt in. For any model that supports prompt caching upstream, the gateway just passes the response metadata through. Two fields tell you what happened:

resp = oai.chat.completions.create(
    model="gpt-5.4-mini",
    max_tokens=128,
    messages=[
        {"role": "system", "content": LONG_STABLE_PROMPT},   # ~7K tokens
        {"role": "user",   "content": "First question"},
    ],
)
print(resp.usage.prompt_tokens_details.cached_tokens)   # cache hit count
print(resp.usage.cost)                                  # USD, gateway-computed

cached_tokens is the count of input tokens that hit the upstream prefix cache. usage.cost is the gateway-computed price for this single call in USD — no need to keep a per-provider rate card locally.

Two rules that follow from the architecture and apply to every provider:

  1. Stable content first, volatile content last. The prefix is matched from token zero; a single byte change at the start invalidates the whole prefix.
  2. Keep dynamic data out of the system prompt. Current timestamps, session IDs, and request UUIDs will all bust the cache.

Everything below is just per-vendor examples of the same pattern.


2. Anthropic Claude — Explicit cache_control Markers

Claude is the explicit-marker family — Anthropic’s API does not auto-cache. To get a cache hit, mark up to four cache_control breakpoints in your system or messages array. Cache reads cost ~10% of input rate; cache writes cost 125% (a 25% premium).

The cleanest way to use cache_control through the gateway is with the official anthropic SDK pointed at the gateway’s Anthropic-native endpoint (the OpenAI-compat /chat/completions path does not currently propagate cache_control markers — use /v1/messages for Claude caching).

import os
from anthropic import Anthropic

anth = Anthropic(
    api_key=os.environ["SYNTHORAI_KEY"],
    base_url="https://synthorai.io/",   # SDK appends /v1/messages
)

msg = anth.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=512,
    system=[
        {"type": "text", "text": SYSTEM_INSTRUCTIONS,
         "cache_control": {"type": "ephemeral"}},       # BP 1: never changes
        {"type": "text", "text": TOOL_DESCRIPTIONS,
         "cache_control": {"type": "ephemeral"}},       # BP 2: rarely changes
        {"type": "text", "text": RETRIEVED_DOCUMENTS},  # changes per call — not cached
    ],
    messages=[{"role": "user", "content": question}],
)

print(msg.usage)
# Usage(input_tokens=18, output_tokens=64,
#       cache_creation_input_tokens=0, cache_read_input_tokens=8123,
#       cost=...)

TTL options. {"type": "ephemeral"} defaults to a 5-minute sliding TTL (every hit pushes expiry forward). For workloads with idle gaps longer than 5 minutes, request the 1-hour TTL on the same marker:

"cache_control": {"type": "ephemeral", "ttl": "1h"}

Layered breakpoints. Up to four markers lets you cache “never changes” + “rarely changes” + “per-task changes” independently — best-in-class for agent and RAG workloads where prompt sections change at different cadences. Even when the trailing layer (e.g. retrieved documents) changes between calls, the earlier layers still hit.

Picking a model. Available Claude IDs on the gateway as of 2026-05: claude-haiku-4-5, claude-sonnet-4-5 / 4-6, claude-opus-4-5 / 4-6 / 4-7. Haiku for cheap chat; Sonnet for general-purpose with the strongest agent caching pattern; Opus for the hardest reasoning tasks.

Measured cache hit / write / no-cache reference (2026-05-25, ~7,976-token system prompt, max_tokens=64):

ModelCache writeCache readNo-cache refRead discountHit TTFT (stream)
claude-haiku-4-5$0.00916$0.00086$0.00725−88%1.31 s
claude-sonnet-4-5$0.02713$0.00247$0.02175−89%1.76 s
claude-sonnet-4-6$0.02736$0.00253$0.02198−88%1.81 s
claude-opus-4-5$0.04522$0.00409$0.03624−89%2.08 s
claude-opus-4-6$0.04522$0.00411$0.03625−89%2.55 s
claude-opus-4-7$0.06545$0.00609$0.05259−88%2.30 s

The discount holds uniformly across the family. Write premium is roughly 25% over no-cache (Anthropic’s documented rate); break-even is one cache hit.


3. OpenAI GPT-5.x — Automatic Caching

OpenAI auto-caches any request with a sufficiently long prefix. No code change, no marker.

def ask_gpt(question: str):
    t0 = time.perf_counter()
    resp = oai.chat.completions.create(
        model="gpt-5.4-mini",
        max_tokens=64,
        messages=[
            {"role": "system", "content": LONG_STABLE_PROMPT},
            {"role": "user",   "content": question},
        ],
    )
    return resp, time.perf_counter() - t0

r1, t1 = ask_gpt("Which export formats are supported?")
r2, t2 = ask_gpt("How long is the refund window for annual plans?")

print(t1, r1.usage.prompt_tokens_details.cached_tokens, r1.usage.cost)
# 3.63   0       0.00267
print(t2, r2.usage.prompt_tokens_details.cached_tokens, r2.usage.cost)
# 1.23   6400    0.00257

Same 6,887-token prompt twice. Second call: 93% of the system prompt hits the cache, total latency drops from 3.6 s to 1.2 s. The cost barely changes here because the cache discount is offset by a longer first-call completion — see §7 for cleaner cross-provider numbers.

gpt-5.4-nano shows the discount more clearly (44% cost reduction on the hit). For chat UIs where you only care about time-to-first-token, the streaming numbers are what matter:

def ttft(model, question):
    t0 = time.perf_counter()
    stream = oai.chat.completions.create(
        model=model, max_tokens=64,
        messages=[
            {"role": "system", "content": LONG_STABLE_PROMPT},
            {"role": "user",   "content": question},
        ],
        stream=True, stream_options={"include_usage": True},
    )
    for ev in stream:
        if ev.choices and ev.choices[0].delta and ev.choices[0].delta.content:
            return time.perf_counter() - t0     # first content token

Measured TTFT on the cached pass: 0.73 s for gpt-5.4-mini, 1.00 s for gpt-5.4-nano.


4. Google Gemini — Implicit Caching

Gemini’s cache is also automatic when you go through the gateway. There is no cachedContent create step you need to perform.

r = oai.chat.completions.create(
    model="gemini-2.5-flash",
    max_tokens=128,
    messages=[
        {"role": "system", "content": LONG_STABLE_PROMPT},
        {"role": "user",   "content": "Summarize section 6 in two bullets."},
    ],
)
print(r.usage.prompt_tokens_details.cached_tokens, r.usage.cost)

A measured hit on gemini-2.5-flash for a ~7,300-token system prompt: 7,140 cached tokens (97%), cost drops from $0.00198 to $0.00024 — 88% off for that pass.

Two gotchas worth knowing:

  • Gemini’s *-pro variants are reasoning models. With small max_tokens, you often see completion_tokens=0 because the budget is consumed by hidden thinking. Bump max_tokens to ≥256 for anything user-facing.
  • The implicit cache TTL is short and not officially specified. In our test, a hit between two calls 5 s apart succeeded; a third call ~10 s later sometimes missed. Don’t engineer logic that assumes the hit; check cached_tokens and degrade gracefully.

5. DeepSeek-v4-flash — Disk-Backed Auto Cache

DeepSeek’s auto cache survives longer than the GPU-memory-resident caches at other vendors. Same call shape:

r1 = oai.chat.completions.create(
    model="deepseek-v4-flash", max_tokens=128,
    messages=[{"role": "system", "content": LONG_STABLE_PROMPT},
              {"role": "user",   "content": "Q1"}],
)
# r1.usage.cost = $0.00091, cached_tokens = 0

r2 = oai.chat.completions.create(
    model="deepseek-v4-flash", max_tokens=128,
    messages=[{"role": "system", "content": LONG_STABLE_PROMPT},
              {"role": "user",   "content": "Q2"}],
)
# r2.usage.cost = $0.00023, cached_tokens = 6784  →  74% saved

Streaming TTFT on the cached pass: 2.93 s. DeepSeek is not the lowest-latency option in this set — the wins are in cost and the fact that the cache stays warm across hour-scale gaps.


6. Alibaba Qwen — Hit Reported, Discount Variable

r = oai.chat.completions.create(
    model="qwen3-max", max_tokens=128,
    messages=[{"role": "system", "content": LONG_STABLE_PROMPT},
              {"role": "user",   "content": "Q1"}],
)
print(r.usage.prompt_tokens_details.cached_tokens, r.usage.cost)
# 7040    0.00549

Caveat seen on our test run: cached_tokens reports a hit (7,040 of 7,234 = 97%), but usage.cost did not drop on the cached pass (still ≈ $0.0055). This means the upstream cache hit happened (faster TTFT, 1.53 s vs 3.03 s cold), but the gateway’s cost field for this provider did not yet reflect the cached-rate discount on this date. If you’re cost-sensitive on Qwen, watch cached_tokens and trust upstream pricing pages until this normalizes.


7. Cross-Provider Benchmark (Measured 2026-05-25)

Single sequential run. 7,284-character (~6,900–7,300 tokens depending on tokenizer) stable system prompt. max_tokens=64. One miss call followed immediately by one hit call.

The auto-cache providers (no marker required):

ModelMiss costHit costCost ΔMiss totalHit totalHit TTFT (stream)Cache hit rate
gpt-5.4-nano$0.00131$0.00074−44%2.18 s1.48 s1.00 s5,888 / 6,887 (85%)
gpt-5.4-mini$0.00267$0.00257−4%*3.63 s1.23 s0.73 s6,400 / 6,887 (93%)
gemini-2.5-flash$0.00198$0.00024†−88%2.49 s1.37 sn/a‡7,140 / 7,322 (97%)
gemini-2.5-pro$0.00824$0.00205†−75%2.99 s1.76 sn/a‡6,120 / 7,328 (84%)
deepseek-v4-flash$0.00091$0.00023−74%4.02 s3.71 s2.93 s6,784 / 7,101 (96%)
qwen3-max$0.00553$0.00549−1%§4.80 s2.37 s1.53 s7,040 / 7,234 (97%)

* gpt-5.4-mini’s miss-call completion was 44 tokens vs 19 on the hit — the cost delta mixes cache discount with completion-length difference. The latency drop (3.63 → 1.23 s) is the cleaner signal here. † Streaming-pass cost (where cached_tokens was reported); the non-stream pass occasionally returned cached_tokens=null for Gemini and the cost did not drop. Gateway metadata for Gemini is currently inconsistent — trust cached_tokens when present. ‡ Gemini *-pro / *-flash reasoning models often emit zero content tokens at small max_tokens, so TTFT is meaningless at that budget. Bump max_tokens if you measure this in production. § See §6 — upstream cache hit happened (latency dropped), but the gateway’s usage.cost field did not reflect the discount for qwen3-max on this date.

Anthropic Claude is explicit-marker-driven; numbers live in a separate table because the discount is opt-in via cache_control (see §2 for the pattern). Same prompt, measured cache write vs cache read:

ModelWrite costRead costRead discountHit TTFT (stream)
claude-haiku-4-5$0.00916$0.00086−88%1.31 s
claude-sonnet-4-5$0.02713$0.00247−89%1.76 s
claude-sonnet-4-6$0.02736$0.00253−88%1.81 s
claude-opus-4-5$0.04522$0.00409−89%2.08 s
claude-opus-4-6$0.04522$0.00411−89%2.55 s
claude-opus-4-7$0.06545$0.00609−88%2.30 s

Your numbers will differ by region, time of day, and warmth of other tenants’ prefixes. Single-run, single-date — don’t quote these as benchmark gospel.


8. Pre-Launch Checklist

Before shipping a cache-aware prompt:

  1. Stable content first — system prompt, knowledge base, tool schemas at the top of messages.
  2. Volatile content last — user input, retrieved docs, timestamps at the bottom.
  3. No dynamic variables in system — current time, user ID, random seeds will nuke your prefix.
  4. Log cached_tokens on every call. If the hit rate is under 50% in production, your prefix isn’t actually stable. Inspect the prompts that miss.
  5. Don’t trust a single hit pass. TTLs are short; design for hit_rate ∈ [0, 1) rather than “always hit”.

9. TTL-Aware Patterns

The most common production failure mode isn’t “I forgot to enable caching” — it’s “my hit rate is 12% because my requests don’t actually arrive inside the TTL window.”

8.1 Session-Bound Workloads (chat, IDE assistants)

The natural cadence is well below TTL. Structure your prompt right and the cache stays warm by itself — don’t engineer anything else.

8.2 Heartbeat for Batch / Cron

If you run a daily report at 09:00 that calls your model 50 times in a 3-minute burst, the first cache write at 09:00 is wasted because the cache went cold overnight. From 08:55 onward, send a 1-token “ping” with the cached prefix every TTL/2 to keep it warm:

def keepalive():
    oai.chat.completions.create(
        model="gpt-5.4-mini",
        max_tokens=1,
        messages=[
            {"role": "system", "content": LONG_STABLE_PROMPT},
            {"role": "user",   "content": "."},
        ],
    )

Cost per ping is the input-tokens × cached rate, which for our 7K-token prefix on gpt-5.4-mini is around $0.0026 — far less than letting your batch job pay full prefill on the first 50 real calls.

8.3 Cold-Storage Documents

For documents queried sporadically (once an hour throughout the day), in-memory caches will be cold most of the time. As of this writing, the gateway does not expose a hosted explicit-cache create endpoint — for long-TTL needs use deepseek-v4-flash (disk-backed; survives hour-scale gaps in practice) or call Google’s native cachedContent API directly outside the gateway.


10. What the Gateway Actually Adds

It would be dishonest to claim the gateway “does caching for you”. Caching happens at the model layer — the gateway exposes what’s there. What it does add, measured against using each vendor’s native SDK directly, is three things:

  1. One base_url, one auth header, every model. Swap the model field and the call shape is unchanged. Same messages array, same usage field structure. You don’t carry five SDKs for five providers.
  2. usage.cost in USD per call. The gateway computes the dollar cost using current upstream rates and includes it in every response. You don’t maintain a pricing matrix in your code, and you don’t have to subscribe to per-vendor price-change notifications.
  3. Uniform cached_tokens field. Anthropic reports cache hits as cache_read_input_tokens, OpenAI as prompt_tokens_details.cached_tokens, DeepSeek as prompt_cache_hit_tokens. The gateway normalizes these into the OpenAI shape so your observability code doesn’t branch on provider.

That’s the entire pitch. Everything else — when to cache, how to structure prompts, which model to pick — is the work of the next article.


Next: Part 4 — How to Choose the Best LLM by Use Case: Chat, API & AI Agents — a decision matrix matching workload type to the optimal model + caching strategy, with cost math.


FAQ

Why use the OpenAI SDK for non-OpenAI models? The gateway speaks OpenAI’s wire format for every provider it fronts. The official openai SDK gives you typed responses, automatic retries, and streaming helpers — there’s no reason to hand-roll five HTTP clients.

Does caching work with streaming responses? Yes. The usage object in the final chunk reports cache hit counts (when you pass stream_options={"include_usage": True}). The latency win is most visible on streaming because TTFT is what users see.

Which provider has the deepest cache discount on my workload? At 2026-05 prices and a 70%+ hit rate, gemini-2.5-flash and deepseek-v4-flash are the cheapest in the §7 table. gpt-5.4-mini wins on TTFT. For Claude’s documented 90% cache discount, mark up to four cache_control breakpoints (see §2). Run the same benchmark against your own prompt — that’s a one-day exercise, not a multi-week migration.

When do I need cache_control markers? Only when calling Anthropic Claude — see §2. For OpenAI/Gemini/DeepSeek/Qwen the upstream auto-caches any sufficiently long prefix, so no marker is required; the field is silently ignored against those providers.

How fresh are these numbers? Measured 2026-05-25 on the public gateway. Treat them as a single data point — pricing and latency change every release cycle.

What about Anthropic Claude? Claude is supported through the gateway with explicit cache_control markers — use the anthropic SDK with base_url="https://synthorai.io/" (the SDK appends /v1/messages). The OpenAI-compat /chat/completions path doesn’t propagate the markers today; for Claude caching specifically, use the Anthropic-native path shown in §2.


Sources & verification: All numbers measured against https://synthorai.io/v1 on 2026-05-25 using openai SDK 2.38.0. Vendor pricing pages: Anthropic Prompt Caching · OpenAI Prompt Caching · Google Gemini Context Caching · DeepSeek KV Cache Guide · Alibaba Bailian Context Cache.

← Back to blog