Claude Opus 4.8 on Synthorai: Caching & TTL vs 4.7/4.6

May 29, 2026 · claude-opus-4-8 · prompt-cache · model-update

Contents

Availability
Caching behavior: unchanged from 4.7/4.6
TTL behavior: unchanged from 4.7/4.6
Time-to-first-token: flat across the line
The one real change: tokenization (since 4.7)
Migration checklist (4.6/4.7 → 4.8)
Bottom line
FAQ

claude-opus-4-8 is now available on the Synthorai gateway. If you already run prompt caching against the Opus line, the headline is reassuring and slightly boring: nothing about the caching or TTL contract changed from 4.7 or 4.6. Same cache_control markers, same 5-minute and 1-hour TTLs, same read discount, same write premiums. Your caching code is a drop-in carry-over.

There is exactly one thing that did change — and it changed back at 4.7, not at 4.8 — that affects your token budget. This post measures it so you don’t have to.

TL;DR

Claude Opus 4.8 keeps the caching contract of 4.7/4.6 unchanged: an 89% measured read discount, about 1.25x write premium on the 5-minute TTL and about 2x on the 1-hour.
The same system text reports about 43% more input tokens on Opus 4.7/4.8 than on 4.5/4.6 (11,394 vs 7,976 tokens).
The per-token price is identical across the Opus line: the 4.8/4.5 cost ratio of 1.43 matches the token ratio of 1.429.
Warm-read TTFT sits in a 2.2-2.8s band across Opus 4.5-4.8; the differences are jitter.

All numbers below were measured against https://synthorai.io/ (Anthropic-native /v1/messages) on 2026-05-29 with a ~8K-character English system prompt, max_tokens small, single sequential run. Reproduce against your own prompt before quoting them.

Availability

import os
from anthropic import Anthropic

anth = Anthropic(
    api_key=os.environ["SYNTHORAI_KEY"],
    base_url="https://synthorai.io/",   # SDK appends /v1/messages
)

msg = anth.messages.create(
    model="claude-opus-4-8",            # the only line that changes
    max_tokens=512,
    system=[
        {"type": "text", "text": SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": question}],
)
print(msg.usage)   # cache_creation_input_tokens, cache_read_input_tokens, cost

Swap claude-opus-4-7 → claude-opus-4-8 and nothing else in your caching path needs to move. The mechanics behind cache_control are covered in the caching tutorial; the architecture of why the cache exists is in Part 1 of the series.

Caching behavior: unchanged from 4.7/4.6

We ran the same cache write / cache read / no-cache sequence across the recent Opus line. The discount structure is identical end to end.

Model	No-cache cost	5m cache write	Cache read	Read discount
`claude-opus-4-5`	$0.0364	$0.0452	$0.0041	88.8%
`claude-opus-4-6`	$0.0364	$0.0452	$0.0041	88.7%
`claude-opus-4-7`	$0.0522	$0.0654	$0.0059	88.7%
`claude-opus-4-8`	$0.0520	$0.0654	$0.0059	88.6%

Two invariants hold across all four versions:

Read discount ≈ 89%. A warm cache read costs ~11% of the no-cache input price. This is Anthropic’s documented 10% cached-read rate, unchanged.
Write premium ≈ 25%. The first (cold) call costs ~1.25× the no-cache price to populate the cache. Break-even is one hit.

The absolute dollar figures for 4.7 and 4.8 are higher than 4.5/4.6, but as we’ll see in a moment that’s a token-count story, not a cache-economics story — the percentages are flat.

TTL behavior: unchanged from 4.7/4.6

Opus 4.8 honors the same two TTLs as the rest of the line: a 5-minute sliding default and an opt-in 1-hour window. We isolated the TTL path with a unique prefix per call (so no stale cache entry could contaminate the result) and measured the write premium for each TTL:

Model	TTL	Cache write	Write premium vs no-cache
`claude-opus-4-7`	5m	$0.0650	~1.25×
`claude-opus-4-7`	1h	$0.1036	~2×
`claude-opus-4-8`	5m	$0.0650	~1.25×
`claude-opus-4-8`	1h	$0.1036	~2×

# 1-hour TTL — same marker syntax on 4.8 as on 4.7/4.6
"cache_control": {"type": "ephemeral", "ttl": "1h"}

The usage object reports the TTL bucket exactly as before — cache_creation.ephemeral_5m_input_tokens or ephemeral_1h_input_tokens. The 1-hour write costs ~2× no-cache (vs ~1.25× for the 5-minute write), and reads stay at ~11% regardless of TTL. Identical to 4.7. If you picked 5m for live chat and 1h for agents with human-in-the-loop pauses on 4.7, keep those choices on 4.8.

Time-to-first-token: flat across the line

We measured warm-read TTFT with a streaming call (5 samples per model after a gateway warm-up, median reported). On this ~8–11K-token prompt, TTFT sits in a ~2.2–2.8 s band with no material per-version trend — the sample ranges overlap, so the differences are jitter, not a version effect.

Model	Warm-read TTFT (median)	Range (n=5)
`claude-opus-4-5`	2.72 s	2.58 – 2.78 s
`claude-opus-4-6`	2.76 s	2.65 – 3.01 s
`claude-opus-4-7`	2.21 s	1.98 – 2.97 s
`claude-opus-4-8`	2.47 s	2.23 – 4.38 s

Two caveats worth stating plainly:

Don’t read a ranking into this. The ranges overlap heavily (4.8’s high sample was an outlier at 4.38 s); on this prompt size TTFT is dominated by network and queueing jitter, not the model version. Treat ~2.2–2.8 s as the warm band for all four.
The cache TTFT win scales with prompt length. At ~8–11K tokens the prefill saved by a cache hit is small, so cold and warm TTFT are close (both ~2–3 s on a warmed gateway). The gap widens substantially at 100K+ tokens, where prefill dominates — that’s where a warm cache turns a multi-second wait into a fast first token. The mechanics are in Part 1: How KV Cache & TTL Work.

The one real change: tokenization (since 4.7)

Here is the thing to re-check before you migrate. The same system text reports ~43% more input tokens on 4.7/4.8 than on 4.5/4.6.

Model	Input tokens (identical text)	No-cache cost
`claude-opus-4-5`	~7,976	$0.0364
`claude-opus-4-6`	~7,977	$0.0364
`claude-opus-4-7`	~11,393	$0.0522
`claude-opus-4-8`	~11,394	$0.0520

The token count jumps at the 4.7 generation and carries into 4.8. The cost tracks the token count almost exactly: the cost ratio (4.8 / 4.5) is 1.43, and the token ratio is 1.429. In other words, the per-token price is the same across the whole line — the higher bill on 4.7/4.8 comes entirely from the same text counting as more tokens.

Two practical consequences:

Re-budget on absolute cost, not on discount. Your cache discount is unchanged (~89% read), but the same English prompt is ~43% more expensive in absolute terms on 4.7/4.8 than it was on 4.6. If you sized a per-call budget against 4.6 token counts, it will be off.
Re-check the 1,024-token cache-eligibility floor. Anthropic only caches prefixes at or above a minimum size. A prompt that sat just under the floor on 4.6 may clear it on 4.7/4.8 (more tokens), and a prompt sized in tokens for the old tokenizer needs re-measuring. Always read cache_creation_input_tokens / cache_read_input_tokens from the live response rather than estimating from a local tokenizer that may not match.

We’re describing a measured observation — identical text, ~43% more reported input tokens on 4.7/4.8 — most consistent with a tokenizer/vocabulary update at the 4.7 generation. The takeaway doesn’t depend on the root cause: re-measure token counts when you migrate, because the cache math is token-based.

Migration checklist (4.6/4.7 → 4.8)

✅ Caching code carries over verbatim. cache_control markers, breakpoint count (up to 4), ttl: "1h", usage-field names — all identical.
✅ TTL choices carry over. 5m for live/session workloads, 1h for bursty/agent-with-pauses.
✅ Discount economics carry over. ~89% read, ~1.25× write (5m), ~2× write (1h).
⚠️ Re-measure token counts. If you’re coming from 4.5/4.6, expect ~40%+ more input tokens for the same text (this happened at 4.7). Coming from 4.7, expect parity.
⚠️ Re-validate cost dashboards. Trust usage.cost and the *_input_tokens fields from the live response, not a cached estimate from the old generation.

Bottom line

For an engineering team already caching against Opus, claude-opus-4-8 is the easy kind of upgrade: the entire caching and TTL surface is stable, so there’s nothing to relearn and no code to rewrite. Budget for the tokenizer shift if you’re jumping from 4.6 or earlier, confirm your numbers against the live usage object, and ship.

For the full caching playbook — prompt structure, hit-rate debugging, TTL-aware patterns — see the prompt-caching series starting with How KV Cache & TTL Work and the working Python tutorial.

FAQ

Do I need to change my cache_control code to use Opus 4.8? No. The marker syntax, breakpoint limit, and TTL options are identical to 4.7/4.6. Change the model field and nothing else.

Did the cache read discount change on 4.8? No. A warm read is ~11% of the no-cache input price (~89% off) on 4.5 through 4.8, matching Anthropic’s documented rate.

Did the 1-hour TTL premium change? No. The 1-hour write costs ~2× the no-cache input price; the 5-minute write costs ~1.25×. Reads are ~11% regardless of TTL. Same as 4.7.

Why is the same prompt more expensive on 4.8 than on 4.6? The per-token price is the same — the prompt simply counts as more tokens. Identical text reported ~8.0K tokens on 4.5/4.6 and ~11.4K on 4.7/4.8 in our measurements (a ~43% increase), most consistent with a tokenizer change at the 4.7 generation. The cache discount is unchanged.

Is 4.8 a drop-in replacement for 4.7? On the caching/TTL surface, yes — token counts and economics were already at the 4.7 level, so migration from 4.7 is parity. We don’t publish capability benchmarks we haven’t run; for quality and reasoning claims, see Anthropic’s model card.

Verification: all caching, TTL, token-count, cost, and TTFT figures measured against https://synthorai.io/ on 2026-05-29 using the official anthropic SDK, single tenant. Cost/token figures are a single sequential run; TTFT is a 5-sample median per model after gateway warm-up. Discount/premium ratios cross-checked against Anthropic Prompt Caching docs. Your numbers will vary with prompt, region, and load.

← Back to blog