Claude Opus 4.8 on Synthorai: Caching & TTL vs 4.7/4.6

Contents
  1. Availability
  2. Caching behavior: unchanged from 4.7/4.6
  3. TTL behavior: unchanged from 4.7/4.6
  4. Time-to-first-token: flat across the line
  5. The one real change: tokenization (since 4.7)
  6. Migration checklist (4.6/4.7 → 4.8)
  7. Bottom line
  8. FAQ

claude-opus-4-8 is now available on the Synthorai gateway. If you already run prompt caching against the Opus line, the headline is reassuring and slightly boring: nothing about the caching or TTL contract changed from 4.7 or 4.6. Same cache_control markers, same 5-minute and 1-hour TTLs, same read discount, same write premiums. Your caching code is a drop-in carry-over.

There is exactly one thing that did change — and it changed back at 4.7, not at 4.8 — that affects your token budget. This post measures it so you don’t have to.

All numbers below were measured against https://synthorai.io/ (Anthropic-native /v1/messages) on 2026-05-29 with a ~8K-character English system prompt, max_tokens small, single sequential run. Reproduce against your own prompt before quoting them.


Availability

import os
from anthropic import Anthropic

anth = Anthropic(
    api_key=os.environ["SYNTHORAI_KEY"],
    base_url="https://synthorai.io/",   # SDK appends /v1/messages
)

msg = anth.messages.create(
    model="claude-opus-4-8",            # the only line that changes
    max_tokens=512,
    system=[
        {"type": "text", "text": SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral"}},
    ],
    messages=[{"role": "user", "content": question}],
)
print(msg.usage)   # cache_creation_input_tokens, cache_read_input_tokens, cost

Swap claude-opus-4-7claude-opus-4-8 and nothing else in your caching path needs to move. The mechanics behind cache_control are covered in the caching tutorial; the architecture of why the cache exists is in Part 1 of the series.


Caching behavior: unchanged from 4.7/4.6

We ran the same cache write / cache read / no-cache sequence across the recent Opus line. The discount structure is identical end to end.

ModelNo-cache cost5m cache writeCache readRead discount
claude-opus-4-5$0.0364$0.0452$0.004188.8%
claude-opus-4-6$0.0364$0.0452$0.004188.7%
claude-opus-4-7$0.0522$0.0654$0.005988.7%
claude-opus-4-8$0.0520$0.0654$0.005988.6%

Two invariants hold across all four versions:

  • Read discount ≈ 89%. A warm cache read costs ~11% of the no-cache input price. This is Anthropic’s documented 10% cached-read rate, unchanged.
  • Write premium ≈ 25%. The first (cold) call costs ~1.25× the no-cache price to populate the cache. Break-even is one hit.

The absolute dollar figures for 4.7 and 4.8 are higher than 4.5/4.6, but as we’ll see in a moment that’s a token-count story, not a cache-economics story — the percentages are flat.


TTL behavior: unchanged from 4.7/4.6

Opus 4.8 honors the same two TTLs as the rest of the line: a 5-minute sliding default and an opt-in 1-hour window. We isolated the TTL path with a unique prefix per call (so no stale cache entry could contaminate the result) and measured the write premium for each TTL:

ModelTTLCache writeWrite premium vs no-cache
claude-opus-4-75m$0.0650~1.25×
claude-opus-4-71h$0.1036~2×
claude-opus-4-85m$0.0650~1.25×
claude-opus-4-81h$0.1036~2×
# 1-hour TTL — same marker syntax on 4.8 as on 4.7/4.6
"cache_control": {"type": "ephemeral", "ttl": "1h"}

The usage object reports the TTL bucket exactly as before — cache_creation.ephemeral_5m_input_tokens or ephemeral_1h_input_tokens. The 1-hour write costs ~2× no-cache (vs ~1.25× for the 5-minute write), and reads stay at ~11% regardless of TTL. Identical to 4.7. If you picked 5m for live chat and 1h for agents with human-in-the-loop pauses on 4.7, keep those choices on 4.8.


Time-to-first-token: flat across the line

We measured warm-read TTFT with a streaming call (5 samples per model after a gateway warm-up, median reported). On this ~8–11K-token prompt, TTFT sits in a ~2.2–2.8 s band with no material per-version trend — the sample ranges overlap, so the differences are jitter, not a version effect.

ModelWarm-read TTFT (median)Range (n=5)
claude-opus-4-52.72 s2.58 – 2.78 s
claude-opus-4-62.76 s2.65 – 3.01 s
claude-opus-4-72.21 s1.98 – 2.97 s
claude-opus-4-82.47 s2.23 – 4.38 s

Two caveats worth stating plainly:

  • Don’t read a ranking into this. The ranges overlap heavily (4.8’s high sample was an outlier at 4.38 s); on this prompt size TTFT is dominated by network and queueing jitter, not the model version. Treat ~2.2–2.8 s as the warm band for all four.
  • The cache TTFT win scales with prompt length. At ~8–11K tokens the prefill saved by a cache hit is small, so cold and warm TTFT are close (both ~2–3 s on a warmed gateway). The gap widens substantially at 100K+ tokens, where prefill dominates — that’s where a warm cache turns a multi-second wait into a fast first token. The mechanics are in Part 1: How KV Cache & TTL Work.

The one real change: tokenization (since 4.7)

Here is the thing to re-check before you migrate. The same system text reports ~43% more input tokens on 4.7/4.8 than on 4.5/4.6.

ModelInput tokens (identical text)No-cache cost
claude-opus-4-5~7,976$0.0364
claude-opus-4-6~7,977$0.0364
claude-opus-4-7~11,393$0.0522
claude-opus-4-8~11,394$0.0520

The token count jumps at the 4.7 generation and carries into 4.8. The cost tracks the token count almost exactly: the cost ratio (4.8 / 4.5) is 1.43, and the token ratio is 1.429. In other words, the per-token price is the same across the whole line — the higher bill on 4.7/4.8 comes entirely from the same text counting as more tokens.

Two practical consequences:

  1. Re-budget on absolute cost, not on discount. Your cache discount is unchanged (~89% read), but the same English prompt is ~43% more expensive in absolute terms on 4.7/4.8 than it was on 4.6. If you sized a per-call budget against 4.6 token counts, it will be off.
  2. Re-check the 1,024-token cache-eligibility floor. Anthropic only caches prefixes at or above a minimum size. A prompt that sat just under the floor on 4.6 may clear it on 4.7/4.8 (more tokens), and a prompt sized in tokens for the old tokenizer needs re-measuring. Always read cache_creation_input_tokens / cache_read_input_tokens from the live response rather than estimating from a local tokenizer that may not match.

We’re describing a measured observation — identical text, ~43% more reported input tokens on 4.7/4.8 — most consistent with a tokenizer/vocabulary update at the 4.7 generation. The takeaway doesn’t depend on the root cause: re-measure token counts when you migrate, because the cache math is token-based.


Migration checklist (4.6/4.7 → 4.8)

  • Caching code carries over verbatim. cache_control markers, breakpoint count (up to 4), ttl: "1h", usage-field names — all identical.
  • TTL choices carry over. 5m for live/session workloads, 1h for bursty/agent-with-pauses.
  • Discount economics carry over. ~89% read, ~1.25× write (5m), ~2× write (1h).
  • ⚠️ Re-measure token counts. If you’re coming from 4.5/4.6, expect ~40%+ more input tokens for the same text (this happened at 4.7). Coming from 4.7, expect parity.
  • ⚠️ Re-validate cost dashboards. Trust usage.cost and the *_input_tokens fields from the live response, not a cached estimate from the old generation.

Bottom line

For an engineering team already caching against Opus, claude-opus-4-8 is the easy kind of upgrade: the entire caching and TTL surface is stable, so there’s nothing to relearn and no code to rewrite. Budget for the tokenizer shift if you’re jumping from 4.6 or earlier, confirm your numbers against the live usage object, and ship.

For the full caching playbook — prompt structure, hit-rate debugging, TTL-aware patterns — see the four-part series starting with How KV Cache & TTL Work and the working Python tutorial.


FAQ

Do I need to change my cache_control code to use Opus 4.8? No. The marker syntax, breakpoint limit, and TTL options are identical to 4.7/4.6. Change the model field and nothing else.

Did the cache read discount change on 4.8? No. A warm read is ~11% of the no-cache input price (~89% off) on 4.5 through 4.8, matching Anthropic’s documented rate.

Did the 1-hour TTL premium change? No. The 1-hour write costs ~2× the no-cache input price; the 5-minute write costs ~1.25×. Reads are ~11% regardless of TTL. Same as 4.7.

Why is the same prompt more expensive on 4.8 than on 4.6? The per-token price is the same — the prompt simply counts as more tokens. Identical text reported ~8.0K tokens on 4.5/4.6 and ~11.4K on 4.7/4.8 in our measurements (a ~43% increase), most consistent with a tokenizer change at the 4.7 generation. The cache discount is unchanged.

Is 4.8 a drop-in replacement for 4.7? On the caching/TTL surface, yes — token counts and economics were already at the 4.7 level, so migration from 4.7 is parity. We don’t publish capability benchmarks we haven’t run; for quality and reasoning claims, see Anthropic’s model card.


Verification: all caching, TTL, token-count, cost, and TTFT figures measured against https://synthorai.io/ on 2026-05-29 using the official anthropic SDK, single tenant. Cost/token figures are a single sequential run; TTFT is a 5-sample median per model after gateway warm-up. Discount/premium ratios cross-checked against Anthropic Prompt Caching docs. Your numbers will vary with prompt, region, and load.

← Back to blog