What a Simple Transcription Test Can and Can't Tell You
Contents
Synthorai now transcribes audio, with seven dedicated speech-to-text models behind one OpenAI-compatible endpoint.
That one endpoint hides a lot of work, because natively these models barely resemble each other. whisper-1 takes a multipart file upload and returns {text}. gpt-4o-transcribe uses the same upload but adds token usage. ByteDance’s seed-asr speaks the BytePlus AUC protocol, Alibaba’s qwen3-asr-flash has its own, and Google’s chirp models are Cloud Speech-to-Text recognizers reached with OAuth.
Different endpoints, different auth, different response shapes, one more integration each. Through the gateway it is one OpenAI-compatible call: swap gpt-4o-mini-transcribe for seed-asr-bigmodel or chirp-3, and nothing else in your code changes.
The call is the OpenAI-compatible transcription endpoint, so it is a drop-in if you already use Whisper:
curl https://synthorai.io/v1/audio/transcriptions \
-H "Authorization: Bearer $SYNTHORAI_API_KEY" \
-F file=@meeting.mp3 \
-F model=gpt-4o-mini-transcribe
from openai import OpenAI
client = OpenAI(base_url="https://synthorai.io/v1", api_key="sk-syn-...")
with open("meeting.mp3", "rb") as f:
result = client.audio.transcriptions.create(model="gpt-4o-mini-transcribe", file=f)
print(result.text)
The transcript comes back in text, and the billed cost is in the x-total-cost-usd response header.
We put all seven through the same simple test, and what that test is shapes every number below.
What this test is, and isn’t
We generated everyday passages with no proper nouns (a morning, the weather, a trip to the market) with a standard text-to-speech voice in each of the world’s five most-spoken languages, then transcribed each clip through all seven models. Each clip runs about 12 to 15 seconds, roughly 40 words of normal-paced speech with no long silences, encoded as 16 kHz mono 16-bit PCM WAV (256 kbps, about 2 MB a minute). The text is the ground truth and the durations are exact.
This is a deliberately easy case: clean, scripted, single-speaker audio with no accents, noise, or jargon. That makes it good for the things that do not depend on how hard the audio is. It measures cost, latency, which languages a model accepts at all, and whether it can stream, and those are stable facts.
It is not a quality benchmark. Real recordings with accents, background noise, domain vocabulary, overlapping speakers, and an hour of runtime separate these models in ways clean speech never will, and nothing here predicts that. Read the accuracy numbers as a floor check, not a ranking, and treat the cost, coverage, and streaming results as the baseline you can actually rely on.
Dedicated ASR, not multimodal
All seven models here are dedicated speech-to-text systems: OpenAI’s whisper-1, gpt-4o-transcribe, and gpt-4o-mini-transcribe; ByteDance’s seed-asr-bigmodel; Alibaba’s qwen3-asr-flash; and Google’s chirp-2 and chirp-3.
The gateway can also point the transcription endpoint at a general multimodal model like Gemini, which transcribes as a side effect of understanding audio. We left those out, and we would not reach for them for transcription. Google says the same about its own lineup: its Gemini audio guide frames Gemini as a way to describe, summarize, or answer questions about audio, then sends you elsewhere for the transcript itself, “for dedicated speech to text models with support for real-time transcription, you should use the Google Cloud Speech-to-Text API” (the Chirp models). A multimodal model can emit a transcript, but it is built to understand audio rather than write down exactly what was said, and it is reached through a chat-style generateContent call instead of a transcription API. For verbatim speech-to-text, a dedicated model is the right tool, and the vendor that ships both agrees.
There are three ways to send the audio:
- File in, batch out: upload a complete recording, get the full transcript in one response. Every model supports it.
- File in, streamed text out: the same upload, but the transcript streams back over SSE as it is produced. Some models support this; others are batch-only.
- Audio stream in, text stream out: real-time recognition of a live mic or call. In development, not yet available, so everything below is the first two modes.
How transcription is billed
Two billing shapes. Per audio-minute (whisper-1, seed-asr, qwen3-asr-flash, the Chirp models): you pay for the wall-clock length of the recording, whatever is in it. Per token (the gpt-4o models): audio tokenizes at a flat rate, and you pay for those input tokens plus the transcript output tokens, so silence is cheaper than dense speech.
The per-token shape has a trap: the listed input rate is for text, but audio bills higher (gpt-4o-mini-transcribe lists $1.25/M input but bills audio at $3/M). Estimate from the text rate and you undershoot. The gateway returns the real charge in an x-total-cost-usd header, so read that rather than guessing from a price page.
Cost
This is the part the test pins down cleanly, and it varies the most. Cost per minute, from the billed header:
| Model | Cost / min | Latency | Streams |
|---|---|---|---|
seed-asr-bigmodel | $0.0020 | ≈10s | no |
qwen3-asr-flash | $0.0021 | ≈3s | no |
gpt-4o-mini-transcribe | $0.0031 | ≈3s | token-by-token |
whisper-1 | $0.0060 | ≈4s | no |
gpt-4o-transcribe | $0.0062 | ≈2s | token-by-token |
chirp-2 | $0.0164 | ≈3s | no |
chirp-3 | $0.0164 | ≈4s | no |
The spread is about 8x, from seed-asr at $0.0020 a minute to the Chirp models at $0.0164. The cheapest model, seed-asr, only handles English and Chinese (more on that below), so the cheapest one that covers every language is qwen3-asr-flash at $0.0021. The Chirp models are the most expensive by a wide margin, and chirp-3 is the one to reach for if you use Chirp at all: it matches chirp-2’s price and speed but transcribes Mandarin far better, as the accuracy table shows.
How these numbers move with your own files depends on the billing shape. The per-minute models (whisper-1, seed-asr, qwen3-asr-flash, the Chirps) bill by duration alone, so the rate is portable: ten minutes of audio costs ten times the per-minute figure, whatever the format or content.
The per-token models (the gpt-4o rows) scale their input cost with duration, not file size, because the provider resamples the audio before tokenizing. A heavy 320 kbps MP3 and our lean 16 kHz WAV of the same words tokenize to about the same cost, so compressing your files saves storage, not transcription spend. What does move a per-token bill is how much is actually spoken: our clips are normal-paced with no dead air, so audio that is denser or quieter than that bills a little more or less on the output tokens. The x-total-cost-usd header is the ground truth in every case.
Accuracy and language coverage
On English, Spanish, and French, every model that accepts the language scored about 0% error. That is the floor, and everyone clears it. Mandarin and Hindi are where even this easy test starts to show cracks, but read that as a hint about where to point your own testing, not a verdict:
| Model | Mandarin (CER) | Hindi (WER) | Coverage |
|---|---|---|---|
seed-asr-bigmodel | 0% | fails | English + Chinese only |
qwen3-asr-flash | 0% | 15% | all five |
gpt-4o-mini-transcribe | 0% | 4% | all five |
whisper-1 | 0% | 22% | all five |
gpt-4o-transcribe | 0% | 13% | all five |
chirp-2 | 16% | 15% | all five |
chirp-3 | 2% | 15% | all five |
The hard fact here is coverage, not accuracy. seed-asr returns a useless transcript for Hindi, Spanish, and French: it is an English-and-Chinese model, so it is only an option if your audio is one of those two languages. The other six handled all five.
The Hindi spread and the Mandarin slip (chirp-2 at 16%, which chirp-3 fixes) say those models are worth testing on your harder languages before you trust them, not that one is better than another. The absolute numbers are inflated by the synthetic voice and the scoring and move from run to run. The honest read is that on clean speech in major languages, accuracy is not where these models separate, so it is not where this test can tell you to choose.
Streaming output
Whether a model can stream its transcript is a capability, not a quality call, and it splits the lineup. The per-minute models (whisper-1, seed-asr, qwen3-asr-flash, and both Chirps) are batch-only; the gateway returns a 400 if you ask them to stream. The gpt-4o models stream token by token: gpt-4o-transcribe returns its first words in about a second and fills in the rest, which is what a live-feel UI needs. Cost is unchanged from batch. To stream, add stream=true:
curl -N https://synthorai.io/v1/audio/transcriptions \
-H "Authorization: Bearer $SYNTHORAI_API_KEY" \
-F file=@meeting.mp3 -F model=gpt-4o-transcribe -F stream=true
# data: {"type":"transcript.text.delta","delta":"When"}
# data: {"type":"transcript.text.delta","delta":" you"} ...
Caching
There is effectively no caching for these models. The per-minute ones bill by duration, so repetition earns no discount: we sent the same clip to whisper-1 five times and paid an identical $0.015478 every time. The gpt-4o token-billed models list no cache rate and showed only ordinary run-to-run variation on repeats. So plan around the per-minute rate; re-sending the same file does not get cheaper.
What to check first, and what to test yourself
This test cannot tell you which model is most accurate on your recordings. It can tell you what to filter on before you run your own evaluation:
- Languages. Check that the model accepts every language you need.
seed-asris English and Chinese only; the other six handled all five we tried. This is a hard gate, not a preference. - Streaming. If you need a live transcript, only the
gpt-4omodels stream token by token; the per-minute models are batch-only. - Cost. The spread is about 8x. The cheapest model that covers every language is
qwen3-asr-flashat $0.0021; the Chirps are the most expensive, andchirp-3is the only reason to pick Chirp over the cheaper models. There is no caching to lean on, so the per-minute rate is what you pay. - Model type. Use a dedicated speech-to-text model, not a general multimodal one. It is the right tool for verbatim transcription, and the vendors that ship both say so.
Once a few models clear those, the question that is left, how accurate each one is on your own audio with its accents, noise, and vocabulary, is the one you have to answer yourself. No clean-speech benchmark substitutes for running the survivors on real recordings.
Bottom line
On clean, scripted speech in major languages, all seven models are about equally accurate, which is the most useful thing this test says: accuracy is not the axis to choose on. What it does pin down, and what genuinely varies, is the baseline: cost spans about 8x, one model covers only two languages, and the per-minute models cannot stream. Use those to narrow the field, not to declare a winner, then run the two or three survivors on your own audio. And reach for a dedicated speech-to-text model rather than a general multimodal one, which is what the vendors building both recommend.
Sources
- OpenAI: Speech to text guide
- OpenAI: API pricing
- Google: Gemini API audio understanding (use Cloud Speech-to-Text for dedicated transcription)
- Google Cloud: Chirp 3 transcription model
- BytePlus: Seed-ASR (ByteDance) overview
Costs and latencies measured on Synthorai on 2026-06-25 across seven dedicated transcription models and five languages (English, Mandarin, Hindi, Spanish, French), via the x-total-cost-usd header and SSE timing. The audio was text-to-speech generated and deliberately easy, so the accuracy figures are a floor check rather than a quality benchmark; real-world speech with accents and noise would separate these models differently. Latency varies run to run. Listing prices are this platform’s rates as of that date. Verify current pricing before relying on it.