What Actually Drives Your Image-Generation Bill

Contents
  1. How image models differ
  2. We measured it
  3. The decision rule
  4. Why you can trust these numbers
  5. Bottom line
  6. FAQ
  7. Sources

We added image generation to a gateway built for text LLMs and measured what drives the cost across four variables: model, resolution, image count, and quality. The largest lever is quality, a parameter most image APIs expose and most callers leave on default. Resolution, prompt caching, and batching matter far less than people expect.


How image models differ

Image models aren’t drop-in swaps for one another. They diverge on several axes, and only one of them (billing shape) is about price. The active catalog at a glance:

FamilyBillingquality knobBatch n>1Resolution
gpt-image (OpenAI)per-tokenlow/med/highup to ≈2K
gemini-image (Google)per-token✗ 1/call1K (gemini-3: to 4K)
qwen-image / wan2.7 (Alibaba)flat/image512²–2048²
seedream (BytePlus)flat/image✗ 1/call≥1920² (4.5/5.0)

The axes that bite if you assume one model behaves like another:

  • Billing shape. Per-token (gpt-image, gemini) or flat-per-image (qwen, wan, seedream). This is the axis that decides your bill, and it’s the subject of the next section.
  • The quality knob. Only gpt-image has it (low/medium/high). Gemini changes fidelity by model tier (flash to pro) or image_size; flat models have no such dial. That one knob swings the bill about 36×, so it’s the main cost lever, covered below.
  • Batch (n>1) isn’t universal. gpt-image, qwen, and wan return several images per call. Every Gemini and Seedream image model is one-image-per-call: n=2 returns a 400, so you issue N requests and orchestrate the batch yourself.
  • Resolution limits cut both ways. gemini-2.5-flash-image caps at 1K (1 MP), while gemini-3 reaches 2K/4K (and its bill roughly doubles from 1K to 4K). Seedream 4.5/5.0 enforce a floor of about 1920² and reject anything smaller. qwen-image lives in a 512²–2048² band. Higher resolution isn’t always available, and dropping resolution to save money isn’t always allowed.
  • Control knobs and image-to-image differ. Only some models accept seed, negative_prompt, or guidance_scale, and the reference-image limit for editing runs from 3 (gemini-2.5) to 16 (gpt-image).

The quality knob has one non-obvious property. For gpt-image, an output token is a billing unit, not a measure of the file you get. OpenAI assigns the count from a published per-(quality × size) rate table (272 / 1,056 / 4,160 tokens for low / medium / high at 1024² on gpt-image-1), so the count is set by quality, not derived from the bytes returned. We checked: the same prompt at 1024² across all three tiers produced identical 1024×1024 PNGs of roughly the same file size (about 0.9 MB), yet billed 196, 1,756, and 7,024 tokens. Same resolution, same byte size, 36× the cost. You pay for rendering effort, not pixels, which is why you read usage rather than eyeball the output.

One capability none of these models has is prompt caching, usually the first cost-saving idea people reach for. Image generation is stateless: there’s no conversation or KV state to reuse, the usage object carries no cache fields, and (as we measure below) batching doesn’t share the prompt either. Caching is a chat feature, not an image one, which rules out a common assumption about cutting image cost.


We measured it

Same e-commerce-style product prompt, real generations through the gateway, with cost computed from the returned usage against each model’s published rates. Five findings, each from a separate sweep.

1. The image is the cost, not the prompt. In text-to-image (a prompt in, an image out), the bill is 97–100% output tokens: a 1024² gpt-image-2 generation is 21 input and 196 output tokens (about $0.0001 plus $0.0059), and gemini-2.5-flash-image takes 10 input. The prompt you write is a rounding error, but only because it’s text. Feed an image instead (image-to-image, like “make this mug blue”) and the input tokenizes large:

Modelt2i inputi2i input (1 ref)Output
gpt-image-2 (low)21 tok1,043 tok196 tok
gemini-2.5-flash-image10 tok1,297 tok1,290 tok

The input jumps 50–130×, and it scales linearly: each extra reference adds about 1,025 tokens on gpt-image-2 (1, 2, and 3 references measured at 1,043, 2,068, and 3,093). At low quality those input tokens outnumber the generated output five-to-one. The principle holds either way: an image is the cost, whether you generate it or supply it, and the prompt never is. The rest of this article stays in text-to-image; the fuller image-to-image economics are their own follow-up.

2. Model choice is a 6× lever. Identical 1024² request, default quality:

ModelBillingCost / image
gpt-image-2token · quality knob$0.0060
gpt-image-1-minitoken · quality knob$0.0085
seedream-4-0per-request flat$0.030
qwen-image-2.0per-request flat$0.035
gemini-2.5-flash-imagetoken · no quality knob$0.0387

A 6.4× spread between the cheapest and priciest path, driven entirely by how many output tokens each model emits.

3. Resolution barely moves it. Sweeping gpt-image-2 from 1024² to 2048², per-image cost stayed roughly flat ($0.0060 to $0.0121); output tokens aren’t proportional to pixels. gemini-2.5-flash-image returned the same 1,290 tokens whatever size we requested, because it’s 1K-only and size only changes the aspect ratio. (The gemini-3 image tiers do honor image_size, roughly doubling cost from 1K to 4K, but 2.5-flash-image, the model we cost here, does not.) Per-image flat models are resolution-independent by definition. So far the per-token model looks hard to beat.

4. Quality is the crossover. Sweep gpt-image-2 across quality tiers:

quality1024²2048²
low$0.0060 (196 tok)$0.0121 (397 tok)
medium$0.053 (1,756 tok)$0.107 (3,568 tok)
high$0.211 (7,024 tok)$0.428 (14,272 tok)

Output tokens scale about 9× from low to medium and about 36× from low to high. At low quality the per-token model is the cheapest option; at medium or high it passes the flat per-image price ($0.03–0.035). The crossover sits where the arithmetic puts it, around 1,000 output tokens ($0.03 ÷ $30/M): low is under it, medium is over. This also corrects an earlier conclusion of ours. “Per-token is always cheapest” was an artifact of testing at default low quality.

The same prompt rendered by gpt-image-2 at low, medium and high quality: three equally sharp 1024² product photos labelled 196 / 1,756 / 7,024 output tokens and $0.006 / $0.053 / $0.215.

Same prompt, gpt-image-2, 1024². low / medium / high bill 196 / 1,756 / 7,024 output tokens, or $0.006 / $0.053 / $0.215: a 36× spread at identical resolution. For a clean product shot like this the three are hard to tell apart, so the cheapest tier is often enough. Set quality to the job instead of defaulting to high.

5. You can’t share a prompt across images. Generating n images in one call doesn’t amortize the prompt. gpt-image-2 bills it N times: input tokens went from 28 to 112 at n=4, and a long brand prompt went from 499 to 1,996. Per-image cost was identical at n=1 and n=4. With no caching either, there’s no prompt-cost-sharing mechanism for image generation. You pay per output image, and the prompt is re-billed each time.


The decision rule

For text-to-image, it comes down to quality, not the things people assume:

  • Low / draft / thumbnail quality: a token-with-quality model (gpt-image, about $0.006–0.012). Cheapest at any resolution up to about 2K.
  • Medium / high quality: per-request flat (seedream / qwen, $0.03–0.035). The per-token bill runs away ($0.05–0.43 in our sweep), and flat is both cheaper and quality-independent.
  • gemini (about $0.039 at default 1K) is rarely the cost-optimal pick. It’s undercut by gpt-image at low quality and by per-request flat at medium and high. It has no quality dial; you’d choose its Pro tier or a higher image_size for output quality, not for price.
  • Resolution moves cost about 2× within a quality tier, not enough to flip the choice. Quality flips it.
  • n>1, caching, and batching never reduce per-image cost. There’s nothing to share.
  • Image-to-image: default to flat per-image. A reference image is input, and only per-token models surcharge it (about 1,025 tokens each); flat models include it for free. For editing, seedream / qwen usually win. gpt-image stays cheaper only for low-quality edits with a few references (around 5 crosses the flat price), and loses once quality or reference count climbs.

E-commerce is the clearest example. Say you generate product photos by sending the same long brand prompt for every item in the catalog, and you assume caching that repeated prompt will save money. That fails for two reasons: the prompt was never the cost (the image is), and there’s no caching for generation anyway. Since real product imagery is medium quality or higher, the right choice is a flat per-image model, which is both cheaper and more predictable regardless of how repetitive your prompts are.

The capability gates from the opening section can still override the choice: one-image-per-call models, resolution floors and ceilings, data-residency limits, and which knobs (seed, negative_prompt, guidance_scale) a model exposes. Pick on cost, then confirm the capability fits.


Why you can trust these numbers

These figures come from real usage against each vendor’s list rates, not estimates. Image billing on our gateway is sessionless: it settles only on a 2xx (a failed generation is never charged), pre-checks the worst-case cost before any spend, and bills a missing-usage response at the ceiling rather than silently $0. The principle is the same one we apply everywhere: trust the cost, not a figure the vendor hands you. It’s the method we used to audit whether a gateway lies about cache.


Bottom line

Image generation looks like just another endpoint, but the billing unit changed. For text-to-image the lever isn’t the prompt (no caching, no batch sharing) or the resolution. It’s quality: gpt-image is cheapest at low, per-image flat (seedream / qwen) wins at medium and high, with the crossover near 1,000 output tokens. Set quality deliberately, match the model to it, and check the cost. When you move from generating to editing, feeding a reference image, re-run the math, because the input image becomes the cost.


FAQ

Does prompt caching reduce image-generation cost? No. Generation is stateless: the usage object has no cache fields, and batching re-bills the prompt per image. The cost is the output image, not the text.

Per-token or per-image, which is cheaper? It depends on quality. For low or draft quality, a quality-knob model like gpt-image (about $0.006–0.012). For medium or high, per-image flat like seedream/qwen ($0.03–0.035), because the per-token bill runs away. For image-to-image the answer tilts further to flat: they include reference images for free, while per-token surcharges about 1,025 tokens each.


Sources

All checked 2026-06-19. Not financial advice; verify current pricing before relying on it.

← Back to blog