GLM 5.2: Reasoning Effort Is the Cost Lever

GLM 5.2: Reasoning Effort Is the Cost Lever

Contents
  1. What GLM 5.2 is
  2. Where it sits on price
  3. The reasoning-effort dial
  4. An easy task: reasoning just adds cost
  5. A hard task: reasoning earns its keep, the default does not
  6. The decision rule
  7. Caching helps the input, not the reasoning
  8. Using it on Synthorai
  9. Bottom line
  10. Sources

GLM 5.2 is now on Synthorai at about a sixth of frontier per-token prices, and the open-weight, frontier-benchmark headline is real. But the per-token price is the wrong number to anchor on. What a coding task actually costs on GLM 5.2 swings by more than an order of magnitude depending on a single knob, reasoning effort, and the default leaves that knob in the worst position. Set it well and GLM 5.2 is correct and cheaper than frontier on both easy and hard work. Leave it on the default and the same answer costs twenty times more and takes minutes. We measured it.


What GLM 5.2 is

GLM 5.2 is Zhipu’s open-weight frontier model, released 2026-06-13: a mixture-of-experts network (~744B total, ~40B active), a usable 1M-token context, and an MIT license you can self-host. It targets coding and agentic work, with strong published benchmarks (SWE-bench Pro 62.1, Terminal-Bench 2.1 81.0, AIME 2026 99.2, GPQA Diamond 91.2). On Synthorai it’s glm-5.2, priced at $1.40 per million input tokens and $4.40 per million output.

The detail that drives everything below: it is a reasoning model, and how much it reasons is something you set.

Where it sits on price

On per-token listing price, GLM 5.2 sits well below the Western frontier and among the cheaper Chinese models. Synthorai’s rates for a representative set:

ModelInput ($/M)Output ($/M)Cache read ($/M)
deepseek-v4-pro0.440.870.0036
kimi-k2.50.573.010.12
glm-5.21.404.400.26
qwen3-max1.206.000.36
gemini-3.1-pro2.0012.000.20
claude-opus-4-85.0025.000.50
gpt-5.55.0030.000.50

Its $4.40 output rate is about a seventh of gpt-5.5 and a sixth of claude-opus-4-8, though deepseek-v4-pro and kimi-k2.5 undercut it. So GLM 5.2 is frontier-class capability at roughly Chinese-model prices, not the absolute floor. There is no separate cache-write charge: a cache write bills at the input rate, and only the cache read is discounted to the rate above. The discount varies by vendor, with GLM 5.2’s cache read about a fifth of its input rate and the frontier models (gpt-5.5, claude-opus-4-8, gemini-3.1-pro) discounting reads to roughly a tenth.

It is also a step up from its own predecessors. The previous GLM generation was extraordinarily cheap; the GLM 5 line raised prices, and GLM 5.2 lands at about 3x the input rate of GLM-4.6 (Zhipu’s official rates):

GLM modelReleasedInput ($/M)Output ($/M)
GLM-4.52025-070.602.20
GLM-4.62025-090.431.74
GLM-520261.003.20
GLM-5.22026-061.404.40

That buys the 1M context and the frontier benchmarks. But the per-token rate is only the headline. What you actually pay per task is set by the reasoning effort.

The reasoning-effort dial

GLM 5.2’s reasoning is a dial, not a switch. You can turn it off (enable_thinking: false), set reasoning_effort to low, medium, or high, or leave it on the default, which runs reasoning unbounded. That setting changes cost and latency by far more than the price does. We ran one easy and one hard coding task across the settings, checking every answer against a reference on hundreds of randomized cases.

An easy task: reasoning just adds cost

Weighted interval scheduling, a moderate dynamic-programming problem:

ModeReasoning tokensAnswer tokensCostLatencyCorrect
glm-5.2, thinking off0169$0.0008≈5syes
glm-5.2, reasoning_effort: low1,563150$0.007639syes
glm-5.2, unbounded default≈6,290≈150$0.0285137syes
gpt-5.5 (reference)59141$0.00644.8syes
claude-opus-4-8 (reference)0201$0.00573.3syes

Two things stand out. Thinking off is correct and the cheapest thing on the board, about 8x under the frontier models, and every step up the dial just adds cost for the same answer. And the bill tracks the reasoning, not the answer: the code GLM returns is roughly 150 tokens every time, while the reasoning in front of it grows from nothing to about 6,300, billed at the same $4.40/M output rate. The unbounded default spends that reasoning to reach the same answer thinking off produced with none, and the gap is the entire cost difference. The frontier models answer here with little or no reported reasoning: gpt-5.5 spends 59 reasoning tokens, and claude-opus-4-8’s usage reports none.

A hard task: reasoning earns its keep, the default does not

Wildcard string matching (? and *), the classic problem that is easy to get subtly wrong. Here thinking off broke. It returned a memoized recursion:

def is_match(s, p):
    memo = {}
    def match(i, j):
        if (i, j) in memo:
            return memo[(i, j)]
        if j == len(p):
            result = i == len(s)
        elif i < len(s) and p[j] in (s[i], '?'):
            result = match(i + 1, j + 1)
        elif p[j] == '*':
            result = match(i + 1, j) or match(i, j + 1)
        else:
            result = False
        memo[(i, j)] = result
        return result
    return match(0, 0)

It looks right, and the memo even suggests some care. But the * branch recurses match(i + 1, j) without bounding i. Once the string is consumed and the pattern still has a *, i climbs forever and the stack overflows. Fast, cheap, and wrong.

Turn the dial up and it returns the correct iterative two-pointer algorithm, which backtracks to the last * instead of recursing:

def is_match(s, p):
    s_idx, p_idx, star_idx, match_idx = 0, 0, -1, 0
    while s_idx < len(s):
        if p_idx < len(p) and (p[p_idx] == '?' or p[p_idx] == s[s_idx]):
            s_idx += 1
            p_idx += 1
        elif p_idx < len(p) and p[p_idx] == '*':
            star_idx = p_idx
            match_idx = s_idx
            p_idx += 1
        elif star_idx != -1:
            p_idx = star_idx + 1
            match_idx += 1
            s_idx = match_idx
        else:
            return False
    while p_idx < len(p) and p[p_idx] == '*':
        p_idx += 1
    return p_idx == len(p)

The full dial on this task:

GLM 5.2 settingCostLatencyCorrect
thinking off$0.00076sno (stack overflow)
reasoning_effort: high$0.003113syes
reasoning_effort: medium$0.003216syes
reasoning_effort: low$0.006840syes
unbounded default$0.062405syes
gpt-5.5 (reference)$0.00645.4syes
claude-opus-4-8 (reference)$0.00694.6syes

Every explicit effort level solved it. reasoning_effort: high did it for $0.0031 in 13 seconds, about twenty times cheaper and thirty times faster than the unbounded default for the same answer, and it undercuts the frontier models on cost, just a few seconds slower. One quirk worth knowing: GLM’s low produced more reasoning than high, consistently across both tasks, so the names don’t track token count. Medium and high were the cheap, fast settings.

The unbounded default is the one setting to avoid. It is the worst of both worlds: it buys reasoning the task may not need and takes minutes to do it, reaching the same answer reasoning_effort: high gave for twenty times the cost.

The decision rule

The lever is the reasoning effort, and the right setting belongs to the task, not the model:

  • Simple or high-volume work where correctness is easy: thinking off (enable_thinking: false). Correct and about 8x under frontier.
  • Harder problems where thinking off fails: reasoning_effort: medium or high. Correct, around $0.003 a task, under frontier on cost and only a few seconds slower.
  • Never the unbounded default. Leaving reasoning on with no effort cap is how a $0.003 answer becomes a $0.06, seven-minute one.

If you cannot tell in advance whether a task needs reasoning, reasoning_effort: high is a safe default: it was cheap, it solved both tasks, and it never ran away.

Caching helps the input, not the reasoning

GLM 5.2 supports caching on the gateway, and it helps where you’d expect. We sent a 1,494-token shared prefix (a code module to review) with several different questions:

CallPrompt tokensCachedOutputCostLatency
new question, prefix not yet cached1,4930120$0.00266.5s
new question, prefix cached1,4941,472120$0.00095.1s
exact repeat (semantic hit)1,4941,494120$0.00091.0s

Once a large prefix has been seen, it caches. The cached input tokens bill at roughly a fifth of the normal input rate, which cut an otherwise identical request from $0.0026 to $0.0009, about 64%. An exact repeat is served straight from the semantic cache: the same answer at the same cost as the cached call, but back in about a second instead of five.

The catch is the same one the dial taught: caching discounts the input, and the moment reasoning is on, the cost and latency live in the reasoning output, which is not cached. So caching is a real win for thinking-off, high-context work (the same system prompt or codebase on every call), and a small one once reasoning is on.

Using it on Synthorai

glm-5.2 is live on the gateway. Three practical notes from our testing:

  • Set the reasoning effort explicitly. Use enable_thinking: false for simple work and reasoning_effort: medium or high for harder problems. The one thing to avoid is leaving reasoning on with no effort cap (the unbounded default), which is the $0.06, seven-minute trap.
  • Stream when reasoning is on. Reasoning responses can run for minutes, and a non-streaming request sits on a silent connection long enough that your client will likely time out before the answer arrives. Use stream: true and you get incremental output and the full result.
  • Reuse your context. If you send the same large system prompt or codebase on every call, prefix caching cuts the input cost, and pairing it with thinking off makes the whole request cheap.

Pricing is $1.40 / $4.40 per million tokens, and the gateway returns a cost field per call so you can see exactly what each request cost.

Bottom line

GLM 5.2 is a genuinely cheap, capable coding model, and configured well it beats frontier prices on both easy and hard work. The catch is the configuration. Its reasoning is a dial, and the default leaves it unbounded, which is how a task that should cost $0.003 becomes a $0.06, seven-minute call. Set enable_thinking: false for simple work and reasoning_effort: medium or high for the rest, and GLM 5.2 is cheap and correct across the board. Leave reasoning on its default, and it is the slowest, priciest option you could have picked.


Sources

(Synthorai listing prices above are this platform’s rates as of 2026-06-24; GLM generational rates are Zhipu’s official list.)

Costs measured on Synthorai on 2026-06-24 (glm-5.2 at $1.40 / $4.40 per M tokens); verify current pricing before relying on it.

← Back to blog