Inference

GLM-5.2 benchmarks, price, and speed compared

GLM-5.2 tops the open-weights leaderboard with a 1M context, near-frontier coding scores, and the fastest time to first token in its class.

By Andy Muns

GLM 5.2 - New open weight model from z.ai

What is GLM-5.2

GLM-5.2 is the most competitive open-weights language model released so far. It sits at the top of open-model leaderboards and lands close to the best closed models from Claude and GPT on several coding and agentic benchmarks, though the strongest closed models still hold an edge on a few of the hardest tests. For teams that want frontier-class reasoning with the deployment flexibility of open weights, it is the model to understand right now.

GLM-5.2 benchmark metrics

The case for GLM-5.2 is its performance-to-cost profile. Across the four measures most teams actually weigh, it is the top open-weights model on intelligence, the cheapest model in the frontier field on blended price, the fastest to first token by a wide margin, and second only to Gemini 3.5 on raw output speed. The four charts below compare it against Claude Opus 4.8, GPT-5.5, Gemini 3.5, and Claude Sonnet 4.6, with data from Artificial Analysis.

Intelligence

GLM-5.2 intelligence index versus frontier models

On the Artificial Analysis Intelligence Index, GLM-5.2 scores 51, the highest figure of any open-weights model. Two closed models sit ahead, Opus 4.8 at 56 and GPT-5.5 at 53, while GLM-5.2 edges out Gemini 3.5 at 50 and Claude Sonnet 4.6 at 47. So the open-weights leader is already inside the closed-frontier pack rather than a tier below it.

The coding results back this up. GLM-5.2 reaches 81.0 on Terminal-Bench 2.1 against 62.0 for GLM-5.1, and 62.1 on SWE-bench Pro against 58.4 for the prior version. On FrontierSWE it trails Opus 4.8 by about one percent while outperforming GPT-5.5 and Opus 4.7, and it stays the highest-ranked open-source model across FrontierSWE, PostTrainBench, and SWE-Marathon. We see the same pattern in our own model benchmarks of open-weight releases.

Price

GLM-5.2 blended price per 1M tokens

GLM-5.2 is the cheapest model in the comparison at $0.90 blended per 1M tokens. That is below Gemini 3.5 at $1.31 and Sonnet 4.6 at $2.31, and well under the two closed frontier leaders, Opus 4.8 at $3.85 and GPT-5.5 at $4.35. Against GPT-5.5, GLM-5.2 delivers near-frontier intelligence at roughly a fifth of the blended token price.

Output speed

GLM-5.2 output speed in tokens per second

On sustained output, GLM-5.2 runs at 106 tokens per second. Only Gemini 3.5 is faster at 163, and GLM-5.2 sits comfortably ahead of every closed model from Claude and GPT in the set: GPT-5.5 at 65, Opus 4.8 at 61, and Sonnet 4.6 at 51. For high-throughput batch and agentic work, that gap compounds across long runs, and sustained throughput depends as much on the GPU network behind the model as on the model itself.

Latency to first token

GLM-5.2 latency to first token in seconds

This is where GLM-5.2 separates from the field. It returns its first token in 1.36 seconds. Every other model in the comparison takes more than 20 seconds: Gemini 3.5 at 20.06s, GPT-5.5 at 23.82s, Opus 4.8 at 30.79s, and Sonnet 4.6 at 104.56s. For interactive and agentic loops where a user or a tool is waiting on the first response, that difference reshapes what the product feels like. A model that starts responding in just over a second reads as live, while one that pauses for twenty seconds reads as broken, even when the eventual answer is identical. Time to first token is the inference latency number that users actually feel.

Metric	GLM-5.2	Best of the rest
Intelligence Index	51, top open-weights	Opus 4.8 at 56
Blended price per 1M	$0.90, cheapest	Gemini 3.5 at $1.31
Latency to first token	1.36s, fastest	Gemini 3.5 at 20.06s

GLM-5.2 reaches close to frontier performance in several benchmarks, especially for coding, but it is not a universal winner across every benchmark.

What GLM-5.2 can do

The performance comes with a feature set built for long-horizon work. GLM-5.2 ships a usable 1M-token context window and produces up to 128K output tokens in a single response. Rather than stretching a context length that frays under load, it was trained for months on long-horizon coding agent scenarios, so the long window holds up across project-scale tasks instead of falling apart late in a run.

On top of that, it supports thinking modes that adjust reasoning depth, function calling for tool use, MCP for connecting external tools and data, structured JSON output for clean integration, context caching, and streaming. Those map to real workflows: project-level codebase takeover, long-horizon refactoring across many files, production-standards stress tests, mobile on-device debugging, and research reproduction that turns a paper into runnable code. A single run can carry module boundaries, API contracts, and earlier engineering decisions forward instead of losing the thread.

How to run GLM-5.2

Because the weights are open, you can self-host. For most teams that is harder than it sounds. GLM-5.2 is a very large model, and practical local use means heavy quantization, often down to around 2-bit on local hardware, which carries a measurable quality cost on top of the ongoing burden of serving, scaling, and reliability. A hosted API is the simpler path to the numbers above, and the machine learning inference primer covers the trade-offs in more depth.

Telnyx Inference runs frontier open-weight models on owned GPU infrastructure, priced per token, behind an OpenAI-compatible API, so adoption is a base URL swap rather than a rewrite. It runs in-region by default and uses FP8 precision for throughput and accuracy at once. For real-time voice, LLM Router serves models on edge GPUs co-located with the carrier network. The efficient frontier guide and the inference latency benchmark are good next reads on cost and speed, and if you are migrating off another host, see these Fireworks alternatives.

GLM-5.2 FAQ

Is GLM-5.2 cheaper than GPT-5.5 and Claude?

Yes. At $0.90 blended per 1M tokens (z.ai pricing), GLM-5.2 is the cheapest model in the comparison, well below GPT-5.5 at $4.35 and Opus 4.8 at $3.85, according to Artificial Analysis.

How large is GLM-5.2's context window?

GLM-5.2 has a usable 1M-token context window and can produce up to 128K output tokens in a single response. It was trained for long-horizon tasks, so the long context stays stable across project-scale work.

Is GLM-5.2 open source?

GLM-5.2 is an open-weights model, so the weights are available to run on your own infrastructure. That gives it more deployment flexibility than closed-platform models, along with the option to keep data on your own hardware.

The short version

GLM-5.2 narrows the gap with the closed frontier on intelligence and coding while undercutting it on price and beating it badly on time to first token. Running it well is mostly an infrastructure question. If you want those numbers without the hardware burden, swap your base URL and run it on infrastructure Telnyx owns.

Share on Social

Andy Muns

Director of AEO

Andy Muns is the Director of AEO at Telnyx, helping make AI and communications products clearer for builders. He previously ran a front-end team behind an Alexa Top 100 organic site, gaining hands-on experience shipping and scaling high-traffic apps. He lives in Colorado.