Telnyx

Four new open-weight LLMs just dropped, and here's what they mean for Voice AI

Four frontier-class LLMs shipped within weeks of each other, each with architecture decisions that matter…

Eli Mogul
By Eli Mogul
4 New Open-Weight LLMs for Voice AI 2026

Four new open-weight LLMs just dropped, and here's what they mean for Voice AI

The first months of 2026 have been relentless for open-weight model releases. Four frontier-class LLMs shipped within weeks of each other, each with architecture decisions that matter for anyone building conversational AI at scale. We broke down what's new, what's different, and why Voice AI builders should pay attention.

Model Developer Total params Active params Context window License API input pricing
DeepSeek V3.2 DeepSeek AI 685B 37B 128K MIT $0.28/1M tokens
Kimi K2.5 Moonshot AI 1T 32B 256K Modified MIT $0.60/1M tokens
GLM-5 Z.AI (Zhipu) 744B 40B 200K MIT $1.00/1M tokens
MiniMax-M2.5 MiniMax 230B 10B 200K Modified MIT $0.30/1M tokens

DeepSeek V3.2

DeepSeek-V3.2.png

DeepSeek V3.2 landed in December 2025 and set the tone for what followed. Built on 685 billion total parameters with 37 billion active per token, V3.2 introduced DeepSeek Sparse Attention (DSA), a mechanism that selectively computes attention weights rather than processing every token against every other token. The result: inference costs for long-context tasks dropped by roughly 50%, and the model maintained performance parity with its predecessor V3.1-Terminus across public benchmarks. For Voice AI applications that require reasoning over long conversation histories or multi-step tool orchestration, that efficiency gain directly translates to faster response times and lower per-minute costs.

Where V3.2 stands out is the combination of strong agentic capability and aggressive pricing. It scores 70% on SWE-bench Verified, 94.2% on AIME 2026, and earned gold-medal results on the 2025 International Mathematical Olympiad. More relevant for voice workflows: V3.2 is DeepSeek's first model to integrate reasoning directly into tool-use, supporting both thinking and non-thinking modes when calling external functions. That means a Voice AI agent powered by V3.2 can reason through a complex customer query, pull data from a CRM, and formulate a response—all within a single inference pass. At $0.28 per million input tokens and $0.42 per million output tokens, it runs at roughly 10–25x less than comparable proprietary models.

chart_deepseek.svg

The open MIT license means you can self-host, fine-tune, or integrate V3.2 into proprietary pipelines without restriction. For teams building Voice AI agents that need strong reasoning at high concurrency (think contact center automation or real-time financial advisory), DeepSeek V3.2 offers a compelling economics story. The tradeoff: it can be verbose, and at 27 tokens per second on the first-party API, it's slower than several competitors. For latency-sensitive voice interactions, that's worth benchmarking against your specific use case.




All four of these models sit on the efficient frontier of cost per intelligence, and we now support them over our chat completions endpoint. DeepSeek V3.2 set the benchmark for what that frontier looks like: strong reasoning, integrated tool use, and pricing that makes always-on AI agents economically viable.

  • James Whedbee, VP of Engineering @ Telnyx


Less than a month later, Moonshot AI answered with a model that takes a fundamentally different approach to scale.


Kimi K2.5

Kimi.png

Kimi K2.5, released January 27, 2026, is a trillion-parameter Mixture-of-Experts model that activates just 32 billion parameters per token. What makes it distinct from everything else on this list: it's natively multimodal. Moonshot AI continued pre-training the K2 base model on approximately 15 trillion mixed visual and text tokens, using a proprietary 400-million-parameter vision encoder called MoonViT. For Voice AI builders, this opens up a category of use cases where an agent needs to interpret visual inputs alongside speech. Think insurance claims processing where a caller describes damage while the agent analyzes uploaded photos, or technical support where a customer shares a screenshot mid-call.

The headline feature is Agent Swarm. Rather than processing tasks sequentially, K2.5 can decompose complex requests into parallel sub-tasks and delegate them to up to 100 dynamically instantiated sub-agents, each handling its own tool calls and reasoning chains. Moonshot AI reports wall-clock time reductions of up to 4.5x compared to single-agent execution. On BrowseComp, a benchmark measuring multi-step web research, K2.5 outperformed GPT-5.2 Pro. On Humanity's Last Exam with tools enabled, it scored 50.2%, at 76% lower cost than Claude Opus 4.5. The model supports four operational modes: Instant (3–8 second responses), Thinking (deep reasoning with traces), Agent (tool-augmented task completion), and Agent Swarm (parallel multi-agent orchestration).

chart_kimi.svg

For voice applications, K2.5's 256K-token context window is the largest in this group, which matters for long-running conversations or scenarios where an agent needs to reference extensive prior context. The tradeoff is latency: Thinking Mode responses typically take 8–25 seconds, which won't work for real-time conversational voice. Instant Mode is faster but sacrifices the deep reasoning that makes K2.5 competitive on hard benchmarks. Teams building asynchronous voice workflows: voicemail triage, post-call summarization, or batch outbound campaigns where latency tolerance is higher, will get the most from K2.5's capabilities. API pricing sits at $0.60 per million input tokens and $2.50 per million output, with the model weights available on Hugging Face for self-hosted deployment.




Kimi K2.5 hits a great balance between intelligence and cost. The non-reasoning version is ideal for real-time Voice AI, and we also recommend it for AI Assistants. It represents a big step up over Qwen 235B in intelligence at effectively the same latency and price point. Many use cases that became a prompt engineering hassle with Qwen will simply work with Kimi.

  • James Whedbee, VP of Engineering @ Telnyx


The same week K2.5 launched, Z.AI quietly stress-tested their own contender on OpenRouter under a codename—before pulling the curtain back entirely.


GLM-5

GLM-5.png

Z.AI (formerly Zhipu AI) released GLM-5 on February 11, 2026, after an unconventional stealth launch on OpenRouter under the alias "Pony Alpha," a nod to 2026 being the Year of the Horse. With 744 billion total parameters and 40 billion active per token, GLM-5 represents a 2x scale-up from GLM-4.5 and is trained on 28.5 trillion tokens. It debuted as the top-ranked open-weight model on both Artificial Analysis and LMArena's Text Arena. The architecture integrates DeepSeek Sparse Attention for efficient long-context handling across a 200K-token window, combined with a novel asynchronous reinforcement learning framework called "Slime" that improved post-training throughput enough to enable significantly more granular optimization iterations.

GLM-5 is explicitly positioned for complex systems engineering and long-horizon agentic tasks. On SWE-bench Verified it scores 77.8%, on AIME 2026 it hits 92.7%, and it ranks first among open-source models on Vending Bench 2—a benchmark that measures long-term operational capability by simulating a full year of business decisions. Z.AI frames the model's strength as a shift from simple code generation to end-to-end agentic engineering: on their internal CC-Bench-V2 suite, GLM-5 achieves a 98% frontend build success rate and 74.8% end-to-end correctness, a 26% improvement over its predecessor on frontend tasks. For Voice AI, this kind of sustained coherence over multi-step workflows is what separates a demo from a production deployment.

chart_glm5.svg

The pricing sits at $1.00 per million input tokens and $3.20 per million output, roughly 3x more than DeepSeek V3.2 on input and significantly more on output, but still a fraction of proprietary alternatives. It's fully MIT-licensed and already supported by vLLM, SGLang, KTransformers, and xLLM for self-hosted deployment. For teams running Voice AI agents that need to handle complex, multi-turn conversations with tool calls (scheduling, order management, technical troubleshooting), GLM-5's combination of strong agentic performance and low hallucination rates makes it a strong candidate. The main consideration: deploying the full model requires 8x NVIDIA B200 GPUs, so self-hosting is a serious infrastructure commitment.




GLM-5 is the highest-intelligence open-source LLM in the world right now. We support it over our chat completions endpoint, and for teams that need maximum reasoning capability in their Voice AI pipelines, it's the one to beat.

  • James Whedbee, VP of Engineering @ Telnyx


One day later, MiniMax made a case that you don't need a trillion parameters to compete at the frontier.


MiniMax-M2.5

MiniMax-M2.5.png

MiniMax-M2.5, released February 12, 2026, is the smallest model in this roundup by a wide margin: 230 billion total parameters with just 10 billion active per token. Don't let the size fool you. Trained with reinforcement learning across more than 200,000 real-world environments, M2.5 posts an 80.2% score on SWE-bench Verified, the highest of any model on this list and competitive with Claude Opus 4.6 on multiple evaluation scaffolds. It also completes those benchmark tasks 37% faster than its predecessor M2.1, matching Claude Opus 4.6's speed. MiniMax describes M2.5 as the first frontier model where cost is no longer a constraint, and the pricing backs that up: $0.30 per million input tokens and $1.20 per million output, with a Lightning variant that doubles throughput to 100 tokens per second.

What sets M2.5 apart for Voice AI is its "architect-level" planning behavior. During training, the model developed a tendency to decompose and plan tasks before executing, writing specs for feature structure, UI design, and component architecture before producing code. That planning-first approach extends beyond coding: in agentic search tasks, M2.5 uses 20% fewer search rounds than M2.1 with better token efficiency, reaching results through more precise reasoning paths. MiniMax also trained M2.5 in collaboration with domain experts in finance, law, and social sciences, targeting genuinely deliverable outputs for office productivity tasks. On advanced document tasks (Word formatting, PowerPoint editing, Excel financial modeling), M2.5 achieved a 59.0% average win rate against mainstream models.

chart_minimax.svg

For Voice AI at scale, M2.5's efficiency story is hard to ignore. At 10 billion active parameters, it's small enough to self-host on consumer-grade multi-GPU setups or even high-end Apple Silicon machines with sufficient unified memory. That means you can run inference locally, keeping conversation data on-premises, a meaningful advantage for industries like healthcare or financial services where data residency requirements rule out third-party API calls. The model supports a 200K context window, function calling, and both thinking and instant modes. The only notable gap compared to the others in this group: M2.5 is text-only, with no native vision support. For pure voice-to-voice conversational AI, that's not a limitation. For multimodal workflows, you'd pair it with a separate vision pipeline.




MiniMax-M2.5 is highly intelligent at a lower cost. It's another model sitting right on that efficient frontier, and we've added it to our chat completions endpoint. For teams optimizing Voice AI spend at scale, the intelligence-per-dollar ratio here is top-tier.

  • James Whedbee, VP of Engineering @ Telnyx



What this means for Voice AI builders

These four releases signal a clear trend: open-weight models now match or exceed proprietary alternatives on the benchmarks that matter most for conversational AI: tool use, multi-step reasoning, long-context coherence, and agentic task completion. The economics have shifted, too. Running a Voice AI agent on MiniMax-M2.5 or DeepSeek V3.2 costs a fraction of what equivalent proprietary models charge, and MIT licensing gives teams full control over deployment, fine-tuning, and data handling.

For teams building on Telnyx's Voice AI infrastructure, these models slot directly into the stack. Colocated GPU infrastructure adjacent to global telecom PoPs means you can run inference on these open models with the low latency that real-time voice demands, without routing data through third-party APIs or sacrificing control over the pipeline. The question isn't whether open models are ready for production Voice AI. It's which one fits your specific latency, cost, and capability requirements.

Share on Social

Sign up for emails of our latest articles and news

Related articles

Sign up and start building.