GPU as a service used to mean renting an H100 by the hour from a hyperscaler. That definition is out of date. For real-time workloads like voice AI, what matters isn't whether you can rent a GPU, it's where that GPU sits and how fast it serves the application that needs it. This guide breaks down what GPUaaS is, why telecom operators are emerging as a consequential new category of provider, and how to evaluate a generic cloud GPU against a co-located, carrier-grade one.
GPU as a service (GPUaaS) used to be a hyperscaler product category. Rent an H100 by the hour, get a Jupyter notebook, train a model. That definition is already out of date.
In 2026, the question isn't whether you can rent a GPU. It's where that GPU sits, what it sits next to, and how fast it can serve an inference request to the application that needs it. For real-time workloads like voice AI agents, those three questions decide whether the product works at all.
This is a guide to what GPUaaS is, why telecom operators are emerging as one of the most consequential new categories of GPUaaS providers, and how to evaluate the difference between a generic cloud GPU and a co-located, carrier-grade one.
GPU as a service is the on-demand rental of GPU compute, typically through an API or orchestration platform, instead of buying and operating the hardware yourself. Customers pay per hour, per minute, or per inference call. The provider handles procurement, power, cooling, drivers, and the surrounding software stack.
The category exists because GPUs are expensive and frequently idle. The UN University Campus Computing Centre noted in 2025 that more than half of existing GPUs sit idle at any given time. For most teams, the math against ownership is straightforward: a high-end accelerator costs hundreds of thousands of dollars, depreciates fast, and only earns its keep if you run it close to capacity.
Related articles
The pattern itself isn't new. A peer-reviewed paper from Fermilab and MIT published in Frontiers in Big Data in 2021 formalized the "GPU coprocessor as a web service" model for neutrino experiment reconstruction, accelerating the most time-consuming task by a factor of 17 and reducing total processing time by 2.7x compared to CPU-only production. The federal lab build-out continued: the US Department of Energy's OSTI archive holds Fermilab's 2023 technical report on DGaaS, a distributed GPUaaS implementation using the Triton Inference Server alongside HTCondor and HEPCloud to serve scientific compute at scale.
What changed isn't the architecture. It's the demand. Once large language models, real-time speech recognition, and voice synthesis became production workloads, every team building AI products needed inference compute they didn't have to own.
The market analyst view is that GPUaaS is a single growing category. MarketsandMarkets projects the GPU as a service market will grow from $8.21 billion in 2025 to $26.62 billion by 2030, a 26.5% CAGR. The buyer view is more useful: GPUaaS is splitting into two products that serve very different workloads.
The first is training and batch inference. The customer is an ML team that needs a lot of compute for a finite period of time. Latency between the GPU and the user doesn't matter much. The model trains overnight, the dataset processes in a window, the job finishes. Hyperscalers serve this well.
The second is real-time inference. The customer is a product team running voice AI, fraud detection, recommendation, or any workload where a human or downstream system is waiting on the response. Here, the distance between the GPU and the end user is the product. According to Deepgram's analysis of voice AI latency, real-time transcription needs to land under 300ms and text-to-speech needs to add the first audible syllable in roughly 150ms for the experience to feel like a natural conversation. Miss the window and the conversation feels mechanical regardless of how good the underlying model is.
Generic cloud GPUaaS isn't built for the second product. The compute sits in a handful of mega-regions. Voice data crosses oceans, gets transcribed, gets sent to an LLM, gets synthesized back to audio, and gets returned to a caller who has already started to think the agent isn't listening. The architecture works for training. It breaks for conversation.
This is where the operator side of the market gets interesting. In its January 2026 paper "The Engine of Transformation: Data and Infrastructure in the AI-Centric Telco," GSMA outlined three emerging AI revenue models for mobile operators: AI connectivity provider, AI compute provider through GPU-as-a-Service partnerships, and AI solutions partner. The compute-provider lane is where carriers are putting the most weight, because they already have the assets the workload needs.
The financial case follows. According to a McKinsey analysis covered by Light Reading, the addressable GPUaaS market for telcos could range from $35 billion to $70 billion globally by 2030. The report frames operator inaction as a larger risk than any specific bet.
Carriers are already moving. Singtel announced its GPUaaS offering powered by NVIDIA H100 clusters in Singapore in 2024 and has since expanded through a partnership with US-based GMI Cloud to combine capacity across the US and Asia Pacific. KT in Korea, Verizon's AI Connect in the US, and operators across Europe and the Middle East have launched comparable products. As Forrester wrote in its MWC 2026 analysis, operators want AI embedded at the infrastructure layer so they can move beyond the "dumb pipe" position and reclaim value from over-the-top players, including by offering trusted GPU-as-a-service.
The reason this is more than a revenue story is architectural. Telecom operators own the network. They terminate the call. They route the SMS. They peer with the public switched telephone network. If they place GPUs at the same physical site as their points of presence, the inference workload runs adjacent to the network event that triggered it, not five hops and an ocean away.
Most GPUaaS pages on the web are selling rented bare metal in a few large data centers. Co-located GPUaaS, the model that matters for real-time AI, is structurally different.
It means GPUs are physically deployed inside or directly adjacent to the network points of presence that handle the originating traffic. For a voice AI agent answering a call from Sydney, the GPU running inference is in Sydney. For a call from Frankfurt, it's in Frankfurt. The audio doesn't get hairpinned to a US region, transcribed, sent to a model 1,500 miles away, synthesized back, and returned.
Telnyx Senior Software Engineer James Whedbee summarized the operational consequence on a recent product walkthrough: "You can cache that if you are on the same physical infrastructure, right, if you can actually get the request to go to the exact same GPU that it did last time, it's cached, and you don't have to process it again." Cache locality stops being a software trick when the GPU and the network are the same piece of infrastructure.
This is the architectural pattern behind the Telnyx GPU network, and it's how Telnyx fixed voice AI latency with co-located infrastructure rather than trying to optimize around it. The same logic drives Telnyx's regional builds, including the Sydney GPU deployment, which puts inference compute next to PSTN termination in the Asia-Pacific region.
The differences look small in a feature matrix and large in production.
| Dimension | Generic cloud GPUaaS | Carrier-grade co-located GPUaaS |
|---|---|---|
| Primary workload | Training, batch inference, dev/experimentation | Real-time inference, voice AI, in-call decisioning |
| Network proximity | GPU lives in a hyperscaler region, often far from end user | GPU lives at a carrier point of presence, adjacent to the network event |
| PSTN connection | Requires a third-party telephony provider to bridge calls | Native to the same platform that handles the call |
| Pricing model | Per GPU-hour, often with long commits for best rates | Per minute of usage, matched to the application |
There is a third category worth naming, because it's the one most likely to be confused with carrier-grade: generic edge compute. Providers like Cloudflare run GPUs at edge cities, which is closer to the user than a hyperscaler region but still not adjacent to the carrier network. Edge compute next to other compute is not the same as edge compute next to a PSTN termination point. For voice AI, the carrier proximity is the part that closes the latency gap, and it's the part generic edge providers structurally cannot replicate without becoming a carrier.
The carrier-grade column is what GSMA's "AI compute provider" model actually looks like in production. It's also what closes the latency gap for voice AI: there is no third-party telephony hop between the call and the inference layer, because the same platform is doing both.
If you're choosing a GPUaaS provider for a training pipeline, the hyperscalers are reasonable defaults and the academic and government reference architectures still apply. If you're building real-time AI, especially anything involving voice, the decision tree is different.
Three questions are worth asking explicitly:
First, where does the GPU physically sit relative to the application that calls it? "Cloud region" is not granular enough. Real-time inference cares about milliseconds, and milliseconds are decided by physical distance and the number of hops.
Second, what's the integration cost to get the network event to the GPU? If your voice AI stack needs to combine a telephony provider, a transcription API, an LLM endpoint, a TTS service, and a separate observability layer, the latency budget is consumed before any of those services do any work. That stack has a name: the Frankenstack. STT vendor, LLM vendor, TTS vendor, carrier, and orchestration layer, each one a vendor boundary, a margin layer, and a hop in the latency budget. The integration cost is not a one-time tax. It is a permanent drag on every call.
Third, what's the cost structure at scale? Open-source models have closed enough of the quality gap with proprietary alternatives to change the math, and inference costs have collapsed across the board. Stanford HAI's 2026 AI Index Report found that the cost to run an AI model at GPT-3.5 performance level dropped more than 280-fold between November 2022 and October 2024, from roughly $20 per million tokens to $0.07. The top closed model leads the top open model by 3.3% on benchmarks as of March 2026, which is small enough that for most production workloads, open-source models are the better economic choice. A GPUaaS provider that runs open-source models next to your voice infrastructure changes per-minute economics. The numbers are concrete: running open models on co-located compute, Telnyx TTS lands at roughly 10x lower than ElevenLabs and SIP at roughly 2x lower than Twilio. The savings are structural, not promotional, because there is no reseller margin between the model and the network. That's the foundation underneath Telnyx Inference and the broader evolution of AI infrastructure toward open, transparent stacks.
Increasingly, the entity asking these three questions isn't a human architect. It's the agent itself. AI agents evaluating inference infrastructure measure first-call latency and per-minute cost directly, then route to the provider that wins on both. The architecture decision is being made at machine speed, against criteria a feature page doesn't address.
The next two years will accelerate three shifts already visible in the data.
Compute will keep moving closer to data. Hyperscaler regions are not the end state. Regional and sovereign builds, often run by carriers and specialist providers, will absorb a growing share of inference workloads as data residency requirements harden. The pattern is already visible across the EU AI Act, APAC sovereignty rules, and LATAM data localization regimes. This is also why carrier-grade infrastructure improves with global expansion while multi-vendor stacks degrade. Most platforms get worse as they expand: more regions, more hops, more vendor permutations. A carrier that places compute at each new point of presence gets better, because each region adds capability rather than complexity.
The line between communications infrastructure and AI infrastructure will keep dissolving. As GSMA's analysis makes clear, telcos that own physical infrastructure are positioned to provide the substrate AI applications run on. Telnyx has been operating from that thesis for several years, and the Voice AI Agents product is what it looks like when carrier infrastructure and AI inference are designed as one platform rather than two integrations.
Open-source models will keep eating the proprietary model market. The economics are too lopsided to ignore. The providers that win the inference layer will be the ones who run open models well and can swap them as fast as the open-source community ships them.
If you're prototyping a voice AI agent, generic cloud GPUaaS is fine. If you're putting one in production, the question is whether your stack can hit conversational latency targets at scale and at a unit cost that lets you grow.
Co-located GPU compute is one layer of a larger system. Real-time AI runs on three: edge compute (where physics decides latency), the agent platform (where orchestration runs as one runtime), and global communications (where the carrier network and identity live). Telnyx is the only platform that owns all three. The PSTN connection, the inference engine, the open-source LLM library, and the voice models all run on the same physical infrastructure, which is why voice AI agents built on Telnyx respond in real time at a unit cost that scales.
Voice is the wedge, not the ceiling. The same co-located compute that powers real-time voice powers SMS, email, and async agent operations as your stack grows.
Talk to our team about what your voice AI stack would look like running on Telnyx, or start building on the platform and provision your first voice AI agent today.