Telnyx

Multi cloud redundancy: Design for real outages

Multicloud redundancy means that when Provider A fails, Provider B picks up the load automatically, with no manual intervention and minimal data loss.

By Telnyx Team

Cloud outages are real. In October 2025, a single DNS resolution failure in AWS's us-east-1 region disrupted over 3,500 companies across 60+ countries. Months earlier, a Google Cloud IAM failure knocked out more than 50 services globally for over seven hours, cascading into outages at Cloudflare, Spotify, Snapchat, and Discord. Azure experienced its own multi-day disruption, with a networking configuration issue in East US 2 lasting roughly 50 hours.

For businesses running real-time voice, messaging, or AI workloads, minutes of downtime translate directly to lost revenue, broken customer trust, and SLA penalties. A 2024 report from Splunk and Oxford Economics found that unplanned downtime costs Global 2000 companies $400 billion annually, roughly 9% of profits. According to the Uptime Institute's 2025 Annual Outage Analysis, more than half of major outages now cost organizations over $100,000, with one in five exceeding $1 million.

The question isn't whether your cloud provider will go down. It's whether you've architected for it.

Multicloud adoption is the norm, but redundancy lags

Most enterprises already operate across multiple clouds. Flexera's 2025 State of the Cloud Report found that 70% of organizations run hybrid cloud strategies, using at least one public and one private cloud. Meanwhile, Gartner forecasts that 90% of organizations will adopt hybrid cloud by 2027, with public cloud spending reaching $723.4 billion in 2025.

But "multicloud" doesn't automatically mean "redundant." Many teams use multiple providers for different workloads (analytics on one, compute on another) without building cross-provider failover for any single workload. That's multicloud strategy. Multicloud redundancy means that when Provider A fails, Provider B picks up the load automatically, with no manual intervention and minimal data loss.

The distinction matters. During the October 2025 AWS outage, companies that had distributed workloads but lacked automated failover still experienced hours of disruption. Those with genuine cross-cloud redundancy kept running.

Active-active vs. active-passive: picking the right pattern

The two dominant approaches to multicloud redundancy are active-active and active-passive. Each involves real tradeoffs in cost, complexity, and recovery time.

In an active-active setup, multiple cloud environments simultaneously handle live production traffic. A load balancer or DNS-based routing distributes requests across all active nodes. If one environment fails, the others absorb the load without a switchover delay. This is standard for workloads where even seconds of downtime carry financial or safety consequences: payment processing, real-time voice, contact center operations.

In an active-passive configuration, one environment handles all traffic while a standby environment waits in reserve. When the primary fails, a failover process promotes the standby to active. This approach is simpler and cheaper to operate, but it introduces a recovery gap. Failover is not instantaneous, and the passive environment may require warm-up time before it can handle full traffic.

Factor	Active-active	Active-passive
Failover time	Near-zero (traffic reroutes automatically)	Minutes to hours (depends on promotion process)
Resource utilization	All nodes serve production traffic	Standby resources sit idle during normal operation
Relative cost	Higher (all environments run at capacity)	Lower (passive environment can run reduced capacity)
Best fit	Real-time voice, payments, customer-facing APIs	Internal tools, batch processing, analytics dashboards

For communications infrastructure, active-active is typically the right call. Voice calls and AI agent interactions cannot pause and resume. A dropped call is a lost customer, not a retry. Telnyx distributes workloads across multiple cloud providers, including AWS, Google Cloud, and private data centers, using active redundancy and automated failover at every layer of its stack.

What multicloud redundancy actually requires

Running workloads on two clouds is the starting point, not the finish line. Real redundancy requires coordination across several layers.

Data replication and consistency. If your primary database is in AWS and your failover is in Google Cloud, those databases need to stay synchronized. Asynchronous replication introduces the risk of data loss during failover (the recovery point objective, or RPO). Synchronous replication eliminates that risk but adds latency and cost. The right choice depends on what your business can tolerate: zero data loss, or a few seconds of acceptable gap.

DNS and traffic routing. When an environment goes down, traffic needs to move. DNS-based failover (using health checks and weighted routing) is the most common approach, but DNS TTLs and propagation can slow the switch. BGP anycast routing, where the same IP address is announced from multiple locations, provides faster failover because routing changes happen at the network layer. Telnyx's Global Edge Router uses BGP anycast across major cloud providers for instantaneous failover, so services never appear offline during a provider outage.

Health checks and monitoring. Automated failover only works if you can detect a failure. Health checks need to probe the right things: not just "is the server responding?" but "is the application returning correct results?" Shallow health checks miss partial failures, such as database connection pool exhaustion that returns errors while the server still responds.

Runbooks and chaos testing. A failover process that's never been tested is a failover process that doesn't work. Runbooks should document every step of a failover, including who gets paged, what decisions need to be made, and how to validate that the secondary environment is serving correctly. Chaos engineering (intentionally injecting failures in production or in realistic pre-prod) validates that your redundancy works under real conditions, not just in architecture diagrams.

Voice and communications: where redundancy is non-negotiable

Multicloud redundancy is important for web applications and data services, but it's existential for real-time communications. HTTP requests can be retried. A dropped voice call, or a silent AI agent, cannot.

For SIP trunking and voice workloads, redundancy means geo-redundant SIP endpoints with automatic re-registration when a trunk goes down. It means having SBCs (Session Border Controllers) in multiple regions that can independently route calls to the PSTN. And it means that your communications provider isn't itself a single point of failure.

This is where provider architecture matters as much as your own. Many CPaaS providers depend entirely on a single hyperscaler. When that hyperscaler goes down, so does your voice service, regardless of how well you've architected your application layer. Telnyx runs on a private, multicloud network with carrier-grade voice infrastructure, including direct PSTN connectivity in 100+ countries. During the October 2025 AWS outage, Telnyx customers experienced no disruption because the platform automatically rerouted workloads across independent nodes and alternate cloud providers.

For teams building Voice AI agents, the stakes are even higher. An AI agent that goes silent mid-conversation doesn't just create a bad user experience. It erodes the trust that makes customers willing to interact with AI in the first place. Building on a platform that provides edge routing with instantaneous failover removes a failure mode that no amount of application-level engineering can compensate for.

The cost tradeoff: redundancy vs. downtime

Multicloud redundancy costs more than single-cloud deployment. You're paying for infrastructure in multiple environments, data replication bandwidth, and the engineering time to maintain parity across providers. For active-active deployments, you're running production capacity in two or more clouds simultaneously.

But the math favors redundancy for most customer-facing workloads. Flexera's 2025 report found that organizations exceeded cloud budgets by 17% on average, with 27% of cloud spend still classified as wasted. In many cases, the money lost to idle or poorly utilized resources already exceeds what a well-designed redundancy layer would cost.

Average outage duration by provider

According to Cherry Servers' analysis of outage data from August 2024 to August 2025, Azure outages averaged 14.6 hours, Google Cloud disruptions averaged 5.8 hours, and AWS incidents averaged 1.5 hours. Even the shortest of these can cost a mid-size enterprise hundreds of thousands of dollars in lost revenue and SLA penalties.

The most expensive redundancy strategy is the one you implement after a major outage.

Getting started: a practical checklist

Multicloud redundancy doesn't have to be all-or-nothing. Start with the workloads where downtime is most costly and expand from there.

Identify your critical path. Which services, if they went down for an hour, would cost you the most in revenue, reputation, or regulatory penalties? Those are your first candidates for cross-cloud failover.

Audit your provider dependencies. Map every external service your application relies on. If several run on the same hyperscaler, that's a concentration risk, even if your own infrastructure is multicloud.

Choose a communications platform built for resilience. Your CPaaS provider should run on multicloud infrastructure with automated failover, not depend on a single cloud. Look for direct PSTN connectivity, geo-distributed SIP endpoints, and a private network backbone.

Test failover regularly. Schedule quarterly game days where you simulate a provider outage and validate that your failover processes work. Document what breaks and fix it before a real incident forces you to.

Build for the outage you haven't seen yet. The pattern across 2024–2025 cloud incidents is clear: outages are becoming more complex, affecting control planes and authentication layers rather than just individual instances. Designing for today's known failure modes isn't enough; you need architecture that absorbs unknown failures gracefully.

Build on infrastructure that stays up when your cloud goes down

Cloud infrastructure is reliable until it isn't. When that moment comes, your architecture either absorbs the failure or becomes the failure.

Telnyx gives you the foundation to stay operational. Our private, multicloud network spans multiple cloud providers and private data centers, with carrier-grade voice infrastructure, direct PSTN connectivity in 100+ countries, and automated failover at every layer of the stack. No single hyperscaler dependency. No manual intervention required.

Talk to our team to see how Telnyx can make your communications infrastructure resilient by design, or explore our multicloud architecture to learn how it works under the hood.

Share on Social

Multicloud adoption is the norm, but redundancy lags Active-active vs. active-passive: picking the right pattern What multicloud redundancy actually requires Voice and communications: where redundancy is non-negotiable The cost tradeoff: redundancy vs. downtime Getting started: a practical checklist Build on infrastructure that stays up when your cloud goes down