Cloud Failover Patterns: Pilot Light, Warm Standby, and Multi-Site Active-Active
Cloud Failover Patterns: Pilot Light, Warm Standby, and Multi-Site Active-Active
Three failover patterns dominate production cloud DR designs in 2026: pilot light, warm standby, and multi-site active-active. The pattern names come from the AWS DR whitepaper, but the concepts apply identically on Azure and GCP. Each pattern represents a different point on the cost-versus-RTO/RPO curve, and choosing wrong is the most expensive mistake in DR architecture. This article walks through each pattern with real configuration, real failover sequences, and real cost ranges, so you can pick the right one for each workload tier instead of defaulting to one across the portfolio.
Backup-and-restore exists as a fourth pattern but is functionally a non-failover model — you rebuild rather than fail over. We cover it briefly at the end for completeness; the meat of this article is the three patterns where standby infrastructure is always present in some form.
Pattern 1: Pilot Light
Pilot light keeps the data layer continuously replicated and the compute layer scaled to zero or near-zero in the recovery region. Replication runs hot; compute is cold. On failover, the compute layer scales up, the database is promoted, traffic is redirected. RTO 10-30 minutes; RPO single-digit minutes; cost typically 15-25% of production.
The reference AWS shape:
# Recovery region (eu-north-1) — pilot light
resource "aws_rds_cluster" "recovery" {
engine = "aurora-postgresql"
source_region = "eu-west-1"
replication_source_identifier = aws_rds_cluster.primary.arn
# Aurora Global Database — sub-second replication lag
global_cluster_identifier = aws_rds_global_cluster.app.id
}
resource "aws_autoscaling_group" "recovery_app" {
min_size = 0 # scaled to zero in steady state
max_size = 20
desired_capacity = 0
launch_template {
id = aws_launch_template.app.id
}
# Target group is pre-created so warm-up is fast on failover
target_group_arns = [aws_lb_target_group.recovery.arn]
}
resource "aws_route53_health_check" "primary" {
fqdn = "primary.app.example.com"
type = "HTTPS"
failure_threshold = 3
request_interval = 30
}
resource "aws_route53_record" "app_failover" {
zone_id = aws_route53_zone.app.zone_id
name = "app.example.com"
type = "A"
set_identifier = "primary"
failover_routing_policy { type = "PRIMARY" }
health_check_id = aws_route53_health_check.primary.id
alias {
name = aws_lb.primary.dns_name
zone_id = aws_lb.primary.zone_id
evaluate_target_health = true
}
}
The failover sequence (10-30 minutes typical):
- Page fires from CloudWatch / Route 53 health check (1-2 minutes)
- Incident Commander declares failover (3-5 minutes — human decision)
- Promote Aurora Global Database secondary to primary (1-2 minutes managed planned failover, 3-5 minutes unplanned)
- Scale the recovery ASG to operational size (5-15 minutes, depending on AMI cache and warm-pool config)
- Route 53 failover record updates as primary fails health check (TTL + propagation = 1-3 minutes)
- Validate end-to-end with synthetics; declare service restored
Cost example: a 20-instance production stack ($28K/month) as pilot light costs roughly $5K-$7K/month — replicated database (full size, $1,800), staged AMIs and ELB ($150), Aurora cross-region replication traffic ($400-$800), warm-pool reservations ($300), idle ASG ($0), KMS multi-region keys and Secrets Manager replicas ($50). Pilot light is the right pattern for tier-2 workloads where a 20-minute outage is unpleasant but tolerable.
Pattern 2: Warm Standby
Warm standby runs a smaller-scale but always-live copy of the production stack in the recovery region. Compute is at, say, 30% of production capacity; the data layer is fully replicated; synthetic traffic keeps both regions warm. Failover is a traffic shift to the warm side, not a cold start. RTO seconds-to-minutes; RPO sub-second to seconds; cost typically 50-70% of production.
The reference Azure shape:
# Primary: West Europe at full scale
# Secondary: North Europe at 30% scale, always live
resource "azurerm_traffic_manager_profile" "app" {
name = "app-tm"
resource_group_name = azurerm_resource_group.app.name
traffic_routing_method = "Priority"
dns_config {
relative_name = "app-example"
ttl = 30
}
monitor_config {
protocol = "HTTPS"
port = 443
path = "/healthz"
interval_in_seconds = 10
timeout_in_seconds = 5
tolerated_number_of_failures = 3
}
}
resource "azurerm_traffic_manager_external_endpoint" "primary" {
name = "primary"
profile_id = azurerm_traffic_manager_profile.app.id
target = azurerm_lb.primary.frontend_ip_configuration[0].public_ip_address
priority = 1
}
resource "azurerm_traffic_manager_external_endpoint" "warm" {
name = "warm"
profile_id = azurerm_traffic_manager_profile.app.id
target = azurerm_lb.warm.frontend_ip_configuration[0].public_ip_address
priority = 2
}
# Azure SQL active geo-replication — sub-second RPO
resource "azurerm_mssql_database" "warm_replica" {
name = "app-db-replica"
server_id = azurerm_mssql_server.warm.id
create_mode = "Secondary"
creation_source_database_id = azurerm_mssql_database.primary.id
}
Failover for warm standby is automatic — Traffic Manager priority routing fails primary at the configured thresholds, and traffic shifts to the warm region within 30-90 seconds (TTL + Traffic Manager polling cycle). The application has to handle in-flight requests gracefully, which means the health-check endpoint is not a static "ok" but actually exercises the database, cache, and downstream dependencies.
Cost example: the same 20-instance stack as warm standby costs $14K-$20K/month — 30% sized compute always running ($8K), full Azure SQL replica ($2K), full storage ($1K), Traffic Manager + bandwidth ($200), synthetic traffic load ($400), monitoring and observability ($300). Warm standby is the right pattern for tier-1 customer-facing workloads where minutes of downtime is the hard limit.
Need expert help with cloud failover patterns?
Our cloud architects can help you with cloud failover patterns — from strategy to implementation. Book a free 30-minute advisory call with no obligation.
Pattern 3: Multi-Site Active-Active
Two or more regions both serve production traffic. There is no failover event; there is only a degradation of one region that the traffic-routing layer drains away from. RTO and RPO both approach zero. Cost is 100-140% of single-region production (the second region runs at full size; the first region runs slightly hotter than it would otherwise to absorb spikes when one region drains).
The architectural challenge is not infrastructure — Route 53 latency-based, Azure Traffic Manager performance, or GCP Global External HTTPS Load Balancer all support multi-region routing. The challenge is application-layer correctness:
- Stateful writes: either route writes by tenant/key to a "home region" with cross-region read replicas (most common), or use a globally distributed write database (Aurora Global with planned write forwarding, Spanner, Cosmos DB multi-region writes) and accept the latency / cost
- Idempotency: every request handler tolerates being retried, often against a different region
- Cache coherence: cross-region cache invalidation via Pub/Sub / SNS+SQS / EventBridge cross-region rules
- Session affinity: sticky sessions are usually replaced by stateless tokens (JWT) so a session can serve from either region
- Continuous chaos engineering: AWS FIS, Azure Chaos Studio, or Gremlin running cross-region failover tests in production weekly
Active-active is the right pattern for a small set of workloads (payments, trading, real-time telemedicine) and the wrong pattern for everything else. Most enterprises that try to run "all tier-1 active-active" end up with a tier-1 list that is too long, an engineering team that can't keep up, and a DR posture that is worse than well-implemented warm standby would have been.
The Backup-and-Restore Floor
For tier-3 and tier-4 workloads, backup-and-restore remains the right pattern. The point is to keep recovery cheap. AWS Backup with cross-region copy + EBS snapshot lifecycle policies + S3 cross-region replication; Azure Backup with GRS storage; GCP persistent disk snapshots with multi-region. RTO is hours, RPO is hours-to-day, cost is single-digit percent of production. Anything that does not need a hotter pattern should sit here.
Picking the Right Pattern Per Workload
| Tier | RTO target | Pattern | Examples |
|---|---|---|---|
| 0 (mission-critical) | ~0 | Multi-site active-active | Payments, real-time fraud, trading |
| 1 (customer-facing) | 1-5 min | Warm standby | Storefront, customer portal, regulated customer service |
| 2 (internal critical) | 10-30 min | Pilot light | Internal CRM, ERP, BI |
| 3 (batch / non-critical) | 4-24 h | Backup & restore | Reporting, dev / test, archive |
How Opsio Helps
Opsio designs and operates failover-pattern architectures across AWS, Azure, and GCP under our cloud disaster recovery service. We size each workload to the right pattern, build the topology in Terraform, codify failover into runbooks, and run quarterly tests. For AWS-native pilot-light and warm-standby designs, we integrate with our broader AWS cloud platforms practice; for Azure designs the equivalent sits in Azure cloud platform. Most engagements move customers from a single-region production estate with a paper DR plan to a tiered, code-defined failover topology in 10-14 weeks.
About the Author

Editorial standards: This article was written by a certified practitioner and peer-reviewed by our engineering team. We update content quarterly to ensure technical accuracy. Opsio maintains editorial independence — we recommend solutions based on technical merit, not commercial relationships.