Disaster Recovery Cloud Architecture7 min read· 1,309 words

Cloud Failover Patterns: Pilot Light, Warm Standby, and Multi-Site Active-Active

Published: April 28, 2026·Updated: April 28, 2026·Reviewed by Opsio Engineering Team

Oscar Bergenbrink

CTO

Technology leadership, cloud architecture, and digital transformation strategy

Cloud Failover Patterns: Pilot Light, Warm Standby, and Multi-Site Active-Active

Three failover patterns dominate production cloud DR designs in 2026: pilot light, warm standby, and multi-site active-active. The pattern names come from the AWS DR whitepaper, but the concepts apply identically on Azure and GCP. Each pattern represents a different point on the cost-versus-RTO/RPO curve, and choosing wrong is the most expensive mistake in DR architecture. This article walks through each pattern with real configuration, real failover sequences, and real cost ranges, so you can pick the right one for each workload tier instead of defaulting to one across the portfolio.

Backup-and-restore exists as a fourth pattern but is functionally a non-failover model — you rebuild rather than fail over. We cover it briefly at the end for completeness; the meat of this article is the three patterns where standby infrastructure is always present in some form.

Pattern 1: Pilot Light

Pilot light keeps the data layer continuously replicated and the compute layer scaled to zero or near-zero in the recovery region. Replication runs hot; compute is cold. On failover, the compute layer scales up, the database is promoted, traffic is redirected. RTO 10-30 minutes; RPO single-digit minutes; cost typically 15-25% of production.

The reference AWS shape:

# Recovery region (eu-north-1) — pilot light
resource "aws_rds_cluster" "recovery" {
  engine                    = "aurora-postgresql"
  source_region             = "eu-west-1"
  replication_source_identifier = aws_rds_cluster.primary.arn
  # Aurora Global Database — sub-second replication lag
  global_cluster_identifier = aws_rds_global_cluster.app.id
}

resource "aws_autoscaling_group" "recovery_app" {
  min_size         = 0   # scaled to zero in steady state
  max_size         = 20
  desired_capacity = 0
  launch_template {
    id = aws_launch_template.app.id
  }
  # Target group is pre-created so warm-up is fast on failover
  target_group_arns = [aws_lb_target_group.recovery.arn]
}

resource "aws_route53_health_check" "primary" {
  fqdn              = "primary.app.example.com"
  type              = "HTTPS"
  failure_threshold = 3
  request_interval  = 30
}

resource "aws_route53_record" "app_failover" {
  zone_id = aws_route53_zone.app.zone_id
  name    = "app.example.com"
  type    = "A"
  set_identifier = "primary"
  failover_routing_policy { type = "PRIMARY" }
  health_check_id = aws_route53_health_check.primary.id
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}

The failover sequence (10-30 minutes typical):

Page fires from CloudWatch / Route 53 health check (1-2 minutes)
Incident Commander declares failover (3-5 minutes — human decision)
Promote Aurora Global Database secondary to primary (1-2 minutes managed planned failover, 3-5 minutes unplanned)
Scale the recovery ASG to operational size (5-15 minutes, depending on AMI cache and warm-pool config)
Route 53 failover record updates as primary fails health check (TTL + propagation = 1-3 minutes)
Validate end-to-end with synthetics; declare service restored

Cost example: a 20-instance production stack ($28K/month) as pilot light costs roughly $5K-$7K/month — replicated database (full size, $1,800), staged AMIs and ELB ($150), Aurora cross-region replication traffic ($400-$800), warm-pool reservations ($300), idle ASG ($0), KMS multi-region keys and Secrets Manager replicas ($50). Pilot light is the right pattern for tier-2 workloads where a 20-minute outage is unpleasant but tolerable.

Pattern 2: Warm Standby

Warm standby runs a smaller-scale but always-live copy of the production stack in the recovery region. Compute is at, say, 30% of production capacity; the data layer is fully replicated; synthetic traffic keeps both regions warm. Failover is a traffic shift to the warm side, not a cold start. RTO seconds-to-minutes; RPO sub-second to seconds; cost typically 50-70% of production.

The reference Azure shape:

# Primary: West Europe at full scale
# Secondary: North Europe at 30% scale, always live

resource "azurerm_traffic_manager_profile" "app" {
  name                = "app-tm"
  resource_group_name = azurerm_resource_group.app.name
  traffic_routing_method = "Priority"
  dns_config {
    relative_name = "app-example"
    ttl           = 30
  }
  monitor_config {
    protocol = "HTTPS"
    port     = 443
    path     = "/healthz"
    interval_in_seconds = 10
    timeout_in_seconds  = 5
    tolerated_number_of_failures = 3
  }
}

resource "azurerm_traffic_manager_external_endpoint" "primary" {
  name       = "primary"
  profile_id = azurerm_traffic_manager_profile.app.id
  target     = azurerm_lb.primary.frontend_ip_configuration[0].public_ip_address
  priority   = 1
}

resource "azurerm_traffic_manager_external_endpoint" "warm" {
  name       = "warm"
  profile_id = azurerm_traffic_manager_profile.app.id
  target     = azurerm_lb.warm.frontend_ip_configuration[0].public_ip_address
  priority   = 2
}

# Azure SQL active geo-replication — sub-second RPO
resource "azurerm_mssql_database" "warm_replica" {
  name        = "app-db-replica"
  server_id   = azurerm_mssql_server.warm.id
  create_mode = "Secondary"
  creation_source_database_id = azurerm_mssql_database.primary.id
}

Failover for warm standby is automatic — Traffic Manager priority routing fails primary at the configured thresholds, and traffic shifts to the warm region within 30-90 seconds (TTL + Traffic Manager polling cycle). The application has to handle in-flight requests gracefully, which means the health-check endpoint is not a static "ok" but actually exercises the database, cache, and downstream dependencies.

Cost example: the same 20-instance stack as warm standby costs $14K-$20K/month — 30% sized compute always running ($8K), full Azure SQL replica ($2K), full storage ($1K), Traffic Manager + bandwidth ($200), synthetic traffic load ($400), monitoring and observability ($300). Warm standby is the right pattern for tier-1 customer-facing workloads where minutes of downtime is the hard limit.

Free Expert Consultation

Need expert help with cloud failover patterns?

Our cloud architects can help you with cloud failover patterns — from strategy to implementation. Book a free 30-minute advisory call with no obligation.

Solution ArchitectAI ExpertSecurity SpecialistDevOps Engineer

50+ certified engineersAWS Advanced Partner24/7 support

Pattern 3: Multi-Site Active-Active

Two or more regions both serve production traffic. There is no failover event; there is only a degradation of one region that the traffic-routing layer drains away from. RTO and RPO both approach zero. Cost is 100-140% of single-region production (the second region runs at full size; the first region runs slightly hotter than it would otherwise to absorb spikes when one region drains).

The architectural challenge is not infrastructure — Route 53 latency-based, Azure Traffic Manager performance, or GCP Global External HTTPS Load Balancer all support multi-region routing. The challenge is application-layer correctness:

Stateful writes: either route writes by tenant/key to a "home region" with cross-region read replicas (most common), or use a globally distributed write database (Aurora Global with planned write forwarding, Spanner, Cosmos DB multi-region writes) and accept the latency / cost
Idempotency: every request handler tolerates being retried, often against a different region
Cache coherence: cross-region cache invalidation via Pub/Sub / SNS+SQS / EventBridge cross-region rules
Session affinity: sticky sessions are usually replaced by stateless tokens (JWT) so a session can serve from either region
Continuous chaos engineering: AWS FIS, Azure Chaos Studio, or Gremlin running cross-region failover tests in production weekly

Active-active is the right pattern for a small set of workloads (payments, trading, real-time telemedicine) and the wrong pattern for everything else. Most enterprises that try to run "all tier-1 active-active" end up with a tier-1 list that is too long, an engineering team that can't keep up, and a DR posture that is worse than well-implemented warm standby would have been.

The Backup-and-Restore Floor

For tier-3 and tier-4 workloads, backup-and-restore remains the right pattern. The point is to keep recovery cheap. AWS Backup with cross-region copy + EBS snapshot lifecycle policies + S3 cross-region replication; Azure Backup with GRS storage; GCP persistent disk snapshots with multi-region. RTO is hours, RPO is hours-to-day, cost is single-digit percent of production. Anything that does not need a hotter pattern should sit here.

Picking the Right Pattern Per Workload

Tier	RTO target	Pattern	Examples
0 (mission-critical)	~0	Multi-site active-active	Payments, real-time fraud, trading
1 (customer-facing)	1-5 min	Warm standby	Storefront, customer portal, regulated customer service
2 (internal critical)	10-30 min	Pilot light	Internal CRM, ERP, BI
3 (batch / non-critical)	4-24 h	Backup & restore	Reporting, dev / test, archive

How Opsio Helps

Opsio designs and operates failover-pattern architectures across AWS, Azure, and GCP under our cloud disaster recovery service. We size each workload to the right pattern, build the topology in Terraform, codify failover into runbooks, and run quarterly tests. For AWS-native pilot-light and warm-standby designs, we integrate with our broader AWS cloud platforms practice; for Azure designs the equivalent sits in Azure cloud platform. Most engagements move customers from a single-region production estate with a paper DR plan to a tiered, code-defined failover topology in 10-14 weeks.

Cloud Failover Patterns: Pilot Light, Warm Standby, and Multi-Site Active-Active

Cloud Failover Patterns: Pilot Light, Warm Standby, and Multi-Site Active-Active

Pattern 1: Pilot Light

Pattern 2: Warm Standby

Need expert help with cloud failover patterns?

Pattern 3: Multi-Site Active-Active

The Backup-and-Restore Floor

Picking the Right Pattern Per Workload

How Opsio Helps

Want to Implement What You Just Read?

Want to Implement What You Just Read?

Cloud Failover Patterns: Pilot Light, Warm Standby, and Multi-Site Active-Active

Cloud Failover Patterns: Pilot Light, Warm Standby, and Multi-Site Active-Active

Pattern 1: Pilot Light

Pattern 2: Warm Standby

Need expert help with cloud failover patterns?

Pattern 3: Multi-Site Active-Active

The Backup-and-Restore Floor

Picking the Right Pattern Per Workload

How Opsio Helps

Related reading

Want to Implement What You Just Read?

Want to Implement What You Just Read?