Opsio - Cloud and AI Solutions
Disaster Recovery13 min read· 3,203 words

Disaster Recovery & Business Continuity in the Cloud: Planning Guide for India

Fredrik Karlsson
Fredrik Karlsson

Group COO & CISO

Published: ·Updated: ·Reviewed by Opsio Engineering Team

Quick Answer

Disaster Recovery & Business Continuity in the Cloud: Planning Guide for India Disaster recovery business continuity (BCDR) planning determines whether an...

Disaster Recovery & Business Continuity in the Cloud: Planning Guide for India

Disaster recovery business continuity (BCDR) planning determines whether an organisation survives a major outage or spirals into extended downtime, data loss, and regulatory penalties. In cloud environments, BCDR shifts from expensive idle hardware to elastic, software-defined resilience — but only if the planning is rigorous. This guide covers how to design, implement, and test DR/BC across AWS, Azure, and GCP, with specific attention to Indian regulatory requirements (DPDPA 2023, RBI cloud circulars, SEBI guidelines, MeitY directives) and multi-region considerations for organisations operating out of India.

Key Takeaways

  • Business continuity is the strategic umbrella; disaster recovery is the technical subset that restores IT systems after an outage.
  • RTO and RPO are the two numbers that drive every architecture and budget decision in DR planning.
  • DPDPA 2023, RBI cloud circulars, and SEBI guidelines impose enforceable obligations on data residency, incident response, and recovery capabilities that directly shape DR design for India-operating organisations.
  • Multi-cloud DR is achievable but operationally expensive — most organisations get better resilience from multi-region within a single provider, leveraging ap-south-1 (Mumbai) and ap-south-2 (Hyderabad).
  • Untested DR plans fail. Quarterly game-day exercises that simulate real failures are the single highest-value investment in resilience.

Business Continuity vs. Disaster Recovery: Drawing the Line

These terms get used interchangeably, and that creates real confusion during an actual incident. Here is the operational distinction:

Business continuity (BC) is the organisational strategy for maintaining essential functions during and after a disruption. It covers people (succession planning, remote work enablement), processes (manual workarounds, alternate suppliers), communications (stakeholder notification, crisis PR), and technology.

Disaster recovery (DR) is the technical execution plan for restoring IT systems, applications, and data. It sits inside BC the way an engine sits inside a vehicle — critical, but not the whole machine.

DimensionBusiness ContinuityDisaster Recovery
ScopeEntire organisationIT infrastructure and data
Primary ownerC-suite / risk managementCTO / VP Infrastructure / DevOps lead
Key metricMinimum Business Continuity Objective (MBCO)RTO and RPO
OutputBusiness Continuity Plan (BCP)DR runbooks, failover automation
StandardsISO 22301, BS 25999ISO 27031, NIST SP 800-34
Regulatory driversDPDPA 2023, RBI BC/BCP guidelines, SEBI cloud frameworkDPDPA 2023, RBI IT outsourcing circulars, GDPR Article 32 (for EU data)

The practical mistake we see from Opsio's NOC operations: organisations invest heavily in DR tooling (replication, automated failover) but skip the BC layer. When an incident hits, the systems recover to a secondary region in twelve minutes — and then nobody knows who authorises the DNS cutover, customers get no status page update for two hours, and the finance team cannot process payments because they never documented the manual workaround. DR without BC is half a plan.

Free Expert Consultation

Need help with cloud?

Book a free 30-minute meeting with one of our cloud specialists. We'll analyse your needs and provide actionable recommendations — no obligation, no cost.

Solution ArchitectAI ExpertSecurity SpecialistDevOps Engineer
50+ certified engineers4.9/5 rating24/7 IST support
Completely free — no obligationResponse within 24h

RTO, RPO, and the Tier Model That Drives Everything

Every BCDR architecture decision flows from two numbers:

  • Recovery Time Objective (RTO): Maximum acceptable downtime. If your RTO is 15 minutes, you need hot standby. If it is 24 hours, a pilot-light or backup-and-restore approach works.
  • Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. An RPO of zero means synchronous replication. An RPO of one hour means you can tolerate losing the last hour of transactions.

Tiering Your Applications

Not every system deserves the same investment. We recommend a four-tier model:

TierRTORPOArchitectureExample
Tier 1 — Mission Critical< 15 minNear-zeroMulti-region active-active or hot standbyUPI/payment processing, core banking platform, core SaaS product
Tier 2 — Business Critical1–4 hours< 1 hourWarm standby with automated failoverERP, CRM, internal APIs
Tier 3 — Important12–24 hours< 24 hoursPilot light or infrastructure-as-code redeployStaging environments, reporting systems
Tier 4 — Non-Critical48–72 hours< 72 hoursBackup and restore from snapshotsDev/test, archival systems

The biggest budgetary mistake: classifying everything as Tier 1. Opsio's Cloud FinOps practice regularly finds organisations spending three to five times more than necessary on DR because someone ticked "mission critical" on every system during a risk assessment checkbox exercise years ago. We have seen Indian enterprises spending upwards of ₹50 lakh per annum on standby infrastructure for systems that could comfortably tolerate a 24-hour RTO. Revisit tiers annually against actual business impact data.

Cloud DR Architectures: What Each Provider Offers

AWS

AWS provides the most mature native DR tooling, and with two Indian regions — ap-south-1 (Mumbai) and ap-south-2 (Hyderabad) — organisations subject to Indian data residency requirements can architect DR entirely within the country. Key services:

  • AWS Elastic Disaster Recovery (AWS DRS): Continuous block-level replication of on-premises or cloud servers to a staging area in a target AWS Region. Launches recovery instances within minutes. For Indian BFSI workloads, replicating from Mumbai to Hyderabad keeps data within Indian borders.
  • S3 Cross-Region Replication (CRR): Asynchronous object replication for data-tier DR. Configure CRR between ap-south-1 and ap-south-2 for domestic DR.
  • Aurora Global Database: Sub-second replication across up to five Regions with managed failover for relational workloads.
  • Route 53 health checks + failover routing: DNS-level traffic shifting during regional outages.

AWS Well-Architected Framework's Reliability Pillar defines four DR strategies explicitly — backup & restore, pilot light, warm standby, and multi-site active-active — and maps them to RTO/RPO ranges. This is the best vendor-provided DR reference document available and should be required reading for any DR architect.

Azure

Azure offers Central India (Pune) and South India (Chennai) regions, with Central India being the primary region for most Indian enterprise workloads.

  • Azure Site Recovery (ASR): VM replication between Azure regions or from on-premises to Azure. Supports orchestrated recovery plans with sequenced startup. For Indian data residency, replicate between Central India and South India.
  • Azure Paired Regions: Microsoft designates region pairs with guaranteed sequential updates and prioritised recovery. Note that Indian region pairing specifics should be validated against current Azure documentation.
  • Cosmos DB multi-region writes: Active-active at the data layer with configurable consistency levels.
  • Azure Front Door: Global load balancing with automatic failover.

One operational nuance: ASR's replication lag for large-disk VMs can exceed published guidelines under heavy I/O. Test with production-representative workloads, not empty VMs.

GCP

  • Cross-region managed instance groups: Auto-scaling across regions with global HTTP(S) load balancing.
  • Cloud Spanner: Globally distributed relational database with synchronous replication — effectively built-in Tier 1 DR for the data layer.
  • Backup and DR Service: Managed backup for Compute Engine, GKE, and databases with orchestrated recovery.

GCP's region count in India is more limited (asia-south1 in Mumbai and asia-south2 in Delhi). Organisations with strict data residency requirements should verify that both primary and DR target regions meet their regulatory obligations before committing to GCP-only architectures.

Managed Cloud Services

Regulatory Landscape: DPDPA 2023, RBI, SEBI, MeitY, and What They Require

DPDPA 2023 (Digital Personal Data Protection Act)

India's DPDPA 2023 requires Data Fiduciaries to implement "reasonable security safeguards" to protect personal data, which explicitly includes the ability to restore availability. While detailed rules are still being notified in 2026, the direction is clear:

  • Data localisation: The Act empowers the Central Government to restrict transfer of personal data to certain jurisdictions. Organisations should design DR architectures that can keep data within India — replicating between ap-south-1 (Mumbai) and ap-south-2 (Hyderabad), or between Azure Central India and South India — to avoid being caught out by future notifications.
  • Breach notification: Data Fiduciaries must notify the Data Protection Board of India of breaches. Your DR plan must include automated detection and escalation to meet reporting timelines once they are formally prescribed.
  • Documented security measures: Expect regulatory audits to review your DR and BCP documentation. Having tested, current runbooks is not optional — it is compliance evidence.

RBI Cloud Circulars (BFSI)

The Reserve Bank of India has issued multiple circulars governing cloud adoption by regulated entities (banks, NBFCs, payment aggregators). Key DR implications:

  • Data residency within India is mandatory for BFSI workloads. DR targets must be in Indian cloud regions. Replicating banking data to Singapore or Ireland for DR purposes is non-compliant.
  • BCP/DR testing must be conducted periodically, with results reported to the Board. RBI examiners increasingly ask for evidence of DR test outcomes during inspections.
  • Vendor risk management: If Opsio or another MSP manages your DR, the regulated entity remains responsible. Service-level agreements must contractually specify RTO, RPO, and testing frequency. Opsio's ISO 27001 and SOC 2 certifications support this compliance chain.
  • Outsourcing guidelines require that regulated entities maintain the ability to switch providers or bring operations in-house — which means DR documentation must be provider-portable.

SEBI Cloud Guidelines

SEBI's framework for cloud adoption by regulated entities (stock exchanges, depositories, brokers, mutual funds) imposes similar data localisation requirements:

  • Critical systems (trading, clearing, settlement) must have data stored and processed within India.
  • DR drills must be conducted at prescribed frequencies, with results shared with SEBI.
  • Encryption and access controls in the DR environment must match production-grade security posture.

MeitY Guidelines

MeitY's cloud policy directives for government workloads mandate the use of empanelled cloud service providers and data hosting within India. Government and public-sector organisations must ensure their DR architectures use MeitY-empanelled providers and Indian data centres exclusively.

GDPR (for India-Based Organisations with EU Customers)

For Indian IT services and SaaS companies serving EU customers, GDPR Article 32(1)(c) requires "the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident." If your DR plan replicates EU citizen data to an Indian DR region, GDPR's cross-border transfer rules (Chapter V) require appropriate safeguards — Standard Contractual Clauses (SCCs) at minimum, as India does not currently have an EU adequacy decision. Many Indian organisations find it simpler to keep EU customer data DR within EU regions (e.g., AWS eu-west-1 to eu-central-1) and Indian customer data DR within Indian regions.

Cloud Security

Building the DR Runbook: From Document to Executable Plan

A DR plan that lives in a Confluence page nobody has read since it was written is not a plan. It is a liability. Here is what a production-grade DR runbook contains:

1. Scope and Activation Criteria

Define exactly what events trigger DR activation. "Major outage" is not specific enough. Examples: "Complete loss of availability in ap-south-1 (Mumbai) lasting more than 15 minutes as confirmed by CloudWatch composite alarms and PagerDuty incident." Include who authorises activation (by name and backup), because the worst time to debate authority is during an incident.

2. Communication Plan

  • Internal: PagerDuty / Opsgenie escalation policies, Slack or Microsoft Teams war-room channels (pre-created, not created during the incident), bridge call details
  • External: Status page update procedures (Statuspage, Instatus), customer email templates pre-approved by legal, regulatory notification checklist (Data Protection Board of India for DPDPA breaches, RBI notification for BFSI incidents, GDPR 72-hour breach notification if EU personal data is affected)

3. Recovery Procedures — Step by Step

Each Tier 1 and Tier 2 system needs a numbered procedure, not a paragraph of prose. Include:

  • Pre-failover validation checks (is the target region healthy? are replicas in sync?)
  • Failover execution commands or automation references (Terraform workspaces, AWS DRS launch templates, ASR recovery plans)
  • Post-failover validation (smoke tests, synthetic monitoring via Datadog or Dynatrace, database integrity checks)
  • DNS cutover procedure with TTL considerations (lower TTLs to 60 seconds before planned tests; document current TTLs for unplanned events)

4. Failback Procedures

Everyone plans failover. Almost nobody documents failback — the process of returning to the primary region once it is healthy. Failback is often more dangerous than failover because data has diverged. Document replication reversal, data reconciliation steps, and the criteria for declaring the primary region "recovered."

5. Contact Sheet and Vendor Escalation

Cloud provider support plans (AWS Enterprise Support, Azure Unified Support), third-party SaaS vendor contacts, DNS registrar emergency procedures. Print a physical copy. During a major cloud outage, your password manager might also be down.

Testing: The Part Everyone Skips

According to Flexera's State of the Cloud, managing cloud spend consistently ranks as a top challenge — but managing DR testing ranks as something organisations simply do not do enough of. From what Opsio's NOC team observes across our managed customers, organisations that test DR quarterly have a median recovery time during real incidents that is dramatically lower than those testing annually or not at all.

Types of DR Tests

Test TypeEffortDisruptionValue
Tabletop exerciseLowNoneValidates roles, communication, decision-making
Component testMediumMinimalTests individual recovery steps (restore a single database)
Parallel recovery testMedium-HighNone to productionSpins up full DR environment alongside production
Full failover testHighProduction traffic shiftsThe only test that proves real-world recovery; schedule quarterly for Tier 1

Game Day Recommendations

  • Inject real chaos: Use AWS Fault Injection Service, Azure Chaos Studio, or Gremlin to simulate AZ failures, network partitions, and disk corruption.
  • Time it: Measure actual RTO and RPO against objectives. Track trends over quarters.
  • Include non-technical staff: BC is not just IT. Have the finance team execute their manual payment workaround. Have customer support use the crisis communication templates.
  • Write a post-mortem for the test — not just for real incidents. Every test reveals gaps. Document them, assign owners, and fix them before the next test.
  • Maintain RBI/SEBI evidence: For regulated entities, archive DR test results, participant lists, and remediation actions. Regulators expect this documentation during inspections.

Managed DevOps

Multi-Cloud DR: Honest Trade-Offs

The idea of failing over from AWS to Azure during a regional outage sounds resilient on a whiteboard. In production, it is extraordinarily complex:

  • Identity and IAM must work across both providers. Federated identity via Entra ID or Okta helps but does not solve service-level authorisation.
  • Data replication between providers requires application-level logic or third-party tools (e.g., Commvault, Cohesity). Native cross-provider replication does not exist for most services.
  • Infrastructure-as-code diverges. Terraform modules for AWS and Azure are structurally different. Maintaining parity is a full-time job.
  • Network architecture (VPN tunnels, peering, DNS) adds latency and operational surface area.

Opsio's position: For most organisations, multi-region DR within a single cloud provider delivers better resilience at lower cost and complexity than multi-cloud DR. With two AWS regions in India (Mumbai and Hyderabad) and two Azure regions (Central India and South India), Indian organisations now have viable domestic multi-region DR options within a single provider. Reserve true multi-cloud DR for scenarios where regulatory requirements mandate it (e.g., certain government workloads requiring MeitY-empanelled providers) or where vendor lock-in risk justifies the operational overhead.

The exception: data-layer DR. Replicating encrypted backups to a second provider's object storage (e.g., production on AWS ap-south-1, backup copies to Azure Blob Storage in Central India) is straightforward, inexpensive, and protects against catastrophic single-provider failure without the complexity of full application-level multi-cloud failover. This approach also satisfies RBI's expectation of exit-strategy readiness.

Cloud Migration

What Opsio's SOC/NOC Sees in Practice

Running 24/7 operations across India and Europe, patterns emerge:

  • DNS TTL neglect is the most common cause of extended apparent downtime after a successful failover. The systems recover in 10 minutes; users experience 45 minutes of disruption because TTLs were left at 3600 seconds.
  • Expired credentials in DR regions. Service accounts, certificates, and API keys that rotate in production but were never configured to rotate in the standby environment. First failover test after six months? Guaranteed authentication failures.
  • Snapshot-only "DR" for databases. Nightly snapshots with no replication means an RPO of up to 24 hours. For many workloads this is fine — but only if the business has explicitly accepted that data loss window. For BFSI workloads, RBI expects much tighter RPOs for core banking systems.
  • No monitoring in the DR region. CloudWatch alarms, Datadog dashboards, and PagerDuty integrations that exist only in the primary region. After failover, you are flying blind.
  • Data residency blind spots. Organisations running production in ap-south-1 (Mumbai) with DR configured to ap-southeast-1 (Singapore) without realising this violates RBI data localisation requirements. Always verify that your DR target region meets your regulatory obligations.

These are not exotic edge cases. They appear in the majority of environments we onboard. A proper Cloud Security assessment catches them before an incident forces discovery.

Getting Started: A Pragmatic 90-Day Roadmap

Days 1–30: Discovery and Business Impact Analysis

  • Inventory all production workloads and classify into tiers
  • Document current RTO/RPO for each tier (even if the answer is "we don't know")
  • Identify regulatory obligations (DPDPA applicability, RBI/SEBI cloud circular scope, GDPR data flows for EU customers)
  • Confirm data residency requirements and validate that current and planned DR regions comply

Days 31–60: Architecture and Tooling

  • Select DR architecture per tier (backup/restore, pilot light, warm standby, active-active)
  • Implement replication for Tier 1 systems (e.g., AWS DRS from ap-south-1 to ap-south-2, or ASR from Central India to South India)
  • Configure monitoring, alerting, and runbook automation in the DR region
  • Lower DNS TTLs for critical services

Days 61–90: Runbook, Test, Iterate

  • Write step-by-step runbooks for Tier 1 and Tier 2 failover and failback
  • Conduct first tabletop exercise with all stakeholders
  • Execute first parallel recovery test for Tier 1 systems
  • Document gaps, assign remediation owners, schedule quarterly game days
  • Archive test evidence for regulatory audits (RBI, SEBI, DPDPA Board)

This is not a one-time project. BCDR is a continuous practice, like security. The plan degrades every time infrastructure changes and the runbook does not.

Frequently Asked Questions

Does business continuity include disaster recovery?

Yes. Business continuity is the broader discipline covering people, processes, supply chain, and communications. Disaster recovery is the IT-focused subset that deals specifically with restoring technology systems, data, and infrastructure after a disruptive event. A BC plan without a DR plan has no way to recover systems; a DR plan without BC context may restore the wrong systems first.

What are the 4 phases of a business continuity plan in disaster recovery?

The four phases are: (1) Risk Assessment and Business Impact Analysis — identify threats and rank systems by criticality; (2) Strategy Development — define RTOs, RPOs, and select recovery architectures; (3) Plan Development and Implementation — write runbooks, configure replication, assign roles; (4) Testing, Maintenance, and Continuous Improvement — run game days, update documentation, and re-assess after every incident or infrastructure change.

What are the 4 C's of disaster recovery?

The 4 C's are Communication, Coordination, Continuity, and Compliance. Communication ensures stakeholders are informed through predefined channels. Coordination assigns clear roles and escalation paths. Continuity keeps critical business functions running during recovery. Compliance ensures that recovery actions meet regulatory obligations such as DPDPA 2023 requirements, RBI incident reporting mandates, or GDPR breach notification timelines for organisations handling EU citizen data.

Does ISO 22301 cover disaster recovery?

ISO 22301 is the international standard for business continuity management systems (BCMS). It covers disaster recovery as part of its broader scope, requiring organisations to identify critical activities, set recovery objectives, and implement and test recovery procedures. It does not prescribe specific technical DR architectures but mandates that recovery capabilities exist, are documented, and are regularly exercised.

How much does cloud-based disaster recovery cost compared to traditional DR?

Cloud DR typically costs a fraction of traditional hot-site DR because you pay for standby compute only when you need it. A pilot-light architecture on AWS or Azure might cost 5–15% of the production environment's monthly spend — for instance, if your production workload runs at ₹10 lakh per month, pilot-light DR may add only ₹50,000–₹1.5 lakh. Costs rise sharply as you move toward warm or hot standby. The biggest hidden cost is operational: maintaining runbooks, testing failover, and training staff.

Written By

Fredrik Karlsson
Fredrik Karlsson

Group COO & CISO at Opsio

Fredrik is the Group Chief Operating Officer and Chief Information Security Officer at Opsio. He focuses on operational excellence, governance, and information security, working closely with delivery and leadership teams to align technology, risk, and business outcomes in complex IT environments. He leads Opsio's security practice including SOC services, penetration testing, and compliance frameworks.

Editorial standards: This article was written by cloud practitioners and peer-reviewed by our engineering team. Content is reviewed quarterly for technical accuracy and relevance to Indian compliance requirements including DPDPA, CERT-In directives, and RBI guidelines. Opsio maintains editorial independence.