Business Continuity and Disaster Recovery for on-premises workloads in Microsoft Azure Cloud
In this article, we'll briefly explore how you can effectively utilize Microsoft Azure Cloud services to comprehensively plan and orchestrate your disaster recovery strategy.
What is BCDR
BCDR stands for Business Continuity and Disaster Recovery. It encompasses strategies and plans that businesses put in place to ensure continuous operation and swift recovery in the face of unexpected events or disasters, such as human errors, natural calamities, cyberattacks, or equipment failures. BCDR includes measures like data backup, redundancy, alternate communication channels, and recovery protocols to minimize downtime, protect valuable assets, and maintain business operations even during challenging circumstances.
Why it's crucial for the organizations to must have a fully functional BCDR strategy and solution in place in this fast-paced digital Era
- Minimize Downtime: Keep operations running smoothly during disasters or system failures.
- Protect Data: Safeguard valuable information through regular backups.
- Ensure Continuity: Maintain critical business functions and services without interruption.
- Regulatory Compliance: Adhere to industry regulations and standards for data protection.
- Risk Mitigation: Identify and address potential threats to IT infrastructure.
- Business Reputation: Safeguard against damage to reputation and stability
Different Types of DR
There are several types of Disaster Recovery (DR) strategies that organizations can implement based on their specific needs and requirements:
Listing few below
-
Backup and Restore:
- This involves regularly backing up data to a secondary storage location and restoring it in case of data loss or corruption. It's typically used for non-critical data and applications with longer recovery time objectives (RTOs) and recovery point objectives (RPOs).
-
Cold DR:
- Cold Disaster Recovery (DR) in the cloud involves storing/replicating primary site data and infrastructure configurations in a dormant state, usually in an offline or powered-off state, until they're required for recovery. Unlike hot DR, where resources are continuously running and ready for immediate failover, cold DR relies on manual intervention (until and unless fully automated with scripts and logics) to activate resources and restore operations in the event of a disaster. This typically results in longer recovery times as resources need to be provisioned, data needs to be restored, and systems need to be brought online. Cold DR is often chosen for its cost-effectiveness and suitability for less critical workloads, where longer downtime is acceptable in exchange for lower operational costs.
-
Warm DR
- A Warm Disaster Recovery (DR) in the cloud is an intermediate approach between cold and hot DR. In a warm DR setup, standby resources are partially active, meaning they're provisioned and configured but not actively processing workloads. These resources are in a semi-dormant state, ready to be quickly activated and brought online when needed. This allows for faster recovery times compared to cold DR since resources don't need to be fully provisioned from scratch. However, warm DR may still require manual intervention or automation to fully transition to operational status, resulting in a slightly longer recovery time compared to hot DR. Warm DR strikes a balance between cost-effectiveness and recovery speed, making it suitable for workloads that require a quicker recovery but can tolerate a short downtime window.
-
Hot DR:
- Hot Disaster Recovery (DR) in the cloud is the highest level of readiness for disaster scenarios. In a hot DR setup, standby resources are fully active and running in parallel with primary production systems, constantly synchronized and ready to take over instantly in the event of a disaster. This involves real-time or near-real-time replication of data and configurations to the standby environment. When a disaster occurs, failover to the hot standby resources is automatic and seamless, with minimal to no interruption in service. Hot DR offers the fastest recovery times and highest level of availability but comes at a higher cost due to the continuous operation of redundant resources. It's typically used for mission-critical workloads where even the slightest downtime is unacceptable.
Each type of DR strategy has its advantages and considerations, and organizations should evaluate their requirements, budget, and risk tolerance to determine the most suitable approach for their needs.
Now let's discuss on how we achieve these DR types.
Onprem to Cloud Disaster Recovery for server based workloads
Planning a BCDR (Business Continuity and Disaster Recovery) strategy from on-premises to Azure involves several technical steps
-
Assessment and Inventory:
- Identify critical on-premises systems, applications, and data.
- Assess dependencies and interconnections between different components.
- Define the compliance and technical requirement of the RTO (RTO, or Recovery Time Objective, is the maximum acceptable duration of time within which a business process or service must be restored after a disruption or disaster occurs. It represents the target time frame for recovering from a downtime event and resuming normal operations) & RPO (RPO, or Recovery Point Objective, refers to the maximum acceptable amount of data loss that a business can tolerate after a disruption or disaster occurs. It represents the point in time to which data must be recovered in order to resume normal operations, indicating the acceptable data loss window)
- Design the DR architecture based on the assessment and RTO, RPO Requirement of the organization.
- Seismic Zone (DR Site location to be defined as per the compliance and best practices recommendation)
-
Azure Subscription Setup:
- Create an Azure subscription if you haven't already.
- Set up the necessary Azure resources, such as Virtual Networks, Storage Accounts, and Virtual Machines, recovery services vaults etc in the desired Azure region.
-
Connectivity:
- Establish connectivity between your on-premises environment and Azure, using technologies like Azure ExpressRoute or VPN Gateway.
-
Data Replication:
- Implement data replication mechanisms to continuously replicate data from on-premises to Azure, such as Azure Site Recovery (ASR), Native replication mechanism for Databases, Domain controllers, rds servers, mfa servers etc or Azure Blob Storage replication.
-
Failover and Failback Planning:
- Define failover and failback procedures, including the sequence of steps to follow during failover and failback events.
- Test failover and failback procedures to ensure they work as expected and meet recovery time objectives (RTOs) and recovery point objectives (RPOs).
-
Network Configuration:
- Configure network settings in Azure to match those of your on-premises environment, including subnets, IP addresses, routing, and security settings.
-
Application Dependencies:
- Identify and address any dependencies or requirements specific to your applications, such as licensing, authentication, or integration with other systems.
-
Monitoring and Alerting:
- Set up monitoring and alerting mechanisms to monitor the health and performance of your BCDR setup in Azure.
- Configure alerts to notify you of any issues or failures in replication, connectivity, or resource availability.
-
Documentation and Runbooks:
- Document the BCDR setup, including configuration details, procedures, and contact information.
- Create runbooks with step-by-step instructions for executing failover and failback procedures.
-
Testing and Validation:
- Regularly test the BCDR setup to ensure it meets your recovery objectives and performs as expected.
- Conduct periodic drills and simulations of disaster scenarios to validate the effectiveness of your BCDR strategy.
Major components involved in designing a BCDR solution from onprem to azure for the server based workloads
-
Recovery Services Vaults: Used for backup and Azure Site Recovery.
-
Storage: Required for storing replicated data and other resources.
-
Compute: Necessary for running workloads during failover in warm and hot DR scenarios.
-
Networking Components: Including connectivity solutions like VPN Gateway or Azure ExpressRoute, SDWAN etc.
-
Traffic Manager: Helps in routing traffic to the appropriate resources during failover.
-
Security Components: Such as Web Application Firewall (WAF), Firewall, DDoS protection, Key Vaults, Defender, and API Management for ensuring security during disaster recovery operations
What is ASR and how it works
In 2018, Azure became the first large public cloud provider to launch a first-class cloud native disaster recovery solution with Azure to Azure Disaster Recovery. Azure Site Recovery is a cloud-based disaster recovery service provided by Microsoft Azure. It enables businesses to replicate and recover virtual machines, physical servers, and workloads from on-premises datacenters to Azure or between Azure regions, ensuring business continuity in the event of a disaster.
ASR Architectural components
Key points for choosing ASR as your DR solution
- No need to maintain the infrastructure for the DR site while they are not in use, hence we end up saving lot of cost and maintenance. When workloads are replicating to Azure, you can reduce the cost of deploying, monitoring, patching, and maintaining an on-premises disaster recovery infrastructure by eliminating the need to build or maintain a costly secondary datacenter. These datacenters come with an influx of costs, from lengthy contracts to expensive network links
- There are no long-term contracts for ASR, and the cost is based only on consumption. Unlike expensive secondary data centers, you will only pay for what you use
- One requirement of any successful DR tool is accessibility, with ASR, you can replicate, recover, and conduct failover testing directly from the Azure portal. This allows a straightforward method of testing of applications and services during a DR drill without impacting production workloads or end-users
- ASR allows you to easily comply with industry regulations such as ISO 27001 by enabling Site Recovery between separate Azure regions. You can meet compliance requirements by ensuring that all metadata that is needed to enable and orchestrate replication and failover remains within that region's geographic boundary
- Easy DR Drills for the compliance auditing report submission. With ASR as your DR solution, you can easily run the DR Drills without interfering with the production environment or the DR site. Dr Drill is called as Test failover and it can be performed as a sandbox environment to validate the DR replication and workload functionality
considerations to keep in mind while designing the ASR as DR solution
- It is recommended to have the management layer up and running as Hot or warm DR in the DR site (i.e. databases, Domain controllers, MFA, RDS servers etc.)
- The best practice is to have the network for the DR site setup and keep it in active or passive mode whatever works for the organization as per their practices.
- Always have the application gateway with waf (if needed and recommended for layer 7 protection at https) to be setup in the DR site and keep the Ip addresses defined for the traffic manager profiles. (Try to use a CNAME for the DNS entries) so that the automatic DNS resolutions can be taken care of in the backend when the DR site is spined up.
- Always refer the support matrix for the workloads and configurations that are recommended to be used with ASR as DR.
- Try to have beyond 24-hour retention policy for the critical workloads (ASR now supports up to 15 days retention policy)
- Avoid using fully automated failover while using cold DR strategy to not be caught up with false alarms
- Always monitor the health of the replication and take immediate action to resolve any errors
SLA for Site Recovery
- For each Protected Instance configured for On-Premises-to-On-Premises Failover, we guarantee at least 99.9% availability of the Site Recovery service.
- For each Protected Instance configured for On-Premises-to-Azure planned and unplanned Failover, we guarantee a two-hour Recovery Time Objective.
Please refer below link for the support matrix of ASR
Workload summary while using ASR for the replication
Site Recovery can replicate any app running on a supported machine. We've partnered with product teams to do additional testing for the apps specified in the following table
Key inputs to consider for a smooth BCDR strategy
- Conduct a POC for the BCDR architecture and document DR drill outcomes.
- Utilize native replication mechanisms (e.g., log shipping, Always On, Dataguard etc) for DB replications.
- Avoid IP-based hardening for applications and end users.
- Use different IP ranges for the DR Site to prevent conflicts during failover and failback (Many organizations aim to maintain the same IP addresses from the primary site to the DR site, leading to complexities and limitations in failover and failback due to IP range conflicts)
- Employ a mix of DR approaches (cold, warm, hot) based on requirements.
- Schedule DR drills/Actual DR testing every quarter or 6 months to ensure DR functionality.
- Ensure strong networking design architecture for DR site success.
- Plan failback to virtual env only, as ASR enabled replications cannot failback to physical servers.
- ASR can complement your existing replication or DR tools if you have already invested and would like to follow a mix approach.
ASR FAQ - General questions about the Azure Site Recovery service | Microsoft Learn
ASR Pricing - Pricing - Site Recovery | Microsoft Azure
Failover and failback process detailed- About failover and failback in Azure Site Recovery - Modernized - Azure Site Recovery | Microsoft Learn
Microsoft BCDR CAF- Business continuity and disaster recovery - Cloud Adoption Framework | Microsoft Learn
Published on:
Learn moreRelated posts
How Azure AI Search powers RAG in ChatGPT and global scale apps
Millions of people use Azure AI Search every day without knowing it. You can enable your apps with the same search that enables retrieval-augm...
Episode 388 – Getting Started with Azure Bicep: Infrastructure as Code with a Domain Specific Language
Welcome to Episode 388 of the Microsoft Cloud IT Pro Podcast. In this episode, we dive into Azure Bicep, Microsoft’s streamlined language for ...
RAG with SQL Vector Store: A Low-Code/No-Code Approach using Azure Logic Apps
Data is at the heart of every AI application, and efficient data ingestion is critical for success. With over 1,400 enterprise connectors, Log...
Exciting Announcement: Public Preview of Native Vector Support in Azure SQL Database!
Public Preview of Native Vector Support in Azure SQL Database We are excited to share that the dedicated vector data type in Azure SQL Databas...
Scaling Azure Container Apps for Peak Performance
Unlock the full potential of Azure Container Apps with smart scaling! In this final part of our series, explore how to keep your apps responsi...