The top 5 reasons why backup and recovery in the cloud goes wrong and how to avoid them

If you have experience running workloads both in the cloud and on-premises, you will know that failures can occur for a variety of reasons. You will also know that in some cases the best thing to do is to restore your workload from an existing backup to meet your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) requirements.

Many organizations have implemented processes to regularly create backups of their workloads, but - in a majority of the cases - we only see the recovery plan that is associated with these backups in action, when problems occur.

In many cases it’s not the lack of an existing backup that complicates things, but rather an insufficient or untested recovery plan.

At Microsoft we have observed many common problems relating to backup and recovery and we wanted to share the top 5 issues that we identified from our customer support data with you, so that you can take mitigating steps when you plan your workload’s recovery procedures.

Do I need a recovery plan at all?

To be able to answer this question with confidence, you need to determine how quickly a workload should recover if it faces an issue. We can define metrics around this process (usually referred to as Recovery Time Objective [RTO] and Recovery Point Objective [RPO])

There is definitely a tradeoff to be had here. For example, you may be running a workload that can easily be redeployed if anything goes wrong and where the re-deployment time is sufficiently short to take away the need for a recovery plan.

On the other end of the spectrum, you have mission critical applications that need to be deployed in a very highly available manner. Restoring a backup might simply take too long and you may instead choose a more advanced setup. You might, for example, choose an architecture that involves containers that can recover their state from a highly available data store.

The Well-Architected framework can help you architect both mission critical solutions and solutions with highly automated deployment routines that are easily redeployed in the event of failure.

If your workload has a backup strategy, then it needs a recovery plan. Continue reading to find out how you can optimize it based on the top 5 common support issues that we have seen at Microsoft.

Azure Site Recovery is not configured in a way that allows backups to complete

Azure Site Recovery is a native disaster recovery as a service (DRaaS) which allows virtual machines in Azure and other clouds as well as physical machines to transmit regular backups to Azure. The service allows customers to quickly spin up an Azure based failover region if their workload’s primary region goes down.

A common configuration issue in this setup is to do with access control and encryption. If your VMs (Virtual machines) are encrypted, make sure to provide the right user permissions to create the key vault in the target region and copy keys from source region key vault to the target region key vault. Check how to replicate VMs with ADE (Azure Disk Encryption) enabled for the specific permissions required. To enforce site recovery configuration on VMs at scale automatically, you can use the built-in capabilities of Azure Policy.

Another common problem is that backups are not successful due to configuration errors on the client machines. Machines that send backup data via site recovery should be set up to monitor the backup process on a continuous basis so that you can quickly detect if the backup procedures start failing.

In addition, you can use the built in Azure Site Recovery monitoring features to get a quick view of your registered machines’ connection health and other key performance indicators of the service. Review and make sure you are not exceeding Azure Site Recovery service limits.

Machines are unable to send their backup to Azure

If you are using Azure Site Recovery for machines both in and outside Azure, we recommend that you review its network connectivity requirements in detail with your wider team. For Site Recovery replication to work, outbound connectivity to specific URLs or IP ranges is required from the VM if your VM is behind a firewall or uses network security group (NSG) rules. Networking in Azure VM disaster recovery discusses in detail about network connectivity requirements for making sure recovery works fine.

You need to make sure the DNS server is reachable from the disaster recovery (DR) region once the VM is recovered for smooth failover. Some customers use a proxy for their network connectivity. In such cases, the Azure Site Recovery Mobility service agent must be configured to use the proxy as well for successful recovery.

When it comes to backing up file shares, you can use Azure File Sync to replicate your on-premises files. To use shares in a failover scenario, you must make sure that your firewall allows TCP port 445 as it is required for the SMB protocol.

Database applications fail to restore due to failed backups

When it comes to backing up and recovering databases, you need to make sure that your recovery strategy matches the needs of the application that your workload is running.

We regularly see organizations configure standard virtual machine backups for their SQL Server VMs for example, when the application using the database needs something more specific. Often applications have their own backup procedures that timestamp the database correctly and ensure that all tables can be restored to the same time stamp.

If you do not have control over the code of the application that you are running, be sure to contact the application vendor about any special backup procedures that need to be in place for your database, when trying to restore your application after an unexpected outage or disaster scenario.

Wherever possible we recommend that you use managed database solutions, and their built-in backup and recovery options as far as these are available and compatible with your application.

Azure Backup agent is not set up correctly

Azure Backup uses the Microsoft Azure Recovery Services (MARS) agent to back up data from on-premises machines and Azure VMs to Recovery Services vault in Azure.

Azure Backup backs up Azure VMs by installing an extension to the Azure VM agent running on the machine. If you're updating the agent, make sure that no backup operations are running, and reinstall the latest version of the agent according to the best practices.

Reinstalling the VM agent helps get the latest version and re-establishes connection in case the agent has become unresponsive.

To resolve recovery issues due to backup agent failures, check out the recovery troubleshooting guide. Note that the agent does not support cross region restoration.

Backup vaults are used in the wrong location, or require additional permission assignments

A Recovery Services vault is a management entity that stores recovery points created over time and provides an interface to perform backups and restores.

For successful backup and restore operations, role assignments are required by the Backup vault’s managed identity. Proper RBAC (role-based access control) assignments resolve most issues with recovering data from backup vault.

Note that the vault must be in the same region as the data source. Changing the Storage Replication type (Locally redundant/ Geo-redundant) for a Recovery Services vault must be done before configuring backups.

Resource health check of Recovery Services vault also helps you monitor successful recovery. We recommend that you follow them as required for your scenario.

Conclusion

Building a suitable recovery plan for your workload is not an easy task. We hope that the pointers in this article help you avoid common pitfalls that many customers will hit on their journey to maximizing their operational efficiency.

In a cloud environment like Azure, you cannot always avoid all failure. Insights like the ones shared in the article can help you in anticipating failure - however.

Many customers are working towards establishing culture of chaos engineering where regular DR drills become more frequent and are used as a tool to test the resiliency of a workload.

For more detailed guidance on how to run high efficiency workloads, review our Azure Well-Architected guidance.

Want to know how you are doing against the best practices? Start one of our reviews today or reach out to your Microsoft representative to align targeted help with a Microsoft professional.

About the authors

Harshitha Putta

Harshitha Putta is a Senior Cloud Solutions Architect in the Customer Success Unit at Microsoft. As the cloud business continues to experience hyper-growth, she helps customers build, grow and enable their cloud teams.

Daniel Stocker

Daniel Stocker is a Senior Program Manager in Microsoft's App Innovation Tech Strategy team. His background is in DevOps, operational excellence, and organizational change. He leads several programs including the Operational Excellence pillar for Well-Architected.  

Published on: November 16, 2022

Learn more

Azure Governance and Management Blog articles

Blog image