Loading...

Demystifying Azure VM Maintenance: A practical guide to minimizing disruptions

Image

Azure regularly updates its platform to enhance the host infrastructure for virtual machines, focusing on reliability, performance, and security. The updates can range from operating system, hypervisor, various networking components/agents deployed on the host, to hardware decommissioning:

 

davidsantiago_0-1709739191070.png

 

There are two types of VM maintenance:

  • Planned maintenance events are periodic updates made by Microsoft to the underlying Azure platform. Most of these updates are totally transparent to customers. However, some maintenance may result in brief freezes or performances degradation, and on very rare occasions, a reboot might be required.
  • Unplanned maintenance events occur when the hardware or physical infrastructure underlying your VM has been faulted in some way. When such a failure will occurs, thanks to predictive ML, the Azure platform automatically migrates (live migration) your VM from the unhealthy physical host to a new, healthy physical host.

In cases where Live Migration can't be used, the VM experiences unexpected downtime (reboot).

 

In this article, we will deep-dive into techniques used to apply planned maintenances, what customer can and cannot control.

 

Azure – Palette of choices to apply updates depending on update & constraints

 

To go further into detail, Azure uses different techniques for updates depending on the type of update and the constraints to ensure that updates are minimally impactful:

davidsantiago_0-1709740234100.png

Source: Inside Azure Innovations with Mark Russinovich | BKR214H

 

  • Hot Patching – This provides the ability to make targeted changes to running code without downtime. All new invocations of a function on the host are redirected to an updated version of that function.
  • Live migration – This involves moving a running customer VM from one host to another.
  • VM-PHU – Virtual Machine Preserving Host Update
    • This suspends VMs in memory, soft reboots the OS, resumes the VMs.
    • This is the most impactful non-rebootful update, but fortunately, this is the least common.

 

Azure can use one of the above techniques to minimize impact during unplanned hardware maintenance, unexpected downtime and planned maintenance.

 

Note: Reboot is not mentioned above – Guest VMs are only rebooted when the previous techniques cannot be used.

 

Here is a summary of the procedure used to manage updates and maintenance for the host:

 

davidsantiago_2-1709740330717.png

 

Azure can use one of above technique to minimize impact during unplanned hardware maintenance, unexpected downtime and planned maintenance.

 

An important point to mention here is that all these maintenance operations we just talk about can happen at any time in Azure and customers do not have control over when these kinds of updates (which generate freezes) can happen.

 

Maintenances Control – Shared Host vs Dedicated Host

 

By default, when you provision a VM in Azure, it will land on a random host in the targeted region and availability zone, and this host is shared by multiple VMs from multiple customers. This is what we call  Shared hosts.

 

In addition to our default hosting model using shared hosts, Microsoft also gives you the capability to have a dedicated host on which only your VMs can be hosted. This offer is named Azure Dedicated Host and offers various advantages among which are:

  • Workload isolation to dedicated physical servers
  • Workload placement control and visibility
  • More maintenance control than what’s available on Shared hosts.

While both solutions have their pros and cons, let’s focus on this last item: maintenance controls.

As explained above, several techniques can be used to update various infrastructure components and we can distinguish two major categories of impacts:

  • Rebootless (aka. Freeze) updates
  • Rebootful updates

When being hosted on a shared host, customers are given by a 35-days window during which they can plan when their VM reboot should occur: they have the control over Rebootful updates during this window. Once it expires, Microsoft will schedule the reboot on its own and the customer will be notified a few minutes before the reboot occurs through Scheduled Events (we will describe this mechanism in the next section).

 

As mentioned before, for Rebootless updates on Shared hosts, customer have no control over them and cannot schedule anything. It means that, on VMs running on Shared hosts, freezes (generally of a couple of seconds) could happen at any time.

 

If it is unacceptable for customer workloads to be subject to this kind of uncontrolled freezes, then this is where Azure Dedicated Hosts can be of a great help. On these hosts, Maintenance Control allows customers to schedule all kind of updates (rebootless and rebootful) and to apply them at a preferred time within a 35-day window.

 

Now that you have better visibility on the various options you have in term of maintenance control, let’s see how to manage updates on Shared hosts.

 

Shared Host – How to minimize VM impact on maintenance?

 

If a reboot is needed, customers are notified and given a time frame to initiate the maintenance themselves, typically within 35 days unless it is urgent. See Handling planned maintenance notification.

 

If no reboot is required, the VM is either paused or live-migrated to an already updated host.

Some applications may not tolerate a pause, even for a few seconds. For these applications, an alternative is possible:

 

1)     Catch scheduled events 15 minutes prior to pause

 

ScheduleEvents is an Azure Instance Metadata Service (IMDS) API that gives your application time to prepare for VM maintenance. It provides up to 15 minute advance notice prior to maintenance events (Reboot, Redeploy, Freeze, Preempt, Terminate) so that your application can prepare for them and limit disruption:

 

davidsantiago_3-1709740354915.png

 

Services on VM can monitor this API to perform graceful shutdown (& connection draining) before the event is carried out.

Note: Schedule Events are enabled when a service makes first requests to query events. There is some delay in the first response (~1min). It is disabled if there is no request to the endpoint for 24 hours.

 

2)     Opt-out to Azure Dedicated Host

 

As previously explained, Azure Dedicated Host can be a solution for maintaining control over when maintenances are applied.

 

How to diagnose disruptions to VM availability?

 

Project Flash enables Azure customers to detect & diagnose ongoing and completed availability disruptions, including VM degradation.

Azure VM availability can be monitored using:

  • Azure Resource Graph – For investigation at scale, centralized resource repository and history lookup.
  • Event Grid system topics – To trigger time-sensitive and critical mitigations (redeploy restart VM actions).
  • Azure Monitor – To track trends, aggregate platform metrics (CPU, disk etc), and set up precise threshold-based alerts.
  • Azure Resource Health – To perform instantaneous and convenient Portal UI health checks per resource.

Using Azure Resource Graph, there are two types of events populated in the HealthResource table:

  • resourcehealth/availabilitystatuses

Denotes the availability state of the Virtual Machine.


Can assume values between Available | Unavailable | Unknown | Degraded:

 

 

{​ "targetResourceType": "Microsoft.Compute/virtualMachines",​ "previousAvailabilityState": “Unavailable",​ "targetResourceId": <ARM Id>,​ "occurredTime": <Precise Time stamp of transition>,​ "availabilityState": "Available"​ }

 

  • resourcehealth/resourceannotations

Provides context to interpret why a change in VM availability has occurred, to decisively take actions if needed.

  • Reason: Brief statement on why VM availability has changed
  • Context: Platform initiated | Customer initiated | VM initiated | Unknow | Not Applicable
  • Category: Planned | Unplanned | Unknown | Not Applicable
  • Summary: Details statement on the activity and cause for VM availability to change
  • ImpactType: Downtime Reboot | Downtime freeze | Degraded | Informational

 

{​ "targetResourceType": "Microsoft.Compute/virtualMachines",​ "targetResourceId": <ARM Id>,​ "annotationName": "VirtualMachineHostRebootedForRepair",​ "occurredTime": "2022-09-25T20:21:37.5280000Z",​ "category": “Unplanned",​ "summary": "We're sorry, your virtual machine isn't available because an unexpected failure on the host server. Azure has begun the auto-recovery process and is currently rebooting the host server. No additional action is required from you at this time. The virtual machine will be back online after the reboot completes.",​ "context": “Platform Initiated",​ "reason": "Unexpected host failure" "impactType”: “Downtime Reboot" }

 

Examples of useful KQL requests on HealthResources table are available.

 

Conclusion

 

In this article, we detailed the two hosting models (shared hosts and dedicated hosts) and their respective options regarding planned maintenance, unplanned maintenance, and their maintenance controls.

 

If your workload can tolerate infrequent freezes with a duration of couple of seconds (which is generally the case) or if 15 minutes is enough for you to prepare these freezes (by draining current connections and refusing new ones for example) the default hosting model with shared hosts is ideal as it provides scalability and ease of management.

 

On the other hand, if your workload are highly sensitive to freezes, even for a few seconds, then you should consider to use Azure Dedicated Hosts which will give you the control over the maintenance.

 

Microsoft is still working to improve the maintenance control experience on shared hosts, so there’s no doubt that it will become better and better with the shared hosts model.

 

 

 

Learn more
Author image

Azure Architecture Blog articles

Azure Architecture Blog articles

Share post:

Related

Stay up to date with latest Microsoft Dynamics 365 and Power Platform news!

* Yes, I agree to the privacy policy