Zero Day Patching Without Rebooting - A Cloud Solution

Slide Note
Embed
Share

Explore how Azure developed Virtual Machine Preserving Host Updates (VM-PHU) to update host OS without rebooting millions of servers, ensuring reliability, security, and minimal downtime. Learn about hardware acceleration, blackout time optimization, fast reboot strategies, and more for efficient zero-day patching in public clouds.


Uploaded on Sep 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Virtual Machine Preserving Host Updates for Zero Day Patching in Public Cloud Mark Russinovich Naga Govindaraju Melur Raghuraman David Hepkin Jamie Schwartz Microsoft Corporation Arun Kishan

  2. Motivation Host OS must be updated For reliability, security and new features Zero day fixes need to be deployed within hours Full host OS reboots lead to major application impact Goal: Zero-day updates for host OS of millions of servers without rebooting VMs

  3. Azure OS updates Azure developed Virtual Machine Preserving Host Updates (VM-PHU) in 2012 Suspend VMs in memory Soft reboot the OS Resume the VMs VM 1 VM 1 VM 1 Host OS A Host OS B Loader

  4. Hardware Acceleration Single root input/output virtualization (SR-IOV) network devices: Fallback to software shadow vNIC VM-PHU Enable hardware acceleration Discrete device assignment devices: Does not have a software fallback Need to handle DMA and interrupt processing Solution: keep IOMMU active during the soft reboot Maintain the DMA and interrupt remapping in IOMMU across reboot Inject interrupts for each device in the VM to handle missed interrupts when hypervisor is offline

  5. Rebootless Update Demo Production system with a GPU DDA VM Update storvsp.sys in 1 second

  6. Blackout time Low blackout time VM-PHU blackout time Reboot time Time to close VM devices Memory preservation time

  7. Blackout: Fast Reboot Reboot time impacts the blackout time for all VMs IO intensive with hundreds of MBs of reads for Windows hosts Issue: Can take tens of seconds on hard disks Solution: Cache the reads and writes in memory and preserve across reboot

  8. Blackout: Fast Close Soft reboot can proceed only after all VMs close their devices Issue: Straggler devices increase blackout time for all VMs. Solution: Fast close devices Serialize the in-flight I/Os during VM save and pend their status After serialization, devices can be closed Before restoring VM, replay the serialized I/Os Implemented in Windows Server 2016 Storage resiliency feature

  9. Blackout: Memory Preservation Windows memory manager uses a linked list for maintaining free regions Marking the pages as in use can be O(n2) where n is the number of memory runs for the VMs Straggler VMs increase blackout for all VMs on host Host and VM memory can be very fragmented O(n2) time to persist can take hundreds or thousands of seconds Memory manager data structure changes are risky and introduce trade- offs Need a O(n) algorithm Solution: Use a sorted list for the VM memory runs 10000 Simple Optimized Blackout in seconds 1000 100 10 1 VM1 VM2 VM3 VM4

  10. Demo: Rebootful Updates

  11. Results 25 Rebootless: 2-3 times per year Rebootful: <1 per year Failure rate: ~0.1% Blackout in seconds Rebootful Rebootless 20 15 10 5 0 0 20 40 VM Percentile 60 80 100

  12. Limitations Designed primarily for updates Cannot be leveraged for other data center scenarios Decommisioning of hardware Dynamic load balancing for better resource utilization Hardware failures Few applications may not tolerate a pause Scheduled events 15 minutes prior to VM-PHU Opt-out to dedicated hosts

  13. Conclusions A fast in-place update mechanism for Azure host OS upgrade without rebooting VMs Designed to handle hardware acceleration Several optimizations to improve reliability and blackout time Applied in production environment on millions of servers and VMs Achieved low blackout time within seconds Can be deployed across the entire fleet within hours

  14. Questions/Feedback MarkRuss@microsoft.com NagaG@microsoft.com

Related