Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks
TLB misses in virtual machines can lead to high overheads with hardware-virtualized MMU. This paper proposes segmentation techniques to bypass paging and optimize memory virtualization, achieving near-native performance or better. Overheads of virtualizing memory are analyzed, highlighting the impact of TLB miss handling costs on performance gaps between native and virtual servers. The study explores unvirtualized X86 translation, two levels of translation, and support for virtualizing memory, emphasizing applications in graph analytics, key-value stores, HPC apps, databases, big-memory applications, and compute workloads. The cost of virtualization is discussed with execution time overheads and performance comparisons across different memory access scenarios.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Efficient Memory Virtualization Reducing Dimensionality of Nested Page Walks Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift MICRO-47
Executive Summary Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Prior Work: Direct Segments unvirtualized case Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Two optimizations to make segmentation flexible Results Near- or better-than-native performance 2
Overheads of Virtualizing Memory We will show that the increase in translation lookaside buffer (TLB) miss handling costs due to the hardware-assisted memory management unit (MMU) is the largest contributor to the performance gap between native and virtual servers. Buell, et al. VMware Technical Journal 2013 3
Unvirtualized X86 Translation VA Virtual Address CR3 Page Table PA Physical Address Up to mem accesses = 4 4
Two-Levels of Translation Guest Virtual Address Host Physical Address Guest Physical Address 2 1 gVA gPA hPA cr3 cr3 Guest Page Table Nested Page Table Base Virtualized 5
Support for Virtualizing Memory gVA CR3 hPA ncr3 gPA ncr3 gPA ncr3 ncr3 ncr3 gPA gPA gPA Up to Mem accesses 5 + 5 + 5 + 5 + 4 = 24 6
Applications Graph-analytics Key-value store HPC Apps Database Big-memory applications (in-memory apps) Compute Workloads 7
Cost of Virtualization Native 60% 800% Overheads of virtual memory can be high on native machines 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 1G+1G 1G+1G 1G+1G 1G+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 4K 4K 4K graph500 memcached NPB: CG GUPS 8
Cost of Virtualization 113% Native Virtual 60% 800% Overheads increases drastically with virtualization 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 1G+1G 1G+1G 1G+1G 1G+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 4K 4K 4K graph500 memcached NPB: CG GUPS 9
Cost of Virtualization 1556% 113% Native Virtual 60% 800% Increase in overheads: ~3.6x (geo. mean) 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K 1G+1G 1G+1G 4K 1G+1G 1G+1G 2M+2M 1G 2M+2M 1G 2M+2M 1G 2M+2M 1G 4K 2M 4K 2M 2M 4K 2M graph500 memcached NPB: CG GUPS 10
Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Evaluation Summary 11
Unvirtualized Direct Segments 2 Direct Segment Conventional Paging 1 BASE LIMIT VA OFFSET PA Why Direct Segment? Matches big memory workload needs NO TLB lookups => NO TLB Misses Basu et al. [ISCA 2013] 12
Direct Segments VA VA cr3 cr3 PA PA Base Native 1D Direct Segments 1D 0D 13
Outline Motivation Review: Direct Segments Virtualized Direct Segments Evaluation Methodology Results Summary 14
Using Segments in VMM VM allocates guest physical memory at boot Contiguous gPA can be mapped with a direct segment VMM has a smaller code base to change It will convert 2D 1D page walk gVA cr3 gPA cr3 hPA But, can we do better? 2D 1D 15
Three Virtualized Modes 1 Features 1. Maps almost whole gPA gVA cr3 2. 4 memory accesses gPA 3. Near-native performance cr3 hPA 4. Helps any application VMM Direct 2D 1D 16
Three Virtualized Modes 2 1 Features 1. 0 memory accesses gVA gVA cr3 cr3 2. Better-than native performance gPA gPA cr3 cr3 3. Suits big- memory applications hPA hPA Dual Direct 2D 0D VMM Direct 2D 1D 17
Three Virtualized Modes 2 1 3 Features 1. 4 memory accesses gVA gVA gVA cr3 cr3 cr3 2. Suits big- memory applications gPA gPA gPA cr3 cr3 cr3 3. Flexible to provide VMM services hPA hPA hPA Dual Direct 2D 0D Guest Direct 2D 1D VMM Direct 2D 1D 18
Compatibility 1 APP APP Unmodified OS VMM Modified Hardware (VMM Direct) 19
Compatibility 2 1 Minimal Big-Memory APP APP Unmodified OS OS Modified VMM VMM Modified Hardware (Dual Direct) Hardware (VMM Direct) 20
Compatibility 2 3 1 Minimal Modified Minimal Big-Memory Big-Memory APP APP Unmodified OS OS OS Modified Minimal VMM VMM VMM Modified Modified Hardware (Dual Direct) Hardware (Guest Direct) Hardware (VMM Direct) 21
Tradeoffs: Summary Base VMM Direct Dual Direct Guest Direct Properties Virtualized Dimension/ Memory accesses Guest OS modifications VMM modifications Applications 2D/24 1D/4 0D/0 1D/4 none none required required none required required minimal Any Any Big- Big- memory No memory Yes VMM services allowed Yes No 22
Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Self-Ballooning Escape Filter Evaluation Summary 23
Self-Ballooning Fragmentation in Guest Physical Memory gPA Segment Used Used Used 1. Removed by the balloon driver 2. VMM hot-adds new memory VMM informed 24
Escape Filter Creation of segments in presence of hard faults gPA Operation Direct Segment hit + Escape Filter hit Translate with paging Escape Filter cr3 Direct Segment hit + Escape Filter miss Translate with segment hPA Bloom Filter to store a few faulty pages 25
Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Evaluation Summary 26
Methodology Measure cost on page walks on real hardware Find TLB misses that lie in the direct segment BadgerTrap for online analysis of TLB misses Released: http://research.cs.wisc.edu/multifacet/BadgerTrap Linear model to predict performance Prototype Linux host/guest + Qemu-KVM hypervisor Intel 12-core Sandy-bridge with 96GB memory 27
Results Native Virtual Modeled 113% VMM Direct achieves near- native performance 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 28
Results Native Virtual Modeled 113% Dual Direct eliminates most of the TLB misses achieving better-than native performance 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 29
Results Native Virtual Modeled 113% Guest Direct achieves near-native performance while providing flexibility at VMM 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 30
Results 1556% Native Virtual Modeled 113% Same trend across all workloads (More workloads in the paper) 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 0.00% 0.01% 0.17% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 31
Summary Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Two optimizations to make segmentation flexible Results Near- or better-than-native performance 32
Questions ? For more details: Come to the poster or see the paper 33
Hardware [V47 V46 V12] [V12 .. V0] L1 D-TLB Lookup Dual Direct mode Y Hit ? N Y L2 D-TLB Lookup Y N Hit ? Page Table Walk [P47 P46 P12] [P12 .. P0] 35
Page Walk Hardware gPA hPA hPA gVA gPT Guest Direct mode nPT cr3 VMM Direct mode hPA ncr3 ncr3 ncr3 ncr3 ncr3 ncr3 gPA gPA gPA gPA gPA gPA 36
Translation with Direct Segments [V47V46 V13V12] [V11 V0] LIMIT<? BASE ? DTLB Lookup Direct Segment Ignored HIT HIT/MISS N MISS OFFSET Page-Table Walker [P11 P0] [P40P39 .P13P12] Basu et al. [ISCA 2013] 37
Translation with Direct Segments [V47V46 V13V12] [V11 V0] LIMIT<? BASE ? DTLB Lookup Paging Ignored HIT/MISS Y MISS OFFSET Page-Table Walker [P11 P0] [P40P39 .P13P12] Basu et al. [ISCA 2013] 38
Modes Guest Page Table Nested Page Table 1 1 2 2 2 gVA gPA hPA cr3 cr3 Base Virtualized Guest Direct VMM Direct Dual Direct 39
Base Virtualized: Translation gVA gPT 1 cr3 gPA 2 cr3 nPT hPA Base Virtualized 40
Dual Direct: Translation gVA gPT Guest Direct Segment 1 cr3 gPA cr3 2 nPT Host Direct Segment hPA hPA Dual Direct 41
VMM Direct: Translation gVA 1 gPT cr3 gPA 2 cr3 nPT Host Direct Segment hPA hPA VMM Direct 42
Guest Direct: Translation gVA Guest Direct Segment 1 cr3 gPA cr3 2 nPT hPA 43
Tradeoffs: Memory Overcommit Base Dual Direct limited limited limited limited limited limited VMM Direct Guest Direct limited limited Properties Virtualized Page Sharing Ballooning Guest OS Swapping VMM Swapping 44
Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 45
Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% VMM Direct and Guest Direct achieves near-native performance 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 46
Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% 641.78% Dual Direct eliminates most of the TLB misses achieving better-than native performance 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 28.60% 30% 400% 300% 12.43% 20% 9.11% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 47
Results 103% 280% 149% 160% 70% 82% Native Virtual Modeled 83% 88% 60% 60% VMM Direct achieves near-native performance for standard workloads 50% EXECUTION CYCLE OVERHEAD 40% 30% 20% 10% 0% 4K 4K 4K 4K 4K 4K 4K+2M 4K+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K+4K 4K+4K 4K+2M 4K+VD 4K+2M 4K+VD 4K+2M 4K+VD 4K+VD 4K+2M 4K+VD 4K+2M 4K+VD 4K+1G 4K+1G 4K+1G 4K+1G 4K+1G cactusADM canneal GemsFDTD mcf omnetpp streamcluster 48