Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks

Slide Note
Embed
Share

TLB misses in virtual machines can lead to high overheads with hardware-virtualized MMU. This paper proposes segmentation techniques to bypass paging and optimize memory virtualization, achieving near-native performance or better. Overheads of virtualizing memory are analyzed, highlighting the impact of TLB miss handling costs on performance gaps between native and virtual servers. The study explores unvirtualized X86 translation, two levels of translation, and support for virtualizing memory, emphasizing applications in graph analytics, key-value stores, HPC apps, databases, big-memory applications, and compute workloads. The cost of virtualization is discussed with execution time overheads and performance comparisons across different memory access scenarios.


Uploaded on Sep 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Efficient Memory Virtualization Reducing Dimensionality of Nested Page Walks Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift MICRO-47

  2. Executive Summary Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Prior Work: Direct Segments unvirtualized case Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Two optimizations to make segmentation flexible Results Near- or better-than-native performance 2

  3. Overheads of Virtualizing Memory We will show that the increase in translation lookaside buffer (TLB) miss handling costs due to the hardware-assisted memory management unit (MMU) is the largest contributor to the performance gap between native and virtual servers. Buell, et al. VMware Technical Journal 2013 3

  4. Unvirtualized X86 Translation VA Virtual Address CR3 Page Table PA Physical Address Up to mem accesses = 4 4

  5. Two-Levels of Translation Guest Virtual Address Host Physical Address Guest Physical Address 2 1 gVA gPA hPA cr3 cr3 Guest Page Table Nested Page Table Base Virtualized 5

  6. Support for Virtualizing Memory gVA CR3 hPA ncr3 gPA ncr3 gPA ncr3 ncr3 ncr3 gPA gPA gPA Up to Mem accesses 5 + 5 + 5 + 5 + 4 = 24 6

  7. Applications Graph-analytics Key-value store HPC Apps Database Big-memory applications (in-memory apps) Compute Workloads 7

  8. Cost of Virtualization Native 60% 800% Overheads of virtual memory can be high on native machines 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 1G+1G 1G+1G 1G+1G 1G+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 4K 4K 4K graph500 memcached NPB: CG GUPS 8

  9. Cost of Virtualization 113% Native Virtual 60% 800% Overheads increases drastically with virtualization 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 1G+1G 1G+1G 1G+1G 1G+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 4K 4K 4K graph500 memcached NPB: CG GUPS 9

  10. Cost of Virtualization 1556% 113% Native Virtual 60% 800% Increase in overheads: ~3.6x (geo. mean) 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K 1G+1G 1G+1G 4K 1G+1G 1G+1G 2M+2M 1G 2M+2M 1G 2M+2M 1G 2M+2M 1G 4K 2M 4K 2M 2M 4K 2M graph500 memcached NPB: CG GUPS 10

  11. Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Evaluation Summary 11

  12. Unvirtualized Direct Segments 2 Direct Segment Conventional Paging 1 BASE LIMIT VA OFFSET PA Why Direct Segment? Matches big memory workload needs NO TLB lookups => NO TLB Misses Basu et al. [ISCA 2013] 12

  13. Direct Segments VA VA cr3 cr3 PA PA Base Native 1D Direct Segments 1D 0D 13

  14. Outline Motivation Review: Direct Segments Virtualized Direct Segments Evaluation Methodology Results Summary 14

  15. Using Segments in VMM VM allocates guest physical memory at boot Contiguous gPA can be mapped with a direct segment VMM has a smaller code base to change It will convert 2D 1D page walk gVA cr3 gPA cr3 hPA But, can we do better? 2D 1D 15

  16. Three Virtualized Modes 1 Features 1. Maps almost whole gPA gVA cr3 2. 4 memory accesses gPA 3. Near-native performance cr3 hPA 4. Helps any application VMM Direct 2D 1D 16

  17. Three Virtualized Modes 2 1 Features 1. 0 memory accesses gVA gVA cr3 cr3 2. Better-than native performance gPA gPA cr3 cr3 3. Suits big- memory applications hPA hPA Dual Direct 2D 0D VMM Direct 2D 1D 17

  18. Three Virtualized Modes 2 1 3 Features 1. 4 memory accesses gVA gVA gVA cr3 cr3 cr3 2. Suits big- memory applications gPA gPA gPA cr3 cr3 cr3 3. Flexible to provide VMM services hPA hPA hPA Dual Direct 2D 0D Guest Direct 2D 1D VMM Direct 2D 1D 18

  19. Compatibility 1 APP APP Unmodified OS VMM Modified Hardware (VMM Direct) 19

  20. Compatibility 2 1 Minimal Big-Memory APP APP Unmodified OS OS Modified VMM VMM Modified Hardware (Dual Direct) Hardware (VMM Direct) 20

  21. Compatibility 2 3 1 Minimal Modified Minimal Big-Memory Big-Memory APP APP Unmodified OS OS OS Modified Minimal VMM VMM VMM Modified Modified Hardware (Dual Direct) Hardware (Guest Direct) Hardware (VMM Direct) 21

  22. Tradeoffs: Summary Base VMM Direct Dual Direct Guest Direct Properties Virtualized Dimension/ Memory accesses Guest OS modifications VMM modifications Applications 2D/24 1D/4 0D/0 1D/4 none none required required none required required minimal Any Any Big- Big- memory No memory Yes VMM services allowed Yes No 22

  23. Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Self-Ballooning Escape Filter Evaluation Summary 23

  24. Self-Ballooning Fragmentation in Guest Physical Memory gPA Segment Used Used Used 1. Removed by the balloon driver 2. VMM hot-adds new memory VMM informed 24

  25. Escape Filter Creation of segments in presence of hard faults gPA Operation Direct Segment hit + Escape Filter hit Translate with paging Escape Filter cr3 Direct Segment hit + Escape Filter miss Translate with segment hPA Bloom Filter to store a few faulty pages 25

  26. Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Evaluation Summary 26

  27. Methodology Measure cost on page walks on real hardware Find TLB misses that lie in the direct segment BadgerTrap for online analysis of TLB misses Released: http://research.cs.wisc.edu/multifacet/BadgerTrap Linear model to predict performance Prototype Linux host/guest + Qemu-KVM hypervisor Intel 12-core Sandy-bridge with 96GB memory 27

  28. Results Native Virtual Modeled 113% VMM Direct achieves near- native performance 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 28

  29. Results Native Virtual Modeled 113% Dual Direct eliminates most of the TLB misses achieving better-than native performance 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 29

  30. Results Native Virtual Modeled 113% Guest Direct achieves near-native performance while providing flexibility at VMM 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 30

  31. Results 1556% Native Virtual Modeled 113% Same trend across all workloads (More workloads in the paper) 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 0.00% 0.01% 0.17% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 31

  32. Summary Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Two optimizations to make segmentation flexible Results Near- or better-than-native performance 32

  33. Questions ? For more details: Come to the poster or see the paper 33

  34. Backup Slides 34

  35. Hardware [V47 V46 V12] [V12 .. V0] L1 D-TLB Lookup Dual Direct mode Y Hit ? N Y L2 D-TLB Lookup Y N Hit ? Page Table Walk [P47 P46 P12] [P12 .. P0] 35

  36. Page Walk Hardware gPA hPA hPA gVA gPT Guest Direct mode nPT cr3 VMM Direct mode hPA ncr3 ncr3 ncr3 ncr3 ncr3 ncr3 gPA gPA gPA gPA gPA gPA 36

  37. Translation with Direct Segments [V47V46 V13V12] [V11 V0] LIMIT<? BASE ? DTLB Lookup Direct Segment Ignored HIT HIT/MISS N MISS OFFSET Page-Table Walker [P11 P0] [P40P39 .P13P12] Basu et al. [ISCA 2013] 37

  38. Translation with Direct Segments [V47V46 V13V12] [V11 V0] LIMIT<? BASE ? DTLB Lookup Paging Ignored HIT/MISS Y MISS OFFSET Page-Table Walker [P11 P0] [P40P39 .P13P12] Basu et al. [ISCA 2013] 38

  39. Modes Guest Page Table Nested Page Table 1 1 2 2 2 gVA gPA hPA cr3 cr3 Base Virtualized Guest Direct VMM Direct Dual Direct 39

  40. Base Virtualized: Translation gVA gPT 1 cr3 gPA 2 cr3 nPT hPA Base Virtualized 40

  41. Dual Direct: Translation gVA gPT Guest Direct Segment 1 cr3 gPA cr3 2 nPT Host Direct Segment hPA hPA Dual Direct 41

  42. VMM Direct: Translation gVA 1 gPT cr3 gPA 2 cr3 nPT Host Direct Segment hPA hPA VMM Direct 42

  43. Guest Direct: Translation gVA Guest Direct Segment 1 cr3 gPA cr3 2 nPT hPA 43

  44. Tradeoffs: Memory Overcommit Base Dual Direct limited limited limited limited limited limited VMM Direct Guest Direct limited limited Properties Virtualized Page Sharing Ballooning Guest OS Swapping VMM Swapping 44

  45. Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 45

  46. Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% VMM Direct and Guest Direct achieves near-native performance 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 46

  47. Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% 641.78% Dual Direct eliminates most of the TLB misses achieving better-than native performance 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 28.60% 30% 400% 300% 12.43% 20% 9.11% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 47

  48. Results 103% 280% 149% 160% 70% 82% Native Virtual Modeled 83% 88% 60% 60% VMM Direct achieves near-native performance for standard workloads 50% EXECUTION CYCLE OVERHEAD 40% 30% 20% 10% 0% 4K 4K 4K 4K 4K 4K 4K+2M 4K+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K+4K 4K+4K 4K+2M 4K+VD 4K+2M 4K+VD 4K+2M 4K+VD 4K+VD 4K+2M 4K+VD 4K+2M 4K+VD 4K+1G 4K+1G 4K+1G 4K+1G 4K+1G cactusADM canneal GemsFDTD mcf omnetpp streamcluster 48

Related


More Related Content