Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks

Efficient Memory Virtualization
Reducing Dimensionality of Nested Page Walks
 
Jayneel Gandhi
, Arkaprava Basu,
Mark D. Hill, Michael M. Swift
MICRO-47
Executive Summary
 
Problem: TLB misses in virtual machines
Hardware-virtualized MMU has high overheads
Prior Work: 
Direct Segments
 – unvirtualized case
Solution: segmentation to bypass paging
Extend Direct Segments 
for virtualization
Three modes
 with different tradeoffs
Two optimizations
 to make segmentation flexible
Results
Near- 
or 
better-than
-native performance
2
Overheads of Virtualizing Memory
3
We will show that the increase in translation lookaside buffer
(TLB) miss handling costs due to the hardware-assisted memory
management unit (MMU) is the largest contributor to the
performance gap between native and virtual servers.
—Buell, et al. VMware Technical Journal 2013
 
Unvirtualized X86 Translation
4
VA
PA
CR3
Virtual
 Address
 
Physical
 Address
Page
 Table
 
Up to mem accesses = 4
Two-Levels of Translation
5
Base Virtualized
Guest Virtual
Address
 
Guest Physical
 Address
gVA
gPA
hPA
 
Host Physical
 Address
 
Guest
Page Table
 
Nested
Page Table
1
2
Support for Virtualizing Memory
6
CR3
gVA
 
Up to Mem
accesses
 
+ 5
 
+ 5
 
+ 5
 
+ 4
 
hPA
 
= 24
 
5
Applications
7
 
Big-memory applications
       (in-memory apps)
 
Database
 
Graph-analytics
 
Key-value store
 
HPC Apps
 
Compute Workloads
8
Cost of Virtualization
9
Cost of Virtualization
10
Cost of Virtualization
Outline
Motivation
Review: Direct Segments 
Virtualized Direct Segments
Optimizations
Evaluation
Summary
11
Unvirtualized Direct Segments
 
OFFSET
 
BASE                                        LIMIT
VA
Conventional Paging
PA
1
2
 
Direct Segment
 
Why Direct Segment?
Matches big memory workload needs
NO TLB lookups => NO TLB Misses
12
Basu et al. [ISCA 2013]
Direct Segments
13
Base Native
1D
 
Direct Segments
1D 
 
0D
VA
PA
PA
VA
Outline
Motivation
Review: Direct Segments
Virtualized Direct Segments 
Evaluation
Methodology
Results
Summary
14
Using Segments in VMM
 
VM allocates guest physical memory
at boot
Contiguous gPA can be mapped with
a direct segment
VMM has a smaller code base to
change
It will convert 2D 
 1D page walk
 
But, can we do better?
15
Three Virtualized Modes
16
 
Features
1.
Maps almost
whole gPA
 
2.
4 memory
accesses
 
3.
Near-native
performance
 
4.
Helps any
application
1
Three Virtualized Modes
17
 
Features
1.
0 memory
accesses
 
2.
Better-than
native
performance
 
3.
Suits big-
memory
applications
1
2
Three Virtualized Modes
18
 
Features
1.
4 memory
accesses
 
2.
Suits big-
memory
applications
 
3.
Flexible to
provide
VMM
services
1
2
3
Compatibility
VMM
Hardware
(VMM Direct)
OS
 
Modified
APP
1
APP
 
Unmodified
19
Compatibility
VMM
Hardware
(VMM Direct)
OS
Modified
APP
1
APP
Unmodified
VMM
Hardware
(Dual Direct)
OS
 
Modified
2
Big-Memory
 
Minimal
20
Compatibility
VMM
Hardware
(VMM Direct)
OS
Modified
APP
1
APP
Unmodified
VMM
Hardware
(Dual Direct)
OS
Modified
2
Big-Memory
Minimal
VMM
Hardware
(Guest Direct)
OS
3
Big-Memory
 
Minimal
 
Modified
 
Minimal
 
Modified
21
Tradeoffs: Summary
22
Outline
Motivation
Review: Direct Segments
Virtualized Direct Segments
Optimizations 
Self-Ballooning
Escape Filter
Evaluation
Summary
23
Self-Ballooning
24
gPA
Used
Used
Used
 
1. Removed by the
balloon driver
 
VMM informed
 
2. VMM hot-adds
new memory
Fragmentation in Guest Physical Memory
 
Segment
Escape Filter
25
 
Operation
 
Direct Segment hit +
Escape Filter hit 
Translate with paging
 
Direct Segment hit +
Escape Filter miss 
 
Translate with segment
Creation of segments in presence of “hard” faults
Outline
Motivation
Review: Direct Segments
Virtualized Direct Segments
Optimizations
Evaluation 
Summary
26
Methodology
 
Measure cost on page walks on real hardware
Find TLB misses that lie in the direct segment
BadgerTrap
 for online analysis of TLB misses
Released:
 http://research.cs.wisc.edu/multifacet/BadgerTrap
Linear model to predict performance
Prototype
Linux host/guest + Qemu-KVM hypervisor
Intel 12-core Sandy-bridge with 96GB memory
27
Results
28
Results
29
Results
30
Results
31
Summary
Problem: TLB misses in virtual machines
Hardware-virtualized MMU has high overheads
Solution: segmentation to bypass paging
Extend Direct Segments 
for virtualization
Three modes
 with different tradeoffs
Two optimizations
 to make segmentation flexible
Results
Near- 
or 
better-than
-native performance
32
Questions ?
33
 
 
Backup Slides
 
34
Hardware
35
Page Walk Hardware
36
Translation with Direct Segments
 
[V
47
V
46
……………………V
13
V
12
]
 
[P
40
P
39
………….P
13
P
12
]
 
Page-Table
Walker
 
N
 
MISS
[P
11
……P
0
]
 
HIT/MISS
[V
11
……V
0
]
 
Direct Segment
Ignored
37
Basu et al. [ISCA 2013]
Translation with Direct Segments
 
[V
47
V
46
……………………V
13
V
12
]
 
[P
40
P
39
………….P
13
P
12
]
 
Page-Table
Walker
 
Y
 
MISS
[P
11
……P
0
]
 
HIT/MISS
[V
11
……V
0
]
 
Paging Ignored
38
Basu et al. [ISCA 2013]
Modes
39
 
Base Virtualized
hPA
Nested
Page Table
1
2
Guest
Page Table
gVA
 
Guest Direct
 
VMM Direct
 
Dual Direct
2
1
gPA
2
Base Virtualized: Translation
40
gPT
nPT
Base Virtualized
1
2
Dual Direct: Translation
41
Dual Direct
gPT
Guest Direct Segment
nPT
Host Direct Segment
1
2
VMM Direct: Translation
42
VMM Direct
gPT
nPT
Host Direct Segment
1
2
Guest Direct: Translation
43
nPT
Guest Direct Segment
1
2
Tradeoffs: Memory Overcommit
44
Results
45
Results
46
Results
47
Results
48
Slide Note

Provide Title and introduce everyone.

Embed
Share

TLB misses in virtual machines can lead to high overheads with hardware-virtualized MMU. This paper proposes segmentation techniques to bypass paging and optimize memory virtualization, achieving near-native performance or better. Overheads of virtualizing memory are analyzed, highlighting the impact of TLB miss handling costs on performance gaps between native and virtual servers. The study explores unvirtualized X86 translation, two levels of translation, and support for virtualizing memory, emphasizing applications in graph analytics, key-value stores, HPC apps, databases, big-memory applications, and compute workloads. The cost of virtualization is discussed with execution time overheads and performance comparisons across different memory access scenarios.

  • Memory Virtualization
  • TLB Misses
  • Segmentation Techniques
  • Performance Optimization
  • Virtual Servers

Uploaded on Sep 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Efficient Memory Virtualization Reducing Dimensionality of Nested Page Walks Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift MICRO-47

  2. Executive Summary Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Prior Work: Direct Segments unvirtualized case Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Two optimizations to make segmentation flexible Results Near- or better-than-native performance 2

  3. Overheads of Virtualizing Memory We will show that the increase in translation lookaside buffer (TLB) miss handling costs due to the hardware-assisted memory management unit (MMU) is the largest contributor to the performance gap between native and virtual servers. Buell, et al. VMware Technical Journal 2013 3

  4. Unvirtualized X86 Translation VA Virtual Address CR3 Page Table PA Physical Address Up to mem accesses = 4 4

  5. Two-Levels of Translation Guest Virtual Address Host Physical Address Guest Physical Address 2 1 gVA gPA hPA cr3 cr3 Guest Page Table Nested Page Table Base Virtualized 5

  6. Support for Virtualizing Memory gVA CR3 hPA ncr3 gPA ncr3 gPA ncr3 ncr3 ncr3 gPA gPA gPA Up to Mem accesses 5 + 5 + 5 + 5 + 4 = 24 6

  7. Applications Graph-analytics Key-value store HPC Apps Database Big-memory applications (in-memory apps) Compute Workloads 7

  8. Cost of Virtualization Native 60% 800% Overheads of virtual memory can be high on native machines 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 1G+1G 1G+1G 1G+1G 1G+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 4K 4K 4K graph500 memcached NPB: CG GUPS 8

  9. Cost of Virtualization 113% Native Virtual 60% 800% Overheads increases drastically with virtualization 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 1G+1G 1G+1G 1G+1G 1G+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 4K 4K 4K graph500 memcached NPB: CG GUPS 9

  10. Cost of Virtualization 1556% 113% Native Virtual 60% 800% Increase in overheads: ~3.6x (geo. mean) 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K 1G+1G 1G+1G 4K 1G+1G 1G+1G 2M+2M 1G 2M+2M 1G 2M+2M 1G 2M+2M 1G 4K 2M 4K 2M 2M 4K 2M graph500 memcached NPB: CG GUPS 10

  11. Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Evaluation Summary 11

  12. Unvirtualized Direct Segments 2 Direct Segment Conventional Paging 1 BASE LIMIT VA OFFSET PA Why Direct Segment? Matches big memory workload needs NO TLB lookups => NO TLB Misses Basu et al. [ISCA 2013] 12

  13. Direct Segments VA VA cr3 cr3 PA PA Base Native 1D Direct Segments 1D 0D 13

  14. Outline Motivation Review: Direct Segments Virtualized Direct Segments Evaluation Methodology Results Summary 14

  15. Using Segments in VMM VM allocates guest physical memory at boot Contiguous gPA can be mapped with a direct segment VMM has a smaller code base to change It will convert 2D 1D page walk gVA cr3 gPA cr3 hPA But, can we do better? 2D 1D 15

  16. Three Virtualized Modes 1 Features 1. Maps almost whole gPA gVA cr3 2. 4 memory accesses gPA 3. Near-native performance cr3 hPA 4. Helps any application VMM Direct 2D 1D 16

  17. Three Virtualized Modes 2 1 Features 1. 0 memory accesses gVA gVA cr3 cr3 2. Better-than native performance gPA gPA cr3 cr3 3. Suits big- memory applications hPA hPA Dual Direct 2D 0D VMM Direct 2D 1D 17

  18. Three Virtualized Modes 2 1 3 Features 1. 4 memory accesses gVA gVA gVA cr3 cr3 cr3 2. Suits big- memory applications gPA gPA gPA cr3 cr3 cr3 3. Flexible to provide VMM services hPA hPA hPA Dual Direct 2D 0D Guest Direct 2D 1D VMM Direct 2D 1D 18

  19. Compatibility 1 APP APP Unmodified OS VMM Modified Hardware (VMM Direct) 19

  20. Compatibility 2 1 Minimal Big-Memory APP APP Unmodified OS OS Modified VMM VMM Modified Hardware (Dual Direct) Hardware (VMM Direct) 20

  21. Compatibility 2 3 1 Minimal Modified Minimal Big-Memory Big-Memory APP APP Unmodified OS OS OS Modified Minimal VMM VMM VMM Modified Modified Hardware (Dual Direct) Hardware (Guest Direct) Hardware (VMM Direct) 21

  22. Tradeoffs: Summary Base VMM Direct Dual Direct Guest Direct Properties Virtualized Dimension/ Memory accesses Guest OS modifications VMM modifications Applications 2D/24 1D/4 0D/0 1D/4 none none required required none required required minimal Any Any Big- Big- memory No memory Yes VMM services allowed Yes No 22

  23. Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Self-Ballooning Escape Filter Evaluation Summary 23

  24. Self-Ballooning Fragmentation in Guest Physical Memory gPA Segment Used Used Used 1. Removed by the balloon driver 2. VMM hot-adds new memory VMM informed 24

  25. Escape Filter Creation of segments in presence of hard faults gPA Operation Direct Segment hit + Escape Filter hit Translate with paging Escape Filter cr3 Direct Segment hit + Escape Filter miss Translate with segment hPA Bloom Filter to store a few faulty pages 25

  26. Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Evaluation Summary 26

  27. Methodology Measure cost on page walks on real hardware Find TLB misses that lie in the direct segment BadgerTrap for online analysis of TLB misses Released: http://research.cs.wisc.edu/multifacet/BadgerTrap Linear model to predict performance Prototype Linux host/guest + Qemu-KVM hypervisor Intel 12-core Sandy-bridge with 96GB memory 27

  28. Results Native Virtual Modeled 113% VMM Direct achieves near- native performance 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 28

  29. Results Native Virtual Modeled 113% Dual Direct eliminates most of the TLB misses achieving better-than native performance 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 29

  30. Results Native Virtual Modeled 113% Guest Direct achieves near-native performance while providing flexibility at VMM 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 30

  31. Results 1556% Native Virtual Modeled 113% Same trend across all workloads (More workloads in the paper) 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 0.00% 0.01% 0.17% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 31

  32. Summary Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Two optimizations to make segmentation flexible Results Near- or better-than-native performance 32

  33. Questions ? For more details: Come to the poster or see the paper 33

  34. Backup Slides 34

  35. Hardware [V47 V46 V12] [V12 .. V0] L1 D-TLB Lookup Dual Direct mode Y Hit ? N Y L2 D-TLB Lookup Y N Hit ? Page Table Walk [P47 P46 P12] [P12 .. P0] 35

  36. Page Walk Hardware gPA hPA hPA gVA gPT Guest Direct mode nPT cr3 VMM Direct mode hPA ncr3 ncr3 ncr3 ncr3 ncr3 ncr3 gPA gPA gPA gPA gPA gPA 36

  37. Translation with Direct Segments [V47V46 V13V12] [V11 V0] LIMIT<? BASE ? DTLB Lookup Direct Segment Ignored HIT HIT/MISS N MISS OFFSET Page-Table Walker [P11 P0] [P40P39 .P13P12] Basu et al. [ISCA 2013] 37

  38. Translation with Direct Segments [V47V46 V13V12] [V11 V0] LIMIT<? BASE ? DTLB Lookup Paging Ignored HIT/MISS Y MISS OFFSET Page-Table Walker [P11 P0] [P40P39 .P13P12] Basu et al. [ISCA 2013] 38

  39. Modes Guest Page Table Nested Page Table 1 1 2 2 2 gVA gPA hPA cr3 cr3 Base Virtualized Guest Direct VMM Direct Dual Direct 39

  40. Base Virtualized: Translation gVA gPT 1 cr3 gPA 2 cr3 nPT hPA Base Virtualized 40

  41. Dual Direct: Translation gVA gPT Guest Direct Segment 1 cr3 gPA cr3 2 nPT Host Direct Segment hPA hPA Dual Direct 41

  42. VMM Direct: Translation gVA 1 gPT cr3 gPA 2 cr3 nPT Host Direct Segment hPA hPA VMM Direct 42

  43. Guest Direct: Translation gVA Guest Direct Segment 1 cr3 gPA cr3 2 nPT hPA 43

  44. Tradeoffs: Memory Overcommit Base Dual Direct limited limited limited limited limited limited VMM Direct Guest Direct limited limited Properties Virtualized Page Sharing Ballooning Guest OS Swapping VMM Swapping 44

  45. Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 45

  46. Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% VMM Direct and Guest Direct achieves near-native performance 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 46

  47. Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% 641.78% Dual Direct eliminates most of the TLB misses achieving better-than native performance 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 28.60% 30% 400% 300% 12.43% 20% 9.11% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 47

  48. Results 103% 280% 149% 160% 70% 82% Native Virtual Modeled 83% 88% 60% 60% VMM Direct achieves near-native performance for standard workloads 50% EXECUTION CYCLE OVERHEAD 40% 30% 20% 10% 0% 4K 4K 4K 4K 4K 4K 4K+2M 4K+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K+4K 4K+4K 4K+2M 4K+VD 4K+2M 4K+VD 4K+2M 4K+VD 4K+VD 4K+2M 4K+VD 4K+2M 4K+VD 4K+1G 4K+1G 4K+1G 4K+1G 4K+1G cactusADM canneal GemsFDTD mcf omnetpp streamcluster 48

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#