Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks

Efficient Memory Virtualization

Reducing Dimensionality of Nested Page Walks

Jayneel Gandhi

, Arkaprava Basu,

Mark D. Hill, Michael M. Swift

MICRO-47

Executive Summary

•

Problem: TLB misses in virtual machines

–

Hardware-virtualized MMU has high overheads

•

Prior Work:

Direct Segments

 – unvirtualized case

•

Solution: segmentation to bypass paging

–

Extend Direct Segments

for virtualization

–

Three modes

 with different tradeoffs

–

Two optimizations

 to make segmentation flexible

•

Results

–

Near-

or

better-than

-native performance

Overheads of Virtualizing Memory

We will show that the increase in translation lookaside buffer

(TLB) miss handling costs due to the hardware-assisted memory

management unit (MMU) is the largest contributor to the

performance gap between native and virtual servers.

—Buell, et al. VMware Technical Journal 2013

Unvirtualized X86 Translation

VA

PA

CR3

Virtual

 Address

Physical

 Address

Page

 Table

Up to mem accesses = 4

Two-Levels of Translation

Base Virtualized

Guest Virtual

Address

Guest Physical

 Address

gVA

gPA

hPA

Host Physical

 Address

Guest

Page Table

Nested

Page Table

Support for Virtualizing Memory

CR3

gVA

Up to Mem

accesses

+ 5

+ 5

+ 5

+ 4

hPA

= 24

Applications

Big-memory applications

       (in-memory apps)

Database

Graph-analytics

Key-value store

HPC Apps

Compute Workloads

Cost of Virtualization

Cost of Virtualization

Cost of Virtualization

Outline

•

Motivation

•

Review: Direct Segments



•

Virtualized Direct Segments

•

Optimizations

•

Evaluation

•

Summary

Unvirtualized Direct Segments

OFFSET

BASE                                        LIMIT

VA

Conventional Paging

PA

Direct Segment

Why Direct Segment?

•

Matches big memory workload needs

•

NO TLB lookups => NO TLB Misses

Basu et al. [ISCA 2013]

Direct Segments

Base Native

1D

Direct Segments

1D



0D

VA

PA

PA

VA

Outline

•

Motivation

•

Review: Direct Segments

•

Virtualized Direct Segments



•

Evaluation

–

Methodology

–

Results

•

Summary

Using Segments in VMM

•

VM allocates guest physical memory

at boot

•

Contiguous gPA can be mapped with

a direct segment

•

VMM has a smaller code base to

change

•

It will convert 2D



 1D page walk

But, can we do better?

Three Virtualized Modes

Features

1.

Maps almost

whole gPA

2.

4 memory

accesses

3.

Near-native

performance

4.

Helps any

application

Three Virtualized Modes

Features

1.

0 memory

accesses

2.

Better-than

native

performance

3.

Suits big-

memory

applications

Three Virtualized Modes

Features

1.

4 memory

accesses

2.

Suits big-

memory

applications

3.

Flexible to

provide

VMM

services

Compatibility

VMM

Hardware

(VMM Direct)

OS

Modified

APP

APP

Unmodified

Compatibility

VMM

Hardware

(VMM Direct)

OS

Modified

APP

APP

Unmodified

VMM

Hardware

(Dual Direct)

OS

Modified

Big-Memory

Minimal

Compatibility

VMM

Hardware

(VMM Direct)

OS

Modified

APP

APP

Unmodified

VMM

Hardware

(Dual Direct)

OS

Modified

Big-Memory

Minimal

VMM

Hardware

(Guest Direct)

OS

Big-Memory

Minimal

Modified

Minimal

Modified

Tradeoffs: Summary

Outline

•

Motivation

•

Review: Direct Segments

•

Virtualized Direct Segments

•

Optimizations



–

Self-Ballooning

–

Escape Filter

•

Evaluation

•

Summary

Self-Ballooning

gPA

Used

Used

Used

1. Removed by the

balloon driver

VMM informed

2. VMM hot-adds

new memory

Fragmentation in Guest Physical Memory

Segment

Escape Filter

Operation

Direct Segment hit +

Escape Filter hit



Translate with paging

Direct Segment hit +

Escape Filter miss



Translate with segment

Creation of segments in presence of “hard” faults

Outline

•

Motivation

•

Review: Direct Segments

•

Virtualized Direct Segments

•

Optimizations

•

Evaluation



•

Summary

Methodology

•

Measure cost on page walks on real hardware

•

Find TLB misses that lie in the direct segment

•

BadgerTrap

 for online analysis of TLB misses

Released:

 http://research.cs.wisc.edu/multifacet/BadgerTrap

•

Linear model to predict performance

•

Prototype

–

Linux host/guest + Qemu-KVM hypervisor

•

Intel 12-core Sandy-bridge with 96GB memory

Results

Results

Results

Results

Summary

•

Problem: TLB misses in virtual machines

–

Hardware-virtualized MMU has high overheads

•

Solution: segmentation to bypass paging

–

Extend Direct Segments

for virtualization

–

Three modes

 with different tradeoffs

–

Two optimizations

 to make segmentation flexible

•

Results

–

Near-

or

better-than

-native performance

Questions ?

Backup Slides

Hardware

Page Walk Hardware

Translation with Direct Segments

[V

……………………V

[P

………….P

Page-Table

Walker

MISS

[P

……P

HIT/MISS

[V

……V

Direct Segment

Ignored

Basu et al. [ISCA 2013]

Translation with Direct Segments

[V

……………………V

[P

………….P

Page-Table

Walker

MISS

[P

……P

HIT/MISS

[V

……V

Paging Ignored

Basu et al. [ISCA 2013]

Modes

Base Virtualized

hPA

Nested

Page Table

Guest

Page Table

gVA

Guest Direct

VMM Direct

Dual Direct

gPA

Base Virtualized: Translation

gPT

nPT

Base Virtualized

Dual Direct: Translation

Dual Direct

gPT

Guest Direct Segment

nPT

Host Direct Segment

VMM Direct: Translation

VMM Direct

gPT

nPT

Host Direct Segment

Guest Direct: Translation

nPT

Guest Direct Segment

Tradeoffs: Memory Overcommit

Results

Results

Results

Results

Slide Note

Provide Title and introduce everyone.

Embed Share

Download

TLB misses in virtual machines can lead to high overheads with hardware-virtualized MMU. This paper proposes segmentation techniques to bypass paging and optimize memory virtualization, achieving near-native performance or better. Overheads of virtualizing memory are analyzed, highlighting the impact of TLB miss handling costs on performance gaps between native and virtual servers. The study explores unvirtualized X86 translation, two levels of translation, and support for virtualizing memory, emphasizing applications in graph analytics, key-value stores, HPC apps, databases, big-memory applications, and compute workloads. The cost of virtualization is discussed with execution time overheads and performance comparisons across different memory access scenarios.

mak_wal Follow

Uploaded on Sep 30, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Efficient Memory Virtualization Reducing Dimensionality of Nested Page Walks Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, Michael M. Swift MICRO-47

Executive Summary Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Prior Work: Direct Segments unvirtualized case Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Two optimizations to make segmentation flexible Results Near- or better-than-native performance 2

Overheads of Virtualizing Memory We will show that the increase in translation lookaside buffer (TLB) miss handling costs due to the hardware-assisted memory management unit (MMU) is the largest contributor to the performance gap between native and virtual servers. Buell, et al. VMware Technical Journal 2013 3

Unvirtualized X86 Translation VA Virtual Address CR3 Page Table PA Physical Address Up to mem accesses = 4 4

Two-Levels of Translation Guest Virtual Address Host Physical Address Guest Physical Address 2 1 gVA gPA hPA cr3 cr3 Guest Page Table Nested Page Table Base Virtualized 5

Support for Virtualizing Memory gVA CR3 hPA ncr3 gPA ncr3 gPA ncr3 ncr3 ncr3 gPA gPA gPA Up to Mem accesses 5 + 5 + 5 + 5 + 4 = 24 6

Applications Graph-analytics Key-value store HPC Apps Database Big-memory applications (in-memory apps) Compute Workloads 7

Cost of Virtualization Native 60% 800% Overheads of virtual memory can be high on native machines 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 1G+1G 1G+1G 1G+1G 1G+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 4K 4K 4K graph500 memcached NPB: CG GUPS 8

Cost of Virtualization 113% Native Virtual 60% 800% Overheads increases drastically with virtualization 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 1G+1G 1G+1G 1G+1G 1G+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 2M 2M+2M 1G 4K 4K 4K graph500 memcached NPB: CG GUPS 9

Cost of Virtualization 1556% 113% Native Virtual 60% 800% Increase in overheads: ~3.6x (geo. mean) 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K 1G+1G 1G+1G 4K 1G+1G 1G+1G 2M+2M 1G 2M+2M 1G 2M+2M 1G 2M+2M 1G 4K 2M 4K 2M 2M 4K 2M graph500 memcached NPB: CG GUPS 10

Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Evaluation Summary 11

Unvirtualized Direct Segments 2 Direct Segment Conventional Paging 1 BASE LIMIT VA OFFSET PA Why Direct Segment? Matches big memory workload needs NO TLB lookups => NO TLB Misses Basu et al. [ISCA 2013] 12

Direct Segments VA VA cr3 cr3 PA PA Base Native 1D Direct Segments 1D 0D 13

Outline Motivation Review: Direct Segments Virtualized Direct Segments Evaluation Methodology Results Summary 14

Using Segments in VMM VM allocates guest physical memory at boot Contiguous gPA can be mapped with a direct segment VMM has a smaller code base to change It will convert 2D 1D page walk gVA cr3 gPA cr3 hPA But, can we do better? 2D 1D 15

Three Virtualized Modes 1 Features 1. Maps almost whole gPA gVA cr3 2. 4 memory accesses gPA 3. Near-native performance cr3 hPA 4. Helps any application VMM Direct 2D 1D 16

Three Virtualized Modes 2 1 Features 1. 0 memory accesses gVA gVA cr3 cr3 2. Better-than native performance gPA gPA cr3 cr3 3. Suits big- memory applications hPA hPA Dual Direct 2D 0D VMM Direct 2D 1D 17

Three Virtualized Modes 2 1 3 Features 1. 4 memory accesses gVA gVA gVA cr3 cr3 cr3 2. Suits big- memory applications gPA gPA gPA cr3 cr3 cr3 3. Flexible to provide VMM services hPA hPA hPA Dual Direct 2D 0D Guest Direct 2D 1D VMM Direct 2D 1D 18

Compatibility 1 APP APP Unmodified OS VMM Modified Hardware (VMM Direct) 19

Compatibility 2 1 Minimal Big-Memory APP APP Unmodified OS OS Modified VMM VMM Modified Hardware (Dual Direct) Hardware (VMM Direct) 20

Compatibility 2 3 1 Minimal Modified Minimal Big-Memory Big-Memory APP APP Unmodified OS OS OS Modified Minimal VMM VMM VMM Modified Modified Hardware (Dual Direct) Hardware (Guest Direct) Hardware (VMM Direct) 21

Tradeoffs: Summary Base VMM Direct Dual Direct Guest Direct Properties Virtualized Dimension/ Memory accesses Guest OS modifications VMM modifications Applications 2D/24 1D/4 0D/0 1D/4 none none required required none required required minimal Any Any Big- Big- memory No memory Yes VMM services allowed Yes No 22

Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Self-Ballooning Escape Filter Evaluation Summary 23

Self-Ballooning Fragmentation in Guest Physical Memory gPA Segment Used Used Used 1. Removed by the balloon driver 2. VMM hot-adds new memory VMM informed 24

Escape Filter Creation of segments in presence of hard faults gPA Operation Direct Segment hit + Escape Filter hit Translate with paging Escape Filter cr3 Direct Segment hit + Escape Filter miss Translate with segment hPA Bloom Filter to store a few faulty pages 25

Outline Motivation Review: Direct Segments Virtualized Direct Segments Optimizations Evaluation Summary 26

Methodology Measure cost on page walks on real hardware Find TLB misses that lie in the direct segment BadgerTrap for online analysis of TLB misses Released: http://research.cs.wisc.edu/multifacet/BadgerTrap Linear model to predict performance Prototype Linux host/guest + Qemu-KVM hypervisor Intel 12-core Sandy-bridge with 96GB memory 27

Results Native Virtual Modeled 113% VMM Direct achieves near- native performance 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 28

Results Native Virtual Modeled 113% Dual Direct eliminates most of the TLB misses achieving better-than native performance 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 29

Results Native Virtual Modeled 113% Guest Direct achieves near-native performance while providing flexibility at VMM 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 30

Results 1556% Native Virtual Modeled 113% Same trend across all workloads (More workloads in the paper) 60% 800% 700% 50% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 600% 40% 500% 30% 400% 300% 20% 200% 10% 0.01% 0.00% 0.01% 0.17% 100% 0% 0% 4K+4K 4K+4K 4K+4K 4K+4K Guest D Guest D Guest D Guest D VMM D Dual D VMM D Dual D VMM D Dual D VMM D Dual D 4K 4K 4K 4K graph500 memcached NPB: CG GUPS 31

Summary Problem: TLB misses in virtual machines Hardware-virtualized MMU has high overheads Solution: segmentation to bypass paging Extend Direct Segments for virtualization Three modes with different tradeoffs Two optimizations to make segmentation flexible Results Near- or better-than-native performance 32

Questions ? For more details: Come to the poster or see the paper 33

Backup Slides 34

Hardware [V47 V46 V12] [V12 .. V0] L1 D-TLB Lookup Dual Direct mode Y Hit ? N Y L2 D-TLB Lookup Y N Hit ? Page Table Walk [P47 P46 P12] [P12 .. P0] 35

Page Walk Hardware gPA hPA hPA gVA gPT Guest Direct mode nPT cr3 VMM Direct mode hPA ncr3 ncr3 ncr3 ncr3 ncr3 ncr3 gPA gPA gPA gPA gPA gPA 36

Translation with Direct Segments [V47V46 V13V12] [V11 V0] LIMIT<? BASE ? DTLB Lookup Direct Segment Ignored HIT HIT/MISS N MISS OFFSET Page-Table Walker [P11 P0] [P40P39 .P13P12] Basu et al. [ISCA 2013] 37

Translation with Direct Segments [V47V46 V13V12] [V11 V0] LIMIT<? BASE ? DTLB Lookup Paging Ignored HIT/MISS Y MISS OFFSET Page-Table Walker [P11 P0] [P40P39 .P13P12] Basu et al. [ISCA 2013] 38

Modes Guest Page Table Nested Page Table 1 1 2 2 2 gVA gPA hPA cr3 cr3 Base Virtualized Guest Direct VMM Direct Dual Direct 39

Base Virtualized: Translation gVA gPT 1 cr3 gPA 2 cr3 nPT hPA Base Virtualized 40

Dual Direct: Translation gVA gPT Guest Direct Segment 1 cr3 gPA cr3 2 nPT Host Direct Segment hPA hPA Dual Direct 41

VMM Direct: Translation gVA 1 gPT cr3 gPA 2 cr3 nPT Host Direct Segment hPA hPA VMM Direct 42

Guest Direct: Translation gVA Guest Direct Segment 1 cr3 gPA cr3 2 nPT hPA 43

Tradeoffs: Memory Overcommit Base Dual Direct limited limited limited limited limited limited VMM Direct Guest Direct limited limited Properties Virtualized Page Sharing Ballooning Guest OS Swapping VMM Swapping 44

Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 45

Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% VMM Direct and Guest Direct achieves near-native performance 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 30% 400% 300% 20% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 46

Results Native 1556% 113% 908% 880% Virtual Modeled 60% 800% 641.78% Dual Direct eliminates most of the TLB misses achieving better-than native performance 700% 50% 600% EXECUTION TIME OVERHEAD EXECUTION TIME OVERHEAD 40% 500% 28.60% 30% 400% 300% 12.43% 20% 9.11% 200% 10% 100% 0% 0% 4K Dual D 4K Dual D 4K Dual D 4K Dual D Guest D Guest D Guest D Guest D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+4K 4K+2M VMM D 4K+1G 4K+1G 4K+1G 4K+1G graph500 memcached NPB: CG GUPS 47

Results 103% 280% 149% 160% 70% 82% Native Virtual Modeled 83% 88% 60% 60% VMM Direct achieves near-native performance for standard workloads 50% EXECUTION CYCLE OVERHEAD 40% 30% 20% 10% 0% 4K 4K 4K 4K 4K 4K 4K+2M 4K+1G 4K+4K 4K+4K 4K+4K 4K+4K 4K+4K 4K+4K 4K+2M 4K+VD 4K+2M 4K+VD 4K+2M 4K+VD 4K+VD 4K+2M 4K+VD 4K+2M 4K+VD 4K+1G 4K+1G 4K+1G 4K+1G 4K+1G cactusADM canneal GemsFDTD mcf omnetpp streamcluster 48

Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks

Download Presentation

Presentation Transcript

Related

More Related Content