Mosaic: A GPU Memory Manager Enhancing Performance Through Adaptive Page Sizes

Mosaic: A GPU Memory Manager

with Application-Transparent Support

for Multiple Page Sizes

Rachata Ausavarungniru

Joshua Landgraf         Vance Miller

Saugata Ghose

Jayneel Gandhi

Christopher J. Rossbach

Onur Mutlu

Executive Summary

•

Problem:

No single best page size

for GPU virtual memory

•

Large pages: Better TLB reach

•

Small pages: Lower demand paging latency

•

Our goal:

Transparently enable both page sizes

•

Key observations

•

Can

easily coalesce

 an application’s contiguously-allocated small pages

into a large page

•

Interleaved memory allocation across applications

breaks page contiguity

•

Key idea:

Preserve virtual address contiguity

of small pages

when allocating physical memory to simplify coalescing

•

Mosaic

is a

hardware/software cooperative framework

that:

•

Coalesces small pages into a large page without data movement

•

Enables

the benefits of

both small and large pages

•

Key result:

55% average performance improvement

over

state-of-the-art GPU memory management mechanism

GPU Support for Virtual Memory

•

Improves

programmability

 with a unified address

space

•

Enables

large data sets

to be processed in the GPU

•

Allows

multiple applications

to run on a GPU

•

Virtual memory can enforce memory protection

GPU Core

Private TLB

State-of-the-Art Virtual Memory on GPUs

GPU Core

GPU Core

GPU Core

Shared TLB

Private TLB

Page Table

Walkers

Page Table

(Main memory)

Private TLB

Private TLB

Limited TLB reach

High latency

page walks

Data

(Main Memory)

CPU Memory

High

latency

I/O

Private

Shared

CPU-side memory

GPU-side memory

Trade-Off with Page Size

•

Larger pages:

•

Better TLB reach

•

High demand paging latency

•

Smaller pages:

•

Lower demand paging latency

•

Limited TLB reach

Trade-Off with Page Size

Can we get the best of both page sizes?

No Paging Overhead

With Paging Overhead

Outline

•

Background

•

Key challenges and our goal

•

Mosaic

•

Experimental evaluations

•

Conclusions

Challenges with Multiple Page Sizes

Large Page Frame 1

Unallocated

App 1

App 2

State-of-the-Art

Time

App 1

Allocation

App 2

Allocation

App 1

Allocation

Large Page Frame 2

GPU Memory

App 2

Allocation

Coalesce

App 1 Pages

Large Page Frame 3

Large Page Frame 4

Large Page Frame 5

Cannot coalesce

(without migrating multiple 4K pages)

Need to search

which pages to coalesce

Coalesce

App 2 Pages

Desirable Allocation

Large Page Frame 1

Unallocated

App 1

App 2

Desirable Behavior

Time

App 1

Allocation

App 2

Allocation

App 1

Allocation

Large Page Frame 2

GPU Memory

App 2

Allocation

Coalesce

App 1 Pages

Coalesce

App 2 Pages

Large Page Frame 3

Large Page Frame 4

Large Page Frame 5

Can coalesce

(without moving data)

Our Goals

•

High TLB reach

•

Low demand paging latency

•

Application transparency

•

Programmers

do not need to modify the applications

Outline

•

Background

•

Key challenges and our goal

•

Mosaic

•

Experimental evaluation

•

Conclusions

Mosaic

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Outline

•

Background

•

Key challenges and our goal

•

Mosaic

•

Contiguity-Conserving Allocation

•

In-Place Coalescer

•

Contiguity-Aware Compaction

•

Experimental evaluations

•

Conclusions

Conserves contiguity within the large page frame

Mosaic: Data Allocation

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Allocate Memory

Page

Table

Data

Large Page Frame

Soft guarantee:

A large page frame contains

pages from only a single address space

•

Data transfer is done at a

small page granularity

•

A page that is transferred is immediately ready to use

Mosaic: Data Allocation

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Transfer Data

Page

Table

Data

System I/O Bus

CPU

Memory

Large Page Frame

Allocate Memory

Mosaic: Data Allocation

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Page

Table

Transfer Done

Large Page Frame

Data

Outline

•

Background

•

Key challenges and our goal

•

Mosaic

•

Contiguity-Conserving Allocation

•

In-Place Coalescer

•

Contiguity-Aware Compaction

•

Experimental evaluations

•

Conclusions

•

Fully-allocated large page frame



Coalesceable

•

Allocator sends

the list of coalesceable pages

 to the

In-Place Coalescer

Mosaic: Coalescing

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Large Page Frame

List of large pages

Large Page Frame

•

In-Place Coalescer has:

•

List of coalesceable large pages

•

Key Task: Perform coalescing without moving data

•

Simply need to update the page tables

Mosaic: Coalescing

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Page

Table

Data

Update page tables

List of large pages

Mosaic: Coalescing

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Page

Table

Data

Update page tables

List of large pages

•

Application-transparent

•

Data can be accessed

using either page size

•

No TLB flush

Outline

•

Background

•

Key challenges and our goal

•

Mosaic

•

Contiguity-Conserving Allocation

•

In-Place Coalescer

•

Contiguity-Aware Compaction

•

Experimental evaluations

•

Conclusions

•

Key Task:

Free up not-fully-used

large page frames

•

Splinter pages



Break down a large page

into small pages

•

Compaction



Combine fragmented large page frames

Mosaic: Data Deallocation

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

•

Splinter only frames with deallocated pages

Mosaic: Data Deallocation

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Page

Table

Splinter Pages (reset the coalesced bit)

Large Page Frame

Data

Mosaic: Compaction

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

•

Key Task:

Free up not-fully-used

large page frames

•

Splinter pages



Break down a large page

into small pages

•

Compaction



Combine fragmented large page frames

•

Compaction

decreases memory bloat

•

Happens only when memory is highly fragmented

Mosaic: Compaction

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Page

Table

Compact Pages

Large Page Frames

Data

Free large page

List of free pages

Free large page

•

Once pages are compacted,

they become

non-coalesceable

•

No virtual contiguity

•

Maximizes number of free large page frames

Mosaic: Compaction

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

Outline

•

Background

•

Key challenges and our goal

•

Mosaic

•

Contiguity-Conserving Allocation

•

In-Place Coalescer

•

Contiguity-Aware Compaction

•

Experimental evaluations

•

Conclusions

Baseline: State-of-the-Art GPU Virtual Memory

Methodology

•

GPGPU-Sim (MAFIA) modeling GTX750 Ti

•

30 GPU cores

•

Multiple GPGPU applications execute concurrently

•

64KB 4-way L1, 2048KB 16-way L2

•

64-entry L1 TLB, 1024-entry L2 TLB

•

8-entry large page L1 TLB, 64-entry large page L2 TLB

•

3GB main memory

•

Model sequential page walks

•

Model page tables and virtual-to-physical mapping

•

CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites

•

235 total workloads evaluated

•

Available at:

https://github.com/CMU-SAFARI/Mosaic

Comparison Points

•

State-of-the-art CPU-GPU memory management

•

GPU-MMU based on [Power et al., HPCA’14]

•

Upside:

Utilizes parallel page walks, TLB request coalescing

and page walk cache to improve performance

•

Downside:

Limited TLB reach

•

Ideal TLB: Every TLB access is an L1 TLB hit

Performance

23.7%

43.1%

31.5%

21.4%

Homogeneous

Heterogeneous

39.0%

33.8%

55.4%

61.5%

95.0%

Mosaic consistently improves performance

across a wide variety of workloads

Mosaic performs within 10% of the ideal TLB

Other Results in the Paper

•

TLB hit rate

•

Mosaic achieves average

TLB hit rate of 99%

•

Per-application IPC

•

 97% of all

applications perform faster

•

Sensitivity to different TLB sizes

•

Mosaic is

effective for various TLB configurations

•

Memory fragmentation analysis

•

Mosaic

reduces memory fragmentation

and

improves

performance

regardless of the original fragmentation

•

Performance with and without demand paging

Outline

•

Background

•

Key challenges and our goal

•

Mosaic

•

Contiguity-Conserving Allocation

•

In-Place Coalescer

•

Contiguity-Aware Compaction

•

Experimental evaluations

•

Conclusions

Summary

•

Problem:

No single best page size

for GPU virtual memory

•

Large pages: Better TLB reach

•

Small pages: Lower demand paging latency

•

Our goal:

Transparently enable both page sizes

•

Key observations

•

Can

easily coalesce

 an application’s contiguously-allocated small pages

into a large page

•

Interleaved memory allocation across applications

breaks page contiguity

•

Key idea:

Preserve virtual address contiguity

of small pages

when allocating physical memory to simplify coalescing

•

Mosaic

is a

hardware/software cooperative framework

that:

•

Coalesces small pages into a large page without data movement

•

Enables

the benefits of

both small and large pages

•

Key result:

55% average performance improvement

over

state-of-the-art GPU memory management mechanism

Mosaic: A GPU Memory Manager

with Application-Transparent Support

for Multiple Page Sizes

Rachata Ausavarungniru

Joshua Landgraf         Vance Miller

Saugata Ghose

Jayneel Gandhi

Christopher J. Rossbach

Onur Mutlu

Backup Slides

Current Methods to Share GPUs

•

Time sharing

•

Fine-grained context switching

•

Coarse-grained context switching

•

Spatial sharing

•

NVIDIA GRID

•

Multi process service

Other Methods to Enforce Protection

•

Segmented paging

•

Static memory partitioning

TLB Flush

•

With Mosaic, the contents in the page tables

are the same

•

TLB flush in Mosaic occurs when page table

content is modified

•

This invalidates content in the TLB



 Need to be flushed

•

Both large and small page TLBs are flushed

Performance with Demand Paging

In-Place Coalescer: Coalescing

•

Key assumption:

 Soft guarantee

•

arge page range always contains pages of the same application

Benefit: No data movement

L1 Page Table

Set Large Page Bit

Coalesce

Set Disabled Bit

Set Disabled Bit

Set Disabled Bit

Set Disabled Bit

PO

Q: How to access large page base entry

In-Place Coalescer: Large Page Walk

•

Large page index is available at leaf PTE

L1 Page Table

Set Large Page Bit

Coalesce

Set Disabled Bit

Set Disabled Bit

Set Disabled Bit

Set Disabled Bit

Sample Application Pairs

L1

L2

L1

L2

L1

L2

L1

L2

L1

L2

TLB Hit Rate

Pre-Fragmenting DRAM

Page Occupancy Experiment

Memory Bloat

Individual Application IPC

Mosaic: Putting Everything Together

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

GPU Runtime

Page

Table

Data

Mosaic: Data Allocation

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

GPU Runtime

Page

Table

Data

Mosaic: Data Deallocation

Contiguity-Conserving

Allocation

In-Place

Coalescer

Contiguity-Aware

Compaction

GPU Runtime

Page

Table

Data

Slide Note

Embed Share

Download

"Mosaic introduces a GPU memory manager supporting multiple page sizes for improved performance. By coalescing small pages into large ones without data movement, it achieves a 55% average performance boost over existing mechanisms. This innovative framework transparently enables the benefits of both small and large page sizes, addressing the trade-off between TLB reach and demand paging latency."

jayo_808 Follow

Uploaded on Sep 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Jayneel Gandhi Vance Miller Saugata Ghose Christopher J. Rossbach Onur Mutlu

Executive Summary Problem:No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalescean application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 2

GPU Support for Virtual Memory Improves programmability with a unified address space Enables large data sets to be processed in the GPU Allows multiple applications to run on a GPU Virtual memory can enforce memory protection 3

State-of-the-Art Virtual Memory on GPUs GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Limited TLB reach Page Table Walkers High latency page walks GPU-side memory High latency I/O CPU-side memory Page Table (Main memory) Data (Main Memory) CPU Memory 4

Trade-Off with Page Size Larger pages: Better TLB reach High demand paging latency Smaller pages: Lower demand paging latency Limited TLB reach 5

Trade-Off with Page Size No Paging Overhead With Paging Overhead Small (4KB) Large (2MB) Small (4KB) Large (2MB) 1.0 1.0 Performance Normalized Performance Normalized 0.8 0.8 52% 0.6 0.6 -93% 0.4 0.4 0.2 0.2 0.0 0.0 Can we get the best of both page sizes? 6

Outline Background Key challenges and our goal Mosaic Experimental evaluations Conclusions 7

Challenges with Multiple Page Sizes State-of-the-Art Time App 1 Allocation GPU Memory Large Page Frame 1 App 2 Allocation Large Page Frame 2 Large Page Frame 3 App 1 Allocation Large Page Frame 4 App 2 Allocation Large Page Frame 5 Cannot coalesce Coalesce App 1 Pages Coalesce App 2 Pages (without migrating multiple 4K pages) Need to search which pages to coalesce App 1 App 2 Unallocated 8

Desirable Allocation Desirable Behavior Time App 1 Allocation GPU Memory Large Page Frame 1 App 2 Allocation Large Page Frame 2 Large Page Frame 3 App 1 Allocation Large Page Frame 4 App 2 Allocation Large Page Frame 5 Coalesce App 1 Pages Coalesce App 2 Pages Can coalesce (without moving data) App 1 App 2 Unallocated 9

Our Goals High TLB reach Low demand paging latency Application transparency Programmers do not need to modify the applications 10

Outline Background Key challenges and our goal Mosaic Experimental evaluation Conclusions 11

Mosaic GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 12

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 13

Mosaic: Data Allocation Application Demands Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory 2 Large Page Frame Page Table Soft guarantee: A large page frame contains pages from only a single address space Data Conserves contiguity within the large page frame 14

Mosaic: Data Allocation Application Demands Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory 2 Large Page Frame Page Table Transfer Data CPU Memory Data System I/O Bus 3 Data transfer is done at a small page granularity A page that is transferred is immediately ready to use 15

Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Transfer Done 4 Large Page Frame Page Table Transfer Data CPU Memory Data System I/O Bus 3 16

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 17

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 1 List of large pages Large Page Frame Large Page Frame Fully-allocated large page frame Allocator sends the list of coalesceable pages to the In-Place Coalescer Coalesceable 18

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Update page tables 1 List of large pages 2 Page Table In-Place Coalescer has: List of coalesceable large pages Data Key Task: Perform coalescing without moving data Simply need to update the page tables 19

Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Update page tables 1 List of large pages 2 Page Table Small Page Table Large Page Table 0 1 Data Coalesced Bit Application-transparent Data can be accessed using either page size No TLB flush 20

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 21

Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 22

Mosaic: Data Deallocation Application Deallocates Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 2 Splinter Pages (reset the coalesced bit) Page Table Large Page Frame Data Splinter only frames with deallocated pages 23

Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 24

Mosaic: Compaction GPU Runtime List of free pages 2 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Large Page Frames Free large page Page Table Free large page Data Compact Pages 1 Compaction decreases memory bloat Happens only when memory is highly fragmented 25

Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Once pages are compacted, they become non-coalesceable No virtual contiguity Maximizes number of free large page frames 26

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 27

Baseline: State-of-the-Art GPU Virtual Memory GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Page Table Walkers GPU-side memory CPU-side memory Page Table (Main memory) Data (Main Memory) CPU Memory 28

Methodology GPGPU-Sim (MAFIA) modeling GTX750 Ti 30 GPU cores Multiple GPGPU applications execute concurrently 64KB 4-way L1, 2048KB 16-way L2 64-entry L1 TLB, 1024-entry L2 TLB 8-entry large page L1 TLB, 64-entry large page L2 TLB 3GB main memory Model sequential page walks Model page tables and virtual-to-physical mapping CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites 235 total workloads evaluated Available at: https://github.com/CMU-SAFARI/Mosaic 29

Comparison Points State-of-the-art CPU-GPU memory management GPU-MMU based on [Power et al., HPCA 14] Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance Downside: Limited TLB reach Ideal TLB: Every TLB access is an L1 TLB hit 30

Performance Homogeneous GPU-MMU Heterogeneous Ideal TLB Mosaic 7 Weighted Speedup 6 39.0% 23.7% 33.8% 43.1% 5 55.4% 4 31.5% 3 61.5% 21.4% 2 95.0% 1 0 1 2 3 4 5 2 3 4 5 Number of Concurrently-Executing Applications Mosaic consistently improves performance across a wide variety of workloads Mosaic performs within 10% of the ideal TLB 31

Other Results in the Paper TLB hit rate Mosaic achieves average TLB hit rate of 99% Per-application IPC 97% of all applications perform faster Sensitivity to different TLB sizes Mosaic is effective for various TLB configurations Memory fragmentation analysis Mosaic reduces memory fragmentation and improves performance regardless of the original fragmentation Performance with and without demand paging 32

Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 33

Summary Problem:No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalescean application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 34

Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Christopher J. Rossbach Jayneel Gandhi Onur Mutlu

Backup Slides

Current Methods to Share GPUs Time sharing Fine-grained context switching Coarse-grained context switching Spatial sharing NVIDIA GRID Multi process service 37

Other Methods to Enforce Protection Segmented paging Static memory partitioning 38

TLB Flush With Mosaic, the contents in the page tables are the same TLB flush in Mosaic occurs when page table content is modified This invalidates content in the TLB Both large and small page TLBs are flushed Need to be flushed 39

Performance with Demand Paging GPU-MMU no Paging GPU-MMU with Paging Mosaic with Paging 2.0 Performance 1.5 Normalized 1.0 0.5 0.0 Homogeneous Heterogeneous 40

In-Place Coalescer: Coalescing Key assumption: Soft guarantee Large page range always contains pages of the same application L1 Page Table L2 Page Table Set Large Page Bit Set Disabled Bit Set Disabled Bit Coalesce Set Disabled Bit Set Disabled Bit Q: How to access large page base entry? VA PD PT PO PO Benefit: No data movement 41

In-Place Coalescer: Large Page Walk Large page index is available at leaf PTE L1 Page Table L2 Page Table Set Large Page Bit Set Disabled Bit Set Disabled Bit Coalesce Set Disabled Bit Set Disabled Bit 42

Sample Application Pairs 5 Weighted Speedup GPU-MMU Mosaic Ideal TLB 4 3 2 1 0 TLB-Friendly TLB-Sensitive

TLB Hit Rate L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 100% TLB Hit Rate 80% 60% 40% 20% 0% 1 App 2 Apps 3 Apps 4 Apps 5 Apps Number of Concurrently-Executing Applications GPU-MMU Mosaic

Pre-Fragmenting DRAM 1.6 no CAC CAC CAC-BC CAC-Ideal 1.4 Performance Normalized 1.2 1.0 0.8 30% 50% 70% 90% 95% 97% 100% Fragmentation Index

Page Occupancy Experiment no CAC CAC CAC-BC CAC-Ideal 1.6 1.4 Performance Normalized 1.2 1.0 0.8 Large Page Frame Occupancy

Memory Bloat 1.8 Memory Bloat vs. GPU-MMU 4KB Page GPU-MMU CAC 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1% 10% 25% 35% 50% 75% Page Occupancy

Individual Application IPC 8 9 GPU-MMU GPU-MMU 8 7 Performance Performance 7 Normalized Mosaic Mosaic Normalized 6 6 Ideal-TLB Ideal-TLB 5 5 4 4 3 3 2 2 1 1 0 0 0 10 20 30 40 50 0 25 50 75 Sorted Application Number Sorted Application Number 8 8 GPU-MMU Mosaic Ideal-TLB GPU-MMU 7 7 Performance Mosaic Performance Normalized Normalized 6 6 Ideal-TLB 5 5 4 4 3 3 2 2 1 1 0 0 0 25 50 75 100 0 25 50 75 100 125 Sorted Application Number Sorted Application Number

GPU-MMU Mosaic GPU-MMU Mosaic 1.4 1.4 Performance Performance 1.3 Normalized 1.3 Normalized 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.9 0.8 0.8 8 16 32 64 128 256 64 128 256 512 1024 4096 Per-SM L1 TLB Base Page Entries Shared L2 TLB Base Page Entries GPU-MMU Mosaic GPU-MMU Mosaic 1.4 1.4 Performance Performance 1.3 1.3 Normalized Normalized 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.9 0.8 0.8 4 8 16 32 64 32 64 128 256 512 Per-SM L1 TLB Large Page Entries Shared L2 TLB Large Page Entries

Mosaic: Putting Everything Together GPU Runtime Hardware Contiguity-Conserving Allocation Application Demands Data Allocate Memory Transfer Done List of Large Pages Coalesce Pages In-Place Coalescer Page Table List of Free Pages Splinter Pages Data Contiguity-Aware Compaction Application Deallocate Data Compact Pages Transfer Data System I/O Bus 50

Mosaic: A GPU Memory Manager Enhancing Performance Through Adaptive Page Sizes

Download Presentation

Presentation Transcript

Related

More Related Content