Mosaic: A GPU Memory Manager Enhancing Performance Through Adaptive Page Sizes

Mosaic: A GPU Memory Manager
with Application-Transparent Support
for Multiple Page Sizes
Rachata Ausavarungniru
n         
Joshua Landgraf         Vance Miller
Saugata Ghose
  
Jayneel Gandhi
Christopher J. Rossbach
 
Onur Mutlu
Executive Summary
2
 
Problem:
 
No single best page size 
for GPU virtual memory
Large pages: Better TLB reach
Small pages: Lower demand paging latency
Our goal: 
Transparently enable both page sizes
Key observations
Can 
easily coalesce
 an application’s contiguously-allocated small pages
into a large page
Interleaved memory allocation across applications 
breaks page contiguity
Key idea: 
Preserve virtual address contiguity
 
of small pages
when allocating physical memory to simplify coalescing
Mosaic 
is a 
hardware/software cooperative framework 
that:
Coalesces small pages into a large page without data movement
Enables 
the benefits of 
both small and large pages
Key result: 
55% average performance improvement 
over
state-of-the-art GPU memory management mechanism
GPU Support for Virtual Memory
3
 
 
Improves 
programmability
 with a unified address
space
 
 
Enables 
large data sets 
to be processed in the GPU
 
 
Allows 
multiple applications 
to run on a GPU
Virtual memory can enforce memory protection
GPU Core
Private TLB
State-of-the-Art Virtual Memory on GPUs
4
GPU Core
GPU Core
GPU Core
Shared TLB
Private TLB
Page Table
Walkers
Page Table
(Main memory)
Private TLB
Private TLB
Limited TLB reach
High latency
page walks
Data
(Main Memory)
CPU Memory
High
latency
I/O
 
Private
 
Shared
 
CPU-side memory
 
GPU-side memory
Trade-Off with Page Size
5
 
Larger pages:
Better TLB reach
High demand paging latency
 
 
Smaller pages:
Lower demand paging latency
Limited TLB reach
Trade-Off with Page Size
6
Can we get the best of both page sizes?
 
No Paging Overhead
 
With Paging Overhead
Outline
7
Background
Key challenges and our goal
Mosaic
Experimental evaluations
Conclusions
Challenges with Multiple Page Sizes
8
 
Large Page Frame 1
 
Unallocated
 
App 1
 
App 2
 
State-of-the-Art
 
Time
 
App 1
Allocation
 
App 2
Allocation
 
App 1
Allocation
 
Large Page Frame 2
 
GPU Memory
 
App 2
Allocation
 
Coalesce
App 1 Pages
 
Large Page Frame 3
 
Large Page Frame 4
 
Large Page Frame 5
Cannot coalesce
(without migrating multiple 4K pages)
Need to search
which pages to coalesce
 
Coalesce
App 2 Pages
Desirable Allocation
9
Large Page Frame 1
Unallocated
App 1
App 2
 
Desirable Behavior
Time
 
App 1
Allocation
 
App 2
Allocation
 
App 1
Allocation
Large Page Frame 2
GPU Memory
 
App 2
Allocation
 
Coalesce
App 1 Pages
 
Coalesce
App 2 Pages
Large Page Frame 3
Large Page Frame 4
Large Page Frame 5
Can coalesce
(without moving data)
Our Goals
10
 
High TLB reach
 
 
Low demand paging latency
 
 
Application transparency
Programmers 
do not need to modify the applications
Outline
11
Background
Key challenges and our goal
Mosaic
Experimental evaluation
Conclusions
Mosaic
12
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
Outline
13
Background
Key challenges and our goal
Mosaic
Contiguity-Conserving Allocation
In-Place Coalescer
Contiguity-Aware Compaction
Experimental evaluations
Conclusions
 
 
 
 
 
 
 
 
 
Conserves contiguity within the large page frame
Mosaic: Data Allocation
14
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
 
Allocate Memory
Page
Table
Data
2
 
Large Page Frame
 
Soft guarantee:
A large page frame contains
pages from only a single address space
 
 
 
 
 
 
 
 
 
 
 
Data transfer is done at a 
small page granularity
A page that is transferred is immediately ready to use
Mosaic: Data Allocation
15
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
 
Transfer Data
Page
Table
Data
3
 
System I/O Bus
CPU
Memory
Large Page Frame
Allocate Memory
2
Mosaic: Data Allocation
16
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
Page
Table
 
Transfer Done
4
Large Page Frame
Data
Outline
17
Background
Key challenges and our goal
Mosaic
Contiguity-Conserving Allocation
In-Place Coalescer
Contiguity-Aware Compaction
Experimental evaluations
Conclusions
 
 
 
 
 
 
 
 
Fully-allocated large page frame 
 
Coalesceable
Allocator sends 
the list of coalesceable pages
 to the
In-Place Coalescer
Mosaic: Coalescing
18
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
 
Large Page Frame
 
List of large pages
1
 
Large Page Frame
 
 
 
 
 
In-Place Coalescer has:
List of coalesceable large pages
 
 
Key Task: Perform coalescing without moving data
Simply need to update the page tables
 
Mosaic: Coalescing
19
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
Page
Table
Data
 
Update page tables
2
List of large pages
1
Mosaic: Coalescing
20
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
Page
Table
Data
Update page tables
2
List of large pages
1
1
 
Application-transparent
Data can be accessed
using either page size
No TLB flush
Outline
21
Background
Key challenges and our goal
Mosaic
Contiguity-Conserving Allocation
In-Place Coalescer
Contiguity-Aware Compaction
Experimental evaluations
Conclusions
 
 
 
 
 
Key Task: 
Free up not-fully-used
large page frames
 
Splinter pages 
 
Break down a large page 
into small pages
 
Compaction 
 
Combine fragmented large page frames
Mosaic: Data Deallocation
22
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
 
 
 
 
 
 
 
 
 
 
Splinter only frames with deallocated pages
Mosaic: Data Deallocation
23
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
Page
Table
 
Splinter Pages (reset the coalesced bit)
2
 
Large Page Frame
Data
Mosaic: Compaction
24
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
Key Task: 
Free up not-fully-used
large page frames
Splinter pages 
 
Break down a large page 
into small pages
Compaction 
 
Combine fragmented large page frames
Compaction
 
decreases memory bloat
Happens only when memory is highly fragmented
Mosaic: Compaction
25
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
Page
Table
 
Compact Pages
1
 
Large Page Frames
Data
Free large page
 
List of free pages
2
Free large page
 
 
 
 
 
Once pages are compacted, 
they become
non-coalesceable
No virtual contiguity
 
Maximizes number of free large page frames
Mosaic: Compaction
26
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
Outline
27
Background
Key challenges and our goal
Mosaic
Contiguity-Conserving Allocation
In-Place Coalescer
Contiguity-Aware Compaction
Experimental evaluations
Conclusions
Baseline: State-of-the-Art GPU Virtual Memory
28
Methodology
29
 
GPGPU-Sim (MAFIA) modeling GTX750 Ti
30 GPU cores
Multiple GPGPU applications execute concurrently
64KB 4-way L1, 2048KB 16-way L2
64-entry L1 TLB, 1024-entry L2 TLB
8-entry large page L1 TLB, 64-entry large page L2 TLB
3GB main memory
Model sequential page walks
Model page tables and virtual-to-physical mapping
CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites
235 total workloads evaluated
Available at: 
https://github.com/CMU-SAFARI/Mosaic
 
Comparison Points
30
 
State-of-the-art CPU-GPU memory management
GPU-MMU based on [Power et al., HPCA’14]
 
Upside: 
Utilizes parallel page walks, TLB request coalescing
and page walk cache to improve performance
 
Downside: 
Limited TLB reach
 
 
Ideal TLB: Every TLB access is an L1 TLB hit
Performance
31
 
23.7%
 
43.1%
 
31.5%
 
21.4%
 
Homogeneous
 
Heterogeneous
39.0%
33.8%
 
55.4%
 
61.5%
 
95.0%
Mosaic consistently improves performance
across a wide variety of workloads
Mosaic performs within 10% of the ideal TLB
Other Results in the Paper
32
 
TLB hit rate
Mosaic achieves average 
TLB hit rate of 99%
Per-application IPC
 97% of all 
applications perform faster
Sensitivity to different TLB sizes
Mosaic is 
effective for various TLB configurations
Memory fragmentation analysis
Mosaic 
reduces memory fragmentation 
and 
improves
performance 
regardless of the original fragmentation
Performance with and without demand paging
Outline
33
Background
Key challenges and our goal
Mosaic
Contiguity-Conserving Allocation
In-Place Coalescer
Contiguity-Aware Compaction
Experimental evaluations
Conclusions
Summary
34
 
Problem:
 
No single best page size 
for GPU virtual memory
Large pages: Better TLB reach
Small pages: Lower demand paging latency
Our goal: 
Transparently enable both page sizes
Key observations
Can 
easily coalesce
 an application’s contiguously-allocated small pages
into a large page
Interleaved memory allocation across applications 
breaks page contiguity
Key idea: 
Preserve virtual address contiguity
 
of small pages
when allocating physical memory to simplify coalescing
Mosaic 
is a 
hardware/software cooperative framework 
that:
Coalesces small pages into a large page without data movement
Enables 
the benefits of 
both small and large pages
Key result: 
55% average performance improvement 
over
state-of-the-art GPU memory management mechanism
Mosaic: A GPU Memory Manager
with Application-Transparent Support
for Multiple Page Sizes
Rachata Ausavarungniru
n         
Joshua Landgraf         Vance Miller
Saugata Ghose
  
Jayneel Gandhi
Christopher J. Rossbach
 
Onur Mutlu
Backup Slides
 
Current Methods to Share GPUs
37
Time sharing
Fine-grained context switching
Coarse-grained context switching
Spatial sharing
NVIDIA GRID
Multi process service
Other Methods to Enforce Protection
38
Segmented paging
Static memory partitioning
TLB Flush
39
With Mosaic, the contents in the page tables
are the same
TLB flush in Mosaic occurs when page table
content is modified
This invalidates content in the TLB 
 Need to be flushed
Both large and small page TLBs are flushed
Performance with Demand Paging
40
In-Place Coalescer: Coalescing
 
Key assumption:
 Soft guarantee
L
arge page range always contains pages of the same application
41
 
Benefit: No data movement
 
L1 Page Table
Set Large Page Bit
 
Coalesce
Set Disabled Bit
Set Disabled Bit
Set Disabled Bit
Set Disabled Bit
PO
Q: How to access large page base entry
?
In-Place Coalescer: Large Page Walk
 
Large page index is available at leaf PTE
42
L1 Page Table
Set Large Page Bit
Coalesce
Set Disabled Bit
Set Disabled Bit
Set Disabled Bit
Set Disabled Bit
T
L
B
-
F
r
i
e
n
d
l
y
T
L
B
-
S
e
n
s
i
t
i
v
e
Sample Application Pairs
L1
L2
L1
L2
L1
L2
L1
L2
L1
L2
TLB Hit Rate
Pre-Fragmenting DRAM
Page Occupancy Experiment
Memory Bloat
Individual Application IPC
Mosaic: Putting Everything Together
50
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
 
GPU Runtime
Page
Table
Data
Mosaic: Data Allocation
51
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
 
GPU Runtime
Page
Table
Data
Mosaic: Data Deallocation
52
Contiguity-Conserving
Allocation
In-Place
Coalescer
Contiguity-Aware
Compaction
GPU Runtime
Page
Table
Data
Slide Note
Embed
Share

"Mosaic introduces a GPU memory manager supporting multiple page sizes for improved performance. By coalescing small pages into large ones without data movement, it achieves a 55% average performance boost over existing mechanisms. This innovative framework transparently enables the benefits of both small and large page sizes, addressing the trade-off between TLB reach and demand paging latency."

  • GPU
  • Memory Manager
  • Performance Enhancement
  • Page Sizes
  • Innovative Framework

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Jayneel Gandhi Vance Miller Saugata Ghose Christopher J. Rossbach Onur Mutlu

  2. Executive Summary Problem:No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalescean application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 2

  3. GPU Support for Virtual Memory Improves programmability with a unified address space Enables large data sets to be processed in the GPU Allows multiple applications to run on a GPU Virtual memory can enforce memory protection 3

  4. State-of-the-Art Virtual Memory on GPUs GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Limited TLB reach Page Table Walkers High latency page walks GPU-side memory High latency I/O CPU-side memory Page Table (Main memory) Data (Main Memory) CPU Memory 4

  5. Trade-Off with Page Size Larger pages: Better TLB reach High demand paging latency Smaller pages: Lower demand paging latency Limited TLB reach 5

  6. Trade-Off with Page Size No Paging Overhead With Paging Overhead Small (4KB) Large (2MB) Small (4KB) Large (2MB) 1.0 1.0 Performance Normalized Performance Normalized 0.8 0.8 52% 0.6 0.6 -93% 0.4 0.4 0.2 0.2 0.0 0.0 Can we get the best of both page sizes? 6

  7. Outline Background Key challenges and our goal Mosaic Experimental evaluations Conclusions 7

  8. Challenges with Multiple Page Sizes State-of-the-Art Time App 1 Allocation GPU Memory Large Page Frame 1 App 2 Allocation Large Page Frame 2 Large Page Frame 3 App 1 Allocation Large Page Frame 4 App 2 Allocation Large Page Frame 5 Cannot coalesce Coalesce App 1 Pages Coalesce App 2 Pages (without migrating multiple 4K pages) Need to search which pages to coalesce App 1 App 2 Unallocated 8

  9. Desirable Allocation Desirable Behavior Time App 1 Allocation GPU Memory Large Page Frame 1 App 2 Allocation Large Page Frame 2 Large Page Frame 3 App 1 Allocation Large Page Frame 4 App 2 Allocation Large Page Frame 5 Coalesce App 1 Pages Coalesce App 2 Pages Can coalesce (without moving data) App 1 App 2 Unallocated 9

  10. Our Goals High TLB reach Low demand paging latency Application transparency Programmers do not need to modify the applications 10

  11. Outline Background Key challenges and our goal Mosaic Experimental evaluation Conclusions 11

  12. Mosaic GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 12

  13. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 13

  14. Mosaic: Data Allocation Application Demands Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory 2 Large Page Frame Page Table Soft guarantee: A large page frame contains pages from only a single address space Data Conserves contiguity within the large page frame 14

  15. Mosaic: Data Allocation Application Demands Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory 2 Large Page Frame Page Table Transfer Data CPU Memory Data System I/O Bus 3 Data transfer is done at a small page granularity A page that is transferred is immediately ready to use 15

  16. Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Transfer Done 4 Large Page Frame Page Table Transfer Data CPU Memory Data System I/O Bus 3 16

  17. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 17

  18. Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 1 List of large pages Large Page Frame Large Page Frame Fully-allocated large page frame Allocator sends the list of coalesceable pages to the In-Place Coalescer Coalesceable 18

  19. Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Update page tables 1 List of large pages 2 Page Table In-Place Coalescer has: List of coalesceable large pages Data Key Task: Perform coalescing without moving data Simply need to update the page tables 19

  20. Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Update page tables 1 List of large pages 2 Page Table Small Page Table Large Page Table 0 1 Data Coalesced Bit Application-transparent Data can be accessed using either page size No TLB flush 20

  21. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 21

  22. Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 22

  23. Mosaic: Data Deallocation Application Deallocates Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 2 Splinter Pages (reset the coalesced bit) Page Table Large Page Frame Data Splinter only frames with deallocated pages 23

  24. Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 24

  25. Mosaic: Compaction GPU Runtime List of free pages 2 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Large Page Frames Free large page Page Table Free large page Data Compact Pages 1 Compaction decreases memory bloat Happens only when memory is highly fragmented 25

  26. Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Once pages are compacted, they become non-coalesceable No virtual contiguity Maximizes number of free large page frames 26

  27. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 27

  28. Baseline: State-of-the-Art GPU Virtual Memory GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Page Table Walkers GPU-side memory CPU-side memory Page Table (Main memory) Data (Main Memory) CPU Memory 28

  29. Methodology GPGPU-Sim (MAFIA) modeling GTX750 Ti 30 GPU cores Multiple GPGPU applications execute concurrently 64KB 4-way L1, 2048KB 16-way L2 64-entry L1 TLB, 1024-entry L2 TLB 8-entry large page L1 TLB, 64-entry large page L2 TLB 3GB main memory Model sequential page walks Model page tables and virtual-to-physical mapping CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites 235 total workloads evaluated Available at: https://github.com/CMU-SAFARI/Mosaic 29

  30. Comparison Points State-of-the-art CPU-GPU memory management GPU-MMU based on [Power et al., HPCA 14] Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance Downside: Limited TLB reach Ideal TLB: Every TLB access is an L1 TLB hit 30

  31. Performance Homogeneous GPU-MMU Heterogeneous Ideal TLB Mosaic 7 Weighted Speedup 6 39.0% 23.7% 33.8% 43.1% 5 55.4% 4 31.5% 3 61.5% 21.4% 2 95.0% 1 0 1 2 3 4 5 2 3 4 5 Number of Concurrently-Executing Applications Mosaic consistently improves performance across a wide variety of workloads Mosaic performs within 10% of the ideal TLB 31

  32. Other Results in the Paper TLB hit rate Mosaic achieves average TLB hit rate of 99% Per-application IPC 97% of all applications perform faster Sensitivity to different TLB sizes Mosaic is effective for various TLB configurations Memory fragmentation analysis Mosaic reduces memory fragmentation and improves performance regardless of the original fragmentation Performance with and without demand paging 32

  33. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 33

  34. Summary Problem:No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalescean application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 34

  35. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Christopher J. Rossbach Jayneel Gandhi Onur Mutlu

  36. Backup Slides

  37. Current Methods to Share GPUs Time sharing Fine-grained context switching Coarse-grained context switching Spatial sharing NVIDIA GRID Multi process service 37

  38. Other Methods to Enforce Protection Segmented paging Static memory partitioning 38

  39. TLB Flush With Mosaic, the contents in the page tables are the same TLB flush in Mosaic occurs when page table content is modified This invalidates content in the TLB Both large and small page TLBs are flushed Need to be flushed 39

  40. Performance with Demand Paging GPU-MMU no Paging GPU-MMU with Paging Mosaic with Paging 2.0 Performance 1.5 Normalized 1.0 0.5 0.0 Homogeneous Heterogeneous 40

  41. In-Place Coalescer: Coalescing Key assumption: Soft guarantee Large page range always contains pages of the same application L1 Page Table L2 Page Table Set Large Page Bit Set Disabled Bit Set Disabled Bit Coalesce Set Disabled Bit Set Disabled Bit Q: How to access large page base entry? VA PD PT PO PO Benefit: No data movement 41

  42. In-Place Coalescer: Large Page Walk Large page index is available at leaf PTE L1 Page Table L2 Page Table Set Large Page Bit Set Disabled Bit Set Disabled Bit Coalesce Set Disabled Bit Set Disabled Bit 42

  43. Sample Application Pairs 5 Weighted Speedup GPU-MMU Mosaic Ideal TLB 4 3 2 1 0 TLB-Friendly TLB-Sensitive

  44. TLB Hit Rate L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 100% TLB Hit Rate 80% 60% 40% 20% 0% 1 App 2 Apps 3 Apps 4 Apps 5 Apps Number of Concurrently-Executing Applications GPU-MMU Mosaic

  45. Pre-Fragmenting DRAM 1.6 no CAC CAC CAC-BC CAC-Ideal 1.4 Performance Normalized 1.2 1.0 0.8 30% 50% 70% 90% 95% 97% 100% Fragmentation Index

  46. Page Occupancy Experiment no CAC CAC CAC-BC CAC-Ideal 1.6 1.4 Performance Normalized 1.2 1.0 0.8 Large Page Frame Occupancy

  47. Memory Bloat 1.8 Memory Bloat vs. GPU-MMU 4KB Page GPU-MMU CAC 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1% 10% 25% 35% 50% 75% Page Occupancy

  48. Individual Application IPC 8 9 GPU-MMU GPU-MMU 8 7 Performance Performance 7 Normalized Mosaic Mosaic Normalized 6 6 Ideal-TLB Ideal-TLB 5 5 4 4 3 3 2 2 1 1 0 0 0 10 20 30 40 50 0 25 50 75 Sorted Application Number Sorted Application Number 8 8 GPU-MMU Mosaic Ideal-TLB GPU-MMU 7 7 Performance Mosaic Performance Normalized Normalized 6 6 Ideal-TLB 5 5 4 4 3 3 2 2 1 1 0 0 0 25 50 75 100 0 25 50 75 100 125 Sorted Application Number Sorted Application Number

  49. GPU-MMU Mosaic GPU-MMU Mosaic 1.4 1.4 Performance Performance 1.3 Normalized 1.3 Normalized 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.9 0.8 0.8 8 16 32 64 128 256 64 128 256 512 1024 4096 Per-SM L1 TLB Base Page Entries Shared L2 TLB Base Page Entries GPU-MMU Mosaic GPU-MMU Mosaic 1.4 1.4 Performance Performance 1.3 1.3 Normalized Normalized 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.9 0.8 0.8 4 8 16 32 64 32 64 128 256 512 Per-SM L1 TLB Large Page Entries Shared L2 TLB Large Page Entries

  50. Mosaic: Putting Everything Together GPU Runtime Hardware Contiguity-Conserving Allocation Application Demands Data Allocate Memory Transfer Done List of Large Pages Coalesce Pages In-Place Coalescer Page Table List of Free Pages Splinter Pages Data Contiguity-Aware Compaction Application Deallocate Data Compact Pages Transfer Data System I/O Bus 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#