Mosaic: A GPU Memory Manager Enhancing Performance Through Adaptive Page Sizes

Slide Note
Embed
Share

"Mosaic introduces a GPU memory manager supporting multiple page sizes for improved performance. By coalescing small pages into large ones without data movement, it achieves a 55% average performance boost over existing mechanisms. This innovative framework transparently enables the benefits of both small and large page sizes, addressing the trade-off between TLB reach and demand paging latency."


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Jayneel Gandhi Vance Miller Saugata Ghose Christopher J. Rossbach Onur Mutlu

  2. Executive Summary Problem:No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalescean application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 2

  3. GPU Support for Virtual Memory Improves programmability with a unified address space Enables large data sets to be processed in the GPU Allows multiple applications to run on a GPU Virtual memory can enforce memory protection 3

  4. State-of-the-Art Virtual Memory on GPUs GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Limited TLB reach Page Table Walkers High latency page walks GPU-side memory High latency I/O CPU-side memory Page Table (Main memory) Data (Main Memory) CPU Memory 4

  5. Trade-Off with Page Size Larger pages: Better TLB reach High demand paging latency Smaller pages: Lower demand paging latency Limited TLB reach 5

  6. Trade-Off with Page Size No Paging Overhead With Paging Overhead Small (4KB) Large (2MB) Small (4KB) Large (2MB) 1.0 1.0 Performance Normalized Performance Normalized 0.8 0.8 52% 0.6 0.6 -93% 0.4 0.4 0.2 0.2 0.0 0.0 Can we get the best of both page sizes? 6

  7. Outline Background Key challenges and our goal Mosaic Experimental evaluations Conclusions 7

  8. Challenges with Multiple Page Sizes State-of-the-Art Time App 1 Allocation GPU Memory Large Page Frame 1 App 2 Allocation Large Page Frame 2 Large Page Frame 3 App 1 Allocation Large Page Frame 4 App 2 Allocation Large Page Frame 5 Cannot coalesce Coalesce App 1 Pages Coalesce App 2 Pages (without migrating multiple 4K pages) Need to search which pages to coalesce App 1 App 2 Unallocated 8

  9. Desirable Allocation Desirable Behavior Time App 1 Allocation GPU Memory Large Page Frame 1 App 2 Allocation Large Page Frame 2 Large Page Frame 3 App 1 Allocation Large Page Frame 4 App 2 Allocation Large Page Frame 5 Coalesce App 1 Pages Coalesce App 2 Pages Can coalesce (without moving data) App 1 App 2 Unallocated 9

  10. Our Goals High TLB reach Low demand paging latency Application transparency Programmers do not need to modify the applications 10

  11. Outline Background Key challenges and our goal Mosaic Experimental evaluation Conclusions 11

  12. Mosaic GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 12

  13. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 13

  14. Mosaic: Data Allocation Application Demands Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory 2 Large Page Frame Page Table Soft guarantee: A large page frame contains pages from only a single address space Data Conserves contiguity within the large page frame 14

  15. Mosaic: Data Allocation Application Demands Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Allocate Memory 2 Large Page Frame Page Table Transfer Data CPU Memory Data System I/O Bus 3 Data transfer is done at a small page granularity A page that is transferred is immediately ready to use 15

  16. Mosaic: Data Allocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Transfer Done 4 Large Page Frame Page Table Transfer Data CPU Memory Data System I/O Bus 3 16

  17. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 17

  18. Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 1 List of large pages Large Page Frame Large Page Frame Fully-allocated large page frame Allocator sends the list of coalesceable pages to the In-Place Coalescer Coalesceable 18

  19. Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Update page tables 1 List of large pages 2 Page Table In-Place Coalescer has: List of coalesceable large pages Data Key Task: Perform coalescing without moving data Simply need to update the page tables 19

  20. Mosaic: Coalescing GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Update page tables 1 List of large pages 2 Page Table Small Page Table Large Page Table 0 1 Data Coalesced Bit Application-transparent Data can be accessed using either page size No TLB flush 20

  21. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 21

  22. Mosaic: Data Deallocation GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 22

  23. Mosaic: Data Deallocation Application Deallocates Data GPU Runtime 1 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware 2 Splinter Pages (reset the coalesced bit) Page Table Large Page Frame Data Splinter only frames with deallocated pages 23

  24. Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Key Task: Free up not-fully-used large page frames Splinter pages Break down a large page into small pages Compaction Combine fragmented large page frames 24

  25. Mosaic: Compaction GPU Runtime List of free pages 2 Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Large Page Frames Free large page Page Table Free large page Data Compact Pages 1 Compaction decreases memory bloat Happens only when memory is highly fragmented 25

  26. Mosaic: Compaction GPU Runtime Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Hardware Once pages are compacted, they become non-coalesceable No virtual contiguity Maximizes number of free large page frames 26

  27. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 27

  28. Baseline: State-of-the-Art GPU Virtual Memory GPU Core GPU Core GPU Core GPU Core Private TLB Private TLB Private TLB Private TLB Private Shared Shared TLB Page Table Walkers GPU-side memory CPU-side memory Page Table (Main memory) Data (Main Memory) CPU Memory 28

  29. Methodology GPGPU-Sim (MAFIA) modeling GTX750 Ti 30 GPU cores Multiple GPGPU applications execute concurrently 64KB 4-way L1, 2048KB 16-way L2 64-entry L1 TLB, 1024-entry L2 TLB 8-entry large page L1 TLB, 64-entry large page L2 TLB 3GB main memory Model sequential page walks Model page tables and virtual-to-physical mapping CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites 235 total workloads evaluated Available at: https://github.com/CMU-SAFARI/Mosaic 29

  30. Comparison Points State-of-the-art CPU-GPU memory management GPU-MMU based on [Power et al., HPCA 14] Upside: Utilizes parallel page walks, TLB request coalescing and page walk cache to improve performance Downside: Limited TLB reach Ideal TLB: Every TLB access is an L1 TLB hit 30

  31. Performance Homogeneous GPU-MMU Heterogeneous Ideal TLB Mosaic 7 Weighted Speedup 6 39.0% 23.7% 33.8% 43.1% 5 55.4% 4 31.5% 3 61.5% 21.4% 2 95.0% 1 0 1 2 3 4 5 2 3 4 5 Number of Concurrently-Executing Applications Mosaic consistently improves performance across a wide variety of workloads Mosaic performs within 10% of the ideal TLB 31

  32. Other Results in the Paper TLB hit rate Mosaic achieves average TLB hit rate of 99% Per-application IPC 97% of all applications perform faster Sensitivity to different TLB sizes Mosaic is effective for various TLB configurations Memory fragmentation analysis Mosaic reduces memory fragmentation and improves performance regardless of the original fragmentation Performance with and without demand paging 32

  33. Outline Background Key challenges and our goal Mosaic Contiguity-Conserving Allocation In-Place Coalescer Contiguity-Aware Compaction Experimental evaluations Conclusions 33

  34. Summary Problem:No single best page size for GPU virtual memory Large pages: Better TLB reach Small pages: Lower demand paging latency Our goal: Transparently enable both page sizes Key observations Can easily coalescean application s contiguously-allocated small pages into a large page Interleaved memory allocation across applications breaks page contiguity Key idea: Preserve virtual address contiguity of small pages when allocating physical memory to simplify coalescing Mosaic is a hardware/software cooperative framework that: Coalesces small pages into a large page without data movement Enables the benefits of both small and large pages Key result: 55% average performance improvement over state-of-the-art GPU memory management mechanism 34

  35. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Christopher J. Rossbach Jayneel Gandhi Onur Mutlu

  36. Backup Slides

  37. Current Methods to Share GPUs Time sharing Fine-grained context switching Coarse-grained context switching Spatial sharing NVIDIA GRID Multi process service 37

  38. Other Methods to Enforce Protection Segmented paging Static memory partitioning 38

  39. TLB Flush With Mosaic, the contents in the page tables are the same TLB flush in Mosaic occurs when page table content is modified This invalidates content in the TLB Both large and small page TLBs are flushed Need to be flushed 39

  40. Performance with Demand Paging GPU-MMU no Paging GPU-MMU with Paging Mosaic with Paging 2.0 Performance 1.5 Normalized 1.0 0.5 0.0 Homogeneous Heterogeneous 40

  41. In-Place Coalescer: Coalescing Key assumption: Soft guarantee Large page range always contains pages of the same application L1 Page Table L2 Page Table Set Large Page Bit Set Disabled Bit Set Disabled Bit Coalesce Set Disabled Bit Set Disabled Bit Q: How to access large page base entry? VA PD PT PO PO Benefit: No data movement 41

  42. In-Place Coalescer: Large Page Walk Large page index is available at leaf PTE L1 Page Table L2 Page Table Set Large Page Bit Set Disabled Bit Set Disabled Bit Coalesce Set Disabled Bit Set Disabled Bit 42

  43. Sample Application Pairs 5 Weighted Speedup GPU-MMU Mosaic Ideal TLB 4 3 2 1 0 TLB-Friendly TLB-Sensitive

  44. TLB Hit Rate L1 L2 L1 L2 L1 L2 L1 L2 L1 L2 100% TLB Hit Rate 80% 60% 40% 20% 0% 1 App 2 Apps 3 Apps 4 Apps 5 Apps Number of Concurrently-Executing Applications GPU-MMU Mosaic

  45. Pre-Fragmenting DRAM 1.6 no CAC CAC CAC-BC CAC-Ideal 1.4 Performance Normalized 1.2 1.0 0.8 30% 50% 70% 90% 95% 97% 100% Fragmentation Index

  46. Page Occupancy Experiment no CAC CAC CAC-BC CAC-Ideal 1.6 1.4 Performance Normalized 1.2 1.0 0.8 Large Page Frame Occupancy

  47. Memory Bloat 1.8 Memory Bloat vs. GPU-MMU 4KB Page GPU-MMU CAC 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1% 10% 25% 35% 50% 75% Page Occupancy

  48. Individual Application IPC 8 9 GPU-MMU GPU-MMU 8 7 Performance Performance 7 Normalized Mosaic Mosaic Normalized 6 6 Ideal-TLB Ideal-TLB 5 5 4 4 3 3 2 2 1 1 0 0 0 10 20 30 40 50 0 25 50 75 Sorted Application Number Sorted Application Number 8 8 GPU-MMU Mosaic Ideal-TLB GPU-MMU 7 7 Performance Mosaic Performance Normalized Normalized 6 6 Ideal-TLB 5 5 4 4 3 3 2 2 1 1 0 0 0 25 50 75 100 0 25 50 75 100 125 Sorted Application Number Sorted Application Number

  49. GPU-MMU Mosaic GPU-MMU Mosaic 1.4 1.4 Performance Performance 1.3 Normalized 1.3 Normalized 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.9 0.8 0.8 8 16 32 64 128 256 64 128 256 512 1024 4096 Per-SM L1 TLB Base Page Entries Shared L2 TLB Base Page Entries GPU-MMU Mosaic GPU-MMU Mosaic 1.4 1.4 Performance Performance 1.3 1.3 Normalized Normalized 1.2 1.2 1.1 1.1 1.0 1.0 0.9 0.9 0.8 0.8 4 8 16 32 64 32 64 128 256 512 Per-SM L1 TLB Large Page Entries Shared L2 TLB Large Page Entries

  50. Mosaic: Putting Everything Together GPU Runtime Hardware Contiguity-Conserving Allocation Application Demands Data Allocate Memory Transfer Done List of Large Pages Coalesce Pages In-Place Coalescer Page Table List of Free Pages Splinter Pages Data Contiguity-Aware Compaction Application Deallocate Data Compact Pages Transfer Data System I/O Bus 50

Related