Exploring Challenges and Opportunities in Processing-in-Memory Architecture

Slide Note
Embed
Share

PIM technology aims to enhance performance by moving computation closer to memory, improving bandwidth, latency, and energy efficiency. Despite initial setbacks, new strategies focus on cost-effectiveness, programming models, and overcoming implementation challenges. A new direction proposes intuitive programming models, cache coherence, and reduced overhead, combining conventional and GPGPU approaches. ISA extensions offer a promising interface for PIM tasks like parallel PageRank computation.


Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. PIM-Enabled Instructions: A Low-Overhead, Locality-Aware PIM Architecture Junwhan Ahn, Sungjoo Yoo, Onur Mutlu+, and Kiyoung Choi +Carnegie Mellon University Seoul National University

  2. Processing-in-Memory Move computation to memory Higher memory bandwidth Lower memory latency Better energy efficiency (e.g., off-chip links vs. TSVs) Originally studied in 1990s Also known as processor-in-memory e.g., DIVA, EXECUBE, FlexRAM, IRAM, Active Pages, Not commercialized in the end Why was PIM unsuccessful in its first attempt?

  3. Challenges in Processing-in-Memory Cost-effectiveness Programming Model Coherence & VM Host Processor Host Processor Thread Thread Thread Thread Thread Thread Thread Thread Thread 3 4 C Thread Thread Thread DRAM die Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread 3 5 Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread C DRAM die Complex Logic In-Memory Processors

  4. Challenges in Processing-in-Memory Cost-effectiveness Programming Model Coherence & VM Host Processor Host Processor Thread Thread Thread Thread Thread Thread Thread Thread Thread 3 4 C Thread Thread Thread DRAM die (Partially) Solved by 3D-Stacked DRAM Thread Still Challenging even in Recent PIM Architectures (e.g., AC-DIMM, NDA, NDC, TOP-PIM, Tesseract, ) Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread 3 5 Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread C DRAM die Complex Logic In-Memory Processors

  5. New Direction of PIM Objectives Provide an intuitive programming model for PIM Full support for cache coherence and virtual memory Reduce the implementation overhead of PIM units Our solution: simple PIM operations as ISA extension Simple: low-overhead implementation PIM operations as host processor instructions: intuitive Conventional PIM : Simple PIM GPGPU : SSE/AVX

  6. Potential of ISA Extension as PIM Interface Example: Parallel PageRank computation for (v: graph.vertices) { value = weight * v.rank; for (w: v.successors) { w.next_rank += value; } } for (v: graph.vertices) { v.rank = v.next_rank; v.next_rank = alpha; }

  7. Potential of ISA Extension as PIM Interface for (v: graph.vertices) { value = weight * v.rank; for (w: v.successors) { w.next_rank += value; } } Host Processor Main Memory w.next_rank w.next_rank w.next_rank w.next_rank 64 bytes in 64 bytes out Conventional Architecture

  8. Potential of ISA Extension as PIM Interface for (v: graph.vertices) { value = weight * v.rank; for (w: v.successors) { __pim_add(&w.next_rank, value); } } pim.add r1, (r2) Main Memory Host Processor value w.next_rank w.next_rank 8 bytes in 0 bytes out In-Memory Addition

  9. Potential of ISA Extension as PIM Interface 60% 50% 40% 30% Increase in Memory Bandwidth Consumption Lack of On-Chip Caches Speedup 20% 10% 0% -10% Reduction in Memory Bandwidth Consumption In-Memory Computation -20% Stanford Journal1 amazon- wiki- ljournal- Patents soc-Slash p2p-Gnu frwiki- soc-Live Talk dot0811 2013 tella31 2008 web- 2008 cit- More Vertices

  10. Overview 1. How should simple PIM operations be interfaced to conventional systems? Expose PIM operations as cache-coherent, virtually- addressed host processor instructions No changes to the existing sequential programming model 2. What is the most efficient way of exploiting such simple PIM operations? Dynamically determine the location of PIM execution based on data locality without software hints

  11. PIM-Enabled Instructions for (v: graph.vertices) { value = weight * v.rank; for (w: v.successors) { w.next_rank += value; } }

  12. PIM-Enabled Instructions for (v: graph.vertices) { value = weight * v.rank; for (w: v.successors) { __pim_add(&w.next_rank, value); } } pim.add r1, (r2) Executed either in memory or in the host processor Cache-coherent, virtually-addressed Atomic between different PEIs Not atomic with normal instructions (use pfence)

  13. PIM-Enabled Instructions for (v: graph.vertices) { value = weight * v.rank; for (w: v.successors) { __pim_add(&w.next_rank, value); } } pfence(); pim.add r1, (r2) pfence Executed either in memory or in the host processor Cache-coherent, virtually-addressed Atomic between different PEIs Not atomic with normal instructions (use pfence)

  14. PIM-Enabled Instructions Key to practicality: single-cache-block restriction Each PEI can access at most one last-level cache block Similar restrictions exist in atomic instructions Benefits Localization: each PEI is bounded to one memory module Interoperability: easier support for cache coherence and virtual memory Simplified locality monitoring: data locality of PEIs can be identified by LLC tag checks or similar methods

  15. Architecture Host Processor HMC Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller PCU DRAM Controller PCU PMU PIM Directory DRAM Controller Locality Monitor PCU Proposed PEI Architecture

  16. Memory-side PEI Execution Host Processor HMC y Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller x PCU DRAM Controller y PCU PMU PIM Directory DRAM Controller Locality Monitor PCU pim.add y, &x

  17. Memory-side PEI Execution Address Translation for PEIs Done by the host processor TLB (similar to normal instructions) No modifications to existing HW/OS Host Processor HMC y Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller x PCU DRAM Controller y PCU PMU No need for in-memory TLBs PIM Directory DRAM Controller Locality Monitor PCU pim.add y, &x

  18. Memory-side PEI Execution Host Processor HMC y Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller x PCU DRAM Controller y PCU PMU PIM Directory Wait until x is writable DRAM Controller Locality Monitor PCU pim.add y, &x

  19. Memory-side PEI Execution Reader-writer lock #0 Reader-writer lock #1 Reader-writer lock #2 XOR-Hash Host Processor HMC Address (Inexact, but Conservative) Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Reader-writer lock #N-1 Crossbar Network HMC Controller x PCU DRAM Controller y PCU PMU PIM Directory Wait until x is writable DRAM Controller Locality Monitor PCU pim.add y, &x

  20. Memory-side PEI Execution Host Processor HMC Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller x PCU DRAM Controller y PCU PMU PIM Directory Wait until x is writable Check the data locality of x DRAM Controller Locality Monitor PCU pim.add y, &x

  21. Memory-side PEI Execution Tag Tag Tag Partial Tag Array Tag Tag Tag Tag Tag Hit: High locality Host Processor Address HMC Miss: Low locality Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Tag Tag Tag Tag Cache PCU Crossbar Network Updated on Each LLC access Each issue of a PIM operation to memory HMC Controller x PCU DRAM Controller y PCU PMU PIM Directory Wait until x is writable Check the data locality of x DRAM Controller Locality Monitor PCU pim.add y, &x

  22. Memory-side PEI Execution Host Processor HMC Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller x PCU DRAM Controller y PCU PMU PIM Low locality Directory Wait until x is writable Check the data locality of x DRAM Controller Locality Monitor PCU pim.add y, &x

  23. Memory-side PEI Execution Host Processor HMC Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache Back-invalidation for cache coherence No modifications to existing cache coherence protocols PCU Crossbar Network HMC Controller x PCU DRAM Controller y PCU PMU PIM Low locality Directory DRAM Controller Locality Monitor PCU pim.add y, &x

  24. Memory-side PEI Execution Host Processor HMC Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller x+y x PCU DRAM Controller y PCU y x+y PMU PIM Low locality Directory DRAM Controller Locality Monitor PCU pim.add y, &x

  25. Memory-side PEI Execution Host Processor HMC Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Completely Localized PIM Memory Accesses without Special Data Mapping Crossbar Network HMC Controller x+y PCU DRAM Controller y PCU x+y PMU PIM Directory DRAM Controller Locality Monitor PCU pim.add y, &x

  26. Memory-side PEI Execution Host Processor HMC Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller x+y x PCU DRAM Controller y PCU x+y PMU PIM Directory Completion Notification DRAM Controller Locality Monitor PCU pim.add y, &x

  27. Host-side PEI Execution Host Processor HMC y x Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller PCU DRAM Controller y PCU PMU PIM Directory Wait until x is writable Check the data locality of x DRAM Controller Locality Monitor PCU pim.add y, &x

  28. Host-side PEI Execution Host Processor HMC x x x+y x Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller PCU x+y DRAM Controller y x PCU PMU PIM High locality Directory Wait until x is writable Check the data locality of x DRAM Controller Locality Monitor PCU pim.add y, &x

  29. Host-side PEI Execution Host Processor HMC x x No Cache Coherence Issues x+y x Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller PCU x+y DRAM Controller y PCU PMU PIM Directory DRAM Controller Locality Monitor PCU pim.add y, &x

  30. Host-side PEI Execution Host Processor HMC x x x+y x Out-Of-Order Core Last-Level DRAM Controller L1 Cache L2 Cache Cache PCU Crossbar Network HMC Controller PCU x+y DRAM Controller y PCU PMU PIM Directory Completion Notification DRAM Controller Locality Monitor PCU pim.add y, &x

  31. Mechanism Summary Atomicity of PEIs PIM directory implements reader-writer locks Locality-aware PEI execution Locality monitor simulates cache replacement behavior Cache coherence for PEIs Memory-side: back-invalidation/back-writeback Host-side: no need for consideration Virtual memory for PEIs Host processor performs address translation before issuing a PEI

  32. Simulation Configuration In-house x86-64 simulator based on Pin 16 out-of-order cores, 4GHz, 4-issue 32KB private L1 I/D-cache, 256KB private L2 cache 16MB shared 16-way L3 cache, 64B blocks 32GB main memory with 8 daisy-chained HMCs (80GB/s) PCU 1-issue computation logic, 4-entry operand buffer 16 host-side PCUs at 4GHz, 128 memory-side PCUs at 2GHz PMU PIM directory: 2048 entries (3.25KB) Locality monitor: similar to LLC tag array (512KB)

  33. Target Applications Ten emerging data-intensive workloads Large-scale graph processing Average teenage followers, BFS, PageRank, single-source shortest path, weakly connected components In-memory data analytics Hash join, histogram, radix partitioning Machine learning and data mining Streamcluster, SVM-RFE Three input sets (small, medium, large) for each workload to show the impact of data locality

  34. Speedup (Large Inputs, Baseline: Host-Only) 70% 60% 50% 40% 30% 20% 10% 0% ATF BFS PR SP WCC HJ HG RP SC SVM GM PIM-Only Locality-Aware

  35. Speedup (Large Inputs, Baseline: Host-Only) Normalized Amount of Off-chip Transfer 70% 1.2 60% 1 50% 0.8 0.6 40% 0.4 30% 0.2 20% 0 10% ATF BFS PR SP WCC HJ HG RP SC SVM 0% Host-Only PIM-Only Locality-Aware ATF BFS PR SP WCC HJ HG RP SC SVM GM PIM-Only Locality-Aware

  36. Speedup (Small Inputs, Baseline: Host-Only) 60% 40% 20% 0% -20% -40% -60% ATF BFS PR SP WCC HJ HG RP SC SVM GM PIM-Only Locality-Aware

  37. Speedup (Small Inputs, Baseline: Host-Only) Normalized Amount of Off-chip Transfer 60% 8 16.1 502 408 7 40% 6 5 20% 4 0% 3 2 -20% 1 -40% 0 ATF BFS PR SP WCC HJ HG RP SC SVM -60% Host-Only PIM-Only Locality-Aware ATF BFS PR SP WCC HJ HG RP SC SVM GM PIM-Only Locality-Aware

  38. Speedup (Medium Inputs, Baseline: Host-Only) 70% 60% 50% 40% 30% 20% 10% 0% ATF BFS PR SP WCC HJ HG RP SC SVM GM -10% PIM-Only Locality-Aware

  39. Sensitivity to Input Size 60% 100% 90% 50% 80% 40% 70% 30% 60% Speedup PIM % 20% 50% 10% 40% 0% 30% -10% 20% -20% 10% -30% 0% amazon- Stanford ljournal- Patents soc-Slash Journal1 wiki- p2p-Gnu frwiki- soc-Live Talk dot0811 2013 tella31 2008 web- 2008 cit- PIM-Only Locality-Aware PIM %

  40. Multiprogrammed Workloads 1.8 1.6 1.4 1.2 1 Host-Only 0.8 0.6 0.4 0.2 0 0 9 153 45 108 117 126 135 144 162 171 180 189 198 18 27 36 54 63 72 81 90 99 PIM-Only Locality-Aware

  41. Energy Consumption Host-Only PIM-Only Locality-Aware 1.5 1 0.5 0 Small Medium Large Cache Host-side PCU HMC Link Memory-side PCU DRAM PMU

  42. Conclusion Challenges of PIM architecture design Cost-effective integration of logic and memory Unconventional programming models Lack of interoperability with caches and virtual memory PIM-enabled instruction: low-cost PIM abstraction & HW Interfaces PIM operations as ISA extension Simplifies cache coherence and virtual memory support for PIM Locality-aware execution of PIM operations Evaluations 47%/32% speedup over Host/PIM-Only in large/small inputs Good adaptivity across randomly generated workloads

  43. PIM-Enabled Instructions: A Low-Overhead, Locality-Aware PIM Architecture Junwhan Ahn, Sungjoo Yoo, Onur Mutlu+, and Kiyoung Choi +Carnegie Mellon University Seoul National University

More Related Content