Transparent Hardware Management of Stacked DRAM for Memory Systems
Explore the innovative use of stacked DRAM as Part of Memory (PoM) to increase overall memory capacity and eliminate duplication. The system involves OS-managed PoM, challenges, and the potential of hardware-managed PoM to reduce OS-related overhead. Learn about the practical implications and evaluations of this architecture.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Transparent Hardware Management of Stacked DRAM as Part of Memory Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014
Heterogeneous Memory System 2/24 Die-stacking is happening NOW! Use as a large cache (DRAM$) Use as part of memory (PoM) Off-Chip Memory Data Duplication SLOW Memory JEDEC: HBM & Wide I/O2 Standards Micron: Hybrid Memory Cube (HMC) Q: How to design PoM architecture? Single Flat Address Space! Stacked DRAM DRAM Cache FAST Memory CPU
Stacked DRAM as PoM 3/24 PoM Architecture Increase overall memory capacity by avoiding duplication Static PoM Physical address space statically mapped to fast & slow memory 0xFFFFFFFF 0x4FFFFFFFF 0x0 20% FAST Memory (4GB) SLOW Memory (16GB) Need Migration
Stacked DRAM as PoM 4/24 OS-Managed PoM (Interval-Based) Nth interval Profiling Application Run Execution Update Page Table/ Flush TLBs OS Page Migration Interrupt/ Handler Invocation Memory Pages 4 fast memory slots HW counters for every active page Disadvantages Often unable to capture short-term hot pages! Require costly monitoring hardware OS page (4KB, 2MB) migration granularity Interval should be large enough!
Stacked DRAM as PoM 5/24 Potential of HW-Managed PoM Eliminate OS-related overhead Migration can happen at any time 100% Goal: Enable a Practical, Hardware-Managed PoM Architecture Interval (cycles) 10M 1M 100K 10K Serviced from 80% Fast Memory +40% LLC Misses 60% 40% 20% 0% AVG.
Outline | Motivation | Hardware-Managed PoM 6/24 Challenges A Practical PoM Architecture | Evaluations | Conclusion
Hardware-Managed PoM 7/24 Challenges of HW-Managed PoM Metadata for GBs of Memory!
Challenges of HW-Managed PoM (1) Hardware-Managed Indirection 8/24 Requirement? Relocates memory pages in an OS-transparent manner Challenge 1: Maintain the integrity of OS s view of memory Approach 1: OS page table modification via hardware (unattractive) Our Approach: Two-Level Indirection with Remapping Cache Approach 2: Additional indirection by remapping table Page Table Physical Address (PTPA) DRAM Physical Address (DPA) PA Remapping Remapping granularity! Remapping Table (2GB Stacked DRAM/2KB Segment) Where to architect this? Added to every memory request Size: tens of MBs Latency: tens of cycles
Challenges of HW-Managed PoM (2) Efficient Memory Activity Tracking/Replacement 9/24 Challenge 2: Provide efficient memory-usage monitoring/replacement mechanisms P1 P5 P9 P13 P2 P6 P10 P14 P3 P7 P11 P15 P4 P8 P12 P16 87 1 4 0 0 0 0 42 887 0 0 0 0 7 27 0 0 0 0 1 97 124 38 0 0 0 0 2 1 1 2 63 Our Approach: Competing Counter-Based Tracking and Replacement 483 56 72 628 7 Memory Pages Counters Activity Tracking Structure (8GB total memory/4KB page) Track as many as 2M entries Compare/sort counters (non-trivial) MBs of storage for counters unresponsive decision
Hardware-Managed PoM 10/24 A Practical PoM Architecture (1) Two-Level Indirection
A Practical PoM Architecture 11/24 Conventional System Access DRAM Page Table Physical Address (PTPA) Page Table Virtual Address (VA) PoM System Actual address of the data in memory Remapping PTPA Page Table Physical Address (PTPA) DRAM Physical Address (DPA) Page Table (OS) Segment Remapping Table (HW) VA
A Practical PoM Architecture 12/24 PoM System Page Table Physical Address (PTPA) DRAM Physical Address (DPA) Page Table (OS) Segment Remapping Table (HW) VA Request for Segment N+27 Originally mapped to slow memory DPA DATA Fast Processor Die Segment 0 C C Entry 0 SRC Miss SRT Segment N+27 Entry 1 C C PTPA SRC Cache Entry1 Slow Memory C C Segment N-1 Entry N-1 C C Segment Remapping Cache (SRC) Segment Remapping Table (SRT)
Segment-Restricted Remapping 13/24 Can we simply cache some entries? Segment Remapping Table (SRT) Segment 0 Segment 0 Entry 0 Entry 0 The remapping information can be anywhere in the SRT! Segment 1 Entry 1 Segment N+27 Entry 1 Segment N+27 Segment N-1 Entry N-1 Entry N-1 N look-ups!! 2 look-ups A single SRC miss may require lots of memory accesses to fast memory!
Segment-Restricted Remapping 14/24 How to minimize SRC miss cost? Entry 0 SEG A SEG C SEG Y Entry 1 SEG B SEG D SEG Z Allowed to be mapped to certain slots! For an SRC miss Segment A,C,Y -> Look up in Entry 0 Segment B,D,Z -> Look up in Entry 1 Segment-restricted remapping minimizes the SRC miss cost to a single FAST DRAM access
Hardware-Managed PoM 15/24 A Practical PoM Architecture (2) Memory Activity Tracking and Replacement
Competing Counter 16/24 How to compare counters of all involved segments? Information of interest is the access count relative to each segment rather than the absolute one! Segments in Fast Memory P1 P2 P3 P4 P8 P12 P16 87 1 4 0 0 0 0 42 887 0 0 0 0 7 27 0 0 0 0 1 97 124 38 0 0 0 0 2 SEG Y Simple Case: One slot exists in fast memory P5 P6 P7 P9 P10 P13 P14 P15 1 1 2 63 483 56 72 628 P11 7 ++ -- SEG Y Counter SEG A Segments in Slow Memory Memory Pages Counters SEG A Can easily figure that which segment is worth for FAST memory
Competing Counter 17/24 How to compare counter of all involved segments? #Counters is bounded to #segments in slow memory! Segment-Restricted Remapping Segments in Fast Memory General Case SEG Y SEG Z SEG Y -- SEG A Sharing Counter Among Competing Segments! SEG Z -- SEG B ++ ++ C1 C2 Segments in Slow Memory SEG Y -- SEG C SEG Z -- SEG D ++ ++ C3 C4 #Counters is bounded to #segments in fast memory! SEG A SEG B SEG Y -- SEG A SEG Z -- SEG B SEG C ++ ++ ++ ++ C1 C2 SEG D SEG C SEG D
More discussions in the paper! 18/24 Two-Level Indirection Competing Counters Swapping Operation Fast Swap and Slow Swap => affects remapping table size Segment Remapping Table/Cache How to design this Swapping Criteria How to determine the threshold for different applications
19/24 Evaluations
Methodology 20/24 20 Workloads System Parameters CPU Core SRC 4 cores, 3.2GHz OOO 4-way, 32KB, LRU policy 14 workloads (a multi-programmed mix of SPEC06) Die-Stacked DRAM Bus Frequency 1.6GHz (DDR 3.2GHz), 128 bits per channel 4/1/8, 2KB row buffer 8-8-8 Swapping Parameters Ch/Rank/Bank tCAS-tRCD-tRP Off-chip DRAM Granularity: 2KB Segment Bus Frequency 800MHz (DDR 1.6GHz), 64 bits per channel 2/1/8, 16KB row buffer 11-11-11 Latency: 1.2K CPU cycles Ch/Rank/Bank tCAS-tRCD-tRP
Performance No migration HW-managed PoM migration cost included 100M cycles interval 100M cycles interval ignore migration cost 21/24 21 Static (1:8) OS-Managed OS-Managed (zero-cost swap) Proposed Speedup over no stacked 2.0 1.8 1.6 19.1% 31.6% 1.4 DRAM 7.5% 1.2 1.0 0.8 0.6 WL-1 WL-3 AVG. WL-10 WL-11 WL-12 WL-13 WL-14 WL-2 WL-4 WL-5 WL-6 WL-7 WL-8 WL-9
SRC: Address Translation Breakdown 22/24 22 HIT_FAST HIT_SLOW MISS_FAST MISS_SLOW 100% 80% 60% AVG +95% SRC hit rate!! 40% 20% 0% HIT/MISS : SRC hit or miss FAST/SLOW: Serviced from FAST or SLOW memory
23/24 Conclusion
Conclusion 24/24 24 Goal: Enable a practical, hardware-managed PoM Challenge 1: Maintaining large indirection table Challenge 2: Providing efficient memory activity tracking/replacement Solution Two-Level indirection with remapping cache Segment-restricted remapping Competing Counter-based tracking/swapping Result: A practical, hardware-managed PoM 18.4% faster over static mapping With very little additional on-chip SRAM storage overhead 7.8% of SRAM LLC