Intelligent DRAM Cache Strategies for Bandwidth Optimization

Slide Note
Embed
Share

Efficiently managing DRAM caches is crucial due to increasing memory demands and bandwidth limitations. Strategies like using DRAM as a cache, architectural considerations for large DRAM caches, and understanding replacement policies are explored in this study to enhance memory bandwidth and capacity while optimizing cache utilization.


Uploaded on Oct 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. To Update or Not To Update? Bandwidth-Efficient Intelligent Replacement Policies for DRAM Caches Vinson Young Moinuddin Qureshi 1

  2. MOORES LAW HITS MEMORY WALL Capacity demand Channel On-Chip Bandwidth demand Need solutions for increased memory bandwidth and capacity 2

  3. EMERGING MEMORY TECHNOLOGIES DRAM 3D-DRAM 3D-XPoint(NVM) Memory Bandwidth low high Memory Capacity small large Emerging memory technologies offer bandwidth/capacity tradeoffs 3

  4. DRAM AS A CACHE fast CPU L1$ L2$ CPU L1$ Memory Hierarchy L2$ L3$ DRAM-Cache (3D-DRAM / DRAM) System Memory (DRAM / NVM) MCDRAM from Intel Optane DC from Intel HBCC from AMD slow OS-visible Space Using DRAM as a DRAM-cache, can improve memory bandwidth and capacity (and avoid OS/software change) 4

  5. ARCHITECTING LARGE DRAM CACHES Organize at line granularity (64B) for high cache utilization Gigascale cache needs large tag-store (tens of MBs) 64 MB Tags 2GB Data Tags? 3D-DRAM Too large for SRAM 5

  6. ARCHITECTING LARGE DRAM CACHES Organize at line granularity (64B) for high cache utilization Gigascale cache needs large tag-store (tens of MBs) 64 MB Tags 2GB Data 3D-DRAM Practical designs must store Tags in DRAM How to architect tag-store for low-latency tag access? 6

  7. BACKGROUND: DRAM CACHE ORGANIZATION Tag-With-Data (e.g., Alloy Cache) T Data T Data T Data T Data Single Tag+Data Lookup (1x hit latency), but direct-mapped and needs lookup on miss Base DRAM Cache is 64B line-size, store Tag-With-Data, and are direct-mapped, to optimize for hit-latency. Can have low hit-rate. Can replacement policies help? This Tag-With-Data approach is used in Intel Knights Landing Product (MCDRAM) 7

  8. BACKGROUND: DRAM CACHE REPL IS STATELESS DIP/BAB Set-Duel Recency-Based Replacement: LRU is Always-Install Probabilistic Replacement: BIP is 90%-bypass, protect working set LRU BIP Always-Install Bimodal Install 100% 10% Dueling Based Replacement: DIP/BAB chooses best policy 90% Install Bypass Stateless global bypass policies can improve hit-rate and save install bandwidth. 8

  9. PERFORMANCE OF STATELESS POLICY 25% 20% Speedup* 15% 10% 3% 5% 0% DRAM Cache bypass can improve performance. But, global policies are too coarse grain to provide significant benefits *System assumes 2GB HBM-based Cache, and PCM-based main memory (~4x latency). 9

  10. DESIRED BEHAVIOR FROM CACHE REPLACEMENT Thrash-Resistance: Working set larger than cache Preserve some of working set miss miss miss miss miss hit hit hit hit hit Wsize LLCsize Scan-Resistance: Recurring scans Preserve frequently-referenced working set hit hit hit scan hit hit scan hit scan hit [ RRIP (ISCA 10)] 10

  11. MOST EFFECTIVE POLICIES ARE STATEFUL (RRIP) 2. Hit Hit Highest Priority Hit Lowest Priority 3. Age 00 01 10 11 All Counters < 3 Evict 1. Install Per-line RRIP protects against thrash and scans. But, RRIP needs 2-bits per line and expensive state-update. [ Jaleel et al., ISCA 10 ] 11

  12. POTENTIAL OF IDEALIZED STATEFUL POLICY 25% 20% Speedup 15% 17% Potential 10% 5% 3% 0% Per-line reuse-based bypassing has higher potential; however, it requires substantial bandwidth costs to upkeep *System assumes 2GB HBM-based Cache, and PCM-based main memory (~4x latency). 12

  13. ENABLING INTELLIGENT REPLACEMENT POLICY Goal: Formulate reuse-bypassing for direct-mapped caches, and reduce bandwidth costs needed to maintain per-line state 13

  14. POLICY: RRIP AGE-ON-BYPASS (RRIP-AOB) Note: state-update costs bandwidth Demote on Bypass Promote on Hit Hit Hit Hit 00 01 10 11 Bypass Bypass Bypass Evict Install RRIP-AOB protects reused lines for 3 conflicts. Enables protecting reused lines, and eventually replacing cold lines But, suffers high bandwidth cost to constantly update state [ Jaleel et al., ISCA 10 ] 14

  15. REDUCE BANDWIDTH COSTS: REPL BW BREAKDOWN Replacement Bandwidth: Always-Install, bandwidth consumed in: Install RRIP, bandwidth consumed for: Install, Promote state, Demote state Replacement BW w.r.t. Always Install 250% 200% Demote Promote Install 150% 100% 111% 50% 24% 0% pr_twi bc_twi cc_twi omnet sphinx cc_web pr_web zeusmp libq soplex Amean milc nekbone xalanc gcc leslie mcf wrf RRIP reduces install bandwidth, but costs additional bandwidth to maintain state (~13% speedup). Can we reduce bandwidth cost? 15

  16. REDUCE BANDWIDTH COSTSINSIGHT: SPATIAL LOCALITY OF REUSE Eviction Locality: Those coresident often have similar state on eviction Coresidency: A given page has many lines coresident. 64 # Lines Coresident on Eviction 56 48 40 32 24 16 8 0 leslie mcf wrf pr_twi bc_twi cc_twi cc_web pr_web zeusmp libq soplex Amean milc gcc 78% also likely to be evicted Update state for only a Representative, and infer others state. 16

  17. REDUCE BANDWIDTH COSTS: EXAMPLE Access: Page A, Page B, Page B Time = 0 Time = 2 Time = 1 First conflicting set Update 1 Set 0 A, RRPV=2 A, RRPV=3 B, RRPV=2 Avoid 3 updates Follow Set 1 A, RRPV=2 A, RRPV=2 B, RRPV=2 Set 2 A, RRPV=2 A, RRPV=2 B, RRPV=2 Set 3 A, RRPV=2 A, RRPV=2 B, RRPV=2 Install A (4) Bypass B (1) Install B (4) BW cost: Efficient Tracking of Reuse (ETR) on RRIP-AOB, opportunistically reduces state-update based on spatial locality: provides similar install policy, at reduced bandwidth cost 17

  18. REDUCE BANDWIDTH COSTS: ETR IMPLEMENTATION Recent Bypass Table (RBT) Page # Last Bypass Decision 1. Miss, is representative, make new decision 2. Update RRIP state Page Address C Page B 0 3. Update RBT Page C 0 A Page A 1 1. Hit, is follower, follow representative s decision 128-entry table (512B) is sufficient to implement ETR 18

  19. REDUCE BANDWIDTH COSTS: ETR REPLACEMENT BW RRIP-AOB [left] and ETR on RRIP-AOB [right], normalized to Always-Install Replacement BW w.r.t. Always Install 250% 200% Demote Promote Install 150% -70% 100% Update BW 50% 0% mcf wrf sphinx cc_web pr_web zeusmp libq soplex Amean pr_twi bc_twi cc_twi omnet milc nekbone xalanc gcc leslie ETR-RRIP reduces 70% of the state-update cost 19

  20. RRIP-AOB AND EFFICIENT TRACKING OF REUSE (SPEEDUP) 25% 20% Bridge 90% Gap Speedup 15% 10% 5% 0% ETR on RRIP-AOB enables intelligent replacement at low BW and storage costs (<1KB SRAM for RBT) to enable 18% speedup *System assumes 2GB HBM-based Cache, and PCM-based main memory (~4x latency).

  21. INTELLIGENT REPLACEMENT FOR DRAM CACHE Intelligent Replacement: with RRIP Age-On-Bypass Reduced State-Update Cost: with Spatial Following Hit Hit Hit 00 01 10 11 Bypass Bypass Bypass Evict Install BW-efficient reuse-based bypassing enables 18% speedup with <1KB SRAM (within 2% speedup of ideal w/ 8MB SRAM) 21

  22. 22

  23. METHODOLOGY (1/8TH KNIGHTS LANDING) CPU (8 cores) Stacked DRAM Non-volatile Memory Stacked DRAM 2GB DDR 1.0GHz, 128-bit 4 channels 64 GBps 13~30 ns NVM (PCM-based) 64GB DDR 2.0GHz, 64-bit 1 channel 16 GBps 13~143 ns Capacity Bus Channels Bandwidth Latency 23

Related


More Related Content