
Addressing Prolonged Restore Challenges in DRAM Scaling
Explore the challenges of scaling DRAM in the face of technology advancements, with a focus on prolonged restore issues. Learn about the complexities of DRAM operations, scaling trends, and the demands for continued scaling to meet evolving computational needs.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
PhD Thesis Defense Jul 14, 2017 (Friday) Addressing Prolonged Restore Challenges in Further Scaling DRAMs Xianwei Zhang Committees: Youtao Zhang (advisor) CS, Pitt Bruce R. Childers CS, Pitt Jun Yang ECE, Pitt Guangyong Li ECE, Pitt Wonsun Ahn CS, Pitt
MAIN MEMORY MAIN MEMORY DRAM Processor Memory Storage Main memory is critical for system performance 2
DRAM DRAM Transistor cell Capacitor DRAM DRAM Cell 2D Array DIMM/Chip The simplicity enabled DRAM to continuously scale 3
SCALING SCALING Technology Scaling $80,000 800MHz 3.0V $1,000 1.8V 400 1.2V $10 200 Voltage Perf/BW Cost Do we still need DRAM to continue scale? 4
DEMANDS DEMANDS Data Intensive Apps Tight Power Budgets Increasing Computation DRAM must keep scaling to meet demands 5
SCALING TREND SCALING TREND Process Tech. 8Gb12Gb 4Gb 90nm 1Gb 512Mb 256Mb 128Mb 45nm 64Mb 16Mb 30nm 22nm Sub-20nm 4Mb ? 1Mb 1985 1995 2005 2015 Chip Density Data: IBM 2010 DRAM scaling is getting more difficult 6
DRAM OPERATIONS DRAM OPERATIONS Wordline Transistor abstract Capacitor Capacitor Bitline Bitline SenseAmp Precharged Sharing Sensing/Restoring Restored Precharged Vdd V .5Vdd T RD T PRE T T ACT T tRCD(13.75ns) tRP(13.75ns) tRAS(35ns) 7
WHY DIFFICULT? WHY DIFFICULT? Wordline Transistor Technology Scaling Capacitor Bitline SenseAmp Less charge higher leakage current Larger resistance Lower voltage Nearer cells Process variations Larger resistance Weaker signal More Leaky Longer Sensing Prolonged Restore Severer Noise 8
RESTORE ISSUE RESTORE ISSUE cell dist. Low yield scale Bad perf restore More cells will be violating the JEDEC specifications 9
THESIS STATEMENT THESIS STATEMENT Enable DRAM further scaling without low yield and degraded performance 10
CANDIDATE SOLUTIONS CANDIDATE SOLUTIONS Work on slow ones perf yield perf yield perf yield Relax standard Cutoff slow ones Expose slow cells to architectural levels 11
THESIS OVERVIEW THESIS OVERVIEW Address Restore Issues in Further Scaling DRAMs Mitigate restore w/ approximate computing [DrMP PACT17, Award MemSys16] Fast restore via reorganization and page alloc [CkRemap DATE15, Alloc TODAES17] DDR Partial restore based on refresh distance [RT-Next HPCA16] 12
OUTLINE OUTLINE RT-Next Partial restore based on refresh distance CkRemap Fast restore via reorganization and allocation DDR DrMP Mitigate restore with approximate computing Summary and Research Directions 13
CHARGING CHARGING - - RESTORE RESTORE Wordline Vcell Transistor Vfull Capacitor Bitline tRAStRAS SenseAmp 0V Time(ns) Post-access restore - Fully charge cells - Read (tRAS), Write (tWR) Prolonged restore leads to slow read/write 14
CHARGING CHARGING - - REFRESH REFRESH Wordline Vcell Transistor Vfull Capacitor Vmin Bitline 64ms SenseAmp Time(ms) Charge leakage - Cell charge decays over time Refresh operation - Periodically fully charge cells to avoid data loss Do we still need to fully restore the cell after r/w? 15
PARTIAL PARTIAL- -RESTORE OPPORTUNITIES RESTORE OPPORTUNITIES Vcell Vcell Vfull Vmin tRAS Time(ms) Time(ns) NxtRef Read 1 Answer: YES and NO Read 1: Yes ! 16
PARTIAL PARTIAL- -RESTORE OPPORTUNITIES RESTORE OPPORTUNITIES Vcell Vcell Vfull Vmin Vx 0V Time(ms) Time(ns) Read 1 Read 2 NxtRef Do we always fully restore? Read 1: Yes ! Read 2: No! It is safe to partially charge to Vx But, how should we determine Vx? 17
DETERMINE VX DETERMINE VX Vcell Vcell Vfull V1 V2 Vmin V3 V4 tRAS Time(ms) Time(ns) NxtRef Linear restore curve - Data is safe as long as the voltage is above decay curve Use four sub-windows - Save a set of timings for each Charging goal: Vmax of each sub-window 18
RT RT- -next: RESTORE W.R.T NEXT REFRESH next: RESTORE W.R.T NEXT REFRESH Vcell Vcell V2 Vfull Vmin 64ms 40ms tRAS Time(ms) Time(ns) Read NxtRef Check the sub-window read/write falls into Apply the timings to achieve the charging goal Example: 40ms to the next refresh, 2nd window, charge to V2 19
MULTI MULTI- -RATE REFRESH RATE REFRESH Vcell Vfull Vmin 64ms 128ms 104ms Time(ns) Read NxtRef Multi-rate refresh - Over 64ms row, same four-window division 20
REFRESH UPGRADE REFRESH UPGRADE Vcell win1 win3 (V1 V3) Vfull Vmin 104ms 40ms Time(ns) Read Read NxtRef Multi-rate refresh - Over 64ms row, same four-window division Refresh upgrade - More frequent refresh, the closer distance to next refresh - Lower charging goal for restore 21
UPGRADE REFRESH DESIGNS UPGRADE REFRESH DESIGNS Vcell win1 win3 (V1 V3) Vfull Vmin 104ms 40ms Time(ns) NxtRef Read Read Blindly upgrade (RT-all) - More refreshes, increasing overheads on performance and energy Selectively upgrade (RT-sel) - Only upgrade touched row/bin - Back to low-rate afterwards 22
PERFORMANCE PERFORMANCE 1.4 RT-next RT-all (blind upgrade) RT-sel (select upgrade) Speedup wrt Baseline 19.5% 15% 1.2 1 0.8 COM SPEC PARSEC BIO Gmean RT-next is 15% over Baseline because of restore truncation RT-all becomes worse because of refresh penalty RT-sel achieves the best result by balancing refresh and restore 23
COMPARE TO STATE COMPARE TO STATE- -OF OF- -ARTS ARTS 1.4 PRT-free ArchShield+[ISCA13] MCR[ISCA15] RT-sel Speed up wrt Baseline 1.3 1.2 5.2% 1.1 19.5% 1 COM SPEC PARSEC BIO Gmean While ArchShield+ is close to PRT-free, RT-sel is 5.2% better While losing 50% capacity, MCR is still worse 24
SUMMARY: RT SUMMARY: RT- - Prolonged restore issue in future DRAM Restore and refresh are strongly correlated RT-next: truncate restore w/ refresh distance RT-sel: expose more restore opportunities Balances refresh and restore, beats state-of-arts Performance: 19.5% improvement results 25
OUTLINE OUTLINE RT-Next Partial restore based on refresh distance CkRemap Fast restore via reorganization and allocation DDR DrMP Mitigate restore with approximate computing Summary and Research Directions 26
DRAM ORGANIZATION DRAM ORGANIZATION Logical Bank Rank Chip Physical Bank Physical bank: chip level, a portion of memory arrays Logical bank: rank level, one physical bank from each chip How to utilize the organization to solve restore? 27
MOTIVATION MOTIVATION chip1 bank0 chip0 bank0 rank0 bank0 22 16 bank0 23 18 19 bank1 bank0 18 20 23 bank1 24 bank1 20 19 bank1 17 24 22 bank1 24 16 17 24 Single set of timings for the whole memory Cells are more statistical in smaller nodes Too pessimistic to decide by the worst case 28
CHUNK CHUNK- -SPECIFIC RESTORE SPECIFIC RESTORE chip1 bank0 18 chip0 bank0 23 rank0 bank0 23 rank0 bank0 bank0 bank0 22 23 18 19 19 16 18 20 23 23 24 23 bank1 24 bank1 bank1 24 bank1 19 bank1 bank1 20 24 16 17 17 19 17 24 22 24 24 24 Partition each chip bank into multi chunks Set chunk-level timings Expose timings to memory controller (MC) Slow & fast chunks can still be combined together 29
FAST CHUNK W/ REMAPPING FAST CHUNK W/ REMAPPING chip1 bank0 chip0 bank0 rank0 bank0 18 23 18 24 19 23 bank1 bank1 bank1 19 24 19 24 17 24 Partition bank into chunks Detect chip-chunk timings Remap chunks within each chip-bank Bad chip leads to slow rank even w/ remapping 30
RANK CONSTRUCTION (BIN) RANK CONSTRUCTION (BIN) Formed ranks Clustering bins DRAM chips b0 chip 1 chip n b1 chip N bM Cluster chips into bins using similarity Construct ranks using chips from each bin How to fully utilize the exposed fast regions? 31
RESTORE RESTORE- -AWARE PAGE ALLOCATION AWARE PAGE ALLOCATION 100 Cumulative accesses (%) fast 80 60 hot MMU 40 403.gcc 436.cac 470.lbm 410.bwa 450.sop 20 Virtual Pages 0 Physical Frames 0 20 Portion of pages (%) 40 60 80 100 Accesses come from a small set of pages 32
PERFORMANCE PERFORMANCE PRT-Baseline Spare ECC 54% CkRemapBin 1.6 Norm. Exeution Time 37% 1.4 15% 1.2 1 0.8 470.lbm 401.bzi 400.per Spec-High Spec-All Prolonged restore significantly hurts performance Classical repair approaches offer limited help With chunk remap and rank construction, avg 15% shorter 33
PAGE ALLOCATION EFFECTS PAGE ALLOCATION EFFECTS Chunk ChunkBin CkRemap CkRemapBin 1.3 10.5% Norm. Execution Time 16.5% 1.1 0.9 Spec-All_rand Spec-All_prof Chunk-remap & rank-construction expose more fast chunks - provide more opportunities for page-allocation Restore-aware page allocation effectively reduce time 34
SUMMARY: SUMMARY: CkRemap CkRemap Further scaling restore has serious PV effects Worse-case based approaches are ineffective CkRemap: construct fast chunks via remapping PageAlloc: fully utilize the exposed fast regions results Performance: as high as 25% avg improvement Page alloc: hotness-aware alloc maximize gains 35
OUTLINE OUTLINE RT-Next Partial restore based on refresh distance CkRemap Fast restore via reorganization and allocation DDR DrMP Mitigate restore with approximate computing Summary and Research Directions 36
APPLICATION CHARACTERISTICS APPLICATION CHARACTERISTICS Credit: www-d0.fnal.gov Credit: image-net.org Credit: www.itbusiness.ca/ Machine Learning Computer Vision Big Data Analytics Many applications can tolerate accuracy loss 37
RESTORE RESTORE- -BASED APPROXIMATION BASED APPROXIMATION precise RT-Next CkRemap approximate Just Errors Will the final output always be acceptable? 38
MOTIVATION RESULTS MOTIVATION RESULTS 100 90.95 tWR=30 tWR=15 tWR=20 tWR=12 OUTPUT ACCURACY LOSS 61.36 46.49 24.2 4.48 1.65 0.7 KMEANS LU Accuracy loss steadily enlarges along tWR decrease Final output quality must be controlled Applications show vastly different behaviors 39
CRITICAL DATA CRITICAL DATA error-resilient error-sensitive pixels pointers jump targets neuron weights video frames meta data Critical data cannot be approximated 40
BITS ARE NOT EQUALLY IMPORTANT BITS ARE NOT EQUALLY IMPORTANT msb Int/byte R G B 1 7 sign exponent mantissa Float 1 8 23 Double 1 11 52 25% 50% There is a tradeoff between accuracy and overhead 41
DrMP DrMP: APPROXIMATE DRAM ROW : APPROXIMATE DRAM ROW exponent mantissa exponent mantissa sign sign 1 8 23 1 8 23 Remapping Map-4 17 18 Worst Map-2 15 16 chip0 chip1 chip3 chip4 chip5 chip7 chip2 chip6 tWR=24 tWR=23 14 14 17 17 20 24 15 15 18 17 17 20 18 18 15 15 19 16 16 18 17 17 23 19 8b 8b 8b 8b 8b 8b 8b 8b 64b 2 floating points What if there aren t that much approx data? 42
DrMP DrMP : PRECISE + APPROX : PRECISE + APPROX exponent mantissa exponent mantissa sign sign 1 8 23 1 8 23 Precise + Approx Approx 19 18 Paired 19 24 all precise Worst chip0 chip1 chip3 chip4 chip5 chip7 chip2 chip6 15 15 tWR=24 tWR=23 14 14 17 17 20 X 24 X 17 17 18 20 X 19 19 17 17 18 18 19 19 16 18 15 23 X 8b 8b 8b 8b 8b 8b 8b 8b 64b Pair two rows to re-combine chip segments - Choose smaller one from each location to form a fast one (Precise) Guarantee partial precise for the other slow row 43
OUTPUT QUALITY OUTPUT QUALITY 100 100 100 Base-2(2chips) Base-4(4chips) DrMP-2(2chips) DrMP-4(4chips) Output Accuracy Loss (%) 80 55.22 60 40 16.75 12.08 20 7.51 5.97 4.48 1.98 1.86 1.82 1.56 0.77 4.2 0.45 0.31 0.29 0.28 0.27 0.15 0.04 0.01 0 0 0 kmeans black ray sor lu smm Base-2 Base-4 Precise DrMP-2 DrMP-4 44
PERFORMANCE PERFORMANCE 1.6 DrMP-4 DrMP-2 RT+DrMP PRT-free Speedup wrt Baseline 1.4 8.7% 19.8% 1.2 1 0.8 kmeans black ray sor lu smm Gmean DrMP achieves 19.8% performance improvement - For apps with dominant approx data accesses, DrMP outperforms PRT-free Orthogonal to RT - RT+DrMP is 8.7% better than PRT-free 45
SUMMARY: SUMMARY: DrMP DrMP Many applications can tolerate output quality loss Restore can be used for approximate computing DrMP: balance restore reductions and accuracy DrMP : support both approximate and precise Output quality: no more than 1% accuracy loss Performance: 19.8% improvement results 46
OUTLINE OUTLINE RT-Next Partial restore based on refresh distance CkRemap Fast restore via reorganization and allocation DDR DrMP Mitigate restore with approximate computing Summary and Research Directions 47
SUMMARY SUMMARY DRAM must keep scaling to meet increasing demands Prolonged restore time has become a major hurdle RT-next: truncate restore using the time distance to next refresh CkRemap: construct fast access regions using DRAM organization DrMP: mitigate restore while guarantee acceptable output loss Performed pioneering studies on restore via modeling & simu Developed comprehensive schemes to mitigate restore issue Supported under NSF grants: CCF-1422331, CNS-1012070, CCF-1535755 and CCF-1617071 48
COMPARISON TO PRIOR ARTS COMPARISON TO PRIOR ARTS Sharing/Sensing timing reduction - Optimize DRAM internal structures [CHARM ISCA13, TL-DRAM HPCA13, etc] - Utilize existing timing margins [NUAT HPCA14, AL-DRAM HPCA15, etc] We are working at orthogonal restore issue in future DRAMs sense DRAM restore studies - Identify the restore scaling issue [Co-arch MEM14, tWR Patent15, etc] - Reduce restore timings [AL-DRAM HPCA15, MCR ISCA15, etc] We are working at future DRAMs with more effective solutions restore Memory-based approximate computing - Optimize storage density and lifetime [PCM/SSD MICRO13, PCM ASPLOS16, etc] - Skip DRAM refresh [Flikker ASPLOS11, Alloc CASES15, etc] We are the first work on restore-based approximation approx 49
FUTURE RESEARCH DIRECTIONS FUTURE RESEARCH DIRECTIONS Solve restore from reliability perspective - Treat Slow restore cells as faulty ones - Design stronger error correction codes Study security issues of restore variation - Restore variation info is DRAM s fingerprint - Solve both info leakage and slow restore Explore restore in 3D stacked DRAM - Stacking has thermal management issue - Reduce restore with temperature-aware solutions 50