Challenges and Solutions in Memory Hierarchies for System Performance Growth

 
Moinuddin K. Qureshi
ECE, Georgia Tech
 
 
 
Memory Scaling is Dead,
Long Live Memory Scaling
Le Memoire Scaling est mort, vive le Memoire Scaling!
 
At Yale’s “Mid Career” Celebration at University of Texas at Austin, Sept 19 2014
undefined
The Gap in Memory Hierarchy
Main memory system must scale to maintain performance growth
2
1
2
3
2
7
2
11
2
13
2
15
2
19
2
23
T
y
p
i
c
a
l
 
a
c
c
e
s
s
 
l
a
t
e
n
c
y
 
i
n
 
p
r
o
c
e
s
s
o
r
 
c
y
c
l
e
s
 
(
@
 
4
 
G
H
z
)
L1(SRAM)
EDRAM 
DRAM 
HDD 
2
5
2
9
2
17
2
21
 
Misses in main memory (page faults) degrade performance severely
?????
undefined
Memory capacity per core expected to drop by 30% every two years
 
The Memory Capacity Gap
 
Trends: Core count doubling every 2 years.
DRAM DIMM capacity doubling every 3 years
 
Lim+ ISCA’09
undefined
Increasing error rates can render DRAM scaling infeasible
Challenges for DRAM: Scaling Wall
undefined
Important to investigate both approaches
Two Roads Diverged …
 
Architectural support 
f
or
 DRAM scaling
a
nd to reduce 
refresh
 overheads 
Find alternative 
technology that
avoids problems
of DRAM
DRAM challenges
undefined
 
Outline
 
 
Introduction
 
ArchShiled:  Yield Aware (arch support for DRAM )
Hybrid Memory: reduce Latency, Energy, Power
Adaptive Tuning of Systems to Workloads
 
Summary
Unreliability of ultra-thin dielectric material
In addition, DRAM cell failures also from:
Permanently leaky cells
Mechanically unstable cells
Broken links in the DRAM array
7
Permanent faults for future DRAMs expected to be much higher
Reasons for DRAM Faults
 
Q
 
Charge
Leaks
 
Permanently
  Leaky Cell
 
  DRAM Cell Capacitor
(tilting towards ground)
 
Broken Links
 
DRAM chip (organized into rows and columns) have spares
 
 
 
 
 
 
 
 
Laser fuses enable spare rows/columns
Entire row/column needs to be sacrificed for a few faulty cells
8
Row and Column Sparing incurs large area overheads
Row and Column Sparing
 
DRAM Chip: 
Before Row/Column Sparing
Spare Rows
Spare Columns
Faults
 
Commodity ECC DIMM with SECDED at 8 bytes (72,64)
 
Mainly used for soft-error protection
 
For hard errors, high chance of two errors in
same word (birthday paradox)
 
 
For 8GB DIMM 
 1 billion words
Expected errors till double-error word
= 1.25*Sqrt(N) = 40K errors  
 
0.5 ppm
9
SECDED not enough at high error-rate (what about soft-error?)
Commodity ECC-DIMM
10
Most faulty words have 1-bit error 
 
The skew in fault probability
can be leveraged for low cost resilience
Dissecting Fault Probabilities
At Bit Error Rate of 
10
-4 
(100ppm) for an 8GB DIMM (1 billion words) 
 
Tolerate high error rates with commodity ECC DIMM while
retaining soft-error resilience
11
ArchShield: Overview
ArchShield
Replication
Area
Fault Map
(cached)
Fault
Map
Main Memory
Inspired from Solid State Drives (SSD) to tolerate high bit-error rate
Expose faulty cell information to Architecture layer via runtime testing
ArchShield stores the error mitigation information in memory
Most words will be error-free
1-bit error handled with SECDED
Multi-bit error handled with replication
 
12
 
ArchShield: Yield Aware Design
 
When DIMM is configured, runtime testing is performed. Each 8B word
gets classified into one of three types:
Tolerates 100ppm fault rate with 1% slowdown and 4% capacity loss
No Error
 
(classification of faulty words can be stored in hard drive for future use)
(Replication not
needed)
 
SECDED can
correct soft error
1-bit Error
SECDED can
correct hard error
 
Need replication
for soft error
Multi-bit Error
Word gets
decommissioned
 
Only the replica
 is used
undefined
 
Outline
 
 
Introduction
 
ArchShiled:  Yield Aware (arch support for DRAM )
Hybrid Memory: reduce Latency, Energy, Power
Adaptive Tuning of Systems to Workloads
 
Summary
undefined
 
Emerging Technology to aid Scaling
PCM is attractive for designing scalable memory systems. But …
 
Phase Change Memory (PCM): Scalable to sub 10nm
 
Advantages:  scalable, has MLC capability, non volatile (no leakage)
 
Resistive memory:  High resistance (0), Low resistance (1)
undefined
Challenges for PCM
How do we design a scalable PCM without these disadvantages?
Key Problems:
1.
Higher read latency (compared to DRAM)
2.
Limited write endurance 
(~10-100 million writes per cell)
3.
Writes are much slower, and power hungry
Replacing DRAM with PCM causes:
High Read Latency, 
High Power 
High Energy Consumption
undefined
Hybrid Memory: Best of DRAM and PCM
Hybrid Memory System:  
1.
DRAM as cache to tolerate PCM Rd/Wr latency and Wr bandwidth
2.
PCM as main-memory to provide large capacity at good cost/power
3.
Write filtering techniques to reduces wasteful writes to PCM
PCM Main Memory
DRAM Buffer
PCM Write Queue
T=Tag-Store
Processor
Flash
Or
HDD
undefined
Latency, Energy, Power: Lowered
Hybrid memory provides performance similar to iso-capacity DRAM
Also avoids the energy/power overheads from frequent writes
undefined
 
Outline
 
 
Introduction
 
ArchShiled:  Yield Aware (arch support for DRAM )
Hybrid Memory: reduce Latency, Energy, Power
Adaptive Tuning of Systems to Workloads
 
Summary
undefined
Workload Adaptive Systems
Ideal for each workload to have the policy that works best for it
Different policies work well for different workloads
1.
No single replacement policy works well for all workloads
2.
Or, the prefetch algorithm
3.
Or, the memory scheduling algorithm
4.
Or, the coherence algorithm
5.
Or, any other policy (write allocate/no allocate?)
Unfortunately: systems are designed to cater to average case
(a policy that works good enough for all workloads)
undefined
Adaptive Tuning via Runtime Testing
Adaptive Tuning can allow dynamic policy selection at low cost
Say we want to select between two policies: P0 and P1
 
Divide the cache in three:
Dedicated P0 sets
Dedicated P1 sets
Follower sets (winner of P0,P1)
 
n-bit saturating counter
misses to P0-sets:
 
count
er++
misses to P1-set: 
counter--
 
Counter decides policy for Followers:
MSB = 0
, Use P0
MSB = 1
, Use P1
n-bit cntr
monitor 
 
choose 
 apply
 (Set Dueling: using a single counter)
undefined
 
Outline
 
 
Introduction
 
ArchShiled:  Yield Aware (arch support for DRAM )
Hybrid Memory: reduce Latency, Energy, Power
Adaptive Tuning of Systems to Workloads
 
Summary
undefined
Challenges for Computer Architects
End of: Technology Scaling, Frequency Scaling, Moore’s Law, ????
How do we address these challenges:
Hybrid memory:  Latency, Energy, Power reduction for PCM
Workload adaptive systems:   low cost
Yield Awareness
“Adaptivity Through Testing”
The solution for all computer architecture problems is:
undefined
Challenges for Computer Architects
Hybrid memory:  
L
atency
,
 
E
nergy, 
P
ower reduction for PCM
Workload adaptive systems:   get low cost
 
Y
ield
 
A
wareness
 
A
daptivity 
T
hrough 
T
esting
The solution for all computer architecture problems is:
 
Happy 75
th
 Yale !
End of: Technology Scaling, Frequency Scaling, Moore’s Law, ????
How do we address these challenges:
Slide Note
Embed
Share

The evolution of memory scaling poses challenges for system performance growth due to limitations in memory hierarchy, capacity gaps, and DRAM scaling obstacles. The need for alternative technologies and architectural support to address these challenges is highlighted, focusing on reducing latency, energy consumption, and power usage through hybrid memory solutions. Reasons for DRAM faults, such as unreliability of dielectric material and cell failures, are also discussed, emphasizing the importance of investigating multiple approaches for enhancing memory systems.

  • Memory Scaling
  • System Performance
  • DRAM Challenges
  • Hybrid Memory
  • Memory Hierarchy

Uploaded on Sep 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Memory Scaling is Dead, Long Live Memory Scaling Le Memoire Scaling est mort, vive le Memoire Scaling! Moinuddin K. Qureshi ECE, Georgia Tech At Yale s Mid Career Celebration at University of Texas at Austin, Sept 19 2014

  2. The Gap in Memory Hierarchy L1(SRAM) EDRAM Flash HDD DRAM ????? 215 25 217 223 219 23 29 213 21 221 27 211 Typical access latency in processor cycles (@ 4 GHz) Misses in main memory (page faults) degrade performance severely Main memory system must scale to maintain performance growth

  3. The Memory Capacity Gap Trends: Core count doubling every 2 years. DRAM DIMM capacity doubling every 3 years Lim+ ISCA 09 Memory capacity per core expected to drop by 30% every two years

  4. Challenges for DRAM: Scaling Wall Scaling wall DRAM does not scale well to small feature sizes (sub 1x nm) Increasing error rates can render DRAM scaling infeasible

  5. Two Roads Diverged Architectural support for DRAM scaling and to reduce refresh overheads Find alternative technology that avoids problems of DRAM DRAM challenges Important to investigate both approaches

  6. Outline Introduction ArchShiled: Yield Aware (arch support for DRAM ) Hybrid Memory: reduce Latency, Energy, Power Adaptive Tuning of Systems to Workloads Summary

  7. Reasons for DRAM Faults Unreliability of ultra-thin dielectric material In addition, DRAM cell failures also from: Permanently leaky cells Mechanically unstable cells Broken links in the DRAM array DRAM Cells Q DRAM Cell Capacitor (tilting towards ground) Charge Leaks DRAM Cell Capacitor Permanently Leaky Cell Mechanically Unstable Cell Broken Links Permanent faults for future DRAMs expected to be much higher 7

  8. Row and Column Sparing DRAM chip (organized into rows and columns) have spares Replaced Columns Deactivated Rows and Columns Spare Columns Faults Replaced Rows Spare Rows DRAM Chip: DRAM Chip: After Row/Column Sparing Before Row/Column Sparing Laser fuses enable spare rows/columns Entire row/column needs to be sacrificed for a few faulty cells Row and Column Sparing incurs large area overheads 8

  9. Commodity ECC-DIMM Commodity ECC DIMM with SECDED at 8 bytes (72,64) Mainly used for soft-error protection For hard errors, high chance of two errors in same word (birthday paradox) For 8GB DIMM 1 billion words Expected errors till double-error word = 1.25*Sqrt(N) = 40K errors 0.5 ppm SECDED not enough at high error-rate (what about soft-error?) 9

  10. Dissecting Fault Probabilities At Bit Error Rate of 10-4 (100ppm) for an 8GB DIMM (1 billion words) Faulty Bits per word (8B) Probability Num words in 8GB 0 99.3% 0.99 Billion 1 0.7% 7.7 Million 26 x 10-6 2 28 K 62 x 10-9 3 67 10-10 4 0.1 Most faulty words have 1-bit error The skew in fault probability can be leveraged for low cost resilience Tolerate high error rates with commodity ECC DIMM while retaining soft-error resilience 10

  11. ArchShield: Overview Inspired from Solid State Drives (SSD) to tolerate high bit-error rate Expose faulty cell information to Architecture layer via runtime testing Replication Area Fault Map Main Memory ArchShield Most words will be error-free 1-bit error handled with SECDED Fault Map (cached) Multi-bit error handled with replication ArchShield stores the error mitigation information in memory 11

  12. ArchShield: Yield Aware Design When DIMM is configured, runtime testing is performed. Each 8B word gets classified into one of three types: No Error 1-bit Error Multi-bit Error (Replication not needed) SECDED can correct hard error Word gets decommissioned SECDED can correct soft error Need replication for soft error Only the replica is used (classification of faulty words can be stored in hard drive for future use) Tolerates 100ppm fault rate with 1% slowdown and 4% capacity loss 12

  13. Outline Introduction ArchShiled: Yield Aware (arch support for DRAM ) Hybrid Memory: reduce Latency, Energy, Power Adaptive Tuning of Systems to Workloads Summary

  14. Emerging Technology to aid Scaling Phase Change Memory (PCM): Scalable to sub 10nm Resistive memory: High resistance (0), Low resistance (1) Advantages: scalable, has MLC capability, non volatile (no leakage) PCM is attractive for designing scalable memory systems. But

  15. Challenges for PCM Key Problems: 1. Higher read latency (compared to DRAM) 2. Limited write endurance (~10-100 million writes per cell) 3. Writes are much slower, and power hungry Replacing DRAM with PCM causes: High Read Latency, High Power High Energy Consumption How do we design a scalable PCM without these disadvantages?

  16. Hybrid Memory: Best of DRAM and PCM PCM Main Memory DATA DRAM Buffer Processor Flash Or HDD T DATA T=Tag-Store PCM Write Queue Hybrid Memory System: 1. DRAM as cache to tolerate PCM Rd/Wr latency and Wr bandwidth 2. PCM as main-memory to provide large capacity at good cost/power 3. Write filtering techniques to reduces wasteful writes to PCM

  17. Latency, Energy, Power: Lowered 1.1 1 0.9 Normalized Execution Time 0.8 8GB DRAM 32GB PCM 32GB DRAM 32GB PCM + 1GB DRAM 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 db1 db2 qsort bsearch kmeans gauss daxpy vdotp gmean Hybrid memory provides performance similar to iso-capacity DRAM Also avoids the energy/power overheads from frequent writes

  18. Outline Introduction ArchShiled: Yield Aware (arch support for DRAM ) Hybrid Memory: reduce Latency, Energy, Power Adaptive Tuning of Systems to Workloads Summary

  19. Workload Adaptive Systems Different policies work well for different workloads 1. No single replacement policy works well for all workloads 2. Or, the prefetch algorithm 3. Or, the memory scheduling algorithm 4. Or, the coherence algorithm 5. Or, any other policy (write allocate/no allocate?) Unfortunately: systems are designed to cater to average case (a policy that works good enough for all workloads) Ideal for each workload to have the policy that works best for it

  20. Adaptive Tuning via Runtime Testing Say we want to select between two policies: P0 and P1 miss Divide the cache in three: Dedicated P0 sets Dedicated P1 sets Follower sets (winner of P0,P1) P0-sets + n-bit cntr P1-sets miss n-bit saturating counter misses to P0-sets: counter++ misses to P1-set: counter-- Follower Sets Counter decides policy for Followers: MSB = 0, Use P0 MSB = 1, Use P1 monitor choose apply (Set Dueling: using a single counter) Adaptive Tuning can allow dynamic policy selection at low cost

  21. Outline Introduction ArchShiled: Yield Aware (arch support for DRAM ) Hybrid Memory: reduce Latency, Energy, Power Adaptive Tuning of Systems to Workloads Summary

  22. Challenges for Computer Architects End of: Technology Scaling, Frequency Scaling, Moore s Law, ???? How do we address these challenges: The solution for all computer architecture problems is: Yield Awareness Hybrid memory: Latency, Energy, Power reduction for PCM Workload adaptive systems: low cost Adaptivity Through Testing

  23. Challenges for Computer Architects End of: Technology Scaling, Frequency Scaling, Moore s Law, ???? How do we address these challenges: The solution for all computer architecture problems is: Yield Awareness Hybrid memory: Latency, Energy, Power reduction for PCM Workload adaptive systems: get low cost Happy 75th Yale ! Adaptivity Through Testing

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#