Understanding DRAM Errors: Implications for System Design

Cosmic Rays Don’t Strike Twice:
Understanding the Nature of DRAM Errors
and the Implications for System Design
Andy A. Hwang, 
Ioan Stefanovici
, Bianca Schroeder
Presented at ASPLOS 2012
Why DRAM errors?
Why DRAM?
One of the most frequently replaced components
[DSN’06]
Getting worse in the future?
DRAM errors?
A bit is read differently
   from how it was written
1
2
0
What 
do
 we know about DRAM errors?
3
 
Soft errors
Transient
Cosmic rays, alpha particles, random noise
Hard errors
Permanent hardware problem
Error protection
None 
 machine crash / data loss
Error correcting codes
E.g.: SEC-DED, Multi-bit Correct, Chipkill
Redundancy
E.g.: Bit-sparing
What 
don’t
 we know about DRAM errors?
 
Some open questions
What does the error process look like? (Poisson?)
What is the frequency of hard vs. soft errors?
What do errors look like on-chip?
Can we predict errors?
What is the impact on the OS?
How effective are hardware and software level error
protection mechanisms?
Can we do better?
4
Previous Work
 
 
Accelerated laboratory testing
Realistic?
Most previous work focused specifically on soft errors
Current field studies are limited
12 machines with errors [ATC’10]
Kernel pages on desktop machines only [EuroSys’11]
Error-count-only information from a homogenous
population [Sigmetrics’09]
5
Error events detected upon [read] access and
corrected by the memory controller
Data contains error location (node and address),
error type (single/multi-bit), timestamp information.
The data in our study
6
CPU
Memory
Controller
The systems in our study
 
 
 
 
 
 
 
Wide range of workloads, DRAM technologies,
protection mechanisms.
Memory controller physical address mappings
In total more than 300 TB-years of data!
7
 
 
 
 
 
 
 
 
 
Errors happen at a significant rate
Highly variable number of errors per node
How common are DRAM errors?
8
 
Only 2-20% of nodes with errors experience a single error
Top 5% of nodes with errors experience > 1 million errors
 
 
 
 
 
 
 
 
Distribution of errors is highly skewed
Very different from a Poisson distribution
Could 
hard errors 
be the dominant failure mode?
How are errors distributed in the systems?
 
Top 10% of nodes with CEs make up
~90% of all errors
 
After 2 errors, probability of future
errors > 90%
9
What do errors look like on-chip?
10
1
2
column
row
What do errors look like on-chip?
11
1
2
column
row
What do errors look like on-chip?
12
1
2
column
row
What do errors look like on-chip?
13
4
1
2
5
3
column
row
What do errors look like on-chip?
14
1
column
row
 
The patterns on the majority of banks can be linked
to hard errors.
What do errors look like on-chip?
15
 
 
 
 
 
 
 
 
 
 
Repeat errors happen quickly
90% of errors manifest themselves within less than 2
weeks
What is the time between repeat errors?
16
2 weeks
When are errors detected?
 
Error detection
Program [read] access
Hardware memory scrubber: Google 
only
 
 
 
 
 
 
 
Hardware scrubbers may not shorten the time until a
repeat error is detected
17
1 day
 
1/3 – 1/2 of error addresses develop additional errors
Top 5-10% develop a large number of repeats
 
 
 
 
 
 
 
3-4 
orders of magnitude 
increase in probability once an
error occurs, and even greater increase after repeat errors.
For both columns and rows
 
 
 
 
 
 
 
 
 
 
 
 
How does memory degrade?
18
 
In the absence of sufficiently powerful ECC, multi-bit
errors can cause data corruption / machine crash.
Can we predict multi-bit errors?
 
 
 
 
 
 
> 100-fold increase in MBE probability after repeat errors
50-90% of MBEs had prior warning
How do multi-bit errors impact the system?
19
 
 
 
 
 
 
 
 
 
 
 
Errors are not uniformly distributed
Some patterns are consistent across systems
Lower rows have higher error probabilities
Are some areas of a bank more likely to fail?
20
Summary so far
 
Similar error behavior across ~300TB-years of DRAM
from different types of systems
Strong correlations (in space and time) exist between
errors
On-chip errors patterns confirm hard errors as
dominating failure mode
Early errors are highly indicative warning signs for
future problems
 
What does this all mean?
21
 
Errors are highly localized on a small number of pages
~85% of errors in the system are localized on 10% of pages
impacted with errors
For typical 4Kb pages:
What do errors look like from the OS’ p.o.v.?
22
Can we retire pages containing errors?
 
 
Page Retirement
Move page’s contents to different page and mark it as
bad
 to prevent future use
Some page retirement mechanisms exist
Solaris
BadRAM patch for Linux
But rarely used in practice
No page retirement 
policy
 evaluation on realistic
error traces
 
23
What sorts of policies should we use?
Retirement policies:
Repeat-on-address
1-error-on-page
2-errors-on-page
Repeat-on-row
Repeat-on-column
24
Physical address space
1
2
What sorts of policies should we use?
Retirement policies:
Repeat-on-address
1-error-on-page
2-errors-on-page
Repeat-on-row
Repeat-on-column
25
Physical address space
1
What sorts of policies should we use?
Retirement policies:
Repeat-on-address
1-error-on-page
2-errors-on-page
Repeat-on-row
Repeat-on-column
26
Physical address space
1
2
What sorts of policies should we use?
Retirement policies:
Repeat-on-address
1-error-on-page
2-errors-on-page
Repeat-on-row
Repeat-on-column
27
Physical address space
1
1
2
column
row
On-chip
2
What sorts of policies should we use?
Retirement policies:
Repeat-on-address
1-error-on-page
2-errors-on-page
Repeat-on-row
Repeat-on-column
28
Physical address space
1
1
2
column
row
On-chip
2
 
1-error-on-page
 
Repeat-on-row
 
Repeat-on-column
 
Repeat-on-address
 
2-errors-on-page
How effective is page retirement?
29
 
(MBE)
 
Repeat-on-address
 
2-errors-on-page
 
1-error-on-page
 
Repeat-on-column
 
Repeat-on-row
 
Effective policy
 
1MB
 
For typical 4Kb pages:
 
 
 
 
 
 
 
More than 90% of errors can be prevented with < 1MB sacrificed
per node
Similar for multi-bit errors
Implications for future system design
 
OS-level page retirement can be highly effective
Different areas on chip are more susceptible to errors than
others
Selective error protection
Potential for error prediction based on early warning signs
Memory scrubbers may not be effective in practice
Using server idle time to run memory tests (eg: memtest86)
Realistic DRAM error process needs to be incorporated
into future reliability research
Physical-level error traces have been made public on the
Usenix Failure Repository
30
Thank you!
Please read the paper for more results! 
Questions 
?
31
Slide Note
Embed
Share

Exploring the nature of DRAM errors, this study delves into the causes, types, and implications for system design. From soft errors caused by cosmic rays to hard errors due to permanent hardware issues, the research examines error protection mechanisms and open questions surrounding DRAM errors. Previous work, data analysis, and system studies shed light on the complexities of DRAM error management.

  • DRAM errors
  • System design
  • Error protection
  • Cosmic rays
  • Hardware
  • Software

Uploaded on Sep 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Cosmic Rays Dont Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design Andy A. Hwang, Ioan Stefanovici, Bianca Schroeder Presented at ASPLOS 2012

  2. Why DRAM errors? Why DRAM? One of the most frequently replaced components [DSN 06] Getting worse in the future? DRAM errors? A bit is read differently from how it was written 1 0 University of Toronto 2

  3. What do we know about DRAM errors? Soft errors Transient Cosmic rays, alpha particles, random noise Hard errors Permanent hardware problem Error protection None machine crash / data loss Error correcting codes E.g.: SEC-DED, Multi-bit Correct, Chipkill Redundancy E.g.: Bit-sparing University of Toronto 3

  4. What dont we know about DRAM errors? Some open questions What does the error process look like? (Poisson?) What is the frequency of hard vs. soft errors? What do errors look like on-chip? Can we predict errors? What is the impact on the OS? How effective are hardware and software level error protection mechanisms? Can we do better? University of Toronto 4

  5. Previous Work Accelerated laboratory testing Realistic? Most previous work focused specifically on soft errors Current field studies are limited 12 machines with errors [ATC 10] Kernel pages on desktop machines only [EuroSys 11] Error-count-only information from a homogenous population [Sigmetrics 09] University of Toronto 5

  6. The data in our study DIMM DIMM DIMM Memory Controller CPU Error events detected upon [read] access and corrected by the memory controller Data contains error location (node and address), error type (single/multi-bit), timestamp information. University of Toronto 6

  7. The systems in our study DRAM Technology Protection Mechanisms Time (days) DRAM (TB) System Multi-bit Correct, Bit Sparing LLNL BG/L 214 49 DDR Multi-bit Correct, Chipkill, Bit Sparing ANL BG/P 583 80 DDR2 SciNet GPC 211 62 DDR3 SEC-DED DDR[1-2], FBDIMM Google 155 220 Multi-bit Correct Wide range of workloads, DRAM technologies, protection mechanisms. Memory controller physical address mappings In total more than 300 TB-years of data! University of Toronto 7

  8. How common are DRAM errors? Average # Errors per Node / Year Median # Errors per Node / Year Total # of Errors in System Nodes With Errors System 227 x 106 LLNL BG/L 1,724 (5.32%) 3,879 19 1.96 x 109 ANL BG/P 1,455 (3.55%) 844,922 14 49.3 x 106 SciNet GPC 97 (2.51%) 263,268 464 27.27 x 109 Google 20,000 (N/A %) 880,179 303 Errors happen at a significant rate Highly variable number of errors per node University of Toronto 8

  9. How are errors distributed in the systems? Only 2-20% of nodes with errors experience a single error Top 5% of nodes with errors experience > 1 million errors Top 10% of nodes with CEs make up ~90% of all errors After 2 errors, probability of future errors > 90% Distribution of errors is highly skewed Very different from a Poisson distribution Could hard errors be the dominant failure mode? University of Toronto 9

  10. What do errors look like on-chip? Error Mode Repeat address Repeat row Repeat column Whole chip BG/L Banks 80.9% 4.7% 8.8% 0.53% BG/P Banks 59.4% 31.8% 22.7% 3.20% Google Banks 58.7% 7.4% 14.5% 2.02% 1 2 row Single Event 17.6% 29.2% 34.9% column University of Toronto 10

  11. What do errors look like on-chip? Error Mode Repeat address Repeat row Repeat column Whole chip BG/L Banks 80.9% 4.7% 8.8% 0.53% BG/P Banks 59.4% 31.8% 22.7% 3.20% Google Banks 58.7% 7.4% 14.5% 2.02% 1 2 row Single Event 17.6% 29.2% 34.9% column University of Toronto 11

  12. What do errors look like on-chip? Error Mode Repeat address Repeat row Repeat column Whole chip BG/L Banks 80.9% 4.7% 8.8% 0.53% BG/P Banks 59.4% 31.8% 22.7% 3.20% Google Banks 58.7% 7.4% 14.5% 2.02% 1 row 2 Single Event 17.6% 29.2% 34.9% column University of Toronto 12

  13. What do errors look like on-chip? Error Mode Repeat address Repeat row Repeat column Whole chip BG/L Banks 80.9% 4.7% 8.8% 0.53% BG/P Banks 59.4% 31.8% 22.7% 3.20% Google Banks 58.7% 7.4% 14.5% 2.02% 4 1 row 2 5 3 Single Event 17.6% 29.2% 34.9% column University of Toronto 13

  14. What do errors look like on-chip? Error Mode Repeat address Repeat row Repeat column Whole chip BG/L Banks 80.9% 4.7% 8.8% 0.53% BG/P Banks 59.4% 31.8% 22.7% 3.20% Google Banks 58.7% 7.4% 14.5% 2.02% 1 row Single Event 17.6% 29.2% 34.9% column University of Toronto 14

  15. What do errors look like on-chip? Error Mode Repeat address Repeat row Repeat column Whole chip BG/L Banks 80.9% 4.7% 8.8% 0.53% BG/P Banks 59.4% 31.8% 22.7% 3.20% Google Banks 58.7% 7.4% 14.5% 2.02% Single Event 17.6% 29.2% 34.9% The patterns on the majority of banks can be linked to hard errors. University of Toronto 15

  16. What is the time between repeat errors? 2 weeks Repeat errors happen quickly 90% of errors manifest themselves within less than 2 weeks University of Toronto 16

  17. When are errors detected? Error detection Program [read] access Hardware memory scrubber: Google only 1 day Hardware scrubbers may not shorten the time until a repeat error is detected University of Toronto 17

  18. How does memory degrade? 1/3 1/2 of error addresses develop additional errors Top 5-10% develop a large number of repeats 3-4 orders of magnitude increase in probability once an error occurs, and even greater increase after repeat errors. For both columns and rows University of Toronto 18

  19. How do multi-bit errors impact the system? In the absence of sufficiently powerful ECC, multi-bit errors can cause data corruption / machine crash. Can we predict multi-bit errors? > 100-fold increase in MBE probability after repeat errors 50-90% of MBEs had prior warning University of Toronto 19

  20. Are some areas of a bank more likely to fail? Errors are not uniformly distributed Some patterns are consistent across systems Lower rows have higher error probabilities University of Toronto 20

  21. Summary so far Similar error behavior across ~300TB-years of DRAM from different types of systems Strong correlations (in space and time) exist between errors On-chip errors patterns confirm hard errors as dominating failure mode Early errors are highly indicative warning signs for future problems What does this all mean? University of Toronto 21

  22. What do errors look like from the OS p.o.v.? For typical 4Kb pages: Errors are highly localized on a small number of pages ~85% of errors in the system are localized on 10% of pages impacted with errors University of Toronto 22

  23. Can we retire pages containing errors? Page Retirement Move page s contents to different page and mark it as bad to prevent future use Some page retirement mechanisms exist Solaris BadRAM patch for Linux But rarely used in practice No page retirement policy evaluation on realistic error traces University of Toronto 23

  24. What sorts of policies should we use? Physical address space Retirement policies: Repeat-on-address 1-error-on-page 2-errors-on-page Repeat-on-row Repeat-on-column 1 2 University of Toronto 24

  25. What sorts of policies should we use? Physical address space Retirement policies: Repeat-on-address 1-error-on-page 2-errors-on-page Repeat-on-row Repeat-on-column 1 University of Toronto 25

  26. What sorts of policies should we use? Physical address space Retirement policies: Repeat-on-address 1-error-on-page 2-errors-on-page Repeat-on-row Repeat-on-column 1 2 University of Toronto 26

  27. What sorts of policies should we use? Physical address space Retirement policies: Repeat-on-address 1-error-on-page 2-errors-on-page Repeat-on-row Repeat-on-column On-chip 1 1 1 2 2 row 2 column University of Toronto 27

  28. What sorts of policies should we use? Physical address space Retirement policies: Repeat-on-address 1-error-on-page 2-errors-on-page Repeat-on-row Repeat-on-column On-chip 1 1 1 row 2 2 column 2 University of Toronto 28

  29. How effective is page retirement? For typical 4Kb pages: (MBE) 1-error-on-page 1-error-on-page Repeat-on-row Repeat-on-column Repeat-on-row 2-errors-on-page 2-errors-on-page Effective policy Repeat-on-address Repeat-on-address Repeat-on-column 1MB More than 90% of errors can be prevented with < 1MB sacrificed per node Similar for multi-bit errors University of Toronto 29

  30. Implications for future system design OS-level page retirement can be highly effective Different areas on chip are more susceptible to errors than others Selective error protection Potential for error prediction based on early warning signs Memory scrubbers may not be effective in practice Using server idle time to run memory tests (eg: memtest86) Realistic DRAM error process needs to be incorporated into future reliability research Physical-level error traces have been made public on the Usenix Failure Repository University of Toronto 30

  31. Thank you! Please read the paper for more results! Questions ? University of Toronto 31

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#