Storage Systems Dependability

 
Reza Eftekhary
eftekhary@ce.sharif.edu
 
In
 the name of GOD
 
Storage Architecture
 
2
 
Storage Systems Dependability
Dependability Criteria
 
 
 
 
 
 
 
 
 
 
 
 
 
D
a
t
a
 
r
e
l
i
a
b
i
l
i
t
y
F
a
u
l
t
 
t
o
l
e
r
a
n
t
 
t
e
c
h
n
i
q
u
e
s
E
r
r
o
r
 
d
e
t
e
c
t
i
o
n
 
&
 
C
o
r
r
e
c
t
i
o
n
M
o
d
u
l
a
r
 
R
e
d
u
n
d
a
n
c
y
E
r
r
o
r
 
R
e
c
o
v
e
r
y
C
h
e
c
k
p
o
i
n
t
s
 
:
 
b
a
c
k
u
p
 
l
e
v
e
l
R
e
s
t
a
r
t
 
s
o
f
t
w
a
r
e
,
 
r
e
s
t
o
r
e
 
d
a
t
a
:
 
t
a
s
k
 
l
e
v
e
l
D
a
t
a
 
a
v
a
i
l
a
b
i
l
i
t
y
3
Storage Systems Dependability
Level of Evaluation
 
 
 
 
 
 
 
B
a
c
k
-
u
p
 
l
e
v
e
l
C
o
m
p
o
n
e
n
t
 
l
e
v
e
l
D
i
s
k
P
r
o
c
e
s
s
o
r
s
E
n
c
l
o
s
u
r
e
I
n
t
e
r
c
o
n
n
e
c
t
s
,
4
Storage Systems Dependability
 
Component Level-Disk
 
Type of faults
Protection techniques
Reliability Model
 
 
5
 
Storage Systems Dependability
Disk
 
Types of fault
Operational Failures (Can not find data)
Bad servo-track
Bad electronics
Can't stay on track
Bad read head
6
Storage Systems Dependability
Disk
 
Types of fault
Latent Defects (Data missing)
Error during writing
Bad media
Inherent bit-error rate
High-fly writes
7
Storage Systems Dependability
 
Disk
 
Type of fault
UnDetected Error (UDE)
Undetected Write Error
Undetected Read Error
 
Low Rate of occurrence
High Rate in Large storage system
 
UWE
Departed write
Near off-track write
Far  off-track write
 
8
 
Storage Systems Dependability
Protection Techniques
 
ECC
Parity
Raid
Disk scrubbing
Intra-disk redundancy
 
9
Storage Systems Dependability
 
Redundant Array of Independent Disks
 
Data striping
 
10
 
Storage Systems Dependability
Double Disk Failure
 
Two scenarios result in DDF
two simultaneous operational failures
an operational failure occurs after a latent defect
before latent error  corrected
 
Multiple simultaneous latent defects don’t constitute
failure.
11
Storage Systems Dependability
Disk Scrubbing
 
A background process
Reading sectors data and recover data via
ECC
Parity  of other disks in RAID
Benefits
Decrease  DDF
Increase reliability
Problems
Decrease bandwidth
Increase access to data
12
Storage Systems Dependability
Intra-Disk Redundancy
 
Increase  redundancy information
2 or 3 fault tolerant codes
Store parity in row & column
 
Storing overheads
Decrease BW overhead
Less delay in comparison with scrubbing
13
Storage Systems Dependability
 
Backup Level
 
14
 
Storage Systems Dependability
Backup Level
 
Sample failure types
Storage hardware failures
 
disk drive, disk array, file server, rack, building, site, region
Common shared infrastructure failures
power, air condition
Software-induced data corruption
storage device firmware, application defects, virus
Human/operator/user error
accidental deletion or overwriting, external attacks
15
Storage Systems Dependability
Backup Level
 
Data protection techniques
Inside one storage device
partial redundancy or  full redundancy
Between storage devices
mirroring at the same level
array to tape, or array to slow storage, across different levels
Update frequency:
synchronous, lock-step with foreground updates
continuous asynchronous - remote mirroring, logging
batched (e.g., nightly backup)
16
Storage Systems Dependability
Backup Level
 
Different tradeoffs for protection techniques
Protection overhead
including capacity, performance impact 
on the foreground workload,
and downtime
Recovery time
how long does it take to get the data back?
 Recovery points
 how much (and which) data can be returned 
to its pre-loss state?
Cost
 
Direct costs include equipment purchase or rental costs
 Indirect costs- lost worker productivity, user dissatisfaction
17
Storage Systems Dependability
 
Backup Level
 
Requirement knowledge
The required bandwidth to mirror data
The number of the network links between the two sites
How often to take full backups and make incremental
The bandwidth required to make backups
When to move backup tapes off-site for disaster recovery
Whether to rebuild a stricken site, or build a new one
 
18
 
Storage Systems Dependability
 
Backup Level
 
When a failure occurs, it may be necessary to revert
back to a consistent point prior to the failure, which
will entail the loss of any data written after that point.
 
19
 
Storage Systems Dependability
Backup Level
 
Recovery time
elapsed time after a failure, that system 
is 
running again
the 
recovery time objective (RTO)provides an acceptable
upper 
bound
Recent data loss
amount of recent updates lost during recovery from a
failure
the recovery point objective (RPO) provides an upper
bound
 
20
Storage Systems Dependability
Backup Level
 
hierarchy  in Primary and secondary copies
Primary data copy is level 0
Each level is responsible for
retaining some number of
     discrete retrieval points
propagating RPs to the
    next level in the hierarchy
21
Storage Systems Dependability
Backup Level
 
As the level increase, the data protection techniques
Store less frequent RPs
Present larger retention capacity
Exhibit longer recovery latencies
 
Recovery path
Opposite of RP propagation
22
Storage Systems Dependability
 
Backup Level
 
23
 
Storage Systems Dependability
 
Reference
 
[1]Jon G. Elerath. A Simple Equation for Estimating Reliability of an N+1
Redundant Array of Independent Disks (RAID). In Proceedings of the
International Conference on Dependable Systems and Networks (DSN), pages
484-493,2009.
[2]Jon G. Elerath and Michael Pecht. Enhanced reliability modeling of raid
storage systems. In Proceedings of the International Conference on
Dependable Systems and Networks (DSN), pages 175-184,2007.
[3]E.W.D.Rozier1, W.Belluomini1, V.Deenadhayalan1, J.Hafner1, KK.Rao,   P.Zhou1.
Evaluating the Impact of Undetected Disk Errors in RAID Systems. In
Proceedings of the International Conference on Dependable Systems and
Networks (DSN), pages 83-92,2009.
[4]K. Keeton and A. Merchant. A framework for evaluating storage System
dependability. In Proc. Intl. Conf. on Dependable Systems and Networks
(DSN), pages 877–886, 2004.
[5] S. Seshadri , L. Chiu, C. Constantinescu, S. Balachandran, C. Dickey, L. Liu,
and P. Muench , ``Enhancing storage system availability on multi-core
architectures using recovery conscious scheduling,'' in USENIX FAST, 2008.
 
24
 
Slide Note
Embed
Share

This content discusses storage systems dependability, covering topics such as data reliability, fault-tolerant techniques, error detection and correction, component levels, disk protection techniques, types of disk faults, and protection mechanisms like RAID and ECC. It provides insights into ensuring the reliable operation of storage systems.

  • Storage Systems
  • Dependability
  • Fault Tolerance
  • Data Reliability
  • RAID

Uploaded on Sep 07, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. In the name of GOD Reza Eftekhary eftekhary@ce.sharif.edu

  2. Storage Architecture Storage Systems Dependability 2

  3. Dependability Criteria Data reliability Fault tolerant techniques Error detection & Correction Modular Redundancy Error Recovery Checkpoints : backup level Restart software, restore data: task level Data availability Storage Systems Dependability 3

  4. Level of Evaluation Back-up level Component level Disk Processors Enclosure Interconnects, Storage Systems Dependability 4

  5. Component Level-Disk Type of faults Protection techniques Reliability Model Storage Systems Dependability 5

  6. Disk Types of fault Operational Failures (Can not find data) Bad servo-track Bad electronics Can't stay on track Bad read head Storage Systems Dependability 6

  7. Disk Types of fault Latent Defects (Data missing) Error during writing Bad media Inherent bit-error rate High-fly writes Storage Systems Dependability 7

  8. Disk Type of fault UnDetected Error (UDE) Undetected Write Error Undetected Read Error Low Rate of occurrence High Rate in Large storage system UWE Departed write Near off-track write Far off-track write Storage Systems Dependability 8

  9. Protection Techniques ECC Parity Raid Disk scrubbing Intra-disk redundancy Storage Systems Dependability 9

  10. Redundant Array of Independent Disks Data striping Storage Systems Dependability 10

  11. Double Disk Failure Two scenarios result in DDF two simultaneous operational failures an operational failure occurs after a latent defect before latent error corrected Multiple simultaneous latent defects don t constitute failure. Storage Systems Dependability 11

  12. Disk Scrubbing A background process Reading sectors data and recover data via ECC Parity of other disks in RAID Benefits Decrease DDF Increase reliability Problems Decrease bandwidth Increase access to data Storage Systems Dependability 12

  13. Intra-Disk Redundancy Increase redundancy information 2 or 3 fault tolerant codes Store parity in row & column Storing overheads Decrease BW overhead Less delay in comparison with scrubbing Storage Systems Dependability 13

  14. Backup Level Storage Systems Dependability 14

  15. Backup Level Sample failure types Storage hardware failures disk drive, disk array, file server, rack, building, site, region Common shared infrastructure failures power, air condition Software-induced data corruption storage device firmware, application defects, virus Human/operator/user error accidental deletion or overwriting, external attacks Storage Systems Dependability 15

  16. Backup Level Data protection techniques Inside one storage device partial redundancy or full redundancy Between storage devices mirroring at the same level array to tape, or array to slow storage, across different levels Update frequency: synchronous, lock-step with foreground updates continuous asynchronous - remote mirroring, logging batched (e.g., nightly backup) Storage Systems Dependability 16

  17. Backup Level Different tradeoffs for protection techniques Protection overhead including capacity, performance impact on the foreground workload, and downtime Recovery time how long does it take to get the data back? Recovery points how much (and which) data can be returned to its pre-loss state? Cost Direct costs include equipment purchase or rental costs Indirect costs- lost worker productivity, user dissatisfaction Storage Systems Dependability 17

  18. Backup Level Requirement knowledge The required bandwidth to mirror data The number of the network links between the two sites How often to take full backups and make incremental The bandwidth required to make backups When to move backup tapes off-site for disaster recovery Whether to rebuild a stricken site, or build a new one Storage Systems Dependability 18

  19. Backup Level When a failure occurs, it may be necessary to revert back to a consistent point prior to the failure, which will entail the loss of any data written after that point. Storage Systems Dependability 19

  20. Backup Level Recovery time elapsed time after a failure, that system is running again the recovery time objective (RTO)provides an acceptable upper bound Recent data loss amount of recent updates lost during recovery from a failure the recovery point objective (RPO) provides an upper bound Storage Systems Dependability 20

  21. Backup Level hierarchy in Primary and secondary copies Primary data copy is level 0 Each level is responsible for retaining some number of discrete retrieval points propagating RPs to the next level in the hierarchy Storage Systems Dependability 21

  22. Backup Level As the level increase, the data protection techniques Store less frequent RPs Present larger retention capacity Exhibit longer recovery latencies Recovery path Opposite of RP propagation Storage Systems Dependability 22

  23. Backup Level Storage Systems Dependability 23

  24. Reference [1]Jon G. Elerath. A Simple Equation for Estimating Reliability of an N+1 Redundant Array of Independent Disks (RAID). In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 484-493,2009. [2]Jon G. Elerath and Michael Pecht. Enhanced reliability modeling of raid storage systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 175-184,2007. [3]E.W.D.Rozier1, W.Belluomini1, V.Deenadhayalan1, J.Hafner1, KK.Rao, P.Zhou1. Evaluating the Impact of Undetected Disk Errors in RAID Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 83-92,2009. [4]K. Keeton and A. Merchant. A framework for evaluating storage System dependability. In Proc. Intl. Conf. on Dependable Systems and Networks (DSN), pages 877 886, 2004. [5] S. Seshadri , L. Chiu, C. Constantinescu, S. Balachandran, C. Dickey, L. Liu, and P. Muench , ``Enhancing storage system availability on multi-core architectures using recovery conscious scheduling,'' in USENIX FAST, 2008. 24

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#