Understanding Storage Systems Dependability
This content discusses storage systems dependability, covering topics such as data reliability, fault-tolerant techniques, error detection and correction, component levels, disk protection techniques, types of disk faults, and protection mechanisms like RAID and ECC. It provides insights into ensuring the reliable operation of storage systems.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
In the name of GOD Reza Eftekhary eftekhary@ce.sharif.edu
Storage Architecture Storage Systems Dependability 2
Dependability Criteria Data reliability Fault tolerant techniques Error detection & Correction Modular Redundancy Error Recovery Checkpoints : backup level Restart software, restore data: task level Data availability Storage Systems Dependability 3
Level of Evaluation Back-up level Component level Disk Processors Enclosure Interconnects, Storage Systems Dependability 4
Component Level-Disk Type of faults Protection techniques Reliability Model Storage Systems Dependability 5
Disk Types of fault Operational Failures (Can not find data) Bad servo-track Bad electronics Can't stay on track Bad read head Storage Systems Dependability 6
Disk Types of fault Latent Defects (Data missing) Error during writing Bad media Inherent bit-error rate High-fly writes Storage Systems Dependability 7
Disk Type of fault UnDetected Error (UDE) Undetected Write Error Undetected Read Error Low Rate of occurrence High Rate in Large storage system UWE Departed write Near off-track write Far off-track write Storage Systems Dependability 8
Protection Techniques ECC Parity Raid Disk scrubbing Intra-disk redundancy Storage Systems Dependability 9
Redundant Array of Independent Disks Data striping Storage Systems Dependability 10
Double Disk Failure Two scenarios result in DDF two simultaneous operational failures an operational failure occurs after a latent defect before latent error corrected Multiple simultaneous latent defects don t constitute failure. Storage Systems Dependability 11
Disk Scrubbing A background process Reading sectors data and recover data via ECC Parity of other disks in RAID Benefits Decrease DDF Increase reliability Problems Decrease bandwidth Increase access to data Storage Systems Dependability 12
Intra-Disk Redundancy Increase redundancy information 2 or 3 fault tolerant codes Store parity in row & column Storing overheads Decrease BW overhead Less delay in comparison with scrubbing Storage Systems Dependability 13
Backup Level Storage Systems Dependability 14
Backup Level Sample failure types Storage hardware failures disk drive, disk array, file server, rack, building, site, region Common shared infrastructure failures power, air condition Software-induced data corruption storage device firmware, application defects, virus Human/operator/user error accidental deletion or overwriting, external attacks Storage Systems Dependability 15
Backup Level Data protection techniques Inside one storage device partial redundancy or full redundancy Between storage devices mirroring at the same level array to tape, or array to slow storage, across different levels Update frequency: synchronous, lock-step with foreground updates continuous asynchronous - remote mirroring, logging batched (e.g., nightly backup) Storage Systems Dependability 16
Backup Level Different tradeoffs for protection techniques Protection overhead including capacity, performance impact on the foreground workload, and downtime Recovery time how long does it take to get the data back? Recovery points how much (and which) data can be returned to its pre-loss state? Cost Direct costs include equipment purchase or rental costs Indirect costs- lost worker productivity, user dissatisfaction Storage Systems Dependability 17
Backup Level Requirement knowledge The required bandwidth to mirror data The number of the network links between the two sites How often to take full backups and make incremental The bandwidth required to make backups When to move backup tapes off-site for disaster recovery Whether to rebuild a stricken site, or build a new one Storage Systems Dependability 18
Backup Level When a failure occurs, it may be necessary to revert back to a consistent point prior to the failure, which will entail the loss of any data written after that point. Storage Systems Dependability 19
Backup Level Recovery time elapsed time after a failure, that system is running again the recovery time objective (RTO)provides an acceptable upper bound Recent data loss amount of recent updates lost during recovery from a failure the recovery point objective (RPO) provides an upper bound Storage Systems Dependability 20
Backup Level hierarchy in Primary and secondary copies Primary data copy is level 0 Each level is responsible for retaining some number of discrete retrieval points propagating RPs to the next level in the hierarchy Storage Systems Dependability 21
Backup Level As the level increase, the data protection techniques Store less frequent RPs Present larger retention capacity Exhibit longer recovery latencies Recovery path Opposite of RP propagation Storage Systems Dependability 22
Backup Level Storage Systems Dependability 23
Reference [1]Jon G. Elerath. A Simple Equation for Estimating Reliability of an N+1 Redundant Array of Independent Disks (RAID). In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 484-493,2009. [2]Jon G. Elerath and Michael Pecht. Enhanced reliability modeling of raid storage systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 175-184,2007. [3]E.W.D.Rozier1, W.Belluomini1, V.Deenadhayalan1, J.Hafner1, KK.Rao, P.Zhou1. Evaluating the Impact of Undetected Disk Errors in RAID Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), pages 83-92,2009. [4]K. Keeton and A. Merchant. A framework for evaluating storage System dependability. In Proc. Intl. Conf. on Dependable Systems and Networks (DSN), pages 877 886, 2004. [5] S. Seshadri , L. Chiu, C. Constantinescu, S. Balachandran, C. Dickey, L. Liu, and P. Muench , ``Enhancing storage system availability on multi-core architectures using recovery conscious scheduling,'' in USENIX FAST, 2008. 24