I/O-Efficient Protection Against Silent Data Corruptions in RAID Arrays

 
1
 
T
o
w
a
r
d
 
I
/
O
-
E
f
f
i
c
i
e
n
t
 
P
r
o
t
e
c
t
i
o
n
 
A
g
a
i
n
s
t
S
i
l
e
n
t
 
D
a
t
a
 
C
o
r
r
u
p
t
i
o
n
s
 
i
n
 
R
A
I
D
 
A
r
r
a
y
s
 
 
M
i
n
g
q
i
a
n
g
 
L
i
 
a
n
d
 
P
a
t
r
i
c
k
 
P
.
 
C
.
 
L
e
e
The Chinese University of Hong Kong
 
MSST ’14
 
R
A
I
D
 
R
A
I
D
 
i
s
 
k
n
o
w
n
 
t
o
 
p
r
o
t
e
c
t
 
d
a
t
a
 
a
g
a
i
n
s
t
 
d
i
s
k
 
f
a
i
l
u
r
e
s
 
a
n
d
l
a
t
e
n
t
 
s
e
c
t
o
r
 
e
r
r
o
r
s
How it works? Encodes 
k
 data chunks into 
m
 parity chunks, such
that the 
k
 data chunks can be recovered from any 
k
 out of
n=k+m
 chunks
 
2
 
[Patterson et al., SIGMOD ’88]
 
S
i
l
e
n
t
 
D
a
t
a
 
C
o
r
r
u
p
t
i
o
n
s
 
S
i
l
e
n
t
 
d
a
t
a
 
c
o
r
r
u
p
t
i
o
n
s
:
Data is stale or corrupted without indication from
disk drives 
 cannot be detected by RAID
Generated due to firmware or hardware bugs or
malfunctions on the read/write paths
More dangerous than disk failures and latent
sector errors
 
3
 
[Kelemen, LCSC ’07; Bairavasundaram et al., FAST ’08; Hafner et al., IBM JRD 2008]
 
S
i
l
e
n
t
 
D
a
t
a
 
C
o
r
r
u
p
t
i
o
n
s
 
Lost write:
 
Torn write:
 
Misdirected writes/reads:
 
4
 
S
t
a
l
e
 
S
t
a
l
e
 
(
a
)
 
M
i
s
d
i
r
e
c
t
e
d
 
w
r
i
t
e
s
 
(
b
)
 
M
i
s
d
i
r
e
c
t
e
d
 
r
e
a
d
s
 
u
p
d
a
t
e
d
 
S
i
l
e
n
t
 
D
a
t
a
 
C
o
r
r
u
p
t
i
o
n
s
 
Consequences:
User read:
Corrupted data propagated to upper layers
User write:
Parity pollution
Data reconstruction
Corruptions of surviving chunks propagated to
reconstructed chunks
 
5
I
n
t
e
g
r
i
t
y
 
P
r
o
t
e
c
t
i
o
n
 
Protection against silent data corruptions:
E
x
t
e
n
d
 
R
A
I
D
 
l
a
y
e
r
 
w
i
t
h
 
i
n
t
e
g
r
i
t
y
 
p
r
o
t
e
c
t
i
o
n
,
 
w
h
i
c
h
a
d
d
s
 
i
n
t
e
g
r
i
t
y
 
m
e
t
a
d
a
t
a
 
f
o
r
 
d
e
t
e
c
t
i
o
n
Recovery is done by RAID layer
Goals:
All types 
of silent data corruptions should be detected
Reduce computational and I/O overheads of
generating and storing integrity metadata
Reduce computational and I/O overheads of
detecting silent data corruptions
6
 
O
u
r
 
C
o
n
t
r
i
b
u
t
i
o
n
s
 
A taxonomy study of existing integrity primitives
on I/O performance and detection capabilities
An integrity checking model
Two I/O-efficient integrity protection schemes
with complementary performance gains
Extensive trace-driven evaluations
 
7
 
A
s
s
u
m
p
t
i
o
n
s
 
At most one
 silently corrupted chunk within a
stripe
If a stripe contains a silently corrupted chunk,
the stripe has 
no more than m-1
 failed chunks
due to disk failures or latent sector errors
 
8
O
t
h
e
r
w
i
s
e
,
 
h
i
g
h
e
r
-
l
e
v
e
l
 
R
A
I
D
 
i
s
 
n
e
e
d
e
d
!
 
D
0
 
D
1
 
D
2
 
D
3
 
D
4
 
D
5
 
P
0
 
P
1
 
m
-
1
=
1
 
H
o
w
 
R
A
I
D
 
H
a
n
d
l
e
s
 
W
r
i
t
e
s
?
 
F
u
l
l
-
s
t
r
i
p
e
 
w
r
i
t
e
s
:
Parity chunks are computed directly from data chunks to be
written chunks (no disk reads needed)
P
a
r
t
i
a
l
-
s
t
r
i
p
e
 
w
r
i
t
e
s
:
R
M
W
 
(
R
e
a
d
-
m
o
d
i
f
y
-
w
r
i
t
e
s
)
 
 
f
o
r
 
s
m
a
l
l
 
w
r
i
t
e
s
Read all 
touched
 data chunks and all parity chunks
Compute the data changes and the parity chunks
Write all 
touched 
data chunks and parity chunks
R
C
W
 
(
R
e
c
o
n
s
t
r
u
c
t
-
w
r
i
t
e
s
)
 
 
f
o
r
 
l
a
r
g
e
 
w
r
i
t
e
s
Read all 
untouched
 data chunks
Compute the parity chunks
Write all 
touched 
data chunks and parity chunks
 
9
 
E
x
i
s
t
i
n
g
 
I
n
t
e
g
r
i
t
y
 
P
r
i
m
i
t
i
v
e
s
 
Self-checksumming / Physical identity
 
10
 
Data and metadata are read in a single disk I/O
Inconsistency implies data corruption
Cannot detect stale or overwritten data
 
[Krioukov et al., FAST ’08]
 
E
x
i
s
t
i
n
g
 
I
n
t
e
g
r
i
t
y
 
P
r
i
m
i
t
i
v
e
s
 
Version Mirroring
 
11
 
Keep a version number in the same data chunk
and 
m
 parity chunks
Can detect lost writes
Cannot detect corruptions
 
[Krioukov et al., FAST ’08]
 
E
x
i
s
t
i
n
g
 
I
n
t
e
g
r
i
t
y
 
P
r
i
m
i
t
i
v
e
s
 
Checksum Mirroring
 
12
 
[Hafner et al., IBM JRD 2008]
 
K
e
e
p
 
a
 
c
h
e
c
k
s
u
m
 
i
n
 
t
h
e
 
n
e
i
g
h
b
o
r
i
n
g
 
d
a
t
a
 
c
h
u
n
k
(
b
u
d
d
y
)
 
a
n
d
 
m
 
p
a
r
i
t
y
 
c
h
u
n
k
s
Can detect all silent data corruptions
High I/O overhead on checksum updates
 
C
o
m
p
a
r
i
s
o
n
s
 
13
Q
u
e
s
t
i
o
n
:
 
H
o
w
 
t
o
 
i
n
t
e
g
r
a
t
e
 
i
n
t
e
g
r
i
t
y
 
p
r
i
m
i
t
i
v
e
s
 
i
n
t
o
I
/
O
-
e
f
f
i
c
i
e
n
t
 
i
n
t
e
g
r
i
t
y
 
p
r
o
t
e
c
t
i
o
n
 
s
c
h
e
m
e
s
?
 
A
d
d
i
t
i
o
n
a
l
 
I
/
O
 
o
v
e
r
h
e
a
d
 
N
o
 
a
d
d
i
t
i
o
n
a
l
 
I
/
O
 
o
v
e
r
h
e
a
d
 
I
n
t
e
g
r
i
t
y
 
C
h
e
c
k
i
n
g
 
M
o
d
e
l
 
Two types of disk reads:
F
i
r
s
t
 
r
e
a
d
:
 
s
e
e
s
 
a
l
l
 
t
y
p
e
s
 
o
f
 
s
i
l
e
n
t
 
d
a
t
a
 
c
o
r
r
u
p
t
i
o
n
s
S
u
b
s
e
q
u
e
n
t
 
r
e
a
d
s
:
 
s
e
e
 
a
 
s
u
b
s
e
t
 
o
f
 
t
y
p
e
s
 
o
f
 
s
i
l
e
n
t
 
d
a
t
a
 
c
o
r
r
u
p
t
i
o
n
s
Observation: A 
simpler and lower-overhead 
integrity
checking mechanism is possible for 
subsequent-reads
 
14
 
C
h
e
c
k
i
n
g
 
S
u
b
s
e
q
u
e
n
t
-
R
e
a
d
s
 
S
u
b
s
e
q
u
e
n
t
-
r
e
a
d
s
 
c
a
n
 
b
e
 
c
h
e
c
k
e
d
 
b
y
 
s
e
l
f
-
c
h
e
c
k
s
u
m
m
i
n
g
a
n
d
 
p
h
y
s
i
c
a
l
 
i
d
e
n
t
i
t
y
 
w
i
t
h
o
u
t
 
a
d
d
i
t
i
o
n
a
l
 
I
/
O
s
Integrity protection schemes to consider:
PURE (checksum mirroring only), HYBRID-1, and HYBRID-2
 
15
 
S
e
e
n
 
b
y
 
s
u
b
s
e
q
u
e
n
t
-
r
e
a
d
s
 
N
o
 
a
d
d
i
t
i
o
n
a
l
 
I
/
O
 
o
v
e
r
h
e
a
d
 
15
 
I
n
t
e
g
r
i
t
y
 
P
r
o
t
e
c
t
i
o
n
 
S
c
h
e
m
e
s
 
16
 
Hybrid-1
P
h
y
s
i
c
a
l
 
i
d
e
n
t
i
t
y
 
+
 
s
e
l
f
-
c
h
e
c
k
s
u
m
m
i
n
g
 
+
 
v
e
r
s
i
o
n
 
m
i
r
r
o
r
i
n
g
A variant of the scheme in [Krioukov et al., FAST ’08]
 
I
n
t
e
g
r
i
t
y
 
P
r
o
t
e
c
t
i
o
n
 
S
c
h
e
m
e
s
 
17
 
Hybrid-2
P
h
y
s
i
c
a
l
 
i
d
e
n
t
i
t
y
 
+
 
s
e
l
f
-
c
h
e
c
k
s
u
m
m
i
n
g
 
+
 
c
h
e
c
k
s
u
m
 
m
i
r
r
o
r
i
n
g
A
 
N
E
W
 
s
c
h
e
m
e
A
d
d
i
t
i
o
n
a
l
 
I
/
O
 
O
v
e
r
h
e
a
d
 
f
o
r
 
a
S
i
n
g
l
e
 
U
s
e
r
 
R
e
a
d
/
W
r
i
t
e
18
B
o
t
h
 
H
y
b
r
i
d
-
1
 
a
n
d
 
H
y
b
r
i
d
-
2
o
u
t
p
e
r
f
o
r
m
 
P
u
r
e
 
i
n
s
u
b
s
e
q
u
e
n
t
-
r
e
a
d
s
H
y
b
r
i
d
-
1
 
a
n
d
 
H
y
b
r
i
d
-
2
 
p
r
o
v
i
d
e
c
o
m
p
l
e
m
e
n
t
a
r
y
 
I
/
O
 
a
d
v
a
n
t
a
g
e
s
f
o
r
 
d
i
f
f
e
r
e
n
t
 
w
r
i
t
e
 
s
i
z
e
s
 
S
w
i
t
c
h
 
p
o
i
n
t
:
 
C
h
o
o
s
i
n
g
 
t
h
e
 
R
i
g
h
t
 
S
c
h
e
m
e
 
If                                     choose Hybrid-1
 
If                                     choose Hybrid-2
 
19
 
            = average write size of a workload (estimated
through measurements)
            = RAID chunk size
The chosen scheme is configured in the RAID layer
(offline)  during initialization
 
E
v
a
l
u
a
t
i
o
n
 
Computational overhead for calculating integrity
metadata
I/O overhead for updating and checking integrity
metadata
Effectiveness of choosing the right scheme
 
20
 
C
o
m
p
u
t
a
t
i
o
n
a
l
 
O
v
e
r
h
e
a
d
 
Implementation:
GF-Complete 
[Plank et al., FAST’13]
and Crcutil libraries
Testbed:
Intel Xeon E5530 CPU @ 2.4GHz
with SSE4.2
Overall results:
~4GB/s 
for RAID-5
~2.5GB/s 
for RAID-6
R
A
I
D
 
p
e
r
f
o
r
m
a
n
c
e
 
i
s
b
o
t
t
l
e
n
e
c
k
e
d
 
b
y
 
d
i
s
k
 
I
/
O
s
,
r
a
t
h
e
r
 
t
h
a
n
 
C
P
U
 
21
 
I
/
O
 
O
v
e
r
h
e
a
d
 
Trace-driven simulation
12 workload traces from production Windows servers
 
RAID-6 with n=8 for different chunk sizes
 
22
 
[Kavalanekar et al., IISWC ’08]
 
I
/
O
 
O
v
e
r
h
e
a
d
 
23
Pure can have high I/O overhead, by up to 43.74%
I
/
O
 
o
v
e
r
h
e
a
d
 
c
a
n
 
b
e
 
k
e
p
t
 
a
t
 
r
e
a
s
o
n
a
b
l
y
 
l
o
w
 
(
o
f
t
e
n
 
b
e
l
o
w
 
1
5
%
)
 
u
s
i
n
g
t
h
e
 
b
e
s
t
 
o
f
 
H
y
b
r
i
d
-
1
 
a
n
d
 
H
y
b
r
i
d
-
2
,
 
d
u
e
 
t
o
 
I
/
O
 
g
a
i
n
 
i
n
 
s
u
b
s
e
q
u
e
n
t
 
r
e
a
d
s
More discussions in the paper
 
4
3
.
7
4
%
 
C
h
o
o
s
i
n
g
 
t
h
e
 
R
i
g
h
t
 
S
c
h
e
m
e
 
24
 
A
c
c
u
r
a
c
y
 
r
a
t
e
:
 
3
4
/
3
6
 
=
 
9
4
.
4
4
%
F
o
r
 
t
h
e
 
t
w
o
 
i
n
c
o
n
s
i
s
t
e
n
t
 
c
a
s
e
s
,
 
t
h
e
 
I
/
O
 
o
v
e
r
h
e
a
d
 
d
i
f
f
e
r
e
n
c
e
b
e
t
w
e
e
n
 
H
y
b
r
i
d
-
1
 
a
n
d
 
H
y
b
r
i
d
-
2
 
i
s
 
s
m
a
l
l
 
(
b
e
l
o
w
 
3
%
)
 
I
m
p
l
e
m
e
n
t
a
t
i
o
n
 
I
s
s
u
e
s
 
Implementation in RAID layer:
Leverage RAID redundancy to recover from silent
data corruptions
Open issues:
How to keep track of first reads and subsequent
reads?
How to choose between Hybrid-1 and Hybrid-2 based
on workload measurements?
How to integrate with end-to-end integrity protection?
 
25
 
C
o
n
c
l
u
s
i
o
n
s
 
A systematic study on I/O-efficient integrity
protection schemes against silent data
corruptions in RAID systems
Findings:
Integrity protection schemes differ in I/O overheads,
depending on the workloads
Simpler integrity checking can be used for
subsequent reads
Extensive evaluations on computational and I/O
overheads of integrity protection schemes
 
26
Slide Note
Embed
Share

This paper discusses the risks of silent data corruptions in RAID arrays, which are challenging to detect and can lead to serious consequences. It presents the concept of integrity protection to enhance RAID systems' ability to detect and recover from such corruptions efficiently. The paper investigates existing integrity primitives, proposes two I/O-efficient protection schemes, and conducts evaluations to assess their performance and detection capabilities.

  • RAID arrays
  • Data corruptions
  • Integrity protection
  • I/O-efficient
  • Silent corruptions

Uploaded on Sep 30, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Toward I/O-Efficient Protection Against Silent Data Corruptions in RAID Arrays Mingqiang Li and Patrick P. C. Lee The Chinese University of Hong Kong MSST 14 1

  2. [Patterson et al., SIGMOD 88] RAID RAID is known to protect data against disk failures and latent sector errors How it works? Encodes k data chunks into m parity chunks, such that the k data chunks can be recovered from any k out of n=k+m chunks 2

  3. Silent Data Corruptions Silent data corruptions: Data is stale or corrupted without indication from disk drives cannot be detected by RAID Generated due to firmware or hardware bugs or malfunctions on the read/write paths More dangerous than disk failures and latent sector errors 3 [Kelemen, LCSC 07; Bairavasundaram et al., FAST 08; Hafner et al., IBM JRD 2008]

  4. Silent Data Corruptions Lost write: Stale Torn write: updated Stale Misdirected writes/reads: (a) Misdirected writes (b) Misdirected reads 4

  5. Silent Data Corruptions Consequences: User read: Corrupted data propagated to upper layers User write: Parity pollution Data reconstruction Corruptions of surviving chunks propagated to reconstructed chunks 5

  6. Integrity Protection Protection against silent data corruptions: Extend RAID layer with integrity protection, which adds integrity metadata for detection Recovery is done by RAID layer Goals: All types of silent data corruptions should be detected Reduce computational and I/O overheads of generating and storing integrity metadata Reduce computational and I/O overheads of detecting silent data corruptions 6

  7. Our Contributions A taxonomy study of existing integrity primitives on I/O performance and detection capabilities An integrity checking model Two I/O-efficient integrity protection schemes with complementary performance gains Extensive trace-driven evaluations 7

  8. Assumptions At most one silently corrupted chunk within a stripe If a stripe contains a silently corrupted chunk, the stripe has no more than m-1 failed chunks due to disk failures or latent sector errors Otherwise, higher-level RAID is needed! m-1=1 D0 D1 D2 D3 D4 D5 P0 P1 8

  9. How RAID Handles Writes? Full-stripe writes: Parity chunks are computed directly from data chunks to be written chunks (no disk reads needed) Partial-stripe writes: RMW (Read-modify-writes) for small writes Read all touched data chunks and all parity chunks Compute the data changes and the parity chunks Write all touched data chunks and parity chunks RCW (Reconstruct-writes) for large writes Read all untouched data chunks Compute the parity chunks Write all touched data chunks and parity chunks 9

  10. Existing Integrity Primitives Self-checksumming / Physical identity [Krioukov et al., FAST 08] Data and metadata are read in a single disk I/O Inconsistency implies data corruption Cannot detect stale or overwritten data 10

  11. Existing Integrity Primitives Version Mirroring [Krioukov et al., FAST 08] Keep a version number in the same data chunk and m parity chunks Can detect lost writes Cannot detect corruptions 11

  12. Existing Integrity Primitives Checksum Mirroring [Hafner et al., IBM JRD 2008] Keep a checksum in the neighboring data chunk (buddy) and m parity chunks Can detect all silent data corruptions High I/O overhead on checksum updates 12

  13. Comparisons No additional I/O overhead Additional I/O overhead Question: How to integrate integrity primitives into I/O-efficient integrity protection schemes? 13

  14. Integrity Checking Model Two types of disk reads: First read: sees all types of silent data corruptions Subsequent reads: see a subset of types of silent data corruptions Observation: A simpler and lower-overhead integrity checking mechanism is possible for subsequent-reads 14

  15. Checking Subsequent-Reads Seen by subsequent-reads 15 No additional I/O overhead Subsequent-reads can be checked by self-checksumming and physical identity without additional I/Os Integrity protection schemes to consider: PURE (checksum mirroring only), HYBRID-1, and HYBRID-2 15

  16. Integrity Protection Schemes Hybrid-1 Physical identity + self-checksumming + version mirroring A variant of the scheme in [Krioukov et al., FAST 08] 16

  17. Integrity Protection Schemes Hybrid-2 Physical identity + self-checksumming + checksum mirroring A NEW scheme 17

  18. Additional I/O Overhead for a Single User Read/Write Switch point: Both Hybrid-1 and Hybrid-2 outperform Pure in subsequent-reads Hybrid-1 and Hybrid-2 provide complementary I/O advantages for different write sizes 18

  19. Choosing the Right Scheme If choose Hybrid-1 If choose Hybrid-2 = average write size of a workload (estimated through measurements) = RAID chunk size The chosen scheme is configured in the RAID layer (offline) during initialization 19

  20. Evaluation Computational overhead for calculating integrity metadata I/O overhead for updating and checking integrity metadata Effectiveness of choosing the right scheme 20

  21. Computational Overhead Implementation: GF-Complete [Plank et al., FAST 13] and Crcutil libraries Testbed: Intel Xeon E5530 CPU @ 2.4GHz with SSE4.2 Overall results: ~4GB/s for RAID-5 ~2.5GB/s for RAID-6 RAID performance is bottlenecked by disk I/Os, rather than CPU 21

  22. I/O Overhead Trace-driven simulation 12 workload traces from production Windows servers [Kavalanekar et al., IISWC 08] RAID-6 with n=8 for different chunk sizes 22

  23. I/O Overhead Pure can have high I/O overhead, by up to 43.74% I/O overhead can be kept at reasonably low (often below 15%) using the best of Hybrid-1 and Hybrid-2, due to I/O gain in subsequent reads More discussions in the paper 43.74% 23

  24. Choosing the Right Scheme Accuracy rate: 34/36 = 94.44% For the two inconsistent cases, the I/O overhead difference between Hybrid-1 and Hybrid-2 is small (below 3%) 24

  25. Implementation Issues Implementation in RAID layer: Leverage RAID redundancy to recover from silent data corruptions Open issues: How to keep track of first reads and subsequent reads? How to choose between Hybrid-1 and Hybrid-2 based on workload measurements? How to integrate with end-to-end integrity protection? 25

  26. Conclusions A systematic study on I/O-efficient integrity protection schemes against silent data corruptions in RAID systems Findings: Integrity protection schemes differ in I/O overheads, depending on the workloads Simpler integrity checking can be used for subsequent reads Extensive evaluations on computational and I/O overheads of integrity protection schemes 26

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#