Amazon Dynamo for Highly Available Key-Value Storage Systems

 
S
c
a
l
i
n
g
 
O
u
t
 
K
e
y
-
V
a
l
u
e
 
S
t
o
r
a
g
e
:
D
y
n
a
m
o
 
CS 240: Computing Systems and Concurrency
Lecture 8
 
Marco Canini
 
Web applications are expected to be “always on”
Down time 
 pisses off customers, costs $
 
System design considerations relevant to availability
S
c
a
l
a
b
i
l
i
t
y
:
 
a
l
w
a
y
s
 
o
n
 
u
n
d
e
r
 
g
r
o
w
i
n
g
 
d
e
m
a
n
d
 
R
e
l
i
a
b
i
l
i
t
y
:
 
a
l
w
a
y
s
 
o
n
 
d
e
s
p
i
t
e
 
f
a
i
l
u
r
e
s
 
Performance
: 10 sec latency considered available?
“an availability event can be modeled as a long-
lasting performance variation”
(Amazon Aurora SIGMOD ’17)
 
2
 
A
v
a
i
l
a
b
i
l
i
t
y
:
 
v
i
t
a
l
 
f
o
r
 
w
e
b
 
a
p
p
l
i
c
a
t
i
o
n
s
 
Scale-up (vertical scaling)
Upgrade hardware
E.g., MacBook Air 
 MacBook Pro
Down time during upgrade; stops working quickly
 
S
c
a
l
e
-
o
u
t
 
(
h
o
r
i
z
o
n
t
a
l
 
s
c
a
l
i
n
g
)
Add machines, divide the work
E.g., a supermarket adds more checkout lines
No disruption; works great with careful design
 
3
 
S
c
a
l
a
b
i
l
i
t
y
:
 
u
p
 
o
r
 
o
u
t
?
 
More machines, more likely to fail
p
 = probability a machine fails in given period
n
 = number of machines
P
r
o
b
a
b
i
l
i
t
y
 
o
f
 
a
n
y
 
f
a
i
l
u
r
e
 
i
n
 
g
i
v
e
n
 
p
e
r
i
o
d
 
=
 
1
(
1
p
)
n
 
F
o
r
 
5
0
K
 
m
a
c
h
i
n
e
s
,
 
e
a
c
h
 
w
i
t
h
 
9
9
.
9
9
9
6
6
%
 
a
v
a
i
l
a
b
l
e
1
6
%
 
o
f
 
t
h
e
 
t
i
m
e
,
 
d
a
t
a
 
c
e
n
t
e
r
 
e
x
p
e
r
i
e
n
c
e
s
 
f
a
i
l
u
r
e
s
 
F
o
r
 
1
0
0
K
 
m
a
c
h
i
n
e
s
,
 
f
a
i
l
u
r
e
s
 
h
a
p
p
e
n
 
3
0
%
 
o
f
 
t
h
e
 
t
i
m
e
!
 
R
e
l
i
a
b
i
l
i
t
y
:
 
a
v
a
i
l
a
b
l
e
 
u
n
d
e
r
 
f
a
i
l
u
r
e
s
 
4
 
H
o
w
 
i
s
 
d
a
t
a
 
p
a
r
t
i
t
i
o
n
e
d
 
a
c
r
o
s
s
 
m
a
c
h
i
n
e
s
 
s
o
 
t
h
e
s
y
s
t
e
m
 
s
c
a
l
e
s
?
 
How are failures handled so the system is always on?
 
5
 
T
w
o
 
q
u
e
s
t
i
o
n
s
 
(
c
h
a
l
l
e
n
g
e
s
)
 
1.
B
a
c
k
g
r
o
u
n
d
 
a
n
d
 
s
y
s
t
e
m
 
m
o
d
e
l
 
2.
Data partitioning
 
3.
Failure handling
 
6
 
T
o
d
a
y
:
 
A
m
a
z
o
n
 
D
y
n
a
m
o
 
10
4
s of servers in multiple DCs
10
6
s of servers, 120+ DCs (as of now)
 
10
7
s of customers at peaks
89M+ reqs/s 
(Prime Day ’21)
 
Tiered architecture 
(similar today)
Service-oriented architecture
Stateless
 web servers
   & aggregators
Stateful
 storage servers
 
7
 
A
m
a
z
o
n
 
i
n
 
2
0
0
7
 
H
i
g
h
l
y
 
a
v
a
i
l
a
b
l
e
 
w
r
i
t
e
s
 
d
e
s
p
i
t
e
 
f
a
i
l
u
r
e
s
Despite disks failing, network routes flapping, “data centers
destroyed by tornadoes”
Always respond quickly, even during failures 
 replication
 
L
o
w
 
r
e
q
u
e
s
t
-
r
e
s
p
o
n
s
e
 
l
a
t
e
n
c
y
:
 
f
o
c
u
s
 
o
n
 
9
9
.
9
%
 
S
L
A
E.g., “provide a response within 300ms for 99.9% of its requests for
peak client load of 500 reqs/s”
 
I
n
c
r
e
m
e
n
t
a
l
l
y
 
s
c
a
l
a
b
l
e
 
a
s
 
s
e
r
v
e
r
s
 
g
r
o
w
 
t
o
 
w
o
r
k
l
o
a
d
Adding “nodes” should be seamless
 
C
o
m
p
r
e
h
e
n
s
i
b
l
e
 
c
o
n
f
l
i
c
t
 
r
e
s
o
l
u
t
i
o
n
High availability in above sense implies conflicts
 
8
 
D
y
n
a
m
o
 
r
e
q
u
i
r
e
m
e
n
t
s
 
B
a
s
i
c
s
 
i
n
 
D
y
n
a
m
o
 
B
a
s
i
c
 
i
n
t
e
r
f
a
c
e
 
i
s
 
a
 
k
e
y
-
v
a
l
u
e
 
s
t
o
r
e
 
(
v
s
.
 
r
e
l
a
t
i
o
n
a
l
 
D
B
)
g
e
t
(
k
)
 
a
n
d
 
p
u
t
(
k
,
 
v
)
Keys and values opaque to Dynamo
 
Nodes are symmetric
P2P and DHT context
 
9
 
1.
Background and system model
 
2.
D
a
t
a
 
p
a
r
t
i
t
i
o
n
i
n
g
 
3.
Failure handling
 
10
10
 
T
o
d
a
y
:
 
A
m
a
z
o
n
 
D
y
n
a
m
o
 
11
11
 
C
o
n
s
i
s
t
e
n
t
 
h
a
s
h
i
n
g
 
r
e
c
a
p
K
e
y
 
i
s
 
s
t
o
r
e
d
 
a
t
 
i
t
s
 
s
u
c
c
e
s
s
o
r
:
 
n
o
d
e
 
w
i
t
h
 
n
e
x
t
-
h
i
g
h
e
r
 
I
D
 
3
-
b
i
t
I
D
 
s
p
a
c
e
 
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
I
d
e
n
t
i
f
i
e
r
s
 
h
a
v
e
 
m
 
=
 
3
 
b
i
t
s
K
e
y
 
s
p
a
c
e
:
 
[
0
,
 
2
3
-
1
]
 
N
o
d
e
 
S
t
o
r
e
s
 
k
e
y
 
1
 
S
t
o
r
e
s
 
k
e
y
s
 
2
,
 
3
 
S
t
o
r
e
s
 
k
e
y
s
 
4
,
 
5
 
S
t
o
r
e
s
 
k
e
y
 
6
 
S
t
o
r
e
s
 
k
e
y
 
7
,
 
0
 
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
 
12
12
 
I
n
c
r
e
m
e
n
t
a
l
 
s
c
a
l
a
b
i
l
i
t
y
(
w
h
y
 
c
o
n
s
i
s
t
e
n
t
 
h
a
s
h
i
n
g
)
K
e
y
 
i
s
 
s
t
o
r
e
d
 
a
t
 
i
t
s
 
s
u
c
c
e
s
s
o
r
:
 
n
o
d
e
 
w
i
t
h
 
n
e
x
t
-
h
i
g
h
e
r
 
I
D
 
3
-
b
i
t
I
D
 
s
p
a
c
e
 
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
I
d
e
n
t
i
f
i
e
r
s
 
h
a
v
e
 
m
 
=
 
3
 
b
i
t
s
K
e
y
 
s
p
a
c
e
:
 
[
0
,
 
2
3
-
1
]
 
N
o
d
e
 
S
t
o
r
e
s
 
k
e
y
 
1
 
S
t
o
r
e
s
 
k
e
y
s
 
2
,
 
3
 
S
t
o
r
e
s
 
k
e
y
s
 
4
,
 
5
 
S
t
o
r
e
s
 
k
e
y
 
6
 
S
t
o
r
e
s
 
k
e
y
 
7
,
 
0
 
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
I
n
c
r
e
m
e
n
t
a
l
 
s
c
a
l
a
b
i
l
i
t
y
(
w
h
y
 
c
o
n
s
i
s
t
e
n
t
 
h
a
s
h
i
n
g
)
Minimum data is moved around when nodes join and leave
Unlike modular hashing (see next slide)
13
13
 
N
o
d
e
 
5
 
j
o
i
n
s
 
Consider problem of data partition:
G
i
v
e
n
 
o
b
j
e
c
t
 
i
d
 
X
,
 
c
h
o
o
s
e
 
o
n
e
 
o
f
 
k
 
s
e
r
v
e
r
s
 
t
o
 
u
s
e
 
S
u
p
p
o
s
e
 
i
n
s
t
e
a
d
 
w
e
 
u
s
e
 
m
o
d
u
l
o
 
h
a
s
h
i
n
g
:
P
l
a
c
e
 
X
 
o
n
 
s
e
r
v
e
r
 
i
 
=
 
h
a
s
h
(
X
)
 
m
o
d
 
k
 
What happens if a server fails or joins (k 
 k
±
1)?
o
r
 
d
i
f
f
e
r
e
n
t
 
c
l
i
e
n
t
s
 
h
a
v
e
 
d
i
f
f
e
r
e
n
t
 
e
s
t
i
m
a
t
e
 
o
f
 
k
?
 
14
14
 
M
o
d
u
l
o
 
h
a
s
h
i
n
g
P
r
o
b
l
e
m
 
f
o
r
 
m
o
d
u
l
o
 
h
a
s
h
i
n
g
:
C
h
a
n
g
i
n
g
 
n
u
m
b
e
r
 
o
f
 
s
e
r
v
e
r
s
S
e
r
v
e
r
O
b
j
e
c
t
 
s
e
r
i
a
l
 
n
u
m
b
e
r
h
(
x
)
 
=
 
x
 
+
 
1
 
(
m
o
d
 
4
)
7
1
0
1
1
2
7
2
9
3
6
3
8
4
0
4
3
2
1
0
5
A
l
l
 
e
n
t
r
i
e
s
 
g
e
t
 
r
e
m
a
p
p
e
d
 
t
o
 
n
e
w
 
n
o
d
e
s
!
 
N
e
e
d
 
t
o
 
m
o
v
e
 
o
b
j
e
c
t
s
 
o
v
e
r
 
t
h
e
 
n
e
t
w
o
r
k
15
15
 
C
h
a
l
l
e
n
g
e
:
 
u
n
b
a
l
a
n
c
e
d
 
l
o
a
d
 
Nodes are assigned different # of keys
 
16
16
 
K
e
y
s
 
1
 
~
 
3
 
C
h
a
l
l
e
n
g
e
:
 
u
n
b
a
l
a
n
c
e
d
 
l
o
a
d
 
Nodes are assigned different # of keys
 
Unbalanced with nodes join/leave
 
17
17
 
K
e
y
s
 
3
,
 
4
 
K
e
y
s
 
1
,
 
2
 
K
e
y
s
 
5
,
 
6
 
C
h
a
l
l
e
n
g
e
:
 
u
n
b
a
l
a
n
c
e
d
 
l
o
a
d
 
Nodes are assigned different # of keys
 
Unbalanced with nodes join/leave
 
18
18
 
K
e
y
s
 
3
,
 
4
 
K
e
y
s
 
1
,
 
2
 
K
e
y
s
 
5
,
 
6
C
h
a
l
l
e
n
g
e
:
 
u
n
b
a
l
a
n
c
e
d
 
l
o
a
d
Nodes are assigned different # of keys
Unbalanced with nodes join/leave
Some keys are more popular
19
19
K
e
y
s
 
3
,
 
4
 
 
B
e
s
t
 
s
e
l
l
e
r
 
i
t
e
m
K
e
y
s
 
1
,
 
2
 
K
e
y
s
 
5
,
 
6
 
 
S
o
l
u
t
i
o
n
:
 
v
i
r
t
u
a
l
 
n
o
d
e
s
 
(
v
n
o
d
e
s
)
 
An extra level of mapping
From node id in the ring to physical node
Node ids are now virtual nodes (tokens)
Multiple node ids 
 same physical node
 
20
20
 
3
-
b
i
t
I
D
 
s
p
a
c
e
 
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
S
o
l
u
t
i
o
n
:
 
v
i
r
t
u
a
l
 
n
o
d
e
s
 
(
v
n
o
d
e
s
)
 
An extra level of mapping
From node id in the ring to physical node
Node ids are now virtual nodes (tokens)
Multiple node ids 
 same physical node
 
21
21
 
3
-
b
i
t
I
D
 
s
p
a
c
e
 
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
4
 
p
h
y
s
i
c
a
l
 
n
o
d
e
s
 
(
s
e
r
v
e
r
s
)
2
 
v
n
o
d
e
s
 
/
 
s
e
r
v
e
r
 
V
i
r
t
u
a
l
 
n
o
d
e
:
s
a
m
e
 
c
o
l
o
r
 
 
s
a
m
e
 
p
h
y
s
i
c
a
l
 
n
o
d
e
 
S
o
l
u
t
i
o
n
:
 
v
i
r
t
u
a
l
 
n
o
d
e
s
 
(
v
n
o
d
e
s
)
 
An extra level of mapping
From node id in the ring to physical node
Node ids are now virtual nodes (tokens)
Multiple node ids 
 same physical node
 
22
22
 
V
i
r
t
u
a
l
 
n
o
d
e
:
s
a
m
e
 
c
o
l
o
r
 
 
s
a
m
e
 
p
h
y
s
i
c
a
l
 
n
o
d
e
 
O
r
a
n
g
e
 
s
e
r
v
e
r
 
l
e
a
v
e
s
K
e
y
s
 
m
o
v
e
d
 
t
o
 
b
l
u
e
 
a
n
d
 
r
e
d
 
S
o
l
u
t
i
o
n
:
 
v
i
r
t
u
a
l
 
n
o
d
e
s
 
(
v
n
o
d
e
s
)
 
An extra level of mapping
From node id in the ring to physical node
Node ids are now virtual nodes (tokens)
Multiple node ids 
 same physical node
 
More virtual nodes, more balanced
 
Faster data transfer for join/leave
 
Controllable # of vnodes / server
Server capacity:
e.g., CPU, memory, network
 
23
23
 
V
i
r
t
u
a
l
 
n
o
d
e
:
s
a
m
e
 
c
o
l
o
r
 
 
s
a
m
e
 
p
h
y
s
i
c
a
l
 
n
o
d
e
 
G
o
s
s
i
p
:
 
O
n
c
e
 
p
e
r
 
s
e
c
o
n
d
,
 
e
a
c
h
 
n
o
d
e
 
c
o
n
t
a
c
t
s
 
a
r
a
n
d
o
m
l
y
 
c
h
o
s
e
n
 
o
t
h
e
r
 
n
o
d
e
T
h
e
y
 
e
x
c
h
a
n
g
e
 
t
h
e
i
r
 
l
i
s
t
s
 
o
f
 
k
n
o
w
n
 
n
o
d
e
s
(
i
n
c
l
u
d
i
n
g
 
v
i
r
t
u
a
l
 
n
o
d
e
 
I
D
s
)
Assumes all nodes will come back eventually, doesn’t
repartition
E
a
c
h
 
n
o
d
e
 
l
e
a
r
n
s
 
w
h
i
c
h
 
o
t
h
e
r
s
 
h
a
n
d
l
e
 
a
l
l
 
k
e
y
 
r
a
n
g
e
s
 
R
e
s
u
l
t
:
 
A
l
l
 
n
o
d
e
s
 
c
a
n
 
s
e
n
d
 
d
i
r
e
c
t
l
y
 
t
o
 
a
n
y
 
k
e
y
s
c
o
o
r
d
i
n
a
t
o
r
 
(
z
e
r
o
-
h
o
p
 
D
H
T
)
R
e
d
u
c
e
s
 
v
a
r
i
a
b
i
l
i
t
y
 
i
n
 
r
e
s
p
o
n
s
e
 
t
i
m
e
s
24
24
G
o
s
s
i
p
 
a
n
d
 
l
o
o
k
u
p
 
1.
Background and system model
 
2.
Data partitioning
 
3.
F
a
i
l
u
r
e
 
h
a
n
d
l
i
n
g
 
25
25
 
T
o
d
a
y
:
 
A
m
a
z
o
n
 
D
y
n
a
m
o
 
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
d
a
t
a
 
r
e
p
l
i
c
a
t
i
o
n
)
 
Key replicated on M vnodes
Remember “r-successor” in DHT?
 
All M vnodes on 
distinct
 servers across 
different
 datacenters
 
 
26
26
 
3
-
b
i
t
I
D
 
s
p
a
c
e
 
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
V
i
r
t
u
a
l
 
n
o
d
e
:
5
 
c
o
l
o
r
s
 
 
5
 
p
h
y
s
i
c
a
l
 
n
o
d
e
s
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
d
a
t
a
 
r
e
p
l
i
c
a
t
i
o
n
)
 
Key replicated on M vnodes
Remember “r-successor” in DHT?
 
All M vnodes on 
distinct
 servers across 
different
 datacenters
 
 
27
27
 
3
-
b
i
t
I
D
 
s
p
a
c
e
 
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
V
i
r
t
u
a
l
 
n
o
d
e
:
5
 
c
o
l
o
r
s
 
 
5
 
p
h
y
s
i
c
a
l
 
n
o
d
e
s
M = 4
K
e
y
 
0
s
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
c
o
u
l
d
 
b
e
v
nodes: {0, 1, 3, 5} mapping to servers:
{
green
, 
red
, 
orange
, 
blue
}
G
r
e
e
n
 
i
s
 
t
h
e
 
c
o
o
r
d
i
n
a
t
o
r
 
s
e
r
v
e
r
 
o
f
 
k
e
y
 
0
 
K
e
y
 
0
 
K
e
y
 
0
 
K
e
y
 
0
 
K
e
y
 
0
 
R
e
a
d
 
a
n
d
 
w
r
i
t
e
 
r
e
q
u
e
s
t
s
 
Received by the coordinator 
(this is not Chord)
Either the client (web server) knows the mapping or re-routed
 
Sent to 
first N “healthy” servers
 in preference list 
(coordinator incl.)
Durable writes: my updates recorded on multiple servers
Fast reads: possible to avoid straggler
 
A write creates a new immutable version of the key 
(no overwrite)
Multi-versioned data store
 
Quorum-based protocol
A write succeeds if W out of N servers reply (write quorum)
A read succeeds if R out of N servers reply (read quorum)
W
 
+
 
R
 
>
 
N
 
28
28
 
Q
u
o
r
u
m
 
i
m
p
l
i
c
a
t
i
o
n
s
 
(
W
,
 
R
,
 
a
n
d
 
N
)
 
N determines the durability of data (Dynamo N = 3)
 
W
 
a
n
d
 
R
 
a
d
j
u
s
t
 
t
h
e
 
a
v
a
i
l
a
b
i
l
i
t
y
-
c
o
n
s
i
s
t
e
n
c
y
 
t
r
a
d
e
o
f
f
W = 1 (R = 3): fast write, weak durability, slow read
R = 1 (W = 3): slow write, good durability, fast read
Dynamo: W = R = 2
 
Why W + R > N ?
R
e
a
d
 
a
n
d
 
w
r
i
t
e
 
q
u
o
r
u
m
s
 
o
v
e
r
l
a
p
 
w
h
e
n
 
t
h
e
r
e
 
a
r
e
 
n
o
f
a
i
l
u
r
e
s
!
Reads see all updates without failures
What if there are failures?
 
29
29
 
F
a
i
l
u
r
e
 
h
a
n
d
i
n
g
:
 
s
l
o
p
p
y
 
q
u
o
r
u
m
 
+
h
i
n
t
e
d
 
h
a
n
d
o
f
f
 
Sloppy: not always the same servers used in N
First N servers in the preference list without failures
Later servers in the list take over if some in the first N fail
 
Consequences
Good performance: no need to wait for failed servers in N to
recover
Eventual (weak) consistency: conflicts are possible, versions
diverge
A
n
o
t
h
e
r
 
d
e
c
i
s
i
o
n
 
o
n
 
a
v
a
i
l
a
b
i
l
i
t
y
-
c
o
n
s
i
s
t
e
n
c
y
 
t
r
a
d
e
o
f
f
!
 
30
30
 
F
a
i
l
u
r
e
 
h
a
n
d
i
n
g
:
 
s
l
o
p
p
y
 
q
u
o
r
u
m
 
+
h
i
n
t
e
d
 
h
a
n
d
o
f
f
 
Key 0’s preference list 
{
green
, 
red
, 
orange
, 
blue
}
 
N = 3: {
green
, 
red
, 
orange
} without failures
 
If 
red
 fails, requests go to {
green
, 
orange
, 
blue
}
 
H
i
n
t
e
d
 
h
a
n
d
o
f
f
Blue
 temporarily serves requests
Hinted that 
red
 is the intended recipient
Send replica back to 
red
 when 
red
 is on
 
31
31
 
3
-
b
i
t
I
D
 
s
p
a
c
e
 
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
V
i
r
t
u
a
l
 
n
o
d
e
:
5
 
c
o
l
o
r
s
 
 
5
 
p
h
y
s
i
c
a
l
 
n
o
d
e
s
 
K
e
y
 
0
 
K
e
y
 
0
 
K
e
y
 
0
 
K
e
y
 
0
 
L
a
s
t
 
,
§
4
.
6
:
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
s
 
a
l
w
a
y
s
 
c
o
n
t
a
i
n
 
n
o
d
e
s
f
r
o
m
 
m
o
r
e
 
t
h
a
n
 
o
n
e
 
d
a
t
a
 
c
e
n
t
e
r
C
o
n
s
e
q
u
e
n
c
e
:
 
D
a
t
a
 
l
i
k
e
l
y
 
t
o
 
s
u
r
v
i
v
e
 
f
a
i
l
u
r
e
 
o
f
e
n
t
i
r
e
 
d
a
t
a
 
c
e
n
t
e
r
 
 
B
l
o
c
k
i
n
g
 
o
n
 
w
r
i
t
e
s
 
t
o
 
a
 
r
e
m
o
t
e
 
d
a
t
a
 
c
e
n
t
e
r
 
w
o
u
l
d
i
n
c
u
r
 
u
n
a
c
c
e
p
t
a
b
l
y
 
h
i
g
h
 
l
a
t
e
n
c
y
C
o
m
p
r
o
m
i
s
e
:
 
W
 
<
 
N
,
 
e
v
e
n
t
u
a
l
 
c
o
n
s
i
s
t
e
n
c
y
B
e
t
t
e
r
 
d
u
r
a
b
i
l
i
t
y
 
&
 
l
a
t
e
n
c
y
 
b
u
t
 
w
o
r
s
e
 
c
o
n
s
i
s
t
e
n
c
y
32
32
W
i
d
e
-
a
r
e
a
 
r
e
p
l
i
c
a
t
i
o
n
 
S
u
p
p
o
s
e
 
N
 
=
 
3
,
 
W
 
=
 
R
 
=
 
2
,
 
n
o
d
e
s
 
a
r
e
 
A
,
 
B
,
 
C
,
 
D
,
 
E
C
L
1
 
p
u
t
(
k
,
 
)
 
c
o
m
p
l
e
t
e
s
 
o
n
 
A
 
a
n
d
 
B
C
L
2
 
p
u
t
(
k
,
 
)
 
c
o
m
p
l
e
t
e
s
 
o
n
 
C
 
a
n
d
 
D
 
C
o
n
f
l
i
c
t
i
n
g
 
r
e
s
u
l
t
s
 
f
r
o
m
 
A
,
 
B
 
a
n
d
 
C
,
 
D
E
a
c
h
 
h
a
s
 
s
e
e
n
 
a
 
d
i
f
f
e
r
e
n
t
 
p
u
t
(
k
,
 
)
 
How does Dynamo handle conflicting versions?
33
33
C
o
n
f
l
i
c
t
s
 
A
n
 
e
x
a
m
p
l
e
 
o
f
 
c
o
n
f
l
i
c
t
i
n
g
 
w
r
i
t
e
s
(
v
e
r
s
i
o
n
s
)
 
34
34
 
Time
 
A
 
B
 
C
 
D
 
E
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
 
S
h
o
p
p
i
n
g
 
c
a
r
t
:
 
x
 
x
 
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
 
A
 
a
n
d
 
B
 
f
a
i
l
 
A
n
 
e
x
a
m
p
l
e
 
o
f
 
c
o
n
f
l
i
c
t
i
n
g
 
w
r
i
t
e
s
(
v
e
r
s
i
o
n
s
)
 
35
35
 
Time
 
A
 
B
 
C
 
D
 
E
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
 
S
h
o
p
p
i
n
g
 
c
a
r
t
:
 
x
 
x
 
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
 
A
 
a
n
d
 
B
 
f
a
i
l
 
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
 
y
 
y
 
A
n
 
e
x
a
m
p
l
e
 
o
f
 
c
o
n
f
l
i
c
t
i
n
g
 
w
r
i
t
e
s
(
v
e
r
s
i
o
n
s
)
 
36
36
 
A
 
B
 
C
 
D
 
E
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
 
S
h
o
p
p
i
n
g
 
c
a
r
t
:
 
x
 
x
 
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
 
A
 
a
n
d
 
B
 
f
a
i
l
 
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
 
y
 
y
 
A
 
a
n
d
 
B
 
r
e
c
o
v
e
r
 
C
L
1
:
 
R
e
a
d
 
c
a
r
t
 
r
e
a
d
 
r
e
a
d
Conflicting versions only possible under failures
 
Time
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
h
a
n
d
l
i
n
g
 
c
o
n
f
l
i
c
t
i
n
g
v
e
r
s
i
o
n
s
37
37
A
B
C
D
E
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
S
h
o
p
p
i
n
g
 
c
a
r
t
:
x
x
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
A
 
a
n
d
 
B
 
f
a
i
l
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
y
y
A
 
a
n
d
 
B
 
r
e
c
o
v
e
r
C
L
1
:
 
R
e
a
d
 
c
a
r
t
r
e
a
d
r
e
a
d
Time
(
A
,
1
)
(
A
,
1
)
(
C
,
1
)
(
C
,
1
)
R
e
a
d
 
r
e
t
u
r
n
s
x
 
(
A
,
1
)
 
a
n
d
 
y
 
(
C
,
1
)
(
A
,
1
)
 
a
n
d
 
(
C
,
1
)
 
a
r
e
n
o
t
 
c
a
u
s
a
l
l
y
 
r
e
l
a
t
e
d
:
c
o
n
f
l
i
c
t
s
!
Can we use Lamport clocks?
 
V
e
r
s
i
o
n
 
v
e
c
t
o
r
:
 
L
i
s
t
 
o
f
 
(
c
o
o
r
d
i
n
a
t
o
r
 
n
o
d
e
,
 
c
o
u
n
t
e
r
)
 
p
a
i
r
s
e.g., 
[(A, 1), (B, 3), …]
 
D
y
n
a
m
o
 
s
t
o
r
e
s
 
a
 
v
e
r
s
i
o
n
 
v
e
c
t
o
r
 
w
i
t
h
 
e
a
c
h
 
s
t
o
r
e
d
 
k
e
y
-
v
a
l
u
e
 
p
a
i
r
 
I
d
e
a
:
 
t
r
a
c
k
 
a
n
c
e
s
t
o
r
-
d
e
s
c
e
n
d
a
n
t
 
r
e
l
a
t
i
o
n
s
h
i
p
b
e
t
w
e
e
n
 
d
i
f
f
e
r
e
n
t
 
v
e
r
s
i
o
n
s
 
o
f
 
d
a
t
a
 
s
t
o
r
e
d
 
u
n
d
e
r
 
t
h
e
s
a
m
e
 
k
e
y
 
k
 
38
38
 
V
e
r
s
i
o
n
 
v
e
c
t
o
r
s
 
(
v
e
c
t
o
r
 
c
l
o
c
k
s
)
 
D
y
n
a
m
o
s
 
s
y
s
t
e
m
 
i
n
t
e
r
f
a
c
e
 
g
e
t
(
k
e
y
)
 
 
v
a
l
u
e
,
 
c
o
n
t
e
x
t
Returns one value or multiple conflicting values
Context describes version(s) of value(s)
 
p
u
t
(
k
e
y
,
 
c
o
n
t
e
x
t
,
 
v
a
l
u
e
)
 
 
O
K
C
o
n
t
e
x
t
 
i
n
d
i
c
a
t
e
s
 
w
h
i
c
h
 
v
e
r
s
i
o
n
s
 
t
h
i
s
 
v
e
r
s
i
o
n
s
u
p
e
r
s
e
d
e
s
 
o
r
 
m
e
r
g
e
s
 
39
39
 
R
u
l
e
:
 
I
f
 
v
e
c
t
o
r
 
c
l
o
c
k
 
c
o
m
p
a
r
i
s
o
n
 
o
f
 
v
1
 
<
 
v
2
,
 
t
h
e
n
 
t
h
e
 
f
i
r
s
t
 
i
s
a
n
 
a
n
c
e
s
t
o
r
 
o
f
 
t
h
e
 
s
e
c
o
n
d
 
 
D
y
n
a
m
o
 
c
a
n
 
f
o
r
g
e
t
 
v
1
 
Each time a put() occurs, Dynamo increments the counter
in the V.V. for the coordinator node
 
Each time a get() occurs, Dynamo returns the V.V. for the
value(s) returned (in the “context”)
 
T
h
e
n
 
u
s
e
r
s
 
m
u
s
t
 
s
u
p
p
l
y
 
t
h
a
t
 
c
o
n
t
e
x
t
 
t
o
 
p
u
t
(
)
s
 
t
h
a
t
m
o
d
i
f
y
 
t
h
e
 
s
a
m
e
 
k
e
y
 
40
40
 
V
e
r
s
i
o
n
 
v
e
c
t
o
r
s
:
 
D
y
n
a
m
o
s
 
m
e
c
h
a
n
i
s
m
 
C
o
n
f
l
i
c
t
 
r
e
s
o
l
u
t
i
o
n
 
(
r
e
c
o
n
c
i
l
i
a
t
i
o
n
)
 
If vector clocks show causally related (not really
conflicting)
System overwrites with the later version
 
For conflicting versions
System handles it automatically, e.g., last-writer-
wins (limited use case)
A
p
p
l
i
c
a
t
i
o
n
 
s
p
e
c
i
f
i
c
 
r
e
s
o
l
u
t
i
o
n
 
(
m
o
s
t
 
c
o
m
m
o
n
)
Clients resolve the conflict 
via reads
, e.g., merge
shopping cart
 
41
41
 
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
h
a
n
d
l
i
n
g
 
c
o
n
f
l
i
c
t
i
n
g
v
e
r
s
i
o
n
s
 
42
42
 
Time
 
A
 
B
 
C
 
D
 
E
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
 
S
h
o
p
p
i
n
g
 
c
a
r
t
:
 
x
 
x
 
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
 
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
 
y
 
y
 
C
L
1
:
 
R
e
a
d
 
c
a
r
t
x
 
(
A
,
1
)
,
 
y
 
(
C
,
1
)
 
(
A
,
1
)
 
(
A
,
1
)
 
(
C
,
1
)
 
(
C
,
1
)
 
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
h
a
n
d
l
i
n
g
 
c
o
n
f
l
i
c
t
i
n
g
v
e
r
s
i
o
n
s
 
43
43
 
Time
 
A
 
B
 
C
 
D
 
E
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
 
S
h
o
p
p
i
n
g
 
c
a
r
t
:
 
x
 
x
 
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
 
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
 
y
 
y
 
C
L
1
:
 
A
d
d
 
I
t
e
m
 
z
x
,
 
y
,
 
z
 
[
(
A
,
1
)
,
 
(
C
,
1
)
]
 
C
L
1
:
 
R
e
a
d
 
c
a
r
t
x
 
(
A
,
1
)
,
 
y
 
(
C
,
1
)
 
(
A
,
1
)
 
(
A
,
1
)
 
(
C
,
1
)
 
(
C
,
1
)
 
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
h
a
n
d
l
i
n
g
 
c
o
n
f
l
i
c
t
i
n
g
v
e
r
s
i
o
n
s
 
44
44
 
Time
 
A
 
B
 
C
 
D
 
E
 
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
 
S
h
o
p
p
i
n
g
 
c
a
r
t
:
 
x
 
x
 
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
 
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
 
y
 
y
 
x
y
z
 
x
y
z
 
(
A
,
2
,
 
C
,
1
)
 
(
A
,
2
,
 
C
,
1
)
 
C
L
1
:
 
R
e
a
d
 
c
a
r
t
x
 
(
A
,
1
)
,
 
y
 
(
C
,
1
)
 
(
A
,
1
)
 
(
A
,
1
)
 
(
C
,
1
)
 
(
C
,
1
)
 
C
L
1
:
 
A
d
d
 
I
t
e
m
 
z
x
,
 
y
,
 
z
 
[
(
A
,
1
)
,
 
(
C
,
1
)
]
 
H
o
w
 
u
s
e
f
u
l
 
i
s
 
i
t
 
t
o
 
v
a
r
y
 
N
,
 
R
,
 
W
?
 
45
45
 
H
o
w
 
u
s
e
f
u
l
 
i
s
 
i
t
 
t
o
 
v
a
r
y
 
N
,
 
R
,
 
W
?
 
46
46
 
F
a
i
l
u
r
e
 
d
e
t
e
c
t
i
o
n
 
a
n
d
 
r
i
n
g
 
m
e
m
b
e
r
s
h
i
p
 
Server A considers B has failed if B does not reply to A’s
message
Even if B replies to C
A then tries alternative nodes
 
With servers join and permanently leave
Servers periodically send gossip messages to their
neighbors to sync who are in the ring
Some servers are chosen as seeds, i.e., common
neighbors to all nodes
 
47
47
 
H
i
n
t
e
d
 
h
a
n
d
o
f
f
 
n
o
d
e
 
c
r
a
s
h
e
s
 
b
e
f
o
r
e
 
i
t
 
c
a
n
 
r
e
p
l
i
c
a
t
e
d
a
t
a
 
t
o
 
n
o
d
e
 
i
n
 
p
r
e
f
e
r
e
n
c
e
 
l
i
s
t
N
e
e
d
 
a
n
o
t
h
e
r
 
w
a
y
 
t
o
 
e
n
s
u
r
e
 
t
h
a
t
 
e
a
c
h
 
k
e
y
-
v
a
l
u
e
p
a
i
r
 
i
s
 
r
e
p
l
i
c
a
t
e
d
 
N
 
t
i
m
e
s
 
M
e
c
h
a
n
i
s
m
:
 
r
e
p
l
i
c
a
 
s
y
n
c
h
r
o
n
i
z
a
t
i
o
n
N
o
d
e
s
 
n
e
a
r
b
y
 
o
n
 
r
i
n
g
 
p
e
r
i
o
d
i
c
a
l
l
y
 
g
o
s
s
i
p
C
o
m
p
a
r
e
 
t
h
e
 
(
k
,
 
v
)
 
p
a
i
r
s
 
t
h
e
y
 
h
o
l
d
C
o
p
y
 
a
n
y
 
m
i
s
s
i
n
g
 
k
e
y
s
 
t
h
e
 
o
t
h
e
r
 
h
a
s
48
48
A
n
t
i
-
e
n
t
r
o
p
y
 
(
r
e
p
l
i
c
a
 
s
y
n
c
h
r
o
n
i
z
a
t
i
o
n
)
H
o
w
 
t
o
 
c
o
m
p
a
r
e
 
a
n
d
 
c
o
p
y
 
r
e
p
l
i
c
a
s
t
a
t
e
 
q
u
i
c
k
l
y
 
a
n
d
 
e
f
f
i
c
i
e
n
t
l
y
?
 
M
e
r
k
l
e
 
t
r
e
e
s
 
h
i
e
r
a
r
c
h
i
c
a
l
l
y
 
s
u
m
m
a
r
i
z
e
 
t
h
e
 
k
e
y
-
v
a
l
u
e
p
a
i
r
s
 
a
 
n
o
d
e
 
h
o
l
d
s
 
O
n
e
 
M
e
r
k
l
e
 
t
r
e
e
 
f
o
r
 
e
a
c
h
 
v
i
r
t
u
a
l
 
n
o
d
e
 
k
e
y
 
r
a
n
g
e
L
e
a
f
 
n
o
d
e
 
=
 
h
a
s
h
 
o
f
 
o
n
e
 
k
e
y
s
 
v
a
l
u
e
(
#
 
o
f
 
l
e
a
v
e
s
 
=
 
#
 
k
e
y
s
 
o
n
 
t
h
e
 
v
i
r
t
u
a
l
 
n
o
d
e
)
I
n
t
e
r
n
a
l
 
n
o
d
e
 
=
 
h
a
s
h
 
o
f
 
c
o
n
c
a
t
e
n
a
t
i
o
n
 
o
f
 
c
h
i
l
d
r
e
n
 
Replicas exchange trees from top down, depth by depth
I
f
 
r
o
o
t
 
n
o
d
e
s
 
m
a
t
c
h
,
 
t
h
e
n
 
i
d
e
n
t
i
c
a
l
 
r
e
p
l
i
c
a
s
,
 
s
t
o
p
E
l
s
e
,
 
g
o
 
t
o
 
n
e
x
t
 
l
e
v
e
l
,
 
c
o
m
p
a
r
e
 
n
o
d
e
s
 
p
a
i
r
-
w
i
s
e
 
49
49
 
E
f
f
i
c
i
e
n
t
 
s
y
n
c
h
r
o
n
i
z
a
t
i
o
n
 
w
i
t
h
 
M
e
r
k
l
e
 
t
r
e
e
s
B
 
i
s
 
m
i
s
s
i
n
g
 
o
r
a
n
g
e
 
k
e
y
;
 
A
 
i
s
 
m
i
s
s
i
n
g
 
g
r
e
e
n
 
o
n
e
E
x
c
h
a
n
g
e
 
a
n
d
 
c
o
m
p
a
r
e
 
h
a
s
h
 
n
o
d
e
s
 
f
r
o
m
 
r
o
o
t
d
o
w
n
w
a
r
d
s
,
 
p
r
u
n
i
n
g
 
w
h
e
n
 
h
a
s
h
e
s
 
m
a
t
c
h
50
50
M
e
r
k
l
e
 
t
r
e
e
 
r
e
c
o
n
c
i
l
i
a
t
i
o
n
B
s
 
v
a
l
u
e
s
:
A
s
 
v
a
l
u
e
s
:
[0, 2
128
)
[0, 2
127
)
[2
127
, 2
128
)
[0, 2
128
)
[0, 2
127
)
[2
127
, 2
128
)
F
i
n
d
s
 
d
i
f
f
e
r
i
n
g
 
k
e
y
s
 
q
u
i
c
k
l
y
 
a
n
d
 
w
i
t
h
m
i
n
i
m
u
m
 
i
n
f
o
r
m
a
t
i
o
n
 
e
x
c
h
a
n
g
e
 
D
y
n
a
m
o
:
 
T
a
k
e
-
a
w
a
y
s
 
i
d
e
a
s
 
Availability is important
Systems need to be scalable and reliable
 
Dynamo is eventually consistent
M
a
n
y
 
d
e
s
i
g
n
 
d
e
c
i
s
i
o
n
s
 
t
r
a
d
e
 
c
o
n
s
i
s
t
e
n
c
y
 
f
o
r
 
a
v
a
i
l
a
b
i
l
i
t
y
 
Core techniques
C
o
n
s
i
s
t
e
n
t
 
h
a
s
h
i
n
g
:
 
d
a
t
a
 
p
a
r
t
i
t
i
o
n
i
n
g
R
e
p
l
i
c
a
t
i
o
n
,
 
p
r
e
f
e
r
e
n
c
e
 
l
i
s
t
,
 
s
l
o
p
p
y
 
q
u
o
r
u
m
,
 
h
i
n
t
e
d
h
a
n
d
o
f
f
:
 
a
v
a
i
l
a
b
i
l
i
t
y
 
u
n
d
e
r
 
f
a
i
l
u
r
e
s
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
c
o
n
f
l
i
c
t
 
r
e
s
o
l
u
t
i
o
n
 
(
p
a
r
t
l
y
 
a
u
t
o
m
a
t
i
c
,
 
r
e
s
t
 
a
p
p
.
)
A
n
t
i
-
e
n
t
r
o
p
y
:
 
s
y
n
c
h
r
o
n
i
z
e
 
r
e
p
l
i
c
a
s
G
o
s
s
i
p
:
 
s
y
n
c
h
r
o
n
i
z
e
 
r
i
n
g
 
m
e
m
b
e
r
s
h
i
p
 
51
51
Slide Note
Embed
Share

Amazon Dynamo is a key-value storage system designed for high availability, scalability, and reliability. It focuses on always being operational despite failures and providing low request-response latency. Learn about its background, data partitioning, failure handling, and the requirements it meets for demanding web applications.

  • Amazon Dynamo
  • Key-Value Storage
  • Availability
  • Scalability
  • Reliability

Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Scaling Out Key-Value Storage: Dynamo CS 240: Computing Systems and Concurrency Lecture 8 Marco Canini

  2. Availability: vital for web applications Web applications are expected to be always on Down time pisses off customers, costs $ System design considerations relevant to availability Scalability: always on under growing demand Reliability: always on despite failures Performance: 10 sec latency considered available? an availability event can be modeled as a long- lasting performance variation (Amazon Aurora SIGMOD 17) 2

  3. Scalability: up or out? Scale-up (vertical scaling) Upgrade hardware E.g., MacBook Air MacBook Pro Down time during upgrade; stops working quickly Scale-out (horizontal scaling) Add machines, divide the work E.g., a supermarket adds more checkout lines No disruption; works great with careful design 3

  4. Reliability: available under failures More machines, more likely to fail p = probability a machine fails in given period n = number of machines Probability of any failure in given period = 1 (1 p)n For 50Kmachines, each with 99.99966% available 16% of the time, data center experiences failures For 100K machines, failures happen 30% of the time! 4

  5. Two questions (challenges) How is data partitioned across machines so the system scales? How are failures handled so the system is always on? 5

  6. Today: Amazon Dynamo 1. Background and system model 2. Data partitioning 3. Failure handling 6

  7. Amazon in 2007 104s of servers in multiple DCs 106s of servers, 120+ DCs (as of now) 107s of customers at peaks 89M+ reqs/s (Prime Day 21) Tiered architecture (similar today) Service-oriented architecture Stateless web servers & aggregators Stateful storage servers 7

  8. Dynamo requirements Highly available writes despite failures Despite disks failing, network routes flapping, data centers destroyed by tornadoes Always respond quickly, even during failures replication Low request-response latency: focus on 99.9% SLA E.g., provide a response within 300ms for 99.9% of its requests for peak client load of 500 reqs/s Incrementally scalable as servers grow to workload Adding nodes should be seamless Comprehensible conflict resolution High availability in above sense implies conflicts 8

  9. Basics in Dynamo Basic interface is a key-value store (vs. relational DB) get(k) and put(k, v) Keys and values opaque to Dynamo Nodes are symmetric P2P and DHT context 9

  10. Today: Amazon Dynamo 1. Background and system model 2. Data partitioning 3. Failure handling 10

  11. Consistent hashing recap Identifiers have m = 3 bits Key space: [0, 23-1] Stores key 7, 0 0 Stores key 1 7 1 Identifiers/key space Node 3-bit ID space 6 Stores key 6 2 5 3 Stores keys 4, 5 Stores keys 2, 3 4 Key is stored at its successor: node with next-higher ID 11

  12. Incremental scalability (why consistent hashing) Identifiers have m = 3 bits Key space: [0, 23-1] Stores key 7, 0 0 Stores key 1 7 1 Identifiers/key space Node 3-bit ID space 6 Stores key 6 2 5 3 Stores keys 4, 5 Stores keys 2, 3 4 Key is stored at its successor: node with next-higher ID 12

  13. Incremental scalability (why consistent hashing) Minimum data is moved around when nodes join and leave Unlike modular hashing (see next slide) Keys 4 ~ 0 Keys 6 ~ 0 0 0 7 7 1 1 Node 5 joins 3-bit ID space Transfer Keys 4, 5 6 6 2 2 5 5 3 3 Keys 1 ~ 3 Keys 1 ~ 3 4 4 Keys 4, 5 13

  14. Modulo hashing Consider problem of data partition: Given object id X, choose one of k servers to use Suppose instead we use modulo hashing: Place X on server i = hash(X) mod k What happens if a server fails or joins (k k 1)? or different clients have different estimate of k? 14

  15. Problem for modulo hashing: Changing number of servers h(x) = x + 1 (mod 4) Add one machine: h(x) = x + 1(mod 5) Server 4 3 All entries get remapped to new nodes! Need to move objects over the network 2 1 0 5 7 10 11 27 29 36 38 40 Object serial number 15

  16. Challenge: unbalanced load Nodes are assigned different # of keys Keys 4 ~ 0 0 7 1 3-bit ID space 6 2 5 3 Keys 1 ~ 3 4 16

  17. Challenge: unbalanced load Nodes are assigned different # of keys Unbalanced with nodes join/leave Keys 7, 0 0 7 1 3-bit ID space 6 Keys 1, 2 2 Keys 5, 6 5 3 4 Keys 3, 4 17

  18. Challenge: unbalanced load Nodes are assigned different # of keys Unbalanced with nodes join/leave Keys 5, 6, 7, 0 0 7 1 3-bit ID space 6 Keys 1, 2 2 Keys 5, 6 5 3 4 Keys 3, 4 18

  19. Challenge: unbalanced load Nodes are assigned different # of keys Unbalanced with nodes join/leave Keys 7, 0 Some keys are more popular 0 7 1 3-bit ID space 6 Keys 1, 2 2 Keys 5, 6 5 3 Best seller item 4 Keys 3, 4 19

  20. Solution: virtual nodes (vnodes) An extra level of mapping From node id in the ring to physical node Node ids are now virtual nodes (tokens) Multiple node ids same physical node 0 7 1 3-bit ID space 6 2 5 3 4 20

  21. Solution: virtual nodes (vnodes) An extra level of mapping From node id in the ring to physical node Node ids are now virtual nodes (tokens) Multiple node ids same physical node 0 7 1 3-bit ID space 4 physical nodes (servers) 2 vnodes / server 6 2 5 3 4 Virtual node: same color same physical node 21

  22. Solution: virtual nodes (vnodes) An extra level of mapping From node id in the ring to physical node Node ids are now virtual nodes (tokens) Multiple node ids same physical node 0 7 1 3-bit ID space Orange server leaves Keys moved to blue and red 6 2 5 3 4 Virtual node: same color same physical node 22

  23. Solution: virtual nodes (vnodes) An extra level of mapping From node id in the ring to physical node Node ids are now virtual nodes (tokens) Multiple node ids same physical node More virtual nodes, more balanced 0 7 1 Faster data transfer for join/leave 3-bit ID space Controllable # of vnodes / server Server capacity: e.g., CPU, memory, network 6 2 5 3 4 Virtual node: same color same physical node 23

  24. Gossip and lookup Gossip: Once per second, each node contacts a randomly chosen other node They exchange their lists of known nodes (including virtual node IDs) Assumes all nodes will come back eventually, doesn t repartition Each node learns which others handle all key ranges Result: All nodes can send directlyto any key s coordinator( zero-hop DHT ) Reduces variability in response times 24

  25. Today: Amazon Dynamo 1. Background and system model 2. Data partitioning 3. Failure handling 25

  26. Preference list (data replication) Key replicated on M vnodes Remember r-successor in DHT? All M vnodes on distinct servers across different datacenters 0 7 1 3-bit ID space 6 2 5 3 4 Virtual node: 5 colors 5 physical nodes 26

  27. Preference list (data replication) Key replicated on M vnodes Remember r-successor in DHT? All M vnodes on distinct servers across different datacenters Key 0 0 Key 0 7 1 M = 4 Key 0 s Preference list could be vnodes: {0, 1, 3, 5} mapping to servers: {green, red, orange, blue} Green is the coordinator server of key 0 3-bit ID space 6 2 5 3 Key 0 Key 0 4 Virtual node: 5 colors 5 physical nodes 27

  28. Read and write requests Received by the coordinator (this is not Chord) Either the client (web server) knows the mapping or re-routed Sent to first N healthy servers in preference list (coordinator incl.) Durable writes: my updates recorded on multiple servers Fast reads: possible to avoid straggler A write creates a new immutable version of the key (no overwrite) Multi-versioned data store Quorum-based protocol A write succeeds if W out of N servers reply (write quorum) A read succeeds if R out of N servers reply (read quorum) W + R > N 28

  29. Quorum implications (W, R, and N) N determines the durability of data (Dynamo N = 3) W and R adjust the availability-consistency tradeoff W = 1 (R = 3): fast write, weak durability, slow read R = 1 (W = 3): slow write, good durability, fast read Dynamo: W = R = 2 Why W + R > N ? Read and write quorums overlap when there are no failures! Reads see all updates without failures What if there are failures? 29

  30. Failure handing: sloppy quorum + hinted handoff Sloppy: not always the same servers used in N First N servers in the preference list without failures Later servers in the list take over if some in the first N fail Consequences Good performance: no need to wait for failed servers in N to recover Eventual (weak) consistency: conflicts are possible, versions diverge Another decision on availability-consistency tradeoff! 30

  31. Failure handing: sloppy quorum + hinted handoff Key 0 s preference list {green, red, orange, blue} N = 3: {green, red, orange} without failures If red fails, requests go to {green, orange, blue} Key 0 0 Key 0 Hinted handoff Blue temporarily serves requests Hinted that red is the intended recipient Send replica back to red when red is on 7 1 3-bit ID space 6 2 5 3 Key 0 Key 0 4 Virtual node: 5 colors 5 physical nodes 31

  32. Wide-area replication Last , 4.6: Preference lists always contain nodes from more than one data center Consequence: Data likely to survive failure of entire data center Blocking on writes to a remote data center would incur unacceptably high latency Compromise: W < N, eventual consistency Better durability & latency but worse consistency 32

  33. Conflicts Suppose N = 3, W = R = 2, nodes are A, B, C, D, E CL1 put(k, ) completes on A and B CL2 put(k, ) completes on C and D Conflicting results from A, B and C, D Each has seen a different put(k, ) How does Dynamo handle conflicting versions? 33

  34. An example of conflicting writes (versions) Preference list (M = 5, N = 3) Time Shopping cart: A B C D E x x CL1: Add Item x A and B fail 34

  35. An example of conflicting writes (versions) Preference list (M = 5, N = 3) Time Shopping cart: A B C D E x x CL1: Add Item x A and B fail y y CL2: Add Item y 35

  36. An example of conflicting writes (versions) Preference list (M = 5, N = 3) Time Shopping cart: A B C D E x x CL1: Add Item x A and B fail y y CL2: Add Item y A and B recover read read CL1: Read cart Conflicting versions only possible under failures 36

  37. Vector clocks: handling conflicting versions Preference list (M = 5, N = 3) Time Shopping cart: A B C D E x x CL1: Add Item x (A,1) (A,1) A and B fail Read returns x (A,1) and y (C,1) (A,1) and (C,1) are not causally related: conflicts! y y CL2: Add Item y (C,1) (C,1) A and B recover read read Can we use Lamport clocks? CL1: Read cart 37

  38. Version vectors (vector clocks) Version vector: List of (coordinatornode, counter) pairs e.g., [(A, 1), (B, 3), ] Dynamo stores a version vector with each stored key- value pair Idea: track ancestor-descendant relationship between different versions of data stored under the same key k 38

  39. Dynamos system interface get(key) value, context Returns one value or multiple conflicting values Context describes version(s) of value(s) put(key, context, value) OK Context indicates which versions this version supersedes or merges 39

  40. Version vectors: Dynamos mechanism Rule: If vector clock comparison of v1 < v2, then the first is an ancestor of the second Dynamocan forget v1 Each time a put() occurs, Dynamo increments the counter in the V.V. for the coordinator node Each time a get() occurs, Dynamo returns the V.V. for the value(s) returned (in the context ) Then users must supply that context to put()s that modify the same key 40

  41. Conflict resolution (reconciliation) If vector clocks show causally related (not really conflicting) System overwrites with the later version For conflicting versions System handles it automatically, e.g., last-writer- wins (limited use case) Application specific resolution (most common) Clients resolve the conflict via reads, e.g., merge shopping cart 41

  42. Vector clocks: handling conflicting versions Preference list (M = 5, N = 3) Time Shopping cart: A B C D E x x CL1: Add Item x (A,1) (A,1) y y CL2: Add Item y (C,1) (C,1) CL1: Read cart x (A,1), y (C,1) 42

  43. Vector clocks: handling conflicting versions Preference list (M = 5, N = 3) Time Shopping cart: A B C D E x x CL1: Add Item x (A,1) (A,1) y y CL2: Add Item y (C,1) (C,1) CL1: Read cart x (A,1), y (C,1) CL1: Add Item z x, y, z [(A,1), (C,1)] 43

  44. Vector clocks: handling conflicting versions Preference list (M = 5, N = 3) Time Shopping cart: A B C D E x x CL1: Add Item x (A,1) (A,1) y y CL2: Add Item y (C,1) (C,1) CL1: Read cart x (A,1), y (C,1) xyz xyz CL1: Add Item z x, y, z [(A,1), (C,1)] (A,2, C,1)(A,2, C,1) 44

  45. How useful is it to vary N, R, W? N R W Behavior 3 2 2 Parameters from paper: Good durability, good R/W latency 3 3 1 3 1 3 3 3 3 3 1 1 45

  46. How useful is it to vary N, R, W? N R W Behavior 3 2 2 Parameters from paper: Good durability, good R/W latency 3 3 1 Slow reads, weak durability,fast writes 3 1 3 Slow writes, strong durability, fast reads 3 3 3 More likely that reads see all prior writes? 3 1 1 Read quorum may not overlap write quorum 46

  47. Failure detection and ring membership Server A considers B has failed if B does not reply to A s message Even if B replies to C A then tries alternative nodes With servers join and permanently leave Servers periodically send gossip messages to their neighbors to sync who are in the ring Some servers are chosen as seeds, i.e., common neighbors to all nodes 47

  48. Anti-entropy (replica synchronization) Hinted handoff node crashesbefore it can replicate data to node in preference list Need another way to ensure that each key-value pair is replicated N times Mechanism: replica synchronization Nodes nearby on ring periodically gossip Compare the (k, v) pairs they hold Copy any missing keys the other has How to compare and copy replica state quickly and efficiently? 48

  49. Efficient synchronization with Merkle trees Merkle trees hierarchically summarize the key-value pairs a node holds One Merkle tree for each virtual node key range Leaf node = hash of one key s value (# of leaves = # keys on the virtual node) Internal node = hash of concatenation of children Replicas exchange trees from top down, depth by depth If root nodes match, then identical replicas, stop Else, go to next level, compare nodes pair-wise 49

  50. Merkle tree reconciliation B is missing orange key; A is missing green one Exchange and compare hash nodes from root downwards, pruning when hashes match A s values: [0, 2128) [0, 2127) B s values: [0, 2128) [0, 2127) [2127, 2128) [2127, 2128) Finds differing keys quickly and with minimum information exchange 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#