Scaling Out Key-Value Storage and Dynamo

S
c
a
l
i
n
g
 
O
u
t
 
K
e
y
-
V
a
l
u
e
 
S
t
o
r
a
g
e
a
n
d
 
D
y
n
a
m
o
COS 418: Distributed Systems
Lecture 10
Mike Freedman
Web applications are expected to be “always on”
Down time 
 pisses off customers, costs $
System design considerations relevant to availability
S
c
a
l
a
b
i
l
i
t
y
:
 
a
l
w
a
y
s
 
o
n
 
u
n
d
e
r
 
g
r
o
w
i
n
g
 
d
e
m
a
n
d
R
e
l
i
a
b
i
l
i
t
y
:
 
a
l
w
a
y
s
 
o
n
 
d
e
s
p
i
t
e
 
f
a
i
l
u
r
e
s
Performance
: 10 sec latency considered available?
“an availability event can be modeled as a long-lasting performance
variation” (Amazon Aurora SIGMOD ’17)
A
v
a
i
l
a
b
i
l
i
t
y
:
 
v
i
t
a
l
 
f
o
r
 
w
e
b
 
a
p
p
l
i
c
a
t
i
o
n
s
2
Scale-up (vertical scaling)
Upgrade hardware
E.g., Macbook Air 
 Macbook Pro
Down time during upgrade; stops working quickly
S
c
a
l
e
-
o
u
t
 
(
h
o
r
i
z
o
n
t
a
l
 
s
c
a
l
i
n
g
)
Add machines, divide the work
E.g., a supermarket adds more checkout lines
No disruption; works great with careful design
S
c
a
l
a
b
i
l
i
t
y
:
 
u
p
 
o
r
 
o
u
t
?
3
More machines, more likely to fail
p = probability one machine fails; n = # of machines
F
a
i
l
u
r
e
s
 
h
a
p
p
e
n
 
w
i
t
h
 
a
 
p
r
o
b
a
b
i
l
i
t
y
 
o
f
 
1
(
1
p
)
n
F
o
r
 
5
0
K
 
m
a
c
h
i
n
e
s
,
 
e
a
c
h
 
w
i
t
h
 
9
9
.
9
9
9
6
6
%
 
a
v
a
i
l
a
b
l
e
1
6
%
 
o
f
 
t
h
e
 
t
i
m
e
,
 
d
a
t
a
 
c
e
n
t
e
r
 
e
x
p
e
r
i
e
n
c
e
s
 
f
a
i
l
u
r
e
s
F
o
r
 
1
0
0
K
 
m
a
c
h
i
n
e
s
,
 
f
a
i
l
u
r
e
s
 
h
a
p
p
e
n
 
3
0
%
 
o
f
 
t
h
e
 
t
i
m
e
!
R
e
l
i
a
b
i
l
i
t
y
:
 
a
v
a
i
l
a
b
l
e
 
u
n
d
e
r
 
f
a
i
l
u
r
e
s
4
H
o
w
 
i
s
 
d
a
t
a
 
p
a
r
t
i
t
i
o
n
e
d
 
a
c
r
o
s
s
 
m
a
c
h
i
n
e
s
 
s
o
 
t
h
e
 
s
y
s
t
e
m
 
s
c
a
l
e
s
?
How are failures handled so the system is always on?
T
w
o
 
q
u
e
s
t
i
o
n
s
 
(
c
h
a
l
l
e
n
g
e
s
)
5
1.
Background and system model
2.
Data partitioning
3.
Failure handling
6
T
o
d
a
y
:
 
A
m
a
z
o
n
 
D
y
n
a
m
o
10
4
s of servers in multiple datacenters
10
6
s of servers, 80+ DCs (as of now)
10
7
s of customers at peak times
20M+ purchases in US. (Prime Day 2020)
Tiered architecture (similar today)
Stateless
 web servers & aggregators
Stateful
 storage servers
A
m
a
z
o
n
 
i
n
 
2
0
0
7
7
A key-value store (vs. relational DB)
get(key) and put(key, value)
Nodes are symmetric
Remember DHT?
Service-Level Agreement (SLA)
E.g., “provide a response within 300ms for 99.9% of its
requests for peak client load of 500 requests/sec”
8
B
a
s
i
c
s
 
i
n
 
D
y
n
a
m
o
1.
Background and system model
2.
D
a
t
a
 
p
a
r
t
i
t
i
o
n
i
n
g
I
n
c
r
e
m
e
n
t
a
l
 
s
c
a
l
a
b
i
l
i
t
y
L
o
a
d
 
b
a
l
a
n
c
i
n
g
3.
Failure handling
9
T
o
d
a
y
:
 
A
m
a
z
o
n
 
D
y
n
a
m
o
10
10
C
o
n
s
i
s
t
e
n
t
 
h
a
s
h
i
n
g
 
r
e
c
a
p
K
e
y
 
i
s
 
s
t
o
r
e
d
 
a
t
 
i
t
s
 
s
u
c
c
e
s
s
o
r
:
 
n
o
d
e
 
w
i
t
h
 
n
e
x
t
-
h
i
g
h
e
r
 
I
D
3
-
b
i
t
I
D
 
s
p
a
c
e
0
1
2
3
4
5
6
7
I
d
e
n
t
i
f
i
e
r
s
 
h
a
v
e
 
m
 
=
 
3
 
b
i
t
s
K
e
y
 
s
p
a
c
e
:
 
[
0
,
 
2
3
-
1
]
N
o
d
e
S
t
o
r
e
s
 
k
e
y
 
1
 
S
t
o
r
e
s
 
k
e
y
s
 
2
,
 
3
 
S
t
o
r
e
s
 
k
e
y
s
 
4
,
 
5
 
S
t
o
r
e
s
 
k
e
y
 
6
S
t
o
r
e
s
 
k
e
y
 
7
,
 
0
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
Minimum data is moved around when nodes join and leave
Please try modular hashing and see the difference
11
11
I
n
c
r
e
m
e
n
t
a
l
 
s
c
a
l
a
b
i
l
i
t
y
 
(
w
h
y
 
c
o
n
s
i
s
t
e
n
t
 
h
a
s
h
i
n
g
)
 
N
o
d
e
 
5
 
j
o
i
n
s
Nodes are assigned different # of keys
12
12
C
h
a
l
l
e
n
g
e
:
 
u
n
b
a
l
a
n
c
e
d
 
l
o
a
d
K
e
y
s
 
1
 
~
 
3
 
Nodes are assigned different # of keys
Unbalanced with nodes join/leave
13
13
C
h
a
l
l
e
n
g
e
:
 
u
n
b
a
l
a
n
c
e
d
 
l
o
a
d
K
e
y
s
 
3
,
 
4
 
k
e
y
s
 
1
,
 
2
 
k
e
y
s
 
5
,
 
6
 
Nodes are assigned different # of keys
Unbalanced with nodes join/leave
14
14
C
h
a
l
l
e
n
g
e
:
 
u
n
b
a
l
a
n
c
e
d
 
l
o
a
d
K
e
y
s
 
3
,
 
4
 
k
e
y
s
 
1
,
 
2
 
k
e
y
s
 
5
,
 
6
 
Nodes are assigned different # of keys
Unbalanced with nodes join/leave
Some keys are more popular
15
15
C
h
a
l
l
e
n
g
e
:
 
u
n
b
a
l
a
n
c
e
d
 
l
o
a
d
K
e
y
s
 
3
,
 
4
 
 
B
e
s
t
 
s
e
l
l
e
r
 
i
t
e
m
K
e
y
s
 
1
,
 
2
 
K
e
y
s
 
5
,
 
6
 
An extra level of mapping
From node id in the ring to physical node
Node ids are now virtual nodes (tokens)
Multiple node ids 
 same physical node
16
16
S
o
l
u
t
i
o
n
:
 
v
i
r
t
u
a
l
 
n
o
d
e
s
An extra level of mapping
From node id in the ring to physical node
Node ids are now virtual nodes (tokens)
Multiple node ids 
 same physical node
17
17
S
o
l
u
t
i
o
n
:
 
v
i
r
t
u
a
l
 
n
o
d
e
s
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
V
i
r
t
u
a
l
 
n
o
d
e
:
 
s
a
m
e
 
c
o
l
o
r
 
 
s
a
m
e
 
p
h
y
s
i
c
a
l
 
n
o
d
e
4
 
p
h
y
i
s
c
a
l
 
n
o
d
e
s
 
(
s
e
r
v
e
r
s
)
2
 
v
n
o
d
e
s
 
/
 
s
e
r
v
e
r
An extra level of mapping
From node id in the ring to physical node
Node ids are now virtual nodes (tokens)
Multiple node ids 
 same physical node
18
18
S
o
l
u
t
i
o
n
:
 
v
i
r
t
u
a
l
 
n
o
d
e
s
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
V
i
r
t
u
a
l
 
n
o
d
e
:
 
s
a
m
e
 
c
o
l
o
r
 
 
s
a
m
e
 
p
h
y
s
i
c
a
l
 
n
o
d
e
G
o
l
d
 
s
e
r
v
e
r
 
l
e
a
v
e
s
K
e
y
s
 
m
o
v
e
d
 
t
o
 
b
l
u
e
 
a
n
d
 
r
e
d
An extra level of mapping
From node id in the ring to physical node
Node ids are now virtual nodes (tokens)
Multiple node ids 
 same physical node
More virtual nodes, more balanced
Faster data transfer for join/leave
Controllable # of vnodes / server
Server capacity, e.g., CPU, memory, network.
19
19
S
o
l
u
t
i
o
n
:
 
v
i
r
t
u
a
l
 
n
o
d
e
s
 
(
v
n
o
d
e
s
)
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
V
i
r
t
u
a
l
 
n
o
d
e
:
 
s
a
m
e
 
c
o
l
o
r
 
 
s
a
m
e
 
p
h
y
s
i
c
a
l
 
n
o
d
e
1.
Background and system model
2.
Data partitioning
3.
F
a
i
l
u
r
e
 
h
a
n
d
l
i
n
g
D
a
t
a
 
r
e
p
l
i
c
a
t
i
o
n
20
20
T
o
d
a
y
:
 
A
m
a
z
o
n
 
D
y
n
a
m
o
Key replicated on M vnodes
Remember “r-successor” in DHT?
All M vnodes on 
distinct
 servers across
different
 datacenters
21
21
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
d
a
t
a
 
r
e
p
l
i
c
a
t
i
o
n
)
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
V
i
r
t
u
a
l
 
n
o
d
e
:
 
5
 
c
o
l
o
r
s
 
 
5
 
p
h
y
s
i
c
a
l
 
n
o
d
e
s
Key replicated on M vnodes
Remember “r-successor” in DHT?
All M vnodes on 
distinct
 servers across
different
 datacenters
22
22
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
V
i
r
t
u
a
l
 
n
o
d
e
:
 
5
 
c
o
l
o
r
s
 
 
5
 
p
h
y
s
i
c
a
l
 
n
o
d
e
s
M = 4
Key 0’s Preference list could be
v
nodes: {0, 1, 3, 5} mapping to servers:
{
green
, 
red
, 
gold
, 
blue
}
G
r
e
e
n
 
i
s
 
t
h
e
 
c
o
o
r
d
i
n
a
t
o
r
 
s
e
r
v
e
r
 
o
f
 
k
e
y
 
0
K
e
y
 
0
K
e
y
 
0
K
e
y
 
0
K
e
y
 
0
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
d
a
t
a
 
r
e
p
l
i
c
a
t
i
o
n
)
Received by the coordinator
Either the client (web server) knows the mapping or re-routed.  (This is not Chord)
Sent to the first N “healthy” servers in the preference list (coordinator included)
Durable writes: my updates recorded on multiple servers
Fast reads: possible to avoid straggler
A write creates a new immutable version of the key instead of overwriting it
Multi-versioned data store
Quorum-based protocol:   
W + R > N
A write succeeds if W out of N servers reply (write quorum)
A read succeeds if R out of N servers reply (read quorum)
23
23
R
e
a
d
 
a
n
d
 
w
r
i
t
e
 
r
e
q
u
e
s
t
s
N determines the durability of data (Dynamo N = 3)
W
 
a
n
d
 
R
 
p
l
a
y
s
 
a
r
o
u
n
d
 
w
i
t
h
 
t
h
e
 
a
v
a
i
l
a
b
i
l
i
t
y
-
c
o
n
s
i
s
t
e
n
c
y
 
t
r
a
d
e
o
f
f
W = 1 (R = 3): fast write, weak durability, slow read (read availability)
R = 1 (W = 3): slow write (write availability), good durability, fast read
Dynamo: W = R = 2
Why W + R > N ?
R
e
a
d
 
a
n
d
 
w
r
i
t
e
 
q
u
o
r
u
m
s
 
o
v
e
r
l
a
p
 
w
h
e
n
 
t
h
e
r
e
 
a
r
e
 
n
o
 
f
a
i
l
u
r
e
s
!
Reads see all updates without failures
What if there are failures?
24
24
Q
u
o
r
u
m
 
i
m
p
l
i
c
a
t
i
o
n
s
 
(
W
,
 
R
,
 
a
n
d
 
N
)
Sloppy: not always the same servers used in N
First N servers in the preference list without failures
Later servers in the list take over if some in the first N fail
Consequences
Good performance: no need to wait for failed servers in N to recover
Eventual (weak) consistency: conflicts are possible, versions diverge
A
n
o
t
h
e
r
 
d
e
c
i
s
i
o
n
 
o
n
 
a
v
a
i
l
a
b
i
l
i
t
y
-
c
o
n
s
i
s
t
e
n
c
y
 
t
r
a
d
e
o
f
f
!
25
25
F
a
i
l
u
r
e
 
h
a
n
d
i
n
g
:
 
s
l
o
p
p
y
 
q
u
o
r
u
m
 
+
 
h
i
n
t
e
d
 
h
a
n
d
o
f
f
26
26
F
a
i
l
u
r
e
 
h
a
n
d
i
n
g
:
 
s
l
o
p
p
y
 
q
u
o
r
u
m
 
+
 
h
i
n
t
e
d
 
h
a
n
d
o
f
f
I
d
e
n
t
i
f
i
e
r
s
/
k
e
y
 
s
p
a
c
e
V
i
r
t
u
a
l
 
n
o
d
e
:
 
5
 
c
o
l
o
r
s
 
 
5
 
p
h
y
s
i
c
a
l
 
n
o
d
e
s
K
e
y
 
0
K
e
y
 
0
K
e
y
 
0
K
e
y
 
0
Key 0’s preference list 
{
green
, 
red
, 
gold
, 
blue
}
N = 3: {
green
, 
red
, 
gold
} without failures
If 
red
 fails, requests go to {
green
, 
gold, 
blue
}
H
inted handoff
Blue
 temporarily serves requests
Hinted that 
red
 is the intended recipient
Send replica back to 
red
 when 
red
 is on
27
27
A
n
 
e
x
a
m
p
l
e
 
o
f
 
c
o
n
f
l
i
c
t
i
n
g
 
w
r
i
t
e
s
 
(
v
e
r
s
i
o
n
s
)
A
B
C
D
E
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
S
h
o
p
p
i
n
g
 
c
a
r
t
:
x
x
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
A
 
a
n
d
 
B
 
f
a
i
l
Time
28
28
A
n
 
e
x
a
m
p
l
e
 
o
f
 
c
o
n
f
l
i
c
t
i
n
g
 
w
r
i
t
e
s
 
(
v
e
r
s
i
o
n
s
)
A
B
C
D
E
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
S
h
o
p
p
i
n
g
 
c
a
r
t
:
x
x
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
A
 
a
n
d
 
B
 
f
a
i
l
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
y
y
Time
29
29
A
n
 
e
x
a
m
p
l
e
 
o
f
 
c
o
n
f
l
i
c
t
i
n
g
 
w
r
i
t
e
s
 
(
v
e
r
s
i
o
n
s
)
A
B
C
D
E
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
S
h
o
p
p
i
n
g
 
c
a
r
t
:
x
x
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
A
 
a
n
d
 
B
 
f
a
i
l
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
y
y
A
 
a
n
d
 
B
 
r
e
c
o
v
e
r
C
L
1
:
 
R
e
a
d
 
c
a
r
t
r
e
a
d
r
e
a
d
Time
Conflicting versions
only possible under failures
30
30
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
h
a
n
d
l
i
n
g
 
c
o
n
f
l
i
c
t
i
n
g
 
v
e
r
s
i
o
n
s
A
B
C
D
E
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
S
h
o
p
p
i
n
g
 
c
a
r
t
:
x
x
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
A
 
a
n
d
 
B
 
f
a
i
l
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
y
y
A
 
a
n
d
 
B
 
r
e
c
o
v
e
r
C
L
1
:
 
R
e
a
d
 
c
a
r
t
r
e
a
d
r
e
a
d
Time
A
.
1
A
.
1
C
.
1
C
.
1
Can we use Lamport clocks?
If vector clocks show causally related (not really conflicting)
System overwrites with the later version
For conflicting versions
System handles it automatically, e.g., last-writer-wins, limited use case
A
p
p
l
i
c
a
t
i
o
n
 
s
p
e
c
i
f
i
c
 
r
e
s
o
l
u
t
i
o
n
 
(
m
o
s
t
 
c
o
m
m
o
n
)
Clients resolve the conflict 
via reads
, e.g., merge shopping cart
31
31
C
o
n
f
l
i
c
t
 
r
e
s
o
l
u
t
i
o
n
 
(
r
e
c
o
n
c
i
l
i
a
t
i
o
n
)
32
32
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
h
a
n
d
l
i
n
g
 
c
o
n
f
l
i
c
t
i
n
g
 
v
e
r
s
i
o
n
s
A
B
C
D
E
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
S
h
o
p
p
i
n
g
 
c
a
r
t
:
x
x
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
y
y
C
L
1
:
 
R
e
a
d
 
c
a
r
t
x
(
A
.
1
)
,
 
y
(
C
.
1
)
Time
A
.
1
A
.
1
C
.
1
C
.
1
33
33
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
h
a
n
d
l
i
n
g
 
c
o
n
f
l
i
c
t
i
n
g
 
v
e
r
s
i
o
n
s
A
B
C
D
E
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
S
h
o
p
p
i
n
g
 
c
a
r
t
:
x
x
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
y
y
C
L
1
:
 
R
e
a
d
 
c
a
r
t
x
(
A
.
1
)
,
 
y
(
C
.
1
)
Time
A
.
1
A
.
1
C
.
1
C
.
1
C
L
1
:
 
A
d
d
 
I
t
e
m
 
z
 
x
,
 
y
,
 
z
 
(
A
.
1
,
 
C
.
1
)
34
34
V
e
c
t
o
r
 
c
l
o
c
k
s
:
 
h
a
n
d
l
i
n
g
 
c
o
n
f
l
i
c
t
i
n
g
 
v
e
r
s
i
o
n
s
A
B
C
D
E
P
r
e
f
e
r
e
n
c
e
 
l
i
s
t
 
(
M
 
=
 
5
,
 
N
 
=
 
3
)
S
h
o
p
p
i
n
g
 
c
a
r
t
:
x
x
C
L
1
:
 
A
d
d
 
I
t
e
m
 
x
C
L
2
:
 
A
d
d
 
I
t
e
m
 
y
y
y
C
L
1
:
 
R
e
a
d
 
c
a
r
t
x
(
A
.
1
)
,
 
y
(
C
.
1
)
Time
A
.
1
A
.
1
C
.
1
C
.
1
C
L
1
:
 
A
d
d
 
I
t
e
m
 
z
 
x
,
 
y
,
 
z
 
(
A
.
1
,
 
C
.
1
)
x
y
z
x
y
z
(
A
.
2
,
 
C
.
1
)
(
A
.
2
,
 
C
.
1
)
Each server keeps one Merkle tree per virtual node (a range of keys)
A leaf is the hash of a key’s value:  # of leaves = # keys on the virtual node
An internal node is the hash of its children
Replicas exchange trees from top down, depth by depth
If root nodes match, then identical replicas, stop
Else, go to next level, compare nodes pair-wise
35
35
A
n
t
i
-
e
n
t
r
o
p
y
 
(
r
e
p
l
i
c
a
 
s
y
n
c
h
r
o
n
i
z
a
t
i
o
n
)
Server A considers B has failed if B does not reply to A’s message
Even if B replies to C
A then tries alternative nodes
With servers join and permanently leave
Servers periodically send gossip messages to their neighbors to sync
who are in the ring
Some servers are chosen as seeds, i.e., common neighbors to all nodes
36
36
F
a
i
l
u
r
e
 
d
e
t
e
c
t
i
o
n
 
a
n
d
 
r
i
n
g
 
m
e
m
b
e
r
s
h
i
p
Availability is important
Systems need to be scalable and reliable
Dynamo is eventually consistent
Many design decisions trade consistency for availability
Core techniques
Consistent hashing: data partitioning
Preference list, sloppy quorum, hinted handoff: handling transient failures
Vector clocks: conflict resolution
Anti-entropy: synchronize replicas
Gossip: synchronize ring membership
37
37
C
o
n
c
l
u
s
i
o
n
Slide Note
Embed
Share

Vital for web applications, availability is a key factor. Understanding scalability, reliability, and performance is crucial for ensuring uninterrupted service. Explore concepts such as scalability up or out, reliability under failures, and data partitioning in distributed systems like Amazon Dynamo.

  • Key-Value Storage
  • Dynamo
  • Distributed Systems
  • Scalability
  • Reliability

Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Scaling Out Key-Value Storage and Dynamo COS 418: Distributed Systems Lecture 10 Mike Freedman

  2. Availability: vital for web applications Web applications are expected to be always on Down time pisses off customers, costs $ System design considerations relevant to availability Scalability: always on under growing demand Reliability: always on despite failures Performance: 10 sec latency considered available? an availability event can be modeled as a long-lasting performance variation (Amazon Aurora SIGMOD 17) 2

  3. Scalability: up or out? Scale-up (vertical scaling) Upgrade hardware E.g., Macbook Air Macbook Pro Down time during upgrade; stops working quickly Scale-out (horizontal scaling) Add machines, divide the work E.g., a supermarket adds more checkout lines No disruption; works great with careful design 3

  4. Reliability: available under failures More machines, more likely to fail p = probability one machine fails; n = # of machines Failures happen with a probability of 1 (1 p)n For 50K machines, each with 99.99966% available 16% of the time, data center experiences failures For 100K machines, failures happen 30% of the time! 4

  5. Two questions (challenges) How is data partitioned across machines so the system scales? How are failures handled so the system is always on? 5

  6. Today: Amazon Dynamo 1. Background and system model 2. Data partitioning 3. Failure handling 6

  7. Amazon in 2007 104s of servers in multiple datacenters 106s of servers, 80+ DCs (as of now) 107s of customers at peak times 20M+ purchases in US. (Prime Day 2020) Tiered architecture (similar today) Stateless web servers & aggregators Stateful storage servers 7

  8. Basics in Dynamo A key-value store (vs. relational DB) get(key) and put(key, value) Nodes are symmetric Remember DHT? Service-Level Agreement (SLA) E.g., provide a response within 300ms for 99.9% of its requests for peak client load of 500 requests/sec 8

  9. Today: Amazon Dynamo 2. Data partitioning Incremental scalability Load balancing 9

  10. Consistent hashing recap Identifiers have m = 3 bits Key space: [0, 23-1] Stores key 7, 0 0 Stores key 1 7 1 Identifiers/key space Node 3-bit ID space 6 Stores key 6 2 5 3 Stores keys 4, 5 Stores keys 2, 3 4 Key is stored at its successor: node with next-higher ID 10

  11. Incremental scalability (why consistent hashing) Minimum data is moved around when nodes join and leave Please try modular hashing and see the difference keys 4 ~ 0 keys 6 ~ 0 0 0 7 7 1 1 Node 5 joins 3-bit ID space Transfer Keys 4, 5 6 6 2 2 5 5 3 3 Keys 1 ~ 3 Keys 1 ~ 3 4 4 Keys 4, 5 11

  12. Challenge: unbalanced load Nodes are assigned different # of keys keys 4 ~ 0 0 7 1 3-bit ID space 6 2 5 3 Keys 1 ~ 3 4 12

  13. Challenge: unbalanced load Nodes are assigned different # of keys keys 7, 0 Unbalanced with nodes join/leave 0 7 1 3-bit ID space 6 keys 5, 6 keys 1, 2 2 5 3 4 Keys 3, 4 13

  14. Challenge: unbalanced load Nodes are assigned different # of keys keys 5, 6, 7, 0 Unbalanced with nodes join/leave 0 7 1 3-bit ID space 6 keys 5, 6 keys 1, 2 2 5 3 4 Keys 3, 4 14

  15. Challenge: unbalanced load Nodes are assigned different # of keys keys 7, 0 Unbalanced with nodes join/leave 0 7 1 Some keys are more popular 3-bit ID space 6 Keys 1, 2 2 Keys 5, 6 5 3 Best seller item 4 Keys 3, 4 15

  16. Solution: virtual nodes An extra level of mapping From node id in the ring to physical node Node ids are now virtual nodes (tokens) Multiple node ids same physical node 0 7 1 3-bit ID space 6 2 5 3 4 16

  17. Solution: virtual nodes Identifiers/key space An extra level of mapping From node id in the ring to physical node Node ids are now virtual nodes (tokens) Multiple node ids same physical node Virtual node: same color same physical node 0 7 1 3-bit ID space 4 phyiscal nodes (servers) 2 vnodes / server 6 2 5 3 4 17

  18. Solution: virtual nodes Identifiers/key space An extra level of mapping From node id in the ring to physical node Node ids are now virtual nodes (tokens) Multiple node ids same physical node Virtual node: same color same physical node 0 7 1 3-bit ID space Gold server leaves Keys moved to blue and red 6 2 5 3 4 18

  19. Solution: virtual nodes (vnodes) Identifiers/key space An extra level of mapping From node id in the ring to physical node Node ids are now virtual nodes (tokens) Multiple node ids same physical node Virtual node: same color same physical node 0 7 1 More virtual nodes, more balanced 3-bit ID space 6 2 Faster data transfer for join/leave 5 3 Controllable # of vnodes / server Server capacity, e.g., CPU, memory, network. 4 19

  20. Today: Amazon Dynamo 3. Failure handling Data replication 20

  21. Preference list (data replication) Identifiers/key space Key replicated on M vnodes Remember r-successor in DHT? Virtual node: 5 colors 5 physical nodes All M vnodes on distinct servers across different datacenters 0 7 1 3-bit ID space 6 2 5 3 4 21

  22. Preference list (data replication) Identifiers/key space Key replicated on M vnodes Remember r-successor in DHT? Virtual node: 5 colors 5 physical nodes Key 0 All M vnodes on distinct servers across different datacenters 0 Key 0 7 1 3-bit ID space 6 M = 4 2 Key 0 s Preference list could be vnodes: {0, 1, 3, 5} mapping to servers: {green, red, gold, blue} Green is the coordinator server of key 0 5 3 Key 0 Key 0 4 22

  23. Read and write requests Received by the coordinator Either the client (web server) knows the mapping or re-routed. (This is not Chord) Sent to the first N healthy servers in the preference list (coordinator included) Durable writes: my updates recorded on multiple servers Fast reads: possible to avoid straggler A write creates a new immutable version of the key instead of overwriting it Multi-versioned data store Quorum-based protocol: W + R > N A write succeeds if W out of N servers reply (write quorum) A read succeeds if R out of N servers reply (read quorum) 23

  24. Quorum implications (W, R, and N) N determines the durability of data (Dynamo N = 3) W and R plays around with the availability-consistency tradeoff W = 1 (R = 3): fast write, weak durability, slow read (read availability) R = 1 (W = 3): slow write (write availability), good durability, fast read Dynamo: W = R = 2 Why W + R > N ? Read and write quorums overlap when there are no failures! Reads see all updates without failures What if there are failures? 24

  25. Failure handing: sloppy quorum + hinted handoff Sloppy: not always the same servers used in N First N servers in the preference list without failures Later servers in the list take over if some in the first N fail Consequences Good performance: no need to wait for failed servers in N to recover Eventual (weak) consistency: conflicts are possible, versions diverge Another decision on availability-consistency tradeoff! 25

  26. Failure handing: sloppy quorum + hinted handoff Identifiers/key space Key 0 s preference list {green, red, gold, blue} Virtual node: 5 colors 5 physical nodes N = 3: {green, red, gold} without failures Key 0 If red fails, requests go to {green, gold, blue} 0 Key 0 7 1 Hinted handoff Blue temporarily serves requests Hinted that red is the intended recipient Send replica back to red when red is on 3-bit ID space 6 2 5 3 Key 0 Key 0 4 26

  27. An example of conflicting writes (versions) Preference list (M = 5, N = 3) Shopping cart: A B C D E x x CL1: Add Item x A and B fail Time 27

  28. An example of conflicting writes (versions) Preference list (M = 5, N = 3) Shopping cart: A B C D E x x CL1: Add Item x A and B fail y y CL2: Add Item y Time 28

  29. An example of conflicting writes (versions) Preference list (M = 5, N = 3) Shopping cart: A B C D E x x CL1: Add Item x A and B fail y y CL2: Add Item y A and B recover Conflicting versions only possible under failures read read CL1: Read cart Time 29

  30. Vector clocks: handling conflicting versions Preference list (M = 5, N = 3) Shopping cart: A B C D E x x CL1: Add Item x A.1 A.1 A and B fail Can we use Lamport clocks? y y CL2: Add Item y C.1 C.1 Read returns x(A.1) and y(C.1) A.1 and C.1 are not causally related: conflicts! A and B recover read read CL1: Read cart Time 30

  31. Conflict resolution (reconciliation) If vector clocks show causally related (not really conflicting) System overwrites with the later version For conflicting versions System handles it automatically, e.g., last-writer-wins, limited use case Application specific resolution (most common) Clients resolve the conflict via reads, e.g., merge shopping cart 31

  32. Vector clocks: handling conflicting versions Preference list (M = 5, N = 3) Shopping cart: A B C D E x x CL1: Add Item x A.1 A.1 y y CL2: Add Item y C.1 C.1 CL1: Read cart x(A.1), y(C.1) Time 32

  33. Vector clocks: handling conflicting versions Preference list (M = 5, N = 3) Shopping cart: A B C D E x x CL1: Add Item x A.1 A.1 y y CL2: Add Item y C.1 C.1 CL1: Read cart x(A.1), y(C.1) CL1: Add Item z x, y, z (A.1, C.1) Time 33

  34. Vector clocks: handling conflicting versions Preference list (M = 5, N = 3) Shopping cart: A B C D E x x CL1: Add Item x A.1 A.1 y y CL2: Add Item y C.1 C.1 CL1: Read cart x(A.1), y(C.1) xyz xyz CL1: Add Item z x, y, z (A.1, C.1) (A.2, C.1)(A.2, C.1) Time 34

  35. Anti-entropy (replica synchronization) Each server keeps one Merkle tree per virtual node (a range of keys) A leaf is the hash of a key s value: # of leaves = # keys on the virtual node An internal node is the hash of its children Replicas exchange trees from top down, depth by depth If root nodes match, then identical replicas, stop Else, go to next level, compare nodes pair-wise 35

  36. Failure detection and ring membership Server A considers B has failed if B does not reply to A s message Even if B replies to C A then tries alternative nodes With servers join and permanently leave Servers periodically send gossip messages to their neighbors to sync who are in the ring Some servers are chosen as seeds, i.e., common neighbors to all nodes 36

  37. Conclusion Availability is important Systems need to be scalable and reliable Dynamo is eventually consistent Many design decisions trade consistency for availability Core techniques Consistent hashing: data partitioning Preference list, sloppy quorum, hinted handoff: handling transient failures Vector clocks: conflict resolution Anti-entropy: synchronize replicas Gossip: synchronize ring membership 37

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#