Fault-Tolerant Replicated Systems in Computing

Putting it all together for SMR:
Two-Phase Commit, Leader Election
RAFT
Credits: Michael Freedman and Kyle Jamieson developed much of the original material.
RAFT slides heavily based on those from Diego Ongaro and John Ousterhout.
CS 240: Computing Systems and Concurrency
Lecture 13
Marco Canini
M
e
c
h
a
n
i
s
m
:
 
 
R
e
p
l
i
c
a
t
e
 
a
n
d
 
s
e
p
a
r
a
t
e
 
s
e
r
v
e
r
s
G
o
a
l
 
#
1
:
 
 
P
r
o
v
i
d
e
 
a
 
h
i
g
h
l
y
 
r
e
l
i
a
b
l
e
 
s
e
r
v
i
c
e
G
o
a
l
 
#
2
:
 
 
S
e
r
v
e
r
s
 
s
h
o
u
l
d
 
b
e
h
a
v
e
 
j
u
s
t
 
l
i
k
e
 
a
s
i
n
g
l
e
,
 
m
o
r
e
 
r
e
l
i
a
b
l
e
 
s
e
r
v
e
r
R
e
c
a
l
l
:
 
 
P
r
i
m
a
r
y
-
B
a
c
k
u
p
2
E
x
t
e
n
d
 
P
B
 
f
o
r
 
h
i
g
h
 
a
v
a
i
l
a
b
i
l
i
t
y
C
l
i
e
n
t
 
C
P
r
i
m
a
r
y
 
P
B
a
c
k
u
p
A
 
Primary gets ops, orders into log
Replicates log of ops to backup
Backup executes ops in same order
Backup takes over if primary fails
 
But what if network partition rather
than primary failure?
“View” server to determine primary
But what if view server fails?
“View” determined via consensus!
3
E
x
t
e
n
d
 
P
B
 
f
o
r
 
h
i
g
h
 
a
v
a
i
l
a
b
i
l
i
t
y
C
l
i
e
n
t
 
C
P
r
i
m
a
r
y
 
P
B
a
c
k
u
p
A
B
 
1.
C
 
 
P
:
 
r
e
q
u
e
s
t
 
<
o
p
>
2.
P
 
 
A
,
 
B
:
 
p
r
e
p
a
r
e
 
<
o
p
>
3.
A
,
 
B
 
 
P
:
 
p
r
e
p
a
r
e
d
 
o
r
 
e
r
r
o
r
4.
P
 
 
C
:
 
r
e
s
u
l
t
 
e
x
e
c
<
o
p
>
 
o
r
 
f
a
i
l
e
d
5.
P
 
 
A
,
 
B
:
 
c
o
m
m
i
t
 
<
o
p
>
“Okay” (i.e., op is stable) if
written to > ½ backups
4
V
i
e
w
 
c
h
a
n
g
e
s
 
o
n
 
f
a
i
l
u
r
e
P
r
i
m
a
r
y
 
P
B
a
c
k
u
p
A
B
 
1.
Backups monitor primary
2.
If a backup thinks primary failed,
initiate 
View Change 
(leader election)
5
V
i
e
w
 
c
h
a
n
g
e
s
 
o
n
 
f
a
i
l
u
r
e
P
r
i
m
a
r
y
 
 
P
B
a
c
k
u
p
A
 
1.
Backups monitor primary
 
2.
If a backup thinks primary failed,
initiate 
View Change 
(leader election)
 
3.
Intuitive safety argument:
View change requires 
f+1 
agreement
Op committed once written to 
f+1
 nodes
At least one node both saw write and in
new view
 
4.
More advanced:  Adding or removing
nodes (“reconfiguration”)
R
e
q
u
i
r
e
s
 
2
f
 
+
 
1
 
n
o
d
e
s
t
o
 
h
a
n
d
l
e
 
f
 
 
f
a
i
l
u
r
e
s
6
B
a
s
i
c
 
f
a
u
l
t
-
t
o
l
e
r
a
n
t
R
e
p
l
i
c
a
t
e
d
 
S
t
a
t
e
 
M
a
c
h
i
n
e
 
(
R
S
M
)
a
p
p
r
o
a
c
h
1.
Consensus protocol to elect leader
2.
2PC to replicate operations from leader
3.
All replicas execute ops once committed
7
W
h
y
 
b
o
t
h
e
r
 
w
i
t
h
 
a
 
l
e
a
d
e
r
?
Not necessary, but …
Decomposition:  normal operation vs. leader changes
Simplifies normal operation (no conflicts)
More efficient than leader-less approaches
Obvious place to handle non-determinism
8
R
a
f
t
:
 
A
 
C
o
n
s
e
n
s
u
s
 
A
l
g
o
r
i
t
h
m
f
o
r
 
R
e
p
l
i
c
a
t
e
d
 
L
o
g
s
Diego Ongaro and John Ousterhout
Stanford University
9
Replicated log => replicated state machine
All servers execute same commands in same order
Consensus module ensures proper log replication
G
o
a
l
:
 
R
e
p
l
i
c
a
t
e
d
 
L
o
g
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
S
e
r
v
e
r
s
C
l
i
e
n
t
s
shl
10
10
1.
L
e
a
d
e
r
 
e
l
e
c
t
i
o
n
2.
N
o
r
m
a
l
 
o
p
e
r
a
t
i
o
n
 
(
b
a
s
i
c
 
l
o
g
 
r
e
p
l
i
c
a
t
i
o
n
)
3.
S
a
f
e
t
y
 
a
n
d
 
c
o
n
s
i
s
t
e
n
c
y
 
a
f
t
e
r
 
l
e
a
d
e
r
 
c
h
a
n
g
e
s
4.
N
e
u
t
r
a
l
i
z
i
n
g
 
o
l
d
 
l
e
a
d
e
r
s
5.
C
l
i
e
n
t
 
i
n
t
e
r
a
c
t
i
o
n
s
6.
R
e
c
o
n
f
i
g
u
r
a
t
i
o
n
11
11
R
a
f
t
 
O
v
e
r
v
i
e
w
At any given time, each server is either:
Leader
: handles all client interactions, log replication
Follower
: completely passive
Candidate
: used to elect a new leader
Normal operation: 1 leader, N-1 followers
12
12
S
e
r
v
e
r
 
S
t
a
t
e
s
Follower
Candidate
Leader
13
13
L
i
v
e
n
e
s
s
 
V
a
l
i
d
a
t
i
o
n
Follower
Candidate
Leader
start
timeout,
start election
receive votes from
majority of servers
timeout,
new election
 
Servers start as followers
Leaders send 
heartbeats
 (empty AppendEntries RPCs) to
maintain authority
If 
electionTimeout 
elapses with no RPCs (100-500ms),
follower assumes leader has crashed and starts new election
14
14
T
e
r
m
s
 
(
a
k
a
 
e
p
o
c
h
s
)
Term 1
Term 2
Term 3
Term 4
Term 5
time
Elections
Normal Operation
Split Vote
 
Time divided into terms
Election (either failed or resulted in 1 leader)
Normal operation under a single leader
Each server maintains 
current term 
value
Key role of terms: identify obsolete information
15
15
E
l
e
c
t
i
o
n
s
 
S
t
a
r
t
 
e
l
e
c
t
i
o
n
:
Increment current term, change to candidate state, vote for self
S
e
n
d
 
R
e
q
u
e
s
t
V
o
t
e
 
t
o
 
a
l
l
 
o
t
h
e
r
 
s
e
r
v
e
r
s
,
 
r
e
t
r
y
 
u
n
t
i
l
 
e
i
t
h
e
r
:
1.
Receive votes from majority of servers:
Become leader
Send AppendEntries heartbeats to all other servers
2.
Receive RPC from valid leader:
Return to follower state
3.
No-one wins election (election timeout elapses):
Increment term, start new election
16
16
E
l
e
c
t
i
o
n
s
Servers
Voted for
candidate A
B can’t also
get majority
 
S
a
f
e
t
y
:
 
 
a
l
l
o
w
 
a
t
 
m
o
s
t
 
o
n
e
 
w
i
n
n
e
r
 
p
e
r
 
t
e
r
m
Each server votes only once per term (persists on disk)
Two different candidates can’t get majorities in same term
 
 
 
L
i
v
e
n
e
s
s
:
 
s
o
m
e
 
c
a
n
d
i
d
a
t
e
 
m
u
s
t
 
e
v
e
n
t
u
a
l
l
y
 
w
i
n
Each choose election timeouts randomly in [T, 2T]
One usually initiates and wins election before others start
Works well if T >> network RTT
17
17
L
o
g
 
S
t
r
u
c
t
u
r
e
1
add
1
2
3
4
5
6
7
8
3
jmp
1
cmp
1
ret
2
mov
3
div
3
shl
3
sub
1
add
3
jmp
1
cmp
1
ret
2
mov
1
add
3
jmp
1
cmp
1
ret
2
mov
3
div
3
shl
3
sub
1
add
1
cmp
1
add
3
jmp
1
cmp
1
ret
2
mov
3
div
3
shl
 
leader
log index
 
followers
 
committed entries
term
command
 
Log entry = < index, term, command >
Log stored on stable storage (disk); survives crashes
Entry 
committed
 if known to be stored on majority of servers
Durable / stable, will eventually be executed by state machines
18
18
N
o
r
m
a
l
 
o
p
e
r
a
t
i
o
n
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
shl
 
Client sends command to leader
Leader appends command to its log
Leader sends AppendEntries RPCs to followers
Once new entry committed:
Leader passes command to its state machine, sends result to client
Leader piggybacks commitment to followers in later AppendEntries
Followers pass committed commands to their state machines
Crashed / slow followers?
Leader retries RPCs until they succeed
Performance is optimal in common case:
One successful RPC to any majority of servers
19
19
N
o
r
m
a
l
 
o
p
e
r
a
t
i
o
n
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
L
o
g
C
o
n
s
e
n
s
u
s
M
o
d
u
l
e
S
t
a
t
e
M
a
c
h
i
n
e
shl
20
20
L
o
g
 
O
p
e
r
a
t
i
o
n
:
 
 
H
i
g
h
l
y
 
C
o
h
e
r
e
n
t
server1
server2
If log entries on different server have same index and term:
Store the same command
Logs are identical in all preceding entries
If given entry is committed, all preceding also committed
 
AppendEntries has <index,term> of entry preceding new ones
Follower must contain matching entry; otherwise it rejects
Implements an 
induction step
, ensures coherency
21
21
L
o
g
 
O
p
e
r
a
t
i
o
n
:
 
 
C
o
n
s
i
s
t
e
n
c
y
 
C
h
e
c
k
1
add
3
jmp
1
cmp
1
ret
2
mov
1
add
1
cmp
1
ret
2
mov
leader
follower
1
2
3
4
5
 
AppendEntries succeeds:
matching entry
 
AppendEntries fails:
mismatch
 
New leader’s log is truth, no special steps, start normal operation
Will eventually make follower’s logs identical to leader’s
Old leader may have left entries partially replicated
Multiple crashes can leave many extraneous log entries
22
22
L
e
a
d
e
r
 
C
h
a
n
g
e
s
 
Raft safety property:  
If leader has decided log entry is
committed, entry will be present in logs of all future leaders
Why does this guarantee higher-level goal?
1.
Leaders never overwrite entries in their logs
2.
Only entries in leader’s log can be committed
3.
Entries must be committed before applying to state machine
23
23
S
a
f
e
t
y
 
R
e
q
u
i
r
e
m
e
n
t
 
C
o
m
m
i
t
t
e
d
 
 
P
r
e
s
e
n
t
 
i
n
 
f
u
t
u
r
e
 
l
e
a
d
e
r
s
 
l
o
g
s
 
Restrictions on
commitment
 
Restrictions on
leader election
Once log entry applied to a state machine, no other state
machine must apply a different value for that log entry
24
24
P
i
c
k
i
n
g
 
t
h
e
 
B
e
s
t
 
L
e
a
d
e
r
1
2
1
1
2
1
2
3
4
5
1
2
1
1
1
2
1
1
2
 
Can’t tell
which entries
committed!
s
1
s
2
 
Elect candidate most likely to contain all committed entries
In RequestVote, candidates incl. index + term of last log entry
Voter V denies vote if its log is “more complete”:
(newer term) or (entry in higher index of same term)
Leader will have “most complete” log among electing majority
25
25
C
o
m
m
i
t
t
i
n
g
 
E
n
t
r
y
 
f
r
o
m
 
C
u
r
r
e
n
t
 
T
e
r
m
1
2
3
4
5
1
1
1
1
1
1
1
2
1
1
1
s
1
s
2
s
3
s
4
s
5
2
2
2
2
2
2
2
 
C
a
s
e
 
#
1
:
 
L
e
a
d
e
r
 
d
e
c
i
d
e
s
 
e
n
t
r
y
 
i
n
 
c
u
r
r
e
n
t
 
t
e
r
m
 
i
s
 
c
o
m
m
i
t
t
e
d
Safe: 
leader for term 3 must contain entry 4
26
26
C
o
m
m
i
t
t
i
n
g
 
E
n
t
r
y
 
f
r
o
m
 
E
a
r
l
i
e
r
 
T
e
r
m
1
2
3
4
5
1
1
1
1
1
1
1
2
1
1
1
s
1
s
2
s
3
s
4
s
5
2
2
3
4
3
3
 
C
a
s
e
 
#
2
:
 
L
e
a
d
e
r
 
t
r
y
i
n
g
 
t
o
 
f
i
n
i
s
h
 
c
o
m
m
i
t
t
i
n
g
 
e
n
t
r
y
 
f
r
o
m
 
e
a
r
l
i
e
r
Entry 3 
not safely committed
:
s
5
 can be elected as leader for term 5 (how?)
If elected, it will overwrite entry 3 on s
1
, s
2
, and s
3
27
27
N
e
w
 
C
o
m
m
i
t
m
e
n
t
 
R
u
l
e
s
C
o
m
b
i
n
a
t
i
o
n
 
o
f
 
e
l
e
c
t
i
o
n
 
r
u
l
e
s
 
a
n
d
 
c
o
m
m
i
t
m
e
n
t
 
r
u
l
e
s
m
a
k
e
s
 
R
a
f
t
 
s
a
f
e
L
e
a
d
e
r
 
f
o
r
 
t
e
r
m
 
4
 
F
o
r
 
l
e
a
d
e
r
 
t
o
 
d
e
c
i
d
e
 
e
n
t
r
y
 
i
s
 
c
o
m
m
i
t
t
e
d
:
1.
Entry stored on a majority
2.
≥ 1 new entry from leader’s term also on majority
Example;   Once e4 committed, s
5
 cannot be elected leader
for term 5, and e3 and e4 both safe
Leader changes can result in log inconsistencies
28
28
C
h
a
l
l
e
n
g
e
:
 
 
L
o
g
 
I
n
c
o
n
s
i
s
t
e
n
c
i
e
s
1
4
1
1
4
5
5
6
6
6
L
e
a
d
e
r
 
f
o
r
 
t
e
r
m
 
8
1
4
1
1
4
5
5
6
6
1
4
1
1
1
4
1
1
4
5
5
6
6
6
6
1
4
1
1
4
5
5
6
6
6
1
4
1
1
4
1
1
1
P
o
s
s
i
b
l
e
f
o
l
l
o
w
e
r
s
4
4
7
7
2
2
3
3
3
3
3
2
(a)
(b)
(c)
(d)
(e)
(f)
1
2
3
4
5
6
7
8
9
10
11
12
R
e
p
a
i
r
i
n
g
 
F
o
l
l
o
w
e
r
 
L
o
g
s
1
4
1
1
1
1
1
F
o
l
l
o
w
e
r
s
2
2
3
3
3
3
3
2
(a)
(b)
 
n
e
x
t
I
n
d
e
x
 
N
e
w
 
l
e
a
d
e
r
 
m
u
s
t
 
m
a
k
e
 
f
o
l
l
o
w
e
r
 
l
o
g
s
 
c
o
n
s
i
s
t
e
n
t
 
w
i
t
h
 
i
t
s
 
o
w
n
Delete extraneous entries
Fill in missing entries
L
e
a
d
e
r
 
k
e
e
p
s
 
n
e
x
t
I
n
d
e
x
 
f
o
r
 
e
a
c
h
 
f
o
l
l
o
w
e
r
:
Index of next log entry to send to that follower
Initialized to (1 + leader’s last index)
If AppendEntries consistency check fails, decrement nextIndex, try again
R
e
p
a
i
r
i
n
g
 
F
o
l
l
o
w
e
r
 
L
o
g
s
1
4
1
1
1
1
1
Before repair
2
2
3
3
3
3
3
2
(a)
(f)
1
1
1
4
(f)
n
e
x
t
I
n
d
e
x
After repair
31
31
N
e
u
t
r
a
l
i
z
i
n
g
 
O
l
d
 
L
e
a
d
e
r
s
 
L
e
a
d
e
r
 
t
e
m
p
o
r
a
r
i
l
y
 
d
i
s
c
o
n
n
e
c
t
e
d
→ other servers elect new leader
→ old leader reconnected
→ old leader attempts to commit log entries
T
e
r
m
s
 
u
s
e
d
 
t
o
 
d
e
t
e
c
t
 
s
t
a
l
e
 
l
e
a
d
e
r
s
 
(
a
n
d
 
c
a
n
d
i
d
a
t
e
s
)
Every RPC contains term of sender
Sender’s term < receiver:
Receiver: Rejects RPC (via ACK which sender processes…)
Receiver’s term < sender:
Receiver reverts to follower, updates term, processes RPC
E
l
e
c
t
i
o
n
 
u
p
d
a
t
e
s
 
t
e
r
m
s
 
o
f
 
m
a
j
o
r
i
t
y
 
o
f
 
s
e
r
v
e
r
s
Deposed server cannot commit new log entries
32
32
C
l
i
e
n
t
 
P
r
o
t
o
c
o
l
 
S
e
n
d
 
c
o
m
m
a
n
d
s
 
t
o
 
l
e
a
d
e
r
If leader unknown, contact any server, which redirects client to leader
L
e
a
d
e
r
 
o
n
l
y
 
r
e
s
p
o
n
d
s
 
a
f
t
e
r
 
c
o
m
m
a
n
d
 
l
o
g
g
e
d
,
c
o
m
m
i
t
t
e
d
,
 
a
n
d
 
e
x
e
c
u
t
e
d
 
b
y
 
l
e
a
d
e
r
I
f
 
r
e
q
u
e
s
t
 
t
i
m
e
s
 
o
u
t
 
(
e
.
g
.
,
 
l
e
a
d
e
r
 
c
r
a
s
h
e
s
)
:
Client reissues command to new leader (after possible redirect)
E
n
s
u
r
e
 
e
x
a
c
t
l
y
-
o
n
c
e
 
s
e
m
a
n
t
i
c
s
 
e
v
e
n
 
w
i
t
h
 
l
e
a
d
e
r
 
f
a
i
l
u
r
e
s
E.g., Leader can execute command then crash before responding
Client should embed unique ID in each command
This client ID included in log entry
Before accepting request, leader checks log for entry with same id
R
e
c
o
n
f
i
g
u
r
a
t
i
o
n
33
33
34
34
C
o
n
f
i
g
u
r
a
t
i
o
n
 
C
h
a
n
g
e
s
 
V
i
e
w
 
c
o
n
f
i
g
u
r
a
t
i
o
n
:
 
 
{
 
l
e
a
d
e
r
,
 
{
 
m
e
m
b
e
r
s
 
}
,
 
s
e
t
t
i
n
g
s
 
}
C
o
n
s
e
n
s
u
s
 
m
u
s
t
 
s
u
p
p
o
r
t
 
c
h
a
n
g
e
s
 
t
o
 
c
o
n
f
i
g
u
r
a
t
i
o
n
Replace failed machine
Change degree of replication
C
a
n
n
o
t
 
s
w
i
t
c
h
 
d
i
r
e
c
t
l
y
 
f
r
o
m
 
o
n
e
 
c
o
n
f
i
g
 
t
o
 
a
n
o
t
h
e
r
:
c
o
n
f
l
i
c
t
i
n
g
 
m
a
j
o
r
i
t
i
e
s
 
c
o
u
l
d
 
a
r
i
s
e
35
35
2
-
P
h
a
s
e
 
A
p
p
r
o
a
c
h
 
v
i
a
 
J
o
i
n
t
 
C
o
n
s
e
n
s
u
s
time
C
old+new
 entry
committed
C
new
 entry
committed
C
old
C
old+new
C
new
C
old
 can make
unilateral decisions
C
new
 can make
unilateral decisions
 
J
o
i
n
t
 
c
o
n
s
e
n
s
u
s
 
i
n
 
i
n
t
e
r
m
e
d
i
a
t
e
 
p
h
a
s
e
:
 
n
e
e
d
 
m
a
j
o
r
i
t
y
 
o
f
b
o
t
h
 
o
l
d
 
a
n
d
 
n
e
w
 
c
o
n
f
i
g
u
r
a
t
i
o
n
s
 
f
o
r
 
e
l
e
c
t
i
o
n
s
,
 
c
o
m
m
i
t
m
e
n
t
Configuration change just a log entry; applied immediately
on receipt (committed or not)
Once joint consensus is committed, begin replicating log
entry for final configuration
36
36
2
-
P
h
a
s
e
 
A
p
p
r
o
a
c
h
 
v
i
a
 
J
o
i
n
t
 
C
o
n
s
e
n
s
u
s
time
C
old+new
 entry
committed
C
new
 entry
committed
C
old
C
old+new
C
new
C
old
 can make
unilateral decisions
C
new
 can make
unilateral decisions
l
e
a
d
e
r
 
n
o
t
 
i
n
 
C
n
e
w
s
t
e
p
s
 
d
o
w
n
 
h
e
r
e
Any server from either configuration can serve as leader
If leader not in C
new
, must step down once C
new
 committed
Viewstamped Replication:
 A new primary copy method to support
highly-available distributed systems
Oki and Liskov, PODC 1988
37
37
 
Strong leader
Log entries flow only from leader to other servers
Select leader from limited set so doesn’t need to “catch up”
Leader election
Randomized timers to initiate elections
Membership changes
New joint consensus approach with overlapping majorities
Cluster can operate normally during configuration changes
38
38
R
a
f
t
 
v
s
.
 
V
R
Slide Note
Embed
Share

Overview of fault-tolerant replicated state machine systems in computing, covering topics such as primary-backup mechanisms, high availability extensions, view changes on failure, leader election, and consensus protocols for replicated operations. The content emphasizes the importance of leaders in the system's decomposition and simplification of normal operations.

  • Fault-tolerant systems
  • Replicated state machine
  • Leader election
  • Consensus protocol
  • High availability

Uploaded on Sep 28, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Putting it all together for SMR: Two-Phase Commit, Leader Election RAFT CS 240: Computing Systems and Concurrency Lecture 13 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. RAFT slides heavily based on those from Diego Ongaro and John Ousterhout.

  2. Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #1: Provide a highly reliable service Goal #2: Servers should behave just like a single, more reliable server 2

  3. Extend PB for high availability Primary gets ops, orders into log Replicates log of ops to backup Client C Backup executes ops in same order Backup takes over if primary fails Primary P But what if network partition rather than primary failure? View server to determine primary Backup A But what if view server fails? View determined via consensus! 3

  4. Extend PB for high availability 1. C P: request <op> Client C 2. P A, B: prepare <op> P: prepared or error 3. A, B Primary P C: result exec<op> or failed 4. P 5. P A, B: commit <op> Okay (i.e., op is stable) if written to > backups Backup A B 4

  5. View changes on failure 1. Backups monitor primary 2. If a backup thinks primary failed, initiate View Change (leader election) Primary P Backup A B 5

  6. View changes on failure 1. Backups monitor primary 2. If a backup thinks primary failed, initiate View Change (leader election) Requires 2f + 1 nodes to handle f failures 3. Intuitive safety argument: View change requires f+1 agreement Op committed once written to f+1 nodes At least one node both saw write and in new view Backup A Primary P 4. More advanced: Adding or removing nodes ( reconfiguration ) 6

  7. Basic fault-tolerant Replicated State Machine (RSM) approach 1. Consensus protocol to elect leader 2. 2PC to replicate operations from leader 3. All replicas execute ops once committed 7

  8. Why bother with a leader? Not necessary, but Decomposition: normal operation vs. leader changes Simplifies normal operation (no conflicts) More efficient than leader-less approaches Obvious place to handle non-determinism 8

  9. Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford University 9

  10. Goal: Replicated Log Clients shl Consensus Module State Machine Consensus Module State Machine Consensus Module State Machine Servers Log Log Log add jmp mov shl add jmp mov shl add jmp mov shl Replicated log => replicated state machine All servers execute same commands in same order Consensus module ensures proper log replication 10

  11. Raft Overview 1. Leader election 2. Normal operation (basic log replication) 3. Safety and consistency after leader changes 4. Neutralizing old leaders 5. Client interactions 6. Reconfiguration 11

  12. Server States At any given time, each server is either: Leader: handles all client interactions, log replication Follower: completely passive Candidate: used to elect a new leader Normal operation: 1 leader, N-1 followers Follower Candidate Leader 12

  13. Liveness Validation Servers start as followers Leaders send heartbeats (empty AppendEntries RPCs) to maintain authority If electionTimeout elapses with no RPCs (100-500ms), follower assumes leader has crashed and starts new election timeout, new election timeout, start election receive votes from majority of servers start Follower Candidate Leader step down discover server with higher term discover current leader or higher term 13

  14. Terms (aka epochs) Term 1 Term 2 Term 3 Term 4 Term 5 time Elections Split Vote Normal Operation Time divided into terms Election (either failed or resulted in 1 leader) Normal operation under a single leader Each server maintains current term value Key role of terms: identify obsolete information 14

  15. Elections Start election: Increment current term, change to candidate state, vote for self Send RequestVote to all other servers, retry until either: 1. Receive votes from majority of servers: Become leader Send AppendEntries heartbeats to all other servers 2. Receive RPC from valid leader: Return to follower state 3. No-one wins election (election timeout elapses): Increment term, start new election 15

  16. Elections Safety: allow at most one winner per term Each server votes only once per term (persists on disk) Two different candidates can t get majorities in same term Voted for candidate A B can t also get majority Servers Liveness: some candidate must eventually win Each choose election timeouts randomly in [T, 2T] One usually initiates and wins election before others start Works well if T >> network RTT 16

  17. Log Structure log index 1 2 1 3 1 ret 4 2 5 3 6 3 7 3 8 3 term 1 leader add cmp mov jmp div shl sub command 1 1 1 2 3 add cmp ret mov jmp 1 1 1 2 3 3 3 3 add cmp ret mov jmp div shl sub followers 1 1 add cmp 1 1 1 2 3 3 3 add cmp ret mov jmp div shl committed entries Log entry = < index, term, command > Log stored on stable storage (disk); survives crashes Entry committed if known to be stored on majority of servers Durable / stable, will eventually be executed by state machines 17

  18. Normal operation shl Consensus Module State Machine Consensus Module State Machine Consensus Module State Machine Log Log Log add jmp mov shl add jmp mov shl add jmp mov shl Client sends command to leader Leader appends command to its log Leader sends AppendEntries RPCs to followers Once new entry committed: Leader passes command to its state machine, sends result to client Leader piggybacks commitment to followers in later AppendEntries Followers pass committed commands to their state machines 18

  19. Normal operation shl Consensus Module State Machine Consensus Module State Machine Consensus Module State Machine Log Log Log add jmp mov shl add jmp mov shl add jmp mov shl Crashed / slow followers? Leader retries RPCs until they succeed Performance is optimal in common case: One successful RPC to any majority of servers 19

  20. Log Operation: Highly Coherent 1 2 1 3 1 ret 4 2 5 3 6 3 1 server1 add cmp mov jmp div 1 1 1 2 3 4 server2 add cmp ret mov jmp sub If log entries on different server have same index and term: Store the same command Logs are identical in all preceding entries If given entry is committed, all preceding also committed 20

  21. Log Operation: Consistency Check 1 2 3 4 5 1 1 1 2 3 leader AppendEntries succeeds: matching entry add cmp ret mov jmp 1 1 1 2 follower add cmp ret mov 1 1 1 2 3 leader add cmp ret mov jmp AppendEntries fails: mismatch 1 1 1 1 follower add cmp ret shl AppendEntries has <index,term> of entry preceding new ones Follower must contain matching entry; otherwise it rejects Implements an induction step, ensures coherency 21

  22. Leader Changes New leader s log is truth, no special steps, start normal operation Will eventually make follower s logs identical to leader s Old leader may have left entries partially replicated Multiple crashes can leave many extraneous log entries 1 2 3 4 5 6 7 log index s1 1 1 5 6 6 6 term s2 1 1 5 6 7 7 7 s3 1 1 5 5 s4 1 1 2 4 s5 1 1 2 2 3 3 3 22

  23. Safety Requirement Once log entry applied to a state machine, no other state machine must apply a different value for that log entry Raft safety property: If leader has decided log entry is committed, entry will be present in logs of all future leaders Why does this guarantee higher-level goal? 1. Leaders never overwrite entries in their logs 2. Only entries in leader s log can be committed 3. Entries must be committed before applying to state machine Committed Present in future leaders logs Restrictions on leader election Restrictions on commitment 23

  24. Picking the Best Leader 1 2 3 4 5 Committed? s1 1 1 1 2 2 Can t tell which entries committed! s2 1 1 1 2 Unavailable during leader transition 1 1 1 2 2 Elect candidate most likely to contain all committed entries In RequestVote, candidates incl. index + term of last log entry Voter V denies vote if its log is more complete : (newer term) or (entry in higher index of same term) Leader will have most complete log among electing majority 24

  25. Committing Entry from Current Term 1 2 3 4 5 Leader for term 2 s1 1 1 2 2 2 s2 1 1 2 2 AppendEntries just succeeded s3 1 1 2 2 s4 1 1 2 Can t be elected as leader for term 3 s5 1 1 Case #1: Leader decides entry in current term is committed Safe: leader for term 3 must contain entry 4 25

  26. Committing Entry from Earlier Term 1 2 3 4 5 Leader for term 4 s1 1 1 2 4 s2 1 1 2 AppendEntries just succeeded s3 1 1 2 s4 1 1 s5 1 1 3 3 3 Case #2: Leader trying to finish committing entry from earlier Entry 3 not safely committed: s5 can be elected as leader for term 5 (how?) If elected, it will overwrite entry 3 on s1, s2, and s3 26

  27. New Commitment Rules 1 2 3 4 5 Leader for term 4 s1 1 1 2 4 s2 1 1 2 4 s3 1 1 2 4 s4 1 1 s5 1 1 3 3 3 For leader to decide entry is committed: 1. Entry stored on a majority 2. 1 new entry from leader s term also on majority Example; Once e4 committed, s5 cannot be elected leader for term 5, and e3 and e4 both safe 27

  28. Challenge: Log Inconsistencies 1 2 3 4 5 6 7 8 9 10 11 12 Leader for term 8 1 1 1 4 4 5 5 6 6 6 (a) 1 1 1 4 4 5 5 6 6 Missing Entries (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6 Possible followers (d) 1 1 1 4 4 5 5 6 6 6 7 7 Extraneous Entries (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3 Leader changes can result in log inconsistencies 28

  29. Repairing Follower Logs nextIndex 1 2 3 4 5 6 7 8 9 10 11 12 Leader for term 7 1 1 1 4 4 5 5 6 6 6 (a) 1 1 1 4 Followers (b) 1 1 1 2 2 2 3 3 3 3 3 New leader must make follower logs consistent with its own Delete extraneous entries Fill in missing entries Leader keeps nextIndex for each follower: Index of next log entry to send to that follower Initialized to (1 + leader s last index) If AppendEntries consistency check fails, decrement nextIndex, try again

  30. Repairing Follower Logs nextIndex 1 2 3 4 5 6 7 8 9 10 11 12 Leader for term 7 1 1 1 4 4 5 5 6 6 6 (a) 1 1 1 4 Before repair (f) 1 1 1 2 2 2 3 3 3 3 3 (f) After repair 1 1 1 4

  31. Neutralizing Old Leaders Leader temporarily disconnected other servers elect new leader old leader reconnected old leader attempts to commit log entries Terms used to detect stale leaders (and candidates) Every RPC contains term of sender Sender s term < receiver: Receiver: Rejects RPC (via ACK which sender processes ) Receiver s term < sender: Receiver reverts to follower, updates term, processes RPC Election updates terms of majority of servers Deposed server cannot commit new log entries 31

  32. Client Protocol Send commands to leader If leader unknown, contact any server, which redirects client to leader Leader only responds after command logged, committed, and executed by leader If request times out (e.g., leader crashes): Client reissues command to new leader (after possible redirect) Ensure exactly-once semantics even with leader failures E.g., Leader can execute command then crash before responding Client should embed unique ID in each command This client ID included in log entry Before accepting request, leader checks log for entry with same id 32

  33. Reconfiguration 33

  34. Configuration Changes View configuration: { leader, { members }, settings } Consensus must support changes to configuration Replace failed machine Change degree of replication Cannot switch directly from one config to another: conflicting majorities could arise Cold Cnew Server 1 Server 2 Server 3 Server 4 Server 5 Majority of Cold Majority of Cnew time 34

  35. 2-Phase Approach via Joint Consensus Joint consensus in intermediate phase: need majority of both old and new configurations for elections, commitment Configuration change just a log entry; applied immediately on receipt (committed or not) Once joint consensus is committed, begin replicating log entry for final configuration Cold can make unilateral decisions Cnew can make unilateral decisions Cnew Cold+new Cold Cold+new entry committed Cnew entry committed time 35

  36. 2-Phase Approach via Joint Consensus Any server from either configuration can serve as leader If leader not in Cnew, must step down once Cnew committed Cold can make unilateral decisions Cnew can make unilateral decisions Cnew Cold+new leader not in Cnew steps down here Cold Cold+new entry committed Cnew entry committed time 36

  37. Viewstamped Replication: A new primary copy method to support highly-available distributed systems Oki and Liskov, PODC 1988 37

  38. Raft vs. VR Strong leader Log entries flow only from leader to other servers Select leader from limited set so doesn t need to catch up Leader election Randomized timers to initiate elections Membership changes New joint consensus approach with overlapping majorities Cluster can operate normally during configuration changes 38

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#