Raft Consensus Algorithm Overview

undefined
C
o
n
s
e
n
s
u
s
 
P
a
r
t
 
I
:
F
a
i
l
-
s
t
o
p
 
F
a
i
l
u
r
e
s
Slide 1
CPS 514 / ECE 558
undefined
S
e
r
v
e
r
s
 
m
a
y
 
c
r
a
s
h
,
 
b
u
t
 
m
a
i
n
t
a
i
n
 
p
e
r
s
i
s
t
e
n
t
 
s
t
o
r
a
g
e
M
e
s
s
a
g
e
s
 
m
a
y
 
b
e
 
l
o
s
t
N
e
t
w
o
r
k
 
m
a
y
 
b
e
 
d
i
s
c
o
n
n
e
c
t
e
d
S
e
r
v
e
r
s
 
d
o
n
t
 
m
i
s
b
e
h
a
v
e
!
G
o
a
l
 
i
s
 
t
o
 
m
a
k
e
 
p
r
o
g
r
e
s
s
 
w
h
e
n
 
a
 
m
a
j
o
r
i
t
y
 
o
f
 
s
e
r
v
e
r
s
a
r
e
 
u
p
 
a
n
d
 
c
a
n
 
c
o
m
m
u
n
i
c
a
t
e
.
Slide 2
Fail-stop Failures
undefined
G
o
a
l
 
i
s
 
t
o
 
o
p
e
r
a
t
e
 
m
u
l
t
i
p
l
e
 
s
e
r
v
e
r
s
 
i
n
 
i
d
e
n
t
i
c
a
l
 
s
t
a
t
e
f
o
r
 
f
a
u
l
t
-
t
o
l
e
r
a
n
c
e
E
x
a
m
p
l
e
 
s
e
r
v
i
c
e
s
:
File system
Database
Shared memory
Ledger
Key-value store
C
l
i
e
n
t
s
 
i
s
s
u
e
 
r
e
q
u
e
s
t
s
 
t
o
 
c
h
a
n
g
e
 
s
t
a
t
e
 
m
a
c
h
i
n
e
s
S
e
r
v
e
r
s
 
m
a
i
n
t
a
i
n
 
i
d
e
n
t
i
c
a
l
 
l
o
g
s
 
o
f
 
l
i
n
e
a
r
i
z
e
d
 
c
l
i
e
n
t
r
e
q
u
e
s
t
s
S
e
r
v
e
r
s
 
m
a
i
n
t
a
i
n
 
i
d
e
n
t
i
c
a
l
 
s
t
a
t
e
 
m
a
c
h
i
n
e
s
Consensus
undefined
Raft: A Consensus Algorithm
for Replicated Logs
D
i
e
g
o
 
O
n
g
a
r
o
 
a
n
d
 
J
o
h
n
 
O
u
s
t
e
r
h
o
u
t
S
t
a
n
f
o
r
d
 
U
n
i
v
e
r
s
i
t
y
undefined
R
e
p
l
i
c
a
t
e
d
 
l
o
g
 
=
>
 
r
e
p
l
i
c
a
t
e
d
 
s
t
a
t
e
 
m
a
c
h
i
n
e
All servers execute same commands in same order
C
o
n
s
e
n
s
u
s
 
m
o
d
u
l
e
 
e
n
s
u
r
e
s
 
p
r
o
p
e
r
 
l
o
g
 
r
e
p
l
i
c
a
t
i
o
n
S
y
s
t
e
m
 
m
a
k
e
s
 
p
r
o
g
r
e
s
s
 
a
s
 
l
o
n
g
 
a
s
 
a
n
y
 
m
a
j
o
r
i
t
y
 
o
f
 
s
e
r
v
e
r
s
 
a
r
e
 
u
p
F
a
i
l
u
r
e
 
m
o
d
e
l
:
 
f
a
i
l
-
s
t
o
p
 
(
n
o
t
 
B
y
z
a
n
t
i
n
e
)
,
 
d
e
l
a
y
e
d
/
l
o
s
t
 
m
e
s
s
a
g
e
s
March 3, 2013
Raft Consensus Algorithm
Slide 5
Goal: Replicated Log
S
e
r
v
e
r
s
C
l
i
e
n
t
s
shl
undefined
T
w
o
 
g
e
n
e
r
a
l
 
a
p
p
r
o
a
c
h
e
s
 
t
o
 
c
o
n
s
e
n
s
u
s
:
S
y
m
m
e
t
r
i
c
,
 
l
e
a
d
e
r
-
l
e
s
s
:
All servers have equal roles
Clients can contact any server
A
s
y
m
m
e
t
r
i
c
,
 
l
e
a
d
e
r
-
b
a
s
e
d
:
At any given time, one server is in charge, others accept its
decisions
Clients communicate with the leader
R
a
f
t
 
u
s
e
s
 
a
 
l
e
a
d
e
r
:
Decomposes the problem (normal operation, leader changes)
Simplifies normal operation (no conflicts)
More efficient than leader-less approaches
March 3, 2013
Raft Consensus Algorithm
Slide 6
Approaches to Consensus
undefined
1.
L
e
a
d
e
r
 
e
l
e
c
t
i
o
n
:
Select one of the servers to act as leader
Detect crashes, choose new leader
2.
N
o
r
m
a
l
 
o
p
e
r
a
t
i
o
n
 
(
b
a
s
i
c
 
l
o
g
 
r
e
p
l
i
c
a
t
i
o
n
)
3.
S
a
f
e
t
y
 
a
n
d
 
c
o
n
s
i
s
t
e
n
c
y
 
a
f
t
e
r
 
l
e
a
d
e
r
 
c
h
a
n
g
e
s
4.
N
e
u
t
r
a
l
i
z
i
n
g
 
o
l
d
 
l
e
a
d
e
r
s
5.
C
l
i
e
n
t
 
i
n
t
e
r
a
c
t
i
o
n
s
Implementing linearizeable semantics
6.
C
o
n
f
i
g
u
r
a
t
i
o
n
 
c
h
a
n
g
e
s
:
 Adding and removing servers
March 3, 2013
Raft Consensus Algorithm
Slide 7
Raft Overview
undefined
A
t
 
a
n
y
 
g
i
v
e
n
 
t
i
m
e
,
 
e
a
c
h
 
s
e
r
v
e
r
 
i
s
 
e
i
t
h
e
r
:
Leader
: handles all client interactions, log replication
At most 1 viable leader at a time
Follower
: completely passive (issues no RPCs, responds to
incoming RPCs)
Candidate
: used to elect a new leader
N
o
r
m
a
l
 
o
p
e
r
a
t
i
o
n
:
 
1
 
l
e
a
d
e
r
,
 
N
-
1
 
f
o
l
l
o
w
e
r
s
March 3, 2013
Raft Consensus Algorithm
Slide 8
Server States
Follower
Candidate
Leader
start
timeout,
start election
receive votes from
majority of servers
timeout,
new election
discover server with
 higher term
discover current leader
or higher term
“step
down”
undefined
T
i
m
e
 
d
i
v
i
d
e
d
 
i
n
t
o
 
t
e
r
m
s
:
Election
Normal operation under a single leader
A
t
 
m
o
s
t
 
1
 
l
e
a
d
e
r
 
p
e
r
 
t
e
r
m
S
o
m
e
 
t
e
r
m
s
 
h
a
v
e
 
n
o
 
l
e
a
d
e
r
 
(
f
a
i
l
e
d
 
e
l
e
c
t
i
o
n
)
E
a
c
h
 
s
e
r
v
e
r
 
m
a
i
n
t
a
i
n
s
 
c
u
r
r
e
n
t
 
t
e
r
m
 
v
a
l
u
e
K
e
y
 
r
o
l
e
 
o
f
 
t
e
r
m
s
:
 
i
d
e
n
t
i
f
y
 
o
b
s
o
l
e
t
e
 
i
n
f
o
r
m
a
t
i
o
n
March 3, 2013
Raft Consensus Algorithm
Slide 9
Terms
Term 1
Term 2
Term 3
Term 4
Term 5
time
Elections
Normal Operation
Split Vote
March 3, 2013
Raft Consensus Algorithm
Slide 10
R
a
f
t
 
P
r
o
t
o
c
o
l
 
S
u
m
m
a
r
y
undefined
S
e
r
v
e
r
s
 
s
t
a
r
t
 
u
p
 
a
s
 
f
o
l
l
o
w
e
r
s
F
o
l
l
o
w
e
r
s
 
e
x
p
e
c
t
 
t
o
 
r
e
c
e
i
v
e
 
R
P
C
s
 
f
r
o
m
 
l
e
a
d
e
r
s
 
o
r
c
a
n
d
i
d
a
t
e
s
L
e
a
d
e
r
s
 
m
u
s
t
 
s
e
n
d
 
h
e
a
r
t
b
e
a
t
s
 
(
e
m
p
t
y
A
p
p
e
n
d
E
n
t
r
i
e
s
 
R
P
C
s
)
 
t
o
 
m
a
i
n
t
a
i
n
 
a
u
t
h
o
r
i
t
y
I
f
 
e
l
e
c
t
i
o
n
T
i
m
e
o
u
t
 
e
l
a
p
s
e
s
 
w
i
t
h
 
n
o
 
R
P
C
s
:
Follower assumes leader has crashed
Follower starts new election
Timeouts typically 100-500ms
March 3, 2013
Raft Consensus Algorithm
Slide 11
Heartbeats and Timeouts
undefined
I
n
c
r
e
m
e
n
t
 
c
u
r
r
e
n
t
 
t
e
r
m
C
h
a
n
g
e
 
t
o
 
C
a
n
d
i
d
a
t
e
 
s
t
a
t
e
V
o
t
e
 
f
o
r
 
s
e
l
f
S
e
n
d
 
R
e
q
u
e
s
t
V
o
t
e
 
R
P
C
s
 
t
o
 
a
l
l
 
o
t
h
e
r
 
s
e
r
v
e
r
s
,
 
r
e
t
r
y
u
n
t
i
l
 
e
i
t
h
e
r
:
1.
Receive votes from majority of servers:
Become leader
Send AppendEntries heartbeats to all other servers
2.
Receive RPC from valid leader:
Return to follower state
3.
No-one wins election (election timeout elapses):
Increment term, start new election
March 3, 2013
Raft Consensus Algorithm
Slide 12
Election Basics
undefined
S
a
f
e
t
y
:
 
 
a
l
l
o
w
 
a
t
 
m
o
s
t
 
o
n
e
 
w
i
n
n
e
r
 
p
e
r
 
t
e
r
m
Each server gives out only one vote per term (persist on disk)
Two different candidates can’t accumulate majorities in same
term
L
i
v
e
n
e
s
s
:
 
s
o
m
e
 
c
a
n
d
i
d
a
t
e
 
m
u
s
t
 
e
v
e
n
t
u
a
l
l
y
 
w
i
n
Choose election timeouts randomly in [T, 2T]
One server usually times out and wins election before others
wake up
Works well if T >> broadcast time
March 3, 2013
Raft Consensus Algorithm
Slide 13
Elections, cont’d
Servers
Voted for
candidate A
B can’t also
get majority
undefined
L
o
g
 
e
n
t
r
y
 
=
 
i
n
d
e
x
,
 
t
e
r
m
,
 
c
o
m
m
a
n
d
L
o
g
 
s
t
o
r
e
d
 
o
n
 
s
t
a
b
l
e
 
s
t
o
r
a
g
e
 
(
d
i
s
k
)
;
 
s
u
r
v
i
v
e
s
 
c
r
a
s
h
e
s
E
n
t
r
y
 
c
o
m
m
i
t
t
e
d
 
i
f
 
k
n
o
w
n
 
t
o
 
b
e
 
s
t
o
r
e
d
 
o
n
 
m
a
j
o
r
i
t
y
 
o
f
 
s
e
r
v
e
r
s
Durable, will eventually be executed by state machines
March 3, 2013
Raft Consensus Algorithm
Slide 14
Log Structure
1
add
1
2
3
4
5
6
7
8
3
jmp
1
cmp
1
ret
2
mov
3
div
3
shl
3
sub
1
add
3
jmp
1
cmp
1
ret
2
mov
1
add
3
jmp
1
cmp
1
ret
2
mov
3
div
3
shl
3
sub
1
add
1
cmp
1
add
3
jmp
1
cmp
1
ret
2
mov
3
div
3
shl
leader
log index
followers
committed entries
term
command
undefined
C
l
i
e
n
t
 
s
e
n
d
s
 
c
o
m
m
a
n
d
 
t
o
 
l
e
a
d
e
r
L
e
a
d
e
r
 
a
p
p
e
n
d
s
 
c
o
m
m
a
n
d
 
t
o
 
i
t
s
 
l
o
g
L
e
a
d
e
r
 
s
e
n
d
s
 
A
p
p
e
n
d
E
n
t
r
i
e
s
 
R
P
C
s
 
t
o
 
f
o
l
l
o
w
e
r
s
O
n
c
e
 
n
e
w
 
e
n
t
r
y
 
c
o
m
m
i
t
t
e
d
:
Leader passes command to its state machine, returns result to
client
Leader notifies followers of committed entries in subsequent
AppendEntries RPCs
Followers pass committed commands to their state machines
C
r
a
s
h
e
d
/
s
l
o
w
 
f
o
l
l
o
w
e
r
s
?
Leader retries RPCs until they succeed
P
e
r
f
o
r
m
a
n
c
e
 
i
s
 
o
p
t
i
m
a
l
 
i
n
 
c
o
m
m
o
n
 
c
a
s
e
:
One successful RPC to any majority of servers
March 3, 2013
Raft Consensus Algorithm
Slide 15
Normal Operation
undefined
H
i
g
h
 
l
e
v
e
l
 
o
f
 
c
o
h
e
r
e
n
c
y
 
b
e
t
w
e
e
n
 
l
o
g
s
:
I
f
 
l
o
g
 
e
n
t
r
i
e
s
 
o
n
 
d
i
f
f
e
r
e
n
t
 
s
e
r
v
e
r
s
 
h
a
v
e
 
s
a
m
e
 
i
n
d
e
x
a
n
d
 
t
e
r
m
:
They store the same command
The logs are identical in all preceding entries
I
f
 
a
 
g
i
v
e
n
 
e
n
t
r
y
 
i
s
 
c
o
m
m
i
t
t
e
d
,
 
a
l
l
 
p
r
e
c
e
d
i
n
g
 
e
n
t
r
i
e
s
a
r
e
 
a
l
s
o
 
c
o
m
m
i
t
t
e
d
March 3, 2013
Raft Consensus Algorithm
Slide 16
Log Consistency
1
add
1
2
3
4
5
6
3
jmp
1
cmp
1
ret
2
mov
3
div
4
sub
1
add
3
jmp
1
cmp
1
ret
2
mov
undefined
E
a
c
h
 
A
p
p
e
n
d
E
n
t
r
i
e
s
 
R
P
C
 
c
o
n
t
a
i
n
s
 
i
n
d
e
x
,
 
t
e
r
m
 
o
f
e
n
t
r
y
 
p
r
e
c
e
d
i
n
g
 
n
e
w
 
o
n
e
s
F
o
l
l
o
w
e
r
 
m
u
s
t
 
c
o
n
t
a
i
n
 
m
a
t
c
h
i
n
g
 
e
n
t
r
y
;
 
 
o
t
h
e
r
w
i
s
e
 
i
t
r
e
j
e
c
t
s
 
r
e
q
u
e
s
t
I
m
p
l
e
m
e
n
t
s
 
a
n
 
i
n
d
u
c
t
i
o
n
 
s
t
e
p
,
 
e
n
s
u
r
e
s
 
c
o
h
e
r
e
n
c
y
March 3, 2013
Raft Consensus Algorithm
Slide 17
AppendEntries Consistency Check
1
add
3
jmp
1
cmp
1
ret
2
mov
1
add
1
cmp
1
ret
2
mov
leader
follower
1
2
3
4
5
1
add
3
jmp
1
cmp
1
ret
2
mov
1
add
1
cmp
1
ret
1
shl
leader
follower
AppendEntries succeeds:
matching entry
AppendEntries fails:
mismatch
undefined
A
t
 
b
e
g
i
n
n
i
n
g
 
o
f
 
n
e
w
 
l
e
a
d
e
r
s
 
t
e
r
m
:
Old leader may have left entries partially replicated
No special steps by new leader: just start normal operation
Leader’s log is “the truth”
Will eventually make follower’s logs identical to leader’s
Multiple crashes can leave many extraneous log entries:
March 3, 2013
Raft Consensus Algorithm
Slide 18
Leader Changes
1
2
3
4
5
6
7
8
log index
1
1
1
1
5
5
6
6
6
6
1
1
5
5
1
4
1
1
1
7
7
2
2
3
3
3
2
7
term
s
1
s
2
s
3
s
4
s
5
undefined
O
n
c
e
 
a
 
l
o
g
 
e
n
t
r
y
 
h
a
s
 
b
e
e
n
 
a
p
p
l
i
e
d
 
t
o
 
a
 
s
t
a
t
e
 
m
a
c
h
i
n
e
,
n
o
 
o
t
h
e
r
 
s
t
a
t
e
 
m
a
c
h
i
n
e
 
m
u
s
t
 
a
p
p
l
y
 
a
 
d
i
f
f
e
r
e
n
t
 
v
a
l
u
e
 
f
o
r
t
h
a
t
 
l
o
g
 
e
n
t
r
y
R
a
f
t
 
s
a
f
e
t
y
 
p
r
o
p
e
r
t
y
:
If a leader has decided that a log entry is committed, that entry
will be present in the logs of all future leaders
T
h
i
s
 
g
u
a
r
a
n
t
e
e
s
 
t
h
e
 
s
a
f
e
t
y
 
r
e
q
u
i
r
e
m
e
n
t
Leaders never overwrite entries in their logs
Only entries in the leader’s log can be committed
Entries must be committed before applying to state machine
March 3, 2013
Raft Consensus Algorithm
Slide 19
Safety Requirement
C
o
m
m
i
t
t
e
d
 
 
P
r
e
s
e
n
t
 
i
n
 
f
u
t
u
r
e
 
l
e
a
d
e
r
s
 
l
o
g
s
Restrictions on
commitment
Restrictions on
leader election
undefined
C
a
n
t
 
t
e
l
l
 
w
h
i
c
h
 
e
n
t
r
i
e
s
 
a
r
e
 
c
o
m
m
i
t
t
e
d
!
D
u
r
i
n
g
 
e
l
e
c
t
i
o
n
s
,
 
c
h
o
o
s
e
 
c
a
n
d
i
d
a
t
e
 
w
i
t
h
 
l
o
g
 
m
o
s
t
l
i
k
e
l
y
 
t
o
 
c
o
n
t
a
i
n
 
a
l
l
 
c
o
m
m
i
t
t
e
d
 
e
n
t
r
i
e
s
Candidates include log info in RequestVote RPCs
(index & term of last log entry)
Voting server V denies vote if its log is “more complete”:
(lastTerm
V
 > lastTerm
C
) ||
(lastTerm
V
 == lastTerm
C
) && (lastIndex
V
 > lastIndex
C
)
Leader will have “most complete” log among electing majority
March 3, 2013
Raft Consensus Algorithm
Slide 20
Picking the Best Leader
1
2
1
1
2
1
2
3
4
5
1
2
1
1
1
2
1
1
2
unavailable during
leader transition
committed?
undefined
C
a
s
e
 
#
1
/
2
:
 
L
e
a
d
e
r
 
d
e
c
i
d
e
s
 
e
n
t
r
y
 
i
n
 
c
u
r
r
e
n
t
 
t
e
r
m
 
i
s
c
o
m
m
i
t
t
e
d
S
a
f
e
:
 
l
e
a
d
e
r
 
f
o
r
 
t
e
r
m
 
3
 
m
u
s
t
 
c
o
n
t
a
i
n
 
e
n
t
r
y
 
4
March 3, 2013
Raft Consensus Algorithm
Slide 21
Committing Entry from Current Term
1
2
3
4
5
6
1
1
1
1
1
1
1
2
1
1
1
s
1
s
2
s
3
s
4
s
5
2
2
2
2
2
2
2
AppendEntries just
succeeded
Can’t be elected as
leader for term 3
Leader for
term 2
undefined
C
a
s
e
 
#
2
/
2
:
 
L
e
a
d
e
r
 
i
s
 
t
r
y
i
n
g
 
t
o
 
f
i
n
i
s
h
 
c
o
m
m
i
t
t
i
n
g
 
e
n
t
r
y
f
r
o
m
 
a
n
 
e
a
r
l
i
e
r
 
t
e
r
m
E
n
t
r
y
 
3
 
n
o
t
 
s
a
f
e
l
y
 
c
o
m
m
i
t
t
e
d
:
s
5
 can be elected as leader for term 5
If elected, it will overwrite entry 3 on s
1
, s
2
, and s
3
!
March 3, 2013
Raft Consensus Algorithm
Slide 22
Committing Entry from Earlier Term
1
2
3
4
5
6
1
1
1
1
1
1
1
2
1
1
1
s
1
s
2
s
3
s
4
s
5
2
2
AppendEntries just
succeeded
3
4
3
Leader for
term 4
3
undefined
F
o
r
 
a
 
l
e
a
d
e
r
 
t
o
 
d
e
c
i
d
e
 
a
n
e
n
t
r
y
 
i
s
 
c
o
m
m
i
t
t
e
d
:
Must be stored on a majority
of servers
At least one new entry from
leader’s term must also be
stored on majority of servers
O
n
c
e
 
e
n
t
r
y
 
4
 
c
o
m
m
i
t
t
e
d
:
s
5
 cannot be elected leader
for term 5
Entries 3 and 4 both safe
March 3, 2013
Raft Consensus Algorithm
Slide 23
New Commitment Rules
1
2
3
4
5
1
1
1
1
1
1
1
2
1
1
1
s
1
s
2
s
3
s
4
s
5
2
2
3
4
3
Leader for
term 4
4
4
C
o
m
b
i
n
a
t
i
o
n
 
o
f
 
e
l
e
c
t
i
o
n
 
r
u
l
e
s
 
a
n
d
 
c
o
m
m
i
t
m
e
n
t
 
r
u
l
e
s
m
a
k
e
s
 
R
a
f
t
 
s
a
f
e
3
undefined
L
e
a
d
e
r
 
c
h
a
n
g
e
s
 
c
a
n
 
r
e
s
u
l
t
 
i
n
 
l
o
g
 
i
n
c
o
n
s
i
s
t
e
n
c
i
e
s
:
March 3, 2013
Raft Consensus Algorithm
Slide 24
Log Inconsistencies
1
4
1
1
4
5
5
6
6
6
1
2
3
4
5
6
7
8
9
10
11
12
log index
leader for
term 8
1
4
1
1
4
5
5
6
6
1
4
1
1
1
4
1
1
4
5
5
6
6
6
6
1
4
1
1
4
5
5
6
6
6
1
4
1
1
4
1
1
1
possible
followers
4
4
7
7
2
2
3
3
3
3
3
2
(a)
(b)
(c)
(d)
(e)
(f)
Extraneous
Entries
Missing
Entries
undefined
March 3, 2013
Raft Consensus Algorithm
N
e
w
 
l
e
a
d
e
r
 
m
u
s
t
 
m
a
k
e
 
f
o
l
l
o
w
e
r
 
l
o
g
s
 
c
o
n
s
i
s
t
e
n
t
 
w
i
t
h
 
i
t
s
 
o
w
n
Delete extraneous entries
Fill in missing entries
L
e
a
d
e
r
 
k
e
e
p
s
 
n
e
x
t
I
n
d
e
x
 
f
o
r
 
e
a
c
h
 
f
o
l
l
o
w
e
r
:
Index of next log entry to send to that follower
Initialized to (1 + leader’s last index)
W
h
e
n
 
A
p
p
e
n
d
E
n
t
r
i
e
s
 
c
o
n
s
i
s
t
e
n
c
y
 
c
h
e
c
k
 
f
a
i
l
s
,
 
d
e
c
r
e
m
e
n
t
n
e
x
t
I
n
d
e
x
 
a
n
d
 
t
r
y
 
a
g
a
i
n
:
Repairing Follower Logs
1
4
1
1
4
5
5
6
6
6
1
2
3
4
5
6
7
8
9
10
11
12
log index
leader for term 7
1
4
1
1
1
1
1
followers
2
2
3
3
3
3
3
2
(a)
(b)
nextIndex
Slide 25
undefined
W
h
e
n
 
f
o
l
l
o
w
e
r
 
o
v
e
r
w
r
i
t
e
s
 
i
n
c
o
n
s
i
s
t
e
n
t
 
e
n
t
r
y
,
 
i
t
d
e
l
e
t
e
s
 
a
l
l
 
s
u
b
s
e
q
u
e
n
t
 
e
n
t
r
i
e
s
:
March 3, 2013
Raft Consensus Algorithm
Slide 26
Repairing Logs, cont’d
1
4
1
1
4
5
5
6
6
6
1
2
3
4
5
6
7
8
9
10
11
log index
leader for term 7
1
1
1
follower (before)
2
2
3
3
3
3
3
2
nextIndex
1
1
1
follower (after)
4
undefined
D
e
p
o
s
e
d
 
l
e
a
d
e
r
 
m
a
y
 
n
o
t
 
b
e
 
d
e
a
d
:
Temporarily disconnected from network
Other servers elect a new leader
Old leader becomes reconnected, attempts to commit log entries
T
e
r
m
s
 
u
s
e
d
 
t
o
 
d
e
t
e
c
t
 
s
t
a
l
e
 
l
e
a
d
e
r
s
 
(
a
n
d
 
c
a
n
d
i
d
a
t
e
s
)
Every RPC contains term of sender
If sender’s term is older, RPC is rejected, sender reverts to
follower and updates its term
If receiver’s term is older, it reverts to follower, updates its term,
then processes RPC normally
E
l
e
c
t
i
o
n
 
u
p
d
a
t
e
s
 
t
e
r
m
s
 
o
f
 
m
a
j
o
r
i
t
y
 
o
f
 
s
e
r
v
e
r
s
Deposed server cannot commit new log entries
March 3, 2013
Raft Consensus Algorithm
Slide 27
Neutralizing Old Leaders
undefined
S
e
n
d
 
c
o
m
m
a
n
d
s
 
t
o
 
l
e
a
d
e
r
If leader unknown, contact any server
If contacted server not leader, it will redirect to leader
L
e
a
d
e
r
 
d
o
e
s
 
n
o
t
 
r
e
s
p
o
n
d
 
u
n
t
i
l
 
c
o
m
m
a
n
d
 
h
a
s
 
b
e
e
n
l
o
g
g
e
d
,
 
c
o
m
m
i
t
t
e
d
,
 
a
n
d
 
e
x
e
c
u
t
e
d
 
b
y
 
l
e
a
d
e
r
s
 
s
t
a
t
e
m
a
c
h
i
n
e
I
f
 
r
e
q
u
e
s
t
 
t
i
m
e
s
 
o
u
t
 
(
e
.
g
.
,
 
l
e
a
d
e
r
 
c
r
a
s
h
)
:
Client reissues command to some other server
Eventually redirected to new leader
Retry request with new leader
March 3, 2013
Raft Consensus Algorithm
Slide 28
Client Protocol
undefined
W
h
a
t
 
i
f
 
l
e
a
d
e
r
 
c
r
a
s
h
e
s
 
a
f
t
e
r
 
e
x
e
c
u
t
i
n
g
 
c
o
m
m
a
n
d
,
 
b
u
t
b
e
f
o
r
e
 
r
e
s
p
o
n
d
i
n
g
?
 Must not execute command twice
S
o
l
u
t
i
o
n
:
 
c
l
i
e
n
t
 
e
m
b
e
d
s
 
a
 
u
n
i
q
u
e
 
i
d
 
i
n
 
e
a
c
h
c
o
m
m
a
n
d
Server includes id in log entry
Before accepting command, leader checks its log for entry with
that id
If id found in log, ignore new command, return response from old
command
R
e
s
u
l
t
:
 
e
x
a
c
t
l
y
-
o
n
c
e
 
s
e
m
a
n
t
i
c
s
 
a
s
 
l
o
n
g
 
a
s
 
c
l
i
e
n
t
d
o
e
s
n
t
 
c
r
a
s
h
March 3, 2013
Raft Consensus Algorithm
Slide 29
Client Protocol, cont’d
undefined
S
y
s
t
e
m
 
c
o
n
f
i
g
u
r
a
t
i
o
n
:
ID, address for each server
Determines what constitutes a majority
C
o
n
s
e
n
s
u
s
 
m
e
c
h
a
n
i
s
m
 
m
u
s
t
 
s
u
p
p
o
r
t
 
c
h
a
n
g
e
s
 
i
n
 
t
h
e
c
o
n
f
i
g
u
r
a
t
i
o
n
:
Replace failed machine
Change degree of replication
March 3, 2013
Raft Consensus Algorithm
Slide 30
Configuration Changes
undefined
C
a
n
n
o
t
 
s
w
i
t
c
h
 
d
i
r
e
c
t
l
y
 
f
r
o
m
 
o
n
e
 
c
o
n
f
i
g
u
r
a
t
i
o
n
 
t
o
a
n
o
t
h
e
r
:
 
c
o
n
f
l
i
c
t
i
n
g
 
m
a
j
o
r
i
t
i
e
s
 
c
o
u
l
d
 
a
r
i
s
e
March 3, 2013
Raft Consensus Algorithm
Slide 31
Configuration Changes, cont’d
C
old
C
new
Server 1
Server 2
Server 3
Server 4
Server 5
Majority of C
old
Majority of C
new
time
undefined
March 3, 2013
Raft Consensus Algorithm
Slide 32
R
a
f
t
 
u
s
e
s
 
a
 
2
-
p
h
a
s
e
 
a
p
p
r
o
a
c
h
:
Intermediate phase uses 
joint consensus 
(need majority of both
old and new configurations for elections, commitment)
Configuration change is just a log entry; applied immediately on
receipt (committed or not)
Once joint consensus is committed, begin replicating log entry
for final configuration
Joint Consensus
time
C
old+new
 entry
committed
C
new
 entry
committed
C
old
C
old+new
C
new
C
old
 can make
unilateral decisions
C
new
 can make
unilateral decisions
undefined
A
d
d
i
t
i
o
n
a
l
 
d
e
t
a
i
l
s
:
Any server from either configuration can serve as leader
If current leader is not in C
new
, must step down once C
new
 is
committed.
March 3, 2013
Raft Consensus Algorithm
Slide 33
Joint Consensus, cont’d
time
C
old+new
 entry
committed
C
new
 entry
committed
C
old
C
old+new
C
new
C
old
 can make
unilateral decisions
C
new
 can make
unilateral decisions
leader not in C
new
steps down here
undefined
1.
L
e
a
d
e
r
 
e
l
e
c
t
i
o
n
2.
N
o
r
m
a
l
 
o
p
e
r
a
t
i
o
n
3.
S
a
f
e
t
y
 
a
n
d
 
c
o
n
s
i
s
t
e
n
c
y
4.
N
e
u
t
r
a
l
i
z
e
 
o
l
d
 
l
e
a
d
e
r
s
5.
C
l
i
e
n
t
 
p
r
o
t
o
c
o
l
6.
C
o
n
f
i
g
u
r
a
t
i
o
n
 
c
h
a
n
g
e
s
March 3, 2013
Raft Consensus Algorithm
Slide 34
Raft Summary
Slide Note
Embed
Share

Raft is a consensus algorithm designed for fault-tolerant replication of logs in distributed systems. It ensures that multiple servers maintain identical states for fault tolerance in various services like file systems, databases, and key-value stores. Raft employs a leader-based approach where one server acts as the leader, simplifying normal operations and making it more efficient than leader-less approaches. The algorithm focuses on leader election, log replication, safety, consistency, and client interactions, ultimately achieving proper log replication and ensuring system progress even in the presence of failures.


Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. CPS 514 / ECE 558 Consensus Part I: Fail-stop Failures Slide 1

  2. Fail-stop Failures Servers may crash, but maintain persistent storage Messages may be lost Network may be disconnected Servers don t misbehave! Goal is to make progress when a majority of servers are up and can communicate. Slide 2

  3. Consensus Goal is to operate multiple servers in identical state for fault-tolerance Example services: File system Database Shared memory Ledger Key-value store Clients issue requests to change state machines Servers maintain identical logs of linearized client requests Servers maintain identical state machines

  4. Raft: A Consensus Algorithm for Replicated Logs Diego Ongaro and John Ousterhout Stanford University

  5. Goal: Replicated Log Clients shl Consensus Module State Machine Consensus Module State Machine Consensus Module State Machine Servers Log Log Log add jmp mov shl add jmp mov shl add jmp mov shl Replicated log => replicated state machine All servers execute same commands in same order Consensus module ensures proper log replication System makes progress as long as any majority of servers are up Failure model: fail-stop (not Byzantine), delayed/lost messages March 3, 2013 Raft Consensus Algorithm Slide 5

  6. Approaches to Consensus Two general approaches to consensus: Symmetric, leader-less: All servers have equal roles Clients can contact any server Asymmetric, leader-based: At any given time, one server is in charge, others accept its decisions Clients communicate with the leader Raft uses a leader: Decomposes the problem (normal operation, leader changes) Simplifies normal operation (no conflicts) More efficient than leader-less approaches March 3, 2013 Raft Consensus Algorithm Slide 6

  7. Raft Overview 1. Leader election: Select one of the servers to act as leader Detect crashes, choose new leader 2. Normal operation (basic log replication) 3. Safety and consistency after leader changes 4. Neutralizing old leaders 5. Client interactions Implementing linearizeable semantics 6. Configuration changes: Adding and removing servers March 3, 2013 Raft Consensus Algorithm Slide 7

  8. Server States At any given time, each server is either: Leader: handles all client interactions, log replication At most 1 viable leader at a time Follower: completely passive (issues no RPCs, responds to incoming RPCs) Candidate: used to elect a new leader Normal operation: 1 leader, N-1 followers timeout, new election timeout, start election receive votes from majority of servers start Follower Candidate Leader step down discover server with higher term March 3, 2013 discover current leader or higher term Raft Consensus Algorithm Slide 8

  9. Terms Term 1 Term 2 Term 3 Term 4 Term 5 time Elections Split Vote Normal Operation Time divided into terms: Election Normal operation under a single leader At most 1 leader per term Some terms have no leader (failed election) Each server maintains current term value Key role of terms: identify obsolete information March 3, 2013 Raft Consensus Algorithm Slide 9

  10. Raft Protocol Summary Followers RequestVote RPC Invoked by candidates to gather votes. Respond to RPCs from candidates and leaders. Convert to candidate if election timeout elapses without either: Receiving valid AppendEntries RPC, or Granting vote to candidate Arguments: candidateId term lastLogIndex lastLogTerm candidate requesting vote candidate's term index of candidate's last log entry term of candidate's last log entry Candidates Results: term voteGranted Increment currentTerm, vote for self Reset election timeout Send RequestVote RPCs to all other servers, wait for either: Votes received from majority of servers: become leader AppendEntries RPC received from new leader: step down Election timeout elapses without election resolution: increment term, start new election Discover higher term: step down currentTerm, for candidate to update itself true means candidate received vote Implementation: 1. If term > currentTerm, currentTerm term (step down if leader or candidate) 2. If term == currentTerm, votedFor is null or candidateId, and candidate's log is at least as complete as local log, grant vote and reset election timeout Leaders Initialize nextIndex for each to last log index + 1 Send initial empty AppendEntries RPCs (heartbeat) to each follower; repeat during idle periods to prevent election timeouts Accept commands from clients, append new entries to local log Whenever last log index nextIndex for a follower, send AppendEntries RPC with log entries starting at nextIndex, update nextIndex if successful If AppendEntries fails because of log inconsistency, decrement nextIndex and retry Mark log entries committed if stored on a majority of servers and at least one entry from current term is stored on a majority of servers Step down if currentTerm changes AppendEntries RPC Invoked by leader to replicate log entries and discover inconsistencies; also used as heartbeat . Arguments: term leaderId prevLogIndex leader's term so follower can redirect clients index of log entry immediately preceding new ones term of prevLogIndex entry log entries to store (empty for heartbeat) last entry known to be committed prevLogTerm entries[] commitIndex Results: term success currentTerm, for leader to update itself true if follower contained entry matching prevLogIndex and prevLogTerm Persistent State Each server persists the following to stable storage synchronously before responding to RPCs: currentTerm latest term server has seen (initialized to 0 on first boot) votedFor candidateId that received vote in current term (or null if none) log[] log entries Implementation: 1. Return if term < currentTerm 2. If term > currentTerm, currentTerm term 3. If candidate or leader, step down 4. Reset election timeout 5. Return failure if log doesn t contain an entry at prevLogIndex whose term matches prevLogTerm 6. If existing entries conflict with new entries, delete all existing entries starting with first conflicting entry 7. Append any new entries not already in the log 8. Advance state machine with newly committed entries Log Entry term index command term when entry was received by leader position of entry in the log command for state machine March 3, 2013 Raft Consensus Algorithm Slide 10

  11. Heartbeats and Timeouts Servers start up as followers Followers expect to receive RPCs from leaders or candidates Leaders must send heartbeats (empty AppendEntries RPCs) to maintain authority If electionTimeout elapses with no RPCs: Follower assumes leader has crashed Follower starts new election Timeouts typically 100-500ms March 3, 2013 Raft Consensus Algorithm Slide 11

  12. Election Basics Increment current term Change to Candidate state Vote for self Send RequestVote RPCs to all other servers, retry until either: 1. Receive votes from majority of servers: Become leader Send AppendEntries heartbeats to all other servers 2. Receive RPC from valid leader: Return to follower state 3. No-one wins election (election timeout elapses): Increment term, start new election March 3, 2013 Raft Consensus Algorithm Slide 12

  13. Elections, contd Safety: allow at most one winner per term Each server gives out only one vote per term (persist on disk) Two different candidates can t accumulate majorities in same term Voted for candidate A B can t also get majority Servers Liveness: some candidate must eventually win Choose election timeouts randomly in [T, 2T] One server usually times out and wins election before others wake up Works well if T >> broadcast time March 3, 2013 Raft Consensus Algorithm Slide 13

  14. Log Structure log index 1 2 1 3 1 ret 4 2 5 3 6 3 7 3 8 3 term 1 leader add cmp mov jmp div shl sub command 1 1 1 2 3 add cmp ret mov jmp 1 1 1 2 3 3 3 3 add cmp ret mov jmp div shl sub followers 1 1 add cmp 1 1 1 2 3 3 3 add cmp ret mov jmp div shl committed entries Log entry = index, term, command Log stored on stable storage (disk); survives crashes Entry committed if known to be stored on majority of servers Durable, will eventually be executed by state machines March 3, 2013 Raft Consensus Algorithm Slide 14

  15. Normal Operation Client sends command to leader Leader appends command to its log Leader sends AppendEntries RPCs to followers Once new entry committed: Leader passes command to its state machine, returns result to client Leader notifies followers of committed entries in subsequent AppendEntries RPCs Followers pass committed commands to their state machines Crashed/slow followers? Leader retries RPCs until they succeed Performance is optimal in common case: One successful RPC to any majority of servers March 3, 2013 Raft Consensus Algorithm Slide 15

  16. Log Consistency High level of coherency between logs: If log entries on different servers have same index and term: They store the same command The logs are identical in all preceding entries 1 2 1 3 1 ret 4 2 5 3 6 3 1 add cmp mov jmp div 1 1 1 2 3 4 add cmp ret mov jmp sub If a given entry is committed, all preceding entries are also committed March 3, 2013 Raft Consensus Algorithm Slide 16

  17. AppendEntries Consistency Check Each AppendEntries RPC contains index, term of entry preceding new ones Follower must contain matching entry; otherwise it rejects request Implements an induction step, ensures coherency 1 2 3 4 5 1 1 1 2 3 leader AppendEntries succeeds: matching entry add cmp ret mov jmp 1 1 1 2 follower add cmp ret mov 1 1 1 2 3 leader add cmp ret mov jmp AppendEntries fails: mismatch 1 1 1 1 follower add cmp ret shl March 3, 2013 Raft Consensus Algorithm Slide 17

  18. Leader Changes At beginning of new leader s term: Old leader may have left entries partially replicated No special steps by new leader: just start normal operation Leader s log is the truth Will eventually make follower s logs identical to leader s Multiple crashes can leave many extraneous log entries: 1 2 3 4 5 6 7 8 log index s1 term 1 1 5 6 6 6 s2 1 1 5 6 7 7 7 s3 1 1 5 5 s4 1 1 2 4 s5 1 1 2 2 3 3 3 March 3, 2013 Raft Consensus Algorithm Slide 18

  19. Safety Requirement Once a log entry has been applied to a state machine, no other state machine must apply a different value for that log entry Raft safety property: If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders This guarantees the safety requirement Leaders never overwrite entries in their logs Only entries in the leader s log can be committed Entries must be committed before applying to state machine Committed Present in future leaders logs Restrictions on commitment Restrictions on leader election March 3, 2013 Raft Consensus Algorithm Slide 19

  20. Picking the Best Leader Can t tell which entries are committed! 1 2 3 4 5 committed? 1 1 1 2 2 1 1 1 2 unavailable during leader transition 1 1 1 2 2 During elections, choose candidate with log most likely to contain all committed entries Candidates include log info in RequestVote RPCs (index & term of last log entry) Voting server V denies vote if its log is more complete : (lastTermV > lastTermC) || (lastTermV == lastTermC) && (lastIndexV > lastIndexC) Leader will have most complete log among electing majority March 3, 2013 Raft Consensus Algorithm Slide 20

  21. Committing Entry from Current Term Case #1/2: Leader decides entry in current term is committed 1 2 3 4 5 6 Leader for term 2 s1 1 1 2 2 2 s2 1 1 2 2 AppendEntries just succeeded s3 1 1 2 2 s4 1 1 2 Can t be elected as leader for term 3 s5 1 1 Safe: leader for term 3 must contain entry 4 March 3, 2013 Raft Consensus Algorithm Slide 21

  22. Committing Entry from Earlier Term Case #2/2: Leader is trying to finish committing entry from an earlier term 1 2 3 4 5 6 Leader for term 4 s1 1 1 2 4 s2 1 1 2 AppendEntries just succeeded s3 1 1 2 s4 1 1 s5 1 1 3 3 3 Entry 3 not safely committed: s5 can be elected as leader for term 5 If elected, it will overwrite entry 3 on s1, s2, and s3! March 3, 2013 Raft Consensus Algorithm Slide 22

  23. New Commitment Rules For a leader to decide an entry is committed: Must be stored on a majority of servers At least one new entry from leader s term must also be stored on majority of servers 1 2 3 4 5 Leader for term 4 s1 1 1 2 4 s2 1 1 2 4 s3 1 1 2 4 Once entry 4 committed: s5 cannot be elected leader for term 5 Entries 3 and 4 both safe s4 1 1 s5 1 1 3 3 3 Combination of election rules and commitment rules makes Raft safe March 3, 2013 Raft Consensus Algorithm Slide 23

  24. Log Inconsistencies Leader changes can result in log inconsistencies: log index leader for term 8 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 (a) 1 1 1 4 4 5 5 6 6 Missing Entries (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6 possible followers (d) 1 1 1 4 4 5 5 6 6 6 7 7 Extraneous Entries (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3 March 3, 2013 Raft Consensus Algorithm Slide 24

  25. Repairing Follower Logs New leader must make follower logs consistent with its own Delete extraneous entries Fill in missing entries Leader keeps nextIndex for each follower: Index of next log entry to send to that follower Initialized to (1 + leader s last index) When AppendEntries consistency check fails, decrement nextIndex and try again: nextIndex log index 1 2 3 4 5 6 7 8 9 10 11 12 leader for term 7 1 1 1 4 4 5 5 6 6 6 (a) 1 1 1 4 followers (b) 1 1 1 2 2 2 3 3 3 3 3 March 3, 2013 Raft Consensus Algorithm Slide 25

  26. Repairing Logs, contd When follower overwrites inconsistent entry, it deletes all subsequent entries: nextIndex log index 1 2 3 4 5 6 7 8 9 10 11 leader for term 7 1 1 1 4 4 5 5 6 6 6 follower (before) 1 1 1 2 2 2 3 3 3 3 3 follower (after) 1 1 1 4 March 3, 2013 Raft Consensus Algorithm Slide 26

  27. Neutralizing Old Leaders Deposed leader may not be dead: Temporarily disconnected from network Other servers elect a new leader Old leader becomes reconnected, attempts to commit log entries Terms used to detect stale leaders (and candidates) Every RPC contains term of sender If sender s term is older, RPC is rejected, sender reverts to follower and updates its term If receiver s term is older, it reverts to follower, updates its term, then processes RPC normally Election updates terms of majority of servers Deposed server cannot commit new log entries March 3, 2013 Raft Consensus Algorithm Slide 27

  28. Client Protocol Send commands to leader If leader unknown, contact any server If contacted server not leader, it will redirect to leader Leader does not respond until command has been logged, committed, and executed by leader s state machine If request times out (e.g., leader crash): Client reissues command to some other server Eventually redirected to new leader Retry request with new leader March 3, 2013 Raft Consensus Algorithm Slide 28

  29. Client Protocol, contd What if leader crashes after executing command, but before responding? Must not execute command twice Solution: client embeds a unique id in each command Server includes id in log entry Before accepting command, leader checks its log for entry with that id If id found in log, ignore new command, return response from old command Result: exactly-once semantics as long as client doesn t crash March 3, 2013 Raft Consensus Algorithm Slide 29

  30. Configuration Changes System configuration: ID, address for each server Determines what constitutes a majority Consensus mechanism must support changes in the configuration: Replace failed machine Change degree of replication March 3, 2013 Raft Consensus Algorithm Slide 30

  31. Configuration Changes, contd Cannot switch directly from one configuration to another: conflicting majorities could arise Cold Cnew Server 1 Server 2 Server 3 Server 4 Server 5 Majority of Cold Majority of Cnew time March 3, 2013 Raft Consensus Algorithm Slide 31

  32. Joint Consensus Raft uses a 2-phase approach: Intermediate phase uses joint consensus (need majority of both old and new configurations for elections, commitment) Configuration change is just a log entry; applied immediately on receipt (committed or not) Once joint consensus is committed, begin replicating log entry for final configuration Cold can make unilateral decisions Cnew can make unilateral decisions Cnew Cold+new Cold time Cold+new entry committed Cnew entry committed March 3, 2013 Raft Consensus Algorithm Slide 32

  33. Joint Consensus, contd Additional details: Any server from either configuration can serve as leader If current leader is not in Cnew, must step down once Cnew is committed. Cold can make unilateral decisions Cnew can make unilateral decisions Cnew Cold+new leader not in Cnew steps down here Cold time Cold+new entry committed Cnew entry committed March 3, 2013 Raft Consensus Algorithm Slide 33

  34. Raft Summary 1. Leader election 2. Normal operation 3. Safety and consistency 4. Neutralize old leaders 5. Client protocol 6. Configuration changes March 3, 2013 Raft Consensus Algorithm Slide 34

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#