Distributed Systems and Fault Tolerance

 
CS 3700
Networks and Distributed Systems
 
Distributed Consensus and Fault Tolerance
 
(or, why can’t we all just get along?)
 
Black Box Online Services
 
Storing and retrieving data from online services is commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented
Black Box Service
Black Box Online Services
Storing and retrieving data from online services is commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented
debit_transaction(-$75)
get_recent_transactions()
OK
[…, “-$75”, …]
Black Box Online Services
Storing and retrieving data from online services is commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented
add_item_to_cart(“Cheerios”)
get_cart()
OK
[“Lucky Charms”, “Cheerios”]
Black Box Online Services
Storing and retrieving data from online services is commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented
post_update(“I LOLed”)
get_newsfeed()
OK
[…, {“txt”: “I LOLed”, “likes”: 87}]
Peeling Back the Curtain
How are large services implemented?
Different types of services may have different requirements
Leads to different design decisions
Black Box Service
?
Centralization
 
Advantages of centralization
Easy to setup and deploy
Consistency is guaranteed (assuming correct software implementation)
Shortcomings
No load balancing
Single point of failure
debit_transaction(-$75)
get_account_balance()
OK
$225
Bob
Bob: $300
Bob: $225
?
Sharding
 
Advantages of sharding
Better load balancing
If done intelligently, may allow incremental scalability
Shortcomings
Failures are still devastating
debit_account(-$75)
get_account_balance()
OK
$225
Bob
Bob: $300
Bob: $225
<A-M>
<N-Z>
Web
Server
Web
Server
Replication
 
Advantages of replication
Better load balancing of reads (potentially)
Resilience against failure; high 
availability 
(with some caveats)
Shortcomings
How do we maintain 
consistency
?
debit_account(-$75)
get_account_balance()
OK
$225
Bob
Bob: $300
Bob: $225
<A-M>
<A-M>
<A-M>
Bob: $300
Bob: $225
Bob: $300
Bob: $225
100%
Agreement
Consistency Failures
Bob: $300
Bob: $225
Bob: $300
No
Agreement
No ACK
Bob: $225
No ACK
Bob: $225
Leader cannot disambiguate cases
where requests and responses are lost
Bob: $225
Timeout!
Bob: $225
Asynchronous networks
are problematic
Bob: $225
Too few
replicas?
No
Agreement
Byzantine Failures
Bob: $300
Bob: $300
No
Agreement
In some cases,
replicas may be
buggy or malicious
 
When discussing Distributed Systems, failures due to malice are known
as 
Byzantine Failures
Name comes from the Byzantine generals problem
More on this later…
Problem and Definitions
 
Build a distributed system that meets the following goals:
The system should be able to reach 
consensus
Consensus [n]: general agreement
The system should be 
consistent
Data should be correct; no integrity violations
The system should be highly 
available
Data should be accessible even in the face of arbitrary failures
Challenges:
Many, many different failure modes
Theory tells us that these goals are 
impossible to achieve 
(more on this later)
 
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
Forcing Consistency
One approach to building distributed systems is to force them to be consistent
Guarantee that all replicas receive an update…
…Or none of them do
If consistency is guaranteed, then reaching consensus is trivial
debit_account(-$75)
OK
Bob
Bob: $300
Bob: $225
Bob: $300
Bob: $225
Bob: $300
Bob: $225
debit_account(-$50)
Error
Bob: $175
Bob: $175
 
Distributed Commit Problem
 
Application that performs operations on multiple replicas or databases
We want to guarantee that all replicas get updated, or none do
 
Distributed commit problem:
1.
Operation is 
committed
 when all participants 
can
 perform the action
2.
Once a commit decision is reached, all participants 
must
 perform the action
 
Two steps gives rise to the Two Phase Commit protocol
Motivating Transactions
System becomes inconsistent if any individual action fails
Bob: $300
Bob: $400
Alice: $600
Alice: $500
transfer_money(Alice, Bob, $100)
debit_account(Alice, -$100)
OK
debit_account(Bob, $100)
OK
Error
Error
Simple Transactions
Actions inside a transaction behave as a single action
Bob: $300
Bob: $400
Alice: $600
Alice: $500
transfer_money(Alice, Bob, $100)
debit_account(Alice, -$100)
debit_account(Bob, $100)
OK
begin_transaction()
end_transaction()
Bob: $400
Alice: $500
At this point, if there
haven’t been any
errors, we say the
transaction is
committed
Simple Transactions
 
If any individual action fails, the whole transaction fails
Failed transactions have 
no side effects
Incomplete results during transactions are hidden
Bob: $300
Alice: $600
Alice: $500
transfer_money(Alice, Chris, $100)
debit_account(Alice, -$100)
debit_account(Chris, $100)
Error
begin_transaction()
 
ACID Properties
 
Traditional transactional databases support the following:
1.
A
tomicity: all or none; if transaction fails then no changes are applied to the
database
2.
C
onsistency: there are no violations of database integrity
3.
I
solation: partial results from incomplete transactions are hidden
4.
D
urability: the effects of committed transactions are permanent
 
Two Phase Commits (2PC)
 
Well known techniques used to implement transactions in 
centralized
databases
E.g. journaling (append-only logs)
Out of scope for this class (take a database class, or CS 5600)
Two Phase Commit (2PC) is a protocol for implementing transactions in
a 
distributed
 setting
Protocol operates in rounds
Assume we have 
leader
 or coordinator that manages transactions
Each replica 
promises
 that it is 
ready to commit
Leader decides the outcome and instructs replicas to 
commit 
or
 abort
Assume no byzantine faults (i.e. nobody is malicious)
2PC Example
Leader
Replica 1
Replica 2
Replica 3
Time
x
x
x
y
y
y
x
x
x
y
y
y
txid = 678; value = y
commit txid = 678
 
Begin by distributing
the update
Txid is a logical clock
 
Wait to receive “ready
to commit” from all
replicas
Also called 
promises
 
Tell replicas to
commit
 
At this point, all
replicas are
guaranteed to be
up-to-date
 
Failure Modes
 
Replica Failure
Before or during the initial promise phase
Before or during the commit
 
Leader Failure
Before receiving all promises
Before or during sending commits
Before receiving all committed messages
Replica Failure (1)
Leader
Replica 1
Replica 2
Replica 3
Time
x
x
x
x
x
x
x
y
y
txid = 678; value = y
abort txid = 678
 
Error: not all replicas
are “ready”
 
The same thing
happens if a write or a
“ready” is dropped, a
replica times out, or a
replica returns an error
Replica Failure (2)
Leader
Replica 1
Replica 2
Replica 3
Time
y
x
x
x
y
y
y
commit txid = 678
committed txid = 678
 
Known inconsistent
state
Leader must keep
retrying until all
commits succeed
commit txid = 678
y
committed txid = 678
Replica Failure (2)
Leader
Replica 1
Replica 2
Replica 3
Time
y
x
y
y
y
commit txid = 678
committed txid = 678
 
Finally, the system is
consistent and may
proceed
stat txid = 678
 
Replicas attempt to
resume unfinished
transactions when
they reboot
Leader Failure
 
What happens if the leader crashes?
Leader must constantly be writing its state to permanent storage
It must pick up where it left off once it reboots
If there are unconfirmed transactions
Send new write messages, wait for “ready to commit” replies
If there are uncommitted transactions
Send new commit messages, wait for “committed” replies
Replicas may see duplicate messages during this process
Thus, it’s important that every transaction have a unique txid
Allowing Progress
 
Key problem: what if the leader crashes and never recovers?
By default, replicas block until contacted by the leader
Can the system make progress?
Yes, under limited circumstances
After sending a “ready to commit” message, each replica starts a timer
The first replica whose timer expires elects itself as the new leader
Query the other replicas for their status
Send “commits” to all replicas if they are all “ready”
However, 
this only works if all the replicas are alive and reachable
If a replica crashes or is unreachable, deadlock is unavoidable
New Leader
Leader
Replica 1
Replica 2
Replica 3
Time
y
y
y
x
x
x
y
y
y
commit txid = 678
stat txid = 678
 
Replica 2’s timeout
expires, begins
recovery procedure
 
System is consistent
again
Deadlock
Leader
Replica 1
Replica 2
Replica 3
Time
x
x
x
y
y
y
stat txid = 678
ready txid = 678
 
Replica 2’s timeout
expires, begins
recovery procedure
 
Cannot proceed, but
cannot abort
stat txid = 678
 
Garbage Collection
 
2PC is somewhat of a misnomer: there is a third phase
Garbage collection
Replicas must retain records of past transactions, just in case the leader
fails
Example, suppose the leader crashes, reboots, and attempts to commit a
transaction that has already been committed
Replicas must remember that this past transaction was already committed,
since committing a second time may lead to inconsistencies
In practice, leader periodically tells replicas to garbage collect
All transactions <= some txid may be deleted
 
2PC Leader Pseudocode
 
Multicast write message
Collect ‘ready to commit’ replies
 
All OK: log ‘commit’ to ‘outcomes’ table, write to journal, send commit
 
Else: send abort
Collect ‘committed’ messages
 
Repeat until all replicas respond
After failure
 
For each pending transaction in ‘outcomes’ table:
  
Send outcome (commit or abort)
  
Wait for acknowledgments
Periodically
 
Query each process: terminated protocols?
 
Broadcast garbage collection message
 
2PC Replica Pseudocode
 
First time message received
 
write: save to temp area and reply ‘ready to commit’
 
 
commit: log outcome, make changes permanent, reply ‘committed’
 
 
abort: log outcome, delete temp area, reply ‘aborted’
Message is a duplicate (recovering coordinator)
 
Send acknowledgment (‘ready’ or ‘committed’)
After failure
 
For each pending transaction:
  
contact coordinator to learn outcome
After timeout in prepare to commit state:
 
Query other participants about state
  
If outcome can be deduced: Run leader-recovery protocol
  
If outcome uncertain: wait for the leader
2PC Summary
 
Message complexity: O(2n)
The good: guarantees consistency
The bad:
Write performance suffers if there are failures during the commit phase
Does not scale gracefully (possible, but difficult to do)
A pure 2PC system blocks all writes if the leader fails
Smarter 2PC systems still blocks all writes if the leader + 1 replica fail
2PC sacrifices 
availability
 in favor of 
consistency
 
Can 2PC be Fixed?
 
The key issue with 2PC is reliance on the centralized leader
Only the leader knows if a transaction is 100% ready to commit or not
Thus, if the leader + 1 replica fail, recovery is impossible
Potential solution: Three Phase Commit
Add an additional round of communication
Tell all replicas to 
prepare to commit
, before actually committed
State of the system can always be deduced by a subset of alive
replicas that can communicate with each other
… unless there are 
partitions
 (more on this later)
3PC Example
Leader
Replica 1
Replica 2
Replica 3
Time
x
x
x
y
y
y
x
x
x
y
y
y
txid = 678; value = y
commit txid = 678
 
Begin by distributing
the update
 
Wait to receive
“ready to commit”
from all replicas
 
Tell replicas to
commit
 
At this point, all
replicas are
guaranteed to be
up-to-date
prepare txid = 678
 
Tell all replicas that
everyone is “ready
to commit”
Leader Failures
Leader
Replica 1
Replica 2
Replica 3
Time
x
x
x
x
x
x
y
y
y
txid = 678; value = y
Begin by distributing
the update
Wait to receive
“ready to commit”
from all replicas
x
x
abort txid = 678
aborted txid = 678
stat txid = 678
ready txid = 678
 
Replica 2’s timeout
expires, begins
recovery procedure
 
System is consistent
again
 
Replica 3 cannot be in
the committed state,
thus okay to abort
Leader Failures
Leader
Replica 1
Replica 2
Replica 3
Time
prepare txid = 678
y
y
commit txid = 678
committed txid = 678
stat txid = 678
prepared txid = 678
 
Replica 2’s timeout
expires, begins
recovery procedure
 
System is consistent
again
 
All replicas must have
been ready to commit
Oh Great, We Fixed Everything!
 
Wrong
3PC is not robust against 
network partitions
What is a network partition?
A split in the network, such that full 
n
-to-
n
 connectivity is broken
i.e. not all servers can contact each other
Partitions split the network into one or more disjoint subnetworks
How can a network partition occur?
A switch or a router may fail, or it may receive an incorrect routing rule
A cable connecting two racks of servers may develop a fault
Network partitions are very real; they happen all the time
Partitioning
Leader
Replica 1
Replica 2
Replica 3
Time
x
x
x
y
x
x
x
x
y
y
y
txid = 678; value = y
commit txid = 678
committed txid = 678
 
Leader assumes
replicas 2 and 3 have
failed, moves on
 
System is
inconsistent
prepare txid = 678
prepared txid = 678
 
Network partitions
into two subnets!
x
Leader recovery
initiated
Abort
3PC Summary
 
Adds an additional phase vs. 2PC
Message complexity: O(3n)
Really four phases with garbage collection
The good: allows the system to make progress under more failure
conditions
The bad:
Extra round of communication makes 3PC even slower than 2PC
Does not work if the network partitions
2PC will simply deadlock if there is a partition, rather than become inconsistent
In practice, nobody used 3PC
Additional complexity and performance penalty just isn’t worth it
Loss of consistency during partitions is a deal breaker
 
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
A Moment of Reflection
 
Goals, revisited:
The system should be able to reach 
consensus
Consensus [n]: general agreement
The system should be 
consistent
Data should be correct; no integrity violations
The system should be highly 
available
Data should be accessible even in the face of arbitrary failures
Achieving these goals may be harder than we thought :(
Huge number of failure modes
Network partitions are difficult to cope with
We haven’t even considered byzantine failures
What Can Theory Tell Us?
 
Let's assume the network is 
synchronous
 and 
reliable
Synchronous 
 no delay, sent packets arrive immediately
Reliable 
 no packet drops or corruption
 
Goal: get 
n
 replicas to reach consensus
Algorithm can be divided into discreet rounds
During each round, each host 
r
 may send 
m <= n
 messages
r
 might crash before sending all 
m
 messages
If a message from host 
r 
is not received in a round, then 
r 
must be faulty
 
If we are willing to tolerate 
f
 total failures (
f < n
), how many rounds of
communication do we need to guarantee consensus?
 
Consensus in a Synchronous System
 
Initialization:
All replicas choose a value 0 or 1 (can generalize to more values if you want)
 
Properties:
Agreement
: all non-faulty processes ultimately choose the same value
Either 0 or 1 in this case
Validity
: if a replica decides on a value, then at least one replica must 
not
have started with that value
This prevents the trivial solution of all replicas always choosing 0, which is technically
perfect consensus but is practically useless
Termination
: the system must converge in finite time
 
Algorithm Sketch
 
Each replica maintains a map
 M
 of all known values
Initially, the vector only contains the replica’s own value
e.g., 
M = {‘replica1’: 0}
Each round, broadcast 
M
 to all other replicas
On receipt, construct the union of received 
M
 and local 
M
Algorithm terminates when all non-faulty replicas have the values from
all other non-faulty replicas
Example with three non-faulty replicas (1, 3, and 5)
M = {‘replica1’: 0, ‘replica3’: 1, ‘replica5’: 0}
Final value is 
min(M.values())
Bounding Convergence Time
 
How many rounds will it take if we are willing to tolerate 
f
 failures?
f + 1
 rounds
Key insight: 
all replicas must be sure that all replicas that did not crash
have the same information (so they can make the same decision)
Proof sketch, assuming 
f = 2
Worst case scenario is that replicas crash during rounds 1 and 2
During round 1, replica 
x
 crashes
All other replicas don’t know if 
x 
is alive or dead
During round 2, replica 
y
 crashes
 
Clear that 
x 
is not alive, but unknown if 
y 
is alive or dead
During round 3, no more replicas may crash
All replicas are guaranteed to receive updated info from all other replicas
Final decision can be made
Replica 1
Replica 2
Replica 3
Time
Replica 4
Replica 5
Round 1
n
 = 5 replicas
f 
= 2 maximum
failures
Legend
[r1, r2, r3, r4, r5]
0
0
0
1
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
 
Round 2
1
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
 
Round 3
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, 1]
Replica 1
Replica 2
Replica 3
Time
Replica 4
Replica 5
Round 1
n
 = 5 replicas
f 
= 2 maximum
failures
Legend
[r1, r2, r3, r4, r5]
0
0
0
1
[0, 1, 0, 0, ?]
[0, 1, 0, 0, ?]
[0, 1, 0, 0, 1]
[0, 1, 0, 0, ?]
 
Round 2
[0, 1, 0, ?, X]
[0, 1, 0, 0, X]
[0, 1, 0, ?, X]
 
Round 3
[0, 1, 0, X, X]
[0, 1, 0, X, X]
[0, 1, 0, X, X]
1
 
A More Realistic Model
 
The previous result is interesting, but 
totally unrealistic
We assume that the network is synchronous and reliable
Of course, neither of these things are true in reality
 
What if the network is 
asynchronous
 and 
reliable
?
Asynchronous 
 r
eplicas may take an arbitrarily long  time to respond to
messages and/or network may delay messages
 
Let’s also assume that all faults are crash faults
i.e.,
 if a replica has a problem, it crashes and never wakes up
No byzantine faults
 
The FLP Result
 
There is no asynchronous algorithm that achieves consensus on a 1-bit
value in the presence of crash faults. The result is true even if no crash
actually occurs!
 
This is known as the FLP result
Michael J. 
F
ischer, Nancy A. 
L
ynch, and Michael S. 
P
aterson, 1985
Extremely powerful result because:
If you can’t agree on 1-bit
, generalizing to larger values isn’t going to help you
If you can’t guarantee convergence with crash faults
, no way you can
guarantee convergence with byzantine faults
If you can’t guarantee convergence on a reliable network
, no way you can on
an unreliable network
FLP Proof Sketch
 
In an asynchronous system, a replica 
x
 cannot tell whether a non-
responsive replica 
y
 has crashed or is just slow
 
What can 
x 
do?
If 
x
 waits, it will block indefinitely since it might never receive the message from 
y
If 
x
 decides, it may find out later that 
y 
made a different decision
 
Proof constructs a scenario where each attempt to decide is overruled by a
delayed, asynchronous message
Thus, the system oscillates between 0 and 1 never converges
Replica 1
Replica 2
Replica 3
Time
Replica 4
Replica 5
Round 1
n
 = 5 replicas
Legend
[r1, r2, r3, r4, r5]
0
0
0
1
[0, 1, 0, 0, ?]
[0, 1, 0, 0, ?]
[0, 1, 0, 0, ?]
[0, 1, 0, 0, ?]
 
Round 2
[1, 1, 0, ?, 1]
[1, 1, 0, ?, 1]
[1, 1, 0, ?, 1]
1
[0, 1, 0, ?, 1]
 
Impact of FLP
 
FLP proves that any fault-tolerant distributed algorithm attempting to
reach consensus has runs that never terminate
These runs are extremely unlikely (“probability zero”)
Yet they imply that we can’t find a totally correct solution
And so – “consensus is impossible” (“not always possible”)
 
So what can we do?
Use randomization, probabilistic guarantees (
gossip
 protocols)
Avoid consensus, use quorum systems (Paxos or RAFT)
In other words, trade-off 
consistency
 in favor of 
availability
 
Consistency vs. Availability
 
FLP states that perfect consistency is impossible
 
Practically, we can get close to perfect consistency, but at significant cost
e.g., using 3PC
Availability begins to suffer dramatically under failure conditions
 
Is there a way to formalize the tradeoff between consistency and
availability?
Eric Brewer’s CAP Theorem
 
CAP theorem for distributed data replication
C
onsistency: updates to data are applied to all or none
A
vailability: must be able to access all data
Network
 P
artition 
T
olerance: failures can partition network into subtrees
 
The Brewer Theorem
No system can simultaneously achieve C and A and PT
 
Typical interpretation: “C, A, and PT: choose 2”
In practice, all networks may partition, thus you must choose PT
A better interpretation might be “C or A: choose 1”
 
Never formally proved or published
Yet widely accepted as a rule of thumb
CAP Examples
Availability
Client can always read
Impact of partition
Not consistent
 
Consistency
Reads must always return
accurate results
Impact of partition
No availability
Write
 
(key, 1)
 
Replicate
 
(key, 
2
)
Read
 
(key, 1)
A+PT
 
(key, 1)
Write
 
(key, 1)
 
(key, 1)
 
Replicate
 
(key, 
2
)
Read
 
Error: Service
Unavailable
 
C+PT
“C or A: Choose 1”
Taken to the extreme, CAP suggests a binary division in distributed systems
Your system is consistent 
or
 available
In practice, it’s more like a spectrum of possibilities
Perfect
Consistency
Always
Available
 
Financial information
must be correct
 
Serve content to all visitors,
regardless of correctness
 
Attempt to balance
correctness with availability
ACID Compliant,
SQL Database
NoSQL
 
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
Strong Consistency, Revisited
 
2PC and 3PC achieve strong consistency, but they have significant
shortcomings
2PC cannot make progress in the face of leader + 1 replica failures
3PC loses consistency guarantees in the face of network partitions
 
Where do we go from here?
 
Observation: 2PC and 3PC attempt to reach 100% agreement
What if 51% of the replicas agree?
 
Quorum Systems
 
Advantages of Quorums
 
Availability
: quorum systems are more resilient in the face of failures
Quorum systems can be designed to tolerate both benign and byzantine
failures
 
Efficiency
: can significantly reduce communication complexity
Do not require all servers in order to perform an operation
Requires a subset of them for each operation
High-Level Quorum Example
ts: 1
Bob: $300
ts: 2
Bob: $400
ts: 3
Bob: $375
ts: 1
Bob: $300
ts: 3
Bob: $375
ts: 1
Bob: $300
ts: 2
Bob: $400
ts: 3
Bob: $375
ts: 1
Bob: $300
ts: 2
Bob: $400
ts: 1
Bob: $300
Write
Read
 
Paxos
History of Paxos
 
Developed by Turing award winner Leslie Lamport
First published as a tech report in 1989
Journal refused to publish it, nobody understood the
protocol
Formally published in 1998
Again, nobody understands it
Leslie Lamport publishes “Paxos Made Simple” in
2001
People start to get the protocol
Reaches widespread fame in 2006-2007
Used by Google in their Chubby distributed mutex
system
Zookeeper is the open-source version of Chubby
Paxos at a High-Level
 
1.
Replicas elect a 
leader
 and agree on the 
view
 number
The view is a logical clock that divides time into epochs
During each view, there is a single leader
2.
The leader collects 
promises
 from the replicas
Replicas promise to only accept proposals from the current or future views
Prevents replicas from going “back in time”
Leader learns about proposed updates from the previous view that haven’t
yet been accepted
3.
The leader proposes updates and replicas accept them
Start by completing unfished updates from the previous view
Then move on to new writes from clients
View Selection
Time
All replicas have a 
view
 number
Goal is to have all replicas agree on
the view
 
Leader is replica with 
ID = hash(view)
Prepare/Promise
Time
prepare view=5 clock= 13
 
Replicas promise to not accept any
messages with 
view < v
Replicas won’t elect a new leader
until the current one fails
(measured using a timeout)
Commit/Accept
Time
All client requests are
serialized through the leader
write
commit clock=14
y
y
y
y
y
OK
 
Replicas write the new value
to temporary storage
 
Paxos Review
 
Failure Modes
 
1.
What happens if a commit fails?
 
2.
What happens during a partition?
 
3.
What happens after the leader fails?
Bad Commit
Time
What happens if a quorum
does not accept a commit?
accept clock=14
commit clock=14
x
y
y
y
y
commit clock=14
 
Leader must retry until
quorum is reached, or
broadcast an abort
 
Replicas that fall behind can
reconcile
 by downloading
missed updates from a peer
Partitions (1)
Time
What happens during a partition?
 
Once partition is fixed, either:
Hold a new leader election and
move forward
Or, reconcile with up-to-date peers
y
y
y
View/Elect
View/Elect
Partitions (2)
Time
 
What happens when the 
view = 0 
group
attempts to rejoin?
Promises for 
view = 1 
prevent the old
leader from interfering with the new
quorum
View/Elect
Prepare
Commit
Leader Failure (1)
Time
x
x
x
x
commit clock=14
prepare clock= 13
commit clock=14’
 
What happens if there is an
uncommitted update with 
no quorum
?
 
Leader is 
aware
 of
uncommitted update
Leader must recommit
the original clock=14
update
Leader Failure (2)
Time
x
x
x
x
commit clock=14
prepare clock= 13
commit clock=14’
 
What happens if there is an
uncommitted update with 
no quorum
?
 
Leader is 
unaware
 of
uncommitted update
Leader announces a
new update with
clock=14’, which is
rejected by replica 3
 
Replica 3 is desynchronized, must
reconcile with another replica
Leader Failure (3)
Time
x
x
x
x
commit clock=14
commit clock=14
What happens if there is an
uncommitted update with 
a quorum
?
Send prepares, collect promises
 
The Devil in the Details
 
Clearly, Paxos is complicated
 
Things we haven’t covered:
Reconciliation – how to bring a replica up-to-date
Managing the queue of updates from clients
Updates may be sent to any replica
Replicas are responsible for responding to clients who contact them
Replicas may need to re-forward updates if the leader changes
Garbage collection
Replicas need to remember the exact history of updates, in case the leader changes
Periodically, the lists need to be garbage collected
 
Odds and Ends
 
Byzantine Generals
Gossip
 
Byzantine Generals Problem
 
Name coined by Leslie Lamport
Several Byzantine Generals are
laying siege to an enemy city
They have to agree on a common
strategy: 
attack
 or 
retreat
They can only communicate by
messenger
Some generals may be traitors (their
identity is unknown)
 
Do you see the connection with the consensus problem?
Byzantine Distributed Systems
 
Goals
1.
All loyal lieutenants obey the same order
2.
If the commanding general is loyal, then every loyal lieutenant obeys the
order he sends
Can the problem be solved?
Yes, iff there at least 
3m+1
 honest generals in the presence of 
m
 traitors
E.g. if there are 3 honest generals, even 1 traitor makes the problem unsolvable
Bazillion variations on the basic problem
What if messages are cryptographically signed (e.g. they are unforgeable)?
What if communication is not 
g x g 
(i.e. some pairs of generals cannot
communicate)?
Most algos have byzantine versions (e.g. Byzantine Paxos)
Alternatives to Quorums
 
Quorums favor consistency over availability
If no quorum exists, then the system stops accepting writes
Significant overhead maintaining consistent replicated state
What if 
eventual consistency 
is okay?
Favor availability over consistency
Results may be stale or incorrect sometimes (hopefully only in rare cases)
Gossip
 protocols
Replicas periodically, randomly exchange state with each other
No strong consistency guarantees but…
Surprisingly fast and reliable convergence to up-to-date state
Requires vector clocks or better in order to causally order events
Extreme cases of divergence may be irreconcilable
 
Sources
 
1.
Some slides courtesy of Cristina Nita-Rotaru (
http://cnitarot.github.io/courses/ds_Fall_2016/
)
2.
The Part-Time Parliament
, Leslie Lamport. 
http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-
paxos.pdf
3.
Paxos Made Simple
, Leslie Lamport. 
http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
4.
Paxos for System Builders
, Jonathan Kirsch and Yair Amir.
http://www.cs.jhu.edu/~jak/docs/paxos_for_system_builders.pdf
5.
The Chubby Lock Service for Loosely-Coupled Distributed Systems
, Mike Burrows.
http://research.google.com/archive/chubby-osdi06.pdf
6.
Paxos Made Live – An Engineering Perspective
, Tushar Deepak Chandra, Robert Griesemer, Joshua Redstone.
http://research.google.com/archive/paxos_made_live.pdf
7.
Apache Zookeeper
https://zookeeper.apache.org/
Slide Note

Last update: 11/14/2023

Embed
Share

Exploring the intricacies of distributed systems and fault tolerance in online services, from black box implementations to centralized systems, sharding, and replication strategies. Dive into the advantages and shortcomings of each approach to data storage and processing.

  • Distributed Systems
  • Fault Tolerance
  • Online Services
  • Centralization
  • Sharding

Uploaded on Sep 29, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why can t we all just get along?)

  2. Black Box Online Services Black Box Service Storing and retrieving data from online services is commonplace We tend to treat these services as black boxes Data goes in, we assume outputs are correct We have no idea how the service is implemented

  3. Black Box Online Services debit_transaction(-$75) OK get_recent_transactions() [ , -$75 , ] Storing and retrieving data from online services is commonplace We tend to treat these services as black boxes Data goes in, we assume outputs are correct We have no idea how the service is implemented

  4. Black Box Online Services post_update( I LOLed ) OK get_newsfeed() [ , { txt : I LOLed , likes : 87}] Storing and retrieving data from online services is commonplace We tend to treat these services as black boxes Data goes in, we assume outputs are correct We have no idea how the service is implemented

  5. Peeling Back the Curtain Black Box Service ? How are large services implemented? Different types of services may have different requirements Leads to different design decisions

  6. Centralization debit_transaction(-$75) ? OK get_account_balance() Bob: $300 Bob: $225 Bob $225 Advantages of centralization Easy to setup and deploy Consistency is guaranteed (assuming correct software implementation) Shortcomings No load balancing Single point of failure

  7. Sharding <A-M> debit_account(-$75) Bob: $300 Bob: $225 OK <N-Z> Web Server get_account_balance() Bob $225 Advantages of sharding Better load balancing If done intelligently, may allow incremental scalability Shortcomings Failures are still devastating

  8. Replication 100% Agreement <A-M> debit_account(-$75) <A-M> Bob: $300 Bob: $225 OK Bob: $300 Bob: $225 get_account_balance() <A-M> Web Server Bob $225 Bob: $300 Bob: $225 Advantages of replication Better load balancing of reads (potentially) Resilience against failure; high availability (with some caveats) Shortcomings How do we maintain consistency?

  9. Leader cannot disambiguate cases where requests and responses are lost Consistency Failures No ACK Bob: $300 Bob: $300 Bob: $225 No No ACK Bob: $300 Bob: $225 Bob: $300 Bob: $225 Agreement Asynchronous networks are problematic Too few replicas? Bob: $300 Bob: $300 Bob: $225 No Bob: $300 Bob: $225 Bob: $300 Bob: $225 Timeout! Agreement

  10. Byzantine Failures Bob: $300 In some cases, replicas may be buggy or malicious No Bob: $300 Bob: $1000 Agreement When discussing Distributed Systems, failures due to malice are known as Byzantine Failures Name comes from the Byzantine generals problem More on this later

  11. Problem and Definitions Build a distributed system that meets the following goals: The system should be able to reach consensus Consensus [n]: general agreement The system should be consistent Data should be correct; no integrity violations The system should be highly available Data should be accessible even in the face of arbitrary failures Challenges: Many, many different failure modes Theory tells us that these goals are impossible to achieve (more on this later)

  12. Distributed Commits (2PC and 3PC) Theory (FLP and CAP) Quorums (Paxos)

  13. Forcing Consistency debit_account(-$75) Bob: $300 Bob: $225 OK debit_account(-$50) Bob: $300 Bob: $225 Bob: $175 Bob Error Bob: $300 Bob: $225 Bob: $175 One approach to building distributed systems is to force them to be consistent Guarantee that all replicas receive an update Or none of them do If consistency is guaranteed, then reaching consensus is trivial

  14. Distributed Commit Problem Application that performs operations on multiple replicas or databases We want to guarantee that all replicas get updated, or none do Distributed commit problem: 1. Operation is committed when all participants can perform the action 2. Once a commit decision is reached, all participants must perform the action Two steps gives rise to the Two Phase Commit protocol

  15. Motivating Transactions transfer_money(Alice, Bob, $100) debit_account(Alice, -$100) OK Error Alice: $600 Alice: $500 debit_account(Bob, $100) Bob: $300 Bob: $400 OK Error System becomes inconsistent if any individual action fails

  16. Simple Transactions transfer_money(Alice, Bob, $100) begin_transaction() debit_account(Alice, -$100) Alice: $600 Alice: $500 Alice: $500 At this point, if there haven t been any errors, we say the transaction is committed debit_account(Bob, $100) Bob: $300 Bob: $400 Bob: $400 end_transaction() OK Actions inside a transaction behave as a single action

  17. Simple Transactions transfer_money(Alice, Chris, $100) begin_transaction() debit_account(Alice, -$100) Alice: $600 Alice: $500 debit_account(Chris, $100) Bob: $300 Error If any individual action fails, the whole transaction fails Failed transactions have no side effects Incomplete results during transactions are hidden

  18. ACID Properties Traditional transactional databases support the following: 1. Atomicity: all or none; if transaction fails then no changes are applied to the database 2. Consistency: there are no violations of database integrity 3. Isolation: partial results from incomplete transactions are hidden 4. Durability: the effects of committed transactions are permanent

  19. Two Phase Commits (2PC) Well known techniques used to implement transactions in centralized databases E.g. journaling (append-only logs) Out of scope for this class (take a database class, or CS 5600) Two Phase Commit (2PC) is a protocol for implementing transactions in a distributed setting Protocol operates in rounds Assume we have leader or coordinator that manages transactions Each replica promises that it is ready to commit Leader decides the outcome and instructs replicas to commit or abort Assume no byzantine faults (i.e. nobody is malicious)

  20. 2PC Example Leader Replica 1 Replica 2 Replica 3 Begin by distributing the update Txid is a logical clock x x x txid = 678; value = y x y x y x y Wait to receive ready to commit from all replicas Also called promises ready txid = 678 Time Tell replicas to commit commit txid = 678 y y y At this point, all replicas are guaranteed to be up-to-date committed txid = 678

  21. Failure Modes Replica Failure Before or during the initial promise phase Before or during the commit Leader Failure Before receiving all promises Before or during sending commits Before receiving all committed messages

  22. Replica Failure (1) Leader Replica 1 Replica 2 Replica 3 x x x txid = 678; value = y x y x y Error: not all replicas are ready ready txid = 678 Time The same thing happens if a write or a ready is dropped, a replica times out, or a replica returns an error abort txid = 678 x x aborted txid = 678

  23. Replica Failure (2) Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 commit txid = 678 Time y Known inconsistent state Leader must keep retrying until all commits succeed committed txid = 678 commit txid = 678 y committed txid = 678

  24. Replica Failure (2) Leader Replica 1 Replica 2 Replica 3 y y x y Replicas attempt to resume unfinished transactions when they reboot stat txid = 678 commit txid = 678 Time y Finally, the system is consistent and may proceed committed txid = 678

  25. Leader Failure What happens if the leader crashes? Leader must constantly be writing its state to permanent storage It must pick up where it left off once it reboots If there are unconfirmed transactions Send new write messages, wait for ready to commit replies If there are uncommitted transactions Send new commit messages, wait for committed replies Replicas may see duplicate messages during this process Thus, it s important that every transaction have a unique txid

  26. Allowing Progress Key problem: what if the leader crashes and never recovers? By default, replicas block until contacted by the leader Can the system make progress? Yes, under limited circumstances After sending a ready to commit message, each replica starts a timer The first replica whose timer expires elects itself as the new leader Query the other replicas for their status Send commits to all replicas if they are all ready However, this only works if all the replicas are alive and reachable If a replica crashes or is unreachable, deadlock is unavoidable

  27. New Leader Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 Replica 2 s timeout expires, begins recovery procedure Time stat txid = 678 ready txid = 678 commit txid = 678 y y y System is consistent again committed txid = 678

  28. Deadlock Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 Replica 2 s timeout expires, begins recovery procedure Time stat txid = 678 ready txid = 678 Cannot proceed, but cannot abort stat txid = 678

  29. Garbage Collection 2PC is somewhat of a misnomer: there is a third phase Garbage collection Replicas must retain records of past transactions, just in case the leader fails Example, suppose the leader crashes, reboots, and attempts to commit a transaction that has already been committed Replicas must remember that this past transaction was already committed, since committing a second time may lead to inconsistencies In practice, leader periodically tells replicas to garbage collect All transactions <= some txid may be deleted

  30. 2PC Summary Message complexity: O(2n) The good: guarantees consistency The bad: Write performance suffers if there are failures during the commit phase Does not scale gracefully (possible, but difficult to do) A pure 2PC system blocks all writes if the leader fails Smarter 2PC systems still blocks all writes if the leader + 1 replica fail 2PC sacrifices availability in favor of consistency

  31. Can 2PC be Fixed? The key issue with 2PC is reliance on the centralized leader Only the leader knows if a transaction is 100% ready to commit or not Thus, if the leader + 1 replica fail, recovery is impossible Potential solution: Three Phase Commit Add an additional round of communication Tell all replicas to prepare to commit, before actually committed State of the system can always be deduced by a subset of alive replicas that can communicate with each other unless there are partitions (more on this later)

  32. Leader Replica 1 Replica 2 Replica 3 3PC Example x x x Begin by distributing the update txid = 678; value = y x y x y x y Wait to receive ready to commit from all replicas ready txid = 678 prepare txid = 678 Time Tell all replicas that everyone is ready to commit prepared txid = 678 Tell replicas to commit commit txid = 678 At this point, all replicas are guaranteed to be up-to-date y y y committed txid = 678

  33. Leader Replica 1 Replica 2 Replica 3 Leader Failures x x x Begin by distributing the update txid = 678; value = y x y x y x y Wait to receive ready to commit from all replicas ready txid = 678 Time Replica 2 s timeout expires, begins recovery procedure stat txid = 678 ready txid = 678 Replica 3 cannot be in the committed state, thus okay to abort abort txid = 678 x x System is consistent again aborted txid = 678

  34. Leader Replica 1 Replica 2 Replica 3 Leader Failures prepare txid = 678 prepared txid = 678 Replica 2 s timeout expires, begins recovery procedure Time stat txid = 678 All replicas must have been ready to commit prepared txid = 678 commit txid = 678 y y System is consistent again committed txid = 678

  35. Oh Great, We Fixed Everything! Wrong 3PC is not robust against network partitions What is a network partition? A split in the network, such that full n-to-n connectivity is broken i.e. not all servers can contact each other Partitions split the network into one or more disjoint subnetworks How can a network partition occur? A switch or a router may fail, or it may receive an incorrect routing rule A cable connecting two racks of servers may develop a fault Network partitions are very real; they happen all the time

  36. Leader Replica 1 Replica 2 Replica 3 Partitioning x x x txid = 678; value = y x y x y x y ready txid = 678 Leader recovery initiated Network partitions into two subnets! prepare txid = 678 Time prepared txid = 678 Leader assumes replicas 2 and 3 have failed, moves on Abort commit txid = 678 x y x System is inconsistent committed txid = 678

  37. 3PC Summary Adds an additional phase vs. 2PC Message complexity: O(3n) Really four phases with garbage collection The good: allows the system to make progress under more failure conditions The bad: Extra round of communication makes 3PC even slower than 2PC Does not work if the network partitions 2PC will simply deadlock if there is a partition, rather than become inconsistent In practice, nobody used 3PC Additional complexity and performance penalty just isn t worth it Loss of consistency during partitions is a deal breaker

  38. Distributed Commits (2PC and 3PC) Theory (FLP and CAP) Quorums (Paxos)

  39. A Moment of Reflection Goals, revisited: The system should be able to reach consensus Consensus [n]: general agreement The system should be consistent Data should be correct; no integrity violations The system should be highly available Data should be accessible even in the face of arbitrary failures Achieving these goals may be harder than we thought :( Huge number of failure modes Network partitions are difficult to cope with We haven t even considered byzantine failures

  40. What Can Theory Tell Us? Let's assume the network is synchronous and reliable Synchronous no delay, sent packets arrive immediately Reliable no packet drops or corruption Goal: get n replicas to reach consensus Algorithm can be divided into discreet rounds During each round, each host r may send m <= n messages r might crash before sending all m messages If a message from host r is not received in a round, then r must be faulty If we are willing to tolerate f total failures (f < n), how many rounds of communication do we need to guarantee consensus?

  41. Consensus in a Synchronous System Initialization: All replicas choose a value 0 or 1 (can generalize to more values if you want) Properties: Agreement: all non-faulty processes ultimately choose the same value Either 0 or 1 in this case Validity: if a replica decides on a value, then at least one replica must not have started with that value This prevents the trivial solution of all replicas always choosing 0, which is technically perfect consensus but is practically useless Termination: the system must converge in finite time

  42. Algorithm Sketch Each replica maintains a map M of all known values Initially, the vector only contains the replica s own value e.g., M = { replica1 : 0} Each round, broadcast M to all other replicas On receipt, construct the union of received M and local M Algorithm terminates when all non-faulty replicas have the values from all other non-faulty replicas Example with three non-faulty replicas (1, 3, and 5) M = { replica1 : 0, replica3 : 1, replica5 : 0} Final value is min(M.values())

  43. Bounding Convergence Time How many rounds will it take if we are willing to tolerate f failures? f + 1 rounds Key insight: all replicas must be sure that all replicas that did not crash have the same information (so they can make the same decision) Proof sketch, assuming f = 2 Worst case scenario is that replicas crash during rounds 1 and 2 During round 1, replica x crashes All other replicas don t know if x is alive or dead During round 2, replica y crashes Clear that x is not alive, but unknown if y is alive or dead During round 3, no more replicas may crash All replicas are guaranteed to receive updated info from all other replicas Final decision can be made

  44. Replica 4 Replica 5 Replica 1 Replica 2 Replica 3 0 1 0 0 1 Round 1 n = 5 replicas f = 2 maximum failures [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] [0, 1, 0, 0, 1] Time Round 2 Legend [r1, r2, r3, r4, r5] [0, 1, 0, ?, X] [0, 1, 0, ?, X] [0, 1, 0, 0, X] Round 3 [0, 1, 0, X, X] [0, 1, 0, X, X] [0, 1, 0, X, X]

  45. A More Realistic Model The previous result is interesting, but totally unrealistic We assume that the network is synchronous and reliable Of course, neither of these things are true in reality What if the network is asynchronous and reliable? Asynchronous replicas may take an arbitrarily long time to respond to messages and/or network may delay messages Let s also assume that all faults are crash faults i.e., if a replica has a problem, it crashes and never wakes up No byzantine faults

  46. The FLP Result There is no asynchronous algorithm that achieves consensus on a 1-bit value in the presence of crash faults. The result is true even if no crash actually occurs! This is known as the FLP result Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1985 Extremely powerful result because: If you can t agree on 1-bit, generalizing to larger values isn t going to help you If you can t guarantee convergence with crash faults, no way you can guarantee convergence with byzantine faults If you can t guarantee convergence on a reliable network, no way you can on an unreliable network

  47. FLP Proof Sketch In an asynchronous system, a replica x cannot tell whether a non- responsive replica y has crashed or is just slow What can x do? If x waits, it will block indefinitely since it might never receive the message from y If x decides, it may find out later that y made a different decision Proof constructs a scenario where each attempt to decide is overruled by a delayed, asynchronous message Thus, the system oscillates between 0 and 1 never converges

  48. Replica 4 Replica 5 Replica 1 Replica 2 Replica 3 0 1 0 0 1 Round 1 n = 5 replicas [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] Legend Time [r1, r2, r3, r4, r5] Round 2 [1, 1, 0, ?, 1] [1, 1, 0, ?, 1] [1, 1, 0, ?, 1] [0, 1, 0, ?, 1]

  49. Impact of FLP FLP proves that any fault-tolerant distributed algorithm attempting to reach consensus has runs that never terminate These runs are extremely unlikely ( probability zero ) Yet they imply that we can t find a totally correct solution And so consensus is impossible ( not always possible ) So what can we do? Use randomization, probabilistic guarantees (gossip protocols) Avoid consensus, use quorum systems (Paxos or RAFT) In other words, trade-off consistency in favor of availability

  50. Consistency vs. Availability FLP states that perfect consistency is impossible Practically, we can get close to perfect consistency, but at significant cost e.g., using 3PC Availability begins to suffer dramatically under failure conditions Is there a way to formalize the tradeoff between consistency and availability?

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#