Distributed Systems and Fault Tolerance

Slide Note

Exploring the intricacies of distributed systems and fault tolerance in online services, from black box implementations to centralized systems, sharding, and replication strategies. Dive into the advantages and shortcomings of each approach to data storage and processing.

douglassw Follow

Uploaded on Sep 29, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why can t we all just get along?)

Black Box Online Services Black Box Service Storing and retrieving data from online services is commonplace We tend to treat these services as black boxes Data goes in, we assume outputs are correct We have no idea how the service is implemented

Black Box Online Services debit_transaction(-$75) OK get_recent_transactions() [ , -$75 , ] Storing and retrieving data from online services is commonplace We tend to treat these services as black boxes Data goes in, we assume outputs are correct We have no idea how the service is implemented

Black Box Online Services post_update( I LOLed ) OK get_newsfeed() [ , { txt : I LOLed , likes : 87}] Storing and retrieving data from online services is commonplace We tend to treat these services as black boxes Data goes in, we assume outputs are correct We have no idea how the service is implemented

Peeling Back the Curtain Black Box Service ? How are large services implemented? Different types of services may have different requirements Leads to different design decisions

Centralization debit_transaction(-$75) ? OK get_account_balance() Bob: $300 Bob: $225 Bob $225 Advantages of centralization Easy to setup and deploy Consistency is guaranteed (assuming correct software implementation) Shortcomings No load balancing Single point of failure

Sharding <A-M> debit_account(-$75) Bob: $300 Bob: $225 OK <N-Z> Web Server get_account_balance() Bob $225 Advantages of sharding Better load balancing If done intelligently, may allow incremental scalability Shortcomings Failures are still devastating

Replication 100% Agreement <A-M> debit_account(-$75) <A-M> Bob: $300 Bob: $225 OK Bob: $300 Bob: $225 get_account_balance() <A-M> Web Server Bob $225 Bob: $300 Bob: $225 Advantages of replication Better load balancing of reads (potentially) Resilience against failure; high availability (with some caveats) Shortcomings How do we maintain consistency?

Leader cannot disambiguate cases where requests and responses are lost Consistency Failures No ACK Bob: $300 Bob: $300 Bob: $225 No No ACK Bob: $300 Bob: $225 Bob: $300 Bob: $225 Agreement Asynchronous networks are problematic Too few replicas? Bob: $300 Bob: $300 Bob: $225 No Bob: $300 Bob: $225 Bob: $300 Bob: $225 Timeout! Agreement

Byzantine Failures Bob: $300 In some cases, replicas may be buggy or malicious No Bob: $300 Bob: $1000 Agreement When discussing Distributed Systems, failures due to malice are known as Byzantine Failures Name comes from the Byzantine generals problem More on this later

Problem and Definitions Build a distributed system that meets the following goals: The system should be able to reach consensus Consensus [n]: general agreement The system should be consistent Data should be correct; no integrity violations The system should be highly available Data should be accessible even in the face of arbitrary failures Challenges: Many, many different failure modes Theory tells us that these goals are impossible to achieve (more on this later)

Distributed Commits (2PC and 3PC) Theory (FLP and CAP) Quorums (Paxos)

Forcing Consistency debit_account(-$75) Bob: $300 Bob: $225 OK debit_account(-$50) Bob: $300 Bob: $225 Bob: $175 Bob Error Bob: $300 Bob: $225 Bob: $175 One approach to building distributed systems is to force them to be consistent Guarantee that all replicas receive an update Or none of them do If consistency is guaranteed, then reaching consensus is trivial

Distributed Commit Problem Application that performs operations on multiple replicas or databases We want to guarantee that all replicas get updated, or none do Distributed commit problem: 1. Operation is committed when all participants can perform the action 2. Once a commit decision is reached, all participants must perform the action Two steps gives rise to the Two Phase Commit protocol

Motivating Transactions transfer_money(Alice, Bob, $100) debit_account(Alice, -$100) OK Error Alice: $600 Alice: $500 debit_account(Bob, $100) Bob: $300 Bob: $400 OK Error System becomes inconsistent if any individual action fails

Simple Transactions transfer_money(Alice, Bob, $100) begin_transaction() debit_account(Alice, -$100) Alice: $600 Alice: $500 Alice: $500 At this point, if there haven t been any errors, we say the transaction is committed debit_account(Bob, $100) Bob: $300 Bob: $400 Bob: $400 end_transaction() OK Actions inside a transaction behave as a single action

Simple Transactions transfer_money(Alice, Chris, $100) begin_transaction() debit_account(Alice, -$100) Alice: $600 Alice: $500 debit_account(Chris, $100) Bob: $300 Error If any individual action fails, the whole transaction fails Failed transactions have no side effects Incomplete results during transactions are hidden

ACID Properties Traditional transactional databases support the following: 1. Atomicity: all or none; if transaction fails then no changes are applied to the database 2. Consistency: there are no violations of database integrity 3. Isolation: partial results from incomplete transactions are hidden 4. Durability: the effects of committed transactions are permanent

Two Phase Commits (2PC) Well known techniques used to implement transactions in centralized databases E.g. journaling (append-only logs) Out of scope for this class (take a database class, or CS 5600) Two Phase Commit (2PC) is a protocol for implementing transactions in a distributed setting Protocol operates in rounds Assume we have leader or coordinator that manages transactions Each replica promises that it is ready to commit Leader decides the outcome and instructs replicas to commit or abort Assume no byzantine faults (i.e. nobody is malicious)

2PC Example Leader Replica 1 Replica 2 Replica 3 Begin by distributing the update Txid is a logical clock x x x txid = 678; value = y x y x y x y Wait to receive ready to commit from all replicas Also called promises ready txid = 678 Time Tell replicas to commit commit txid = 678 y y y At this point, all replicas are guaranteed to be up-to-date committed txid = 678

Failure Modes Replica Failure Before or during the initial promise phase Before or during the commit Leader Failure Before receiving all promises Before or during sending commits Before receiving all committed messages

Replica Failure (1) Leader Replica 1 Replica 2 Replica 3 x x x txid = 678; value = y x y x y Error: not all replicas are ready ready txid = 678 Time The same thing happens if a write or a ready is dropped, a replica times out, or a replica returns an error abort txid = 678 x x aborted txid = 678

Replica Failure (2) Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 commit txid = 678 Time y Known inconsistent state Leader must keep retrying until all commits succeed committed txid = 678 commit txid = 678 y committed txid = 678

Replica Failure (2) Leader Replica 1 Replica 2 Replica 3 y y x y Replicas attempt to resume unfinished transactions when they reboot stat txid = 678 commit txid = 678 Time y Finally, the system is consistent and may proceed committed txid = 678

Leader Failure What happens if the leader crashes? Leader must constantly be writing its state to permanent storage It must pick up where it left off once it reboots If there are unconfirmed transactions Send new write messages, wait for ready to commit replies If there are uncommitted transactions Send new commit messages, wait for committed replies Replicas may see duplicate messages during this process Thus, it s important that every transaction have a unique txid

Allowing Progress Key problem: what if the leader crashes and never recovers? By default, replicas block until contacted by the leader Can the system make progress? Yes, under limited circumstances After sending a ready to commit message, each replica starts a timer The first replica whose timer expires elects itself as the new leader Query the other replicas for their status Send commits to all replicas if they are all ready However, this only works if all the replicas are alive and reachable If a replica crashes or is unreachable, deadlock is unavoidable

New Leader Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 Replica 2 s timeout expires, begins recovery procedure Time stat txid = 678 ready txid = 678 commit txid = 678 y y y System is consistent again committed txid = 678

Deadlock Leader Replica 1 Replica 2 Replica 3 x y x y x y ready txid = 678 Replica 2 s timeout expires, begins recovery procedure Time stat txid = 678 ready txid = 678 Cannot proceed, but cannot abort stat txid = 678

Garbage Collection 2PC is somewhat of a misnomer: there is a third phase Garbage collection Replicas must retain records of past transactions, just in case the leader fails Example, suppose the leader crashes, reboots, and attempts to commit a transaction that has already been committed Replicas must remember that this past transaction was already committed, since committing a second time may lead to inconsistencies In practice, leader periodically tells replicas to garbage collect All transactions <= some txid may be deleted

2PC Summary Message complexity: O(2n) The good: guarantees consistency The bad: Write performance suffers if there are failures during the commit phase Does not scale gracefully (possible, but difficult to do) A pure 2PC system blocks all writes if the leader fails Smarter 2PC systems still blocks all writes if the leader + 1 replica fail 2PC sacrifices availability in favor of consistency

Can 2PC be Fixed? The key issue with 2PC is reliance on the centralized leader Only the leader knows if a transaction is 100% ready to commit or not Thus, if the leader + 1 replica fail, recovery is impossible Potential solution: Three Phase Commit Add an additional round of communication Tell all replicas to prepare to commit, before actually committed State of the system can always be deduced by a subset of alive replicas that can communicate with each other unless there are partitions (more on this later)

Leader Replica 1 Replica 2 Replica 3 3PC Example x x x Begin by distributing the update txid = 678; value = y x y x y x y Wait to receive ready to commit from all replicas ready txid = 678 prepare txid = 678 Time Tell all replicas that everyone is ready to commit prepared txid = 678 Tell replicas to commit commit txid = 678 At this point, all replicas are guaranteed to be up-to-date y y y committed txid = 678

Leader Replica 1 Replica 2 Replica 3 Leader Failures x x x Begin by distributing the update txid = 678; value = y x y x y x y Wait to receive ready to commit from all replicas ready txid = 678 Time Replica 2 s timeout expires, begins recovery procedure stat txid = 678 ready txid = 678 Replica 3 cannot be in the committed state, thus okay to abort abort txid = 678 x x System is consistent again aborted txid = 678

Leader Replica 1 Replica 2 Replica 3 Leader Failures prepare txid = 678 prepared txid = 678 Replica 2 s timeout expires, begins recovery procedure Time stat txid = 678 All replicas must have been ready to commit prepared txid = 678 commit txid = 678 y y System is consistent again committed txid = 678

Oh Great, We Fixed Everything! Wrong 3PC is not robust against network partitions What is a network partition? A split in the network, such that full n-to-n connectivity is broken i.e. not all servers can contact each other Partitions split the network into one or more disjoint subnetworks How can a network partition occur? A switch or a router may fail, or it may receive an incorrect routing rule A cable connecting two racks of servers may develop a fault Network partitions are very real; they happen all the time

Leader Replica 1 Replica 2 Replica 3 Partitioning x x x txid = 678; value = y x y x y x y ready txid = 678 Leader recovery initiated Network partitions into two subnets! prepare txid = 678 Time prepared txid = 678 Leader assumes replicas 2 and 3 have failed, moves on Abort commit txid = 678 x y x System is inconsistent committed txid = 678

3PC Summary Adds an additional phase vs. 2PC Message complexity: O(3n) Really four phases with garbage collection The good: allows the system to make progress under more failure conditions The bad: Extra round of communication makes 3PC even slower than 2PC Does not work if the network partitions 2PC will simply deadlock if there is a partition, rather than become inconsistent In practice, nobody used 3PC Additional complexity and performance penalty just isn t worth it Loss of consistency during partitions is a deal breaker

Distributed Commits (2PC and 3PC) Theory (FLP and CAP) Quorums (Paxos)

A Moment of Reflection Goals, revisited: The system should be able to reach consensus Consensus [n]: general agreement The system should be consistent Data should be correct; no integrity violations The system should be highly available Data should be accessible even in the face of arbitrary failures Achieving these goals may be harder than we thought :( Huge number of failure modes Network partitions are difficult to cope with We haven t even considered byzantine failures

What Can Theory Tell Us? Let's assume the network is synchronous and reliable Synchronous no delay, sent packets arrive immediately Reliable no packet drops or corruption Goal: get n replicas to reach consensus Algorithm can be divided into discreet rounds During each round, each host r may send m <= n messages r might crash before sending all m messages If a message from host r is not received in a round, then r must be faulty If we are willing to tolerate f total failures (f < n), how many rounds of communication do we need to guarantee consensus?

Consensus in a Synchronous System Initialization: All replicas choose a value 0 or 1 (can generalize to more values if you want) Properties: Agreement: all non-faulty processes ultimately choose the same value Either 0 or 1 in this case Validity: if a replica decides on a value, then at least one replica must not have started with that value This prevents the trivial solution of all replicas always choosing 0, which is technically perfect consensus but is practically useless Termination: the system must converge in finite time

Algorithm Sketch Each replica maintains a map M of all known values Initially, the vector only contains the replica s own value e.g., M = { replica1 : 0} Each round, broadcast M to all other replicas On receipt, construct the union of received M and local M Algorithm terminates when all non-faulty replicas have the values from all other non-faulty replicas Example with three non-faulty replicas (1, 3, and 5) M = { replica1 : 0, replica3 : 1, replica5 : 0} Final value is min(M.values())

Bounding Convergence Time How many rounds will it take if we are willing to tolerate f failures? f + 1 rounds Key insight: all replicas must be sure that all replicas that did not crash have the same information (so they can make the same decision) Proof sketch, assuming f = 2 Worst case scenario is that replicas crash during rounds 1 and 2 During round 1, replica x crashes All other replicas don t know if x is alive or dead During round 2, replica y crashes Clear that x is not alive, but unknown if y is alive or dead During round 3, no more replicas may crash All replicas are guaranteed to receive updated info from all other replicas Final decision can be made

Replica 4 Replica 5 Replica 1 Replica 2 Replica 3 0 1 0 0 1 Round 1 n = 5 replicas f = 2 maximum failures [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] [0, 1, 0, 0, 1] Time Round 2 Legend [r1, r2, r3, r4, r5] [0, 1, 0, ?, X] [0, 1, 0, ?, X] [0, 1, 0, 0, X] Round 3 [0, 1, 0, X, X] [0, 1, 0, X, X] [0, 1, 0, X, X]

A More Realistic Model The previous result is interesting, but totally unrealistic We assume that the network is synchronous and reliable Of course, neither of these things are true in reality What if the network is asynchronous and reliable? Asynchronous replicas may take an arbitrarily long time to respond to messages and/or network may delay messages Let s also assume that all faults are crash faults i.e., if a replica has a problem, it crashes and never wakes up No byzantine faults

The FLP Result There is no asynchronous algorithm that achieves consensus on a 1-bit value in the presence of crash faults. The result is true even if no crash actually occurs! This is known as the FLP result Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1985 Extremely powerful result because: If you can t agree on 1-bit, generalizing to larger values isn t going to help you If you can t guarantee convergence with crash faults, no way you can guarantee convergence with byzantine faults If you can t guarantee convergence on a reliable network, no way you can on an unreliable network

FLP Proof Sketch In an asynchronous system, a replica x cannot tell whether a non- responsive replica y has crashed or is just slow What can x do? If x waits, it will block indefinitely since it might never receive the message from y If x decides, it may find out later that y made a different decision Proof constructs a scenario where each attempt to decide is overruled by a delayed, asynchronous message Thus, the system oscillates between 0 and 1 never converges

Replica 4 Replica 5 Replica 1 Replica 2 Replica 3 0 1 0 0 1 Round 1 n = 5 replicas [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] [0, 1, 0, 0, ?] Legend Time [r1, r2, r3, r4, r5] Round 2 [1, 1, 0, ?, 1] [1, 1, 0, ?, 1] [1, 1, 0, ?, 1] [0, 1, 0, ?, 1]

Impact of FLP FLP proves that any fault-tolerant distributed algorithm attempting to reach consensus has runs that never terminate These runs are extremely unlikely ( probability zero ) Yet they imply that we can t find a totally correct solution And so consensus is impossible ( not always possible ) So what can we do? Use randomization, probabilistic guarantees (gossip protocols) Avoid consensus, use quorum systems (Paxos or RAFT) In other words, trade-off consistency in favor of availability

Consistency vs. Availability FLP states that perfect consistency is impossible Practically, we can get close to perfect consistency, but at significant cost e.g., using 3PC Availability begins to suffer dramatically under failure conditions Is there a way to formalize the tradeoff between consistency and availability?

Distributed Systems and Fault Tolerance

Download Presentation

Presentation Transcript

Related

More Related Content