
Understanding Fault Tolerance in Distributed Systems
Explore the concepts of fault tolerance in distributed systems, including definitions, failure characteristics, and the distinction between transient and persistent failures. Learn how systems can automatically recover from partial failures and continue operating properly even in the face of unexpected events.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Distributed Systems CS 15-440 Fault Tolerance Lecture 26, December 02, 2019 Mohammad Hammoud
Today Last Session: Replication - Part II Today s Session: Fault Tolerance Definitions Detecting and masking failures 2PC Announcements: Project 4 is due on Dec 3 by midnight PS5 is due on December 5 by midnight The final exam is on Thursday, Dec 12 from 8:30 to 11:30AM in room 3044. It will be open book, open notes.
Fault-Tolerance Systems can be designed in a way that can automatically recover from partial failures Tire punctured. Car stopped. Tire punctured. It got masked and car continued. Fault-tolerance is the property that enables a system to continue operating properly even if a failure takes place during operation For example, TCP is designed to allow reliable two-way communications in packet-switched networks, even in the presence of communication links that are imperfect or overloaded 3
What is a Failure? A failure is a deviation from a specified behavior E.g., Pressing brake pedal does not stop car brake failure (could be catastrophic!) E.g., Read of a disk sector does not return content disk failure (not necessarily catastrophic) Many failures are due to incorrect specified behavior This typically happens when the designer misses addressing a scenario that makes the system perform incorrectly It is especially true in complex systems with many subtle interactions
Failure Characteristics Failures can be characterized as either transient or persistent Transient Failures: Also referred to as soft failures or Heisenbugs Occur temporarily then disappear Manifested only in a very unlikely combination of circumstances Typically go away upon rolling back and/or retrying/rebooting E.g., Frozen keyboard or window, race conditions and deadlocks, etc.
Failure Characteristics Failures can be characterized as either transient or persistent Persistent Failures: Persist until explicitly repaired Retrying does not help E.g., Burnt-out chips, software bugs, crashed disks, broken Ethernet cable, etc. Durations of failures and repairs are random variables Means of distributions are Mean Time To Fail (MTTF) and Mean Time To Repair (MTTR) In-Service Out-of-Service MTTF MTTR
Availability vs. Reliability There is a distinction between availability and reliability Availability refers to the probability that a system is operating correctly at any given moment Availability = MTTF/(MTTF+MTTR) Reliability measures how long a system can operate without a breakdown A highly-available (HA) system is one that will most likely be working at a given instant in time A highly-reliable system is one that will most likely continue to work without interruption during a relatively long period of time
Availability vs. Reliability Examples: A system that goes down for 1ms every hour has an availability of over 99.9999%, but is highly unreliable A system that never crashes but is shut down for two weeks every August has high reliability, but only 96% availability System Type Availability (%) Downtime in a Year Conventional Workstation 99 3.6 Days High-Available (HA) System 99.9 8.4 Hours Fault-Resilient System 99.99 1 Hour Fault-Tolerant System 99.999 5 Minutes
Failure Types Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Description Description Description Description Description Description Description Description Description Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped Omission Failure Omission Failure Omission Failure Omission Failure Omission Failure Omission Failure Omission Failure Omission Failure Receive Omission Send Omission Send Omission Send Omission Send Omission Send Omission Send Omission Send Omission Send Omission Send Omission Omission Failure Receive Omission Receive Omission Receive Omission Receive Omission Receive Omission Receive Omission Receive Omission Receive Omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval Response Failure Value Failure Value Failure Value Failure Value Failure Value Failure Value Failure Value Failure Value Failure Response Failure Value Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure Response Failure Response Failure Response Failure Response Failure Response Failure Response Failure Response Failure A server s response is incorrect The value of the response is wrong The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control A server s response is incorrect The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times
Failure Types Type of Failure Description Description Description Description Description Description Description Description Description Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Type of Failure Known generally as Fail-Stop or Fail-Fast Failures Crash Failure A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped A server halts, but was working correctly until it stopped Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure Crash Failure Omission Failure A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests A server fails to respond to incoming requests Omission Failure Receive Omission Receive Omission Receive Omission Receive Omission Receive Omission Receive Omission Receive Omission Receive Omission Omission Failure Receive Omission Send Omission Send Omission Send Omission Send Omission Send Omission Send Omission Send Omission Send Omission Send Omission Omission Failure Omission Failure Omission Failure Omission Failure Omission Failure Omission Failure A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to send messages A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages A server fails to receive incoming messages Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure Timing Failure A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval A server s response lies outside the specified time interval Known generally as Fail-Silent Failures Timing Failure Response Failure Value Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure State Transition Failure Value Failure State Transition Failure Response Failure Response Failure Response Failure Response Failure Response Failure Response Failure Response Failure Value Failure Value Failure Value Failure Value Failure Value Failure Value Failure Value Failure A server s response is incorrect The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong The value of the response is wrong A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect A server s response is incorrect The value of the response is wrong The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The server deviates from the correct flow of control The value of the response is wrong The server deviates from the correct flow of control Response Failure Arbitrary Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure Byzantine Failure A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times A server may produce arbitrary responses at arbitrary times Byzantine Failure A server may produce arbitrary responses at arbitrary times Known generally as Arbitrary or Byzantine Failures
Masking Failures The key technique for masking failures is to use redundancy Usually, extra bits are added to allow recovery from garbled bits Information Usually, extra equipment are added to allow tolerating failed hardware components Usually, extra processes are added to allow tolerating failed processes Software Redundancy Hardware Time Usually, an action is performed, and then, if required, it is performed again
Detecting Failures But, failures need to be detected before they can be masked A detection subsystem (especially, for a fail-stop or fail-silent failure): Can usually be done as a side-effect of regularly exchanging information with servers Should ideally be able to distinguish between network and server failures A process, P, that cannot reach a server can check with other processes on whether they can reach the server If at least one other process indicates that it can reach the server, P can presume that it is a network failure (assuming all processes are non-malicious/non-faulty) 12
Example 1: Speculative Execution in Hadoop A MapReduce job is dominated by the slowest task MapReduce attempts to locate slow tasks (or stragglers) and run replicated (or speculative) tasks that will optimistically commit before corresponding stragglers In general, this strategy is known as task resiliency or task replication (as opposed to data replication) In Hadoop it is called speculative execution Only one copy of a straggler is allowed to be speculated Whichever task (among the two tasks) commits first, its results are exploited, and the other task is killed
But, How to Detect Stragglers? Hadoop monitors the progresses of all tasks and assigns each task a progress score between 0 and 1 A task is marked as a straggler if its progress score (PS) < (average 0.2) after running at least 1 minute Not a straggler T1 PS= 2/3 A straggler T2 PS= 1/12 Time
Example 2: Atomic Multicasting Atomic multicasting requires satisfying two conditions: 1. A message should be delivered to either all or none of the processes (or replica sites) This property is known as atomicity 2. All messages should be delivered in the same order to all processes (or replica sites) This property is known as consistent ordering The atomicity property entails reliable multicasting since it guarantees that ALL (or none) of the processes will receive the multicast message
Message Ordering As discussed before, there are typically three types of message orderings: 1. Sequential (or FIFO) Ordering Messages sent from the same process are delivered in the same order as they were sent at every receiving process 2. Causal Ordering If message m1 causally precedes message m2, m1 is delivered before m2 at every receiving process 3. Total Ordering Messages are delivered in the same order at every receiving process 16
Types of Reliable Multicasting There is usually a distinction between six types of reliable multicasting Multicasting Type Basic Message Ordering Total-Ordered Delivery? Reliable multicasting None No FIFO multicasting FIFO-ordered delivery No Causal multicasting Causal-ordered delivery No Atomic multicasting None Yes FIFO atomic multicasting FIFO-ordered delivery Yes Causal atomic multicasting Causal-ordered delivery Yes 17
Types of Reliable Multicasting There is usually a distinction between six types of reliable multicasting Multicasting Type Basic Message Ordering Total-Ordered Delivery? Reliable multicasting None No FIFO multicasting FIFO-ordered delivery No Causal multicasting Causal-ordered delivery No Atomic multicasting None Yes FIFO atomic multicasting FIFO-ordered delivery Yes Causal atomic multicasting Causal-ordered delivery Yes 18
Types of Reliable Multicasting There is usually a distinction between six types of reliable multicasting Multicasting Type Basic Message Ordering Total-Ordered Delivery? Reliable multicasting None No FIFO multicasting FIFO-ordered delivery No Causal multicasting Causal-ordered delivery No Atomic multicasting None Yes FIFO atomic multicasting FIFO-ordered delivery Yes Causal atomic multicasting Causal-ordered delivery Yes 19
Types of Reliable Multicasting There is usually a distinction between six types of reliable multicasting Multicasting Type Basic Message Ordering Total-Ordered Delivery? Reliable multicasting None No FIFO multicasting FIFO-ordered delivery No Causal multicasting Causal-ordered delivery No Atomic multicasting None Yes FIFO atomic multicasting FIFO-ordered delivery Yes Causal atomic multicasting Causal-ordered delivery Yes 20
Types of Reliable Multicasting There is usually a distinction between six types of reliable multicasting Multicasting Type Basic Message Ordering Total-Ordered Delivery? Reliable multicasting None No FIFO multicasting FIFO-ordered delivery No Causal multicasting Causal-ordered delivery No Atomic multicasting None Yes FIFO atomic multicasting FIFO-ordered delivery Yes Causal atomic multicasting Causal-ordered delivery Yes 21
Types of Reliable Multicasting There is usually a distinction between six types of reliable multicasting Multicasting Type Basic Message Ordering Total-Ordered Delivery? Reliable multicasting None No FIFO multicasting FIFO-ordered delivery No Causal multicasting Causal-ordered delivery No Atomic multicasting None Yes FIFO atomic multicasting FIFO-ordered delivery Yes Causal atomic multicasting Causal-ordered delivery Yes 22
Distributed Atomic Transactions Atomic multicasting is an example of a general problem, known as distributed atomic transactions Given a transaction with multiple actions Either all or none of the actions are committed If all actions are committed, they will be committed in the same order at all replica sites A popular distributed atomic transaction protocol is known as the two-phase commit protocol (2PC), which involves: One coordinator Multiple participants 23
Two-Phase Commit Protocol 2PC is comprised of the following two phases, each involving two steps: Phase I: Voting Phase Phase I: Voting Phase Phase I: Voting Phase Step 1 Step 1 Step 1 The coordinator sends a VOTE_REQUEST message to all participants. The coordinator sends a VOTE_REQUEST message to all participants. The coordinator sends a VOTE_REQUEST message to all participants. When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message VOTE_ABORT message message. When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a locally commit its part of the transaction, or otherwise a VOTE_ABORT When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator indicating that it is prepared to Step 2 Step 2 Step 2
Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. Step 1 However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message. Each participant that voted for a commit waits for the final reaction by the coordinator. If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction. Step 2 If a participant receives a GLOBAL_ABORT message, , it locally aborts the transaction.
2PC Finite State Machines Note: The terms above and below the line indicate what have been received and sent, respectively Vote-request Vote-abort Commit Vote-request INIT INIT Vote-request Vote-commit READY WAIT Vote-commit Global-commit Global-abort ACK Vote-abort Global-abort Global-commit ACK ABORT COMMIT ABORT COMMIT The finite state machine of the COORDINATOR in 2PC The finite state machine of a PARTICIPANT in 2PC
2PC Algorithm Actions by coordinator: write START_2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected{ wait for any incoming vote; if timeout{ write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } If all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; }else{ write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; }
Coordinator Recovery The coordinator can fail at any stage in 2PC However, due to logging its state, it can recover as follows: State in Log Action After Recovery INIT Abort WAIT Retransmit VOTE_REQUEST to participants COMMIT Retransmit GLOBAL_COMMIT to all participants ABORT Retransmit GLOBAL_ABORT to all participants
Two-Phase Commit Protocol Actions by participants: write INIT to local log; Wait for VOTE_REQUEST from coordinator; If timeout{ write VOTE_ABORT to local log; exit; } If participant votes COMMIT{ write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout{ multicast DECISION_RQUEST to other participants; wait until DECISION is received; /*remain blocked*/ write DECISION to local log; } if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;} else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log}; }else{ write VOTE_ABORT to local log; send VOTE_ABORT to coordinator; } 30
Two-Phase Commit Protocol Actions for handling decision requests: /*executed by separate thread*/ while true{ wait until any incoming DECISION_REQUEST is received; /*remain blocked*/ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /*participant remains blocked*/ } An indefinite blocking window can arise, whereby all sites who voted positively are blocked until outcome is known Can any clever protocol avoid this window? No! All distributed commit protocols have an indefinite blocking window!
Participant Recovery Any participant can fail at any stage in 2PC Due to logging its state, it can recover as follows: State in Log Action After Recovery INIT Locally abort and notify the coordinator Cannot decide on its own what it should do next; hence, contact others READY COMMIT Retransmit its decision to coordinator ABORT Retransmit its decision to the coordinator
Next Class Overview