Fault Tolerance in Distributed Systems

 
Distributed Systems
CS 15-440
 
Fault Tolerance
Lecture 24, November
 28, 2023
 
Mohammad Hammoud
 
1
 
Today
 
Last Session:
Replication – Part II
 
Today’s Session:
Fault Tolerance
 
Announcements:
P4 is due on Nov 30
Final exam (
open books, open notes
) is on Wednesday, December 06
from 2:30PM to 5:30PM in Room 1031
 
 
Course Map
Networks
Communication Paradigms
Architectures
Naming
Synchronization
Replication & Consistency
Fault-tolerance
Programming Models
Applications
 
Correct or
E
f
f
e
c
t
i
v
e
 
D
S
 
F
a
s
t
 
&
 
R
e
l
i
a
b
l
e
 
o
r
 
E
f
f
i
c
i
e
n
t
 
D
S
Course Map
Networks
Communication Paradigms
Architectures
Naming
Synchronization
Replication & Consistency
Fault-tolerance
Programming Models
Applications
Fault-Tolerance
 
Systems can be designed in a way that can 
automatically
 recover
from 
partial
 failures
 
 
 
 
 
Fault-tolerance
 
is the property that enables a system to continue
operating properly even if a 
failure
 takes place during operation
E.g., TCP is designed to allow reliable two-way communications in packet-
switched networks, even in the presence of imperfect or overloaded
communication links
 
5
 
Tire punctured.
Car stopped.
 
 
T
i
r
e
 
p
u
n
c
t
u
r
e
d
.
 
I
t
 
g
o
t
 
m
a
s
k
e
d
 
a
n
d
 
c
a
r
 
c
o
n
t
i
n
u
e
d
.
What is a Failure?
 
A failure is a deviation from a 
specified behavior
E.g., Pressing brake pedal does not stop car 
 
brake failure (
could be
catastrophic!
)
E.g., Reading a disk sector does not return content 
 disk failure (
not
necessarily catastrophic
)
 
Many failures are due to 
incorrect 
specified behavior
This typically happens when the designer misses addressing a scenario that
makes the system perform incorrectly
It is especially true in complex systems with many subtle interactions
 
 
 
Failure Characteristics
 
Failures can be either 
transient
 
or 
persistent
 
Transient Failures
:
Also referred to as “soft failures” or “Heisenbugs”
Occur temporarily then disappear
Manifested only in very unlikely combinations of circumstances
Typically go away upon rolling back and/or retrying/rebooting
E.g., Frozen keyboard or window, race conditions and deadlocks,
etc.,
Failure Characteristics
 
Failures can be either 
transient
 
or 
persistent
 
Persistent Failures
:
Persist until explicitly repaired
Retrying does not help
E.g., Burnt-out chips, software bugs, crashed disks, broken Ethernet cable, etc.,
Durations of failures and repairs are random variables
Means of distributions are 
Mean Time To Fail 
(
MTTF
) and 
Mean Time To Repair
(
MTTR
)
 
MTTF
 
MTTR
 
In-Service
 
Out-of-Service
Availability vs. Reliability
 
There is a subtle distinction between 
availability
 and 
reliability
Availability refers to the probability that 
a system is operating correctly at any
given moment
Availability
 
=
 
MTTF
/(
MTTF
+
MTTR
)
Reliability measures 
how long a system can operate without a breakdown
 
A 
highly-available
 system is one that will most likely be working 
at a
given instant in time
 
A 
highly-reliable
 system is one that will most likely continue to work
without interruption 
during a relatively long period of time
Availability vs. Reliability
 
For example:
A system that goes down for 1ms every hour has an availability of over
99.9999%, but is highly unreliable
A system that never crashes but shuts down for two weeks every August has
high reliability, but only 96% availability
Masking Failures
The key technique for masking failures is to use 
redundancy
Redundancy
Information
Hardware
Time
Software
 
Usually, extra bits are added to allow recovery from garbled bits
 
Usually, an action is performed, and then, if required, it is performed again
 
 Usually, extra
equipments are added
to allow tolerating
failed hardware
components
 
 Usually, extra
processes are added
to allow tolerating
failed processes
Detecting Failures
 
But, failures need to be detected before they can be masked
 
A detection subsystem:
Can usually be involved as a side-effect of regularly exchanging
information with servers
Should ideally be able to distinguish between network and server failures
A process, 
P
, that cannot reach a server can check with other processes
on whether they can reach the server
If at least one other process indicates that it can reach the server, P
can assume that the failure is a network failure
 
 
 
 
 
12
Example: Atomic Multicasting
 
Atomic multicasting 
requires satisfying two conditions:
1.
A message should be delivered to 
either all or none
 of the processes (or
replica sites)
This property is known as 
atomicity, 
which implies 
reliable
 multicasting
because all or none of the processes shall receive the multicast message
2.
All messages should be delivered 
in the same order to all 
the processes (or
replica sites)
This property is known as 
total ordering
Distributed Atomic Transactions
 
A popular distributed atomic transaction protocol is known as the
two-phase commit protocol 
(2PC), which involves:
One 
coordinator
Multiple 
participants
14
The Two-Phase Commit Protocol
2PC is comprised of two phases, the 
voting phase 
and the 
decision
phase
, each involving two steps:
 
The Two-Phase Commit Protocol
 
2PC is comprised of two phases, the 
voting phase 
and the 
decision
phase
, each involving two steps:
 
 
 
The Two-Phase Commit Protocol
 
2PC is comprised of two phases, the 
voting phase 
and the 
decision
phase
, each involving two steps:
 
 
 
 
The Two-Phase Commit Protocol
 
 
The Two-Phase Commit Protocol
 
 
The Two-Phase Commit Protocol
 
 
The Two-Phase Commit Protocol
 
 
The Two-Phase Commit Protocol
 
 
The Two-Phase Commit Protocol
 
 
The Two-Phase Commit Protocol
2PC Finite State Machines
INIT
WAIT
COMMIT
ABORT
    Commit
Vote-request
  Vote-abort
Global-abort
Vote-commit
Global-commit
INIT
READY
COMMIT
ABORT
 
Vote-request
Vote-commit
 
Global-abort
ACK
 
Global-commit
ACK
 
Vote-request
Vote-abort
The finite state machine of the 
C
O
O
R
D
I
N
A
T
O
R
 
i
n
 
2
P
C
The finite state machine of a
P
A
R
T
I
C
I
P
A
N
T
 
i
n
 
2
P
C
 
received
 
sent
The 2PC Algorithm
write START_2PC to local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected{
 
wait for any incoming vote;
 
if timeout{
  
write GLOBAL_ABORT to local log;
  
multicast GLOBAL_ABORT to all participants;
  
exit;
 
}
 
record vote;
}
If all participants sent VOTE_COMMIT and coordinator votes COMMIT{
 
write GLOBAL_COMMIT to local log;
 
multicast GLOBAL_COMMIT to all participants;
}else{
 
write GLOBAL_ABORT to local log;
 
multicast GLOBAL_ABORT to all participants;
}
 
 
Actions by coordinator:
Coordinator Recovery
 
The coordinator can fail at any stage in 2PC
 
However, due to 
logging
 
its state, it can recover as follows:
The 2PC Algorithm
write INIT to local log;
Wait for VOTE_REQUEST from coordinator;
If timeout{
 
write VOTE_ABORT to local log;
 
exit;
}
If participant votes COMMIT{
 
write VOTE_COMMIT to local log;
 
send VOTE_COMMIT to coordinator;
 
wait for DECISION from coordinator;
 
if timeout{
  
multicast DECISION_RQUEST to other participants;
  
wait until DECISION is received; /*remain blocked*/
  
write DECISION to local log;
 
}
 
if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;}
 
else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log};
}else{
 
write VOTE_ABORT to local log;
 
send VOTE_ABORT to coordinator;
}
 
 
28
Actions by participants:
The 2PC Algorithm
/*executed by a separate thread*/
 
while true{
 
wait until any incoming DECISION_REQUEST is received; /*remain blocked*/
 
read most recently recorded STATE from the local log;
 
if STATE == GLOBAL_COMMIT
  
send GLOBAL_COMMIT to requesting participant;
 
else if STATE == INIT or STATE == GLOBAL_ABORT
  
send GLOBAL_ABORT to requesting participant;
 
else
  
skip; /*participant remains blocked*/
}
 
 
Actions for handling decision requests:
An 
indefinite blocking window 
may arise, whereby all
the participants who have voted are blocked until the
final decision is known
Participant Recovery
 
Any participant can fail at any stage in 2PC
 
Due to 
logging
 
its state, it can recover as follows:
 
The End.
 
 
 
 
Thank You!
Slide Note
Embed
Share

Explore the concept of fault tolerance in distributed systems, focusing on system design that can recover from failures. Learn about failure types, characteristics, and the importance of addressing specified behavior to ensure proper system operation. Discover how transient and persistent failures impact system reliability and operation.

  • Fault Tolerance
  • Distributed Systems
  • System Reliability
  • Failure Characteristics
  • System Design

Uploaded on Nov 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Distributed Systems CS 15-440 Fault Tolerance Lecture 24, November 28, 2023 Mohammad Hammoud 1

  2. Today Last Session: Replication Part II Today s Session: Fault Tolerance Announcements: P4 is due on Nov 30 Final exam (open books, open notes) is on Wednesday, December 06 from 2:30PM to 5:30PM in Room 1031

  3. Course Map Applications Programming Models Fast & Reliable or Efficient DS Replication & Consistency Fault-tolerance Communication Paradigms Architectures Naming Synchronization Correct or Effective DS Networks

  4. Course Map Applications Programming Models Replication & Consistency Fault-tolerance Communication Paradigms Architectures Naming Synchronization Networks

  5. Fault-Tolerance Systems can be designed in a way that can automatically recover from partial failures Tire punctured. Car stopped. Tire punctured. It got masked and car continued. Fault-tolerance is the property that enables a system to continue operating properly even if a failure takes place during operation E.g., TCP is designed to allow reliable two-way communications in packet- switched networks, even in the presence of imperfect or overloaded communication links 5

  6. What is a Failure? A failure is a deviation from a specified behavior E.g., Pressing brake pedal does not stop car brake failure (could be catastrophic!) E.g., Reading a disk sector does not return content disk failure (not necessarily catastrophic) Many failures are due to incorrect specified behavior This typically happens when the designer misses addressing a scenario that makes the system perform incorrectly It is especially true in complex systems with many subtle interactions

  7. Failure Characteristics Failures can be either transient or persistent Transient Failures: Also referred to as soft failures or Heisenbugs Occur temporarily then disappear Manifested only in very unlikely combinations of circumstances Typically go away upon rolling back and/or retrying/rebooting E.g., Frozen keyboard or window, race conditions and deadlocks, etc.,

  8. Failure Characteristics Failures can be either transient or persistent Persistent Failures: Persist until explicitly repaired Retrying does not help E.g., Burnt-out chips, software bugs, crashed disks, broken Ethernet cable, etc., Durations of failures and repairs are random variables Means of distributions are Mean Time To Fail (MTTF) and Mean Time To Repair (MTTR) In-Service Out-of-Service MTTF MTTR

  9. Availability vs. Reliability There is a subtle distinction between availability and reliability Availability refers to the probability that a system is operating correctly at any given moment Availability = MTTF/(MTTF+MTTR) Reliability measures how long a system can operate without a breakdown A highly-available system is one that will most likely be working at a given instant in time A highly-reliable system is one that will most likely continue to work without interruption during a relatively long period of time

  10. Availability vs. Reliability For example: A system that goes down for 1ms every hour has an availability of over 99.9999%, but is highly unreliable A system that never crashes but shuts down for two weeks every August has high reliability, but only 96% availability System Type Availability (%) Downtime in a Year Conventional Workstation 99 3.6 Days High-Available (HA) System 99.9 8.4 Hours Fault-Resilient System 99.99 1 Hour Fault-Tolerant System 99.999 5 Minutes

  11. Masking Failures The key technique for masking failures is to use redundancy Usually, extra bits are added to allow recovery from garbled bits Information Usually, extra equipments are added to allow tolerating failed hardware components Usually, extra processes are added to allow tolerating failed processes Software Redundancy Hardware Time Usually, an action is performed, and then, if required, it is performed again

  12. Detecting Failures But, failures need to be detected before they can be masked A detection subsystem: Can usually be involved as a side-effect of regularly exchanging information with servers Should ideally be able to distinguish between network and server failures A process, P, that cannot reach a server can check with other processes on whether they can reach the server If at least one other process indicates that it can reach the server, P can assume that the failure is a network failure 12

  13. Example: Atomic Multicasting Atomic multicasting requires satisfying two conditions: 1. A message should be delivered to either all or none of the processes (or replica sites) This property is known as atomicity, which implies reliable multicasting because all or none of the processes shall receive the multicast message 2. All messages should be delivered in the same order to all the processes (or replica sites) This property is known as total ordering

  14. Distributed Atomic Transactions A popular distributed atomic transaction protocol is known as the two-phase commit protocol (2PC), which involves: One coordinator Multiple participants 14

  15. The Two-Phase Commit Protocol 2PC is comprised of two phases, the voting phase and the decision phase, each involving two steps: Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator indicating that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message Step 2

  16. The Two-Phase Commit Protocol 2PC is comprised of two phases, the voting phase and the decision phase, each involving two steps: Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator indicating that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message Step 2

  17. The Two-Phase Commit Protocol 2PC is comprised of two phases, the voting phase and the decision phase, each involving two steps: Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator indicating that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message Step 2

  18. The Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from participants If all the participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all the participants Step 1 However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message Each participant that voted for a commit waits for the final action by the coordinator If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction Step 2 If a participant receives a GLOBAL_ABORT message, , it locally aborts the transaction

  19. The Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from participants If all the participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all the participants Step 1 However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message Each participant that voted for a commit waits for the final action by the coordinator If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction Step 2 If a participant receives a GLOBAL_ABORT message, , it locally aborts the transaction

  20. The Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from participants If all the participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all the participants Step 1 However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message Each participant that voted for a commit waits for the final action by the coordinator If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction Step 2 If a participant receives a GLOBAL_ABORT message, , it locally aborts the transaction

  21. The Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from participants If all the participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all the participants Step 1 However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message Each participant that voted for a commit waits for the final action by the coordinator If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction Step 2 If a participant receives a GLOBAL_ABORT message, , it locally aborts the transaction

  22. The Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from participants If all the participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all the participants Step 1 However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message Each participant that voted for a commit waits for the final action by the coordinator If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction Step 2 If a participant receives a GLOBAL_ABORT message, , it locally aborts the transaction

  23. The Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from participants If all the participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all the participants Step 1 However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message Each participant that voted for a commit waits for the final action by the coordinator If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction Step 2 If a participant receives a GLOBAL_ABORT message, , it locally aborts the transaction

  24. The Two-Phase Commit Protocol Phase II: Decision Phase The coordinator collects all votes from participants If all the participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all the participants Step 1 However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message Each participant that voted for a commit waits for the final action by the coordinator If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction Step 2 If a participant receives a GLOBAL_ABORT message, it locally aborts the transaction

  25. 2PC Finite State Machines Vote-request Vote-abort Commit Vote-request INIT INIT received sent Vote-request Vote-commit READY WAIT Vote-commit Global-commit Global-abort ACK Vote-abort Global-abort Global-commit ACK ABORT COMMIT ABORT COMMIT The finite state machine of the COORDINATOR in 2PC The finite state machine of a PARTICIPANTin 2PC

  26. The 2PC Algorithm Actions by coordinator: write START_2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected{ wait for any incoming vote; if timeout{ write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } If all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; }else{ write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; }

  27. Coordinator Recovery The coordinator can fail at any stage in 2PC However, due to logging its state, it can recover as follows: State in Log Action After Recovery INIT Abort WAIT Retransmit VOTE_REQUEST to participants COMMIT Retransmit GLOBAL_COMMIT to all participants ABORT Retransmit GLOBAL_ABORT to all participants

  28. The 2PC Algorithm Actions by participants: write INIT to local log; Wait for VOTE_REQUEST from coordinator; If timeout{ write VOTE_ABORT to local log; exit; } If participant votes COMMIT{ write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout{ multicast DECISION_RQUEST to other participants; wait until DECISION is received; /*remain blocked*/ write DECISION to local log; } if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;} else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log}; }else{ write VOTE_ABORT to local log; send VOTE_ABORT to coordinator; } 28

  29. The 2PC Algorithm Actions for handling decision requests: /*executed by a separate thread*/ while true{ wait until any incoming DECISION_REQUEST is received; /*remain blocked*/ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /*participant remains blocked*/ } An indefinite blocking window may arise, whereby all the participants who have voted are blocked until the final decision is known

  30. Participant Recovery Any participant can fail at any stage in 2PC Due to logging its state, it can recover as follows: State in Log Action After Recovery INIT Locally abort and notify the coordinator Cannot decide on its own what it should do next; hence, contact others READY COMMIT Retransmit its decision to coordinator ABORT Retransmit its decision to the coordinator

  31. The End. Thank You!

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#