Understanding Remote Procedure Call (RPC) in Operating Systems

 
CS6456: Graduate
Operating Systems
 
Brad Campbell 
 
bradjc@virginia.edu
https://www.cs.virginia.edu/~bjc8c/class/cs6456-f19/
 
1
Remote Procedure Call (RPC)
 
Raw messaging is a bit too low-level for programming
Must wrap up information into message at source
Must decide what to do with message at destination
May need to sit and wait for multiple messages to arrive
 
Another option: Remote Procedure Call (RPC)
Calls a procedure on a remote machine
Client calls:
 
remoteFileSystem
Read(
"
rutabaga
"
);
Translated automatically into call on server:
 
fileSys
Read(
"
rutabaga
"
);
2
 
RPC Implementation
 
Request-response message passing (under covers!)
“Stub” provides glue on client/server
Client stub is responsible for “marshalling” arguments and “unmarshalling”
the return values
Server-side stub is responsible for “unmarshalling” arguments and
“marshalling” the return values.
 
Marshalling
 involves (depending on system)
Converting values to a canonical form, serializing objects, copying
arguments passed by reference, etc.
 
3
 
RPC Information Flow
Client
(caller)
Server
(callee)
Packet
Handler
Packet
Handler
 
bundle
ret vals
 
unbundle
ret vals
 
Machine A
 
Machine B
 
mbox1
 
mbox2
 
4
 
RPC Details
 
Equivalence with regular procedure call
Parameters




 
Stub generator: Compiler that generates stubs
Input: interface definitions in an “interface definition language (IDL)”
Contains, among other things, types of arguments/return
Output: stub code in the appropriate source language
Code for client to pack message, send it off, wait for result, unpack result and return to caller
Code for server to unpack message, call procedure, pack results, send them off
 
5
 
RPC Details
 
Cross-platform issues:
What if client/server machines are different architectures/
languages?
Convert everything to/from some canonical form
Tag every item with an indication of how it is encoded (avoids unnecessary
conversions)
 
 
 
6
 
Problems with RPC: Non-Atomic Failures
 
Different failure modes in dist. system than on a single machine
Consider many different types of failures
User-level bug causes address space to crash
Machine failure, kernel bug causes all processes on same machine
to fail
Some machine is compromised by malicious party
Before RPC: whole system would crash/die
After RPC: One machine crashes/compromised while others keep
working
Can easily result in inconsistent view of the world
Did my cached data get written back or not?
Did server do what I requested or not?
 
7
 
Problems with RPC: Performance
 
Cost of Procedure call 
« same-machine RPC « network RPC
 
Means programmers must be aware that RPC is not free
Caching can help, but may make failure handling complex
 
8
 
9
 
Important “ilities”
 
Availability:
 the ability of the system to accept and process
requests
 
Durability:
 the ability of a system to recover data despite faults
 
Reliability: 
the ability of a system or component to perform its
required functions under stated conditions for a specified
period of time (IEEE definition)
 
10
One Approach: Geographic Replication
 
Highly durable: Hard to destroy all copies
Highly available for reads: Just talk to any copy
What about for writes? Need every copy online
to update all together?
Replica/Frag #1
Replica/Frag #2
Replica/Frag #n
11
 
C
e
n
t
r
a
l
i
z
e
d
 
S
y
s
t
e
m
:
 
M
a
j
o
r
 
f
u
n
c
t
i
o
n
s
p
e
r
f
o
r
m
e
d
 
o
n
 
o
n
e
 
p
h
y
s
i
c
a
l
 
c
o
m
p
u
t
e
r
 
D
i
s
t
r
i
b
u
t
e
d
 
S
y
s
t
e
m
:
 
P
h
y
s
i
c
a
l
l
y
 
s
e
p
a
r
a
t
e
c
o
m
p
u
t
e
r
s
 
w
o
r
k
i
n
g
 
t
o
g
e
t
h
e
r
 
t
o
 
p
e
r
f
o
r
m
 
a
s
i
n
g
l
e
 
t
a
s
k
 
Centralized vs Distributed
 
12
 
Parallel vs Distributed
 
Distributed: different machines responsible for
different parts of task
Usually no centralized state
Usually about different responsibilities or redundancy
 
Parallel: different parts of same task performed on
different machines
Usually about performance
 
13
 
Distributed: Why?
 
S
i
m
p
l
e
,
 
c
h
e
a
p
e
r
 
c
o
m
p
o
n
e
n
t
s
 
E
a
s
y
 
t
o
 
a
d
d
 
c
a
p
a
b
i
l
i
t
y
 
i
n
c
r
e
m
e
n
t
a
l
l
y
 
Let multiple users cooperate (maybe)
Physical components owned by different users
E
n
a
b
l
e
 
c
o
l
l
a
b
o
r
a
t
i
o
n
 
b
e
t
w
e
e
n
 
d
i
v
e
r
s
e
 
u
s
e
r
s
 
14
 
The Promise of Dist. Systems
 
A
v
a
i
l
a
b
i
l
i
t
y
:
 
O
n
e
 
m
a
c
h
i
n
e
 
g
o
e
s
 
d
o
w
n
,
 
o
v
e
r
a
l
l
 
s
y
s
t
e
m
s
t
a
y
s
 
u
p
 
D
u
r
a
b
i
l
i
t
y
:
 
O
n
e
 
m
a
c
h
i
n
e
 
l
o
s
e
s
 
d
a
t
a
,
 
b
u
t
 
s
y
s
t
e
m
 
d
o
e
s
n
o
t
 
l
o
s
e
 
a
n
y
t
h
i
n
g
 
S
e
c
u
r
i
t
y
:
 
E
a
s
i
e
r
 
t
o
 
s
e
c
u
r
e
 
e
a
c
h
 
c
o
m
p
o
n
e
n
t
 
o
f
 
t
h
e
s
y
s
t
e
m
 
i
n
d
i
v
i
d
u
a
l
l
y
?
 
15
 
Distributed: Worst-Case Reality
 
A
v
a
i
l
a
b
i
l
i
t
y
:
 
F
a
i
l
u
r
e
 
i
n
 
o
n
e
 
m
a
c
h
i
n
e
 
b
r
i
n
g
s
 
d
o
w
n
 
e
n
t
i
r
e
s
y
s
t
e
m
 
D
u
r
a
b
i
l
i
t
y
:
 
A
n
y
 
m
a
c
h
i
n
e
 
c
a
n
 
l
o
s
e
 
y
o
u
r
 
d
a
t
a
 
S
e
c
u
r
i
t
y
:
 
M
o
r
e
 
c
o
m
p
o
n
e
n
t
s
 
m
e
a
n
s
 
m
o
r
e
 
p
o
i
n
t
s
 
o
f
a
t
t
a
c
k
 
16
 
Distributed Systems Goal
 
T
r
a
n
s
p
a
r
e
n
c
y
:
 
H
i
d
e
 
"
d
i
s
t
r
i
b
u
t
e
d
-
n
e
s
s
"
 
f
r
o
m
 
a
n
y
e
x
t
e
r
n
a
l
 
o
b
s
e
r
v
e
r
,
 
m
a
k
e
 
s
y
s
t
e
m
 
s
i
m
p
l
e
r
T
y
p
e
s
Location: Location of resources is invisible
Migration: Resources can move without user knowing
Replication: Invisible extra copies of resources (for reliability,
performance)
Parallelism: Job split into multiple pieces, but looks like a
single task
Fault Tolerance: Components fail without users knowing
 
17
 
Challenge of Coordination
 
Components communicate over the network
Send messages between machines
 
N
e
e
d
 
t
o
 
u
s
e
 
m
e
s
s
a
g
e
s
 
t
o
 
a
g
r
e
e
 
o
n
 
s
y
s
t
e
m
 
s
t
a
t
e
This issue does not exist in a centralized system
 
18
 
CAP Theorem
 
Originally proposed by Eric Brewer (Berkeley)
 
1.
C
o
n
s
i
s
t
e
n
c
y
 
 
c
h
a
n
g
e
s
 
a
p
p
e
a
r
 
t
o
 
e
v
e
r
y
o
n
e
 
i
n
 
s
a
m
e
s
e
q
u
e
n
t
i
a
l
 
o
r
d
e
r
2.
A
v
a
i
l
a
b
i
l
i
t
y
 
 
c
a
n
 
g
e
t
 
a
 
r
e
s
u
l
t
 
a
t
 
a
n
y
 
t
i
m
e
3.
P
a
r
t
i
t
i
o
n
 
T
o
l
e
r
a
n
c
e
 
 
s
y
s
t
e
m
 
c
o
n
t
i
n
u
e
s
 
t
o
 
w
o
r
k
 
e
v
e
n
w
h
e
n
 
o
n
e
 
p
a
r
t
 
o
f
 
n
e
t
w
o
r
k
 
c
a
n
'
t
 
c
o
m
m
u
n
i
c
a
t
e
 
w
i
t
h
t
h
e
 
o
t
h
e
r
 
Impossible to achieve all 3 at the same time (pick two)
 
19
CAP Theorem Example
 
What do we do if a network partition occurs?
P
r
e
f
e
r
 
A
v
a
i
l
a
b
i
l
i
t
y
:
 
A
l
l
o
w
 
t
h
e
 
s
t
a
t
e
 
a
t
 
s
o
m
e
 
n
o
d
e
s
 
t
o
d
i
s
a
g
r
e
e
 
w
i
t
h
 
t
h
e
 
s
t
a
t
e
 
a
t
 
o
t
h
e
r
 
n
o
d
e
s
 
(
A
P
)
P
r
e
f
e
r
 
C
o
n
s
i
s
t
e
n
c
y
:
 
R
e
j
e
c
t
 
r
e
q
u
e
s
t
s
 
u
n
t
i
l
 
t
h
e
 
p
a
r
t
i
t
i
o
n
i
s
 
r
e
s
o
l
v
e
d
 
(
C
P
)
20
 
Consistency Preferred
 
Block writes until all nodes able to agree
 
C
o
n
s
i
s
t
e
n
t
:
 
R
e
a
d
s
 
n
e
v
e
r
 
r
e
t
u
r
n
 
w
r
o
n
g
 
v
a
l
u
e
s
 
N
o
t
 
A
v
a
i
l
a
b
l
e
:
 
W
r
i
t
e
s
 
b
l
o
c
k
 
u
n
t
i
l
 
p
a
r
t
i
t
i
o
n
 
i
s
 
r
e
s
o
l
v
e
d
 
a
n
d
u
n
a
n
i
m
o
u
s
 
a
p
p
r
o
v
a
l
 
i
s
 
p
o
s
s
i
b
l
e
 
21
 
What about AP Systems?
 
Partition occurs, but both groups of nodes continue to
accept requests
Consequence: State might diverge between the two
groups (e.g., different updates are executed)
When communication is restored, there needs to be
an explicit 
recovery
 process
Resolve conflicting updates so everyone agrees on system
state once again
 
22
 
General’s Paradox
 
Two generals located on opposite sides of their
enemy’s position
Can only communicate via messengers
Messengers go through enemy territory: might be
captured
 
Problem: Need to coordinate time of attack
Two generals lose unless they attack at same time
If they attack at same time, they win
 
23
 
General’s Paradox
 
Can messages over an unreliable network be used to
guarantee two entities do something simultaneously?
N
o
,
 
e
v
e
n
 
i
f
 
a
l
l
 
m
e
s
s
a
g
e
s
 
g
o
 
t
h
r
o
u
g
h
General 1
General 2
 
24
 
Two-Phase Commit
 
We can’t solve the General’s Paradox
No simultaneous action
But we can solve a related problem
 
D
i
s
t
r
i
b
u
t
e
d
 
T
r
a
n
s
a
c
t
i
o
n
:
 
T
w
o
 
(
o
r
 
m
o
r
e
)
 
m
a
c
h
i
n
e
s
 
a
g
r
e
e
 
t
o
 
d
o
s
o
m
e
t
h
i
n
g
 
o
r
 
n
o
t
 
d
o
 
i
t
 
a
t
o
m
i
c
a
l
l
y
 
E
x
t
r
a
 
t
o
o
l
:
 
P
e
r
s
i
s
t
e
n
t
 
L
o
g
If machine fails, it will remember what happened
Assume log itself can’t be corrupted
 
25
 
Two-Phase Commit: Setup
 
One machine 
(coordinator)
 initiates the protocol
I
t
 
a
s
k
s
 
e
v
e
r
y
 
m
a
c
h
i
n
e
 
t
o
 
v
o
t
e
 
o
n
 
t
r
a
n
s
a
c
t
i
o
n
 
Two possible votes:
C
o
m
m
i
t
A
b
o
r
t
 
Commit transaction only if unanimous approval
 
26
Two-Phase Commit: Preparing
 
A
g
r
e
e
 
t
o
 
C
o
m
m
i
t
M
a
c
h
i
n
e
 
h
a
s
 
g
u
a
r
a
n
t
e
e
d
 
t
h
a
t
 
i
t
 
w
i
l
l
 
a
c
c
e
p
t
 
t
r
a
n
s
a
c
t
i
o
n
M
u
s
t
 
b
e
 
r
e
c
o
r
d
e
d
 
i
n
 
l
o
g
 
s
o
 
m
a
c
h
i
n
e
 
w
i
l
l
 
r
e
m
e
m
b
e
r
 
t
h
i
s
d
e
c
i
s
i
o
n
 
i
f
 
i
t
 
f
a
i
l
s
 
a
n
d
 
r
e
s
t
a
r
t
s
A
g
r
e
e
 
t
o
 
A
b
o
r
t
M
a
c
h
i
n
e
 
h
a
s
 
g
u
a
r
a
n
t
e
e
d
 
t
h
a
t
 
i
t
 
w
i
l
l
 
n
e
v
e
r
 
a
c
c
e
p
t
 
t
h
i
s
t
r
a
n
s
a
c
t
i
o
n
M
u
s
t
 
b
e
 
r
e
c
o
r
d
e
d
 
i
n
 
l
o
g
 
s
o
 
m
a
c
h
i
n
e
 
w
i
l
l
 
r
e
m
e
m
b
e
r
 
t
h
i
s
d
e
c
i
s
i
o
n
 
i
f
 
i
t
 
f
a
i
l
s
 
a
n
d
 
r
e
s
t
a
r
t
s
27
 
Two-Phase Commit: Finishing
 
C
o
m
m
i
t
 
T
r
a
n
s
a
c
t
i
o
n
Coordinator learns 
all machines have agreed to commit
Apply transaction, inform voters
Record decision in local log
A
b
o
r
t
 
T
r
a
n
s
a
c
t
i
o
n
Coordinator learns 
at least on machine has voted to abort
Do not apply transaction, inform voters
Record decision in local log
 
28
 
Formalizing Two-Phase Commit
 
N
 workers (replicas): actually perform transaction
 
One coordinator (may also serve a worker)
Asks each worker to vote on transaction
Tells every machine result of the vote (workers don’t need to
ask each other)
 
29
Messages in Two-Phase Commit
 
Coordinator → Worker
VOTE-REQ
Worker 
→ Coordinator
VOTE-COMMIT
VOTE-ABORT
Coordinator 
→ Worker
GLOBAL-COMMIT
GLOBAL-ABORT
30
Messages in Two-Phase Commit
 
Coordinator → Worker
VOTE-REQ
Worker 
→ Coordinator
VOTE-COMMIT
VOTE-ABORT
Coordinator 
→ Worker
GLOBAL-COMMIT
GLOBAL-ABORT
No taking back: always
logged before sending
Actual result of transaction
attempt
31
Detailed Algorithm
Coordinator sends 
VOTE-REQ
 
to all
workers
Wait for 
VOTE-REQ 
from coordinator
If ready, send 
VOTE-COMMIT 
to
coordinator
If not ready, send 
VOTE-ABORT 
to
coordinator
And immediately abort
If receive 
VOTE-COMMIT 
from all N
workers, send 
GLOBAL-COMMIT
 to
all workers
If doesn’t receive 
VOTE-COMMIT
from all N workers, send
 
GLOBAL-
ABORT
 
to all workers
If receive 
GLOBAL-COMMIT
 
then
commit
If receive 
GLOBAL-ABORT 
then abort
Coordinator Algorithm
Worker Algorithm
32
Example: Failure-Free 2PC
33
coordinator
w
orker 1
time
w
orker 2
w
orker 3
 
Example: Failure-Free 2PC
 
34
 
coordinator
 
w
orker 1
 
time
 
w
orker 2
 
w
orker 3
 
VOTE-
ABORT
Example of 
Worker
 Failure
coordinator
w
orker 1
time
VOTE-REQ
 
VOTE-
COMMIT
 
timeout
w
orker 2
w
orker 3
35
Example of Coordinator Failure (1)
coordinator
w
orker 1
VOTE-
REQ
 
VOTE-
ABORT
 
timeout
 
timeout
 
timeout
w
orker 2
w
orker 3
36
Example of Coordinator Failure
37
VOTE-REQ
 
VOTE-
COMMIT
 
block waiting for
coordinator
 
restarted
 
GLOBAL-
ABORT
coordinator
w
orker 1
w
orker 2
w
orker 3
Failure Recovery
 
N
o
d
e
s
 
n
e
e
d
 
t
o
 
k
n
o
w
 
w
h
a
t
 
s
t
a
t
e
 
t
h
e
y
 
a
r
e
 
i
n
 
w
h
e
n
t
h
e
y
 
c
o
m
e
 
b
a
c
k
 
f
r
o
m
 
a
 
f
a
i
l
u
r
e
How? Log events on local hard disk
Then we have the following recovery rules:
Coordinator aborts transaction if it was in the INIT, WAIT, or
ABORT states
Coordinator commits transaction if it was in COMMIT
Worker aborts if in INIT or ABORT states
Worker commits if it was in COMMIT state
Worker asks coordinator what to do if in READY state
38
Blocking for Coordinator to Recover
 
A worker waiting for global decision can ask
fellow workers about their state
If another worker is in ABORT or
COMMIT state then coordinator
must have sent GLOBAL-*
Thus, worker can safely
abort or commit, respectively
 
If another worker is still in
INIT state then both workers
can decide to abort
 
If all workers are in ready, need to 
BLOCK
 (don’t
know if coordinator wanted to abort or commit)
39
Blocking
 
What if 
both
 coordinator and a worker fail?
The remaining workers can still consult each other
But they can’t reach a conclusion on what to do!
 
Why?
If all workers in INIT, we still don’t know state of failed
worker 
w
w
 may have been first to be notified of a commit, and
then coordinator and 
w
 crashed
40
Blocking for Coordinator
 
What if 
both
 coordinator and a worker fail?
The remaining workers can still consult each other
But they can’t reach a conclusion on what to do!
 
This problem motivated 
Three Phase Commit
41
 
Paxos: fault tolerant agreement
 
Paxos lets all nodes agree on the same value despite
node failures, network failures and delays
High-level process:
One (or more) node decides to be the leader
Leader proposes a value and solicits acceptance from others
Leader announces result or try again
 
42
 
Google Spanner
 
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay
Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak,
Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan,
Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale
Woodford. 2012. Spanner: Google's globally-distributed database. In 
Proceedings of the 10th USENIX
conference on Operating Systems Design and Implementation
 (OSDI'12). USENIX Association, Berkeley,
CA, USA, 251-264.
 
43
 
Basic Spanner Operation
 
Data replicated across
datacenters
Paxos groups support
transactions
On commit:
Grab Paxos lock
Paxos algorithm decides consensus
If all agree, transaction is
committeed
 
44
Spanner Operation
45
 
Paxos
 
Paxos
 
2PC
Base operation great for writes…
 
What about reads?
Reads are dominant operations
e.g., FB’s TAO had 
500 reads 
: 1 write 
[ATC 2013]
e.g., Google Ads (F1) on Spanner from 1? DC in 24h:
         21.5
B
 reads
         31.2M single-shard transactions
         32.1M multi-shard transactions
Want efficient read transactions
46
 
Make Read-Only Txns Efficient
 
Ideal: Read-only transactions that are non-blocking
Arrive at shard, read data, send data back
 
Goal 1: Lock-free read-only transactions
 
Goal 2: Non-blocking stale read-only txns
 
47
“Global wall-clock time” with bounded uncertainty
ε 
is worst-case clock divergence
Timestamps become intervals, not single values
time
earliest
latest
TT.now()
 
2*ε
48
TrueTime
 
Consider event e
now
 which invoked tt = TT.now():
 
Guarantee:  
tt.earliest <= t
abs
(e
now
) <= tt.latest
Timestamps and TrueTime
 
Key: Need to ensure that all future transactions will get a higher timestamp
Commit wait ensures this
49
T
 
Pick s > TT.now().latest
 
Acquired locks
 
Release locks
 
Wait until TT.now().earliest > s
 
s
 
average ε
 
Commit wait
 
average ε
Commit Wait and Replication
T
Acquired locks
 
Start
consensus
 
Notify
followers
 
Commit wait done
 
Pick 
s
50
 
Achieve
consensus
 
Release locks
 
Given global timestamp, can implement read-only
transactions lock-free (snapshot isolation)
Step 1: Choose timestamp s
read
 = TT.now.latest()
Step 2: Snapshot read (at s
read
)
Can be served by any up-to-date replica
51
Read-only optimizations
TrueTime for Read-Only Txns
 
Assign all transactions a wall-clock commit time (s)
All replicas of all shards track how up-to-date they are with
t
safe
: all transactions with s < t
safe
 have committed on this machine
 
Goal 1: Lock-free read-only transactions
Current time ≤ TT.now.latest()
s
read 
= TT.now.latest()
wait until s
read
 < t
safe
Read data as of s
read
 
Goal 2: Non-blocking stale read-only txns
Similar to above, except explicitly choose time in the past
(Trades away consistency for better perf, e.g., lower latency)
52
 
Commit wait
 
What does this mean for performance?
Larger TrueTime uncertainty bound
longer commit wait
Longer commit wait
locks held longer
can’t process conflicting transactions
lower throughput
i.e., if time is less certain, Spanner is slower!
 
53
 
Question
 
If you notice unacceptable performance using
Spanner, how could you improve it?
 
54
Slide Note
Embed
Share

Remote Procedure Call (RPC) in operating systems enables remote execution of procedures, facilitating communication between client and server machines. It involves information flow, addressing cross-platform issues, and dealing with non-atomic failures. Ensuring availability, durability, and reliability are key considerations in RPC implementation.


Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. CS6456: Graduate Operating Systems Brad Campbell bradjc@virginia.edu https://www.cs.virginia.edu/~bjc8c/class/cs6456-f19/ 1

  2. Remote Procedure Call (RPC) Raw messaging is a bit too low-level for programming Must wrap up information into message at source Must decide what to do with message at destination May need to sit and wait for multiple messages to arrive Another option: Remote Procedure Call (RPC) Calls a procedure on a remote machine Client calls: remoteFileSystem Read("rutabaga"); Translated automatically into call on server: fileSys Read("rutabaga"); 2

  3. RPC Information Flow bundle args call send Client (caller) Client Stub Packet Handler return receive unbundle ret vals mbox2 Network Network Machine A Machine B bundle ret vals mbox1 return send Server (callee) Server Stub Packet Handler call receive unbundle args 4

  4. RPC Details Cross-platform issues: What if client/server machines are different architectures/ languages? Convert everything to/from some canonical form Tag every item with an indication of how it is encoded (avoids unnecessary conversions) 6

  5. Problems with RPC: Non-Atomic Failures Different failure modes in dist. system than on a single machine Consider many different types of failures User-level bug causes address space to crash Machine failure, kernel bug causes all processes on same machine to fail Some machine is compromised by malicious party Before RPC: whole system would crash/die After RPC: One machine crashes/compromised while others keep working Can easily result in inconsistent view of the world Did my cached data get written back or not? Did server do what I requested or not? 7

  6. 9

  7. Important ilities Availability: the ability of the system to accept and process requests Durability: the ability of a system to recover data despite faults Reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time (IEEE definition) 10

  8. Distributed: Why? Simple, cheaper components Easy to add capability incrementally Let multiple users cooperate (maybe) Physical components owned by different users Enable collaboration between diverse users 14

  9. The Promise of Dist. Systems Availability: One machine goes down, overall system stays up Durability: One machine loses data, but system does not lose anything Security: Easier to secure each component of the system individually? 15

  10. Distributed: Worst-Case Reality Availability: Failure in one machine brings down entire system Durability: Any machine can lose your data Security: More components means more points of attack 16

  11. Distributed Systems Goal Transparency: Hide "distributed-ness" from any external observer, make system simpler Types Location: Location of resources is invisible Migration: Resources can move without user knowing Replication: Invisible extra copies of resources (for reliability, performance) Parallelism: Job split into multiple pieces, but looks like a single task Fault Tolerance: Components fail without users knowing 17

  12. Challenge of Coordination Components communicate over the network Send messages between machines Need to use messages to agree on system state This issue does not exist in a centralized system 18

  13. CAP Theorem Originally proposed by Eric Brewer (Berkeley) 1. Consistency changes appear to everyone in same sequential order 2. Availability can get a result at any time 3. Partition Tolerance system continues to work even when one part of network can't communicate with the other Impossible to achieve all 3 at the same time (pick two) 19

  14. CAP Theorem Example What do we do if a network partition occurs? Prefer Availability: Allow the state at some nodes to disagree with the state at other nodes (AP) Prefer Consistency: Reject requests until the partition is resolved (CP) Partition B Partition A 20

  15. Consistency Preferred Block writes until all nodes able to agree Consistent: Reads never return wrong values Not Available: Writes block until partition is resolved and unanimous approval is possible 21

  16. What about AP Systems? Partition occurs, but both groups of nodes continue to accept requests Consequence: State might diverge between the two groups (e.g., different updates are executed) When communication is restored, there needs to be an explicit recovery process Resolve conflicting updates so everyone agrees on system state once again 22

  17. Generals Paradox Two generals located on opposite sides of their enemy s position Can only communicate via messengers Messengers go through enemy territory: might be captured Problem: Need to coordinate time of attack Two generals lose unless they attack at same time If they attack at same time, they win 23

  18. Generals Paradox Can messages over an unreliable network be used to guarantee two entities do something simultaneously? No, even if all messages go through General 1 General 2 24

  19. Two-Phase Commit We can t solve the General s Paradox No simultaneous action But we can solve a related problem Distributed Transaction: Two (or more) machines agree to do something or not do itatomically Extra tool: Persistent Log If machine fails, it will remember what happened Assume log itself can t be corrupted 25

  20. Two-Phase Commit: Setup One machine (coordinator) initiates the protocol It asks every machine to vote on transaction Two possible votes: Commit Abort Commit transaction only if unanimous approval 26

  21. Two-Phase Commit: Preparing Agree to Commit Machine has guaranteed that it will accept transaction Must be recorded in log so machine will remember this decision if it fails and restarts Agree to Abort Machine has guaranteed that it will never accept this transaction Must be recorded in log so machine will remember this decision if it fails and restarts 27

  22. Two-Phase Commit: Finishing Commit Transaction Coordinator learns all machines have agreed to commit Apply transaction, inform voters Record decision in local log Abort Transaction Coordinator learns at least on machine has voted to abort Do not apply transaction, inform voters Record decision in local log 28

  23. Example: Failure-Free 2PC coordinator GLOBAL- COMMIT VOTE- REQ worker 1 worker 2 VOTE- COMMIT worker 3 time 33

  24. Example: Failure-Free 2PC coordinator GLOBAL- ABORT VOTE- REQ VOTE- ABORT worker 1 worker 2 VOTE- COMMIT worker 3 time 34

  25. Example of Worker Failure INIT WAIT timeout ABORT COMM coordinator GLOBAL- ABORT VOTE-REQ worker 1 VOTE- COMMIT worker 2 worker 3 time 35

  26. Example of Coordinator Failure INIT READY ABORT COMM coordinator restarted VOTE-REQ worker 1 VOTE- COMMIT GLOBAL- ABORT worker 2 block waiting for coordinator worker 3 37

  27. Paxos: fault tolerant agreement Paxos lets all nodes agree on the same value despite node failures, network failures and delays High-level process: One (or more) node decides to be the leader Leader proposes a value and solicits acceptance from others Leader announces result or try again 42

  28. Google Spanner James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 251-264. 43

  29. Basic Spanner Operation Data replicated across datacenters Paxos groups support transactions On commit: Grab Paxos lock Paxos algorithm decides consensus If all agree, transaction is committeed 44

  30. Spanner Operation Paxos Paxos 2PC 45

  31. Base operation great for writes What about reads? Reads are dominant operations e.g., FB s TAO had 500 reads : 1 write [ATC 2013] e.g., Google Ads (F1) on Spanner from 1? DC in 24h: 21.5B reads 31.2M single-shard transactions 32.1M multi-shard transactions Want efficient read transactions 46

  32. Make Read-Only Txns Efficient Ideal: Read-only transactions that are non-blocking Arrive at shard, read data, send data back Goal 1: Lock-free read-only transactions Goal 2: Non-blocking stale read-only txns 47

  33. TrueTime Global wall-clock time with bounded uncertainty is worst-case clock divergence Timestamps become intervals, not single values TT.now() time earliest latest 2* Consider event enow which invoked tt = TT.now(): Guarantee: tt.earliest <= tabs(enow) <= tt.latest 48

  34. Timestamps and TrueTime Acquired locks Release locks T Pick s > TT.now().latest s Wait until TT.now().earliest > s Commit wait average average Key: Need to ensure that all future transactions will get a higher timestamp Commit wait ensures this 49

  35. Commit Wait and Replication Start consensus Achieve consensus Notify followers Acquired locks Release locks T Pick s Commit wait done 50

  36. Read-only optimizations Given global timestamp, can implement read-only transactions lock-free (snapshot isolation) Step 1: Choose timestamp sread = TT.now.latest() Step 2: Snapshot read (at sread) Can be served by any up-to-date replica 51

  37. TrueTime for Read-Only Txns Assign all transactions a wall-clock commit time (s) All replicas of all shards track how up-to-date they are with tsafe: all transactions with s < tsafe have committed on this machine Goal 1: Lock-free read-only transactions Current time TT.now.latest() sread = TT.now.latest() wait until sread < tsafe Read data as of sread Goal 2: Non-blocking stale read-only txns Similar to above, except explicitly choose time in the past (Trades away consistency for better perf, e.g., lower latency) 52

  38. Commit wait What does this mean for performance? Larger TrueTime uncertainty bound longer commit wait Longer commit wait locks held longer can t process conflicting transactions lower throughput i.e., if time is less certain, Spanner is slower! 53

  39. Question If you notice unacceptable performance using Spanner, how could you improve it? 54

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#