Distributed Transactions in Spanner: Insights and Mechanisms

D
i
s
t
r
i
b
u
t
e
d
 
T
r
a
n
s
a
c
t
i
o
n
s
 
i
n
S
p
a
n
n
e
r
 
2
CS 240: Computing Systems and Concurrency
Lecture 21
Marco Canini
Credits: Michael Freedman and Kyle Jamieson developed much of the original material.
Contents adapted from Haonan Lu, Wyatt Lloyd.
Efficient read-only transactions in strictly
serializable systems
Strict serializability is desirable but costly!
Reads are prevalent! (340x more than write txns)
Efficient rotxns 
 good overall performance
R
e
c
a
p
:
 
S
p
a
n
n
e
r
 
i
s
 
S
t
r
i
c
t
l
y
 
S
e
r
i
a
l
i
z
a
b
l
e
2
Timestamping writes must enforce the invariant
If T2 starts after T1 commits (finishes), then T2 must have a
larger timestamp
TrueTime: partially-synchronized clock abstraction
Bounded clock skew (uncertainty)
TT.now() 
 [earliest, latest]; earliest <= T
abs
 <= latest
Uncertainty (
ε
) is kept short
TrueTime enforces the invariant by
Use 
at least
 TT.now().latest for timestamps
Commit wait
R
e
c
a
p
:
 
T
r
u
e
T
i
m
e
3
E
n
f
o
r
c
i
n
g
 
t
h
e
 
I
n
v
a
r
i
a
n
t
 
w
i
t
h
 
T
T
T
abs
S
A
S
B
TrueTime
T1.now()
= 
[3, 15]
T1.commit
(ts = 
15
)
8
20
15
3
16
wait
TT.after(15)
== true
b
x
b < x
If T2 starts after T1 commits (finishes), then T2 must
have a larger timestamp
Let T1 write S
B
 and T2 write S
A
4
E
n
f
o
r
c
i
n
g
 
t
h
e
 
I
n
v
a
r
i
a
n
t
 
w
i
t
h
 
T
T
T
abs
S
A
S
B
TrueTime
T1.now()
= 
[3, 15]
T1.commit
(ts = 
15
)
8
20
T2.now()
= 
[18, 22]
T2.commit
(ts = 
22
)
15
T2.ts > T1.ts
3
22
16
18
wait
wait
If T2 starts after T1 commits (finishes), then T2 must
have a larger timestamp
Let T1 write S
B
 and T2 write S
A
5
S
t
r
i
c
t
l
y
 
S
e
r
i
a
l
i
z
a
b
l
e
 
M
u
l
t
i
-
S
h
a
r
d
T
r
a
n
s
a
c
t
i
o
n
s
How are clocks made “nearly perfect”?
TrueTime
How does Spanner leverage these clocks?
How are writes done and tagged?
How read-only transactions are made efficient?
6
S
c
a
l
e
-
o
u
t
 
v
s
.
 
f
a
u
l
t
 
t
o
l
e
r
a
n
c
e
Spanner mechanisms
2PL for concurrency control of read-write transactions
2PC for distributed transactions over tables
(Multi)Paxos for replicating every tablet
7
How write transactions are done
2PL + 2PC (sometimes 2PL for short)
How they are timestamped
How read-only transactions are done
How read timestamps are chosen
How reads are executed
T
h
i
s
 
L
e
c
t
u
r
e
8
Three phases
Execute  
  Prepare  
  Commit
R
e
a
d
-
W
r
i
t
e
 
T
r
a
n
s
a
c
t
i
o
n
s
 
(
2
P
L
)
 
2
P
C
:
 
a
t
o
m
i
c
i
t
y
9
 
Client: 2PL w/ 2PC
1.
Issues reads to leader of each shard group,
which acquires read locks and returns most recent data
2.
Locally performs writes
3.
Chooses coordinator from set of leaders, initiates commit
4.
Sends commit message to each leader,
include identity of coordinator and buffered writes
5.
Waits for commit from coordinator
C
l
i
e
n
t
-
d
r
i
v
e
n
 
t
r
a
n
s
a
c
t
i
o
n
s
 
(
m
u
l
t
i
-
s
h
a
r
d
)
10
10
R
e
a
d
-
W
r
i
t
e
 
T
r
a
n
s
a
c
t
i
o
n
s
 
(
2
P
L
)
A
B
C
 
T
 
R(A)
 
A=a
Execute
Client
 
Txn T = {R(A=?), W(A=?+1), W(B=?+1), W(C=?+1)}
E
x
e
c
u
t
e
:
Does reads: grab read locks and return the most recent data, e.g., R(A=a)
Client computes and buffers writes locally, e.g., A = a+1, B = a+1, C = a+1
11
11
R
e
a
d
-
W
r
i
t
e
 
T
r
a
n
s
a
c
t
i
o
n
s
 
(
2
P
L
)
A
B
C
T
R(A)
A=a
 
Coord. 
 
Par. 
 
Par. 
 
ok
 
Recv W(a+1)
 
Recv W(a+1)
 
Recv W(a+1)
Log Prepare
Log Prepare
Execute
Prepare
Client
 
P
r
e
p
a
r
e
:
Choose a coordinator, e.g., A, others are participants
Send buffered writes and the identity of the coordinator; grab write locks
Each participant prepares T by logging a prepare record via Paxos with its
replicas. Coord skips prepare (Paxos Logging)
Participants send OK to the coord if lock grabbed and after Paxos logging is done
12
12
R
e
a
d
-
W
r
i
t
e
 
T
r
a
n
s
a
c
t
i
o
n
s
 
(
2
P
L
)
A
B
C
T
R(A)
A=a
Execute
Coord. 
Par. 
Par. 
ok
Prepare
Recv W(a+1)
Recv W(a+1)
Recv W(a+1)
Log Prepare
Log Prepare
Log
Commit
Log
Commit
Log
Commit
 
Apply W(a+1)
 
Apply W(a+1)
Commit
 
Apply W(a+1)
 
ack
Client
 
C
o
m
m
i
t
:
After hearing from all participants, coord commits T if all OK; otherwise, abort T
Coord logs a commit/abort record via Paxos, applies writes if commit, release all locks
Coord sends commit/abort messages to participants
Participants log commit/abort via Paxos, apply writes if commit, release locks
Coord sends result to client either after its “log commit” or after ack
13
13
T
i
m
e
s
t
a
m
p
i
n
g
 
R
e
a
d
-
W
r
i
t
e
 
T
r
a
n
s
a
c
t
i
o
n
s
Client
A
B
C
T
Execute
Coord. 
Par. 
Par. 
 
o
k,
ts
B
, 
ts
C
Prepare
Log
Prepare
Log
Prepare
Log
Commit
Log
Commit
Log
Commit
Commit
ack
 
T.ts = 
ts
A
 
T.ts = 
ts
A
 
T.ts = 
ts
A
 
T
i
m
e
s
t
a
m
p
i
n
g
:
Participant: choose a timestamp, e.g., ts
B
 and ts
C
, larger than any writes it has applied
Coordinator: choose a timestamp, e.g., ts
A
, larger than
Any writes it has applied
Any timestamps proposed by the participants, e.g., ts
B
 and ts
C
Its current 
TT.now().latest
Coord 
commit-waits
: TT.after(ts
A
) == true. Commit-wait overlaps with Paxos logging
ts
A
 is T’s commit timestamp
14
14
Tag writes with physical timestamps upon commit
Write txns are strictly serializable, e.g., 2PL
Read-only txns return the writes, whose commit
timestamps precede the reads’ current time
Rotxns are one-round, lock-free, and never abort
I
d
e
a
s
 
B
e
h
i
n
d
 
R
e
a
d
-
O
n
l
y
 
T
x
n
s
15
15
R
e
a
d
-
O
n
l
y
 
T
r
a
n
s
a
c
t
i
o
n
s
 
(
s
h
a
r
d
s
 
p
a
r
t
)
A
B
C
0
0
0
Txn T’ = R(A=?, B=?, C=?)
Client
5
W
1
cmt
12
W
2
prep
W
0
W
0
W
0
8
W
3
prep
 
10
 
W
1
 
W
0
 
W
0
 
Client chooses a read timestamp ts = TT.now().latest
If no prepared write, return the preceding write, e.g., on A
If write prepared with ts’ > ts, no need to wait, proceed with read, e.g.,
on B
If write prepared with ts’ < ts, wait until write commits, e.g., on C
16
16
Don’t know whether
and when it commits
R
e
a
d
-
O
n
l
y
 
T
r
a
n
s
a
c
t
i
o
n
s
 
(
P
a
x
o
s
 
p
a
r
t
)
A
B
C
0
0
0
Client
5
W
1
cmt
W
0
W
0
W
0
 
W
2
17
17
What if no replication, only shards
Not in the paper, not realistic
A
 
P
u
z
z
l
e
 
t
o
 
H
e
l
p
 
W
i
t
h
 
U
n
d
e
r
s
t
a
n
d
i
n
g
A
B
C
T’
0
0
0
Client
t
s=10
W
0
W
0
W
0
 
T’ sees partial effect of T, e.g., sees W
C
 but not W
A
, and violates atomicity
 
W
0
 
W
C
Txn T = {W
A
, W
C
}, T’ = R (A, C)
18
18
Solution: uncertainty-wait
A
 
P
u
z
z
l
e
 
t
o
 
H
e
l
p
 
W
i
t
h
 
U
n
d
e
r
s
t
a
n
d
i
n
g
A
B
C
T’
0
0
0
Client
t
s=10
 
W
ait
W
0
W
0
W
0
“commit”
 
t
s
cmt
>10
 
t
s
cmt
>10
 
W
0
 
W
0
 
Uncertainty-wait ensures that t
s
cmt
 must > readTS because
W
1
 starts after T’ “commits,” and
T’ waits out uncertainty before “commit”, e.g., TT.after(10) == true
19
19
Client specifies a read timestamp way in the past
E.g., one hour ago
Read shards at the stale timestamp
Serializable
Old timestamp cannot ensure real-time order
Better performance
No waiting in any cases
E.g., non-blocking, not just lock-free
S
e
r
i
a
l
i
z
a
b
l
e
 
S
n
a
p
s
h
o
t
 
R
e
a
d
s
20
20
Strictly serializable (externally consistent)
Make it easy for developers to build apps!
Reads dominant, make them efficient
One-round, lock-free
TrueTime exposes clock uncertainty
Commit wait and at least TT.now.latest() for
timestamps ensure real-time ordering
Globally-distributed database
2PL w/ 2PC over Paxos!
T
a
k
e
a
w
a
y
21
21
Slide Note
Embed
Share

Spanner, a strictly serializable system, leverages TrueTime for timestamping to enforce the invariant between transactions. It ensures efficient read-only transactions and multi-shard transactions. Mechanisms like 2PL, 2PC, and (Multi)Paxos contribute to Spanner's fault tolerance and scalability. Learn about clock synchronization, write transaction processes, and concurrency control in this comprehensive lecture.

  • Distributed Transactions
  • Spanner
  • TrueTime
  • Mechanisms
  • Concurrency Control

Uploaded on Oct 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Distributed Transactions in Spanner 2 CS 240: Computing Systems and Concurrency Lecture 21 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Contents adapted from Haonan Lu, Wyatt Lloyd.

  2. Recap: Spanner is Strictly Serializable Efficient read-only transactions in strictly serializable systems Strict serializability is desirable but costly! Reads are prevalent! (340x more than write txns) Efficient rotxns good overall performance 2

  3. Recap: TrueTime Timestamping writes must enforce the invariant If T2 starts after T1 commits (finishes), then T2 must have a larger timestamp TrueTime: partially-synchronized clock abstraction Bounded clock skew (uncertainty) TT.now() [earliest, latest]; earliest <= Tabs <= latest Uncertainty ( ) is kept short TrueTime enforces the invariant by Use at least TT.now().latest for timestamps Commit wait 3

  4. Enforcing the Invariant with TT If T2 starts after T1 commits (finishes), then T2 must have a larger timestamp TT.after(15) == true b < x Let T1 write SB and T2 write SA SA 8 3 20 16 15 x Tabs wait SB T1.now() = [3, 15] T1.commit (ts = 15) b TrueTime 4

  5. Enforcing the Invariant with TT If T2 starts after T1 commits (finishes), then T2 must have a larger timestamp T2.commit (ts = 22) T2.now() = [18, 22] Let T1 write SB and T2 write SA SA wait 8 3 20 16 15 Tabs 18 22 wait SB T1.now() = [3, 15] T1.commit (ts = 15) T2.ts > T1.ts TrueTime 5

  6. Strictly Serializable Multi-Shard Transactions How are clocks made nearly perfect ? TrueTime How does Spanner leverage these clocks? How are writes done and tagged? How read-only transactions are made efficient? 6

  7. Scale-out vs. fault tolerance O OO P PP QQQ Spanner mechanisms 2PL for concurrency control of read-write transactions 2PC for distributed transactions over tables (Multi)Paxos for replicating every tablet 7

  8. This Lecture How write transactions are done 2PL + 2PC (sometimes 2PL for short) How they are timestamped How read-only transactions are done How read timestamps are chosen How reads are executed 8

  9. Read-Write Transactions (2PL) Three phases Execute Prepare Commit 2PC: atomicity 9

  10. Client-driven transactions (multi-shard) Client: 2PL w/ 2PC 1. Issues reads to leader of each shard group, which acquires read locks and returns most recent data 2. Locally performs writes 3. Chooses coordinator from set of leaders, initiates commit 4. Sends commit message to each leader, include identity of coordinator and buffered writes 5. Waits for commit from coordinator 10

  11. Read-Write Transactions (2PL) Execute T A=a Client A R(A) B C Txn T = {R(A=?), W(A=?+1), W(B=?+1), W(C=?+1)} Execute: Does reads: grab read locks and return the most recent data, e.g., R(A=a) Client computes and buffers writes locally, e.g., A = a+1, B = a+1, C = a+1 11

  12. Read-Write Transactions (2PL) Prepare Execute T A=a Client ok A Coord. R(A) Recv W(a+1) B Log Prepare Par. Recv W(a+1) C Log Prepare Par. Recv W(a+1) Prepare: Choose a coordinator, e.g., A, others are participants Send buffered writes and the identity of the coordinator; grab write locks Each participant prepares T by logging a prepare record via Paxos with its replicas. Coord skips prepare (Paxos Logging) Participants send OK to the coord if lock grabbed and after Paxos logging is done 12

  13. Read-Write Transactions (2PL) Prepare Commit Execute T A=a Client ok ack Log A Coord. Commit R(A) Recv W(a+1) Apply W(a+1) Log B Log Prepare Par. Commit Apply W(a+1) Recv W(a+1) C Log Prepare Log Par. Commit Recv W(a+1) Apply W(a+1) Commit: After hearing from all participants, coord commits T if all OK; otherwise, abort T Coord logs a commit/abort record via Paxos, applies writes if commit, release all locks Coord sends commit/abort messages to participants Participants log commit/abort via Paxos, apply writes if commit, release locks Coord sends result to client either after its log commit or after ack 13

  14. Timestamping Read-Write Transactions Commit Prepare Execute T Client Commit Wait ok, ack tsB, tsC A Coord. Log tsA Commit T.ts = tsA Log Log B Par. Commit T.ts = tsA Prepare tsB C Log Log Par. Prepare Commit T.ts = tsA tsC Timestamping: Participant: choose a timestamp, e.g., tsB and tsC, larger than any writes it has applied Coordinator: choose a timestamp, e.g., tsA, larger than Any writes it has applied Any timestamps proposed by the participants, e.g., tsB and tsC Its current TT.now().latest Coord commit-waits: TT.after(tsA) == true. Commit-wait overlaps with Paxos logging tsAis T s commit timestamp 14

  15. Ideas Behind Read-Only Txns Tag writes with physical timestamps upon commit Write txns are strictly serializable, e.g., 2PL Read-only txns return the writes, whose commit timestamps precede the reads current time Rotxns are one-round, lock-free, and never abort 15

  16. Read-Only Transactions (shards part) T W0 W1W0 Client ts=10 W0 W1cmt A 0 W0 5 W2prep B 12 0 W0 W3prep W3cmt C 15 Wait 10 8 0 Don t know whether and when it commits Txn T = R(A=?, B=?, C=?) Client chooses a read timestamp ts = TT.now().latest If no prepared write, return the preceding write, e.g., on A If write prepared with ts > ts, no need to wait, proceed with read, e.g., on B If write prepared with ts < ts, wait until write commits, e.g., on C 16

  17. Read-Only Transactions (Paxos part) T W2 Client ts=10 W1cmt W0 A 0 W0 5 W2Paxos W3Paxos B 0 W0 C 10 0 Paxos writes are monotonic, e.g., writes with smaller timestamp must be applied earlier, W2 is applied before W3 T needs to wait until there exits a Paxos write with ts>10, e.g., W3,so all writes before 10 are finalized Put it together: a shard can process a read at ts if ts <= tsafe tsafe = min(????? ?????,????? ??): before tsafe, all system states (writes) have finalized 17

  18. A Puzzle to Help With Understanding What if no replication, only shards Not in the paper, not realistic Txn T = {WA, WC}, T = R (A, C) T W0 WC Client ts=10 WAprep WAcmt W0 A 0 W0 tsprep tscmt=8 B 0 W0 WCprep WCcmt C tsprep 0 tscmt=8 T sees partial effect of T, e.g., sees WC but not WA, and violates atomicity 18

  19. A Puzzle to Help With Understanding Solution: uncertainty-wait commit Wait T W0 W0 Client ts=10 WAprep WAcmt W0 A 0 W0 tsprep tscmt>10 B 0 W0 WCprep WCcmt C tsprep tscmt>10 0 Uncertainty-wait ensures that tscmt must > readTS because W1starts after T commits, and T waits out uncertainty before commit , e.g., TT.after(10) == true 19

  20. Serializable Snapshot Reads Client specifies a read timestamp way in the past E.g., one hour ago Read shards at the stale timestamp Serializable Old timestamp cannot ensure real-time order Better performance No waiting in any cases E.g., non-blocking, not just lock-free 20

  21. Takeaway Strictly serializable (externally consistent) Make it easy for developers to build apps! Reads dominant, make them efficient One-round, lock-free TrueTime exposes clock uncertainty Commit wait and at least TT.now.latest() for timestamps ensure real-time ordering Globally-distributed database 2PL w/ 2PC over Paxos! 21

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#