Hands-Off Persistence System (HOPS) Overview

H
a
n
d
s
-
O
f
f
 
P
e
r
s
i
s
t
e
n
c
e
 
S
y
s
t
e
m
(
H
O
P
S
)
Swapnil Haria
1
, Sanketh Nalli
1
, Haris Volos
2
, 
Kim Keeton
2
, Mark D. Hill
1
, Mike M. Swift
1
1
 
1
 University of Wisconsin-Madison
 
2 
Hewlett Packard Labs
W
H
I
S
P
E
R
 
A
n
a
l
y
s
i
s
 
 
 
 
 
 
H
O
P
S
 
D
e
s
i
g
n
4% accesses to PM, 96% to
DRAM
2
5-50 epochs/transaction
Self-dependencies common
Cross-dependencies rare
 
Volatile memory hierarchy
(almost) unchanged
 
Order epochs without flushing
 
Allows multiple copies of same
cacheline
 
Correct, conservative method
based on coherence
O
u
t
l
i
n
e
3
Motivation
HOPS Design
Evaluation
?
A
C
I
D
 
T
r
a
n
s
a
c
t
i
o
n
s
 
(
c
u
r
r
e
n
t
l
y
)
4
 
Acquire Lock
 
Release Lock
 
Prepare Log Entry
 
1
 
N
 
Mutate Data Structure
 
1
 
N
 
Commit Transaction
 
FLUSH EPOCH
 
FLUSH EPOCH
 
FLUSH EPOCH
B
a
s
e
 
S
y
s
t
e
m
5
Shared LLC
DRAM
Controller
PM
Controller
Persistent
Private L1
Private L1
Volatile
CPU 0
CPU 1
 
1. Flush A
B
a
s
e
 
S
y
s
t
e
m
:
 
F
l
u
s
h
6
DRAM
Controller
PM
Controller
Persistent
Volatile
A=1
 
1. Flush A
 
Writeback A
 
Long Latency
PM Write
 
Flush ACK
2. Flush B
CPU 0
CPU 1
O
u
t
l
i
n
e
7
Motivation
HOPS Design
Evaluation
?
H
a
n
d
s
-
o
f
f
 
P
e
r
s
i
s
t
e
n
c
e
 
S
y
s
t
e
m
 
(
H
O
P
S
)
Volatile memory hierarchy (almost) unchanged
Order epochs without flushing
Allows multiple copies of same cacheline
Correct, conservative method for handling cross-dependencies
8
 
+
 
Stores
9
CPU
Shared LLC
CPU
DRAM
Controller
PM
Controller
Persistent
Private L1
Private L1
Volatile
 
Persist Buffer
Front End
 
Persist Buffer
Front End
 
Persist Buffer
Back End
Loads + Stores
Loads
 
B
a
s
e
 
S
y
s
t
e
m
 
+
 
P
e
r
s
i
s
t
 
B
u
f
f
e
r
s
 
B
a
s
e
 
S
y
s
t
e
m
Volatile buffers
Front End (per-thread)
Address, Ordering Info
Back End (per-MC)
Cacheline data
Enqueue/Dequeue only
Not fully-associative
10
 
P
e
r
s
i
s
t
 
B
u
f
f
e
r
s
H
a
n
d
s
-
o
f
f
 
P
e
r
s
i
s
t
e
n
c
e
 
S
y
s
t
e
m
 
(
H
O
P
S
)
Volatile memory hierarchy (almost) unchanged
Order epochs without flushing
Allows multiple copies of same cacheline
Correct, conservative method for handling cross-dependencies
11
O
F
E
N
C
E
:
 
O
r
d
e
r
i
n
g
 
F
e
n
c
e
12
Orders stores preceding OFENCE before later stores
Happens Before
Stores
13
CPU
Shared LLC
CPU
DRAM
Controller
PM
Controller
Persistent
Private L1
Private L1
Volatile
Persist Buffer
Front End
Persist Buffer
Front End
Persist Buffer
Back End
Loads + Stores
Loads
B
a
s
e
 
S
y
s
t
e
m
 
+
 
P
e
r
s
i
s
t
 
B
u
f
f
e
r
s
 
O
r
d
e
r
i
n
g
 
E
p
o
c
h
s
 
w
i
t
h
o
u
t
 
F
l
u
s
h
i
n
g
14
CPU 1
Local TS
L1 Cache
 
25
 
A = 1   25
Persist Buffer
 
B = 1   25
 
A = 2   26
 
A = 1
 
26
1.
ST A = 1
2.
ST B = 1
3.
LD R1 = A
4.
OFENCE
5.
ST A = 2
 
B = 1
 
A = 2
A
C
I
D
 
T
r
a
n
s
a
c
t
i
o
n
s
 
i
n
 
H
O
P
S
15
Acquire Lock
Release Lock
Prepare Log Entry
1
N
Mutate Data Structure
1
N
Commit Transaction
Volatile Writes
Persistent Writes
OFENCE
 
NOT DURABLE!
A
C
I
D
 
T
r
a
n
s
a
c
t
i
o
n
s
 
i
n
 
H
O
P
S
16
Acquire Lock
Release Lock
Prepare Log Entry
1
N
Mutate Data Structure
1
N
Commit Transaction
Volatile Writes
Persistent Writes
OFENCE
DFENCE
D
F
E
N
C
E
:
 
D
u
r
a
b
i
l
i
t
y
 
F
e
n
c
e
17
Makes the stores preceding DFENCE durable
Happens  Before
Volatile Memory Order
D
u
r
a
b
i
l
i
t
y
 
i
s
 
i
m
p
o
r
t
a
n
t
 
t
o
o
!
18
CPU 1
Local TS
L1 Cache
26
Persist Buffer
 
A = 2   26
A = 1
 
1.
ST A = 1
2.
ST B = 1
3.
LD R1 = A
4.
OFENCE
5.
ST A = 2
 
N.   DFENCE
B = 1
A = 2
H
a
n
d
s
-
o
f
f
 
P
e
r
s
i
s
t
e
n
c
e
 
S
y
s
t
e
m
 
(
H
O
P
S
)
Volatile memory hierarchy (almost) unchanged
Order epochs without flushing
Allows multiple copies of same cacheline
Correct, conservative method for handling cross-dependencies
19
P
r
e
s
e
r
v
i
n
g
 
m
u
l
t
i
p
l
e
 
c
o
p
i
e
s
 
o
f
 
c
a
c
h
e
l
i
n
e
s
20
CPU 1
Local TS
L1 Cache
26
Persist Buffer
B = 1   25
A = 2   26
A = 1
1.
ST A = 1
2.
ST B = 1
3.
LD R1 = A
4.
OFENCE
5.
ST A = 2
B = 1
A = 2
A = 1   25
H
a
n
d
s
-
o
f
f
 
P
e
r
s
i
s
t
e
n
c
e
 
S
y
s
t
e
m
 
(
H
O
P
S
)
Volatile memory hierarchy (almost) unchanged
Orders epochs without flushing
Allows multiple copies of same cacheline
Correct, conservative method for handling cross-dependencies
21
O
u
t
l
i
n
e
22
Motivation
HOPS Design
Evaluation
?
S
y
s
t
e
m
 
C
o
n
f
i
g
u
r
a
t
i
o
n
Evaluated using gem5 full-system mode with the Ruby memory model
23
 
Baseline +
Persistent Write
Queue
 
Ideal
performance,
unsafe on crash
P
e
r
f
o
r
m
a
n
c
e
 
E
v
a
l
u
a
t
i
o
n
24
 
(Lower is Better)
 
Baseline, uses
clwb + sfence
 
HOPS
 
HOPS + Persistent
Write Queue
W
H
I
S
P
E
R
 
A
n
a
l
y
s
i
s
 
 
 
 
 
 
H
O
P
S
 
D
e
s
i
g
n
4% accesses to PM, 96% to
DRAM
25
5-50 epochs/transaction
Self-dependencies common
Cross-dependencies rare
Volatile memory hierarchy
(almost) unchanged
Order epochs without flushing
Allows multiple copies of same
cacheline
Correct, conservative method
based on coherence
Q
u
e
s
t
i
o
n
s
?
Thanks!
26
B
E
N
C
H
W
A
R
M
E
R
S
 
27
H
a
n
d
l
i
n
g
 
C
r
o
s
s
 
D
e
p
e
n
d
e
n
c
i
e
s
28
Local TS
L1 Cache
 
25
Local TS
L1 Cache
14
Persist Buffer
 
ST A = 4
CPU 1
CPU 0
Directory
1
1
2
2
3
3
 
A = 4   14
 
A = 4
 
0:25
 
24
C
o
m
p
a
r
i
s
o
n
 
w
i
t
h
 
D
P
O
 
(
M
i
c
r
o
 
1
6
)
29
C
o
m
p
a
r
i
s
o
n
 
w
i
t
h
 
E
f
f
i
c
i
e
n
t
 
P
B
 
(
M
i
c
r
o
 
1
5
)
30
C
o
m
p
a
r
i
s
o
n
 
w
i
t
h
 
N
o
n
-
T
e
m
p
o
r
a
l
 
S
t
o
r
e
s
 
(
x
8
6
)
31
L
i
n
k
e
d
 
L
i
s
t
 
I
n
s
e
r
t
i
o
n
 
-
 
N
a
i
v
e
32
1.
Create Node
2.
Update Node Pointer
3.
Update Head Pointer
CACHE
HEAD
 
NODE C
NODE A
NODE B
PM
HEAD
 
NODE C
NODE A
NODE B
 
CACHE WRITEBACK
 
Caches (volatile) wiped clean
 
Main Memory inconsistent!
L
i
n
k
e
d
 
L
i
s
t
 
I
n
s
e
r
t
i
o
n
 
-
 
N
a
i
v
e
33
 
CACHE
HEAD
 
NODE C
 
NODE A
 
NODE B
PM
HEAD
NODE C
NODE A
NODE B
 
System Crash
 
?
L
i
n
k
e
d
 
L
i
s
t
 
I
n
s
e
r
t
i
o
n
 
 
C
r
a
s
h
 
C
o
n
s
i
s
t
e
n
t
34
1.
Create Node
2.
Update Node Pointer
3.
FLUSH EPOCH 1
4.
Update Head Pointer
5.
FLUSH EPOCH 2
CACHE
HEAD
 
NODE C
NODE A
NODE B
PM
HEAD
 
NODE C
NODE A
NODE B
 
CACHE WRITEBACK
Epoch 1
Epoch 2
 
EXPLICIT WRITEBACK
P
e
r
f
o
r
m
a
n
c
e
 
E
v
a
l
u
a
t
i
o
n
35
 
(Lower is Better)
 
Baseline, uses clwb + sfence
Baseline + Persistent Write Q
HOPS
HOPS + Persistent Write Q
Ideal performance, incorrect
on crash
P
r
o
p
o
s
e
d
 
U
s
e
-
C
a
s
e
s
File Systems
Existing FS (ext4)
NVM-aware FS (PMFS, BPFS)
Persistent Data stores
Key-Value stores
Relational databases
Persistent Heap
Various forms of caching
Webserver/file/page caching
Low-power data storage for IoT devices
36
I
n
t
e
l
 
e
x
t
e
n
s
i
o
n
s
 
f
o
r
 
P
M
 
CLWB - Cache line Write Back
Write back cached line (if present in cache hierarchy), 
may retain clean copy
37
E
v
a
l
u
a
t
i
n
g
 
I
n
t
e
l
 
e
x
t
e
n
s
i
o
n
s
 
Crash-Consistency
Sufficient to provide crash consistency, 
if used correctly
 
Programmability
Pushes burden of data movement onto programmer
 
Performance
Provides ordering and durability mixed in one
38
E
v
a
l
u
a
t
i
n
g
 
I
n
t
e
l
 
e
x
t
e
n
s
i
o
n
s
Crash-Consistency
Sufficient to provide crash consistency, 
if used correctly
Programmability
Pushes burden of data movement onto programmer
Complex ordering guarantees, expressed in imprecise prose
Complex semantics, expressed in imprecise prose
Makes the programmer worry about caches
39
Coherent Cache
Hierarchy
B=0
B=1
 
CPU
CPU
Persistent Address Space
 
Program
1. A = 1
2. B = 1
3. CLWB A
4. CLWB B
 
Caches -> Memory Controller Queues
A=0
A=0
B=0
Persistent Write Queue
(PM Controller)
A=1
40
 
41
 
42
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
T
r
a
c
k
Local timestamp (TS) register
maintained at L1 cache
Indicates epoch TS of current
(incomplete) epoch
Local TS copied as part of PB entry for
incoming PM stores
Local TS incremented on
encountering persist barrier
43
CPU 1
Local TS
Persist Buffer
L1 Cache
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
T
r
a
c
k
Local timestamp (TS) register
maintained at L1 cache
Indicates epoch TS of current
(incomplete) epoch
Local TS copied as part of PB entry for
incoming PM stores
Local TS incremented on
encountering persist barrier
44
CPU 1
Local TS
L1 Cache
25
Persist Buffer
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
T
r
a
c
k
Local timestamp (TS) register
maintained at L1 cache
Indicates epoch TS of current
(incomplete) epoch
Local TS copied as part of PB entry for
incoming PM stores
Local TS incremented on
encountering persist barrier
45
CPU 1
Local TS
L1 Cache
25
A = 1
A = 1   25
Persist Buffer
ST A = 1
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
T
r
a
c
k
Local timestamp (TS) register
maintained at L1 cache
Indicates epoch TS of current
(incomplete) epoch
Local TS copied as part of PB entry for
incoming PM stores
Local TS incremented on
encountering persist barrier
46
CPU 1
Local TS
L1 Cache
25
A = 1
A = 1   25
Persist Buffer
ST A = 1
ST B = 1
B = 1
B = 1   25
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
T
r
a
c
k
Local timestamp (TS) register
maintained at L1 cache
Indicates epoch TS of current
(incomplete) epoch
Local TS copied as part of PB entry for
incoming PM stores
Local TS incremented on
encountering persist barrier
47
CPU 1
Local TS
L1 Cache
26
A = 1
A = 1   25
Persist Buffer
ST A = 1
ST B = 1
OFENCE
B = 1
B = 1   25
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
T
r
a
c
k
Local timestamp (TS) register
maintained at L1 cache
Indicates epoch TS of current
(incomplete) epoch
Local TS copied as part of PB entry for
incoming PM stores
Local TS incremented on
encountering persist barrier
48
CPU 1
Local TS
L1 Cache
26
A = 2
A = 1   25
Persist Buffer
ST A = 1
ST B = 1
OFENCE
ST A = 2
B = 1
B = 1   25
A = 2   26
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
E
n
f
o
r
c
e
E
X
A
M
P
L
E
 
T
O
 
B
E
 
D
R
O
P
P
E
D
Drain requests for all entries in
epoch sent concurrently
Epoch entries drained after all
drain ACKs received for previous
epoch
49
A = 1   25
Persist Buffer
B = 1   25
A = 2   26
PM Controller
PM Region
PM Controller
PM Region
ST A = 1
ST B = 1
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
E
n
f
o
r
c
e
Drain requests for all entries in
epoch sent concurrently
Epoch entries drained after all
drain ACKs received for previous
epoch
50
A = 1   25
Persist Buffer
B = 1   25
A = 2   26
PM Controller
PM Region
PM Controller
PM Region
ACK
ACK
   A = 1
   B = 1
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
E
n
f
o
r
c
e
Drain requests for all entries in
epoch sent concurrently
Epoch entries drained after all
drain ACKs received for previous
epoch
51
Persist Buffer
A = 2   26
PM Controller
PM Region
PM Controller
PM Region
   A = 1
   B = 1
ST A = 2
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
E
X
A
M
P
L
E
 
T
O
 
B
E
 
D
R
O
P
P
E
D
Global TS register stored at LLC
Records <Thread ID:Flushed Epoch TS>
PBs check this before flushing epoch
PBs update this on DFENCEs
52
Persist Buffer 1
PM Controller
PM Region
A = 4   14    0:25
0:24     1:14
Persist Buffer 0
C = 2   24       -
Global TS
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
E
n
f
o
r
c
e
Global TS register stored at LLC
Records <Thread ID:Flushed Epoch TS>
PBs check this before flushing epoch
PBs update this on DFENCEs
53
Persist Buffer 1
PM Controller
PM Region
A = 4   14    0:25
0:24     1:14
Persist Buffer 0
C = 2   24       -
Global TS
Flush Stalled (24 < 25)
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
E
n
f
o
r
c
e
Global TS register stored at LLC
Records <Thread ID:Flushed Epoch TS>
PBs check this before flushing epoch
PBs update this on DFENCEs
54
Persist Buffer 1
PM Controller
PM Region
A = 4   14    0:25
0:24     1:14
Persist Buffer 0
C = 2   24       -
DFENCE
Global TS
C = 1
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
E
n
f
o
r
c
e
Global TS register stored at LLC
Records <Thread ID:Flushed Epoch TS>
PBs check this before flushing epoch
PBs update this on DFENCEs
55
Persist Buffer 1
PM Controller
PM Region
A = 4   14    0:25
0:24     1:14
Persist Buffer 0
C = 2   24       -
DFENCE
Global TS
ACK
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
E
n
f
o
r
c
e
Global TS register stored at LLC
Records <Thread ID:Flushed Epoch TS>
PBs check this before flushing epoch
PBs update this on DFENCEs
56
Persist Buffer 1
PM Controller
PM Region
A = 4   14    0:25
0:25
     1:14
Persist Buffer 0
Global TS
I
n
t
r
a
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
T
r
a
c
k
A
N
I
M
A
T
I
O
N
 
W
I
P
Local timestamp (TS) register
maintained at L1 cache
Indicates epoch TS of current
(incomplete) epoch
Local TS copied as part of PB
entry for incoming PM stores
Local TS incremented on
encountering persist barrier
57
CPU 1
Local TS
L1 Cache
 
26
A = 1   25
Persist Buffer
ST B = 1
OFENCE
ST A = 2
B = 1   25
A = 2   26
ST A = 1
ST A = 1
ST A = 1
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
E
n
f
o
r
c
e
Global TS register stored at LLC
Records <Thread ID:Flushed Epoch TS>
PBs check this before flushing epoch
PBs update this on DFENCEs
58
Persist Buffer 1
PM Controller
PM Region
A = 4   14    0:25
0:25     1:14
Persist Buffer 0
Global TS
Flush OK
PM Controller
PM Controller
D
r
a
i
n
i
n
g
 
w
r
i
t
e
s
 
t
o
 
m
u
l
t
i
p
l
e
 
P
M
 
C
o
n
t
r
o
l
l
e
r
s
59
Persist Buffer
PM Region
B = 2
 
OFENCE
PM Region
A = 1
A = 3
PM Controller
L
o
o
s
e
 
E
n
d
s
PM addresses identified based on higher order address bits
PBs flushed on context switches using durable persist barriers
LLC misses to PM stalled if address present in PB
Tracked using counting bloom filters at the PM Controller
Rare as updates stay longer in the cache than in PBs
60
R
e
o
r
d
e
r
 
w
i
t
h
i
n
 
a
n
 
E
p
o
c
h
61
ST A=1
Global Visibility
Persistence Order
Time
ST B=2
Epoch 1
Epoch 1
ST C=3
Epoch 2
OFENCE
ST A=1
ST B=2
ST C=3
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
T
r
a
c
k
A
N
I
M
A
T
I
O
N
 
P
E
N
D
I
N
G
Identified using coherence
activity
Loss of exclusive permissions
signals inter-thread conflict
Local TS, Thread ID sent as part of
coherence response (pessimistic)
Recorded in next epoch entry
62
Persist Buffer
Directory
Local TS
L1 Cache
25
A = 1
Local TS
L1 Cache
14
ST A = 4
CPU 1
CPU 0
1
1
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
T
r
a
c
k
Identified using coherence
activity
Loss of exclusive permissions
signals inter-thread conflict
Local TS, Thread ID sent as part of
coherence response (pessimistic)
Recorded in next epoch entry
63
Local TS
L1 Cache
25
Local TS
L1 Cache
14
Persist Buffer
ST A = 4
CPU 1
CPU 0
Directory
1
1
2
2
A = 1
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
T
r
a
c
k
Identified using coherence
activity
Loss of exclusive permissions
signals inter-thread conflict
Local TS, Thread ID sent as part of
coherence response (pessimistic)
Recorded in next epoch entry
64
Local TS
L1 Cache
25
Local TS
L1 Cache
14
Persist Buffer
ST A = 4
CPU 1
CPU 0
Directory
1
1
2
2
3
3
A = 1
0:25
I
n
t
e
r
-
t
h
r
e
a
d
 
D
e
p
e
n
d
e
n
c
y
 
:
 
T
r
a
c
k
Identified using coherence
activity
Loss of exclusive permissions
signals inter-thread conflict
Local TS, Thread ID sent as part of
coherence response (pessimistic)
Recorded in next epoch entry
65
A = 4   14    0:25
Persist Buffer
Directory
Local TS
L1 Cache
25
Local TS
L1 Cache
14
ST A = 4
CPU 1
CPU 0
A = 4
C
r
o
s
s
 
D
e
p
e
n
d
e
n
c
y
 
:
 
E
n
f
o
r
c
e
Global TS register stored at LLC
Records <Thread ID:Flushed Epoch TS>
PBs check this before flushing epoch
PBs update this lazily
66
K
e
y
 
I
d
e
a
 
1
67
Support separate hardware primitives for
ordering and durability
E
p
o
c
h
 
P
e
r
s
i
s
t
e
n
c
y
 
[
1
]
Stores 
to different PM addresses 
within an epoch can
be reordered among themselves, but not across epoch
boundaries in the same thread
Stores to 
same PM address
 from any thread must be
persisted as observed in global memory order
68
[1] Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. 2014. Memory persistency, ISCA '14.
S
e
l
f
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
E
n
f
o
r
c
e
Drain requests for all entries in oldest epoch sent concurrently
Next epoch drained after all drain ACKs received for previous epoch
69
S
e
l
f
 
D
e
p
e
n
d
e
n
c
y
 
(
E
p
o
c
h
)
 
:
 
T
r
a
c
k
Local timestamp (TS) register
maintained at L1 cache
Indicates epoch TS of current
(incomplete) epoch
Local TS copied as part of PB
entry for incoming PM stores
Local TS incremented on
encountering OFENCE/DFENCE
70
CPU 1
Local TS
L1 Cache
 
25
 
A = 1   25
Persist Buffer
 
B = 1   25
 
A = 2   26
 
A = 1
 
26
1.
ST A = 1
2.
ST B = 1
3.
OFENCE
4.
ST A = 2
 
B = 1
 
A = 2
W
r
i
t
e
 
O
r
d
e
r
i
n
g
 
i
n
 
H
O
P
S
Two types of dependencies preserved
Cross dependencies between threads (address conflict)
Self dependencies within a thread (epoch)
Dependencies identified at the time of insertion
Dependencies enforced at the time of drain into PM
71
A = 1
B = 1
A = 3
D
r
a
i
n
i
n
g
 
w
r
i
t
e
s
 
t
o
 
s
i
n
g
l
e
 
P
M
 
C
o
n
t
r
o
l
l
e
r
72
Persist Buffer
PM Region
A = 1    25
B = 1    25
A = 2    26
PM Controller
 
ACK  (A)
 
ACK  (B)
Slide Note

Hey Everyone, today I will be describing our new hardware design for simplifying the programming of PM system.

Embed
Share

Hands-Off Persistence System (HOPS) is a novel approach to memory hierarchy design, focusing on volatile memory with minimal changes. It enables multiple copies of the same cacheline, handles cross-dependencies conservatively, and optimizes transaction processing epochs efficiently. The system architecture maintains correctness and coherence without frequent flushing, providing improved performance for various workloads.

  • Memory Hierarchy
  • Transaction Processing
  • System Architecture
  • Cross-Dependencies
  • Coherence

Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Hands Hands- -Off Persistence System Off Persistence System (HOPS) (HOPS) Swapnil Haria1, Sanketh Nalli1, Haris Volos2, Kim Keeton2, Mark D. Hill1, Mike M. Swift1 1 University of Wisconsin-Madison 2 Hewlett Packard Labs 1

  2. WHISPER Analysis WHISPER Analysis HOPS Design HOPS Design 4% accesses to PM, 96% to DRAM Volatile memory hierarchy (almost) unchanged 5-50 epochs/transaction Order epochs without flushing Self-dependencies common Allows multiple copies of same cacheline Cross-dependencies rare Correct, conservative method based on coherence 2

  3. Outline Outline 1.8 CLWB+Fence 1.6 1.4 PM Normalized Speedup 1.2 HEAD 1 0.8 ? 0.6 0.4 C A B 0.2 0 ctree hashmap echo nstore vacation average Evaluation Motivation HOPS Design 3

  4. ACID Transactions (currently) ACID Transactions (currently) Acquire Lock 1 Prepare Log Entry N FLUSH EPOCH Mutate Data Structure1 N FLUSH EPOCH Commit Transaction FLUSH EPOCH Release Lock 4

  5. Base System Base System CPU 0 CPU 1 Private L1 Private L1 Shared LLC DRAM Controller PM Controller Volatile Persistent 5

  6. Base System: Flush Base System: Flush 1. Flush A 1. Flush A 2. Flush B CPU 1 CPU 0 A=1 Writeback A Flush ACK Long Latency PM Write DRAM Controller PM Controller Volatile Persistent 6

  7. Outline Outline 1.8 CLWB+Fence 1.6 1.4 PM Normalized Speedup 1.2 HEAD 1 0.8 ? 0.6 0.4 C A B 0.2 0 ctree hashmap echo nstore vacation average Evaluation Motivation HOPS Design 7

  8. Hands Hands- -off Persistence System (HOPS) off Persistence System (HOPS) Volatile memory hierarchy (almost) unchanged Order epochs without flushing Allows multiple copies of same cacheline Correct, conservative method for handling cross-dependencies 8

  9. Base System + Persist Buffers Base System + Persist Buffers Base System Base System CPU CPU Persist Buffer Front End Persist Buffer Front End Private L1 Private L1 Shared LLC +Stores Loads Loads + Stores Persist Buffer Back End DRAM Controller PM Controller Volatile Persistent 9

  10. Persist Buffers Persist Buffers Volatile buffers Front End (per-thread) Address, Ordering Info Back End (per-MC) Cacheline data Enqueue/Dequeue only Not fully-associative 10

  11. Hands Hands- -off Persistence System (HOPS) off Persistence System (HOPS) Volatile memory hierarchy (almost) unchanged Order epochs without flushing Allows multiple copies of same cacheline Correct, conservative method for handling cross-dependencies 11

  12. OFENCE: Ordering Fence OFENCE: Ordering Fence Orders stores preceding OFENCE before later stores OFENCE Thread 1 Thread 1 Volatile Memory Order ST A=1 ST B=2 Time Persistence Order ST A=1 ST B=2 Happens Before 12

  13. Base System + Persist Buffers Base System + Persist Buffers CPU CPU Persist Buffer Front End Persist Buffer Front End Private L1 Private L1 Shared LLC Stores Loads + Stores Loads Persist Buffer Back End DRAM Controller PM Controller Volatile Persistent 13

  14. Ordering Epochs without Flushing Ordering Epochs without Flushing 1. ST A = 1 2. ST B = 1 3. LD R1 = A 4. OFENCE 5. ST A = 2 CPU 1 Local TS 25 26 A = 1 A = 2 A = 2 26 B = 1 B = 1 25 A = 1 25 Persist Buffer L1 Cache 14

  15. ACID Transactions in HOPS ACID Transactions in HOPS Acquire Lock Volatile Writes 1 Prepare Log Entry N Persistent Writes Mutate Data Structure1 N OFENCE Commit Transaction NOT DURABLE! Release Lock 15

  16. ACID Transactions in HOPS ACID Transactions in HOPS Acquire Lock Volatile Writes 1 Prepare Log Entry N Persistent Writes Mutate Data Structure1 N OFENCE DFENCE Commit Transaction Release Lock 16

  17. DFENCE: Durability Fence DFENCE: Durability Fence Makes the stores preceding DFENCE durable DFENCE Thread 1 Thread 1 Volatile Memory Order ST A=1 ST B=2 Time Happens Before Persistence Order ST A=1 ST B=2 17

  18. Durability Durability is important too! is important too! 1. ST A = 1 2. ST B = 1 3. LD R1 = A 4. OFENCE 5. ST A = 2 CPU 1 N. DFENCE Local TS 26 A = 1 A = 2 B = 1 A = 2 26 Persist Buffer L1 Cache 18

  19. Hands Hands- -off Persistence System (HOPS) off Persistence System (HOPS) Volatile memory hierarchy (almost) unchanged Order epochs without flushing Allows multiple copies of same cacheline Correct, conservative method for handling cross-dependencies 19

  20. Preserving multiple copies of Preserving multiple copies of cachelines cachelines 1. ST A = 1 2. ST B = 1 3. LD R1 = A 4. OFENCE 5. ST A = 2 CPU 1 Local TS 26 A = 1 A = 2 A = 2 26 B = 1 B = 1 25 A = 1 25 Persist Buffer L1 Cache 20

  21. Hands Hands- -off Persistence System (HOPS) off Persistence System (HOPS) Volatile memory hierarchy (almost) unchanged Orders epochs without flushing Allows multiple copies of same cacheline Correct, conservative method for handling cross-dependencies 21

  22. Outline Outline 1.8 CLWB+Fence 1.6 1.4 PM Normalized Speedup 1.2 HEAD 1 0.8 ? 0.6 0.4 C A B 0.2 0 ctree hashmap echo nstore vacation average Evaluation Motivation HOPS Design 22

  23. System Configuration System Configuration Evaluated using gem5 full-system mode with the Ruby memory model Parameter Setting CPU Cores 4 cores, OOO, 2Ghz L1 Caches private, 64 KB, Split I/D L2 Caches private, 2 MB DRAM 4GB, 40 cycles read/write latency PM 4GB, 160 cycles read/write latency Persist Buffers 64 entries 23

  24. Performance Evaluation Performance Evaluation 1 Baseline, uses HOPS HOPS + Persistent Ideal performance, clwb + sfence Write Queue Baseline + Persistent Write Queue unsafe on crash 0.9 0.8 Normalized Runtime 0.7 (Lower is Better) 0.6 x86-64 (NVM) x86-64 (PWQ) 0.5 HOPS (NVM) HOPS (PWQ) 0.4 IDEAL (NON-CC) 0.3 0.2 0.1 0 echo ycsb redis ctree hashmapvacation average 24

  25. WHISPER Analysis WHISPER Analysis HOPS Design HOPS Design 4% accesses to PM, 96% to DRAM Volatile memory hierarchy (almost) unchanged 5-50 epochs/transaction Order epochs without flushing Self-dependencies common Allows multiple copies of same cacheline Cross-dependencies rare Correct, conservative method based on coherence 25

  26. Questions? Questions? Thanks! 26

  27. BENCHWARMERS BENCHWARMERS 27

  28. Handling Cross Dependencies Handling Cross Dependencies CPU 1 CPU 0 ST A = 4 Local TS 25 24 Local TS 14 0:25 A = 4 A = 1 3 L1 Cache L1 Cache 1 2 Directory A = 4 14 0:25 Persist Buffer 28

  29. Comparison with DPO (Micro 16) Comparison with DPO (Micro 16) Parameter HOPS Ordering, Durability Buffered None DPO Ordering Buffered Primitives Conflicts Effect on Volatile accesses Fully associative PBs snooped on every coherence request Lazy (cumulative) updates of PB drain Works natively Scalable to multiple cores Global Broadcast on every PB drain Scalable to multiple MC Designed for one MC 29

  30. Comparison with Efficient PB (Micro 15) Comparison with Efficient PB (Micro 15) Parameter HOPS Ordering, Durability Buffered Buffered 1 bit Efficient PBs Ordering Primitives Intra-thread Conflict Causes Synchronous Flush Buffered (upto 5) Inter-thread Conflict Cache modifications Proportional to number of cores, inflight epochs supported 30

  31. Comparison with Non Comparison with Non- -Temporal Stores (x86) Temporal Stores (x86) Parameter HOPS Yes Yes, with fast OFENCE Yes, with DFENCE NT Stores Stores cached Cache copy invalidated Yes, with slower FENCEs Ordering Guarantees Durability Guarantees No 31

  32. Linked List Insertion Linked List Insertion - - Naive Naive CACHE HEAD 1. Create Node 2. Update Node Pointer NODE C NODE A NODE B 3. Update Head Pointer CACHE WRITEBACK PM HEAD NODE C NODE A NODE B 32

  33. Linked List Insertion Linked List Insertion - - Naive Naive CACHE HEAD NODE C NODE A NODE B Caches (volatile) wiped clean System Crash Main Memory inconsistent! PM HEAD ? NODE C NODE A NODE B 33

  34. Linked List Insertion Linked List Insertion Crash Consistent Crash Consistent CACHE HEAD 1. Create Node Epoch 1 2. Update Node Pointer NODE C NODE A NODE B 3. FLUSH EPOCH 1 EXPLICIT WRITEBACK CACHE WRITEBACK Epoch 2 4. Update Head Pointer PM HEAD 5. FLUSH EPOCH 2 NODE C NODE A NODE B 34

  35. Performance Evaluation Performance Evaluation 1 Baseline, uses clwb + sfence Baseline + Persistent Write Q HOPS HOPS + Persistent Write Q Ideal performance, incorrect on crash 0.9 0.8 Normalized Runtime 0.7 (Lower is Better) 0.6 x86-64 (NVM) x86-64 (PWQ) 0.5 HOPS (NVM) HOPS (PWQ) 0.4 IDEAL (NON-CC) 0.3 0.2 0.1 0 echo ycsb redis ctree hashmapvacation average 35

  36. Proposed Use Proposed Use- -Cases Cases File Systems Existing FS (ext4) NVM-aware FS (PMFS, BPFS) Persistent Data stores Key-Value stores Relational databases Persistent Heap Various forms of caching Webserver/file/page caching Low-power data storage for IoT devices 36

  37. Intel extensions for PM Intel extensions for PM CLWB - Cache line Write Back Write back cached line (if present in cache hierarchy), may retain clean copy 37

  38. Evaluating Intel extensions Evaluating Intel extensions Crash-Consistency Sufficient to provide crash consistency, if used correctly Programmability Pushes burden of data movement onto programmer Performance Provides ordering and durability mixed in one 38

  39. Evaluating Intel extensions Evaluating Intel extensions Crash-Consistency Sufficient to provide crash consistency, if used correctly Programmability Pushes burden of data movement onto programmer Complex ordering guarantees, expressed in imprecise prose Complex semantics, expressed in imprecise prose Makes the programmer worry about caches 39

  40. Program 1. A = 1 2. B = 1 3. CLWB A 4. CLWB B CPU Coherent Cache Hierarchy Caches -> Memory Controller Queues A=1 B=1 Persistent Write Queue (PM Controller) A=0 A=0 B=0 B=0 Persistent Address Space 40

  41. 41

  42. 42

  43. Intra Intra- -thread Dependency (Epoch) : Track thread Dependency (Epoch) : Track Local timestamp (TS) register maintained at L1 cache CPU 1 Local TS Indicates epoch TS of current (incomplete) epoch Local TS copied as part of PB entry for incoming PM stores Persist Buffer L1 Cache Local TS incremented on encountering persist barrier 43

  44. Intra Intra- -thread Dependency (Epoch) : Track thread Dependency (Epoch) : Track Local timestamp (TS) register maintained at L1 cache CPU 1 Local TS 25 Indicates epoch TS of current (incomplete) epoch Local TS copied as part of PB entry for incoming PM stores Persist Buffer L1 Cache Local TS incremented on encountering persist barrier 44

  45. Intra Intra- -thread Dependency (Epoch) : Track thread Dependency (Epoch) : Track ST A = 1 Local timestamp (TS) register maintained at L1 cache CPU 1 Local TS 25 Indicates epoch TS of current (incomplete) epoch A = 1 Local TS copied as part of PB entry for incoming PM stores A = 1 25 Persist Buffer L1 Cache Local TS incremented on encountering persist barrier 45

  46. Intra Intra- -thread Dependency (Epoch) : Track thread Dependency (Epoch) : Track ST A = 1 Local timestamp (TS) register maintained at L1 cache ST B = 1 CPU 1 Local TS 25 Indicates epoch TS of current (incomplete) epoch A = 1 B = 1 B = 1 25 Local TS copied as part of PB entry for incoming PM stores A = 1 25 Persist Buffer L1 Cache Local TS incremented on encountering persist barrier 46

  47. Intra Intra- -thread Dependency (Epoch) : Track thread Dependency (Epoch) : Track ST A = 1 Local timestamp (TS) register maintained at L1 cache ST B = 1 CPU 1 OFENCE Local TS 26 Indicates epoch TS of current (incomplete) epoch A = 1 B = 1 B = 1 25 Local TS copied as part of PB entry for incoming PM stores A = 1 25 Persist Buffer L1 Cache Local TS incremented on encountering persist barrier 47

  48. Intra Intra- -thread Dependency (Epoch) : Track thread Dependency (Epoch) : Track ST A = 1 Local timestamp (TS) register maintained at L1 cache ST B = 1 CPU 1 OFENCE ST A = 2 Local TS 26 Indicates epoch TS of current (incomplete) epoch A = 2 A = 2 26 B = 1 B = 1 25 Local TS copied as part of PB entry for incoming PM stores A = 1 25 Persist Buffer L1 Cache Local TS incremented on encountering persist barrier 48

  49. Intra Intra- -thread Dependency (Epoch) : Enforce thread Dependency (Epoch) : Enforce EXAMPLE TO BE DROPPED EXAMPLE TO BE DROPPED Persist Buffer Drain requests for all entries in epoch sent concurrently A = 2 26 B = 1 25 A = 1 25 ST B = 1 ST A = 1 Epoch entries drained after all drain ACKs received for previous epoch PM Controller PM Controller PM Region PM Region 49

  50. Intra Intra- -thread Dependency (Epoch) : Enforce thread Dependency (Epoch) : Enforce Persist Buffer Drain requests for all entries in epoch sent concurrently A = 2 26 B = 1 25 A = 1 25 ACK ACK Epoch entries drained after all drain ACKs received for previous epoch PM Controller PM Controller B = 1 A = 1 PM Region PM Region 50

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#