Cooperative Cache Scrubbing for Efficient Memory Management in Multicore Systems

undefined
C
o
o
p
e
r
a
t
i
v
e
 
C
a
c
h
e
 
S
c
r
u
b
b
i
n
g
J
e
n
n
i
f
e
r
 
B
.
 
S
a
r
t
o
r
,
 
W
i
m
 
H
e
i
r
m
a
n
,
 
S
t
e
v
e
B
l
a
c
k
b
u
r
n
*
,
 
L
i
e
v
e
n
 
E
e
c
k
h
o
u
t
,
 
K
a
t
h
r
y
n
 
S
.
 
M
c
K
i
n
l
e
y
^
PACT 2014
 
 
*
 
 
 
 
 
 
^
undefined
 
Multicore Challenge
Chip
 memory (DRAM)
p. 2
P
$
P
$
P
$
P
$
 
 
M
a
n
a
g
e
d
 
l
a
n
g
u
a
g
e
r
u
n
t
i
m
e
 
e
n
v
i
r
o
n
m
e
n
t
 
Application
 
Operating System
O
b
j
e
c
t
s
 
r
a
p
i
d
l
y
a
l
l
o
c
a
t
e
d
 
a
n
d
s
h
o
r
t
-
l
i
v
e
d
LLC
undefined
Problem: Allocation Wall
 
Chip
 memory (DRAM)
p. 3
P
$
P
$
P
$
P
$
M
a
n
a
g
e
d
 
l
a
n
g
u
a
g
e
r
u
n
t
i
m
e
 
e
n
v
i
r
o
n
m
e
n
t
Application
Operating System
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
O
b
j
e
c
t
s
 
r
a
p
i
d
l
y
a
l
l
o
c
a
t
e
d
 
a
n
d
s
h
o
r
t
-
l
i
v
e
d
LLC
undefined
Problem: Bandwidth 
&
 Power Wall
 
Chip
 memory (DRAM)
p. 4
P
$
P
$
P
$
P
$
M
a
n
a
g
e
d
 
l
a
n
g
u
a
g
e
r
u
n
t
i
m
e
 
e
n
v
i
r
o
n
m
e
n
t
Application
Operating System
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
0000000
0000000
O
b
j
e
c
t
s
 
r
a
p
i
d
l
y
a
l
l
o
c
a
t
e
d
 
a
n
d
s
h
o
r
t
-
l
i
v
e
d
Z
e
r
o
i
n
i
t
i
a
l
i
z
a
t
i
o
n
LLC
undefined
Cooperative Cache Scrubbing
 
Chip
LLC
 memory (DRAM)
p. 5
P
$
P
$
P
$
P
$
M
a
n
a
g
e
d
 
l
a
n
g
u
a
g
e
r
u
n
t
i
m
e
 
e
n
v
i
r
o
n
m
e
n
t
Application
Operating System
 
0000000
0000000
O
b
j
e
c
t
s
 
r
a
p
i
d
l
y
a
l
l
o
c
a
t
e
d
 
a
n
d
s
h
o
r
t
-
l
i
v
e
d
Z
e
r
o
i
n
i
t
i
a
l
i
z
a
t
i
o
n
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
write
 
read
LLC
undefined
Generational Garbage Collection
 
Y
o
u
n
g
 
o
b
j
e
c
t
s
 
d
i
e
 
q
u
i
c
k
l
y
N
u
r
s
e
r
y
Traced for live objects
Copy to mature space
Reclaimed ‘en masse’
Nursery
Mature
 
LLC
 
8MB
p. 6
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
undefined
Dead Lines in LLC (8MB)
p. 7
undefined
Dead Data Written Back?
 
Chip
LLC
 memory (DRAM)
p. 8
P
$
P
$
P
$
P
$
M
a
n
a
g
e
d
 
l
a
n
g
u
a
g
e
r
u
n
t
i
m
e
 
e
n
v
i
r
o
n
m
e
n
t
Application
Operating System
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
undefined
Useless Write Backs (8MB LLC)
p. 9
undefined
Cooperative Cache Scrubbing
 
C
o
m
m
u
n
i
c
a
t
e
 
m
a
n
a
g
e
d
 
l
a
n
g
u
a
g
e
s
s
e
m
a
n
t
i
c
 
i
n
f
o
r
m
a
t
i
o
n
 
t
o
 
h
a
r
d
w
a
r
e
C
a
c
h
e
s
‘Scrub’ dead lines
I
n
v
a
l
i
d
a
t
e
U
n
s
e
t
 
d
i
r
t
y
 
b
i
t
Zero lines without fetch
R
e
s
u
l
t
Better cache management
Avoid traffic to DRAM
Save DRAM energy
p. 10
writes
reads
undefined
Dead Data Written in Cache?
 
Y
o
u
n
g
 
o
b
j
e
c
t
s
 
d
i
e
 
q
u
i
c
k
l
y
N
u
r
s
e
r
y
Traced for live objects
Copy to mature space
Reclaimed ‘en masse’
Nursery
Mature
 
LLC
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
 
DEAD
p. 11
 
0
0
0
0
0
0
0
undefined
Dead Lines Written in LLC (8MB)
p. 12
undefined
SW-HW Cooperative Scrubbing
 
S
o
f
t
w
a
r
e
Identify cache line-aligned dead/zero region
Generational Immix collector (stop-the-world)
After nursery collection, call scrub instruction on each
line in entire range
Call zero instructions to zero region (32KB)
H
a
r
d
w
a
r
e
p. 13
undefined
SW-HW Cooperative Scrubbing
 
S
o
f
t
w
a
r
e
H
a
r
d
w
a
r
e
Scrubbing 
(LLC)
c
l
i
n
v
a
l
i
d
a
t
e
:
 
i
n
v
a
l
i
d
a
t
e
s
 
c
a
c
h
e
 
l
i
n
e
c
l
u
n
d
i
r
t
y
:
 
c
l
e
a
r
s
 
d
i
r
t
y
 
b
i
t
c
l
c
l
e
a
n
:
 
c
l
e
a
r
s
 
d
i
r
t
y
 
b
i
t
,
 
m
o
v
e
s
 
l
i
n
e
 
t
o
 
L
R
U
Zeroing (L2)
c
l
z
e
r
o
:
 
z
e
r
o
 
c
a
c
h
e
 
l
i
n
e
 
w
i
t
h
o
u
t
 
f
e
t
c
h
Modifications to MESI cache coherence protocol
Back-propagation from LLC to L1/L2 cache levels
Local coherence transitions (no off-chip)
p. 14
PowerPC’s dcbi, ARM
PowerPC’s dcbz
undefined
MESI Coherence Transitions
p. 15
 
clclean/-
clinvalidate/-
clinvalidate/-
 
clclean/-
 
clclean/-
clinvalidate/-
clinvalidate/-
 
clclean/-
undefined
MESI Coherence Transitions
p. 16
clzero/-
clzero/-
clzero/BusInvalidate
clzero/BusInvalidate
 
BusInvalidate
 
BusInvalidate
 
BusInvalidate
external:
from
another
LLC
undefined
Methodology
S
n
i
p
e
r
 
s
i
m
u
l
a
t
o
r
4 cores, 8MB shared L3 (LLC), McPAT
Extensions for JVM
Works with JIT compiler
Emulate system calls 
(futex & nanosleep)
JVM-simulator communication with new instruction
J
i
k
e
s
 
R
V
M
 
3
.
1
.
2
 
a
n
d
 
D
a
C
a
p
o
 
b
e
n
c
h
m
a
r
k
s
Generational Immix garbage collector
4 application, 4 GC threads
2x minimum heap
Replay compilation, 2
nd
 invocation
p. 17
undefined
DRAM Writes (8MB nursery)
p. 18
undefined
DRAM Writes (8MB nursery)
p. 19
undefined
DRAM Writes (8MB nursery)
p. 20
undefined
DRAM Reads (8MB nursery)
p. 21
undefined
DRAM Reads (8MB nursery)
p. 22
undefined
DRAM Reads (8MB nursery)
p. 23
undefined
DRAM Reads (8MB nursery)
p. 24
undefined
DRAM Reads (8MB nursery)
p. 25
undefined
Dynamic DRAM Energy (8MB nursery)
p. 26
undefined
Dynamic DRAM Energy (8MB nursery)
p. 27
undefined
Total DRAM Energy
p. 28
-22%
undefined
Total DRAM Energy
p. 29
-22%
undefined
Total DRAM Traffic
p. 30
-14x
undefined
clclean+clzero Improvements
p. 31
undefined
Related Work
C
o
o
p
e
r
a
t
i
v
e
 
c
a
c
h
e
 
m
a
n
a
g
e
m
e
n
t
ESKIMO by Isen & John, Micro 09
Useless reads and writes to DRAM by sequential C
programs
Reduce energy
Require large map in hardware, extra cache bits
Wang et al., PACT 02/ ISCA 03; Sartor et al., 05
C & Fortran static analysis to give cache hints to evict or
keep data
Z
e
r
o
 
i
n
i
t
i
a
l
i
z
a
t
i
o
n
 
[
Y
a
n
g
 
e
t
 
a
l
.
,
 
O
O
P
S
L
A
 
1
1
]
Studied costs in time, cache and traffic
Use non-temporal writes to DRAM, increase bandwidth
p. 32
undefined
Conclusions
S
o
f
t
w
a
r
e
-
h
a
r
d
w
a
r
e
 
c
o
o
p
e
r
a
t
i
v
e
 
c
a
c
h
e
s
c
r
u
b
b
i
n
g
Leverages region allocation semantics
Changes to MESI coherence protocol
New multicore architectural simulation
methodology
Reductions
59% traffic
14% DRAM energy
4.6% execution time
p. 33
http://users.elis.ugent.be/~jsartor/
undefined
 
 
p. 34
undefined
Execution Time (8MB nursery)
p. 35
undefined
Changes to MESI coherence protocol
p. 36
undefined
Total DRAM Energy (8MB nursery)
p. 37
undefined
Execution Time Across Nurseries
p. 38
undefined
Execution Time
p. 39
undefined
Dynamic DRAM Energy 8MB Nursery
p. 40
Slide Note
Embed
Share

Cooperative Cache Scrubbing optimizes memory management in multicore systems by efficiently handling short-lived application objects and reducing unnecessary data writes to memory. By communicating semantic information to hardware caches, dead lines are scrubbed, dirty bits unset, and unnecessary fetches avoided, resulting in better cache utilization and energy savings. Through techniques like generational garbage collection and zero initialization, this approach addresses challenges such as allocation walls, bandwidth, and power limitations, improving system performance and resource utilization.

  • Memory Management
  • Multicore Systems
  • Cache Scrubbing
  • Memory Efficiency
  • Energy Savings

Uploaded on Dec 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

  2. Multicore Challenge Application Objects rapidly allocated and short-lived Managed language runtime environment Operating System P $ P $ P $ P $ Chip LLC memory (DRAM) p. 2

  3. Problem: Allocation Wall Application Objects rapidly allocated and short-lived Managed language runtime environment Operating System P $ P $ P $ P $ Chip DEAD DEAD DEAD LLC DEAD DEAD DEAD memory (DRAM) p. 3

  4. Problem: Bandwidth & Power Wall Application Objects rapidly allocated and short-lived Managed language runtime environment Operating System Zero initialization P $ P $ P $ P $ Chip DEAD DEAD DEAD 0000000 LLC DEAD DEAD 0000000 DEAD memory (DRAM) p. 4

  5. Cooperative Cache Scrubbing Application Objects rapidly allocated and short-lived Managed language runtime environment Operating System Zero initialization P $ P $ P $ P $ Chip DEAD DEAD 0000000 0000000 LLC LLC DEAD DEAD DEAD DEAD memory (DRAM) p. 5

  6. Generational Garbage Collection Mature Nursery Young objects die quickly Nursery Traced for live objects Copy to mature space Reclaimed en masse DEAD DEAD DEAD DEAD LLC DEAD DEAD 8MB p. 6

  7. Dead Lines in LLC (8MB) p. 7

  8. Dead Data Written Back? Application Managed language runtime environment Operating System P $ P $ P $ P $ Chip DEAD DEAD DEAD LLC DEAD DEAD DEAD memory (DRAM) p. 8

  9. Useless Write Backs (8MB LLC) p. 9

  10. Cooperative Cache Scrubbing Communicate managed language s semantic information to hardware Caches Scrub dead lines Invalidate Unset dirty bit Zero lines without fetch Result Better cache management Avoid traffic to DRAM Save DRAM energy writes reads p. 10

  11. Dead Data Written in Cache? Mature Nursery Young objects die quickly Nursery Traced for live objects Copy to mature space Reclaimed en masse DEAD DEAD 0000000 DEAD LLC DEAD DEAD DEAD DEAD DEAD p. 11

  12. Dead Lines Written in LLC (8MB) p. 12

  13. SW-HW Cooperative Scrubbing Software Identify cache line-aligned dead/zero region Generational Immix collector (stop-the-world) After nursery collection, call scrub instruction on each line in entire range Call zero instructions to zero region (32KB) Hardware p. 13

  14. SW-HW Cooperative Scrubbing Software Hardware Scrubbing (LLC) clinvalidate: invalidates cache line clundirty: clears dirty bit clclean: clears dirty bit, moves line to LRU Zeroing (L2) clzero: zero cache line without fetch Modifications to MESI cache coherence protocol Back-propagation from LLC to L1/L2 cache levels Local coherence transitions (no off-chip) PowerPC s dcbi, ARM PowerPC s dcbz p. 14

  15. MESI Coherence Transitions M E clclean/- clclean/- clinvalidate/- clinvalidate/- clclean/- clclean/- I S clinvalidate/- p. 15

  16. MESI Coherence Transitions clzero/- M E clzero/- clzero/BusInvalidate BusInvalidate external: from another LLC I S BusInvalidate p. 16

  17. Methodology Sniper simulator 4 cores, 8MB shared L3 (LLC), McPAT Extensions for JVM Works with JIT compiler Emulate system calls (futex & nanosleep) JVM-simulator communication with new instruction Jikes RVM 3.1.2 and DaCapo benchmarks Generational Immix garbage collector 4 application, 4 GC threads 2x minimum heap Replay compilation, 2ndinvocation p. 17

  18. DRAM Writes (8MB nursery) 120 100 Writes/Baseline (%) 80 clinvalidate clundirty clclean clzero clclean+clzero 60 40 20 0 p. 18

  19. DRAM Writes (8MB nursery) 120 100 Writes/Baseline (%) 80 clinvalidate clundirty clclean clzero clclean+clzero 60 40 20 0 p. 19

  20. DRAM Writes (8MB nursery) 120 100 Writes/Baseline (%) 80 clinvalidate clundirty clclean clzero clclean+clzero 60 40 20 0 p. 20

  21. DRAM Reads (8MB nursery) 225 200 Reads/Baseline (%) 175 150 125 clinvalidate clundirty clclean clzero clclean+clzero 100 75 50 25 0 p. 21

  22. DRAM Reads (8MB nursery) 225 200 Reads/Baseline (%) 175 150 125 clinvalidate clundirty clclean clzero clclean+clzero 100 75 50 25 0 p. 22

  23. DRAM Reads (8MB nursery) 225 200 Reads/Baseline (%) 175 150 125 clinvalidate clundirty clclean clzero clclean+clzero 100 75 50 25 0 p. 23

  24. DRAM Reads (8MB nursery) 225 200 Reads/Baseline (%) 175 150 125 clinvalidate clundirty clclean clzero clclean+clzero 100 75 50 25 0 p. 24

  25. DRAM Reads (8MB nursery) 225 200 Reads/Baseline (%) 175 150 125 clinvalidate clundirty clclean clzero clclean+clzero 100 75 50 25 0 p. 25

  26. Dynamic DRAM Energy (8MB nursery) 80 70 Energy Reduction (%) 60 50 clinvalidate clundirty clclean clzero clclean+clzero 40 30 20 10 0 Mean p. 26

  27. Dynamic DRAM Energy (8MB nursery) 80 70 Energy Reduction (%) 60 50 clinvalidate clundirty clclean clzero clclean+clzero 40 30 20 10 0 Mean p. 27

  28. Total DRAM Energy 25 20 Energy Reduction (%) 15 clinvalidate clundirty clclean clzero clclean+clzero 10 5 0 4M 8M 16M -22% -5 p. 28

  29. Total DRAM Energy 25 20 Energy Reduction (%) 15 clinvalidate clundirty clclean clzero clclean+clzero 10 5 0 4M 8M 16M -22% -5 p. 29

  30. Total DRAM Traffic 100 75 Traffic Reduction (%) 50 clinvalidate clundirty clclean clzero clclean+clzero 25 0 4M 8M 16M -25 -14x -50 p. 30

  31. clclean+clzero Improvements 100% 90% 80% 70% 60% 50% 40% 4MB 8MB 16MB 30% 20% 10% 0% p. 31

  32. Related Work Cooperative cache management ESKIMO by Isen & John, Micro 09 Useless reads and writes to DRAM by sequential C programs Reduce energy Require large map in hardware, extra cache bits Wang et al., PACT 02/ ISCA 03; Sartor et al., 05 C & Fortran static analysis to give cache hints to evict or keep data Zero initialization [Yang et al., OOPSLA 11] Studied costs in time, cache and traffic Use non-temporal writes to DRAM, increase bandwidth p. 32

  33. Conclusions Software-hardware cooperative cache scrubbing Leverages region allocation semantics Changes to MESI coherence protocol New multicore architectural simulation methodology Reductions 59% traffic 14% DRAM energy 4.6% execution time DEAD 0000000 http://users.elis.ugent.be/~jsartor/ p. 33

  34. p. 34

  35. Execution Time (8MB nursery) 7 Execution Time Reduction (%) 6 5 clinvalidate clundirty clclean clzero clclean+clzero 4 3 2 1 0 Mean p. 35

  36. Changes to MESI coherence protocol State clinvalidate clundirty/clcl ean invalidate L1/L2 (no WB) E (clclean LRU) invalidate L1/L2 (clclean LRU) invalidate L1/L2 (clclean LRU) clzero BusInvalidate M invalidate L1/L2 (no WB) I invalidate L1/L2 (no WB) I E invalidate L1/L2 I invalidate L1/L2 I M invalidate L1/L2 I invalidate L1/L2 I S BusInvalidate M I BusInvalidate M p. 36

  37. Total DRAM Energy (8MB nursery) 60 50 Energy Reduction (%) 40 clinvalidate clundirty clclean clzero clclean+clzero 30 20 10 0 -10 p. 37

  38. Execution Time Across Nurseries p. 38

  39. Execution Time p. 39

  40. Dynamic DRAM Energy 8MB Nursery p. 40

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#