Enhancing Off-chip Bandwidth Utilization for Improved System Performance

B
B
a
a
n
n
d
d
w
w
i
i
d
d
t
t
h
h
-
-
a
a
w
w
a
a
r
r
e
e
 
 
L
L
L
L
C
C
E
E
f
f
f
f
i
i
c
c
i
i
e
e
n
n
t
t
l
l
y
y
 
 
C
C
o
o
o
o
r
r
d
d
i
i
n
n
a
a
t
t
i
i
n
n
g
g
 
 
O
O
f
f
f
f
-
-
c
c
h
h
i
i
p
p
 
 
R
R
e
e
a
a
d
d
/
/
W
W
r
r
i
i
t
t
e
e
 
 
B
B
a
a
n
n
d
d
w
w
i
i
d
d
t
t
h
h
Mainak Chaudhuri
Indian Institute of Technology Kanpur
Jayesh Gaur, Sreenivas Subramoney
Processor Architecture Research Lab, Intel
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
T
a
l
k
 
i
n
 
O
n
e
 
S
l
i
d
e
 
DRAM bandwidth is shared between 
reads
and 
writes
; writes are drained periodically
DRAM writes are generated by the 
LLC
policy
;
 
periodic interruption to read servicing
Our BA-LLC proposal 
maximizes
 DRAM read
stretches
 to accelerate critical paths and
controls exactly when and for how long
writes can interrupt DRAM read stream
BA-LLC relies on run-time analysis of
read/write characteristics and 
bounds policy-
related losses with sound analytical models
R
e
s
u
l
t
 
h
i
g
h
l
i
g
h
t
s
 
12% performance improvement
 in an 8-core
system averaged over 50 multi-programmed
workloads
Average DRAM 
read latency decreases by
17%
 due to better bandwidth scheduling
Comfortably outperforms eager write
scheduling and writeback-aware LLC policies
Bridges 75% of performance gap between
baseline and a system deploying unbounded
write buffers (i.e., no interruption to reads)
RTL synthesis confirms low area overhead
and timing closure
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
I
n
t
r
o
d
u
c
t
i
o
n
Detailed analysis shows that there are big holes
in read BW demand in several workloads
Challenge
: they are very far apart needing
larger than even 8K-entry write buffers
time
BW
BASELINE
GOAL
READ
WRITE
BW
time
I
n
t
r
o
d
u
c
t
i
o
n
 
Maximize read stretch length; leads to less
waiting time and improved read latency
Precisely control write-induced interruption
time
BW
BASELINE
GOAL
READ
WRITE
BW
time
 
maximize
 
control when
and how long
A
 
D
R
A
M
 
b
a
n
d
w
i
d
t
h
 
s
c
h
e
d
u
l
i
n
g
 
p
r
o
b
l
e
m
m
a
n
a
g
e
d
 
b
y
 
t
h
e
 
L
L
C
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
B
o
t
t
l
e
n
e
c
k
 
a
n
a
l
y
s
i
s
 
DRAM writes introduce three inefficiencies
that hamper/delay read servicing
Channel turnaround, poor write locality,
bandwidth consumption of writes
 
 
 
 
 
 
 
No write BW: 36% speedup, 
no TA: 1.4%
speedup, no TA and no ACT/PRE: 11% speedup
N
e
e
d
 
B
W
 
c
o
o
r
d
i
n
a
t
i
o
n
 
t
o
 
i
m
p
r
o
v
e
b
e
y
o
n
d
 
t
r
a
d
i
t
i
o
n
a
l
 
T
A
 
a
n
d
 
A
C
T
/
P
R
E
o
v
e
r
h
e
a
d
 
r
e
d
u
c
i
n
g
 
t
e
c
h
n
i
q
u
e
s
B
o
t
t
l
e
n
e
c
k
 
a
n
a
l
y
s
i
s
Big write buffers: naïve way of reducing
write disturbance
Reality: small write buffers due to address CAM
Why big write buffers are good?
B
o
t
t
l
e
n
e
c
k
 
a
n
a
l
y
s
i
s
 
Three advantages of big write buffers
Less turnaround, improved write locality from
increased bunching options, 
BW coordination
Observations (data-driven analysis in paper)
Most gains with medium-sized (e.g., 256) write
buffers come from less TA and improved locality
Techniques that can reduce TA and improve locality
would match the performance of medium-sized write
buffers
A good fraction of gains with large-sized (e.g.,
8K) write buffers come from BW coordination
Just TA reduction and locality improvement are not
enough for matching the performance of big WBs
I
n
s
i
g
h
t
:
 
n
e
e
d
 
(
o
r
 
n
e
e
d
 
t
o
 
e
m
u
l
a
t
e
)
 
l
a
r
g
e
 
w
r
i
t
e
b
u
f
f
e
r
s
 
t
o
 
i
m
p
r
o
v
e
 
b
e
y
o
n
d
 
t
r
a
d
i
t
i
o
n
a
l
 
T
A
r
e
d
u
c
i
n
g
 
a
n
d
 
l
o
c
a
l
i
t
y
 
i
m
p
r
o
v
i
n
g
 
t
e
c
h
n
i
q
u
e
s
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
B
A
-
L
L
C
:
 
D
e
s
i
g
n
 
o
v
e
r
v
i
e
w
 
Need large write buffers
Use LLC space; treat the lowest portion of LRU
stack as a write buffer (
how many ways?
)
How? Use clean LRU replacement policy to buffer
dirty blocks in LLC (a la ARI [TACO 2013], NVMs)
Contribution#1
: Dynamically maximize in-
LLC write buffer capacity so that clean LRU
stretches (and hence, DRAM read stretches)
get maximized in length
Sound analytical model to bound the sacrificed
LLC hit fraction while computing maximum write
buffer width during a phase of the execution
Contribution#2
: Scrub LLC dirty blocks when
read BW demand is low (controls write BW)
B
A
-
L
L
C
:
 
I
n
-
L
L
C
 
w
r
i
t
e
 
b
u
f
f
e
r
 
Dirty blocks accumulate at LRU end as clean
LRU keeps victimizing clean blocks
Prematurely evicted clean blocks sacrifice hits
Volume of sacrificed hits usually monotonically
increases with in-LLC write buffer (WB) width
LLC sets
LLC ways
 
WB
 
MRU
 
LRU
B
A
-
L
L
C
:
 
I
n
-
L
L
C
 
w
r
i
t
e
 
b
u
f
f
e
r
 
w
i
d
t
h
 
A read hit histogram (RHH) per LLC bank to
compute hit distribution across LRU stack
 
 
 
 
 
Sacrificed hit fraction for WB of width w in LLC
bank B =
 
Compute max w such that
 
i=0
i=A-1
LRU
MRU
Read hit in LRU position i=k
i=k
++
 
w
B
A
-
L
L
C
:
 
I
n
-
L
L
C
 
w
r
i
t
e
 
b
u
f
f
e
r
 
w
i
d
t
h
 
Write buffer width (number of LLC ways at
LRU tail) computed periodically using RHH
Will be referred to as n
R
(B) in LLC bank B
Must avoid dirty inclusion victims (DIVs) to
exploit maximum effectiveness of clean LRU
Observation: DIVs increase in volume as write
buffer width increases
Increasing write buffer width pushes clean victims
more toward live MRU side of LLC
Write hits to victim positions are an indication of DIVs
Use a write hit histogram (WHH) to compute a
different write buffer width n
D
(B) to restrict write
hit fraction to the WB ways in LLC bank B
B
A
-
L
L
C
:
 
I
n
-
L
L
C
 
w
r
i
t
e
 
b
u
f
f
e
r
 
w
i
d
t
h
Dynamic write buffer width in LLC bank B =
 
n(B) = max(3, min(n
R
(B), n
D
(B)))
At least three LLC ways given to write buffering
Empirically decided for a 16-way LLC and our set of
workloads
 
Iterative computation of n
R
(B)
B
A
-
L
L
C
:
 
D
i
r
t
y
 
b
l
o
c
k
 
s
c
r
u
b
b
i
n
g
 
Honoring the computed maximum WB width
n(B) requires cleaning the LRU tail of LLC
Sub-problem#1: when to scrub
Sub-problem#2: actual scrubbing protocol
When to scrub
Could be based on a high water mark on dirty
population within WB (= N.n(B))
Dirty population ≥ N.n(B).
τ
hwm 
where N is the number
of LLC sets in a bank
Doesn’t work
: different sets fill up with dirty blocks at
different rate; need a set-centric criterion also
Clean LRU can be very bad for sets mostly filled with dirty
blocks
B
A
-
L
L
C
:
 
D
i
r
t
y
 
b
l
o
c
k
 
s
c
r
u
b
b
i
n
g
 
When to scrub
At least one of the following two criteria must be
met in an LLC bank B to trigger a scrub
Criterion#1 [high water mark]
Dirty population in WB of bank B ≥ N.n(B).
τ
hwm
Criterion#2 [overfull sets 
Ʌ
 minimum dirty pop.]
Number of 
full LLC sets
 in bank B ≥ N.
τ
f
A set is full if all its WB ways are dirty
Rule to pick 
Τ
f 
(< 1): larger if LLC hit rate higher
Dirty population in WB for bank B ≥ N.n(B).
τ
low
Required to offer enough scrubbing options
Additionally, whenever a DRAM channel has no
reads, scrubbing starts; but stops on read arrival
B
A
-
L
L
C
:
 
D
i
r
t
y
 
b
l
o
c
k
 
s
c
r
u
b
b
i
n
g
 
Scrubbing protocol
Once triggered, the scrubber takes two passes
over the sets in the triggering LLC bank
Each pass scrubs at most one block from each set
Looks up a set during idle LLC cycles
Minimizes premature scrubs (i.e., write hits to
scrubbed blocks) by limiting the number of LRU tail
ways to scrub from: uses the write hit distribution in
WHH to compute the max number of scrub ways
Periodically polls the population of pending reads
to a DRAM channel and stops scrubbing if read
demand crosses a threshold
Writes can demand DRAM BW only if read BW
demand is low
B
A
-
L
L
C
:
 
D
i
r
t
y
 
b
l
o
c
k
 
s
c
r
u
b
b
i
n
g
Set traversal order is important for efficient
scrubbing
LLC sets
LLC ways
WB
MRU
LRU
n(B)
 
scrub ways
 
Partition
(parallelism)
 
Consec. sets
(locality)
B
A
-
L
L
C
:
 
D
e
s
i
g
n
 
s
y
n
t
h
e
s
i
s
/
A
r
e
a
FPGA synthesis for rapid testing with a very
large set of stimuli
To gain confidence about RTL correctness
ASIC flow with TSMC 45 nm process for
target cycle time of 0.25 ns
Additional logic area when LLC runs 
clean LRU
policy: 0.01618 mm
2
 per LLC bank
Additional logic area when LLC runs 
clean NRU
policy: 0.01669 mm
2
 per LLC bank
Additional logic area when LLC runs 
clean SRRIP
policy: 0.02103 mm
2
 per LLC bank
Storage overhead: slightly over 1Kbits per LLC
bank
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
P
r
i
o
r
 
s
t
u
d
i
e
s
Eager writeback proposals
Eagerly send writes to idle DRAM
channel/rank/bank
Eager writeback [MICRO 2000]
Improve write locality by bunching dirty blocks
from a few statically predefined LRU ways of the
LLC falling on same DRAM row
Virtual write queue (VWQ) [ISCA 2010]
DRAM-aware writeback (DAWB) [UT TR 2010]
Last write prediction-guided (LWPG) writeback [ISCA
2012]
 Dirty block index (DBI) + aggressive writeback (AWB)
[ISCA 2014]
P
r
i
o
r
 
s
t
u
d
i
e
s
 
Proposals to reduce write volume to DRAM
Clean LRU with adaptive insertion in LLC to
optimize read+write to DRAM: ARI [TACO 2013]
Retain LLC blocks with write reuse: WADE [TACO
2013]
Our proposal (BA-LLC)
Goal is to affect BW coordination between DRAM
reads and writes; maximizes read stretch length
Goes beyond eager writebacks that focus mostly
on channel turnaround and write locality
Doesn’t attempt to reduce volume of DRAM
writes or reads, but repositions them in time
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
S
i
m
u
l
a
t
i
o
n
 
i
n
f
r
a
s
t
r
u
c
t
u
r
e
Chip-multiprocessor (CMP) with 8 cores
Each core: private iL1 cache (32 KB 8-way), dL1
cache (32 KB 8-way), unified L2 cache (256 KB
8-way)
Shared LLC: 8 MB 16-way / 16 MB 16-way
Mesh interconnect
Dual-channel DDR3-1600 and DDR3-2133 DRAM
modules
Fifty multiprogrammed workloads
35 heterogeneous and 15 rate mixes of SPEC CPU
2006
500M representative dynamic instructions per thread
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
S
i
m
u
l
a
t
i
o
n
 
r
e
s
u
l
t
s
:
 
S
p
e
e
d
u
p
0.97
1.00
1.03
1.06
1.09
1.12
1.15
LRU
NRU
SRRIP
SRRIP+
SHiP
ARI
Base
VWQ
DAWB
LWPG
DBI+AWB
BA-LLC
S
i
m
u
l
a
t
i
o
n
 
r
e
s
u
l
t
s
:
 
S
-
c
u
r
v
e
 
(
L
R
U
)
Average speedup is 12%
S
o
u
r
c
e
s
 
o
f
 
p
e
r
f
o
r
m
a
n
c
e
 
i
n
 
B
A
-
L
L
C
 
Average read stretch length increases by
2.4x in BA-LLC
Number of DRAM reads/writes remain almost
unchanged on average
Average write buffer capacity
About five LLC ways: ~2.5 MB (~40K-entry WB)
DRAM read latency improves by 17% on
average
DRAM write throughput improves by 50%
DRAM write row hit rate improves from 35%
to 40%
A
p
p
r
o
a
c
h
i
n
g
 
u
n
b
o
u
n
d
e
d
 
W
B
 
BA-LLC with 32-entry WB delivers performance of
baseline 1K-entry WB and bridges 75% performance
gap between baseline and infinite WB
BA-LLC with 8-entry WB delivers better performance
than 32-entry baseline
Less complex WB with better performance
S
k
e
t
c
h
Talk in one slide
Result highlights
Introduction
Bottleneck analysis
Bandwidth-aware LLC
Prior studies
Simulation infrastructure
Simulation results
Summary
S
u
m
m
a
r
y
Bandwidth-aware LLC policy proposal to
intelligently schedule DRAM read and write
bandwidth demands from LLC side
Proposal offers long stretches of exclusive
DRAM bandwidth to reads
Enabled by dynamically computed in-LLC write
buffer width to maximize read stretch lengths
Accompanied by a smart dirty block scrubber
12% speedup averaged over fifty eight-way
multiprogrammed workloads
Bridges 75% performance gap with
unbounded write buffer
Thank you
Thank you
B
B
a
a
n
n
d
d
w
w
i
i
d
d
t
t
h
h
-
-
a
a
w
w
a
a
r
r
e
e
 
 
L
L
L
L
C
C
E
E
f
f
f
f
i
i
c
c
i
i
e
e
n
n
t
t
l
l
y
y
 
 
C
C
o
o
o
o
r
r
d
d
i
i
n
n
a
a
t
t
i
i
n
n
g
g
 
 
O
O
f
f
f
f
-
-
c
c
h
h
i
i
p
p
 
 
R
R
e
e
a
a
d
d
/
/
W
W
r
r
i
i
t
t
e
e
 
 
B
B
a
a
n
n
d
d
w
w
i
i
d
d
t
t
h
h
Slide Note
Embed
Share

Efficiently coordinating off-chip read/write bandwidth through the Bandwidth-aware LLC proposal yields a 12% performance improvement in an 8-core system across multiple workloads. This approach optimizes DRAM read latency, surpassing existing policies and filling performance gaps while confirming low area overhead and timing closure in RTL synthesis.

  • Bandwidth-aware LLC
  • System Performance
  • Off-chip Bandwidth
  • DRAM Optimization
  • Simulation Results

Uploaded on Sep 26, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Bandwidth-aware LLC Efficiently Coordinating Off- chip Read/Write Bandwidth Mainak Chaudhuri Indian Institute of Technology Kanpur Jayesh Gaur, Sreenivas Subramoney Processor Architecture Research Lab, Intel

  2. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  3. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  4. Talk in One Slide DRAM bandwidth is shared between reads and writes; writes are drained periodically DRAM writes are generated by the LLC policy; periodic interruption to read servicing Our BA-LLC proposal maximizes DRAM read stretches to accelerate critical paths and controls exactly when and for how long writes can interrupt DRAM read stream BA-LLC relies on run-time analysis of read/write characteristics and bounds policy- related losses with sound analytical models

  5. Result highlights 12% performance improvement in an 8-core system averaged over 50 multi-programmed workloads Average DRAM read latency decreases by 17% due to better bandwidth scheduling Comfortably outperforms eager write scheduling and writeback-aware LLC policies Bridges 75% of performance gap between baseline and a system deploying unbounded write buffers (i.e., no interruption to reads) RTL synthesis confirms low area overhead and timing closure

  6. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  7. Introduction BW BASELINE READ time BW GOAL WRITE time Detailed analysis shows that there are big holes in read BW demand in several workloads Challenge: they are very far apart needing larger than even 8K-entry write buffers

  8. Introduction BW BASELINE A DRAM bandwidth scheduling problem READ managed by the LLC time maximize control when and how long BW GOAL WRITE time Maximize read stretch length; leads to less waiting time and improved read latency Precisely control write-induced interruption

  9. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  10. Bottleneck analysis DRAM writes introduce three inefficiencies that hamper/delay read servicing Channel turnaround, poor write locality, bandwidth consumption of writes overhead reducing techniques Need BW coordination to improve beyond traditional TA and ACT/PRE No write BW: 36% speedup, no TA: 1.4% speedup, no TA and no ACT/PRE: 11% speedup

  11. Bottleneck analysis Big write buffers: na ve way of reducing write disturbance Reality: small write buffers due to address CAM Why big write buffers are good?

  12. Bottleneck analysis Three advantages of big write buffers Less turnaround, improved write locality from increased bunching options, BW coordination Observations (data-driven analysis in paper) Most gains with medium-sized (e.g., 256) write buffers come from less TA and improved locality Techniques that can reduce TA and improve locality would match the performance of medium-sized write buffers A good fraction of gains with large-sized (e.g., 8K) write buffers come from BW coordination Just TA reduction and locality improvement are not enough for matching the performance of big WBs Insight: need (or need to emulate) large write buffers to improve beyond traditional TA reducing and locality improving techniques

  13. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  14. BA-LLC: Design overview Need large write buffers Use LLC space; treat the lowest portion of LRU stack as a write buffer (how many ways?) How? Use clean LRU replacement policy to buffer dirty blocks in LLC (a la ARI [TACO 2013], NVMs) Contribution#1: Dynamically maximize in- LLC write buffer capacity so that clean LRU stretches (and hence, DRAM read stretches) get maximized in length Sound analytical model to bound the sacrificed LLC hit fraction while computing maximum write buffer width during a phase of the execution Contribution#2: Scrub LLC dirty blocks when read BW demand is low (controls write BW)

  15. BA-LLC: In-LLC write buffer LLC sets WB MRU LRU LLC ways Dirty blocks accumulate at LRU end as clean LRU keeps victimizing clean blocks Prematurely evicted clean blocks sacrifice hits Volume of sacrificed hits usually monotonically increases with in-LLC write buffer (WB) width

  16. BA-LLC: In-LLC write buffer width A read hit histogram (RHH) per LLC bank to compute hit distribution across LRU stack i=A-1 i=k i=0 ++ MRU LRU w Read hit in LRU position i=k Sacrificed hit fraction for WB of width w in LLC bank B = = 0 i 1 w ( , / ) i RHH B T 1 w = i Compute max w such that ( , / ) i RHH B T 0

  17. BA-LLC: In-LLC write buffer width Write buffer width (number of LLC ways at LRU tail) computed periodically using RHH Will be referred to as nR(B) in LLC bank B Must avoid dirty inclusion victims (DIVs) to exploit maximum effectiveness of clean LRU Observation: DIVs increase in volume as write buffer width increases Increasing write buffer width pushes clean victims more toward live MRU side of LLC Write hits to victim positions are an indication of DIVs Use a write hit histogram (WHH) to compute a different write buffer width nD(B) to restrict write hit fraction to the WB ways in LLC bank B

  18. BA-LLC: In-LLC write buffer width Dynamic write buffer width in LLC bank B = n(B) = max(3, min(nR(B), nD(B))) At least three LLC ways given to write buffering Empirically decided for a 16-way LLC and our set of workloads Iterative computation of nR(B)

  19. BA-LLC: Dirty block scrubbing Honoring the computed maximum WB width n(B) requires cleaning the LRU tail of LLC Sub-problem#1: when to scrub Sub-problem#2: actual scrubbing protocol When to scrub Could be based on a high water mark on dirty population within WB (= N.n(B)) Dirty population N.n(B). hwm where N is the number of LLC sets in a bank Doesn t work: different sets fill up with dirty blocks at different rate; need a set-centric criterion also Clean LRU can be very bad for sets mostly filled with dirty blocks

  20. BA-LLC: Dirty block scrubbing When to scrub At least one of the following two criteria must be met in an LLC bank B to trigger a scrub Criterion#1 [high water mark] Dirty population in WB of bank B N.n(B). hwm Criterion#2 [overfull sets minimum dirty pop.] Number of full LLC sets in bank B N. f A set is full if all its WB ways are dirty Rule to pick f (< 1): larger if LLC hit rate higher Dirty population in WB for bank B N.n(B). low Required to offer enough scrubbing options Additionally, whenever a DRAM channel has no reads, scrubbing starts; but stops on read arrival

  21. BA-LLC: Dirty block scrubbing Scrubbing protocol Once triggered, the scrubber takes two passes over the sets in the triggering LLC bank Each pass scrubs at most one block from each set Looks up a set during idle LLC cycles Minimizes premature scrubs (i.e., write hits to scrubbed blocks) by limiting the number of LRU tail ways to scrub from: uses the write hit distribution in WHH to compute the max number of scrub ways Periodically polls the population of pending reads to a DRAM channel and stops scrubbing if read demand crosses a threshold Writes can demand DRAM BW only if read BW demand is low

  22. BA-LLC: Dirty block scrubbing Set traversal order is important for efficient scrubbing n(B) Partition (parallelism) scrub ways WB Consec. sets (locality) LLC sets MRU LRU LLC ways

  23. BA-LLC: Design synthesis/Area FPGA synthesis for rapid testing with a very large set of stimuli To gain confidence about RTL correctness ASIC flow with TSMC 45 nm process for target cycle time of 0.25 ns Additional logic area when LLC runs clean LRU policy: 0.01618 mm2 per LLC bank Additional logic area when LLC runs clean NRU policy: 0.01669 mm2 per LLC bank Additional logic area when LLC runs clean SRRIP policy: 0.02103 mm2 per LLC bank Storage overhead: slightly over 1Kbits per LLC bank

  24. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  25. Prior studies Eager writeback proposals Eagerly send writes to idle DRAM channel/rank/bank Eager writeback [MICRO 2000] Improve write locality by bunching dirty blocks from a few statically predefined LRU ways of the LLC falling on same DRAM row Virtual write queue (VWQ) [ISCA 2010] DRAM-aware writeback (DAWB) [UT TR 2010] Last write prediction-guided (LWPG) writeback [ISCA 2012] Dirty block index (DBI) + aggressive writeback (AWB) [ISCA 2014]

  26. Prior studies Proposals to reduce write volume to DRAM Clean LRU with adaptive insertion in LLC to optimize read+write to DRAM: ARI [TACO 2013] Retain LLC blocks with write reuse: WADE [TACO 2013] Our proposal (BA-LLC) Goal is to affect BW coordination between DRAM reads and writes; maximizes read stretch length Goes beyond eager writebacks that focus mostly on channel turnaround and write locality Doesn t attempt to reduce volume of DRAM writes or reads, but repositions them in time

  27. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  28. Simulation infrastructure Chip-multiprocessor (CMP) with 8 cores Each core: private iL1 cache (32 KB 8-way), dL1 cache (32 KB 8-way), unified L2 cache (256 KB 8-way) Shared LLC: 8 MB 16-way / 16 MB 16-way Mesh interconnect Dual-channel DDR3-1600 and DDR3-2133 DRAM modules Fifty multiprogrammed workloads 35 heterogeneous and 15 rate mixes of SPEC CPU 2006 500M representative dynamic instructions per thread

  29. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  30. Simulation results: Speedup Base VWQ DAWB LWPG LRU NRU DBI+AWB BA-LLC SRRIP SRRIP+ SHiP ARI 0.97 1.00 1.03 1.06 1.09 1.12 1.15

  31. Simulation results: S-curve (LRU) Average speedup is 12%

  32. Sources of performance in BA-LLC Average read stretch length increases by 2.4x in BA-LLC Number of DRAM reads/writes remain almost unchanged on average Average write buffer capacity About five LLC ways: ~2.5 MB (~40K-entry WB) DRAM read latency improves by 17% on average DRAM write throughput improves by 50% DRAM write row hit rate improves from 35% to 40%

  33. Approaching unbounded WB BA-LLC with 32-entry WB delivers performance of baseline 1K-entry WB and bridges 75% performance gap between baseline and infinite WB BA-LLC with 8-entry WB delivers better performance than 32-entry baseline Less complex WB with better performance

  34. Sketch Talk in one slide Result highlights Introduction Bottleneck analysis Bandwidth-aware LLC Prior studies Simulation infrastructure Simulation results Summary

  35. Summary Bandwidth-aware LLC policy proposal to intelligently schedule DRAM read and write bandwidth demands from LLC side Proposal offers long stretches of exclusive DRAM bandwidth to reads Enabled by dynamically computed in-LLC write buffer width to maximize read stretch lengths Accompanied by a smart dirty block scrubber 12% speedup averaged over fifty eight-way multiprogrammed workloads Bridges 75% performance gap with unbounded write buffer

  36. Bandwidth-aware LLC Efficiently Coordinating Off- chip Read/Write Bandwidth Thank you

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#