Tradeoffs in Coherent Cache Hierarchies for Accelerators

Fusion : Design Tradeoffs in
Coherent Cache Hierarchies for
Accelerators
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula
School of Computing Sciences, Simon Fraser University
ISCA-’15
Presented by :
Keshav Mathur
Outline
Motivation
Proposed Design
Implementation details
Evaluation Methods
Results
Conclusion and Comments
References
Accelerators : We need them, but ..
+
Specialised Hardware to reduce power and gain performance
+
Power and Clock Gating easier
-
Granularity of Accelerators : Fine Grained fix function vs coarse domain
specific
-
Programmability Concerns : How easy is the discovery and how frequent is
the invocation
Specialization Spectrum
Dynamically Specialized Datapaths for Energy Efficient Computing, Venkatraman Govindaraju Chen-Han Ho Karthikeyan
Sankaralingam Vertical Research Group University of Wisconsin-Madison
Fixed Function Accelerators
Fine grain off-loading of functions to multiple accelerators
+
Enables data path reuse 
   
+ Saves control
path power
-
Creates producer - consumer scenario 
  
- Incurs frequent data movement ( DMA
Calls )
-
Forwarding Buffers ? Co-located, shared memory?
 
-
 Scratchpad ? Cache ? Stash ? Both ? Always ?
??
func1();
func2();
func3();
func4();
L
L
C
DMA
Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance
accelerator simulator enabling large design space exploration
of customized architectures
 
Architecture : Tile + ?
- Tiled architecture : Multiple fixed function
accelerators on a single tile
- Independent memory hierarchy
- Independent coherence protocol
- Multiple Tiles per core
Scratchpad  
    
   
  
    
    Caches
+
Deterministic Access
+
Low Load Use Latency
+
Efficient Memory Utilization
-
Incoherent ,Private Address Space
-
Software Managed
+
Indeterministic behaviour (Hit/Miss)
+
Coherent , Non polluting memory
+
Capture locality, enable reuse
-
H/w address translation energy and
latency
-
Implicit data movement, lazy writebacks
Baseline Systems: SCRATCH and SHARED
eg: SoC systems :AXI - ARM [1], CAPI -IBM [2]
Private Scratchpad
1 per Accelerator
DMA controller
Good for Compute
Intensive workloads
Large Size of Scratch.
amortizes DMA overhead
Large Size increases
access latency, energy
High overhead for kernels
of seq. programs with
high locality
Shared L1 per tile
L1X takes part in MESI
based coherence
L2 at host maintains
inclusion with L1X
Capture spatial and
temporal locality
Provide coherent view of
memory
Customized hierarchy
needed for efficiency
Size vs Latency tradeoffs
as  needs to be sized for
multiple s/w threads
Fusion Architecture
Private L0X per accelerator,
independently sized
Banked L1x : shared
ACC Coherence between L0 and L1
L1x Coherent with core via MESI
protocol implementation
Virtual address (no TLB) 
 space for
accelerators
Timestamp based coherence between
accelerators L0x and  L1x
PID Tags in caches
PID1
PID2
Fusion : Data flow
Reduced  data migration by avoiding DMAs
Frequent Write Backs between L0 and L1
cause energy overhead
Write forwarding avoids write
back to cache
Exploits producer consumer
scenario
Virtual Memory
ACX -0
ACX -0
   L0x
   L1x
L1x
AX- TLB
V.
Addr
Shared L2
   L1D
   L1D
MESI
ACC
TLB provides virtual to physical address translations on a L1x Miss
This is used to index to shared L2 and participate in MESI actions
Virtual Memory
ACX -0
ACX -0
   L0x
   L1x
L1x
AX- RMAP
Shared L2
   L1D
   L1D
MESI
ACC
Phy.
Block
Addr.
Virtual
Cache
line ptr
Requests filtered at L2 Directory
based on sharer list
Accelerator Coherence Protocol
-
Timestamp based , Self - Invalidation protocol
-
2 Hop Protocol , saves energy
-
Motivation to enable data mitigation b/w Acc rather than 
concurrent
sharing
-
Supports Sequential Consistency for accelerators
-
Host Side : 3 Hop Directory based MESI
-
Lease time based on operation and known compute latency of accel.
L0x Cache Line 
 
 
LTime
L1x Cache Line 
 
 
GTime
Local Time , Valid = Time < LTime ;
Global Time , Valid = max( LTime ) for
all L0x LTimes
Accelerator Memory Operations
Request : Load A, #10
Misses in L0x -> Misses in L1x
Virtual -> Physical Addr. in Ax TLB
MESI read request at L2
Response: Data + Phys.
Block addr , Add to sharer
list
Physical -> Line Pointer map
Data + Line Pointer @L1
Consumed by Acc
Host Requests
Fusion vs Fusion -Dx
Evaluation Methods
Shared L2
   L1D
   L1D
C
A
MACSIM : Simulator for
Heterogenous
Computing Systems
GEMS: Simulator for
Memory Hierarchy
GPROF
Aladdin Like
Flow
Benchmarks Characterisation
Most memory Intensive : FFT, DISP,
TRACK, HIST
Compute Intensive : ADPCM, SUSAN,
FILTER
Max. Data Re-use : FFT , Tracking ,
ADPCM, HIST
Max. Per tile accelerators : 6 ( FFT )
Evaluation Specs
Results
Baseline: Scratchpad
O
b
s
e
r
v
a
t
i
o
n
:
 
S
h
a
r
e
d
 
L
1
x
 
h
e
l
p
s
 
m
e
m
o
r
y
i
n
t
e
n
s
i
v
e
 
k
e
r
n
e
l
s
 
b
u
t
 
h
u
r
t
s
 
c
o
m
p
u
t
e
d
o
m
i
n
a
t
e
d
 
k
e
r
n
e
l
s
O
b
s
e
r
v
a
t
i
o
n
:
 
P
r
i
v
a
t
e
 
L
0
x
 
c
a
p
t
u
r
e
s
 
s
p
a
t
i
a
l
l
o
c
a
l
i
t
y
 
i
n
 
S
U
S
A
N
 
a
n
d
 
A
D
P
C
M
 
,
 
b
e
t
t
e
r
t
h
a
n
 
S
H
A
R
E
D
O
b
s
e
r
v
a
t
i
o
n
 
:
L
0
x
 
C
a
c
h
e
s
 
r
e
d
u
c
e
 
L
2
a
c
c
e
s
s
 
e
n
e
r
g
y
 
b
y
 
f
i
l
t
e
r
i
n
g
 
D
M
A
 
c
a
l
l
s
 
f
o
r
m
e
m
o
r
y
 
i
n
t
e
n
s
i
v
e
 
p
r
o
g
.
O
b
s
e
r
v
a
t
i
o
n
 
:
 
F
u
s
i
o
n
s
 
L
1
x
 
f
u
r
t
h
e
r
 
f
i
l
t
e
r
s
 
L
2
a
c
c
e
s
s
e
s
,
 
b
u
t
 
i
n
c
r
e
a
s
e
d
 
c
o
h
e
r
e
n
c
e
m
e
s
s
a
g
e
 
r
e
s
u
l
t
 
i
n
 
n
o
 
s
i
g
n
i
f
i
c
a
n
t
 
e
n
e
r
g
y
i
m
p
r
o
v
e
m
e
n
t
Results
Address Translation overheads
need to be mitigated -> reduced in
Fusion by removing TLB from
critical path.
Large Caches at L1x and L0x still may not
capture all the working sets and hence fail
to give any energy benefits
Protocol Extensions like write
forwarding can reduce energy
consumption in Fusion model.
Write Through caches are
energy expensive
Comments
Not all benchmarks with high share percentage are evaluated on
write-forwarding.
Should kernels that are candidates for write forwarding be designed
as single accelerators ?
How does providing private L0 ( 4-8KB) per ACx scale for more than 6
( max here )  ACx per tile ?
Adding timestamp comparison logic / update logic in the cache is a
major change in cache design. Do these affect access latency ?
References
1.
Goodridge. The effect and technique of system coherence in arm multicore technology
2.
POWER8 Coherent Accelerator Processor Interface http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html
3.
Towards Cache Friendly Hardware Accelerators , Yakun Sophia Shao, Sam Xi, Viji Srinivasan† , Gu-Yeon Wei, David
Brooks
4.
Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and
Vikram S. Adve. 2015. Stash: have your scratchpad and cache it too. In 
Proceedings of the 42nd Annual International Symposium on Computer
Architecture
 (ISCA '15). ACM, New York, NY, USA, 707-719. DOI=
http://dx.doi.org/10.1145/2749469.2750374
5.
Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance accelerator simulator enabling large design space exploration of
customized architectures
6.
B. Reagen, R. Adolf, S. Y. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In
IEEE International Symposium on Workload Characterization (IISWC), 2014
Thank you
Questions ??
Slide Note
Embed
Share

Explore the design tradeoffs and implementation details of coherent cache hierarchies for accelerators in the context of specialized hardware. The presentation covers motivation, proposed design, evaluation methods, results, and conclusions, highlighting the need for accelerators and considerations in off-loading functions to multiple accelerators for power efficiency and performance gains.

  • Cache hierarchies
  • Accelerators
  • Specialized hardware
  • Design tradeoffs
  • Energy efficiency

Uploaded on Sep 23, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Fusion : Design Tradeoffs in Coherent Cache Hierarchies for Accelerators Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula School of Computing Sciences, Simon Fraser University ISCA- 15 Presented by : Keshav Mathur

  2. Outline Motivation Proposed Design Implementation details Evaluation Methods Results Conclusion and Comments References

  3. Accelerators : We need them, but .. + Specialised Hardware to reduce power and gain performance + Power and Clock Gating easier Specialization Spectrum Dynamically Specialized Datapaths for Energy Efficient Computing, Venkatraman Govindaraju Chen-Han Ho Karthikeyan Sankaralingam Vertical Research Group University of Wisconsin-Madison - Granularity of Accelerators : Fine Grained fix function vs coarse domain specific Programmability Concerns : How easy is the discovery and how frequent is the invocation -

  4. Fixed Function Accelerators Fine grain off-loading of functions to multiple accelerators + Enables data path reuse path power + Saves control DMA func1(); func2(); L L C func3(); func4(); Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance accelerator simulator enabling large design space exploration of customized architectures - Creates producer - consumer scenario Calls ) Forwarding Buffers ? Co-located, shared memory? ?? - Incurs frequent data movement ( DMA - Scratchpad ? Cache ? Stash ? Both ? Always ? -

  5. Architecture : Tile + ? - Tiled architecture : Multiple fixed function accelerators on a single tile - Independent memory hierarchy - Independent coherence protocol - Multiple Tiles per core + + + - Indeterministic behaviour (Hit/Miss) Coherent , Non polluting memory Capture locality, enable reuse H/w address translation energy and latency Implicit data movement, lazy writebacks + + + - - Deterministic Access Low Load Use Latency Efficient Memory Utilization Incoherent ,Private Address Space Software Managed - Scratchpad Caches

  6. Baseline Systems: SCRATCH and SHARED Shared L1 per tile Private Scratchpad L1X takes part in MESI based coherence 1 per Accelerator DMA controller L2 at host maintains inclusion with L1X Good for Compute Intensive workloads Capture spatial and temporal locality Large Size of Scratch. amortizes DMA overhead Provide coherent view of memory Large Size increases access latency, energy Customized hierarchy needed for efficiency High overhead for kernels of seq. programs with high locality Size vs Latency tradeoffs as needs to be sized for multiple s/w threads eg: SoC systems :AXI - ARM [1], CAPI -IBM [2]

  7. Fusion Architecture Private L0X per accelerator, independently sized Banked L1x : shared ACC Coherence between L0 and L1 PID1 PID2 L1x Coherent with core via MESI protocol implementation Virtual address (no TLB) space for accelerators Timestamp based coherence between accelerators L0x and L1x PID Tags in caches

  8. Fusion : Data flow Write forwarding avoids write back to cache Exploits producer consumer scenario Reduced data migration by avoiding DMAs Frequent Write Backs between L0 and L1 cause energy overhead

  9. Virtual Memory ACX -0 ACX -0 L0x L1x V. Addr L1D L1D AX- TLB Shared L2 L1x MESI ACC TLB provides virtual to physical address translations on a L1x Miss This is used to index to shared L2 and participate in MESI actions

  10. Virtual Memory Phy. Block Addr. Virtual Cache line ptr ACX -0 ACX -0 L0x L1x L1D L1D AX- RMAP Shared L2 L1x MESI ACC Requests filtered at L2 Directory based on sharer list

  11. Accelerator Coherence Protocol - Timestamp based , Self - Invalidation protocol - 2 Hop Protocol , saves energy - Motivation to enable data mitigation b/w Acc rather than concurrent sharing - Supports Sequential Consistency for accelerators - Host Side : 3 Hop Directory based MESI - Lease time based on operation and known compute latency of accel. L0x Cache Line Local Time , Valid = Time < LTime ; LTime L1x Cache Line Global Time , Valid = max( LTime ) for all L0x LTimes GTime

  12. Accelerator Memory Operations Request : Load A, #10 Misses in L0x -> Misses in L1x Virtual -> Physical Addr. in Ax TLB MESI read request at L2 Response: Data + Phys. Block addr , Add to sharer list Physical -> Line Pointer map Data + Line Pointer @L1 Consumed by Acc

  13. Host Requests

  14. Fusion vs Fusion -Dx

  15. Evaluation Methods MACSIM : Simulator for Heterogenous Computing Systems C A GEMS: Simulator for Memory Hierarchy Aladdin Like Flow L1D L1D Shared L2 GPROF

  16. Benchmarks Characterisation Most memory Intensive : FFT, DISP, TRACK, HIST Compute Intensive : ADPCM, SUSAN, FILTER Max. Data Re-use : FFT , Tracking , ADPCM, HIST Max. Per tile accelerators : 6 ( FFT )

  17. Evaluation Specs

  18. Results Baseline: Scratchpad Observation: Shared L1x helps memory intensive kernels but hurts compute dominated kernels Observation :L0x Caches reduce L2 access energy by filtering DMA calls for memory intensive prog. Observation: Private L0x captures spatial locality in SUSAN and ADPCM , better than SHARED Observation : Fusion s L1x further filters L2 accesses, but increased coherence message result in no significant energy improvement

  19. Results Large Caches at L1x and L0x still may not capture all the working sets and hence fail to give any energy benefits Write Through caches are energy expensive Protocol Extensions like write forwarding can reduce energy consumption in Fusion model. Address Translation overheads need to be mitigated -> reduced in Fusion by removing TLB from critical path.

  20. Comments Not all benchmarks with high share percentage are evaluated on write-forwarding. Should kernels that are candidates for write forwarding be designed as single accelerators ? How does providing private L0 ( 4-8KB) per ACx scale for more than 6 ( max here ) ACx per tile ? Adding timestamp comparison logic / update logic in the cache is a major change in cache design. Do these affect access latency ?

  21. References 1. Goodridge. The effect and technique of system coherence in arm multicore technology 2. POWER8 Coherent Accelerator Processor Interface http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html 3. Towards Cache Friendly Hardware Accelerators , Yakun Sophia Shao, Sam Xi, Viji Srinivasan , Gu-Yeon Wei, David Brooks 4. Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: have your scratchpad and cache it too. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 707-719. DOI=http://dx.doi.org/10.1145/2749469.2750374 5. Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance accelerator simulator enabling large design space exploration of customized architectures 6. B. Reagen, R. Adolf, S. Y. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IEEE International Symposium on Workload Characterization (IISWC), 2014

  22. Thank you Questions ??

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#