Tradeoffs in Coherent Cache Hierarchies for Accelerators

Fusion : Design Tradeoffs in

Coherent Cache Hierarchies for

Accelerators

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula

School of Computing Sciences, Simon Fraser University

ISCA-’15

Presented by :

Keshav Mathur

Outline

●

Motivation

●

Proposed Design

●

Implementation details

●

Evaluation Methods

●

Results

●

Conclusion and Comments

●

References

Accelerators : We need them, but ..

Specialised Hardware to reduce power and gain performance

Power and Clock Gating easier

Granularity of Accelerators : Fine Grained fix function vs coarse domain

specific

Programmability Concerns : How easy is the discovery and how frequent is

the invocation

Specialization Spectrum

Dynamically Specialized Datapaths for Energy Efficient Computing, Venkatraman Govindaraju Chen-Han Ho Karthikeyan

Sankaralingam Vertical Research Group University of Wisconsin-Madison

Fixed Function Accelerators

Fine grain off-loading of functions to multiple accelerators

Enables data path reuse

+ Saves control

path power

Creates producer - consumer scenario

- Incurs frequent data movement ( DMA

Calls )

Forwarding Buffers ? Co-located, shared memory?

 Scratchpad ? Cache ? Stash ? Both ? Always ?

??

func1();

func2();

func3();

func4();

DMA

Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance

accelerator simulator enabling large design space exploration

of customized architectures

Architecture : Tile + ?

- Tiled architecture : Multiple fixed function

accelerators on a single tile

- Independent memory hierarchy

- Independent coherence protocol

- Multiple Tiles per core

Scratchpad

    Caches

Deterministic Access

Low Load Use Latency

Efficient Memory Utilization

Incoherent ,Private Address Space

Software Managed

Indeterministic behaviour (Hit/Miss)

Coherent , Non polluting memory

Capture locality, enable reuse

H/w address translation energy and

latency

Implicit data movement, lazy writebacks

Baseline Systems: SCRATCH and SHARED

eg: SoC systems :AXI - ARM [1], CAPI -IBM [2]

Private Scratchpad

1 per Accelerator

DMA controller

Good for Compute

Intensive workloads

Large Size of Scratch.

amortizes DMA overhead

Large Size increases

access latency, energy

High overhead for kernels

of seq. programs with

high locality

Shared L1 per tile

L1X takes part in MESI

based coherence

L2 at host maintains

inclusion with L1X

Capture spatial and

temporal locality

Provide coherent view of

memory

Customized hierarchy

needed for efficiency

Size vs Latency tradeoffs

as  needs to be sized for

multiple s/w threads

Fusion Architecture

Private L0X per accelerator,

independently sized

Banked L1x : shared

ACC Coherence between L0 and L1

L1x Coherent with core via MESI

protocol implementation

Virtual address (no TLB)

 space for

accelerators

Timestamp based coherence between

accelerators L0x and  L1x

PID Tags in caches

PID1

PID2

Fusion : Data flow

Reduced  data migration by avoiding DMAs

Frequent Write Backs between L0 and L1

cause energy overhead

Write forwarding avoids write

back to cache

Exploits producer consumer

scenario

Virtual Memory

ACX -0

ACX -0

L0x

L1x

L1x

AX- TLB

V.

Addr

Shared L2

L1D

L1D

MESI

ACC

TLB provides virtual to physical address translations on a L1x Miss

This is used to index to shared L2 and participate in MESI actions

Virtual Memory

ACX -0

ACX -0

L0x

L1x

L1x

AX- RMAP

Shared L2

L1D

L1D

MESI

ACC

Phy.

Block

Addr.

Virtual

Cache

line ptr

Requests filtered at L2 Directory

based on sharer list

Accelerator Coherence Protocol

Timestamp based , Self - Invalidation protocol

2 Hop Protocol , saves energy

Motivation to enable data mitigation b/w Acc rather than

concurrent

sharing

Supports Sequential Consistency for accelerators

Host Side : 3 Hop Directory based MESI

Lease time based on operation and known compute latency of accel.

L0x Cache Line

LTime

L1x Cache Line

GTime

Local Time , Valid = Time < LTime ;

Global Time , Valid = max( LTime ) for

all L0x LTimes

Accelerator Memory Operations

Request : Load A, #10

Misses in L0x -> Misses in L1x

Virtual -> Physical Addr. in Ax TLB

MESI read request at L2

Response: Data + Phys.

Block addr , Add to sharer

list

Physical -> Line Pointer map

Data + Line Pointer @L1

Consumed by Acc

Host Requests

Fusion vs Fusion -Dx

Evaluation Methods

Shared L2

L1D

L1D

MACSIM : Simulator for

Heterogenous

Computing Systems

GEMS: Simulator for

Memory Hierarchy

GPROF

Aladdin Like

Flow

Benchmarks Characterisation

Most memory Intensive : FFT, DISP,

TRACK, HIST

Compute Intensive : ADPCM, SUSAN,

FILTER

Max. Data Re-use : FFT , Tracking ,

ADPCM, HIST

Max. Per tile accelerators : 6 ( FFT )

Evaluation Specs

Results

Baseline: Scratchpad

’

Results

Address Translation overheads

need to be mitigated -> reduced in

Fusion by removing TLB from

critical path.

Large Caches at L1x and L0x still may not

capture all the working sets and hence fail

to give any energy benefits

Protocol Extensions like write

forwarding can reduce energy

consumption in Fusion model.

Write Through caches are

energy expensive

Comments

●

Not all benchmarks with high share percentage are evaluated on

write-forwarding.

●

Should kernels that are candidates for write forwarding be designed

as single accelerators ?

●

How does providing private L0 ( 4-8KB) per ACx scale for more than 6

( max here )  ACx per tile ?

●

Adding timestamp comparison logic / update logic in the cache is a

major change in cache design. Do these affect access latency ?

References

1.

Goodridge. The effect and technique of system coherence in arm multicore technology

2.

POWER8 Coherent Accelerator Processor Interface http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html

3.

Towards Cache Friendly Hardware Accelerators , Yakun Sophia Shao, Sam Xi, Viji Srinivasan† , Gu-Yeon Wei, David

Brooks

4.

Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and

Vikram S. Adve. 2015. Stash: have your scratchpad and cache it too. In

Proceedings of the 42nd Annual International Symposium on Computer

Architecture

 (ISCA '15). ACM, New York, NY, USA, 707-719. DOI=

http://dx.doi.org/10.1145/2749469.2750374

5.

Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance accelerator simulator enabling large design space exploration of

customized architectures

6.

B. Reagen, R. Adolf, S. Y. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In

IEEE International Symposium on Workload Characterization (IISWC), 2014

Thank you

Questions ??

Slide Note

Embed Share

Download

Explore the design tradeoffs and implementation details of coherent cache hierarchies for accelerators in the context of specialized hardware. The presentation covers motivation, proposed design, evaluation methods, results, and conclusions, highlighting the need for accelerators and considerations in off-loading functions to multiple accelerators for power efficiency and performance gains.

katilyn Follow

Uploaded on Sep 23, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Fusion : Design Tradeoffs in Coherent Cache Hierarchies for Accelerators Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula School of Computing Sciences, Simon Fraser University ISCA- 15 Presented by : Keshav Mathur

Outline Motivation Proposed Design Implementation details Evaluation Methods Results Conclusion and Comments References

Accelerators : We need them, but .. + Specialised Hardware to reduce power and gain performance + Power and Clock Gating easier Specialization Spectrum Dynamically Specialized Datapaths for Energy Efficient Computing, Venkatraman Govindaraju Chen-Han Ho Karthikeyan Sankaralingam Vertical Research Group University of Wisconsin-Madison - Granularity of Accelerators : Fine Grained fix function vs coarse domain specific Programmability Concerns : How easy is the discovery and how frequent is the invocation -

Fixed Function Accelerators Fine grain off-loading of functions to multiple accelerators + Enables data path reuse path power + Saves control DMA func1(); func2(); L L C func3(); func4(); Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance accelerator simulator enabling large design space exploration of customized architectures - Creates producer - consumer scenario Calls ) Forwarding Buffers ? Co-located, shared memory? ?? - Incurs frequent data movement ( DMA - Scratchpad ? Cache ? Stash ? Both ? Always ? -

Architecture : Tile + ? - Tiled architecture : Multiple fixed function accelerators on a single tile - Independent memory hierarchy - Independent coherence protocol - Multiple Tiles per core + + + - Indeterministic behaviour (Hit/Miss) Coherent , Non polluting memory Capture locality, enable reuse H/w address translation energy and latency Implicit data movement, lazy writebacks + + + - - Deterministic Access Low Load Use Latency Efficient Memory Utilization Incoherent ,Private Address Space Software Managed - Scratchpad Caches

Baseline Systems: SCRATCH and SHARED Shared L1 per tile Private Scratchpad L1X takes part in MESI based coherence 1 per Accelerator DMA controller L2 at host maintains inclusion with L1X Good for Compute Intensive workloads Capture spatial and temporal locality Large Size of Scratch. amortizes DMA overhead Provide coherent view of memory Large Size increases access latency, energy Customized hierarchy needed for efficiency High overhead for kernels of seq. programs with high locality Size vs Latency tradeoffs as needs to be sized for multiple s/w threads eg: SoC systems :AXI - ARM [1], CAPI -IBM [2]

Fusion Architecture Private L0X per accelerator, independently sized Banked L1x : shared ACC Coherence between L0 and L1 PID1 PID2 L1x Coherent with core via MESI protocol implementation Virtual address (no TLB) space for accelerators Timestamp based coherence between accelerators L0x and L1x PID Tags in caches

Fusion : Data flow Write forwarding avoids write back to cache Exploits producer consumer scenario Reduced data migration by avoiding DMAs Frequent Write Backs between L0 and L1 cause energy overhead

Virtual Memory ACX -0 ACX -0 L0x L1x V. Addr L1D L1D AX- TLB Shared L2 L1x MESI ACC TLB provides virtual to physical address translations on a L1x Miss This is used to index to shared L2 and participate in MESI actions

Virtual Memory Phy. Block Addr. Virtual Cache line ptr ACX -0 ACX -0 L0x L1x L1D L1D AX- RMAP Shared L2 L1x MESI ACC Requests filtered at L2 Directory based on sharer list

Accelerator Coherence Protocol - Timestamp based , Self - Invalidation protocol - 2 Hop Protocol , saves energy - Motivation to enable data mitigation b/w Acc rather than concurrent sharing - Supports Sequential Consistency for accelerators - Host Side : 3 Hop Directory based MESI - Lease time based on operation and known compute latency of accel. L0x Cache Line Local Time , Valid = Time < LTime ; LTime L1x Cache Line Global Time , Valid = max( LTime ) for all L0x LTimes GTime

Accelerator Memory Operations Request : Load A, #10 Misses in L0x -> Misses in L1x Virtual -> Physical Addr. in Ax TLB MESI read request at L2 Response: Data + Phys. Block addr , Add to sharer list Physical -> Line Pointer map Data + Line Pointer @L1 Consumed by Acc

Host Requests

Fusion vs Fusion -Dx

Evaluation Methods MACSIM : Simulator for Heterogenous Computing Systems C A GEMS: Simulator for Memory Hierarchy Aladdin Like Flow L1D L1D Shared L2 GPROF

Benchmarks Characterisation Most memory Intensive : FFT, DISP, TRACK, HIST Compute Intensive : ADPCM, SUSAN, FILTER Max. Data Re-use : FFT , Tracking , ADPCM, HIST Max. Per tile accelerators : 6 ( FFT )

Evaluation Specs

Results Baseline: Scratchpad Observation: Shared L1x helps memory intensive kernels but hurts compute dominated kernels Observation :L0x Caches reduce L2 access energy by filtering DMA calls for memory intensive prog. Observation: Private L0x captures spatial locality in SUSAN and ADPCM , better than SHARED Observation : Fusion s L1x further filters L2 accesses, but increased coherence message result in no significant energy improvement

Results Large Caches at L1x and L0x still may not capture all the working sets and hence fail to give any energy benefits Write Through caches are energy expensive Protocol Extensions like write forwarding can reduce energy consumption in Fusion model. Address Translation overheads need to be mitigated -> reduced in Fusion by removing TLB from critical path.

Comments Not all benchmarks with high share percentage are evaluated on write-forwarding. Should kernels that are candidates for write forwarding be designed as single accelerators ? How does providing private L0 ( 4-8KB) per ACx scale for more than 6 ( max here ) ACx per tile ? Adding timestamp comparison logic / update logic in the cache is a major change in cache design. Do these affect access latency ?

References 1. Goodridge. The effect and technique of system coherence in arm multicore technology 2. POWER8 Coherent Accelerator Processor Interface http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html 3. Towards Cache Friendly Hardware Accelerators , Yakun Sophia Shao, Sam Xi, Viji Srinivasan , Gu-Yeon Wei, David Brooks 4. Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: have your scratchpad and cache it too. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 707-719. DOI=http://dx.doi.org/10.1145/2749469.2750374 5. Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance accelerator simulator enabling large design space exploration of customized architectures 6. B. Reagen, R. Adolf, S. Y. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IEEE International Symposium on Workload Characterization (IISWC), 2014

Thank you Questions ??

Tradeoffs in Coherent Cache Hierarchies for Accelerators

Download Presentation

Presentation Transcript

Related

More Related Content