Hybrid Optimization Heuristic Instruction Scheduling for Accelerator Codesign

 
Hybrid Optimization/Heuristic
Instruction Scheduling for
Programmable Accelerator Codesign
 
Tony Nowatzki
   
Newsha Ardalani
 Karthikeyan Sankaralingam
   Jian Weng
 
 
 
 
PACT 2018, Nov. 3rd
 
>
 
Google TPU
R
econfigurable
A
rchitectures
C
oarse
G
rain
R
econfigurable
A
rchitectures
C
oarse
G
rain
 
Systolic Array
 
What if I don’t exactly want to
do dense matrix multiplies
with  large matrices?
 
Google TPU
 
Systolic CGRA
 (dedicated PE CGRA)
Problem: How can we map
computation to this effectively?
 
Perfectly-pipelined
execution
(1 computation
per cycle)
 
Examples: Charm, Camel,
Stream-Dataflow, FPCA,
DySER (almost), LSSD,
Morphosys, etc …
+
+
+
+
+
+
+
>
+
A[8]
B[8]
c[4]
?
o[1]
×
×
×
×
×
×
×
×
-
-
 
Mapping
 
Timing
 
Routing
 
Computation Graph
 
Hardware Graph
+
+
+
+
+
+
+
>
+
A[8]
B[8]
c[4]
?
o[1]
×
×
×
×
×
×
×
×
-
-
Mapping
Timing
Routing
Computation Graph
Hardware Graph
+
+
+
+
+
+
+
>
+
A[8]
B[8]
c[4]
?
o[1]
×
×
×
×
×
×
×
×
-
-
Mapping
Timing
Routing
Computation Graph
Hardware Graph
>
?
+
+
+
+
+
+
+
>
+
A[8]
B[8]
c[4]
?
o[1]
×
×
×
×
×
×
×
×
-
-
Mapping
Timing
Routing
Computation Graph
Hardware Graph
?
>
>
Mapping
Timing
Routing
Joint
Optimization
 
(days for solution)
 
Incremental
Optimization
 
Mapping
 
Timing
 
Routing
 
Mapping
 
Timing
 
Routing
 
Our Approach:
1. Phase Overlapping
2. Hybrid Scheduling
 
(high area increase  or  integer factors performance loss)
 
Joint
Heuristic
 
Timing
 
Routing
 
(full throughput & seconds to minutes & low area overhead)
 
Mapping
 
Routing
Technique Overview
 
Outline
 
“Systolic” CGRAs
Throughput sensitivity and fractional initiation intervals.
Techniques for delay matching
Scheduling Approach 1: Phasing-Overlapping
Tractable Optimization through Overlapping
Scheduling Approach 2: Hybrid Scheduling
Heuristic to reduce search space
Optimization to deal with tricky cases
Evaluation
 
 
Requirements for Full-Pipelining
 
1.
Sufficient Data Bandwidth: 
Data has to arrive at a
bandwidth of sufficient for one set of inputs / cycle.
 
2.
No Compute Contention: 
Each instruction gets a
dedicated compute unit.
3.
No Routing Contention: 
Each dependence gets a
dedicated routing path.
 
4.
No Storage Contention: 
Each operand is guaranteed
a free storage location at each step.
Fully-pipelined Systolic CGRA
×
/
+
 
5
 
2
(2+3)
5
 
Mismatch=3
 
3
 
1
Mismatch=0
1
2
3
4
5
6
Cycle:
Fully-pipelined Systolic CGRA
×
/
+
5
3
1
(2+2)
4
Mismatch=1
1
2
3
4
5
6
Cycle:
 
Throughput Formula
 
Inverse of initiation interval in compiler terminology.
 
=
 
FIFO Length
 
Max. Mismatch + FIFO Length
 
Delay-FIFOs Help
Reduce Latency
Mismatch
 
Provide buffer
space which
increases
throughput
 
But at the cost of area…
 
Throughput
 
Alternatives to Delay-FIFOs (1)
×
/
+
 
5
 
5
 
Use a longer route:  Costs 
Routing
 Resources
 
Alternatives to Delay-FIFOs (2)
×
/
+
 
5
 
5
 
Pass through a PE:  Costs 
Mapping 
Resources
 
This is also just
good for
improving
routing
bandwidth …
 
“Systolic” CGRA Challenge Summary
 
Systolic CGRA throughput is 
very
 sensitive to timing
constraints
Several ways to balance delays:
Pass-through 
(affects mapping)
Long route 
(affects routing)
Delay FIFOs 
(affects timing)
Mapping / Routing / Timing responsibilities are
highly interdependent
Codesign: Either more delay FIFOs or more
advanced scheduler
 
Outline
 
Challenges for pipelining “Systolic” CGRAs
Throughput sensitivity and fractional initiation intervals.
Techniques for delay matching
Scheduling Approach 1: Phasing-Overlapping
Tractable Optimization through Overlapping
Scheduling Approach 2: Hybrid Scheduling
Heuristic to reduce search space
Optimization to deal with tricky cases
Evaluation
 
 
Optimization Background
 
Prior work demonstrates optimization-based spatial
scheduling: Integer Linear Programming [PLDI 2013],
SMT [TOPLAS 2014]
Basic Integer Linear Programming Approach:
Decision Variables:
Assigning instructions to PEs 
(binary)
Assigning dependences to a set of Routes 
(binary)
Assigning delays to each PE 
(integer)
Linear Constraints:
Limit schedule to be legal
We added three forms of delay-matching.
Optimization Phases
 
Joint Optimization: (50% Failure)
Timeout -- Very restricted hardware of systolic CGRA
 
 
Separate Phases:  (85% Failure)
Fast, but doesn’t capture inter-dependence!
 
 
In-between Joint:  (75% Failure, poor throughput)
 
Mapping
 
Timing
 
Routing
Mapping
Timing
Routing
 
Mapping
 
Timing
 
Routing
Our Approach: Phase Overlapping
 
Insight:  Phase Interdependence Matters
But relationship between adjacent phases is more relevant…
 
 
 
Phase Overlapping:
Perform Mapping and Routing together
Discard Routing (keep Mapping)
Perform Routing and Timing together
Good: Only 15% fail, mostly fully-pipelined
Remaining Problems:
Mapping/routing phase takes time (90% or more)
Sometimes, we get an unlucky mapping/routing (looks like
high potential, but cannot be fixed for timing)
Timing
Routing
Mapping
Routing
 
Overview
 
“Systolic” CGRAs
Throughput sensitivity and fractional initiation intervals.
Techniques for delay matching
Scheduling Approach 1: Phasing-Overlapping
Tractable Optimization through Overlapping
Scheduling Approach 2: Hybrid Scheduling
Heuristic to reduce search space
Optimization to deal with tricky cases
Evaluation
 
Hybrid Scheduling: Integrate Heuristic
 
Insight: Mitigate the problems overlapped scheduling by
trying things repeatedly if we got an unlucky mapping.
But … Optimization won’t work for mapping/routing
Still too slow (and do we really need it?)
ILP solver’s don’t tend to generate unique solutions
 
 
 
Hybrid Approach: Use a heuristic for the
mapping/routing phase, to quickly get a plausible
mapping, then optimize the routing and timing together.
 
 
 
 
 
Timing
 
Routing
 
Mapping
 
Routing
 
Heuristic:
 
Optimization:
 
A Heuristic for Hybrid Scheduling
 
Need a heuristic that 
quickly
 generates 
unique
schedules!
Basic Approach: Stochastic Scheduling
Iteratively build schedule in topological order 
(fast)
Normally be rational: Choose a location for each
instruction which minimizing routing resources and
timing mismatch.
Occasionally be irrational: Choose a purposely bad
mapping, hoping that it might help later on 
(unique)
Repeat a few times to find a good schedule 
(good)
Objective function: Minimize total mismatch (to
make routing/timing scheduling easier)
Intuition for Solution Quality
 
Outline
 
“Systolic” CGRAs
Throughput sensitivity and fractional initiation intervals.
Techniques for delay matching
Scheduling Approach 1: Phasing-Overlapping
Tractable Optimization through Overlapping
Scheduling Approach 2: Hybrid Scheduling
Heuristic to reduce search space
Optimization to deal with tricky cases
Evaluation
 
 
Methodology – Accelerator Modeling
 
Softbrain Accelerator
5x5 Processing Elem (PE)
64-bit datapath (subword)
Heterogeneous Grid
All FUs are fully pipelined
Simulation: All blocks simulated at
cycle-level in gem5, single “core”
Area Analysis
Synthesize Chisel-based design in
55nm tech library.
Assumes Control core (single-issue
inorder 16KB I$/D$) + 4KB SPAD
Workloads (accelerator centric)
MachSuite, CNN/DNN, Dense Linear
Algebra QR/Cholesky/FFT)
 
 
IO Interface
 
Softbrain [ISCA 2017]
Stream
SPAD
Streaming
Cache
Ctrl
Core
 
Does delay FIFO area matter?
 
Methodology – Experimental Setup
 
Scheduler Designs:
Joint
 Optimization
Overlapped
 Optimization
Heuristic
 Only
Hybrid 
(overlap/part heuristic)
Scheduling Timeout: 20 Minutes
Problem: How to compare schedulers that across a
set of workloads that may fail?
We don’t!  Instead, we seed all schedulers with an initial
solution from the heuristic.
Joint and Overlapped are also hybrid schedulers in the
evaluation, just using a more traditional approach.
 
Performance vs Time vs Complexity
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Conclusions
 
Systolic CGRAs gain efficiency by simplifying the
execution model.
Consequence is tradeoff between easy scheduling
low hardware overhead, and high performance:
Minimizing delay-FIFO size is key factor in reducing area
and increasing scheduler difficulty in finding minimum II.
Two novel techniques:
Phase Overlapping (capture the phase interdependence)
Hybrid Scheduling (using heuristic to generate solutions)
Broadly, codesign is the key to success of future
reconfigurable architectures.
 
 
 
Thank you
 
Scheduling-time Sensitivity
 
Modeled 
vs
Measured
Throughput
 
 
Heuristic    Hybrid      Overlapped    Joint
Slide Note
Embed
Share

This research presents a hybrid optimization heuristic approach for efficient instruction scheduling in programmable accelerator codesign. It discusses Google's TPU architecture, problem-solving strategies, and computation graph mapping, routing, and timing optimizations. The technique overview highlights joint optimization, incremental optimization, and hybrid scheduling methods for improved performance and reduced area overhead.

  • Optimization
  • Heuristic
  • Instruction Scheduling
  • Accelerator Codesign
  • Computation Graph

Uploaded on Oct 08, 2024 | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign Tony Nowatzki Newsha Ardalani Karthikeyan Sankaralingam Jian Weng Simple Machines PACT 2018, Nov. 3rd

  2. Google TPU Coarse Grain Reconfigurable Architectures > CPU 2

  3. Google TPU Problem: How can we map computation to this effectively? Circuit Switched Network Processing Element (ALU etc.) Coarse Grain PE PE PE PE PE PE PE PE PE PE Reconfigurable Architectures PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Perfectly-pipelined execution Stream-Dataflow, FPCA, DySER (almost), LSSD, PE PE PE PE PE PE PE PE PE PE What if I don t exactly want to do dense matrix multiplies with large matrices? (1 computation per cycle) Morphosys, etc Examples: Charm, Camel, PE PE PE PE PE PE PE PE PE PE Systolic Array Systolic CGRA (dedicated PE CGRA) 3

  4. Computation Graph Hardware Graph B[8] c[4] A[8] PE PE PE PE PE - - PE PE PE PE PE + + + + > + + PE PE PE PE PE + ? + PE PE PE PE PE PE PE PE PE PE o[1] Mapping Routing Timing 4

  5. Computation Graph Hardware Graph B[8] c[4] A[8] PE PE PE PE PE - - PE PE PE PE PE + + + + > + + PE PE PE PE PE + ? + PE PE PE PE PE PE PE PE PE PE o[1] Mapping Routing Timing 5

  6. Computation Graph Hardware Graph B[8] c[4] A[8] PE PE PE PE PE - - PE > PE PE PE PE + + + + > + + PE PE PE PE PE + ? + PE PE PE PE PE PE ? PE PE PE PE o[1] Mapping Routing Timing 6

  7. Computation Graph Hardware Graph B[8] c[4] A[8] PE PE PE PE PE - - PE > PE PE PE PE + + + + > > + + PE PE PE PE PE + ? + PE PE PE PE PE PE ? PE PE PE PE o[1] Mapping Routing Timing 7

  8. Technique Overview Joint Optimization Mapping (days for solution) Routing Timing Incremental Optimization Mapping Routing Timing Joint Heuristic Mapping Routing Timing (high area increase or integer factors performance loss) Mapping Routing Our Approach: 1. Phase Overlapping 2. Hybrid Scheduling Routing Timing (full throughput & seconds to minutes & low area overhead) 8

  9. Outline Systolic CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation 9

  10. Requirements for Full-Pipelining 1. Sufficient Data Bandwidth: Data has to arrive at a bandwidth of sufficient for one set of inputs / cycle. 2. No Compute Contention: Each instruction gets a dedicated compute unit. 3. No Routing Contention: Each dependence gets a dedicated routing path. 4. No Storage Contention: Each operand is guaranteed a free storage location at each step. 10

  11. Fully-pipelined Systolic CGRA Must move forward! Can t be consumed! 1 PE PE PE 3 / Delay FIFO PE PE PE 5 2 + Mismatch=3 Mismatch=0 2 3 4 1 6 5 Cycle: 11

  12. Fully-pipelined Systolic CGRA 1 PE PE PE 3 / Delay FIFO (2+2) 4 PE PE PE 5 + Mismatch=1 2 3 4 1 6 5 Cycle: 12

  13. Throughput Formula FIFO Length = Throughput Max. Mismatch + FIFO Length Delay-FIFOs Help Reduce Latency Mismatch Provide buffer space which increases throughput But at the cost of area Inverse of initiation interval in compiler terminology. 13

  14. Alternatives to Delay-FIFOs (1) PE PE PE / PE PE PE 5 5 + Use a longer route: Costs Routing Resources 14

  15. Alternatives to Delay-FIFOs (2) Do a NOP PE PE PE This is also just good for improving routing bandwidth / PE PE PE 5 5 + Pass through a PE: Costs Mapping Resources 15

  16. Systolic CGRA Challenge Summary Systolic CGRA throughput is very sensitive to timing constraints Several ways to balance delays: Pass-through (affects mapping) Long route (affects routing) Delay FIFOs (affects timing) Mapping / Routing / Timing responsibilities are highly interdependent Codesign: Either more delay FIFOs or more advanced scheduler 16

  17. Outline Challenges for pipelining Systolic CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation 17

  18. Optimization Background Prior work demonstrates optimization-based spatial scheduling: Integer Linear Programming [PLDI 2013], SMT [TOPLAS 2014] Basic Integer Linear Programming Approach: Decision Variables: Assigning instructions to PEs (binary) Assigning dependences to a set of Routes (binary) Assigning delays to each PE (integer) Linear Constraints: Limit schedule to be legal We added three forms of delay-matching. 18

  19. Optimization Phases Joint Optimization: (50% Failure) Timeout -- Very restricted hardware of systolic CGRA Mapping Routing Timing Separate Phases: (85% Failure) Fast, but doesn t capture inter-dependence! Mapping Routing Timing In-between Joint: (75% Failure, poor throughput) Mapping Routing Timing 19

  20. Our Approach: Phase Overlapping Insight: Phase Interdependence Matters But relationship between adjacent phases is more relevant Mapping Routing Routing Timing Phase Overlapping: Perform Mapping and Routing together Discard Routing (keep Mapping) Perform Routing and Timing together Good: Only 15% fail, mostly fully-pipelined Remaining Problems: Mapping/routing phase takes time (90% or more) Sometimes, we get an unlucky mapping/routing (looks like high potential, but cannot be fixed for timing) 20

  21. Overview Systolic CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation 21

  22. Hybrid Scheduling: Integrate Heuristic Insight: Mitigate the problems overlapped scheduling by trying things repeatedly if we got an unlucky mapping. But Optimization won t work for mapping/routing Still too slow (and do we really need it?) ILP solver s don t tend to generate unique solutions Heuristic: Mapping Routing Optimization: Routing Timing Hybrid Approach: Use a heuristic for the mapping/routing phase, to quickly get a plausible mapping, then optimize the routing and timing together. 22

  23. A Heuristic for Hybrid Scheduling Need a heuristic that quickly generates unique schedules! Basic Approach: Stochastic Scheduling Iteratively build schedule in topological order (fast) Normally be rational: Choose a location for each instruction which minimizing routing resources and timing mismatch. Occasionally be irrational: Choose a purposely bad mapping, hoping that it might help later on (unique) Repeat a few times to find a good schedule (good) Objective function: Minimize total mismatch (to make routing/timing scheduling easier) 23

  24. Intuition for Solution Quality 16 14 Objective: Mis + Lat/50 Heuristic 12 10 Overlapped 8 Hybrid 6 4 2 0 0 200 400 600 800 Seconds into scheduling (32-1 reduction in CNN) 24

  25. Outline Systolic CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation 25

  26. Methodology Accelerator Modeling Softbrain Accelerator 5x5 Processing Elem (PE) 64-bit datapath (subword) Heterogeneous Grid All FUs are fully pipelined Simulation: All blocks simulated at cycle-level in gem5, single core Area Analysis Synthesize Chisel-based design in 55nm tech library. Assumes Control core (single-issue inorder 16KB I$/D$) + 4KB SPAD Workloads (accelerator centric) MachSuite, CNN/DNN, Dense Linear Algebra QR/Cholesky/FFT) Streaming Cache Ctrl Core Stream SPAD PE PE PE PE PE PE PE PE PE PE IO Interface PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Softbrain [ISCA 2017] 26

  27. Does delay FIFO area matter? Area (mm2) FIFO Length Scheduling Difficulty Overhead 2 3 Very Hard Hard 0.528 0.543 9.67% 12.90% 7 Medium 0.606 25.85% 15 Easy 0.736 52.86% 27

  28. Methodology Experimental Setup Scheduler Designs: Joint Optimization Overlapped Optimization Heuristic Only Hybrid (overlap/part heuristic) Scheduling Timeout: 20 Minutes Problem: How to compare schedulers that across a set of workloads that may fail? We don t! Instead, we seed all schedulers with an initial solution from the heuristic. Joint and Overlapped are also hybrid schedulers in the evaluation, just using a more traditional approach. 28

  29. Performance vs Time vs Complexity Throughput 1 (Iterations/Cycle) 0.95 Firing Rate 0.9 0.85 0.8 0.75 0.7 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 Joint Overlapped Heuristic Hybrid Average Scheduling Time Average Scheduling 600 500 Time (Sec.) 400 300 200 100 0 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 Joint Overlapped Heuristic Hybrid

  30. Conclusions Systolic CGRAs gain efficiency by simplifying the execution model. Consequence is tradeoff between easy scheduling low hardware overhead, and high performance: Minimizing delay-FIFO size is key factor in reducing area and increasing scheduler difficulty in finding minimum II. Two novel techniques: Phase Overlapping (capture the phase interdependence) Hybrid Scheduling (using heuristic to generate solutions) Broadly, codesign is the key to success of future reconfigurable architectures. 30

  31. Thank you 31

  32. Scheduling-time Sensitivity 20 min 40 min 1 hr Throughput 1 (Iterations/Cycle) 0.95 Firing Rate 0.9 0.85 0.8 0.75 0.7 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 Joint Overlapped Heuristic Hybrid 20 min 40 min 1 hr Average Scheduling Time Avg. Scheduling 1000 Time(Sec.) 100 10 1 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 7 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 Joint Overlapped Heuristic Hybrid

  33. Heuristic Hybrid Overlapped Joint Modeled vs Measured Throughput 33

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#