Hybrid Optimization Heuristic Instruction Scheduling for Accelerator Codesign

Slide Note

This research presents a hybrid optimization heuristic approach for efficient instruction scheduling in programmable accelerator codesign. It discusses Google's TPU architecture, problem-solving strategies, and computation graph mapping, routing, and timing optimizations. The technique overview highlights joint optimization, incremental optimization, and hybrid scheduling methods for improved performance and reduced area overhead.

chr_ro Follow

Uploaded on Oct 08, 2024 | 6 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Hybrid Optimization/Heuristic Instruction Scheduling for Programmable Accelerator Codesign Tony Nowatzki Newsha Ardalani Karthikeyan Sankaralingam Jian Weng Simple Machines PACT 2018, Nov. 3rd

Google TPU Coarse Grain Reconfigurable Architectures > CPU 2

Google TPU Problem: How can we map computation to this effectively? Circuit Switched Network Processing Element (ALU etc.) Coarse Grain PE PE PE PE PE PE PE PE PE PE Reconfigurable Architectures PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Perfectly-pipelined execution Stream-Dataflow, FPCA, DySER (almost), LSSD, PE PE PE PE PE PE PE PE PE PE What if I don t exactly want to do dense matrix multiplies with large matrices? (1 computation per cycle) Morphosys, etc Examples: Charm, Camel, PE PE PE PE PE PE PE PE PE PE Systolic Array Systolic CGRA (dedicated PE CGRA) 3

Computation Graph Hardware Graph B[8] c[4] A[8] PE PE PE PE PE - - PE PE PE PE PE + + + + > + + PE PE PE PE PE + ? + PE PE PE PE PE PE PE PE PE PE o[1] Mapping Routing Timing 4

Computation Graph Hardware Graph B[8] c[4] A[8] PE PE PE PE PE - - PE PE PE PE PE + + + + > + + PE PE PE PE PE + ? + PE PE PE PE PE PE PE PE PE PE o[1] Mapping Routing Timing 5

Computation Graph Hardware Graph B[8] c[4] A[8] PE PE PE PE PE - - PE > PE PE PE PE + + + + > + + PE PE PE PE PE + ? + PE PE PE PE PE PE ? PE PE PE PE o[1] Mapping Routing Timing 6

Computation Graph Hardware Graph B[8] c[4] A[8] PE PE PE PE PE - - PE > PE PE PE PE + + + + > > + + PE PE PE PE PE + ? + PE PE PE PE PE PE ? PE PE PE PE o[1] Mapping Routing Timing 7

Technique Overview Joint Optimization Mapping (days for solution) Routing Timing Incremental Optimization Mapping Routing Timing Joint Heuristic Mapping Routing Timing (high area increase or integer factors performance loss) Mapping Routing Our Approach: 1. Phase Overlapping 2. Hybrid Scheduling Routing Timing (full throughput & seconds to minutes & low area overhead) 8

Outline Systolic CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation 9

Requirements for Full-Pipelining 1. Sufficient Data Bandwidth: Data has to arrive at a bandwidth of sufficient for one set of inputs / cycle. 2. No Compute Contention: Each instruction gets a dedicated compute unit. 3. No Routing Contention: Each dependence gets a dedicated routing path. 4. No Storage Contention: Each operand is guaranteed a free storage location at each step. 10

Fully-pipelined Systolic CGRA Must move forward! Can t be consumed! 1 PE PE PE 3 / Delay FIFO PE PE PE 5 2 + Mismatch=3 Mismatch=0 2 3 4 1 6 5 Cycle: 11

Fully-pipelined Systolic CGRA 1 PE PE PE 3 / Delay FIFO (2+2) 4 PE PE PE 5 + Mismatch=1 2 3 4 1 6 5 Cycle: 12

Throughput Formula FIFO Length = Throughput Max. Mismatch + FIFO Length Delay-FIFOs Help Reduce Latency Mismatch Provide buffer space which increases throughput But at the cost of area Inverse of initiation interval in compiler terminology. 13

Alternatives to Delay-FIFOs (1) PE PE PE / PE PE PE 5 5 + Use a longer route: Costs Routing Resources 14

Alternatives to Delay-FIFOs (2) Do a NOP PE PE PE This is also just good for improving routing bandwidth / PE PE PE 5 5 + Pass through a PE: Costs Mapping Resources 15

Systolic CGRA Challenge Summary Systolic CGRA throughput is very sensitive to timing constraints Several ways to balance delays: Pass-through (affects mapping) Long route (affects routing) Delay FIFOs (affects timing) Mapping / Routing / Timing responsibilities are highly interdependent Codesign: Either more delay FIFOs or more advanced scheduler 16

Outline Challenges for pipelining Systolic CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation 17

Optimization Background Prior work demonstrates optimization-based spatial scheduling: Integer Linear Programming [PLDI 2013], SMT [TOPLAS 2014] Basic Integer Linear Programming Approach: Decision Variables: Assigning instructions to PEs (binary) Assigning dependences to a set of Routes (binary) Assigning delays to each PE (integer) Linear Constraints: Limit schedule to be legal We added three forms of delay-matching. 18

Optimization Phases Joint Optimization: (50% Failure) Timeout -- Very restricted hardware of systolic CGRA Mapping Routing Timing Separate Phases: (85% Failure) Fast, but doesn t capture inter-dependence! Mapping Routing Timing In-between Joint: (75% Failure, poor throughput) Mapping Routing Timing 19

Our Approach: Phase Overlapping Insight: Phase Interdependence Matters But relationship between adjacent phases is more relevant Mapping Routing Routing Timing Phase Overlapping: Perform Mapping and Routing together Discard Routing (keep Mapping) Perform Routing and Timing together Good: Only 15% fail, mostly fully-pipelined Remaining Problems: Mapping/routing phase takes time (90% or more) Sometimes, we get an unlucky mapping/routing (looks like high potential, but cannot be fixed for timing) 20

Overview Systolic CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation 21

Hybrid Scheduling: Integrate Heuristic Insight: Mitigate the problems overlapped scheduling by trying things repeatedly if we got an unlucky mapping. But Optimization won t work for mapping/routing Still too slow (and do we really need it?) ILP solver s don t tend to generate unique solutions Heuristic: Mapping Routing Optimization: Routing Timing Hybrid Approach: Use a heuristic for the mapping/routing phase, to quickly get a plausible mapping, then optimize the routing and timing together. 22

A Heuristic for Hybrid Scheduling Need a heuristic that quickly generates unique schedules! Basic Approach: Stochastic Scheduling Iteratively build schedule in topological order (fast) Normally be rational: Choose a location for each instruction which minimizing routing resources and timing mismatch. Occasionally be irrational: Choose a purposely bad mapping, hoping that it might help later on (unique) Repeat a few times to find a good schedule (good) Objective function: Minimize total mismatch (to make routing/timing scheduling easier) 23

Intuition for Solution Quality 16 14 Objective: Mis + Lat/50 Heuristic 12 10 Overlapped 8 Hybrid 6 4 2 0 0 200 400 600 800 Seconds into scheduling (32-1 reduction in CNN) 24

Outline Systolic CGRAs Throughput sensitivity and fractional initiation intervals. Techniques for delay matching Scheduling Approach 1: Phasing-Overlapping Tractable Optimization through Overlapping Scheduling Approach 2: Hybrid Scheduling Heuristic to reduce search space Optimization to deal with tricky cases Evaluation 25

Methodology Accelerator Modeling Softbrain Accelerator 5x5 Processing Elem (PE) 64-bit datapath (subword) Heterogeneous Grid All FUs are fully pipelined Simulation: All blocks simulated at cycle-level in gem5, single core Area Analysis Synthesize Chisel-based design in 55nm tech library. Assumes Control core (single-issue inorder 16KB I$/D$) + 4KB SPAD Workloads (accelerator centric) MachSuite, CNN/DNN, Dense Linear Algebra QR/Cholesky/FFT) Streaming Cache Ctrl Core Stream SPAD PE PE PE PE PE PE PE PE PE PE IO Interface PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE Softbrain [ISCA 2017] 26

Does delay FIFO area matter? Area (mm2) FIFO Length Scheduling Difficulty Overhead 2 3 Very Hard Hard 0.528 0.543 9.67% 12.90% 7 Medium 0.606 25.85% 15 Easy 0.736 52.86% 27

Methodology Experimental Setup Scheduler Designs: Joint Optimization Overlapped Optimization Heuristic Only Hybrid (overlap/part heuristic) Scheduling Timeout: 20 Minutes Problem: How to compare schedulers that across a set of workloads that may fail? We don t! Instead, we seed all schedulers with an initial solution from the heuristic. Joint and Overlapped are also hybrid schedulers in the evaluation, just using a more traditional approach. 28

Performance vs Time vs Complexity Throughput 1 (Iterations/Cycle) 0.95 Firing Rate 0.9 0.85 0.8 0.75 0.7 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 Joint Overlapped Heuristic Hybrid Average Scheduling Time Average Scheduling 600 500 Time (Sec.) 400 300 200 100 0 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 Joint Overlapped Heuristic Hybrid

Conclusions Systolic CGRAs gain efficiency by simplifying the execution model. Consequence is tradeoff between easy scheduling low hardware overhead, and high performance: Minimizing delay-FIFO size is key factor in reducing area and increasing scheduler difficulty in finding minimum II. Two novel techniques: Phase Overlapping (capture the phase interdependence) Hybrid Scheduling (using heuristic to generate solutions) Broadly, codesign is the key to success of future reconfigurable architectures. 30

Thank you 31

Scheduling-time Sensitivity 20 min 40 min 1 hr Throughput 1 (Iterations/Cycle) 0.95 Firing Rate 0.9 0.85 0.8 0.75 0.7 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 Joint Overlapped Heuristic Hybrid 20 min 40 min 1 hr Average Scheduling Time Avg. Scheduling 1000 Time(Sec.) 100 10 1 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 15 FIFO: 7 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 FIFO: 7 FIFO: 3 FIFO: 2 Joint Overlapped Heuristic Hybrid