Understanding Eyeriss Dataflow for Digital CNN Accelerator

lecture eyeriss dataflow l.w
1 / 13
Embed
Share

Explore Eyeriss architecture and dataflow including optimizations, spatial architecture, primitive operations, convolution examples, folding techniques, weight and output stationarity concepts, and key terminology.

  • Eyeriss
  • Dataflow
  • CNN Accelerator
  • Architecture
  • Optimization

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Lecture: Eyeriss Dataflow Topics: Eyeriss architecture and dataflow (digital CNN accelerator) 1

  2. Dataflow Optimizations 2

  3. Overall Spatial Architecture 3

  4. One Primitive 4

  5. Row Stationary Dataflow for one 2D Convolution Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 62x62 outputs; 20 image batch Edge prim: (glb) 64 inp, 3 wts; (reg) 186 MACs psums (124 rg/by) (62 PE) Other prims: (PE) 64 inp, 3 wts; (reg) 186 MACs psums (124 rg/by) (62 PE) The first step is done ~64 times; the second step is done ~122 times Eventually: 4K outputs to global buffer or DRAM 5

  6. Folding May have to fold a 2D convolution over a small physical set of PEs Must eventually take the 4D convolution and fold it into multiple 2D convolutions the 2D convolution has to be done C (input filters) x M (output filters) x N (image batch) times Can exploit global buffer reuse and register reuse depending on which order you do this (note that you have to deal with inputs, weights, and partial sums) 6

  7. Weight Stationary Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch Weight Stationary: the weights stay in place Assume that we lay out 3x3 weights across 9 PEs Let the inputs stream over these each weight has to be seen 61x61 times -- no easy way to move the pixels around to promote this 7

  8. Output Stationary Example: 4 64x64 inputs; 4x3x3 kernel wts; 8 61x61 outputs; 20 image batch Output Stationary: the output neuron stays in place Need to use all PEs to compute a subset of a 4D space of neurons at a time can move inputs around to promote reuse 8

  9. Terminology 9

  10. Energy Estimates Most ops in convolutional layers Most energy in ALU and RF in convs; most energy in buffer in FC More storage = more delay 10

  11. Summary All about reduced energy and data movement; assume PEs are busy most of the time (except edge effects) Reduced data movement low energy and low area (from fewer interconnects) While Row Stationary is best, need a detailed design space exploration to identify the best traversal thru the 4D array It s not always about reducing DRAM accesses; even global buffer accesses must be reduced More PEs allows for better data reuse, so not terrible even if it means smaller global buffer Convs are 90% of all ops and growing Their best designs are with 256 PEs, 0.5KB regfile/PE, 128KB global buffer; filter/psum/act = 224/24/12 11

  12. WAX 12

  13. References Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks, Y-H. Chen et al., ISCA 2016 Wire-Aware Architecture and Dataflow for CNN Accelerators, S. Gudaparthi et al., MICRO 2019 13

Related


More Related Content