Exploring Efficient Hardware Architectures for Deep Neural Network Processing

Slide Note
Embed
Share

Discover new hardware architectures designed for efficient deep neural network processing, including SCNN accelerators for compressed-sparse Convolutional Neural Networks. Learn about convolution operations, memory size versus access energy, dataflow decisions for reuse, and Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow strategies. Dive into concepts such as intra and inter PE parallelism, and understand how these architectures enhance performance for inference and deep learning tasks.


Uploaded on Oct 01, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. New hardware architectures for efficient deep net processing

  2. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks 9 authors @ NVIDIA, MIT, Berkeley, Stanford ISCA 2017

  3. Convolution operation

  4. Reuse

  5. Memory: size vs. access energy

  6. Dataflow decides reuse

  7. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP N = 1 for inference Reuse activations: Input Stationary IS Reuse filters

  8. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Inter PE parallelism Intra PE parallelism Ouput coordinate

  9. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Intra PE parallelism Cartesian Product CP, all to all multiplications Each PE contains F*I multipliers Vector of F filter weights fetched Vector of I input activations fetched Multiplier outputs sent to accumulator to compute partial sums Accumulator has F*I adders to match multiplier throughput Partial sum written at matching coordinate in output activation space

  10. Inter PE parallelism

  11. Last week: Inter PE parallelism

  12. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Planar tiled PT, division of input activation map inter PE parallelism Each PE processes C*Wt*Ht inputs Output halos: accumulation buffer contains incomplete partial sums that are communicated to neighboring PEs

  13. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Inter PE parallelism Intra PE parallelism Ouput coordinate

  14. Last week: sparsity in FC layer Dynamic sparsity (input dependent), created by ReLU operator on specific activations during inference Static sparsity of the network, created by pruning during training Activations might be 0 Weights might be 0 4 bit index to shared weights Weight quantization and sharing. Table of shared weights Non-zero weights Non-zero activations

  15. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

  16. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

  17. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

  18. Last week: Compressed format to store non-zero weights

  19. Example compressed weights for PE0 v x p

  20. Similar compression of zero weights and activations Not all weights (F) and Input activations (I) are stored and fetched in the dataflow.

  21. PE hardware architecture

  22. Other details in paper Implementation Evaluation

  23. Next Friday student presentations Tuesday minor 1 3-4 more lectures on architecture

Related


More Related Content