Exploring Efficient Hardware Architectures for Deep Neural Network Processing

Slide Note

Discover new hardware architectures designed for efficient deep neural network processing, including SCNN accelerators for compressed-sparse Convolutional Neural Networks. Learn about convolution operations, memory size versus access energy, dataflow decisions for reuse, and Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow strategies. Dive into concepts such as intra and inter PE parallelism, and understand how these architectures enhance performance for inference and deep learning tasks.

hay_mc Follow

Uploaded on Oct 01, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

New hardware architectures for efficient deep net processing

SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks 9 authors @ NVIDIA, MIT, Berkeley, Stanford ISCA 2017

Convolution operation

Reuse

Memory: size vs. access energy

Dataflow decides reuse

Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP N = 1 for inference Reuse activations: Input Stationary IS Reuse filters

Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Inter PE parallelism Intra PE parallelism Ouput coordinate

Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Intra PE parallelism Cartesian Product CP, all to all multiplications Each PE contains F*I multipliers Vector of F filter weights fetched Vector of I input activations fetched Multiplier outputs sent to accumulator to compute partial sums Accumulator has F*I adders to match multiplier throughput Partial sum written at matching coordinate in output activation space

Inter PE parallelism

Last week: Inter PE parallelism

Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Planar tiled PT, division of input activation map inter PE parallelism Each PE processes C*Wt*Ht inputs Output halos: accumulation buffer contains incomplete partial sums that are communicated to neighboring PEs

Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Inter PE parallelism Intra PE parallelism Ouput coordinate

Last week: sparsity in FC layer Dynamic sparsity (input dependent), created by ReLU operator on specific activations during inference Static sparsity of the network, created by pruning during training Activations might be 0 Weights might be 0 4 bit index to shared weights Weight quantization and sharing. Table of shared weights Non-zero weights Non-zero activations