Exploring Efficient Hardware Architectures for Deep Neural Network Processing
Discover new hardware architectures designed for efficient deep neural network processing, including SCNN accelerators for compressed-sparse Convolutional Neural Networks. Learn about convolution operations, memory size versus access energy, dataflow decisions for reuse, and Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow strategies. Dive into concepts such as intra and inter PE parallelism, and understand how these architectures enhance performance for inference and deep learning tasks.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
New hardware architectures for efficient deep net processing
SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks 9 authors @ NVIDIA, MIT, Berkeley, Stanford ISCA 2017
Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP N = 1 for inference Reuse activations: Input Stationary IS Reuse filters
Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Inter PE parallelism Intra PE parallelism Ouput coordinate
Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Intra PE parallelism Cartesian Product CP, all to all multiplications Each PE contains F*I multipliers Vector of F filter weights fetched Vector of I input activations fetched Multiplier outputs sent to accumulator to compute partial sums Accumulator has F*I adders to match multiplier throughput Partial sum written at matching coordinate in output activation space
Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Planar tiled PT, division of input activation map inter PE parallelism Each PE processes C*Wt*Ht inputs Output halos: accumulation buffer contains incomplete partial sums that are communicated to neighboring PEs
Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Inter PE parallelism Intra PE parallelism Ouput coordinate
Last week: sparsity in FC layer Dynamic sparsity (input dependent), created by ReLU operator on specific activations during inference Static sparsity of the network, created by pruning during training Activations might be 0 Weights might be 0 4 bit index to shared weights Weight quantization and sharing. Table of shared weights Non-zero weights Non-zero activations
Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.
Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.
Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.
Similar compression of zero weights and activations Not all weights (F) and Input activations (I) are stored and fetched in the dataflow.
Other details in paper Implementation Evaluation
Next Friday student presentations Tuesday minor 1 3-4 more lectures on architecture