Optimizing DNN Pruning for Hardware Efficiency

Slide Note
Embed
Share

Customizing deep neural network (DNN) pruning to maximize hardware parallelism can significantly reduce storage and computation costs. Techniques such as weight pruning, node pruning, and utilizing specific hardware types like GPUs are explored to enhance performance. However, drawbacks like increased execution time and the need for extra storage in sparse formats must be managed. The Scalpel approach offers a solution by balancing parallelism levels across different hardware types and optimizing matrix-vector multiplications with SIMD instructions.


Uploaded on Aug 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu1, Andrew Lukefahr1, David Palframan2, Ganesh Dasika2, Reetuparna Das1, Scott Mahlke1 1 University of Michigan Ann Arbor 2 ARM Research ISCA, June 28, 2017

  2. Deep Neural Networks High storage and computation cost AlexNet: 240 MB, 1.5 GFLOP VGG: 550 MB, 30.9 GFLOP DNN compression is necessary Removing unimportant parameters 2

  3. DNN Architecture Weight Matrix 0 2 0 8 0 5 0 0 1 0 0 9 6 0 2 4 0 1 4 3 0 3 0 0 0 5 8 0 0 0 1 7 0 0 2 1 Output Vector Input Vector = X X1 W1 Y f High redundancy Wn Xn 3

  4. Weight Pruning * Weight Matrix 0 0 2 0 0 1 8 0 0 0 0 2 0 8 0 5 5 5 0 0 1 0 0 9 9 9 6 6 0 2 4 0 6 0 2 4 0 1 1 4 3 0 3 0 0 0 4 3 0 3 0 4 3 0 5 8 0 0 0 5 8 0 0 0 0 1 7 0 0 2 1 7 0 0 2 1 1 5 8 7 8 4 3 |Weights| > Threshold Computation and storage * Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015. 4

  5. Drawbacks Sparse format needs extra storage 0 1 2 3 4 5 CSR Format 6 4 3 Weights 5 8 7 6 4 3 5 7 8 8 4 3 5 9 A = Column indexes 8 4 3 2 3 3 4 5 4 0 2 3 0 1 JA = 5 9 0 2 5 6 9 9 11 IA = One column index for each weight 5

  6. Drawbacks Execution time increase Computation reduction not fully utilized Extra computation for decoding sparse format AlexNet * Relative Model Size, Computation and Exec. Time 4 334% 3 2 125% Unpruned Baseline 1 42% 22% 0 Size Computation Time-CPU Time-GPU * AlexNet not tested on microcontroller 6

  7. Scalpel Trained DNN Low parallelism - Micro. No cache Low storage (~100 KB) High Low Hardware Parallelism High parallelism - GPU TLP High bandwidth / long latency memory Moderate FC CONV Layer Type Moderate parallelism - CPU ILP / MLP SIMD-Aware Weight Pruning Node Pruning Pruned DNN 7

  8. Matrix-Vector Multiplication with SIMD + x x 0 1 0 2 + 0 = 0 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 2 0 0 1 2 1 2 1 1 X + x x 6 1 0 2 + 0 = 6 + + 6 = 18 x x 5 1 7 1 Assume SIMD width = 2 Two loads / multiply-accumulate (MAC) in one instruction SIMD benefits dense matrix-vector multiplication 8

  9. Sparse Matrix-Vector Multiplication Weights 0 1 2 3 4 5 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 2 6 5 7 4 4 3 1 2 6 5 7 1 2 1 2 1 1 Indexes 2 4 5 4 2 3 2 3 4 X 4 3 + x x 6 1 + 0 = 6 + x x 5 1 + 6 = 11 + x x 7 1 + 11 = 18 SIMD not fully utilized Extra storage for column indexes 9

  10. Weights in Groups Weights 0 1 2 3 4 5 6 6 0 5 7 6 0 5 7 4 0 4 3 1 2 5 7 1 2 1 2 1 1 Indexes 2 4 4 2 2 4 4 0 X 4 3 4 3 + x x 6 1 0 2 + 0 = 6 1 2 1 2 + + 6 = 18 x x 5 1 7 1 SIMD units fully utilized Fewer column indexes 10

  11. SIMD-Aware Weight Pruning Start with original weight matrix Weights grouping Group size = SIMD width Calculate importance of groups Root mean square a2+b2 2 Remove redundant groups Importance < Threshold 0 2 0 8 0 5 5 5 0 2 0 8 0 0 0 1 0 0 9 9 9 0 0 1 0 0 6 0 2 4 0 6 0 2 4 0 1 1 6 4 4 3 0 3 0 4 3 0 3 0 0 0 0 5 8 0 0 0 0 0 5 8 0 0 1 7 0 0 2 1 7 0 0 2 1 1 5 8 7 0 8 0 4 3 Num of weights Model size Execution time 11

  12. Performance Benefit Low Parallelism ARM Cortex-M4 Microcontroller 2.5 48% 68% Relative Exec. Time 2 1.5 Dense Baseline 1 0.5 0 0.3 0.4 0.5 0.6 Pruning Rate 0.7 0.8 0.9 1 Traditional Weight Pruning SIMD-Aware Weight Pruning 12

  13. Performance High Parallelism NVIDIA GTX Titan X Traditional weight pruning 96.3% 2.5 Relative Exec. Time 2 1.5 Dense Baseline 1 0.5 0 0.9 0.92 0.94 0.96 0.98 1 Pruning Rate Sparsity hurts performance Pruning without sparsity 13

  14. Node Pruning Keep regular DNN structure: no sparsity Remove redundant nodes Nodes: neurons (FC) / feature maps (CONV) Mask layers finds unimportant nodes Add mask layers b <T b T 0, 1, (Blocked) (Kept) a = Mask Layer Train mask layers Remove redundant nodes and mask layers A 14

  15. Combined Pruning Moderate Parallelism Intel Core i7-6700 CPU Impact of sparsity on computation performance FC layers CONV layers 84% 3% 2.5 1.2 Dense 1 Relative Exec. Time Relative Exec. Time 2 0.8 1.5 0.6 Dense 1 0.4 0.5 0.2 0 0 0.7 0.8 0.9 1 0 0.2 0.4 Pruning Rate 0.6 0.8 1 Pruning Rate SIMD-Aware Weight Pruning SIMD-Aware Weight Pruning SIMD-aware weight pruning Node pruning 15

  16. Evaluation Methodology Networks MNIST: LeNet-300-100, LeNet-5 CIFAR-10: ConvNet, Network-in-Network ImageNet: AlexNet Hardware Low-parallelism: ARM Cortex-M4 microcontroller Moderate-parallelism: Intel Core i7-6700 CPU High-parallelism: NVIDIA GTX Titan X Baseline: Traditional weight pruning * * Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015. 16

  17. Results: ARM Cortex-M4 microcontroller Relative Exec. Time Relative Model Size 1.0 1.0 1.0 1.0 1.2 0.5 1 0.4 0.8 0.3 0.6 0.2 0.4 0.1 0.2 0 0 Original Traditional Scalpel Original Traditional Scalpel 28% execution time 12% model size 17

  18. Results: Intel Core i7-6700 CPU Relative Exec. Time Relative Model Size 2.7 1.2 1.4 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Original Traditional Scalpel Original Traditional Scalpel 38% execution time 18% model size 18

  19. Results: NVIDIA GTX Titan X Relative Exec. Time Relative Model Size 4 1.4 3.5 1.2 3 1 2.5 0.8 2 0.6 1.5 0.4 1 0.2 0.5 0 0 Original Traditional Scalpel Original Traditional Scalpel 80% execution time 47% model size 19

  20. Conclusions Traditional weight pruning has drawbacks Sparse format needs extra storage Execution time increase Scalpel customizes pruning for hardware with different parallelism SIMD-aware weight pruning: utilizing SIMD units Node pruning: avoiding sparsity Microcontroller / CPU / GPU: Traditional: 53%, 94%, and 242% execution time Scalpel: 28%, 38%, and 80% execution time 20

  21. Q & A

  22. Thank you!

  23. Networks Num of Layers Models Dataset Error Rate FC CONV LeNet-300-100 0 3 1.50% MNIST LeNet-5 2 2 0.68% ConvNet 3 1 18.14% CIFAR-10 Network-In-Network 9 0 10.43% AlexNet 5 3 ImageNet 19.73% (top-5) 23

  24. Group Importance Measurement Maximum absolute value (MAX) Mean absolute value (MEAN) Root-mean-square (RMS) 98% 0.10 Relative Accuracy (%) 0.05 0.00 -0.05 -0.10 -0.15 -0.20 0.88 0.9 0.92 0.94 0.96 0.98 1 Pruning Rate of fc1 in LeNet-300-100 RMS MEAN MAX 24

  25. Nodes Removed in Network-In-Network 200 150 100 50 0 conv1 cccp1 cccp2 conv2 cccp3 cccp4 conv3 cccp5 cccp6 Remaining Removed 25

  26. Results: High Parallelism Relative Exec. Time Relative Model Size 4 1.4 3.5 1.2 3 1 2.5 0.8 2 0.6 1.5 0.4 1 0.2 0.5 0 0 Original Traditional Optimized Scalpel Original Traditional Optimized Scalpel 80% execution time 47% model size NVIDIA GTX Titan X 26

  27. Combined Pruning Moderate Parallelism Intel Core i7-6700 CPU Impact of sparsity on computation performance FC layers CONV layers 84% 3% 2.5 1.2 Dense 1 Relative Exec. Time Relative Exec. Time 2 0.8 1.5 0.6 Dense 1 0.4 0.5 0.2 0 0 0.7 0.8 0.9 1 0 0.2 0.4 Pruning Rate 0.6 0.8 1 Pruning Rate SIMD-Aware Weight Pruning SIMD-aware weight pruning SIMD-Aware Weight Pruning Node pruning 27

Related