Optimizing DNN Pruning for Hardware Efficiency

 
Scalpel: Customizing DNN Pruning to the
Underlying Hardware Parallelism
 
Jiecao Yu
1
, Andrew Lukefahr
1
, David Palframan
2
, Ganesh Dasika
2
,
Reetuparna Das
1
, Scott Mahlke
1
 
1
 University of Michigan 
 Ann Arbor  
    
2
 ARM
 
Research
 
ISCA, Jun
e
 28, 201
7
2
Deep Neural Network
s
 
High storage and computation cost
AlexNet: 
240
 MB, 
1.5
 
GFLOP
VGG:
 
550
 
MB,
 
30.9
 
GFLOP
DNN
 
compression
 
is
 
necessary
Removing
 
unimportant
 
parameters
3
DNN
 
Architecture
 
X
Input Vector
 
Weight
 
Matrix
 
=
Output
 
Vector
 
High
 
redundancy
4
Weight Pruning
 *
 
|Weights|
 
>
 
Threshold
* 
Han, Song, et al. "Learning both weights and connections for efficient neural network." 
NIPS
. 2015.
Weight
 
Matrix
 
Computation
 
and
 
storage
5
Drawbacks
Sparse
 
format
 
needs
 
extra
 
storage
 
CSR
 
Format
 
A
 
=
 
JA
 
=
 
IA
 
=
 
One
 
column
 
index
 
for
 
each
 
weight
 
Weights
 
Column
 
indexes
6
Drawbacks
Execution
 
time
 
increase
Computation
 
reduction
 
not
 
fully
 
utilized
Extra computation for decoding sparse format
 
* 
AlexNet
 
not
 
tested
 
on
 
microcontroller
 
AlexNet
 *
 
Unpruned
Baseline
 
22%
 
42%
 
125%
 
334%
 
Low
 
parallelism
 
-
 
Micro.
No
 
cache
L
ow storage (~100
 
KB)
 
Moderate
 
parallelism
 
-
 
CPU
ILP / MLP
 
High
 
parallelism
 
-
 
GPU
TLP
High bandwidth / long
latency memory
 
CONV
7
Scalpel
Trained
 
DNN
Hardware
Parallelism
SIMD-Aware
Weight
 
Pruning
Node
 
Pruning
Layer
Type
Pruned
 
DNN
 
Low
 
High
 
Moderate
 
FC
8
Matrix-Vector
 
Multiplication
 
with
 
SIMD
X
 
SIMD
 
benefits
 
dense
 
matrix-vector
 
multiplication
Assume
 
SIMD
 
width
 
=
 
2
Two
 
loads
 
/
 
multiply-accumulate
 
(MAC)
 
in
 
one
 
instruction
 
x
 
x
 
+
 
+
 
0
  
=
 
0
 
x
 
x
 
+
 
+
 
0
  
=
 
6
 
x
 
x
 
+
 
+
 
6
  
=
 
18
 
x
 
+
 
0
    
=
 
6
 
x
 
+
 
6
    
=
 
11
 
x
 
+
 
11
  
=
 
18
 
x
 
+
 
x
 
+
 
x
 
+
9
Sparse
 
Matrix-Vector
 
Multiplication
X
 
Weights
 
Indexes
 
SIMD
 
not
 
fully
 
utilized
 
Extra
 
storage
 
for
 
column
 
indexes
10
Weights
 
in
 
Groups
X
 
Weights
 
Indexes
 
x
 
x
 
+
 
+
 
0
  
=
 
6
 
x
 
x
 
+
 
+
 
6
  
=
 
18
 
SIMD
 
units
 
fully
 
utilized
 
Fewer
 
column
 
indexes
 
11
SIMD-Aware
 
Weight
 
Pruning
Start
 
with
 
original
 
weight
 
matrix
 
Weights grouping
Group size = SIMD width
 
Calculate
 
importance
 
of
 
groups
Root
 
mean
 
square
 
Remove
 
redundant
 
groups
Importance
 
<
 
Threshold
 
Num
 
of
 
weights
 
Model
 
size
 
Execution
 
time
12
Performance Benefit
 
 
Low
 
Parallelism
ARM
 
Cortex-M4
 
Microcontroller
 
68%
 
48%
 
Dense
Baseline
NVIDIA
 
GTX
 
Titan
 
X
Traditional
 
weight
 
pruning
Performance
 
 
High
 
Parallelism
 
96.3%
 
Sparsity
 
hurts
 
performance
 
Pruning
 
without
 
sparsity
13
 
Dense
Baseline
Node
 
Pruning
 
Remove redundant nodes
Nodes:
 
neurons
 
(FC)
 
/
 
feature
 
maps
 
(CONV)
Mask
 
layers
 
finds
 
unimportant
 
nodes
 
Add
 
mask
 
layers
 
Mask
Layer
 
Train
 
mask
 
layers
 
Remove
 
redundant
nodes
 
and
 
mask
 
layers
Keep
 
regular
 
DNN
 
structure:
 
no
 
sparsity
14
 
A
Combined
 
Pruning
 
 
Moderate
 
Parallelism
15
 
FC
 
layers
 
CONV
 
layers
 
SIMD-aware
 
weight
 
pruning
 
Node
 
pruning
 
3%
 
84%
 
Dense
 
Dense
Intel Core i7-6700 CPU
Impact
 
of
 
sparsity
 
on
 
computation
 
performance
 
Evaluation
 
Methodology
 
16
 
Networks
MNIST:
 
LeNet-300-100,
 
LeNet-5
CIFAR-10:
 
ConvNet,
 
Network-in-Network
ImageNet:
 
AlexNet
 
Hardware
Low-parallelism:
 
ARM
 
Cortex-M4
 
microcontroller
Moderate-parallelism:
 
Intel
 
Core i7-6700 CPU
High-parallelism:
 
NVIDIA
 
GTX
 
Titan
 
X
 
Baseline:
 
Traditional
 
weight
 
pruning
 *
 
* 
Han, Song, et al. "Learning both weights and connections for efficient neural network." 
NIPS
. 2015.
Results:
 
ARM Cortex-M4 microcontroller
17
 
1.0
 
1.0
 
1.0
 
1.0
 
28% execution time
 
12% model size
Results:
 
Intel Core i7-6700 CPU
18
38% execution time
2.7
 
18% model size
Results:
 
NVIDIA GTX Titan X
19
80% execution time
 
47% model size
Conclusions
20
 
Scalpel
 
customizes pruning for hardware with
different parallelism
SIMD-aware
 
weight
 
pruning:
 
utilizing
 
SIMD
 
units
Node
 
pruning:
 
avoiding
 
sparsity
 
Microcontroller
 
/
 
CPU
 
/
 
GPU:
Traditional:
 
53%,
 
94%,
 
and
 
242%
 
execution
 
time
Scalpel:
 
28%,
 
38%, and 80% execution
 
time
Traditional
 
weight
 
pruning
 
has
 
drawbacks
Sparse format needs extra storage
Execution
 
time
 
increase
 
Q & A
 
Thank you!
 
Networks
 
23
Group
 
Importance
 
Measurement
24
Maximum absolute value (MAX)
Mean absolute value (MEAN)
Root-mean-square (RMS)
 
98%
 
Nodes Removed in Network-In-Network
 
25
Results:
 
High
 
Parallelism
26
80% execution time
 
47% model size
NVIDIA
 
GTX
 
Titan
 
X
 
Combined
 
Pruning
 
 
Moderate
 
Parallelism
27
 
FC
 
layers
 
CONV
 
layers
 
SIMD-aware
 
weight
 
pruning
 
Node
 
pruning
 
3%
 
84%
 
Dense
 
Dense
Intel Core i7-6700 CPU
Impact
 
of
 
sparsity
 
on
 
computation
 
performance
28
Hardware
 
Parallelism
 
Low
 
parallelism
 
-
 
microcontroller
No
 
cache
L
ow storage (~100
 
KB)
 
Moderate
 
parallelism
 
-
 
desktop
 
CPU
ILP / MLP
Deep
 
cache
 
hierarchy
~8
 
MB
 
on-chip
 
SRAM
 
High
 
parallelism
 
-
 
GPU
TLP
High bandwidth / long latency memory
2-12
 
GB
 
storage
Slide Note

Thanks for the introduction. Hi everyone, today, I am presenting my work scalpel: customizing DNN pruning to the underlying hardware parallelism.

Embed
Share

Customizing deep neural network (DNN) pruning to maximize hardware parallelism can significantly reduce storage and computation costs. Techniques such as weight pruning, node pruning, and utilizing specific hardware types like GPUs are explored to enhance performance. However, drawbacks like increased execution time and the need for extra storage in sparse formats must be managed. The Scalpel approach offers a solution by balancing parallelism levels across different hardware types and optimizing matrix-vector multiplications with SIMD instructions.

  • DNN Pruning
  • Hardware Efficiency
  • Parallelism Optimization
  • Weight Pruning
  • Neural Networks

Uploaded on Aug 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu1, Andrew Lukefahr1, David Palframan2, Ganesh Dasika2, Reetuparna Das1, Scott Mahlke1 1 University of Michigan Ann Arbor 2 ARM Research ISCA, June 28, 2017

  2. Deep Neural Networks High storage and computation cost AlexNet: 240 MB, 1.5 GFLOP VGG: 550 MB, 30.9 GFLOP DNN compression is necessary Removing unimportant parameters 2

  3. DNN Architecture Weight Matrix 0 2 0 8 0 5 0 0 1 0 0 9 6 0 2 4 0 1 4 3 0 3 0 0 0 5 8 0 0 0 1 7 0 0 2 1 Output Vector Input Vector = X X1 W1 Y f High redundancy Wn Xn 3

  4. Weight Pruning * Weight Matrix 0 0 2 0 0 1 8 0 0 0 0 2 0 8 0 5 5 5 0 0 1 0 0 9 9 9 6 6 0 2 4 0 6 0 2 4 0 1 1 4 3 0 3 0 0 0 4 3 0 3 0 4 3 0 5 8 0 0 0 5 8 0 0 0 0 1 7 0 0 2 1 7 0 0 2 1 1 5 8 7 8 4 3 |Weights| > Threshold Computation and storage * Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015. 4

  5. Drawbacks Sparse format needs extra storage 0 1 2 3 4 5 CSR Format 6 4 3 Weights 5 8 7 6 4 3 5 7 8 8 4 3 5 9 A = Column indexes 8 4 3 2 3 3 4 5 4 0 2 3 0 1 JA = 5 9 0 2 5 6 9 9 11 IA = One column index for each weight 5

  6. Drawbacks Execution time increase Computation reduction not fully utilized Extra computation for decoding sparse format AlexNet * Relative Model Size, Computation and Exec. Time 4 334% 3 2 125% Unpruned Baseline 1 42% 22% 0 Size Computation Time-CPU Time-GPU * AlexNet not tested on microcontroller 6

  7. Scalpel Trained DNN Low parallelism - Micro. No cache Low storage (~100 KB) High Low Hardware Parallelism High parallelism - GPU TLP High bandwidth / long latency memory Moderate FC CONV Layer Type Moderate parallelism - CPU ILP / MLP SIMD-Aware Weight Pruning Node Pruning Pruned DNN 7

  8. Matrix-Vector Multiplication with SIMD + x x 0 1 0 2 + 0 = 0 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 2 0 0 1 2 1 2 1 1 X + x x 6 1 0 2 + 0 = 6 + + 6 = 18 x x 5 1 7 1 Assume SIMD width = 2 Two loads / multiply-accumulate (MAC) in one instruction SIMD benefits dense matrix-vector multiplication 8

  9. Sparse Matrix-Vector Multiplication Weights 0 1 2 3 4 5 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 2 6 5 7 4 4 3 1 2 6 5 7 1 2 1 2 1 1 Indexes 2 4 5 4 2 3 2 3 4 X 4 3 + x x 6 1 + 0 = 6 + x x 5 1 + 6 = 11 + x x 7 1 + 11 = 18 SIMD not fully utilized Extra storage for column indexes 9

  10. Weights in Groups Weights 0 1 2 3 4 5 6 6 0 5 7 6 0 5 7 4 0 4 3 1 2 5 7 1 2 1 2 1 1 Indexes 2 4 4 2 2 4 4 0 X 4 3 4 3 + x x 6 1 0 2 + 0 = 6 1 2 1 2 + + 6 = 18 x x 5 1 7 1 SIMD units fully utilized Fewer column indexes 10

  11. SIMD-Aware Weight Pruning Start with original weight matrix Weights grouping Group size = SIMD width Calculate importance of groups Root mean square a2+b2 2 Remove redundant groups Importance < Threshold 0 2 0 8 0 5 5 5 0 2 0 8 0 0 0 1 0 0 9 9 9 0 0 1 0 0 6 0 2 4 0 6 0 2 4 0 1 1 6 4 4 3 0 3 0 4 3 0 3 0 0 0 0 5 8 0 0 0 0 0 5 8 0 0 1 7 0 0 2 1 7 0 0 2 1 1 5 8 7 0 8 0 4 3 Num of weights Model size Execution time 11

  12. Performance Benefit Low Parallelism ARM Cortex-M4 Microcontroller 2.5 48% 68% Relative Exec. Time 2 1.5 Dense Baseline 1 0.5 0 0.3 0.4 0.5 0.6 Pruning Rate 0.7 0.8 0.9 1 Traditional Weight Pruning SIMD-Aware Weight Pruning 12

  13. Performance High Parallelism NVIDIA GTX Titan X Traditional weight pruning 96.3% 2.5 Relative Exec. Time 2 1.5 Dense Baseline 1 0.5 0 0.9 0.92 0.94 0.96 0.98 1 Pruning Rate Sparsity hurts performance Pruning without sparsity 13

  14. Node Pruning Keep regular DNN structure: no sparsity Remove redundant nodes Nodes: neurons (FC) / feature maps (CONV) Mask layers finds unimportant nodes Add mask layers b <T b T 0, 1, (Blocked) (Kept) a = Mask Layer Train mask layers Remove redundant nodes and mask layers A 14

  15. Combined Pruning Moderate Parallelism Intel Core i7-6700 CPU Impact of sparsity on computation performance FC layers CONV layers 84% 3% 2.5 1.2 Dense 1 Relative Exec. Time Relative Exec. Time 2 0.8 1.5 0.6 Dense 1 0.4 0.5 0.2 0 0 0.7 0.8 0.9 1 0 0.2 0.4 Pruning Rate 0.6 0.8 1 Pruning Rate SIMD-Aware Weight Pruning SIMD-Aware Weight Pruning SIMD-aware weight pruning Node pruning 15

  16. Evaluation Methodology Networks MNIST: LeNet-300-100, LeNet-5 CIFAR-10: ConvNet, Network-in-Network ImageNet: AlexNet Hardware Low-parallelism: ARM Cortex-M4 microcontroller Moderate-parallelism: Intel Core i7-6700 CPU High-parallelism: NVIDIA GTX Titan X Baseline: Traditional weight pruning * * Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015. 16

  17. Results: ARM Cortex-M4 microcontroller Relative Exec. Time Relative Model Size 1.0 1.0 1.0 1.0 1.2 0.5 1 0.4 0.8 0.3 0.6 0.2 0.4 0.1 0.2 0 0 Original Traditional Scalpel Original Traditional Scalpel 28% execution time 12% model size 17

  18. Results: Intel Core i7-6700 CPU Relative Exec. Time Relative Model Size 2.7 1.2 1.4 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Original Traditional Scalpel Original Traditional Scalpel 38% execution time 18% model size 18

  19. Results: NVIDIA GTX Titan X Relative Exec. Time Relative Model Size 4 1.4 3.5 1.2 3 1 2.5 0.8 2 0.6 1.5 0.4 1 0.2 0.5 0 0 Original Traditional Scalpel Original Traditional Scalpel 80% execution time 47% model size 19

  20. Conclusions Traditional weight pruning has drawbacks Sparse format needs extra storage Execution time increase Scalpel customizes pruning for hardware with different parallelism SIMD-aware weight pruning: utilizing SIMD units Node pruning: avoiding sparsity Microcontroller / CPU / GPU: Traditional: 53%, 94%, and 242% execution time Scalpel: 28%, 38%, and 80% execution time 20

  21. Q & A

  22. Thank you!

  23. Networks Num of Layers Models Dataset Error Rate FC CONV LeNet-300-100 0 3 1.50% MNIST LeNet-5 2 2 0.68% ConvNet 3 1 18.14% CIFAR-10 Network-In-Network 9 0 10.43% AlexNet 5 3 ImageNet 19.73% (top-5) 23

  24. Group Importance Measurement Maximum absolute value (MAX) Mean absolute value (MEAN) Root-mean-square (RMS) 98% 0.10 Relative Accuracy (%) 0.05 0.00 -0.05 -0.10 -0.15 -0.20 0.88 0.9 0.92 0.94 0.96 0.98 1 Pruning Rate of fc1 in LeNet-300-100 RMS MEAN MAX 24

  25. Nodes Removed in Network-In-Network 200 150 100 50 0 conv1 cccp1 cccp2 conv2 cccp3 cccp4 conv3 cccp5 cccp6 Remaining Removed 25

  26. Results: High Parallelism Relative Exec. Time Relative Model Size 4 1.4 3.5 1.2 3 1 2.5 0.8 2 0.6 1.5 0.4 1 0.2 0.5 0 0 Original Traditional Optimized Scalpel Original Traditional Optimized Scalpel 80% execution time 47% model size NVIDIA GTX Titan X 26

  27. Combined Pruning Moderate Parallelism Intel Core i7-6700 CPU Impact of sparsity on computation performance FC layers CONV layers 84% 3% 2.5 1.2 Dense 1 Relative Exec. Time Relative Exec. Time 2 0.8 1.5 0.6 Dense 1 0.4 0.5 0.2 0 0 0.7 0.8 0.9 1 0 0.2 0.4 Pruning Rate 0.6 0.8 1 Pruning Rate SIMD-Aware Weight Pruning SIMD-aware weight pruning SIMD-Aware Weight Pruning Node pruning 27

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#