Optimizing DNN Pruning for Hardware Efficiency

Scalpel: Customizing DNN Pruning to the

Underlying Hardware Parallelism

Jiecao Yu

, Andrew Lukefahr

, David Palframan

, Ganesh Dasika

Reetuparna Das

, Scott Mahlke

 University of Michigan

–

 Ann Arbor

ARM

Research

ISCA, Jun

 28, 201

Deep Neural Network

•

High storage and computation cost

−

AlexNet:

MB,

1.5

GFLOP

−

VGG:

MB,

30.9

GFLOP

•

DNN

compression

is

necessary

−

Removing

unimportant

parameters

DNN

Architecture

Input Vector

Weight

Matrix

Output

Vector

•

High

redundancy

Weight Pruning

|Weights|

Threshold

Han, Song, et al. "Learning both weights and connections for efficient neural network."

NIPS

. 2015.

Weight

Matrix

Computation

and

storage

Drawbacks

•

Sparse

format

needs

extra

storage

CSR

Format

JA

IA

•

One

column

index

for

each

weight

Weights

Column

indexes

Drawbacks

•

Execution

time

increase

−

Computation

reduction

not

fully

utilized

−

Extra computation for decoding sparse format

AlexNet

not

tested

on

microcontroller

•

AlexNet

Unpruned

Baseline

22%

42%

125%

334%

•

Low

parallelism

Micro.

−

No

cache

−

ow storage (~100

KB)

•

Moderate

parallelism

CPU

−

ILP / MLP

•

High

parallelism

GPU

−

TLP

−

High bandwidth / long

latency memory

CONV

Scalpel

Trained

DNN

Hardware

Parallelism

SIMD-Aware

Weight

Pruning

Node

Pruning

Layer

Type

Pruned

DNN

Low

High

Moderate

FC

Matrix-Vector

Multiplication

with

SIMD

•

SIMD

benefits

dense

matrix-vector

multiplication

•

Assume

SIMD

width

−

Two

loads

multiply-accumulate

(MAC)

in

one

instruction

Sparse

Matrix-Vector

Multiplication

Weights

Indexes

•

SIMD

not

fully

utilized

•

Extra

storage

for

column

indexes

Weights

in

Groups

Weights

Indexes

•

SIMD

units

fully

utilized

•

Fewer

column

indexes

SIMD-Aware

Weight

Pruning

•

Start

with

original

weight

matrix

•

Weights grouping

−

Group size = SIMD width

•

Calculate

importance

of

groups

−

Root

mean

square

•

Remove

redundant

groups

−

Importance

Threshold

Num

of

weights

Model

size

Execution

time

Performance Benefit

–

Low

Parallelism

•

ARM

Cortex-M4

Microcontroller

68%

48%

Dense

Baseline

•

NVIDIA

GTX

Titan

−

Traditional

weight

pruning

Performance

–

High

Parallelism

96.3%

•

Sparsity

hurts

performance

Pruning

without

sparsity

Dense

Baseline

Node

Pruning

•

Remove redundant nodes

−

Nodes:

neurons

(FC)

feature

maps

(CONV)

−

Mask

layers

finds

unimportant

nodes

•

Add

mask

layers

Mask

Layer

•

Train

mask

layers

•

Remove

redundant

nodes

and

mask

layers

•

Keep

regular

DNN

structure:

no

sparsity

Combined

Pruning

–

Moderate

Parallelism

FC

layers

CONV

layers

SIMD-aware

weight

pruning

Node

pruning

3%

84%

Dense

Dense

•

Intel Core i7-6700 CPU

•

Impact

of

sparsity

on

computation

performance

Evaluation

Methodology

•

Networks

−

MNIST:

LeNet-300-100,

LeNet-5

−

CIFAR-10:

ConvNet,

Network-in-Network

−

ImageNet:

AlexNet

•

Hardware

−

Low-parallelism:

ARM

Cortex-M4

microcontroller

−

Moderate-parallelism:

Intel

Core i7-6700 CPU

−

High-parallelism:

NVIDIA

GTX

Titan

•

Baseline:

Traditional

weight

pruning

Han, Song, et al. "Learning both weights and connections for efficient neural network."

NIPS

. 2015.

Results:

ARM Cortex-M4 microcontroller

1.0

1.0

1.0

1.0

•

28% execution time

•

12% model size

Results:

Intel Core i7-6700 CPU

•

38% execution time

2.7

•

18% model size

Results:

NVIDIA GTX Titan X

•

80% execution time

•

47% model size

Conclusions

•

Scalpel

customizes pruning for hardware with

different parallelism

−

SIMD-aware

weight

pruning:

utilizing

SIMD

units

−

Node

pruning:

avoiding

sparsity

•

Microcontroller

CPU

GPU:

−

Traditional:

53%,

94%,

and

242%

execution

time

−

Scalpel:

28%,

38%, and 80% execution

time

•

Traditional

weight

pruning

has

drawbacks

−

Sparse format needs extra storage

−

Execution

time

increase

Q & A

Thank you!

Networks

Group

Importance

Measurement

•

Maximum absolute value (MAX)

•

Mean absolute value (MEAN)

•

Root-mean-square (RMS)

98%

Nodes Removed in Network-In-Network

Results:

High

Parallelism

•

80% execution time

•

47% model size

NVIDIA

GTX

Titan

Combined

Pruning

–

Moderate

Parallelism

FC

layers

CONV

layers

•

SIMD-aware

weight

pruning

•

Node

pruning

3%

84%

Dense

Dense

•

Intel Core i7-6700 CPU

•

Impact

of

sparsity

on

computation

performance

Hardware

Parallelism

•

Low

parallelism

microcontroller

−

No

cache

−

ow storage (~100

KB)

•

Moderate

parallelism

desktop

CPU

−

ILP / MLP

−

Deep

cache

hierarchy

−

~8

MB

on-chip

SRAM

•

High

parallelism

GPU

−

TLP

−

High bandwidth / long latency memory

−

2-12

GB

storage

Slide Note

Thanks for the introduction. Hi everyone, today, I am presenting my work scalpel: customizing DNN pruning to the underlying hardware parallelism.

Embed Share

Download

Customizing deep neural network (DNN) pruning to maximize hardware parallelism can significantly reduce storage and computation costs. Techniques such as weight pruning, node pruning, and utilizing specific hardware types like GPUs are explored to enhance performance. However, drawbacks like increased execution time and the need for extra storage in sparse formats must be managed. The Scalpel approach offers a solution by balancing parallelism levels across different hardware types and optimizing matrix-vector multiplications with SIMD instructions.

vinicius Follow

Uploaded on Aug 25, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu1, Andrew Lukefahr1, David Palframan2, Ganesh Dasika2, Reetuparna Das1, Scott Mahlke1 1 University of Michigan Ann Arbor 2 ARM Research ISCA, June 28, 2017

Deep Neural Networks High storage and computation cost AlexNet: 240 MB, 1.5 GFLOP VGG: 550 MB, 30.9 GFLOP DNN compression is necessary Removing unimportant parameters 2

DNN Architecture Weight Matrix 0 2 0 8 0 5 0 0 1 0 0 9 6 0 2 4 0 1 4 3 0 3 0 0 0 5 8 0 0 0 1 7 0 0 2 1 Output Vector Input Vector = X X1 W1 Y f High redundancy Wn Xn 3

Weight Pruning * Weight Matrix 0 0 2 0 0 1 8 0 0 0 0 2 0 8 0 5 5 5 0 0 1 0 0 9 9 9 6 6 0 2 4 0 6 0 2 4 0 1 1 4 3 0 3 0 0 0 4 3 0 3 0 4 3 0 5 8 0 0 0 5 8 0 0 0 0 1 7 0 0 2 1 7 0 0 2 1 1 5 8 7 8 4 3 |Weights| > Threshold Computation and storage * Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015. 4

Drawbacks Sparse format needs extra storage 0 1 2 3 4 5 CSR Format 6 4 3 Weights 5 8 7 6 4 3 5 7 8 8 4 3 5 9 A = Column indexes 8 4 3 2 3 3 4 5 4 0 2 3 0 1 JA = 5 9 0 2 5 6 9 9 11 IA = One column index for each weight 5

Drawbacks Execution time increase Computation reduction not fully utilized Extra computation for decoding sparse format AlexNet * Relative Model Size, Computation and Exec. Time 4 334% 3 2 125% Unpruned Baseline 1 42% 22% 0 Size Computation Time-CPU Time-GPU * AlexNet not tested on microcontroller 6

Scalpel Trained DNN Low parallelism - Micro. No cache Low storage (~100 KB) High Low Hardware Parallelism High parallelism - GPU TLP High bandwidth / long latency memory Moderate FC CONV Layer Type Moderate parallelism - CPU ILP / MLP SIMD-Aware Weight Pruning Node Pruning Pruned DNN 7

Matrix-Vector Multiplication with SIMD + x x 0 1 0 2 + 0 = 0 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 2 0 0 1 2 1 2 1 1 X + x x 6 1 0 2 + 0 = 6 + + 6 = 18 x x 5 1 7 1 Assume SIMD width = 2 Two loads / multiply-accumulate (MAC) in one instruction SIMD benefits dense matrix-vector multiplication 8

Sparse Matrix-Vector Multiplication Weights 0 1 2 3 4 5 0 0 6 0 5 7 0 0 0 0 0 0 0 0 0 0 4 0 0 0 4 3 0 0 0 0 0 0 0 0 0 0 1 2 0 0 1 2 6 5 7 4 4 3 1 2 6 5 7 1 2 1 2 1 1 Indexes 2 4 5 4 2 3 2 3 4 X 4 3 + x x 6 1 + 0 = 6 + x x 5 1 + 6 = 11 + x x 7 1 + 11 = 18 SIMD not fully utilized Extra storage for column indexes 9

Weights in Groups Weights 0 1 2 3 4 5 6 6 0 5 7 6 0 5 7 4 0 4 3 1 2 5 7 1 2 1 2 1 1 Indexes 2 4 4 2 2 4 4 0 X 4 3 4 3 + x x 6 1 0 2 + 0 = 6 1 2 1 2 + + 6 = 18 x x 5 1 7 1 SIMD units fully utilized Fewer column indexes 10

SIMD-Aware Weight Pruning Start with original weight matrix Weights grouping Group size = SIMD width Calculate importance of groups Root mean square a2+b2 2 Remove redundant groups Importance < Threshold 0 2 0 8 0 5 5 5 0 2 0 8 0 0 0 1 0 0 9 9 9 0 0 1 0 0 6 0 2 4 0 6 0 2 4 0 1 1 6 4 4 3 0 3 0 4 3 0 3 0 0 0 0 5 8 0 0 0 0 0 5 8 0 0 1 7 0 0 2 1 7 0 0 2 1 1 5 8 7 0 8 0 4 3 Num of weights Model size Execution time 11

Performance Benefit Low Parallelism ARM Cortex-M4 Microcontroller 2.5 48% 68% Relative Exec. Time 2 1.5 Dense Baseline 1 0.5 0 0.3 0.4 0.5 0.6 Pruning Rate 0.7 0.8 0.9 1 Traditional Weight Pruning SIMD-Aware Weight Pruning 12

Performance High Parallelism NVIDIA GTX Titan X Traditional weight pruning 96.3% 2.5 Relative Exec. Time 2 1.5 Dense Baseline 1 0.5 0 0.9 0.92 0.94 0.96 0.98 1 Pruning Rate Sparsity hurts performance Pruning without sparsity 13

Node Pruning Keep regular DNN structure: no sparsity Remove redundant nodes Nodes: neurons (FC) / feature maps (CONV) Mask layers finds unimportant nodes Add mask layers b <T b T 0, 1, (Blocked) (Kept) a = Mask Layer Train mask layers Remove redundant nodes and mask layers A 14

Combined Pruning Moderate Parallelism Intel Core i7-6700 CPU Impact of sparsity on computation performance FC layers CONV layers 84% 3% 2.5 1.2 Dense 1 Relative Exec. Time Relative Exec. Time 2 0.8 1.5 0.6 Dense 1 0.4 0.5 0.2 0 0 0.7 0.8 0.9 1 0 0.2 0.4 Pruning Rate 0.6 0.8 1 Pruning Rate SIMD-Aware Weight Pruning SIMD-Aware Weight Pruning SIMD-aware weight pruning Node pruning 15

Evaluation Methodology Networks MNIST: LeNet-300-100, LeNet-5 CIFAR-10: ConvNet, Network-in-Network ImageNet: AlexNet Hardware Low-parallelism: ARM Cortex-M4 microcontroller Moderate-parallelism: Intel Core i7-6700 CPU High-parallelism: NVIDIA GTX Titan X Baseline: Traditional weight pruning * * Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS. 2015. 16

Results: ARM Cortex-M4 microcontroller Relative Exec. Time Relative Model Size 1.0 1.0 1.0 1.0 1.2 0.5 1 0.4 0.8 0.3 0.6 0.2 0.4 0.1 0.2 0 0 Original Traditional Scalpel Original Traditional Scalpel 28% execution time 12% model size 17

Results: Intel Core i7-6700 CPU Relative Exec. Time Relative Model Size 2.7 1.2 1.4 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 Original Traditional Scalpel Original Traditional Scalpel 38% execution time 18% model size 18

Results: NVIDIA GTX Titan X Relative Exec. Time Relative Model Size 4 1.4 3.5 1.2 3 1 2.5 0.8 2 0.6 1.5 0.4 1 0.2 0.5 0 0 Original Traditional Scalpel Original Traditional Scalpel 80% execution time 47% model size 19

Conclusions Traditional weight pruning has drawbacks Sparse format needs extra storage Execution time increase Scalpel customizes pruning for hardware with different parallelism SIMD-aware weight pruning: utilizing SIMD units Node pruning: avoiding sparsity Microcontroller / CPU / GPU: Traditional: 53%, 94%, and 242% execution time Scalpel: 28%, 38%, and 80% execution time 20

Q & A

Thank you!

Networks Num of Layers Models Dataset Error Rate FC CONV LeNet-300-100 0 3 1.50% MNIST LeNet-5 2 2 0.68% ConvNet 3 1 18.14% CIFAR-10 Network-In-Network 9 0 10.43% AlexNet 5 3 ImageNet 19.73% (top-5) 23

Group Importance Measurement Maximum absolute value (MAX) Mean absolute value (MEAN) Root-mean-square (RMS) 98% 0.10 Relative Accuracy (%) 0.05 0.00 -0.05 -0.10 -0.15 -0.20 0.88 0.9 0.92 0.94 0.96 0.98 1 Pruning Rate of fc1 in LeNet-300-100 RMS MEAN MAX 24

Nodes Removed in Network-In-Network 200 150 100 50 0 conv1 cccp1 cccp2 conv2 cccp3 cccp4 conv3 cccp5 cccp6 Remaining Removed 25

Results: High Parallelism Relative Exec. Time Relative Model Size 4 1.4 3.5 1.2 3 1 2.5 0.8 2 0.6 1.5 0.4 1 0.2 0.5 0 0 Original Traditional Optimized Scalpel Original Traditional Optimized Scalpel 80% execution time 47% model size NVIDIA GTX Titan X 26

Combined Pruning Moderate Parallelism Intel Core i7-6700 CPU Impact of sparsity on computation performance FC layers CONV layers 84% 3% 2.5 1.2 Dense 1 Relative Exec. Time Relative Exec. Time 2 0.8 1.5 0.6 Dense 1 0.4 0.5 0.2 0 0 0.7 0.8 0.9 1 0 0.2 0.4 Pruning Rate 0.6 0.8 1 Pruning Rate SIMD-Aware Weight Pruning SIMD-aware weight pruning SIMD-Aware Weight Pruning Node pruning 27

Optimizing DNN Pruning for Hardware Efficiency

Download Presentation

Presentation Transcript

Related

More Related Content