Cutting-Edge Training Architecture Overview

1
Lecture: Training Innovations
 Topics: NVIDIA Volta, Intel NNP-T/I, ScaleDeep, vDNN
      
 No class on Tuesday
2
NVIDIA Volta GPU
640 tensor cores
Each tensor core performs a MAC on 4x4 tensors
Throughput: 128 FLOPs x 640 x 1.5 GHz = 125 Tflops
FP16 multiply operations
12x better than Pascal on training and 6x better on inference
Basic matrix multiply unit – 32 inputs being fed to 64 parallel
multipliers; 64 parallel add operations
Reference: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
3
Intel NNP-T 
Image Source: Intel presentation  HotChips’19
4
Intel NNP-T
Each Tensor Processing Cluster (TPC) has two 32x32 MAC grids
     supporting bfloat16 (appears similar to TPU)
Conv engine to marshall data before each step
60MB on-chip SRAM (2.5MB scratchpad per TPC)
Communication is key: grid network among TPCs, 64 SerDes
     lanes (3.6 Tb/s) for inter-chip communication, 4 HBMs
Relatively low utilization for GeMM (< 60%) and Conv (59-87%)
Image Source: Intel presentation  HotChips’19
5
Intel NNP-I
12 Inference Compute Engines (ICE) that can work together or
     independently; 24MB central cache and 4MB per ICE
Each ICE has 4K 8b MAC unit
10W, 48 TOPs, 3600 inf/s
Image Source: Intel presentation  HotChips’19
6
Training Take-Homes
 Create a mini-batch which is a random set of training inputs; apply the
   fwd pass; compute error; backprop the errors (while using a transpose
   of weights); compute the deltas for all the weights (using the Xs of
   previous layer); aggregate the deltas and update the weights at the
   end of the mini-batch; keep creating mini-batches until you run out of
   training samples; this is one epoch.
 On the fwd pass, we use inputs X and weights W to compute output Y
 On the bwd pass, we use dY and weights W to compute dX;
                               we also use dX and X to compute dW
X
Y
d
X
d
Y
X
,
W
d
Y
,
W
,
X
d
W
7
ScaleDeep Intro
 Pays attention to training (network, storage)
 Introduces heterogeneity
 Uses a spatial pipeline 
   (instead of time-multiplexed execution)
 Heavy use of batching for efficiency
 Novel interconnect structure (for a 1.4 KW server) that
   helps with batching and training
In the previous lecture, we saw the needs of the training
algorithm.  Here, we’ll look at a hardware architecture (ScaleDeep)
that pays attention to the requirements of training.  This work is
also unusual in that it designs a 1.4 KW server for training in the
datacenter – you’ll be dazzled by the many resources that are
thrown at this problem.
The ScaleDeep architecture allocates storage resources
and designs a network that is suitable for training.
It introduces heterogeneity within and among chips.
It uses a spatial pipeline (also seen later in ISAAC).
Batching is used heavily (since thruput > response time
when doing training)
8
Layer Breakdown
High weight reuse;
Large feature maps
High neuron reuse;
Many feature maps;
80% of all computations
Large memory;
High bytes/flop
Note: FP, BP, WG
The authors classify the layers as initial (more reuse of
weights becoz of large feature maps), middle (more reuse of
input neurons becoz of many feature maps, also 80% of
computations.
9
Layer Breakdown
High weight reuse;
Large feature maps
High neuron reuse;
Many feature maps;
80% of all computations
Large memory;
High bytes/flop
Note: FP, BP, WG
Also notable are the growing weights as we move right.  The
bytes/flop also increases, i.e., more memory-bound.  Also
note that a third of the time is spent in forward-pass,
backward-pass, and weight-gradient updates.
10
CompHeavy Tile
Kernels enter from top and down
Can do two small kernels or 1
large kernel
Inputs enter from left memory
Accumulation happens on right
Scratchpad used for partial sums
Multiple lanes in each PE (can
handle different kernels)
This figure shows a CompHeavy tile that is used for most of
the dot-product computations.  Small buffers feed inputs
from the left, weights from the top and bottom.  The inputs
flow left-to-right and the dot-products get accumulated in …
11
CompHeavy Tile
Kernels enter from top and down
Can do two small kernels or 1
large kernel
Inputs enter from left memory
Accumulation happens on right
Scratchpad used for partial sums
Multiple lanes in each PE (can
handle different kernels)
… the accumulator array at the right.  Partial sums may have
to be saved in the scratchpad.  There are 2 weight buffers
(top and bottom) in case the kernels are small and you can
cram two kernels into one CompHeavy tile.
12
CompHeavy Tile
Kernels enter from top and down
Can do two small kernels or 1
large kernel
Inputs enter from left memory
Accumulation happens on right
Scratchpad used for partial sums
Multiple lanes in each PE (can
handle different kernels)
The inputs to this tile come from a MemHeavy tile to the left.
The outputs of this tile are written to a MemHeavy tile to the
right.  The feature maps produced may also be written to
DRAM memory (for reuse during the backward pass).
13
MemHeavy Tile
Intra-chip heterogeneity
Mainly used for data storage.
Can compute activations and
pooling.
Within a single chip, you can also have MemHeavy tiles.
These MemHeavy tiles store the input/output feature maps.
The SFU computational units are used to compute the
activation function and to perform pooling operations.
14
ScaleDeep Chip
3 Comp tiles per Mem tile
(for FP, BP, and WG)
Tiles are dedicated for
specific layers (like ISAAC)
Latency/thruput trade-off
A single chip is made up of columns of MemHeavy and
CompHeavy tiles.  There are 3 CompHeavy tiles per
MemHeavy tile, one each for FP, BP, and WG, so all 3 steps
can happen in parallel.
15
ScaleDeep Chip
3 Comp tiles per Mem tile
(for FP, BP, and WG)
Tiles are dedicated for
specific layers (like ISAAC)
Latency/thruput trade-off
Similar to ISAAC, tiles are dedicated to process a given layer.
This means that weights stay in place for the most part (may
occasionally spill into DRAM), but feature maps move from
column to column.  Such a “spatial pipeline” can potentially ...
16
ScaleDeep Chip
3 Comp tiles per Mem tile
(for FP, BP, and WG)
Tiles are dedicated for
specific layers (like ISAAC)
Latency/thruput trade-off
... increase response time for one image (since only a fraction
of all resources are being used for any one layer), but it should
have little impact on throughput if we can set up a pipeline
where all tiles are busy all the time.
Also note that each chip is
connected to multiple memory
channels.  It turns out that each chip
has about 50 MB of data storage; so
the memory would be used when
the working set is larger (recall that
feature map storage in vDNN
requires many tens of giga-bytes).
17
Chip Heterogeneity
Tailored for
FC layer 
In addition to the heterogeneity within a chip, they also
offer heterogeneity among chips.  A Conv Chip and an FC
Chip are provisioned differently.  They both have
CompHeavy and MemHeavy tiles, but the number/size of
each tile varies.
18
ScaleDeep Node
Each ConvChip works
on a different image.
The outputs converge on
the FC chip (spokes) so
it can use batching. 
The wheel arcs are used
for weight updates and
for very large convs.
The FC layer is split
across multiple FCchips;
increases batching,
reduces memory,
low network bw
A single node consists of 16 Conv Chips and 4 FC Chips.  5 of
these are organized into one chip cluster and 4 chip clusters
are organized as a ring.  In one chip cluster, each Conv Chip is
responsible for 1 image.  4 Conv Chips then feed their …
19
ScaleDeep Node
Each ConvChip works
on a different image.
The outputs converge on
the FC chip (spokes) so
it can use batching. 
The wheel arcs are used
for weight updates and
for very large convs.
The FC layer is split
across multiple FCchips;
increases batching,
reduces memory,
low network bw
… 4 output feature maps to their central FC Chip (using
“spoke” connections).  This helps improve the batching factor
in the FC Chip (important because it has the worst bytes/flop
ratio).  In fact, the FC layers are partitioned across all …
20
ScaleDeep Node
Each ConvChip works
on a different image.
The outputs converge on
the FC chip (spokes) so
it can use batching. 
The wheel arcs are used
for weight updates and
for very large convs.
The FC layer is split
across multiple FCchips;
increases batching,
reduces memory,
low network bw
… 4 FC chips.  Thus, one FC chip does a quarter of the FC layer
for all 16 images.  This further increases the batching factor,
while requiring more feature map communication on the ring
network.
21
ScaleDeep Node
Each ConvChip works
on a different image.
The outputs converge on
the FC chip (spokes) so
it can use batching. 
The wheel arcs are used
for weight updates and
for very large convs.
The FC layer is split
across multiple FCchips;
increases batching,
reduces memory,
low network bw
The green ring within a chip cluster (called the “wheel”
connections) are used in case the convolution layers need more
than one chip.  The wheel and the outer ring are also used to
gather weight gradients across images in the mini-batch.
22
Results Summary
 Both chips have about 50 MB capacity
 Convchip consumes 58 W, FCchip consumes 15 W
 Inference is 3x faster than training
 One chip-cluster (320 W, comparable to a GPU) has
   5-11x speedup over the GPU 
 5x speedup over DaDianNao at iso-power
Both Conv and FC chips have about 50 MB capacity in their MemHeavy tiles (they don’t seem to
specify area per chip?).  Each chip consumes modest amounts of power, but 20 of these chips add
up to over a kilo-watt of power (including the DDR memory power).  Inference is 3x faster than
training because 3x more CompHeavy tiles can be put to work on the forward pass.  One chip-
cluster consumes about 320 W, which is comparable to a GPU.  In their iso-power comparisons
against a GPU and against DaDianNao, they report about 5-11x speedups.
Note that this architecture targets training.  This was enabled with 50 MB storage and DRAM per chip,
with a novel interconnect structure (rings that pass weight gradients around), and spokes/rings that
improve the batching factor for the FC layers.   The other key contribution is the use of heterogeneity.
23
Training on GPUs
 GPUs have limited memory capacity (max 12 GB)
 When training, in addition to storing weights, we must
   store all the activations as they are required during
   backprop; plus the weight deltas
 In convolution layers, the activations consume way more
   memory than the weights; convolution activations also
   have longer reuse distance than classifier weights
 vDNN is a runtime that automatically moves activations to
   CPU memory and back, so the network fits in GPU memory
   
(avoids constrained network, multi GPUs, small batches, slower algos)
Now we move to the vDNN paper.  The paper has a
very simple idea.  It says that GPUs are used for
training and must save all the feature maps
produced during their forward pass.  This requires
lots of memory, especially when batch size is high.
24
Training on GPUs
 GPUs have limited memory capacity (max 12 GB)
 When training, in addition to storing weights, we must
   store all the activations as they are required during
   backprop; plus the weight deltas
 In convolution layers, the activations consume way more
   memory than the weights; convolution activations also
   have longer reuse distance than classifier weights
 vDNN is a runtime that automatically moves activations to
   CPU memory and back, so the network fits in GPU memory
   
(avoids constrained network, multi GPUs, small batches, slower algos)
But state-of-the-art GPUs only offer 12 GB of memory, which
isn’t enough.  So feature maps that will be reused much later
during the backward pass can be moved to the CPU memory,
thus not hitting the memory capacity limit.  When required,
those feature maps are prefetched back from CPU memory.
25
Motivation I
This graph shows that these workloads far exceed the 12 GB
capacity and this increases with batch size.  But if those
workloads had to only save one layer’s worth of memory,
they’d be fine, e.g., about 37% of the 28GB in VGG-16(256).
26
Motivation II
This graph shows that the feature maps account for most of
the memory usage.  The convolution can use a faster FFT-based
algorithm that requires extra memory, called workspace.
27
Motivation III
Can use frequency-domain FFT algorithms that are faster, but need additional workspace
This graph shows that most of the large feature maps are in
the early layers that also have longer reuse distances.
28
vDNN Proposal
 Runtime system that, during fwd pass, initiates an off-load of a layer’s X
  while it is computing on X (no races because X is read-only)
 The layer waits if the offload hasn’t finished; trying to optimize memory
  capacity
 During back-prop, initiate a prefetch of X for previous layer; must wait if
  prefetch hasn’t finished
29
System Parameters
 Baseline: NVIDIA Titan X; 7 TFLOP/s; 336 GB/s GDDR5
                  memory bandwidth, 12 GB, 16 GB/s PCIe link
                  for GPU

 CPU memory transfers
 Baseline with performance features (more memory reqd)
 vDNNall: all layers perform an offload/prefetch (least memory)
 vDNNconv: only convolutional layers perform offload/prefetch
 vDNNdyn: explores many configs to identify the few layers
                    that need to be offloaded/prefetched, and the few
                    layers that can use the faster algos; exploration
                    isn’t expensive and uses a greedy algorithm
We can use the offload approach to all layers, to all conv layers, or just to the layers that need this the
most with or without the FFT/workspace optimization (vDNNdyn).  The vDNNdyn algorithm does the best.
The background offload will not impose a performance penalty of higher than 5%, because off-loading can
only happen at 16 GB/s (the PCIe link bandwidth), thus compromising some of the 336 GB/s memory
bandwidth available to the GPU.  This 5% and the delays waiting for prefetch cause a max 18% slowdown.
30
Memory Interference
  Early layers need the most offloading and will interfere
  the most with GDDR access; but early layers also have
  the least memory accesses for weights; at most, the
  interference will be 16/336 = 5%
31
Performance Results
 The largest benchmark has 18% perf loss; others are near zero
 Dynamic scheme gives best perf while fitting in memory
 Power increase of 1-7% 
32
References
  “Neural Networks and Deep Learning,” (Chapter 2) Michael Nielsen
  “SCALEDEEP: A Scalable Compute Architecture for Learning and
    Evaluating Deep Networks,” S. Venkataramani et al., ISCA 2017
  “vDNN: Virtualized Deep Neural Networks for Scalable, Memory-
    Efficient Neural Network Design,” M. Rhu et al., MICRO 2016
Slide Note
Embed
Share

Delve into the latest training innovations featuring NVIDIA Volta, Intel NNP-T/I, ScaleDeep, and vDNN. Learn about the impressive capabilities of the NVIDIA Volta GPU, Intel NNP-T with Tensor Processing Clusters, and Intel NNP-I for inference tasks. Explore the intricacies of creating mini-batches, forward and backward passes, and weight updates in training algorithms. Uncover the optimizations and design choices in architecture, such as ScaleDeep's focus on training needs, network design, and resource allocation for efficient training processes.

  • Training Innovations
  • NVIDIA Volta
  • Intel NNP-T
  • ScaleDeep
  • Deep Learning

Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Lecture: Training Innovations Topics: NVIDIA Volta, Intel NNP-T/I, ScaleDeep, vDNN No class on Tuesday 1

  2. NVIDIA Volta GPU 640 tensor cores Each tensor core performs a MAC on 4x4 tensors Throughput: 128 FLOPs x 640 x 1.5 GHz = 125 Tflops FP16 multiply operations 12x better than Pascal on training and 6x better on inference Basic matrix multiply unit 32 inputs being fed to 64 parallel multipliers; 64 parallel add operations 2 Reference: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

  3. Intel NNP-T 3 Image Source: Intel presentation HotChips 19

  4. Intel NNP-T Each Tensor Processing Cluster (TPC) has two 32x32 MAC grids supporting bfloat16 (appears similar to TPU) Conv engine to marshall data before each step 60MB on-chip SRAM (2.5MB scratchpad per TPC) Communication is key: grid network among TPCs, 64 SerDes lanes (3.6 Tb/s) for inter-chip communication, 4 HBMs Relatively low utilization for GeMM (< 60%) and Conv (59-87%) 4 Image Source: Intel presentation HotChips 19

  5. Intel NNP-I 12 Inference Compute Engines (ICE) that can work together or independently; 24MB central cache and 4MB per ICE Each ICE has 4K 8b MAC unit 10W, 48 TOPs, 3600 inf/s 5 Image Source: Intel presentation HotChips 19

  6. Training Take-Homes Create a mini-batch which is a random set of training inputs; apply the fwd pass; compute error; backprop the errors (while using a transpose of weights); compute the deltas for all the weights (using the Xs of previous layer); aggregate the deltas and update the weights at the end of the mini-batch; keep creating mini-batches until you run out of training samples; this is one epoch. On the fwd pass, we use inputs X and weights W to compute output Y On the bwd pass, we use dY and weights W to compute dX; we also use dX and X to compute dW X X,W Y dY,W,X dY dX 6 dW

  7. In the previous lecture, we saw the needs of the training algorithm. Here, we ll look at a hardware architecture (ScaleDeep) that pays attention to the requirements of training. This work is also unusual in that it designs a 1.4 KW server for training in the datacenter you ll be dazzled by the many resources that are thrown at this problem. ScaleDeep Intro Pays attention to training (network, storage) The ScaleDeep architecture allocates storage resources and designs a network that is suitable for training. It introduces heterogeneity within and among chips. It uses a spatial pipeline (also seen later in ISAAC). Batching is used heavily (since thruput > response time when doing training) Introduces heterogeneity Uses a spatial pipeline (instead of time-multiplexed execution) Heavy use of batching for efficiency Novel interconnect structure (for a 1.4 KW server) that helps with batching and training 7

  8. The authors classify the layers as initial (more reuse of weights becoz of large feature maps), middle (more reuse of input neurons becoz of many feature maps, also 80% of computations. Layer Breakdown High neuron reuse; Many feature maps; 80% of all computations High weight reuse; Large feature maps Large memory; High bytes/flop Note: FP, BP, WG 8

  9. Also notable are the growing weights as we move right. The bytes/flop also increases, i.e., more memory-bound. Also note that a third of the time is spent in forward-pass, backward-pass, and weight-gradient updates. Layer Breakdown High neuron reuse; Many feature maps; 80% of all computations High weight reuse; Large feature maps Large memory; High bytes/flop Note: FP, BP, WG 9

  10. This figure shows a CompHeavy tile that is used for most of the dot-product computations. Small buffers feed inputs from the left, weights from the top and bottom. The inputs flow left-to-right and the dot-products get accumulated in CompHeavy Tile Kernels enter from top and down Can do two small kernels or 1 large kernel Inputs enter from left memory Accumulation happens on right Scratchpad used for partial sums Multiple lanes in each PE (can handle different kernels) 10

  11. the accumulator array at the right. Partial sums may have to be saved in the scratchpad. There are 2 weight buffers (top and bottom) in case the kernels are small and you can cram two kernels into one CompHeavy tile. CompHeavy Tile Kernels enter from top and down Can do two small kernels or 1 large kernel Inputs enter from left memory Accumulation happens on right Scratchpad used for partial sums Multiple lanes in each PE (can handle different kernels) 11

  12. The inputs to this tile come from a MemHeavy tile to the left. The outputs of this tile are written to a MemHeavy tile to the right. The feature maps produced may also be written to DRAM memory (for reuse during the backward pass). CompHeavy Tile Kernels enter from top and down Can do two small kernels or 1 large kernel Inputs enter from left memory Accumulation happens on right Scratchpad used for partial sums Multiple lanes in each PE (can handle different kernels) 12

  13. Within a single chip, you can also have MemHeavy tiles. These MemHeavy tiles store the input/output feature maps. The SFU computational units are used to compute the activation function and to perform pooling operations. MemHeavy Tile Intra-chip heterogeneity Mainly used for data storage. Can compute activations and pooling. 13

  14. A single chip is made up of columns of MemHeavy and CompHeavy tiles. There are 3 CompHeavy tiles per MemHeavy tile, one each for FP, BP, and WG, so all 3 steps can happen in parallel. ScaleDeep Chip 3 Comp tiles per Mem tile (for FP, BP, and WG) Tiles are dedicated for specific layers (like ISAAC) Latency/thruput trade-off 14

  15. Similar to ISAAC, tiles are dedicated to process a given layer. This means that weights stay in place for the most part (may occasionally spill into DRAM), but feature maps move from column to column. Such a spatial pipeline can potentially ... ScaleDeep Chip 3 Comp tiles per Mem tile (for FP, BP, and WG) Tiles are dedicated for specific layers (like ISAAC) Latency/thruput trade-off 15

  16. ... increase response time for one image (since only a fraction of all resources are being used for any one layer), but it should have little impact on throughput if we can set up a pipeline where all tiles are busy all the time. ScaleDeep Chip 3 Comp tiles per Mem tile (for FP, BP, and WG) Tiles are dedicated for specific layers (like ISAAC) Latency/thruput trade-off Also note that each chip is connected to multiple memory channels. It turns out that each chip has about 50 MB of data storage; so the memory would be used when the working set is larger (recall that feature map storage in vDNN requires many tens of giga-bytes). 16

  17. In addition to the heterogeneity within a chip, they also offer heterogeneity among chips. A Conv Chip and an FC Chip are provisioned differently. They both have CompHeavy and MemHeavy tiles, but the number/size of each tile varies. Chip Heterogeneity Tailored for FC layer 17

  18. A single node consists of 16 Conv Chips and 4 FC Chips. 5 of these are organized into one chip cluster and 4 chip clusters are organized as a ring. In one chip cluster, each Conv Chip is responsible for 1 image. 4 Conv Chips then feed their ScaleDeep Node Each ConvChip works on a different image. The outputs converge on the FC chip (spokes) so it can use batching. The wheel arcs are used for weight updates and for very large convs. The FC layer is split across multiple FCchips; increases batching, reduces memory, low network bw 18

  19. 4 output feature maps to their central FC Chip (using spoke connections). This helps improve the batching factor in the FC Chip (important because it has the worst bytes/flop ratio). In fact, the FC layers are partitioned across all ScaleDeep Node Each ConvChip works on a different image. The outputs converge on the FC chip (spokes) so it can use batching. The wheel arcs are used for weight updates and for very large convs. The FC layer is split across multiple FCchips; increases batching, reduces memory, low network bw 19

  20. 4 FC chips. Thus, one FC chip does a quarter of the FC layer for all 16 images. This further increases the batching factor, while requiring more feature map communication on the ring network. ScaleDeep Node Each ConvChip works on a different image. The outputs converge on the FC chip (spokes) so it can use batching. The wheel arcs are used for weight updates and for very large convs. The FC layer is split across multiple FCchips; increases batching, reduces memory, low network bw 20

  21. The green ring within a chip cluster (called the wheel connections) are used in case the convolution layers need more than one chip. The wheel and the outer ring are also used to gather weight gradients across images in the mini-batch. ScaleDeep Node Each ConvChip works on a different image. The outputs converge on the FC chip (spokes) so it can use batching. The wheel arcs are used for weight updates and for very large convs. The FC layer is split across multiple FCchips; increases batching, reduces memory, low network bw 21

  22. Results Summary Both chips have about 50 MB capacity Convchip consumes 58 W, FCchip consumes 15 W Inference is 3x faster than training One chip-cluster (320 W, comparable to a GPU) has 5-11x speedup over the GPU 5x speedup over DaDianNao at iso-power Both Conv and FC chips have about 50 MB capacity in their MemHeavy tiles (they don t seem to specify area per chip?). Each chip consumes modest amounts of power, but 20 of these chips add up to over a kilo-watt of power (including the DDR memory power). Inference is 3x faster than training because 3x more CompHeavy tiles can be put to work on the forward pass. One chip- cluster consumes about 320 W, which is comparable to a GPU. In their iso-power comparisons against a GPU and against DaDianNao, they report about 5-11x speedups. Note that this architecture targets training. This was enabled with 50 MB storage and DRAM per chip, with a novel interconnect structure (rings that pass weight gradients around), and spokes/rings that improve the batching factor for the FC layers. The other key contribution is the use of heterogeneity. 22

  23. Now we move to the vDNN paper. The paper has a very simple idea. It says that GPUs are used for training and must save all the feature maps produced during their forward pass. This requires lots of memory, especially when batch size is high. Training on GPUs GPUs have limited memory capacity (max 12 GB) When training, in addition to storing weights, we must store all the activations as they are required during backprop; plus the weight deltas In convolution layers, the activations consume way more memory than the weights; convolution activations also have longer reuse distance than classifier weights vDNN is a runtime that automatically moves activations to CPU memory and back, so the network fits in GPU memory (avoids constrained network, multi GPUs, small batches, slower algos) 23

  24. But state-of-the-art GPUs only offer 12 GB of memory, which isn t enough. So feature maps that will be reused much later during the backward pass can be moved to the CPU memory, thus not hitting the memory capacity limit. When required, those feature maps are prefetched back from CPU memory. Training on GPUs GPUs have limited memory capacity (max 12 GB) When training, in addition to storing weights, we must store all the activations as they are required during backprop; plus the weight deltas In convolution layers, the activations consume way more memory than the weights; convolution activations also have longer reuse distance than classifier weights vDNN is a runtime that automatically moves activations to CPU memory and back, so the network fits in GPU memory (avoids constrained network, multi GPUs, small batches, slower algos) 24

  25. This graph shows that these workloads far exceed the 12 GB capacity and this increases with batch size. But if those workloads had to only save one layer s worth of memory, they d be fine, e.g., about 37% of the 28GB in VGG-16(256). Motivation I 25

  26. This graph shows that the feature maps account for most of the memory usage. The convolution can use a faster FFT-based algorithm that requires extra memory, called workspace. Motivation II 26

  27. This graph shows that most of the large feature maps are in the early layers that also have longer reuse distances. Motivation III Can use frequency-domain FFT algorithms that are faster, but need additional workspace 27

  28. vDNN Proposal Runtime system that, during fwd pass, initiates an off-load of a layer s X while it is computing on X (no races because X is read-only) The layer waits if the offload hasn t finished; trying to optimize memory capacity During back-prop, initiate a prefetch of X for previous layer; must wait if prefetch hasn t finished 28

  29. System Parameters Baseline: NVIDIA Titan X; 7 TFLOP/s; 336 GB/s GDDR5 memory bandwidth, 12 GB, 16 GB/s PCIe link for GPU CPU memory transfers Baseline with performance features (more memory reqd) vDNNall: all layers perform an offload/prefetch (least memory) vDNNconv: only convolutional layers perform offload/prefetch vDNNdyn: explores many configs to identify the few layers that need to be offloaded/prefetched, and the few layers that can use the faster algos; exploration isn t expensive and uses a greedy algorithm We can use the offload approach to all layers, to all conv layers, or just to the layers that need this the most with or without the FFT/workspace optimization (vDNNdyn). The vDNNdyn algorithm does the best. The background offload will not impose a performance penalty of higher than 5%, because off-loading can only happen at 16 GB/s (the PCIe link bandwidth), thus compromising some of the 336 GB/s memory bandwidth available to the GPU. This 5% and the delays waiting for prefetch cause a max 18% slowdown. 29

  30. Memory Interference Early layers need the most offloading and will interfere the most with GDDR access; but early layers also have the least memory accesses for weights; at most, the interference will be 16/336 = 5% 30

  31. Performance Results The largest benchmark has 18% perf loss; others are near zero Dynamic scheme gives best perf while fitting in memory Power increase of 1-7% 31

  32. References Neural Networks and Deep Learning, (Chapter 2) Michael Nielsen SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks, S. Venkataramani et al., ISCA 2017 vDNN: Virtualized Deep Neural Networks for Scalable, Memory- Efficient Neural Network Design, M. Rhu et al., MICRO 2016 32

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#