Cutting-Edge Training Architecture Overview

Slide Note

Delve into the latest training innovations featuring NVIDIA Volta, Intel NNP-T/I, ScaleDeep, and vDNN. Learn about the impressive capabilities of the NVIDIA Volta GPU, Intel NNP-T with Tensor Processing Clusters, and Intel NNP-I for inference tasks. Explore the intricacies of creating mini-batches, forward and backward passes, and weight updates in training algorithms. Uncover the optimizations and design choices in architecture, such as ScaleDeep's focus on training needs, network design, and resource allocation for efficient training processes.

skny Follow

Uploaded on Oct 08, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Lecture: Training Innovations Topics: NVIDIA Volta, Intel NNP-T/I, ScaleDeep, vDNN No class on Tuesday 1

NVIDIA Volta GPU 640 tensor cores Each tensor core performs a MAC on 4x4 tensors Throughput: 128 FLOPs x 640 x 1.5 GHz = 125 Tflops FP16 multiply operations 12x better than Pascal on training and 6x better on inference Basic matrix multiply unit 32 inputs being fed to 64 parallel multipliers; 64 parallel add operations 2 Reference: http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

Intel NNP-T 3 Image Source: Intel presentation HotChips 19

Intel NNP-T Each Tensor Processing Cluster (TPC) has two 32x32 MAC grids supporting bfloat16 (appears similar to TPU) Conv engine to marshall data before each step 60MB on-chip SRAM (2.5MB scratchpad per TPC) Communication is key: grid network among TPCs, 64 SerDes lanes (3.6 Tb/s) for inter-chip communication, 4 HBMs Relatively low utilization for GeMM (< 60%) and Conv (59-87%) 4 Image Source: Intel presentation HotChips 19

Intel NNP-I 12 Inference Compute Engines (ICE) that can work together or independently; 24MB central cache and 4MB per ICE Each ICE has 4K 8b MAC unit 10W, 48 TOPs, 3600 inf/s 5 Image Source: Intel presentation HotChips 19

Training Take-Homes Create a mini-batch which is a random set of training inputs; apply the fwd pass; compute error; backprop the errors (while using a transpose of weights); compute the deltas for all the weights (using the Xs of previous layer); aggregate the deltas and update the weights at the end of the mini-batch; keep creating mini-batches until you run out of training samples; this is one epoch. On the fwd pass, we use inputs X and weights W to compute output Y On the bwd pass, we use dY and weights W to compute dX; we also use dX and X to compute dW X X,W Y dY,W,X dY dX 6 dW

In the previous lecture, we saw the needs of the training algorithm. Here, we ll look at a hardware architecture (ScaleDeep) that pays attention to the requirements of training. This work is also unusual in that it designs a 1.4 KW server for training in the datacenter you ll be dazzled by the many resources that are thrown at this problem. ScaleDeep Intro Pays attention to training (network, storage) The ScaleDeep architecture allocates storage resources and designs a network that is suitable for training. It introduces heterogeneity within and among chips. It uses a spatial pipeline (also seen later in ISAAC). Batching is used heavily (since thruput > response time when doing training) Introduces heterogeneity Uses a spatial pipeline (instead of time-multiplexed execution) Heavy use of batching for efficiency Novel interconnect structure (for a 1.4 KW server) that helps with batching and training 7

The authors classify the layers as initial (more reuse of weights becoz of large feature maps), middle (more reuse of input neurons becoz of many feature maps, also 80% of computations. Layer Breakdown High neuron reuse; Many feature maps; 80% of all computations High weight reuse; Large feature maps Large memory; High bytes/flop Note: FP, BP, WG 8

Also notable are the growing weights as we move right. The bytes/flop also increases, i.e., more memory-bound. Also note that a third of the time is spent in forward-pass, backward-pass, and weight-gradient updates. Layer Breakdown High neuron reuse; Many feature maps; 80% of all computations High weight reuse; Large feature maps Large memory; High bytes/flop Note: FP, BP, WG 9

This figure shows a CompHeavy tile that is used for most of the dot-product computations. Small buffers feed inputs from the left, weights from the top and bottom. The inputs flow left-to-right and the dot-products get accumulated in CompHeavy Tile Kernels enter from top and down Can do two small kernels or 1 large kernel Inputs enter from left memory Accumulation happens on right Scratchpad used for partial sums Multiple lanes in each PE (can handle different kernels) 10

the accumulator array at the right. Partial sums may have to be saved in the scratchpad. There are 2 weight buffers (top and bottom) in case the kernels are small and you can cram two kernels into one CompHeavy tile. CompHeavy Tile Kernels enter from top and down Can do two small kernels or 1 large kernel Inputs enter from left memory Accumulation happens on right Scratchpad used for partial sums Multiple lanes in each PE (can handle different kernels) 11

The inputs to this tile come from a MemHeavy tile to the left. The outputs of this tile are written to a MemHeavy tile to the right. The feature maps produced may also be written to DRAM memory (for reuse during the backward pass). CompHeavy Tile Kernels enter from top and down Can do two small kernels or 1 large kernel Inputs enter from left memory Accumulation happens on right Scratchpad used for partial sums Multiple lanes in each PE (can handle different kernels) 12

Within a single chip, you can also have MemHeavy tiles. These MemHeavy tiles store the input/output feature maps. The SFU computational units are used to compute the activation function and to perform pooling operations. MemHeavy Tile Intra-chip heterogeneity Mainly used for data storage. Can compute activations and pooling. 13

A single chip is made up of columns of MemHeavy and CompHeavy tiles. There are 3 CompHeavy tiles per MemHeavy tile, one each for FP, BP, and WG, so all 3 steps can happen in parallel. ScaleDeep Chip 3 Comp tiles per Mem tile (for FP, BP, and WG) Tiles are dedicated for specific layers (like ISAAC) Latency/thruput trade-off 14

Similar to ISAAC, tiles are dedicated to process a given layer. This means that weights stay in place for the most part (may occasionally spill into DRAM), but feature maps move from column to column. Such a spatial pipeline can potentially ... ScaleDeep Chip 3 Comp tiles per Mem tile (for FP, BP, and WG) Tiles are dedicated for specific layers (like ISAAC) Latency/thruput trade-off 15

... increase response time for one image (since only a fraction of all resources are being used for any one layer), but it should have little impact on throughput if we can set up a pipeline where all tiles are busy all the time. ScaleDeep Chip 3 Comp tiles per Mem tile (for FP, BP, and WG) Tiles are dedicated for specific layers (like ISAAC) Latency/thruput trade-off Also note that each chip is connected to multiple memory channels. It turns out that each chip has about 50 MB of data storage; so the memory would be used when the working set is larger (recall that feature map storage in vDNN requires many tens of giga-bytes). 16

In addition to the heterogeneity within a chip, they also offer heterogeneity among chips. A Conv Chip and an FC Chip are provisioned differently. They both have CompHeavy and MemHeavy tiles, but the number/size of each tile varies. Chip Heterogeneity Tailored for FC layer 17

A single node consists of 16 Conv Chips and 4 FC Chips. 5 of these are organized into one chip cluster and 4 chip clusters are organized as a ring. In one chip cluster, each Conv Chip is responsible for 1 image. 4 Conv Chips then feed their ScaleDeep Node Each ConvChip works on a different image. The outputs converge on the FC chip (spokes) so it can use batching. The wheel arcs are used for weight updates and for very large convs. The FC layer is split across multiple FCchips; increases batching, reduces memory, low network bw 18

4 output feature maps to their central FC Chip (using spoke connections). This helps improve the batching factor in the FC Chip (important because it has the worst bytes/flop ratio). In fact, the FC layers are partitioned across all ScaleDeep Node Each ConvChip works on a different image. The outputs converge on the FC chip (spokes) so it can use batching. The wheel arcs are used for weight updates and for very large convs. The FC layer is split across multiple FCchips; increases batching, reduces memory, low network bw 19

4 FC chips. Thus, one FC chip does a quarter of the FC layer for all 16 images. This further increases the batching factor, while requiring more feature map communication on the ring network. ScaleDeep Node Each ConvChip works on a different image. The outputs converge on the FC chip (spokes) so it can use batching. The wheel arcs are used for weight updates and for very large convs. The FC layer is split across multiple FCchips; increases batching, reduces memory, low network bw 20

The green ring within a chip cluster (called the wheel connections) are used in case the convolution layers need more than one chip. The wheel and the outer ring are also used to gather weight gradients across images in the mini-batch. ScaleDeep Node Each ConvChip works on a different image. The outputs converge on the FC chip (spokes) so it can use batching. The wheel arcs are used for weight updates and for very large convs. The FC layer is split across multiple FCchips; increases batching, reduces memory, low network bw 21

Results Summary Both chips have about 50 MB capacity Convchip consumes 58 W, FCchip consumes 15 W Inference is 3x faster than training One chip-cluster (320 W, comparable to a GPU) has 5-11x speedup over the GPU 5x speedup over DaDianNao at iso-power Both Conv and FC chips have about 50 MB capacity in their MemHeavy tiles (they don t seem to specify area per chip?). Each chip consumes modest amounts of power, but 20 of these chips add up to over a kilo-watt of power (including the DDR memory power). Inference is 3x faster than training because 3x more CompHeavy tiles can be put to work on the forward pass. One chip- cluster consumes about 320 W, which is comparable to a GPU. In their iso-power comparisons against a GPU and against DaDianNao, they report about 5-11x speedups. Note that this architecture targets training. This was enabled with 50 MB storage and DRAM per chip, with a novel interconnect structure (rings that pass weight gradients around), and spokes/rings that improve the batching factor for the FC layers. The other key contribution is the use of heterogeneity. 22

Now we move to the vDNN paper. The paper has a very simple idea. It says that GPUs are used for training and must save all the feature maps produced during their forward pass. This requires lots of memory, especially when batch size is high. Training on GPUs GPUs have limited memory capacity (max 12 GB) When training, in addition to storing weights, we must store all the activations as they are required during backprop; plus the weight deltas In convolution layers, the activations consume way more memory than the weights; convolution activations also have longer reuse distance than classifier weights vDNN is a runtime that automatically moves activations to CPU memory and back, so the network fits in GPU memory (avoids constrained network, multi GPUs, small batches, slower algos) 23

But state-of-the-art GPUs only offer 12 GB of memory, which isn t enough. So feature maps that will be reused much later during the backward pass can be moved to the CPU memory, thus not hitting the memory capacity limit. When required, those feature maps are prefetched back from CPU memory. Training on GPUs GPUs have limited memory capacity (max 12 GB) When training, in addition to storing weights, we must store all the activations as they are required during backprop; plus the weight deltas In convolution layers, the activations consume way more memory than the weights; convolution activations also have longer reuse distance than classifier weights vDNN is a runtime that automatically moves activations to CPU memory and back, so the network fits in GPU memory (avoids constrained network, multi GPUs, small batches, slower algos) 24

This graph shows that these workloads far exceed the 12 GB capacity and this increases with batch size. But if those workloads had to only save one layer s worth of memory, they d be fine, e.g., about 37% of the 28GB in VGG-16(256). Motivation I 25

This graph shows that the feature maps account for most of the memory usage. The convolution can use a faster FFT-based algorithm that requires extra memory, called workspace. Motivation II 26

This graph shows that most of the large feature maps are in the early layers that also have longer reuse distances. Motivation III Can use frequency-domain FFT algorithms that are faster, but need additional workspace 27

vDNN Proposal Runtime system that, during fwd pass, initiates an off-load of a layer s X while it is computing on X (no races because X is read-only) The layer waits if the offload hasn t finished; trying to optimize memory capacity During back-prop, initiate a prefetch of X for previous layer; must wait if prefetch hasn t finished 28

System Parameters Baseline: NVIDIA Titan X; 7 TFLOP/s; 336 GB/s GDDR5 memory bandwidth, 12 GB, 16 GB/s PCIe link for GPU CPU memory transfers Baseline with performance features (more memory reqd) vDNNall: all layers perform an offload/prefetch (least memory) vDNNconv: only convolutional layers perform offload/prefetch vDNNdyn: explores many configs to identify the few layers that need to be offloaded/prefetched, and the few layers that can use the faster algos; exploration isn t expensive and uses a greedy algorithm We can use the offload approach to all layers, to all conv layers, or just to the layers that need this the most with or without the FFT/workspace optimization (vDNNdyn). The vDNNdyn algorithm does the best. The background offload will not impose a performance penalty of higher than 5%, because off-loading can only happen at 16 GB/s (the PCIe link bandwidth), thus compromising some of the 336 GB/s memory bandwidth available to the GPU. This 5% and the delays waiting for prefetch cause a max 18% slowdown. 29

Memory Interference Early layers need the most offloading and will interfere the most with GDDR access; but early layers also have the least memory accesses for weights; at most, the interference will be 16/336 = 5% 30

Performance Results The largest benchmark has 18% perf loss; others are near zero Dynamic scheme gives best perf while fitting in memory Power increase of 1-7% 31

References Neural Networks and Deep Learning, (Chapter 2) Michael Nielsen SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks, S. Venkataramani et al., ISCA 2017 vDNN: Virtualized Deep Neural Networks for Scalable, Memory- Efficient Neural Network Design, M. Rhu et al., MICRO 2016 32

Cutting-Edge Training Architecture Overview

Download Presentation

Presentation Transcript

Related

More Related Content