Understanding Model Scalability in Deep Learning Systems

cse 234 data systems for machine learning n.w
1 / 16
Embed
Share

Explore the critical need for model scalability in deep learning systems, especially for sub-families like Transformers and graph neural networks (GNNs). Learn about challenges faced due to GPU memory limits and discover cutting-edge solutions like model parallelism, layer-aligned sharding, and tools like DeepSpeed for efficient DL model scaling.

  • Deep Learning
  • Model Scalability
  • Transformers
  • GPU Memory
  • Model Parallelism

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CSE 234 Data Systems for Machine Learning Arun Kumar Topic 2: Deep Learning Systems Part 3 DL book; Chapters 5 and 6 of MLSys book 1

  2. Outline Introduction to Deep Learning Overview of DL Systems DL Training Systems Compilation and Execution Data Scaling Model Scaling LLM Systems DLRM Systems DL Inference Systems 2

  3. Need for Model Scalability Some DL sub-families, especially Transformers, now face model scalability issues due to GPU memory limits 3 http://jalammar.github.io/illustrated-transformer/

  4. Need for Model Scalability Many popular Transformers are much larger than GPU memory: V100: 16-32GB; A100 and H100: up to 80GB GPT-2 has 1.5B parameters (~6 GB); GPT-3 has 175B (~0.7 TB); GPT-4 has ~1.7T parameters (~6 TB)! Need space for data batch, intermediates, and optimizer state too; GPU memory footprint can blow up by ~20x! 4

  5. Need for Model Scalability Another DL sub-family facing this issue: graph neural networks (GNNs), e.g., in social graph analytics GNN+CNN combinations also arise for for multimedia data 5 https://neurohive.io/en/news/attentive-graph-neural-networks-new-method-for-video-object-segmentation/

  6. Model Scalability Typical approach today: model parallelism Shard model itself across multiple GPUs (akin to data shards) Exchange features / backprop updates periodically Layer-aligned sharding is common in model scaling; lower inter-GPU comm. costs But intra-layer sharding is making a comeback (FSDP) 6 https://medium.com/@esaliya/model-parallelism-in-deep-learning-is-not-what-you-think-94d2f81e82ed

  7. Model Scaling: Pipelining A common optimization with layer-aligned sharding; GPipe from Google is an exemplar Stages out forward passes (and backward passes) across subsets of mini-batches called micro-batches 7 https://arxiv.org/pdf/1811.06965.pdf

  8. Model Scaling Approach: DeepSpeed State-of-the-art DL model scaling tool from Microsoft DRAM offloading: Spill shards of large model to DRAM, both model state and gradients Can scale to 10B parameters on single GPU! 3D parallelism: Combines data-parallelism, model-parallelism, and pipelining Mitigates the bubble issue of pure pipelining Several other systems-level features supported: Easier checkpointing, efficient loading, mixed precision, memory layout/bandwidth optimizations, etc. 8 https://www.deepspeed.ai/

  9. Model Scaling Approach: DeepSpeed Example of 8 micro-batches hybridized as 2-way data-parallel execution, with 2-GPU pipelining within each AllReduce (AR) used to sync gradients at the end before model update step, akin to Horovod Yields exact mini-batch gradient update (no inconsistency) For model s second half For model s first half These are all different micro-batches (8 in all) of one mini-batch 9 https://www.deepspeed.ai/

  10. Model Scaling Approach: DeepSpeed Here is my (animated) revamp of DeepSpeed s botched illustration. :) Split data mini-batch into 8 micro-batches D1 to D8 Split 4-GPU cluster into 2 sub-clusters: {G1, G2}, {G3, G4} Each sub-cluster has full model copy; AllRed. for 2-way data-par. across Within each sub-cluster, split model into 2 shards: M1->M2; run them with 2-way pipelining for 4 micro-batches Forw. & Backw. pass on shard M1 w/ micro-batch D1 AllRed. for grads of shard M2 Weight updates on shard M1 Legend: FM1,D1 FM1,D2 BM1,D1 FM1,D3 BM1,D2 FM1,D4 BM1,D3 BM1,D4 AllRed1 Upd1 G1: FM2,D1 FM2,D4 BM2,D4 BM2,D1 FM2,D2 BM2,D2 FM2,D3 BM2,D3 AllRed2 Upd2 G2: FM1,D5 FM1,D6 BM1,D5 FM1,D7 BM1,D6 FM1,D8 BM1,D7 BM1,D8 AllRed1 Upd1 G3: FM2,D5 FM2,D8 BM2,D8 BM2,D5 FM2,D6 BM2,D6 FM2,D7 BM2,D7 AllRed2 Upd2 G4: 10 Time

  11. Model Scaling Approach: DeepSpeed Supports 3D/tensor parallelism, hybridizing model sharding, pipelining, and data-parallelism 11 https://www.deepspeed.ai/

  12. Model Scaling Approach: FSDP More advanced hybridization of 3D parallelism as seen in DeepSpeed; now the default in PyTorch Each layer is sharded across GPUs; data mini-batch too Likely most scalable OSS model scaling approach today Stages out All- Reduce (as in Horovod/DDP/Dee pSpeed) to Reduce-Scatter and All-Gather 12 https://engineering.fb.com/2021/07/15/open-source/fsdp/

  13. Model Scaling Approach: FSDP Such per-layer sharding can raise communication between GPUs; tradeoff for ease of scalability Needs fast GPU-GPU interconnect (NVlink) to work well Neither GPipe nor DeepSpeed/FSDP dominate; tradeoffs based on NCG, batch size, GPU, and # GPUs (NB: Kabir s Saturn talk) 13 https://engineering.fb.com/2021/07/15/open-source/fsdp/

  14. Discussion on DL training systems 14

  15. Review Questions 1. Why is PS a poor fit for DL training in general? 2. Why does Horovod perform better than PS for DL training? 3. Explain 1 advantage and 1 disadvantage of PyTorch DDP over Horovod. 4. Why does pure pipeline parallelism for model scaling underutilize GPUs? 5. Briefly explain 2 systems techniques in DeepSpeed to make model scaling more efficient. 6. Briefly explain 1 key systems technique in FSDP compared to DeepSpeed that raises model scalability. 7. [ECQ] Explain the behavior of all the model-parallelism approaches covered when SGD mini-batch size is just 1. 15

  16. Outline Introduction to Deep Learning Overview of DL Systems DL Training Systems Compilation and Execution Data Scaling Model Scaling LLM Systems DLRM Systems DL Inference Systems 16

Related


More Related Content