PipeSwitch: Fast Context Switching for Deep Learning Applications

PipeSwitch
: Fast Pipelined Context
Switching for Deep Learning Applications
Zhihao Bai, 
Zhen Zhang, Yibo Zhu, Xin Jin
1
Deep learning powers intelligent
applications in many domains
2
Training and inference
3
Training
Inference
 
High throughput
 
Low latency
GPUs clusters for DL workloads
4
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
Separate clusters for training and
inference
5
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
 
Cluster for
training
 
Cluster for
inference
Utilization of GPU clusters is low
 
6
 
Training
 
Inference
 
100%
 
75%
 
Daytime
 
Midnight
 
50%
 
25%
 
Today: separate clusters
 
Ideal: shared clusters
 
Daytime
 
Midnight
 
50%
 
25%
Context switching overhead is high
7
New model
Old model
Context switching overhead is high
8
Infer
ResNet
Train
BERT
NVIDIA T4
Latency: 6s
Drawbacks of existing solutions
9
Infer
ResNet
Train
BERT
NVIDIA T4
Latency: 6s
NVIDIA MPS
High overhead due to contention
Salus[MLSys’20]
Requires all the models to be preloaded into the GPU memory
Goal: fast context switching
10
Infer
ResNet
Train
BERT
NVIDIA T4
Latency: 6s
Enable GPU-efficient 
multiplexing
 of multiple DL apps
with 
fine-grained time-sharing
Achieve 
millisecond-scale
 context switching overhead
and high throughput
PipeSwitch overview: architecture
11
Controller
Active
Worker
 
GPU
Memory
Daemon
Standby
Worker
Standby
Worker
 
New
Task
Sources of context switching overhead
12
Task cleaning
Task initialization
Memory allocation
Model transmission
How to reduce the overhead?
13
Pipelined
model transmission
Task cleaning
Task initialization
Memory allocation
Model transmission
DL models have layered structures
14
Input
Layer-1
Layer-2
Layer-N
Output
 
Forward
Propagation
 
Backward
Propagation
Sequential model transmission and
execution
15
T
0
E
0
 
model transmission
over PCIe
 
task execution
on GPU
T
1
T
n-1
E
1
E
n-1
T
2
E
2
Transmit layer 0
Execute layer 0
Pipelined model transmission and
execution
16
PCIe
GPU
T
0
T
1
T
n-1
T
2
E
0
E
1
E
n-1
E
2
Pipelined model transmission and
execution
17
PCIe
GPU
Transmit layer 0
T
0
T
1
T
n-1
T
2
E
0
E
1
E
n-1
E
2
Pipelined model transmission and
execution
18
Transmit layer 1
Execute layer 0
PCIe
GPU
T
0
T
1
T
n-1
T
2
E
0
E
1
E
n-1
E
2
Pipelined model transmission and
execution
19
Transmit layer 2
Execute layer 1
PCIe
GPU
T
0
T
1
T
n-1
T
2
E
0
E
1
E
n-1
E
2
Pipelined model transmission and
execution
20
PCIe
GPU
T
0
T
1
T
n-1
T
2
E
0
E
1
E
n-1
E
2
1.
Multiple calls to PCIe;
2.
Synchronize transmission and execution.
Pipelined model transmission and
execution
21
PCIe
GPU
Group
(0, i)
Group
(i+1,
 
j)
Group
(0, i)
Group
(i+1,
 
j)
Group
(k,
 
n-1)
Group
(k,
 
n-1)
Pipelined model transmission and
execution
22
PCIe
GPU
Group
(0, i)
Group
(i+1,
 
j)
Group
(0, i)
Group
(i+1,
 
j)
Group
(k,
 
n-1)
Group
(k,
 
n-1)
Exponential time to find the optimal strategy
Two heuristics for pruning
How to reduce the overhead?
23
Unified
memory management
Task cleaning
Task initialization
Memory allocation
Model transmission
Pipelined
model transmission
Unified memory management
24
GPU memory
Memory
Daemon
Workers
Pointer
Offset
Manage model parameters.
Allocate GPU memory.
How to reduce the overhead?
25
Active-standby
worker switching
Task cleaning
Task initialization
Memory allocation
Model transmission
Unified
memory management
Pipelined
model transmission
Active-standby worker switching
26
 
Old Task
 
New Task
Init.
Execute
Init.
Execute
Clean
 
Time
Clean
 
New Task Starts
Active-standby worker switching
27
Old Task
New Task
Init.
Execute
Time
Clean
New Task Starts
Init.
2
Init.
1
Execute
Clean
Active-standby worker switching
28
Old Task
New Task
Init.
Execute
Time
Clean
New Task Starts
Init.
2
Init.
1
Execute
Clean
Launch the process.
Create CUDA context.
Allocate GPU memory.
Active-standby worker switching
29
Old Task
New Task
Init.
Execute
Time
Clean
New Task Starts
Init.
2
Init.
1
Execute
Clean
Implementation
 
Testbed: AWS EC2
p3.2xlarge: 
PCIe 3.0x16
, NVIDIA Tesla 
V100
 GPU
g4dn.2xlarge: 
PCIe 3.0x8
, NVIDIA Tesla 
T4
 GPU
Software
CUDA 10.1
PyTorch 1.3.0
Models
ResNet-152
Inception-v3
BERT-base
30
Evaluation
Can PipeSwitch satisfy SLOs?
Can PipeSwitch provide high utilization?
How well do the design choices of PipeSwitch work?
31
Evaluation
Can PipeSwitch satisfy SLOs?
Can PipeSwitch provide high utilization?
How well do the design choices of PipeSwitch work?
32
PipeSwitch satisfies SLOs
NVIDIA Tesla V100
NVIDIA Tesla T4
33
33ms
39ms
340ms
6.5s
PipeSwitch satisfies SLOs
NVIDIA Tesla V100
NVIDIA Tesla T4
34
PipeSwitch achieves low context switching overhead.
PipeSwitch provides high utilization
35
Scheduling cycles
PipeSwitch provide high utilization
36
PipeSwitch achieves near 100% utilization.
Scheduling cycles
Summary
 
GPU clusters for DL applications suffer from low utilization
Limited sharing between training and inference workloads
 
PipeSwitch introduces pipelined context switching
Enable GPU-efficient multiplexing of DL apps with fine-grained time-sharing
Achieve millisecond-scale context switching latencies and high throughput
37
38
Thank you!
zbai1@jhu.edu
Slide Note

Hello everyone.

Today I am glad to present our work, PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications.

End.

Embed
Share

PipeSwitch introduces fast pipelined context switching for deep learning applications, aiming to enable GPU-efficient multiplexing of multiple DL tasks with fine-grained time-sharing. The goal is to achieve millisecond-scale context switching overhead and high throughput, addressing the challenges of low GPU cluster utilization and high context switching overhead in existing solutions.

  • Pipelined Context Switching
  • Deep Learning
  • GPU Clusters
  • High Throughput
  • DL Applications

Uploaded on Sep 11, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1

  2. Deep learning powers intelligent applications in many domains 2

  3. Training and inference Training Inference High throughput Low latency 3

  4. GPUs clusters for DL workloads GPU GPU GPU GPU GPU GPU GPU GPU 4

  5. Separate clusters for training and inference Cluster for training GPU GPU GPU GPU Cluster for inference GPU GPU GPU GPU 5

  6. Utilization of GPU clusters is low Today: separate clusters Ideal: shared clusters 50% 100% Training 25% 75% Daytime Midnight 50% 50% 25% Inference 25% 6 Daytime Midnight Daytime Midnight

  7. Context switching overhead is high New model Old model 7

  8. Context switching overhead is high Infer ResNet Train BERT NVIDIA T4 Latency: 6s 8

  9. Drawbacks of existing solutions NVIDIA MPS High overhead due to contention Salus[MLSys 20] Requires all the models to be preloaded into the GPU memory Latency: 6s 9

  10. Goal: fast context switching Enable GPU-efficient multiplexing of multiple DL apps with fine-grained time-sharing Achieve millisecond-scale context switching overhead and high throughput Latency: 6s 10

  11. PipeSwitch overview: architecture New Task Controller Standby Worker Standby Worker Memory Daemon Active Worker GPU 11

  12. Sources of context switching overhead Model transmission Memory allocation Task initialization Task cleaning 12

  13. How to reduce the overhead? Pipelined Model transmission model transmission 13

  14. DL models have layered structures Input Layer-1 Layer-2 Forward Propagation Backward Propagation Layer-N Output 14

  15. Sequential model transmission and execution Execute layer 0 Transmit layer 0 T0 T1 T2 Tn-1 E0 E1 E2 En-1 model transmission over PCIe task execution on GPU 15

  16. Pipelined model transmission and execution PCIe T0 T1 T2 Tn-1 GPU E0 En-1 E1 E2 16

  17. Pipelined model transmission and execution Transmit layer 0 PCIe T0 T1 T2 Tn-1 GPU E0 En-1 E1 E2 17

  18. Pipelined model transmission and execution Transmit layer 1 PCIe T0 T1 T2 Tn-1 GPU E0 En-1 E1 E2 Execute layer 0 18

  19. Pipelined model transmission and execution Transmit layer 2 PCIe T0 T1 T2 Tn-1 GPU E0 En-1 E1 E2 Execute layer 1 19

  20. Pipelined model transmission and execution 1.Multiple calls to PCIe; 2.Synchronize transmission and execution. 20

  21. Pipelined model transmission and execution Group (0, i) Group (i+1, j) Group (k, n-1) PCIe Group (k, n-1) Group (i+1, j) Group (0, i) GPU 21

  22. Pipelined model transmission and execution Exponential time to find the optimal strategy Two heuristics for pruning 22

  23. How to reduce the overhead? Model transmission Unified Memory allocation memory management Task initialization Task cleaning 23

  24. Unified memory management Manage model parameters. Allocate GPU memory. Pointer Memory Daemon Workers Offset GPU memory 24

  25. How to reduce the overhead? Model transmission Memory allocation Task initialization Active-standby worker switching Task cleaning 25

  26. Active-standby worker switching Time Old Task Init. Execute Clean New Task Init. Execute Clean New Task Starts 26

  27. Active-standby worker switching Time Old Task Init. Execute Clean New Task Init. 1 Init. 2 Execute Clean New Task Starts 27

  28. Active-standby worker switching Time Old Task Init. Execute Clean New Task Init. 1 Init. 2 Execute Clean Launch the process. Create CUDA context. Allocate GPU memory. New Task Starts 28

  29. Active-standby worker switching Time Old Task Init. Execute Clean New Task Init. 1 Init. 2 Execute Clean New Task Starts 29

  30. Implementation Testbed: AWS EC2 p3.2xlarge: PCIe 3.0x16, NVIDIA Tesla V100 GPU g4dn.2xlarge: PCIe 3.0x8, NVIDIA Tesla T4 GPU Software CUDA 10.1 PyTorch 1.3.0 Models ResNet-152 Inception-v3 BERT-base 30

  31. Evaluation Can PipeSwitch satisfy SLOs? Can PipeSwitch provide high utilization? How well do the design choices of PipeSwitch work? 31

  32. Evaluation Can PipeSwitch satisfy SLOs? Can PipeSwitch provide high utilization? 32

  33. PipeSwitch satisfies SLOs NVIDIA Tesla V100 NVIDIA Tesla T4 6.5s 340ms 33ms 39ms 33

  34. PipeSwitch satisfies SLOs PipeSwitch achieves low context switching overhead. 34

  35. PipeSwitch provides high utilization Scheduling cycles 35

  36. PipeSwitch provide high utilization PipeSwitch achieves near 100% utilization. 36

  37. Summary GPU clusters for DL applications suffer from low utilization Limited sharing between training and inference workloads PipeSwitch introduces pipelined context switching Enable GPU-efficient multiplexing of DL apps with fine-grained time-sharing Achieve millisecond-scale context switching latencies and high throughput 37

  38. Thank you! zbai1@jhu.edu 38

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#