PipeSwitch: Fast Context Switching for Deep Learning Applications

PipeSwitch

: Fast Pipelined Context

Switching for Deep Learning Applications

Zhihao Bai,

Zhen Zhang, Yibo Zhu, Xin Jin

Deep learning powers intelligent

applications in many domains

Training and inference

Training

Inference

High throughput

Low latency

GPUs clusters for DL workloads

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Separate clusters for training and

inference

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Cluster for

training

Cluster for

inference

Utilization of GPU clusters is low

Training

Inference

100%

75%

Daytime

Midnight

50%

25%

Today: separate clusters

Ideal: shared clusters

Daytime

Midnight

50%

25%

Context switching overhead is high

New model

Old model

Context switching overhead is high

Infer

ResNet

Train

BERT

NVIDIA T4

Latency: 6s

Drawbacks of existing solutions

Infer

ResNet

Train

BERT

NVIDIA T4

Latency: 6s

•

NVIDIA MPS

•

High overhead due to contention

•

Salus[MLSys’20]

•

Requires all the models to be preloaded into the GPU memory

Goal: fast context switching

Infer

ResNet

Train

BERT

NVIDIA T4

Latency: 6s

•

Enable GPU-efficient

multiplexing

 of multiple DL apps

with

fine-grained time-sharing

•

Achieve

millisecond-scale

 context switching overhead

and high throughput

PipeSwitch overview: architecture

Controller

Active

Worker

GPU

Memory

Daemon

Standby

Worker

Standby

Worker

New

Task

Sources of context switching overhead

Task cleaning

Task initialization

Memory allocation

Model transmission

How to reduce the overhead?

Pipelined

model transmission

Task cleaning

Task initialization

Memory allocation

Model transmission

DL models have layered structures

Input

Layer-1

Layer-2

…

Layer-N

Output

Forward

Propagation

Backward

Propagation

Sequential model transmission and

execution

model transmission

over PCIe

task execution

on GPU

n-1

n-1

Transmit layer 0

Execute layer 0

Pipelined model transmission and

execution

PCIe

GPU

n-1

n-1

Pipelined model transmission and

execution

PCIe

GPU

Transmit layer 0

n-1

n-1

Pipelined model transmission and

execution

Transmit layer 1

Execute layer 0

PCIe

GPU

n-1

n-1

Pipelined model transmission and

execution

Transmit layer 2

Execute layer 1

PCIe

GPU

n-1

n-1

Pipelined model transmission and

execution

PCIe

GPU

n-1

n-1

1.

Multiple calls to PCIe;

2.

Synchronize transmission and execution.

Pipelined model transmission and

execution

PCIe

GPU

Group

(0, i)

Group

(i+1,

j)

Group

(0, i)

Group

(i+1,

j)

Group

(k,

n-1)

Group

(k,

n-1)

Pipelined model transmission and

execution

PCIe

GPU

Group

(0, i)

Group

(i+1,

j)

Group

(0, i)

Group

(i+1,

j)

Group

(k,

n-1)

Group

(k,

n-1)

•

Exponential time to find the optimal strategy

•

Two heuristics for pruning

How to reduce the overhead?

Unified

memory management

Task cleaning

Task initialization

Memory allocation

Model transmission

Pipelined

model transmission

Unified memory management

GPU memory

Memory

Daemon

Workers

Pointer

Offset

Manage model parameters.

Allocate GPU memory.

How to reduce the overhead?

Active-standby

worker switching

Task cleaning

Task initialization

Memory allocation

Model transmission

Unified

memory management

Pipelined

model transmission

Active-standby worker switching

Old Task

New Task

Init.

Execute

Init.

Execute

Clean

Time

Clean

New Task Starts

Active-standby worker switching

Old Task

New Task

Init.

Execute

Time

Clean

New Task Starts

Init.

Init.

Execute

Clean

Active-standby worker switching

Old Task

New Task

Init.

Execute

Time

Clean

New Task Starts

Init.

Init.

Execute

Clean

Launch the process.

Create CUDA context.

Allocate GPU memory.

Active-standby worker switching

Old Task

New Task

Init.

Execute

Time

Clean

New Task Starts

Init.

Init.

Execute

Clean

Implementation

•

Testbed: AWS EC2

•

p3.2xlarge:

PCIe 3.0x16

, NVIDIA Tesla

V100

GPU

•

g4dn.2xlarge:

PCIe 3.0x8

, NVIDIA Tesla

T4

GPU

•

Software

•

CUDA 10.1

•

PyTorch 1.3.0

•

Models

•

ResNet-152

•

Inception-v3

•

BERT-base

Evaluation

•

Can PipeSwitch satisfy SLOs?

•

Can PipeSwitch provide high utilization?

•

How well do the design choices of PipeSwitch work?

Evaluation

•

Can PipeSwitch satisfy SLOs?

•

Can PipeSwitch provide high utilization?

•

How well do the design choices of PipeSwitch work?

PipeSwitch satisfies SLOs

NVIDIA Tesla V100

NVIDIA Tesla T4

33ms

39ms

340ms

6.5s

PipeSwitch satisfies SLOs

NVIDIA Tesla V100

NVIDIA Tesla T4

PipeSwitch achieves low context switching overhead.

PipeSwitch provides high utilization

Scheduling cycles

PipeSwitch provide high utilization

PipeSwitch achieves near 100% utilization.

Scheduling cycles

Summary

•

GPU clusters for DL applications suffer from low utilization

•

Limited sharing between training and inference workloads

•

PipeSwitch introduces pipelined context switching

•

Enable GPU-efficient multiplexing of DL apps with fine-grained time-sharing

•

Achieve millisecond-scale context switching latencies and high throughput

Thank you!

zbai1@jhu.edu

Slide Note

Hello everyone.

Today I am glad to present our work, PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications.

End.

Embed Share

Download

PipeSwitch introduces fast pipelined context switching for deep learning applications, aiming to enable GPU-efficient multiplexing of multiple DL tasks with fine-grained time-sharing. The goal is to achieve millisecond-scale context switching overhead and high throughput, addressing the challenges of low GPU cluster utilization and high context switching overhead in existing solutions.

zbayl Follow

Uploaded on Sep 11, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin 1

Deep learning powers intelligent applications in many domains 2

Training and inference Training Inference High throughput Low latency 3

GPUs clusters for DL workloads GPU GPU GPU GPU GPU GPU GPU GPU 4

Separate clusters for training and inference Cluster for training GPU GPU GPU GPU Cluster for inference GPU GPU GPU GPU 5

Utilization of GPU clusters is low Today: separate clusters Ideal: shared clusters 50% 100% Training 25% 75% Daytime Midnight 50% 50% 25% Inference 25% 6 Daytime Midnight Daytime Midnight

Context switching overhead is high New model Old model 7

Context switching overhead is high Infer ResNet Train BERT NVIDIA T4 Latency: 6s 8

Drawbacks of existing solutions NVIDIA MPS High overhead due to contention Salus[MLSys 20] Requires all the models to be preloaded into the GPU memory Latency: 6s 9

Goal: fast context switching Enable GPU-efficient multiplexing of multiple DL apps with fine-grained time-sharing Achieve millisecond-scale context switching overhead and high throughput Latency: 6s 10

PipeSwitch overview: architecture New Task Controller Standby Worker Standby Worker Memory Daemon Active Worker GPU 11

Sources of context switching overhead Model transmission Memory allocation Task initialization Task cleaning 12

How to reduce the overhead? Pipelined Model transmission model transmission 13

DL models have layered structures Input Layer-1 Layer-2 Forward Propagation Backward Propagation Layer-N Output 14

Sequential model transmission and execution Execute layer 0 Transmit layer 0 T0 T1 T2 Tn-1 E0 E1 E2 En-1 model transmission over PCIe task execution on GPU 15

Pipelined model transmission and execution PCIe T0 T1 T2 Tn-1 GPU E0 En-1 E1 E2 16

Pipelined model transmission and execution Transmit layer 0 PCIe T0 T1 T2 Tn-1 GPU E0 En-1 E1 E2 17

Pipelined model transmission and execution Transmit layer 1 PCIe T0 T1 T2 Tn-1 GPU E0 En-1 E1 E2 Execute layer 0 18

Pipelined model transmission and execution Transmit layer 2 PCIe T0 T1 T2 Tn-1 GPU E0 En-1 E1 E2 Execute layer 1 19

Pipelined model transmission and execution 1.Multiple calls to PCIe; 2.Synchronize transmission and execution. 20

Pipelined model transmission and execution Group (0, i) Group (i+1, j) Group (k, n-1) PCIe Group (k, n-1) Group (i+1, j) Group (0, i) GPU 21

Pipelined model transmission and execution Exponential time to find the optimal strategy Two heuristics for pruning 22

How to reduce the overhead? Model transmission Unified Memory allocation memory management Task initialization Task cleaning 23

Unified memory management Manage model parameters. Allocate GPU memory. Pointer Memory Daemon Workers Offset GPU memory 24

How to reduce the overhead? Model transmission Memory allocation Task initialization Active-standby worker switching Task cleaning 25

Active-standby worker switching Time Old Task Init. Execute Clean New Task Init. Execute Clean New Task Starts 26

Active-standby worker switching Time Old Task Init. Execute Clean New Task Init. 1 Init. 2 Execute Clean New Task Starts 27

Active-standby worker switching Time Old Task Init. Execute Clean New Task Init. 1 Init. 2 Execute Clean Launch the process. Create CUDA context. Allocate GPU memory. New Task Starts 28

Active-standby worker switching Time Old Task Init. Execute Clean New Task Init. 1 Init. 2 Execute Clean New Task Starts 29

Implementation Testbed: AWS EC2 p3.2xlarge: PCIe 3.0x16, NVIDIA Tesla V100 GPU g4dn.2xlarge: PCIe 3.0x8, NVIDIA Tesla T4 GPU Software CUDA 10.1 PyTorch 1.3.0 Models ResNet-152 Inception-v3 BERT-base 30

Evaluation Can PipeSwitch satisfy SLOs? Can PipeSwitch provide high utilization? How well do the design choices of PipeSwitch work? 31

Evaluation Can PipeSwitch satisfy SLOs? Can PipeSwitch provide high utilization? 32

PipeSwitch satisfies SLOs NVIDIA Tesla V100 NVIDIA Tesla T4 6.5s 340ms 33ms 39ms 33

PipeSwitch satisfies SLOs PipeSwitch achieves low context switching overhead. 34

PipeSwitch provides high utilization Scheduling cycles 35

PipeSwitch provide high utilization PipeSwitch achieves near 100% utilization. 36

Summary GPU clusters for DL applications suffer from low utilization Limited sharing between training and inference workloads PipeSwitch introduces pipelined context switching Enable GPU-efficient multiplexing of DL apps with fine-grained time-sharing Achieve millisecond-scale context switching latencies and high throughput 37

Thank you! zbai1@jhu.edu 38

PipeSwitch: Fast Context Switching for Deep Learning Applications

Download Presentation

Presentation Transcript

Related

More Related Content