Unveiling the Black Box: ML Prediction Serving Systems

PRETZEL: 
Opening 
the Black 
Box 
of
ML 
Prediction 
Serving
 
Systems
Yunseong 
Lee
s
, 
Alberto Scolari
p
, 
Byung-Gon
 
Chun
s
,
Marco 
Domenico 
Santambrogio
p
, 
Markus 
Weimer
m
, 
Matteo
 
Interlandi
m
Machine Learning 
Prediction
 
Serving
1.
 
Models 
are 
learned 
from
 
data
2.
 
Models 
are 
deployed 
and 
served
 
together
Prediction
 
serving
Server
 
Users
D
a
t
a
Training
Mo
d
e
l
Learn
D
epl
o
y
Performance
 
goal:
1)
Low
 
latency
2)
High
 throughput
3)
Minimal 
resource
 
usage
2
Assumption: models 
are 
black
 
box
Re-use 
the same 
code 
in 
training
 
phase
Encapsulate 
all
 operations
into 
a 
function 
call 
(e.g.,
 
predict()
)
Apply 
external
 
optimizations
ML 
Prediction 
Serving 
Systems:
State-of-the-art
Clipper
 
TF
 
Serving
 
ML.Net
Prediction Serving
System
“Pretzel 
is
 
tasty”
cat
c
a
r
Text
A
n
a
l
ys
i
s
Image
R
e
c
ogn
iti
on
Result
c
a
c
hing
ensemble
3
Replication
Request
Batching
How 
do 
Models Look inside
 
Boxes?
<Example: Sentiment 
Analysis>
Model
Pretzel 
is
 
tasty
(text)
 
vs.
 
(
positive
 
vs.
 
negative
)
4
How 
do 
Models Look inside
 
Boxes?
Pretzel 
is
 
tasty
N
g
r
a
m
Word
N
g
r
a
m
<Example: Sentiment 
Analysis>
T
o
k
e
n
i
z
e
r
 
C
o
n
c
a
t
 
vs.
 
DAG 
of
 
Operators
Featurizers
Char
P
r
e
d
i
c
t
or
Logistic
R
e
g
r
e
ss
io
n
5
T
o
k
e
n
i
z
e
r
How 
do 
Models Look inside
 
Boxes?
<Example: Sentiment 
Analysis>
Pretzel 
is
 
tasty
Char
N
g
r
a
m
Word
N
g
r
a
m
C
o
n
c
a
t
Logistic
R
e
g
r
e
ss
io
n
DAG 
of
 
Operators
 
vs.
 
Split 
text
into
 
tokens
Extract
N-
g
r
a
ms
Merge
 
two
vectors
Compute
final
 
score
6
Many 
Models 
Have 
Similar
 
Structures
Many 
part of 
a 
model 
can 
be 
re-used 
in other
 
models
Customer personalization, 
Templates, 
Transfer
 
Learning
Identical set 
of 
operators 
with 
different
 
parameters
7
8
Outline
Prediction 
Serving
 
Systems
Limitations 
of 
Black 
Box
 
Approaches
PRETZEL: 
White-box 
Prediction 
Serving
 
System
Evaluation
Conclusion
Limitation 
1: 
Resource
 
Waste
Resources are 
isolated across 
Black
 
boxes
1.
Unable 
to 
share 
memory
 
space
 
Waste 
memory 
to maintain duplicate 
objects
(despite similarities 
between
 
models)
2.
No 
coordination 
for 
CPU 
resources between
 
boxes
 
Serving 
many 
models 
can 
use 
too many
 
threads
mach
i
ne
9
T
o
k
e
n
i
z
er
Char
Ng
r
am
Word
Ng
r
am
Concat
L
og
R
eg
Limitation 
2: 
Inconsideration 
for 
Ops’
 
Characteristics
1.
Operators have different 
performance
 
characteristics
Concat
 
materializes 
a 
vector
LogReg
 
takes 
only 0.3% 
(contrary to 
the 
training 
phase)
2.
There 
can 
be a 
better 
plan if 
such 
characteristics are
 
considered
Re-use 
the 
existing
 
vectors
Apply in-place 
update 
in
 
LogReg
0
%
 
20%
100%
40
%
 
60
%
 
80%
Latency
 
breakdown
23.1
34.2
32.7
9.6
CharNgram
 
WordNgram
 
Concat
 
LogReg
Oth
ers
0.3
10
Limitation 
3: 
Lazy
 
Initialization
ML.Net 
initializes 
code 
and memory lazily 
(efficient 
in 
training
 
phase)
Run 250 
Sentiment Analysis 
models 
100
 
times
 
cold
: 
first execution 
/ 
hot
: 
average 
of 
the 
rest
 
99
Long-tail 
latency 
in the 
cold
 
case
Code analysis, Just–in-time (JIT) 
compilation, 
memory 
allocation, 
etc
Difficult 
to 
provide 
strong 
Service-Level-Agreement
 
(SLA)
T
o
k
e
n
i
z
er
Char
Ng
r
am
Word
Ng
r
am
Concat
L
og
R
eg
13x
444x
11
12
Outline
(Black-box) 
Prediction 
Serving
 
Systems
Limitations 
of 
Black 
Box
 
Approaches
PRETZEL: 
White-box 
Prediction 
Serving
 System
Evaluation
Conclusion
13
PRETZEL: 
White-box Prediction
 
Serving
We 
analyze 
models 
to optimize 
the 
internal
 
execution
We 
let 
models 
co-exist 
on 
the same 
runtime,
sharing 
computation 
and memory
 
resources
We 
optimize 
models in 
two
 
directions:
1.
End-to-end
 optimizations
2.
Multi-model
 
optimizations
14
End-to-End
 
Optimizations
Optimize 
the 
execution 
of 
individual 
models 
from start 
to
 
end
1.
[Ahead-of-time 
Compilation]
Compile 
operators’ 
code 
in
 
advance
 
No 
JIT
 
overhead
2.
[Vector
 
pooling]
Pre-allocate 
data
 
structures
 
No 
memory 
allocation 
on 
the 
data
 
path
15
Multi-model
 
Optimizations
Share computation 
and memory 
across
 
models
1.
[Object 
Store]
Share 
Operators
 
parameters/weights
 
Maintain 
only one
 
copy
2.
[Sub-plan 
Materialization]
Reuse intermediate 
results computed 
by other
 
models
 
Save
 
computation
System 
Components
1. Flour: 
Intermediate
 
Representation
2.
 
Oven:
 
Compiler/Optimizer
var 
fContext 
= 
...;
var 
Tokenizer 
= 
...;
return
 
fPrgm.Plan();
3.
 
Runtime: 
Execute inference
 
queries
Runtime
O
bj
e
c
t
Store
Scheduler
4.
 
FrontEnd: 
Handle 
user
 
requests
FrontEnd
16
Prediction 
Serving with
 
PRETZEL
1.
Offline
Analyze 
structural 
information 
of
 
models
Build 
ModelPlan 
for 
optimal
 
execution
Register 
ModelPlan 
to
 
Runtime
2.
Online
Handle 
prediction requests
Coordinate 
CPU 
& memory
 
resources
R
u
n
t
i
m
e
F
r
o
n
tE
n
d
R
u
n
t
i
m
e
Register
Model
A
n
a
l
y
z
e
17
System 
Design: 
Offline
 
Phase
Char
Ng
r
am
Tokenizer
Word
Ng
r
am
Concat
L
og
R
eg
1.
 
Translate 
Model 
into 
Flour
 
Program
<Model>
var 
fContext 
= 
new 
FlourContext(...)
var 
tTokenizer 
=
 
fContext.CSV
.FromText(fields, fieldsType,
 
sep)
.
Tokenize
();
var 
tCNgram 
= 
tTokenizer.
CharNgram
(numCNgrms, ...);
var 
tWNgram 
= 
tTokenizer.
WordNgram
(numWNgrms, ...);
var 
fPrgrm 
=
 
tCNgram
.
Concat
(tWNgram)
.
ClassifierBinaryLinear
(cParams);
<Flour
 
Program>
return
 
fPrgrm.Plan();
18
System 
Design: 
Offline
 
Phase
var 
tCNgram = tTokenizer.
CharNgram
(numCNgrms, ...);
var 
tWNgram = tTokenizer.
WordNgram
(numWNgrms, ...);
var 
fPrgrm =
 
tCNgram
.
Concat
(tWNgram)
.
ClassifierBinaryLinear
(cParams);
return
 
fPrgrm.Plan();
2. 
Oven 
optimizer/compiler 
build 
Model
 
Plan
<Flour 
Program>
var 
fContext = 
new
 
FlourContext(...)
var 
tTokenizer =
 
fContext.CSV
.FromText(fields, fieldsType,
 
sep)
.
Tokenize
();
Push linear 
predictor
& 
Remove
 
Concat
Stage
 
1
Stage
 
2
Group 
ops
into
 
stages
R
ule
-
b
as
ed
optimizer
S1
S2
<Model
 
Plan>
Logical
 DAG
19
2. 
Oven 
optimizer/compiler 
build 
Model
 
Plan
System 
Design: 
Offline
 
Phase
var 
fContext = 
new
 
FlourContext(...)
var 
tTokenizer =
 
fContext.CSV
.FromText(fields, fieldsType,
 
sep)
.
Tokenize
();
var 
tCNgram = tTokenizer.
CharNgram
(numCNgrms, ...);
var 
tWNgram = tTokenizer.
WordNgram
(numWNgrms, ...);
var 
fPrgrm =
 
tCNgram
.
Concat
(tWNgram)
.
ClassifierBinaryLinear
(cParams);
return 
fPrgrm.Plan();
<Flour
 
Program>
Push linear 
predictor
& 
Remove
 
Concat
Stage
 
1
Stage
 
2
Group 
ops
into
 
stages
R
ule
-
b
as
ed
optimizer
S1
S2
var 
fContext = 
new
 
FlourContext(...)
var 
tTokenizer =
 
fContext.CSV
.FromText(
fields
, 
fieldsType
,
 
sep
)
.
Tokenize
();
var 
tCNgram = tTokenizer.
CharNgram
(
numCNgrms
, ...);
var 
tWNgram = tTokenizer.
WordNgram
(
numWNgrms
, ...);
var 
fPrgrm =
 
tCNgram
.
Concat
(tWNgram)
.
ClassifierBinaryLinear
(
cParams
);
return 
fPrgrm.Plan();
var 
fContext = 
new
 
FlourContext(
...
)
var 
tTokenizer =
 
fContext.CSV
.FromText(
fields
, 
fieldsType
,
 
sep
)
.
Tokenize
();
var 
tCNgram = tTokenizer.
CharNgram
(
numCNgrms
, 
...
);
var 
tWNgram = tTokenizer.
WordNgram
(
numWNgrms
, 
...
);
var 
fPrgrm =
 
tCNgram
.
Concat
(tWNgram)
.
ClassifierBinaryLinear
(
cParams
);
e.g.,
 
Dictionary,
N-gram
 
Length
<Model
 
Plan>
Logical
 
DAG
Parameters
Statistics
return
 
fPrgrm.Plan();
e.g., 
dense vs.
 
sparse,
maximum 
vector
 
size
20
3. 
Model Plan 
is 
registered 
to
 
Runtime
System 
Design: 
Offline
 
Phase
<Model
 
Plan>
Logical
 
DAG
Parameters
Statistics
Physical
 
Stages
S1
 
S2
Logical
 
Stages
Model1
S1
S2
Object
 
Store
2. 
Find the 
most
efficient physical 
impl.
using 
params 
&
 
stats
1. 
Store 
parameters 
&
mapping
 
between
logical
 stages
21
3. 
Model Plan 
is 
registered 
to
 
Runtime
System 
Design: 
Offline
 
Phase
<Model
 
Plan>
Logical
 
DAG
Parameters
Statistics
Physical
 
Stages
S1
 
S2
C
a
t
a
lo
g
3. 
Register
 
selected
physical 
stages
 
to
Catalog
Logical
 
Stages
Model1
S1
S2
Object
 
Store
N-gram
 
length
1
 
vs.
 
3
Sparse 
vs.
 
Dense
2. 
Find the 
most
efficient physical 
impl.
using 
params 
&
 
stats
1. 
Store 
parameters 
&
mapping
 
between
logical
 stages
22
System 
Design: Online
 
Phase
Runtime
Logical
 
Stages
M
ode
l1
S1
S2
M
ode
l2
S1’
S
2
4. 
Send 
result
 
back
to 
Client
1. 
When a
 
prediction
request
 
arrives
<Model1, 
“Pretzel 
is
 
tasty”>
3. 
Execute stages
 
using
t
h
r
e
ad-p
ools
,
managed 
by
 
Scheduler
Physical
 
Stages
2. 
Instantiate
physical
 
stages
along with
pa
r
a
m
e
t
e
r
s
Object
 
Store
23
24
Outline
(Black-box) 
Prediction 
Serving
 
Systems
Limitations 
of 
Black 
box
 
approaches
PRETZEL: 
White-box 
Prediction 
Serving
 
System
Evaluation
Conclusion
25
E
v
alu
a
ti
o
n
Q. 
How PRETZEL 
improves 
performance 
over 
black-box
 
approaches?
in 
terms 
of 
latency, 
memory 
and 
throughput
500 
Models 
from 
Microsoft 
Machine Learning
 
Team
250 
Sentiment Analysis
 
(Memory-bound)
250 
Attendee 
Count
 
(Compute-bound)
System
 
configuration
16 
Cores 
CPU, 
32GB
 
RAM
Windows 
10, .Net 
core
 
2.0
Evaluation:
 
Latency
Micro-benchmark 
(No 
server-client
 
communication)
Score 
250 
Sentiment Analysis 
models 100 times 
for
 
each
Compare 
ML.Net vs.
 
PRETZEL
100
 
0.6
 
8.1
80
(%)
 
60
CDF
 
40
2
0
 
M
L
.
N
e
t
 
M
L
.
N
e
t
(
hot
)
 
(
cold
)
0
1
0
 
 
2
 
1
0
 
 
1
 
1
0
0
 
1
0
1
Latency 
(ms,
 
log-scaled)
M
L.
N
e
t
PR
ETZEL
P99
 
(
hot
)
P99
 
(
cold
)
Worst
 
(
cold
)
M
L.
N
e
t
PR
ETZEL
P99
 
(
hot
)
0.6
P99
 
(
cold
)
8.1
Worst
 
(
cold
)
M
L.
N
e
t
PR
ETZEL
P99
 
(
hot
)
0.6
0.2
P99
 
(
cold
)
8.1
0.8
Worst
 
(
cold
)
10
2
1
10
 
10
0
10
1
Latency 
(ms,
 
log-scaled)
0 
 
 
 
20
40
60
80
100
CDF
 
(%)
PRETZEL
 
(
hot
)
PRETZEL
 
(
cold
)
8.1
0.2
0.8
0.6
3x
10x
45x
b
e
t
t
er
26
Measure Cumulative 
Memory 
Usage 
after 
loading 
250
 
models
Attendee 
Count 
models (smaller 
size 
than 
Sentiment
 
Analysis)
4 
settings 
for 
Comparison
Evaluation:
 
Memory
b
e
t
t
er
e
g
 
32GB
a
s
U
 
10GB
ry
ed
)
emo
c
 
1GB
l
a
s
M
-
g
e
o
v
ti
(l
a
 
0.1GB
l
mu
Cu
 
10MB
0
 
5
0
 
10
0
 
15
0
 
20
0
 
250
Number 
of
 
pipelines
e
g
 
32GB
a
s
U
 
10GB
ry
ed
)
emo
c
 
1GB
l
a
s
M
-
g
e
o
v
ti
(l
a
 
0.1GB
l
mu
Cu
 
10MB
0
 
5
0
 
10
0
 
15
0
 
20
0
 
250
Number 
of
 
pipelines
0
Number 
of
 
pipelines
10M
B
0.
1
G
B
1G
B
Cumulative 
Memory
 
Usage
(log-scaled)
32G
B
10GB
164
MB
9.7
GB
3.7GB
2.9GB
25x
 
62x
27
PRETZEL
5
0
 
10
0
 
15
0
 
20
0
 
250
M
L.N
e
t
(w.o.
 
ObjStore)
ML.Net 
+
 
Clipper
Evaluation:
 
Throughput
13
1
 
2
 
4
 
8
Num. 
CPU
 
Cores
0
5
10
15
(i
d
ea
l
)
(i
d
ea
l
)
Throughput 
(K
 
QPS)
b
e
t
t
er
Micro-benchmark
Score 
250 
Attendee 
Count 
models 1000 times 
for
 
each
Request 
1000 
queries 
in 
a
 
batch
Compare 
ML.Net vs.
 
PRETZEL
10x
Close 
to
 
ideal
s
c
alabili
t
y
More
 
results
in the
 
paper!
28
29
C
o
nc
l
u
s
io
n
PRETZEL is the 
first 
white-box 
prediction 
serving 
system 
for 
ML
 
pipelines
By 
using 
models’ 
structural 
info, 
we 
enable 
two 
types of
 
optimizations:
End-to-end 
optimizations 
generate 
efficient 
execution 
plans 
for 
a
 
model
Multi-model 
optimizations 
let models 
share computation 
and 
memory
 
resources
Our 
evaluation 
shows that 
PRETZEL 
can 
improve 
performance 
compared to
Black-box 
systems 
(e.g.,
 
ML.Net)
Decrease 
latency 
and memory
 
footprint
Increase 
resource 
utilization 
and
 
throughput
PRETZEL: 
a 
White-Box 
ML 
Prediction 
Serving
 
System
Thank 
you!
Q
ue
s
t
ion
s?
30
Slide Note
Embed
Share

Delve into the world of Machine Learning Prediction Serving Systems with a focus on low latency, high throughput, and minimal resource usage. Explore state-of-the-art models like Clipper and TF Serving, and learn how models can be optimized for performance. Discover the inner workings of models through examples like sentiment analysis and understand the reusability of model structures for customer personalization and transfer learning.

  • Machine Learning
  • Prediction Serving Systems
  • Black Box
  • Performance Optimization
  • Sentiment Analysis

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. PRETZEL: Opening the Black Box of ML Prediction Serving Systems Yunseong Lees, Alberto Scolarip, Byung-Gon Chuns, Marco Domenico Santambrogiop, Markus Weimerm, Matteo Interlandim

  2. Machine Learning Prediction Serving Performancegoal: 1) Low latency 2) High throughput 3) Minimal resource usage 1. Models are learned from data 2. Models are deployed and served together Learn Model Deploy Data Server Users Training Predictionserving 2

  3. ML Prediction Serving Systems: State-of-the-art Clipper TF Serving Replication Result caching ensemble ML.Net Request Batching Text Analysis Pretzel istasty Assumption: models are black box Re-use the same code in training phase Image Recognition cat Encapsulate all operations into a function call (e.g., predict()) car Apply externaloptimizations Prediction Serving System 3

  4. How do Models Look inside Boxes? vs. (positive vs.negative) Pretzel istasty Model (text) <Example: Sentiment Analysis> 4

  5. How do Models Look inside Boxes? DAG of Operators Featurizers Char Predictor Ngram Logistic Regression vs. Pretzel istasty Tokenizer Concat Word Ngram <Example: Sentiment Analysis> 5

  6. How do Models Look inside Boxes? DAG of Operators Extract N-grams Compute finalscore Char Ngram Logistic Regression vs. Pretzel istasty Tokenizer Concat Word Ngram Split text intotokens Mergetwo vectors <Example: Sentiment Analysis> 6

  7. Many Models Have Similar Structures Many part of a model can be re-used in other models Customer personalization, Templates, Transfer Learning Identical set of operators with different parameters 7

  8. Outline Prediction Serving Systems Limitations of Black Box Approaches PRETZEL: White-box Prediction Serving System Evaluation Conclusion 8

  9. Limitation 1: Resource Waste Resources are isolated across Black boxes 1. Unable to share memory space Waste memory to maintain duplicate objects (despite similarities between models) 2. No coordination for CPU resources between boxes Serving many models can use too manythreads machine 9

  10. Limitation 2: Inconsideration for Ops Characteristics 1. Operators have different performance characteristics Concat materializes a vector LogReg takes only 0.3% (contrary to the training phase) 2. There can be a better plan if such characteristics are considered Re-use the existingvectors Apply in-place update in LogReg CharNgram WordNgram Concat LogReg Others 0.3 Char Ngram 23.1 34.2 32.7 9.6 Log Reg Tokenizer Concat Word Ngram 40% 60% 80% 0% 20% 100% Latency breakdown 10

  11. Limitation 3: Lazy Initialization ML.Net initializes code and memory lazily (efficient in training phase) Run 250 Sentiment Analysis models 100 times cold: first execution / hot: average of the rest 99 Long-tail latency in the cold case Code analysis, Just in-time (JIT) compilation, memory allocation, etc Difficult to provide strong Service-Level-Agreement (SLA) Char Ngram 444x Log Reg Tokenizer Concat 13x Word Ngram 11

  12. Outline (Black-box) Prediction Serving Systems Limitations of Black Box Approaches PRETZEL: White-box Prediction Serving System Evaluation Conclusion 12

  13. PRETZEL: White-box Prediction Serving We analyze models to optimize the internal execution We let models co-exist on the same runtime, sharing computation and memory resources We optimize models in two directions: 1. End-to-end optimizations 2. Multi-model optimizations 13

  14. End-to-End Optimizations Optimize the execution of individual models from start to end 1. [Ahead-of-time Compilation] Compile operators code in advance No JIToverhead 2. [Vector pooling] Pre-allocate data structures No memory allocation on the data path 14

  15. Multi-model Optimizations Share computation and memory across models 1. [Object Store] Share Operators parameters/weights Maintain only one copy 2. [Sub-plan Materialization] Reuse intermediate results computed by other models Savecomputation 15

  16. System Components 3. Runtime: Execute inferencequeries 1. Flour: IntermediateRepresentation Runtime var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); Object Store Scheduler 2. Oven: Compiler/Optimizer 4. FrontEnd: Handle user requests FrontEnd 16

  17. Prediction Serving with PRETZEL 1. Offline Analyze structural information of models Build ModelPlan for optimal execution Register ModelPlan to Runtime Model Analyze Runtime Register 2. Online Handle prediction requests Coordinate CPU & memory resources FrontEnd Runtime 17

  18. System Design: Offline Phase 1. Translate Model into Flour Program <Model> <Flour Program> var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); Char Ngram Log Reg Concat Tokenizer Word Ngram return fPrgrm.Plan(); 18

  19. Rule-based optimizer System Design: Offline Phase 2. Oven optimizer/compiler build Model Plan Push linear predictor & RemoveConcat <Flour Program> var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); Group ops intostages var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); <ModelPlan> Stage1 S1 LogicalDAG return fPrgrm.Plan(); S2 Stage2 19

  20. Rule-based optimizer System Design: Offline Phase 2. Oven optimizer/compiler build Model Plan Push linear predictor & RemoveConcat <Flour Program> e.g.,Dictionary, N-gramLength var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); .Tokenize(); .Tokenize(); var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .FromText(fields, fieldsType, sep) var fContext = new FlourContext(...) var tTokenizer = fContext.CSV Group ops intostages var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); .ClassifierBinaryLinear(cParams); .ClassifierBinaryLinear(cParams); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .Concat(tWNgram) <ModelPlan> Stage1 S1 Logical DAG Parameters return fPrgrm.Plan(); return fPrgrm.Plan(); return fPrgrm.Plan(); S2 e.g., dense vs. sparse, maximum vectorsize Statistics Stage2 20

  21. System Design: Offline Phase 3. Model Plan is registered to Runtime LogicalStages PhysicalStages S1 <ModelPlan> Model1 S1 S2 Logical DAG S2 Parameters Statistics ObjectStore 2. Find the most efficient physical impl. using params & stats 1. Store parameters & mappingbetween logical stages 21

  22. System Design: Offline Phase 3. Register selected physical stagesto Catalog 3. Model Plan is registered to Runtime LogicalStages PhysicalStages S1 Catalog <ModelPlan> Model1 S1 S2 Logical DAG S2 Parameters N-gramlength 1 vs. 3 Statistics Sparse vs.Dense ObjectStore 2. Find the most efficient physical impl. using params & stats 1. Store parameters & mappingbetween logical stages 22

  23. System Design: Online Phase LogicalStages Model1 S1 Model2 S1 2. Instantiate physicalstages along with parameters S2 S2 4. Send resultback to Client Object Store PhysicalStages <Model1, Pretzel is tasty > 1. When aprediction request arrives 3. Execute stagesusing thread-pools, managed byScheduler Runtime 23

  24. Outline (Black-box) Prediction Serving Systems Limitations of Black box approaches PRETZEL: White-box Prediction Serving System Evaluation Conclusion 24

  25. Evaluation Q. How PRETZEL improves performance over black-box approaches? in terms of latency, memory and throughput 500 Models from Microsoft Machine Learning Team 250 Sentiment Analysis (Memory-bound) 250 Attendee Count (Compute-bound) System configuration 16 Cores CPU, 32GB RAM Windows 10, .Net core 2.0 25

  26. Evaluation: Latency Micro-benchmark (No server-client communication) Score 250 Sentiment Analysis models 100 times for each Compare ML.Net vs.PRETZEL 0.8 0.2 100 100 0.6 0.6 8.1 8.1 PRETZEL(cold) PRETZEL(hot) ML.Net ML.Net ML.Net ML.Net PRETZEL PRETZEL PRETZEL PRETZEL 80 80 10x P99(hot) P99(hot) P99(hot) P99 (hot) P99 (cold) 0.6 0.6 0.6 8.1 0.2 0.2 0.8 60 CDF (%) (%)60 3x P99(cold) P99(cold) P99(cold) Worst(cold) 8.1 8.1 0.8 6.2 40 CDF40 280.2 Worst(cold) Worst(cold) Worst(cold) ML.Net (hot) 10 Latency (ms, log-scaled) Latency (ms, log-scaled) ML.Net (cold) 20 20 better 45x 0 10 10 0 1 1 101 101 2 2 100 100 10 26

  27. Evaluation: Memory Measure Cumulative Memory Usage after loading 250 models Attendee Count models (smaller size than Sentiment Analysis) 4 settings for Comparison eg as U ryed emoc 1GB las M-g eo v ti(l a 0.1GB l mu Cu 10MB Cu 10MB 10MB eg as U ryed emoc 1GB las M-g eo v ti(l a 0.1GB l mu Cumulative Memory Usage 32GB 32GB 32GB Shared Objects Shared Runtime 9.7GB 3.7GB ML.Net +Clipper Settings 10GB 10GB 10GB ) ) (log-scaled) ML.Net + Clipper 2.9GB 25x 62x 1GB ML.Net PRETZELwithout ObjectStore 0.1GB 164MB PRETZEL better PRETZEL 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 0 0 0 Number of pipelines Number of pipelines Number of pipelines 27

  28. Evaluation: Throughput Micro-benchmark Score 250 Attendee Count models 1000 times for each Request 1000 queries in a batch Compare ML.Net vs. PRETZEL Throughput (K QPS) Close toideal scalability (ideal) 15 10 10x Moreresults in thepaper! (ideal) 5 better 0 13 1 2 4 8 Num. CPU Cores 28

  29. Conclusion PRETZEL is the first white-box prediction serving system for ML pipelines By using models structural info, we enable two types of optimizations: End-to-end optimizations generate efficient execution plans for a model Multi-model optimizations let models share computation and memory resources Our evaluation shows that PRETZEL can improve performance compared to Black-box systems (e.g., ML.Net) Decrease latency and memory footprint Increase resource utilization and throughput 29

  30. PRETZEL: a White-Box ML Prediction Serving System Thank you! Questions? 30

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#