Unveiling the Black Box: ML Prediction Serving Systems

Slide Note

Delve into the world of Machine Learning Prediction Serving Systems with a focus on low latency, high throughput, and minimal resource usage. Explore state-of-the-art models like Clipper and TF Serving, and learn how models can be optimized for performance. Discover the inner workings of models through examples like sentiment analysis and understand the reusability of model structures for customer personalization and transfer learning.

luis_vin Follow

Uploaded on Sep 15, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

PRETZEL: Opening the Black Box of ML Prediction Serving Systems Yunseong Lees, Alberto Scolarip, Byung-Gon Chuns, Marco Domenico Santambrogiop, Markus Weimerm, Matteo Interlandim

Machine Learning Prediction Serving Performancegoal: 1) Low latency 2) High throughput 3) Minimal resource usage 1. Models are learned from data 2. Models are deployed and served together Learn Model Deploy Data Server Users Training Predictionserving 2

ML Prediction Serving Systems: State-of-the-art Clipper TF Serving Replication Result caching ensemble ML.Net Request Batching Text Analysis Pretzel istasty Assumption: models are black box Re-use the same code in training phase Image Recognition cat Encapsulate all operations into a function call (e.g., predict()) car Apply externaloptimizations Prediction Serving System 3

How do Models Look inside Boxes? vs. (positive vs.negative) Pretzel istasty Model (text) <Example: Sentiment Analysis> 4

How do Models Look inside Boxes? DAG of Operators Featurizers Char Predictor Ngram Logistic Regression vs. Pretzel istasty Tokenizer Concat Word Ngram <Example: Sentiment Analysis> 5

How do Models Look inside Boxes? DAG of Operators Extract N-grams Compute finalscore Char Ngram Logistic Regression vs. Pretzel istasty Tokenizer Concat Word Ngram Split text intotokens Mergetwo vectors <Example: Sentiment Analysis> 6

Many Models Have Similar Structures Many part of a model can be re-used in other models Customer personalization, Templates, Transfer Learning Identical set of operators with different parameters 7

Outline Prediction Serving Systems Limitations of Black Box Approaches PRETZEL: White-box Prediction Serving System Evaluation Conclusion 8

Limitation 1: Resource Waste Resources are isolated across Black boxes 1. Unable to share memory space Waste memory to maintain duplicate objects (despite similarities between models) 2. No coordination for CPU resources between boxes Serving many models can use too manythreads machine 9

Limitation 2: Inconsideration for Ops Characteristics 1. Operators have different performance characteristics Concat materializes a vector LogReg takes only 0.3% (contrary to the training phase) 2. There can be a better plan if such characteristics are considered Re-use the existingvectors Apply in-place update in LogReg CharNgram WordNgram Concat LogReg Others 0.3 Char Ngram 23.1 34.2 32.7 9.6 Log Reg Tokenizer Concat Word Ngram 40% 60% 80% 0% 20% 100% Latency breakdown 10

Limitation 3: Lazy Initialization ML.Net initializes code and memory lazily (efficient in training phase) Run 250 Sentiment Analysis models 100 times cold: first execution / hot: average of the rest 99 Long-tail latency in the cold case Code analysis, Just in-time (JIT) compilation, memory allocation, etc Difficult to provide strong Service-Level-Agreement (SLA) Char Ngram 444x Log Reg Tokenizer Concat 13x Word Ngram 11

Outline (Black-box) Prediction Serving Systems Limitations of Black Box Approaches PRETZEL: White-box Prediction Serving System Evaluation Conclusion 12

PRETZEL: White-box Prediction Serving We analyze models to optimize the internal execution We let models co-exist on the same runtime, sharing computation and memory resources We optimize models in two directions: 1. End-to-end optimizations 2. Multi-model optimizations 13

End-to-End Optimizations Optimize the execution of individual models from start to end 1. [Ahead-of-time Compilation] Compile operators code in advance No JIToverhead 2. [Vector pooling] Pre-allocate data structures No memory allocation on the data path 14

Multi-model Optimizations Share computation and memory across models 1. [Object Store] Share Operators parameters/weights Maintain only one copy 2. [Sub-plan Materialization] Reuse intermediate results computed by other models Savecomputation 15

System Components 3. Runtime: Execute inferencequeries 1. Flour: IntermediateRepresentation Runtime var fContext = ...; var Tokenizer = ...; return fPrgm.Plan(); Object Store Scheduler 2. Oven: Compiler/Optimizer 4. FrontEnd: Handle user requests FrontEnd 16

Prediction Serving with PRETZEL 1. Offline Analyze structural information of models Build ModelPlan for optimal execution Register ModelPlan to Runtime Model Analyze Runtime Register 2. Online Handle prediction requests Coordinate CPU & memory resources FrontEnd Runtime 17

System Design: Offline Phase 1. Translate Model into Flour Program <Model> <Flour Program> var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); Char Ngram Log Reg Concat Tokenizer Word Ngram return fPrgrm.Plan(); 18

Rule-based optimizer System Design: Offline Phase 2. Oven optimizer/compiler build Model Plan Push linear predictor & RemoveConcat <Flour Program> var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); Group ops intostages var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); <ModelPlan> Stage1 S1 LogicalDAG return fPrgrm.Plan(); S2 Stage2 19

Rule-based optimizer System Design: Offline Phase 2. Oven optimizer/compiler build Model Plan Push linear predictor & RemoveConcat <Flour Program> e.g.,Dictionary, N-gramLength var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); .Tokenize(); .Tokenize(); var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .FromText(fields, fieldsType, sep) var fContext = new FlourContext(...) var tTokenizer = fContext.CSV Group ops intostages var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); .ClassifierBinaryLinear(cParams); .ClassifierBinaryLinear(cParams); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .Concat(tWNgram) <ModelPlan> Stage1 S1 Logical DAG Parameters return fPrgrm.Plan(); return fPrgrm.Plan(); return fPrgrm.Plan(); S2 e.g., dense vs. sparse, maximum vectorsize Statistics Stage2 20

System Design: Offline Phase 3. Model Plan is registered to Runtime LogicalStages PhysicalStages S1 <ModelPlan> Model1 S1 S2 Logical DAG S2 Parameters Statistics ObjectStore 2. Find the most efficient physical impl. using params & stats 1. Store parameters & mappingbetween logical stages 21

System Design: Offline Phase 3. Register selected physical stagesto Catalog 3. Model Plan is registered to Runtime LogicalStages PhysicalStages S1 Catalog <ModelPlan> Model1 S1 S2 Logical DAG S2 Parameters N-gramlength 1 vs. 3 Statistics Sparse vs.Dense ObjectStore 2. Find the most efficient physical impl. using params & stats 1. Store parameters & mappingbetween logical stages 22

System Design: Online Phase LogicalStages Model1 S1 Model2 S1 2. Instantiate physicalstages along with parameters S2 S2 4. Send resultback to Client Object Store PhysicalStages <Model1, Pretzel is tasty > 1. When aprediction request arrives 3. Execute stagesusing thread-pools, managed byScheduler Runtime 23

Outline (Black-box) Prediction Serving Systems Limitations of Black box approaches PRETZEL: White-box Prediction Serving System Evaluation Conclusion 24

Evaluation Q. How PRETZEL improves performance over black-box approaches? in terms of latency, memory and throughput 500 Models from Microsoft Machine Learning Team 250 Sentiment Analysis (Memory-bound) 250 Attendee Count (Compute-bound) System configuration 16 Cores CPU, 32GB RAM Windows 10, .Net core 2.0 25

Evaluation: Latency Micro-benchmark (No server-client communication) Score 250 Sentiment Analysis models 100 times for each Compare ML.Net vs.PRETZEL 0.8 0.2 100 100 0.6 0.6 8.1 8.1 PRETZEL(cold) PRETZEL(hot) ML.Net ML.Net ML.Net ML.Net PRETZEL PRETZEL PRETZEL PRETZEL 80 80 10x P99(hot) P99(hot) P99(hot) P99 (hot) P99 (cold) 0.6 0.6 0.6 8.1 0.2 0.2 0.8 60 CDF (%) (%)60 3x P99(cold) P99(cold) P99(cold) Worst(cold) 8.1 8.1 0.8 6.2 40 CDF40 280.2 Worst(cold) Worst(cold) Worst(cold) ML.Net (hot) 10 Latency (ms, log-scaled) Latency (ms, log-scaled) ML.Net (cold) 20 20 better 45x 0 10 10 0 1 1 101 101 2 2 100 100 10 26

Evaluation: Memory Measure Cumulative Memory Usage after loading 250 models Attendee Count models (smaller size than Sentiment Analysis) 4 settings for Comparison eg as U ryed emoc 1GB las M-g eo v ti(l a 0.1GB l mu Cu 10MB Cu 10MB 10MB eg as U ryed emoc 1GB las M-g eo v ti(l a 0.1GB l mu Cumulative Memory Usage 32GB 32GB 32GB Shared Objects Shared Runtime 9.7GB 3.7GB ML.Net +Clipper Settings 10GB 10GB 10GB ) ) (log-scaled) ML.Net + Clipper 2.9GB 25x 62x 1GB ML.Net PRETZELwithout ObjectStore 0.1GB 164MB PRETZEL better PRETZEL 50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 0 0 0 Number of pipelines Number of pipelines Number of pipelines 27

Evaluation: Throughput Micro-benchmark Score 250 Attendee Count models 1000 times for each Request 1000 queries in a batch Compare ML.Net vs. PRETZEL Throughput (K QPS) Close toideal scalability (ideal) 15 10 10x Moreresults in thepaper! (ideal) 5 better 0 13 1 2 4 8 Num. CPU Cores 28

Conclusion PRETZEL is the first white-box prediction serving system for ML pipelines By using models structural info, we enable two types of optimizations: End-to-end optimizations generate efficient execution plans for a model Multi-model optimizations let models share computation and memory resources Our evaluation shows that PRETZEL can improve performance compared to Black-box systems (e.g., ML.Net) Decrease latency and memory footprint Increase resource utilization and throughput 29