Stream Processing for Incremental Sliding Window Analytics
This content explores the design requirements, state-of-the-art technologies, trade-offs, goals, and approach for achieving efficient incremental processing in stream analytics. It emphasizes the need to balance advantages of batch-based systems with the efficiency of incremental updates for sliding window analytics. The approach involves automatically adapting data-parallel applications for incremental sliding window analytics, with a focus on behind-the-scenes computation steps and the use of batch-based techniques.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon Middleware 2014
Data analytics systems Hadoop Spark Naiad Storm S4 Data analytics system Information Raw data E.g. Web-crawl E.g. search E.g. computing PageRank
Design requirements Streaming data Recent trends + Incremental updates Sliding window Incremental sliding window analytics for data stream
State-of-the-art: Stream processing .. mutable state Stream Classification based on programming model .. Batch# 1 Batch# 2 Batch# n input records M M M node 1 R R M M input records Trigger-based systems M node 3 Single batch Batch-based systems M M M Input R R Output node 2 E.g. Storm, S4, Naiad E.g. D-Streams
Trade-offs for incremental updates Batch-based systems Trigger-based systems (re-compute from scratch) (require dynamic algorithms) (+) efficient (-) hard to design (-) inefficient (+) easy to design Slider
Goals 1. Retain the advantages/simplicity of batch-based approach 2. Achieve the efficiency of incremental processing for sliding window analytics
Outline Motivation Basic design Slider design Evaluation
Our approach Take an unmodified data-parallel application written assuming unchanging data Automatically adapt it for incremental sliding window analytics
Behind the scenes Step#1 divide Step#2 build Step #3 perform computation sub-computations dependence graph change propagation We follow this high-level approach for batch-based stream processing
Batch-based sliding window analytics Window . .. Stream M M M M Step#1: Divide the computation Map & Reduce tasks R R R Step#2: Build the dependence graph Data-flow graph of MapReduce
Step#3: Change propagation added removed window B5 Stream B1 B2 B3 B4 M5 M5 M1 M1 M2 M3 M4 R1 R2 R3 Contraction tree # 3 Contraction tree # 1 Replace Reduce tasks with contraction trees Contraction tree # 2
Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation
Contraction tree What: Breaks down the work done by a Reduce task to allow fine-grained change propagation How: Leverages Combiners at the Reducer site
Zoom IN with a single Reducer removed added window B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Replace Contraction tree R
Example of contraction tree Reduce task
Zoom IN with a single Reducer removed added window B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Replace Contraction tree R
Basic design w/ contraction tree window removed added B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Path affected by M1 Path affected by M5 Pramod Bhatotia
Limitation of the contraction tree Na ve grouping of Combiner tasks may lead to sub-optimal reuse of the memoized result Self-adjusting contraction tree
Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation
Self-adjusting contraction tree The tree should have low depth (implies short path length for re-computation) Key ingredients: Balanced tree: sublinear updates w.r.t. window size Self-adjusting capability after change propagation
Self-adjusting contraction tree(s) Different modes of operation General case Append-only Fixed-width Fixed-width
Rotating contraction tree B4 Memoized results are reused Update path for bucket 4
Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation
Split processing Change propagation Background pre-processing Foreground processing
Change propagation for bucket#4 Memoized results are reused Update path for bucket 4
Split processing for bucket#4 Background pre-processing Foreground processing
Outline Motivation Basic design Slider design Evaluation
Evaluating Slider Goal: Determine how Slider works in practice 1. What are the performance benefits? 2. How effective is split processing? 3. What is the overhead for the initial run? more results in the paper Case studies @ MPI-SWS
Q1: Performance gains 5% fixed-width change 25% fixed-width change 4 3.5 3 2.5 2 Speedup 1.5 1 0.5 0 Top-K Sub-string Matrix K-means KNN Speedup up to 3.8X w.r.t. basic contraction tree
Q2: Split processing w/o split processing Background Foreground 1 0.8 0.6 Normalized execution time 0.4 0.2 0 K-means Top-K KNN Matrix Sub-string Foreground processing is faster by 30% on avg.
Q3: Performance overheads 40 35 30 25 Initial run overhead (%) 20 15 10 5 0 K-means Top-K KNN Matrix Sub-string Overheads 2% to 38% for the initial run
Case studies Online Social Networks [IMC 11] Information propagation in Twitter Networked Systems [NSDI 10] Glasnost: Detecting traffic shaping Details in the paper Hybrid CDNs [NSDI 12] Reliable client accounting
Information propagation in Twitter Window change (%) Speedups 8 15 7 13 6 11 Change (%) Speedup 5 9 4 7 3 5 2 3 1 1 Week-1 Week-2 Week-3 Week-4 Speedup > 13X for ~5%window change
Summary Slider enables incremental sliding window analytics Transparently & efficiently Slider design includes Self-adjusting contractions trees for sub-linear updates Split processing for background pre-processing Multi-level trees for general data-flow programs (didn t cover in the talk!)
Thanks! Incremental Sliding Window Analytics Transparent + Efficient bhatotia@mpi-sws.org