Stream Processing for Incremental Sliding Window Analytics

Slide Note

This content explores the design requirements, state-of-the-art technologies, trade-offs, goals, and approach for achieving efficient incremental processing in stream analytics. It emphasizes the need to balance advantages of batch-based systems with the efficiency of incremental updates for sliding window analytics. The approach involves automatically adapting data-parallel applications for incremental sliding window analytics, with a focus on behind-the-scenes computation steps and the use of batch-based techniques.

ellanor Follow

Uploaded on Sep 21, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon Middleware 2014

Data analytics systems Hadoop Spark Naiad Storm S4 Data analytics system Information Raw data E.g. Web-crawl E.g. search E.g. computing PageRank

Design requirements Streaming data Recent trends + Incremental updates Sliding window Incremental sliding window analytics for data stream

State-of-the-art: Stream processing .. mutable state Stream Classification based on programming model .. Batch# 1 Batch# 2 Batch# n input records M M M node 1 R R M M input records Trigger-based systems M node 3 Single batch Batch-based systems M M M Input R R Output node 2 E.g. Storm, S4, Naiad E.g. D-Streams

Trade-offs for incremental updates Batch-based systems Trigger-based systems (re-compute from scratch) (require dynamic algorithms) (+) efficient (-) hard to design (-) inefficient (+) easy to design Slider

Goals 1. Retain the advantages/simplicity of batch-based approach 2. Achieve the efficiency of incremental processing for sliding window analytics

Outline Motivation Basic design Slider design Evaluation

Our approach Take an unmodified data-parallel application written assuming unchanging data Automatically adapt it for incremental sliding window analytics

Behind the scenes Step#1 divide Step#2 build Step #3 perform computation sub-computations dependence graph change propagation We follow this high-level approach for batch-based stream processing

Batch-based sliding window analytics Window . .. Stream M M M M Step#1: Divide the computation Map & Reduce tasks R R R Step#2: Build the dependence graph Data-flow graph of MapReduce

Step#3: Change propagation added removed window B5 Stream B1 B2 B3 B4 M5 M5 M1 M1 M2 M3 M4 R1 R2 R3 Contraction tree # 3 Contraction tree # 1 Replace Reduce tasks with contraction trees Contraction tree # 2

Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation

Contraction tree What: Breaks down the work done by a Reduce task to allow fine-grained change propagation How: Leverages Combiners at the Reducer site

Zoom IN with a single Reducer removed added window B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Replace Contraction tree R

Example of contraction tree Reduce task

Zoom IN with a single Reducer removed added window B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Replace Contraction tree R

Basic design w/ contraction tree window removed added B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Path affected by M1 Path affected by M5 Pramod Bhatotia

Limitation of the contraction tree Na ve grouping of Combiner tasks may lead to sub-optimal reuse of the memoized result Self-adjusting contraction tree

Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation

Self-adjusting contraction tree The tree should have low depth (implies short path length for re-computation) Key ingredients: Balanced tree: sublinear updates w.r.t. window size Self-adjusting capability after change propagation

Self-adjusting contraction tree(s) Different modes of operation General case Append-only Fixed-width Fixed-width

Fixed-width window slides

Rotating contraction tree

Rotating contraction tree B4 Memoized results are reused Update path for bucket 4

Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation

Split processing Change propagation Background pre-processing Foreground processing

Change propagation for bucket#4 Memoized results are reused Update path for bucket 4

Split processing for bucket#4 Background pre-processing Foreground processing

Outline Motivation Basic design Slider design Evaluation

Evaluating Slider Goal: Determine how Slider works in practice 1. What are the performance benefits? 2. How effective is split processing? 3. What is the overhead for the initial run? more results in the paper Case studies @ MPI-SWS

Q1: Performance gains 5% fixed-width change 25% fixed-width change 4 3.5 3 2.5 2 Speedup 1.5 1 0.5 0 Top-K Sub-string Matrix K-means KNN Speedup up to 3.8X w.r.t. basic contraction tree

Q2: Split processing w/o split processing Background Foreground 1 0.8 0.6 Normalized execution time 0.4 0.2 0 K-means Top-K KNN Matrix Sub-string Foreground processing is faster by 30% on avg.

Q3: Performance overheads 40 35 30 25 Initial run overhead (%) 20 15 10 5 0 K-means Top-K KNN Matrix Sub-string Overheads 2% to 38% for the initial run

Case studies Online Social Networks [IMC 11] Information propagation in Twitter Networked Systems [NSDI 10] Glasnost: Detecting traffic shaping Details in the paper Hybrid CDNs [NSDI 12] Reliable client accounting

Information propagation in Twitter Window change (%) Speedups 8 15 7 13 6 11 Change (%) Speedup 5 9 4 7 3 5 2 3 1 1 Week-1 Week-2 Week-3 Week-4 Speedup > 13X for ~5%window change

Summary Slider enables incremental sliding window analytics Transparently & efficiently Slider design includes Self-adjusting contractions trees for sub-linear updates Split processing for background pre-processing Multi-level trees for general data-flow programs (didn t cover in the talk!)