Stream Processing for Incremental Sliding Window Analytics

Slide Note
Embed
Share

This content explores the design requirements, state-of-the-art technologies, trade-offs, goals, and approach for achieving efficient incremental processing in stream analytics. It emphasizes the need to balance advantages of batch-based systems with the efficiency of incremental updates for sliding window analytics. The approach involves automatically adapting data-parallel applications for incremental sliding window analytics, with a focus on behind-the-scenes computation steps and the use of batch-based techniques.


Uploaded on Sep 21, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Slider Incremental Sliding Window Analytics Pramod Bhatotia MPI-SWS Umut Acar CMU Flavio Junqueira MSR Cambridge Rodrigo Rodrigues NOVA University of Lisbon Middleware 2014

  2. Data analytics systems Hadoop Spark Naiad Storm S4 Data analytics system Information Raw data E.g. Web-crawl E.g. search E.g. computing PageRank

  3. Design requirements Streaming data Recent trends + Incremental updates Sliding window Incremental sliding window analytics for data stream

  4. State-of-the-art: Stream processing .. mutable state Stream Classification based on programming model .. Batch# 1 Batch# 2 Batch# n input records M M M node 1 R R M M input records Trigger-based systems M node 3 Single batch Batch-based systems M M M Input R R Output node 2 E.g. Storm, S4, Naiad E.g. D-Streams

  5. Trade-offs for incremental updates Batch-based systems Trigger-based systems (re-compute from scratch) (require dynamic algorithms) (+) efficient (-) hard to design (-) inefficient (+) easy to design Slider

  6. Goals 1. Retain the advantages/simplicity of batch-based approach 2. Achieve the efficiency of incremental processing for sliding window analytics

  7. Outline Motivation Basic design Slider design Evaluation

  8. Our approach Take an unmodified data-parallel application written assuming unchanging data Automatically adapt it for incremental sliding window analytics

  9. Behind the scenes Step#1 divide Step#2 build Step #3 perform computation sub-computations dependence graph change propagation We follow this high-level approach for batch-based stream processing

  10. Batch-based sliding window analytics Window . .. Stream M M M M Step#1: Divide the computation Map & Reduce tasks R R R Step#2: Build the dependence graph Data-flow graph of MapReduce

  11. Step#3: Change propagation added removed window B5 Stream B1 B2 B3 B4 M5 M5 M1 M1 M2 M3 M4 R1 R2 R3 Contraction tree # 3 Contraction tree # 1 Replace Reduce tasks with contraction trees Contraction tree # 2

  12. Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation

  13. Contraction tree What: Breaks down the work done by a Reduce task to allow fine-grained change propagation How: Leverages Combiners at the Reducer site

  14. Zoom IN with a single Reducer removed added window B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Replace Contraction tree R

  15. Example of contraction tree Reduce task

  16. Zoom IN with a single Reducer removed added window B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Replace Contraction tree R

  17. Basic design w/ contraction tree window removed added B5 Stream B1 B2 B3 B4 M2 M3 M4 M1 M5 M1 Path affected by M1 Path affected by M5 Pramod Bhatotia

  18. Limitation of the contraction tree Na ve grouping of Combiner tasks may lead to sub-optimal reuse of the memoized result Self-adjusting contraction tree

  19. Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation

  20. Self-adjusting contraction tree The tree should have low depth (implies short path length for re-computation) Key ingredients: Balanced tree: sublinear updates w.r.t. window size Self-adjusting capability after change propagation

  21. Self-adjusting contraction tree(s) Different modes of operation General case Append-only Fixed-width Fixed-width

  22. Fixed-width window slides

  23. Rotating contraction tree

  24. Rotating contraction tree B4 Memoized results are reused Update path for bucket 4

  25. Outline Motivation Basic design Slider design Contraction tree Self-adjusting contraction tree Split processing Evaluation

  26. Split processing Change propagation Background pre-processing Foreground processing

  27. Change propagation for bucket#4 Memoized results are reused Update path for bucket 4

  28. Split processing for bucket#4 Background pre-processing Foreground processing

  29. Outline Motivation Basic design Slider design Evaluation

  30. Evaluating Slider Goal: Determine how Slider works in practice 1. What are the performance benefits? 2. How effective is split processing? 3. What is the overhead for the initial run? more results in the paper Case studies @ MPI-SWS

  31. Q1: Performance gains 5% fixed-width change 25% fixed-width change 4 3.5 3 2.5 2 Speedup 1.5 1 0.5 0 Top-K Sub-string Matrix K-means KNN Speedup up to 3.8X w.r.t. basic contraction tree

  32. Q2: Split processing w/o split processing Background Foreground 1 0.8 0.6 Normalized execution time 0.4 0.2 0 K-means Top-K KNN Matrix Sub-string Foreground processing is faster by 30% on avg.

  33. Q3: Performance overheads 40 35 30 25 Initial run overhead (%) 20 15 10 5 0 K-means Top-K KNN Matrix Sub-string Overheads 2% to 38% for the initial run

  34. Case studies Online Social Networks [IMC 11] Information propagation in Twitter Networked Systems [NSDI 10] Glasnost: Detecting traffic shaping Details in the paper Hybrid CDNs [NSDI 12] Reliable client accounting

  35. Information propagation in Twitter Window change (%) Speedups 8 15 7 13 6 11 Change (%) Speedup 5 9 4 7 3 5 2 3 1 1 Week-1 Week-2 Week-3 Week-4 Speedup > 13X for ~5%window change

  36. Summary Slider enables incremental sliding window analytics Transparently & efficiently Slider design includes Self-adjusting contractions trees for sub-linear updates Split processing for background pre-processing Multi-level trees for general data-flow programs (didn t cover in the talk!)

  37. Thanks! Incremental Sliding Window Analytics Transparent + Efficient bhatotia@mpi-sws.org

Related


More Related Content