Advancements in Stream Processing with Augmented Sketch Technology
Augmented Sketch is a cutting-edge technology developed by Pratanu Roy, Arijit Khan, and Gustavo Alonso at ETH Zurich and Nanyang Technical University. This technology enables faster and more accurate processing of data streams, including IP traffic, phone calls, sensor measurements, and web interactions. The system addresses challenges in stream processing, such as space, accuracy, and efficiency trade-offs, and offers solutions for load balancing, ranking, and frequent itemset mining. With a focus on improving accuracy for frequent items and reducing misclassification, Augmented Sketch enhances throughput and efficiently handles hot and cold data scenarios.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Augmented Sketch: Faster and More Accurate Stream Processing Pratanu Roy Arijit Khan Gustavo Alonso Systems Group ETH Zurich Nanyang Technical University Singapore Systems Group ETH Zurich
Data Stream Processing f( ) = 3 e f( ) = 2 e . e a a c e e Data Stream IP traffic, phone calls, sensor measurements, web clicks and crawls P. Roy, A. Khan, G. Alonso 1/10
Data Stream Processing f( ) = 3 e f( ) = 2 e . e a a c e e Data Stream IP traffic, phone calls, sensor measurements, web clicks and crawls Applications: Load balancing Ranking Frequent itemsets mining Classification P. Roy, A. Khan, G. Alonso 1/10
Challenges in Stream Processing Trade-off among Space, Accuracy, and Efficiency: -- Increasing space increases accuracy, but reduces throughput Other requirements: -- Build summary in one pass over the stream -- Incremental updates in summary P. Roy, A. Khan, G. Alonso 2/10
Related Work Sketch Space-saving Wavelets Sampling h + f H1(e) (e,f) + f w Hw(e) + f Count-Min Sketch P. Roy, A. Khan, G. Alonso 3/10
Our Motivation Improve accuracy for frequent items -- Critical for threshold checking (service-level agreement ) , ranking, load-balancing Reduce misclassification -- Frequent items mining the first step of frequent itemsets mining -- Even a small number of misclassified items lead to a large number of false-positive itemsets Improve throughput P. Roy, A. Khan, G. Alonso 3/10
Main Takeaway Hot data Frequency Cold data Items Let the common case go faster P. Roy, A. Khan, G. Alonso 4/10
Main Takeaway: Solution Framework Cold data Hot data Optimized Codepath (Filter) State-of-the-art Sketch Algorithms Input Optimized codepath acts like a filter for hot data Improvements: accuracy, throughput P. Roy, A. Khan, G. Alonso 4/10
Main Takeaway: Desired Outcome Solution framework Throughput State of the art Skew P. Roy, A. Khan, G. Alonso 4/10
Augmented Sketch (ASketch) Items with lower count Input Filter Count Min Very small (~0.3%) and make the processing very fast (with SIMD) Items with higher count estimate(k) > minimum in (filter) Challenges -- Removing items from sketch -- Cascading exchanges P. Roy, A. Khan, G. Alonso 5/10
Exchange Mechanism Count-Min filter A 8 B Items 4 3 8 7 H1 A 7 10 new 6 4 6 9 H2 2 1 old Found in filter Update in filter Count-Min filter A B Items 8 9 4 3 7 H1 C 8 10 new 6 4 6 9 H2 10 2 1 old Not found in filter Update in count-min estimate (C) > minimum count (filter) initiate an exchange P. Roy, A. Khan, G. Alonso 6/10
Exchange Mechanism Count-Min filter A C B Items A 9 9 4 3 7 H1 8 9 10 new 8 6 4 6 10 10 H2 2 9 1 old 2 Step 1: Move C to filter Count-Min filter C B Items 4 3 9 13 7 H1 9 10 new 6 10 4 6 10 H2 9 1 old Step 2: Move A to Count-Min with (8-2) = 6 We do not perform multiple exchanges P. Roy, A. Khan, G. Alonso 6/10
Other Technical Contributions Theoretical Error Bounds Four different filter implementation -- Array, Strict Heap, Relaxed Heap, Stream Summary Hardware-conscious filter (SIMD) Pipeline parallelism SPMD parallelism P. Roy, A. Khan, G. Alonso 7/10
Experimental Results: Stream Processing Throughput 128 Throughput ( million items/sec) Asketch 64 FCM [Cormode et al., Algo 05] Count-Min Count-Min 32 (FCM) [Thomas et al, ICDE 09] Frequency Aware Counting holistic UDAFS 16 [Cormode et al, SIGMOD 04] Holistic UDAFs 8 4 0.0 1.0 2.0 3.0 Skew (Zipf parameter z) Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso 8/10
Experimental Results: Query Processing Throughput 128 Throughput (million items/sec) Asketch 64 FCM Count-Min 32 holistic UDAFs 16 8 4 0.0 0.5 1.0 Skew (Zipf parameter z) 1.5 2.0 2.5 3.0 Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso 9/10
Experimental Results: Accuracy Improvement 4 3 Observed error (%) 2 1 0 Count-Min ASketch Holistic UDAFs FCM Asketch-FCM IP-Trace, 13M data items, stream size = 461M, sketch size = 128 KB, zipf 0.9 Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso 10/10
Conclusions ASketch dynamically identifies and aggregates most frequent items -- Improves throughput and accuracy of existing sketches -- Allows efficient utilization of modern hardware features such as SIMD, multi-cores, etc. Future work: investigate the use of Asketch in the context of machine learning and data mining applications P. Roy, A. Khan, G. Alonso
Impact of Misclassifications Count-Min ASketch 1000000 Average relative error 10000 100 1 16KB 24KB 32KB Sketch Size 8M data items; stream size = 32M, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso
Dataflow with Pipeline Parallelism Core 1 Core 2 Optimized codepath Count-Min Input Use message passing to communicate across the cores P. Roy, A. Khan, G. Alonso
Parallel-ASketch vs ASketch 256 Throughput (million items/sec) Parallel-ASketch 128 ASketch 64 32 16 8 4 0.0 0.5 1.0 Skew (Zipf parameter z) 1.5 2.0 2.5 3.0 Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso