
Challenges and Tools for Processing Streaming Data
Explore the challenges of streaming data processing, including consistency, throughput vs. latency trade-offs, and time complexities. Discover tools like windowing and watermarks used in processing unbounded data streams efficiently.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Streaming COS 518: Advanced Computer Systems Lecture 11 Daniel Suo
What is streaming? Fast data! Fast processing! Lots of data! 2
Streaming = unbounded data (Batch = bounded data) 3
Other defns are somewhat misleading We can use batch frameworks for stream processing (how?) Batch frameworks can also handle scenarios historically covered by stream frameworks (e.g., low-latency, approximate) 4
Three major challenges Consistency: historically, streaming systems were created to decrease latency and made many sacrifices (at-most-processing, anyone?) Throughput vs. latency: typically a trade-off (why?) Time: as we will soon see, streaming introduces some new challenges We ve covered consistency in a lot of detail, so let s investigate time. 5
but if you give a data scientist some data Once we move to unbounded data, we need new methods to process whether for sake of capacity (not enough machines) or availability (data doesn t exist yet) Easiest thing to do: 7
Windowing by processing time is great Easy to implement and verify correctness Great for applications like filtering or monitoring 8
But what if we care about when events happen? If we associate event times, then items could now come out-of-order! (why?) 9
But not the case, so we need tools Windows: how should we group together data? Watermarks: how can we mark when the last piece of data in some window has arrived? Triggers: how can we initiate an early result? Accumulators: what do we do with the results (correct, modified, or retracted)? All topics covered in next week s readings! 12