Space-Efficient Estimation of Statistics Over Sub-Sampled Streams
This research focuses on efficiently estimating statistics over sub-sampled streams, particularly in the context of IP packet streams for traffic monitoring. Various types of sampling methods are explored, such as Bernoulli sampling, to compute key metrics like frequency moments, number of distinct elements, empirical entropy, and heavy hitters. The study highlights the challenges and tradeoffs involved in processing sampled streams with limited memory resources.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Space-Efficient Estimation of Statistics over Sub-Sampled Streams Andrew McGregor University of Massachusetts mcgregor@cs.umass.edu A. Pavan Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Iowa State University snt@iastate.edu David Woodruff IBM Research Almaden dpwoodru@us.ibm.com
Sampled IP Packet Streams IP Traffic Monitoring 40Gbps 100s of Gbps Cisco Netflow standard for monitoring Sampled Netflow Network Monitor sees only a random sample of the original packet stream Different Types of Sampling Used IETF Working Group (psamp) 9/13/2024 Estimation from Sub-Sampled Streams 2
Sampled Streams What can we compute over a stream by observing only a random sample of the stream? Two Constraints: Only observe a random sample Streaming, Memory Bound Computation 9/13/2024 Estimation from Sub-Sampled Streams 3
Model: Bernoulli Sampling Original Stream P Bernoulli Sampler selects each ai with probability p Sampled Stream L Stream Processor Aggregates and Analysis of Original Stream P 9/13/2024 Estimation from Sub-Sampled Streams 4
Aggregates The stream is a sequence of items What matters is the vector where fi is frequency of i Frequency Moments Number of Distinct Elements Empirical Entropy Heavy Hitters 9/13/2024 Estimation from Sub-Sampled Streams 5
Preliminaries Let gi be the frequency of item i in the substream L contains the frequency vector Randomized multiplicative approximation, for parameters 9/13/2024 Estimation from Sub-Sampled Streams 6
Results Number of Distinct Elements (Known) Upper Bound on Result Quality Simple Streaming Algorithm that meets the bound Frequency Moments, Fk, k > 0 Smaller values of pincrease streaming space complexity Matching upper and lower bounds (w.r.t p) Tradeoff between processing time and space Entropy Matching Upper and Lower Bounds for Additive Error Relative Error Impossible in small space Heavy Hitters Sampling is a good fit 9/13/2024 Estimation from Sub-Sampled Streams 7
Related Work Duffield, Lund, Thorup, Properties and prediction of flow statistics from sampled packet streams , IMC 2002 Duffield, Lund, and Thorup, Estimating flow distributions from sampled flow statistics , SIGCOMM 2003 Rusu and Dobra, Sketching sampled data streams ICDE 2009 Bar-Yossef, Sampling Lower Bounds via Information Theory , STOC 2003 9/13/2024 Estimation from Sub-Sampled Streams 8
Results: Number of Distinct Elements For constant p, estimate F0(P) from observing L Any algorithm must have relative error (Charikar et al. PODS 2000) A simple streaming algorithm has relative error Other Estimators, such as GEE (Generalized Error Estimator) also possible in single pass 9/13/2024 Estimation from Sub-Sampled Streams 9
Frequency Moments Fk Theorem (Upper Bound): There is a one pass streaming algorithm which observes L and outputs a -estimator to where using space. assuming 9/13/2024 Estimation from Sub-Sampled Streams 10
Computing F2: Why is This Hard? 1 2 3 4 5 6 n sampler 1 3 7 .. 9/13/2024 Estimation from Sub-Sampled Streams 11
Algorithm 1 for F2 Algorithm 1. Estimate using streaming algorithm on L 2. Use in above formula for Z 9/13/2024 Estimation from Sub-Sampled Streams 12
Algorithm 1 for F2 (contd) To get -approximation for Z, space needed is Issue: Need to estimate with very high accuracy to get good relative error for 9/13/2024 Estimation from Sub-Sampled Streams 13
Algorithm 2 for F2 Collisions. The number of 2-wise collisions in P Observation: Algorithm: Estimate and with mult. error 9/13/2024 Estimation from Sub-Sampled Streams 14
Estimating Observation: If estimated accurately, we are done But this is hard in general, in small space 9/13/2024 Estimation from Sub-Sampled Streams 15
Estimating Case I: Case 2: Estimate hence Accurate estimate of not needed with good relative error Get an estimate within a multiplicative error of 3 9/13/2024 Estimation from Sub-Sampled Streams 16
Estimating Estimate with good relative error when Technique due to Indyk & Woodruff (STOC 2005) Divide items {1,2..,m} into classes based on frequency Estimate the size of different classes that contribute to the final result 9/13/2024 Estimation from Sub-Sampled Streams 17
Tight Lower Bound for Theorem: Any constant-pass streaming algo. that -approximates for a sufficiently small constants by observing a sampled stream, in the Bernoulli sampling model, requires bits of space 9/13/2024 Estimation from Sub-Sampled Streams 18
Time-Space Tradeoff With sampled stream at probability p Processing Time: Streaming Space: Product: 9/13/2024 Estimation from Sub-Sampled Streams 19
Entropy No multiplicative error approximation possible with probability 9/10, even if p > The entropy of sampled stream could be zero, while that of the original stream non-zero If there is an approximation H to such that with prob. at least 99/100 9/13/2024 Estimation from Sub-Sampled Streams 20
Heavy Hitters The frequency of heavy hitters are (approximately) proportionately maintained in the sampled stream Precise upper and lower bounds in paper 9/13/2024 Estimation from Sub-Sampled Streams 21
Conclusions Extreme Volume Data Streams Fk Upper and Lower Bounds for Fk, k > 0 Smooth Space-Time tradeoff Number of distinct Elements, Entropy Sampling harms, streaming does not Heavy Hitters: Sampling and Streaming are both ok 9/13/2024 Estimation from Sub-Sampled Streams 22