Space-Efficient Estimation of Statistics Over Sub-Sampled Streams

 
Space-Efficient Estimation of Statistics
over Sub-Sampled Streams
Sampled IP Packet Streams
 
IP Traffic Monitoring
40Gbps – 100s of Gbps
Cisco Netflow standard for monitoring
Sampled Netflow
Network Monitor sees only a random sample of the
original packet stream
Different Types of Sampling Used
IETF Working Group (psamp)
Estimation from Sub-Sampled Streams
2
9/13/2024
Sampled Streams
 
What can we compute over a stream by
observing only a random sample of the
stream?
 
Two Constraints:
Only observe a 
random sample
Streaming
, Memory Bound Computation
Estimation from Sub-Sampled Streams
3
9/13/2024
 
Model: Bernoulli Sampling
 
Estimation from Sub-Sampled Streams
 
4
Bernoulli Sampler
selects each a
i
 with
probability p
 
Sampled
Stream L
 
9/13/2024
 
Original Stream P
 
Aggregates and Analysis
of Original Stream P
 
Aggregates
 
The stream is a sequence of items
What matters is the vector                                                  where 
f
i 
 is
frequency of i
 
 
Frequency Moments
 
Number of Distinct Elements
 
Empirical Entropy
 
Heavy Hitters
 
Estimation from Sub-Sampled Streams
 
5
 
9/13/2024
 
Preliminaries
 
Let 
g
i
 be the frequency of item 
i
 in the
substream
L 
contains the frequency vector
 
Randomized multiplicative approximation, for
parameters
 
Estimation from Sub-Sampled Streams
 
6
 
9/13/2024
Results
 
Number of Distinct Elements
(Known) Upper Bound on Result Quality
Simple Streaming Algorithm that meets the bound
 
Frequency Moments, F
k
, k > 0
Smaller values of 
p
 
increase
 streaming space complexity
Matching upper and lower bounds (w.r.t   
p
)
Tradeoff between processing time and space
 
Entropy
Matching Upper and Lower Bounds for Additive Error
Relative Error Impossible in small space
 
Heavy Hitters
Sampling is a good fit
9/13/2024
Estimation from Sub-Sampled Streams
7
Related Work
 
Duffield, Lund, Thorup, “Properties and prediction of flow
statistics from sampled packet streams”, IMC 2002
 
Duffield, Lund, and Thorup,  “Estimating flow distributions
from sampled flow statistics”, SIGCOMM 2003
 
Rusu and Dobra, “Sketching sampled data streams” ICDE
2009
 
Bar-Yossef, “Sampling Lower Bounds via Information
Theory”, STOC 2003
Estimation from Sub-Sampled Streams
8
9/13/2024
Results: Number of Distinct Elements
 
For constant 
p
, estimate F
0
(P) from observing L
Any algorithm must have relative error
(Charikar et al. PODS 2000)
 
A simple streaming algorithm has relative
error
 
Other Estimators, such as GEE (Generalized
Error Estimator) also possible in single pass
Estimation from Sub-Sampled Streams
9
9/13/2024
 
Frequency Moments F
k
 
Theorem (Upper Bound)
: There is a one pass
streaming algorithm which observes L and
outputs a                    -estimator to
where                 using                               space.
 
   assuming
 
Estimation from Sub-Sampled Streams
 
10
 
9/13/2024
Computing F
2
: Why is This Hard?
9/13/2024
Estimation from Sub-Sampled Streams
11
sampler
1
Algorithm 1 for F
2
 
 
 
 
Algorithm
1.
Estimate                 using streaming algorithm on L
2.
Use in above formula for Z
Estimation from Sub-Sampled Streams
12
9/13/2024
Algorithm 1 for F
2 
(contd)
 
To get                       -approximation for 
Z
, space
needed is
 
Issue:
 Need to estimate                with very high
accuracy to get good relative error for
 
Estimation from Sub-Sampled Streams
13
9/13/2024
Algorithm 2 for F
2
 
Collisions. 
The number of 2-wise collisions in P
 
 
Observation:
 
Algorithm:
Estimate and                               with mult. error
Estimation from Sub-Sampled Streams
14
9/13/2024
Estimating
 
Observation:
 
 
If                  estimated accurately, we are done
 
But this is hard in general, in small space
Estimation from Sub-Sampled Streams
15
9/13/2024
 
Estimating
 
Estimation from Sub-Sampled Streams
 
16
 
Case I:
 
 
 
 
 Estimate                      hence
   with good relative error
 
Case 2:
 
  Accurate estimate of
    not needed
  
Get an estimate within a
    multiplicative error of 3
 
9/13/2024
Estimating
 
Estimate with good relative error when
 
Technique due to Indyk & Woodruff (STOC
2005)
Divide items 
{1,2..,m} 
into classes based on
frequency
Estimate the size of different classes that
contribute to the final result
9/13/2024
Estimation from Sub-Sampled Streams
17
 
Tight Lower Bound for
 
Theorem
: Any constant-pass streaming algo.
that                      -approximates
for a sufficiently small constants
by observing a sampled stream, in the
Bernoulli sampling model, requires
                                      bits of space
 
9/13/2024
 
Estimation from Sub-Sampled Streams
 
18
 
Time-Space Tradeoff
 
With sampled stream at probability 
p
 
Processing Time:
 
Streaming Space:
 
Product:
 
9/13/2024
 
Estimation from Sub-Sampled Streams
 
19
Entropy
 
No multiplicative error approximation possible
with probability 9/10, even if p > ½
The entropy of sampled stream could be zero,
while that of the original stream non-zero
If                          there is an approximation H’
to              such that
                                  with prob. at least 99/100
9/13/2024
Estimation from Sub-Sampled Streams
20
 
Heavy Hitters
 
The frequency of heavy hitters are
(approximately) proportionately maintained in
the sampled stream
 
Precise upper and lower bounds in paper
 
9/13/2024
 
Estimation from Sub-Sampled Streams
 
21
Conclusions
 
Extreme Volume Data Streams
 
F
k
Upper and Lower Bounds for F
k
, k > 0
Smooth Space-Time tradeoff
 
Number of distinct Elements, Entropy
Sampling harms, streaming does not
 
Heavy Hitters:
Sampling and Streaming are both ok
9/13/2024
Estimation from Sub-Sampled Streams
22
Slide Note
Embed
Share

This research focuses on efficiently estimating statistics over sub-sampled streams, particularly in the context of IP packet streams for traffic monitoring. Various types of sampling methods are explored, such as Bernoulli sampling, to compute key metrics like frequency moments, number of distinct elements, empirical entropy, and heavy hitters. The study highlights the challenges and tradeoffs involved in processing sampled streams with limited memory resources.

  • Efficient Estimation
  • Sub-Sampled Streams
  • IP Packet Streams
  • Traffic Monitoring
  • Sampling Methods

Uploaded on Sep 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Space-Efficient Estimation of Statistics over Sub-Sampled Streams Andrew McGregor University of Massachusetts mcgregor@cs.umass.edu A. Pavan Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Iowa State University snt@iastate.edu David Woodruff IBM Research Almaden dpwoodru@us.ibm.com

  2. Sampled IP Packet Streams IP Traffic Monitoring 40Gbps 100s of Gbps Cisco Netflow standard for monitoring Sampled Netflow Network Monitor sees only a random sample of the original packet stream Different Types of Sampling Used IETF Working Group (psamp) 9/13/2024 Estimation from Sub-Sampled Streams 2

  3. Sampled Streams What can we compute over a stream by observing only a random sample of the stream? Two Constraints: Only observe a random sample Streaming, Memory Bound Computation 9/13/2024 Estimation from Sub-Sampled Streams 3

  4. Model: Bernoulli Sampling Original Stream P Bernoulli Sampler selects each ai with probability p Sampled Stream L Stream Processor Aggregates and Analysis of Original Stream P 9/13/2024 Estimation from Sub-Sampled Streams 4

  5. Aggregates The stream is a sequence of items What matters is the vector where fi is frequency of i Frequency Moments Number of Distinct Elements Empirical Entropy Heavy Hitters 9/13/2024 Estimation from Sub-Sampled Streams 5

  6. Preliminaries Let gi be the frequency of item i in the substream L contains the frequency vector Randomized multiplicative approximation, for parameters 9/13/2024 Estimation from Sub-Sampled Streams 6

  7. Results Number of Distinct Elements (Known) Upper Bound on Result Quality Simple Streaming Algorithm that meets the bound Frequency Moments, Fk, k > 0 Smaller values of pincrease streaming space complexity Matching upper and lower bounds (w.r.t p) Tradeoff between processing time and space Entropy Matching Upper and Lower Bounds for Additive Error Relative Error Impossible in small space Heavy Hitters Sampling is a good fit 9/13/2024 Estimation from Sub-Sampled Streams 7

  8. Related Work Duffield, Lund, Thorup, Properties and prediction of flow statistics from sampled packet streams , IMC 2002 Duffield, Lund, and Thorup, Estimating flow distributions from sampled flow statistics , SIGCOMM 2003 Rusu and Dobra, Sketching sampled data streams ICDE 2009 Bar-Yossef, Sampling Lower Bounds via Information Theory , STOC 2003 9/13/2024 Estimation from Sub-Sampled Streams 8

  9. Results: Number of Distinct Elements For constant p, estimate F0(P) from observing L Any algorithm must have relative error (Charikar et al. PODS 2000) A simple streaming algorithm has relative error Other Estimators, such as GEE (Generalized Error Estimator) also possible in single pass 9/13/2024 Estimation from Sub-Sampled Streams 9

  10. Frequency Moments Fk Theorem (Upper Bound): There is a one pass streaming algorithm which observes L and outputs a -estimator to where using space. assuming 9/13/2024 Estimation from Sub-Sampled Streams 10

  11. Computing F2: Why is This Hard? 1 2 3 4 5 6 n sampler 1 3 7 .. 9/13/2024 Estimation from Sub-Sampled Streams 11

  12. Algorithm 1 for F2 Algorithm 1. Estimate using streaming algorithm on L 2. Use in above formula for Z 9/13/2024 Estimation from Sub-Sampled Streams 12

  13. Algorithm 1 for F2 (contd) To get -approximation for Z, space needed is Issue: Need to estimate with very high accuracy to get good relative error for 9/13/2024 Estimation from Sub-Sampled Streams 13

  14. Algorithm 2 for F2 Collisions. The number of 2-wise collisions in P Observation: Algorithm: Estimate and with mult. error 9/13/2024 Estimation from Sub-Sampled Streams 14

  15. Estimating Observation: If estimated accurately, we are done But this is hard in general, in small space 9/13/2024 Estimation from Sub-Sampled Streams 15

  16. Estimating Case I: Case 2: Estimate hence Accurate estimate of not needed with good relative error Get an estimate within a multiplicative error of 3 9/13/2024 Estimation from Sub-Sampled Streams 16

  17. Estimating Estimate with good relative error when Technique due to Indyk & Woodruff (STOC 2005) Divide items {1,2..,m} into classes based on frequency Estimate the size of different classes that contribute to the final result 9/13/2024 Estimation from Sub-Sampled Streams 17

  18. Tight Lower Bound for Theorem: Any constant-pass streaming algo. that -approximates for a sufficiently small constants by observing a sampled stream, in the Bernoulli sampling model, requires bits of space 9/13/2024 Estimation from Sub-Sampled Streams 18

  19. Time-Space Tradeoff With sampled stream at probability p Processing Time: Streaming Space: Product: 9/13/2024 Estimation from Sub-Sampled Streams 19

  20. Entropy No multiplicative error approximation possible with probability 9/10, even if p > The entropy of sampled stream could be zero, while that of the original stream non-zero If there is an approximation H to such that with prob. at least 99/100 9/13/2024 Estimation from Sub-Sampled Streams 20

  21. Heavy Hitters The frequency of heavy hitters are (approximately) proportionately maintained in the sampled stream Precise upper and lower bounds in paper 9/13/2024 Estimation from Sub-Sampled Streams 21

  22. Conclusions Extreme Volume Data Streams Fk Upper and Lower Bounds for Fk, k > 0 Smooth Space-Time tradeoff Number of distinct Elements, Entropy Sampling harms, streaming does not Heavy Hitters: Sampling and Streaming are both ok 9/13/2024 Estimation from Sub-Sampled Streams 22

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#