New Algorithms for Heavy Hitters in Data Streams
This article discusses new algorithms for heavy hitters in data streams, addressing the challenge of identifying items with high occurrence frequency efficiently. It covers topics such as streaming models, guarantees for heavy hitters, optimal algorithms for different parameters, and the Misra-Gries algorithm's space complexity. The results show optimal solutions achieving efficient space complexity in various scenarios.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut, Palash Dey Nikita Ivkin, Jelani Nelson, and Zhengyu Wang
Streaming Model 4 3 7 3 1 1 2 Stream of elements a1, , am in [n] = {1, , n}. Assume m = poly(n) Arbitrary order One pass over the data Minimize space complexity (in bits) for solving a task Let fj be the number of occurrences of item j Heavy Hitters Problem: find those j for which fj is large
Guarantees l1 guarantee output a set containing all items j for which fj m the set should not contain any j with fj ( - ) m l2 guarantee F2= jfj output a set containing all items j for which fj 2 F2 the set should not contain any j with fj 2 ( - ) F2 f1, f2 f3 f4 f5 f6 2 l2 guarantee can be much stronger than the l1 guarantee Suppose frequency vector is ( ?, 1, 1, 1, , 1) Item 1 is an l2-heavy hitter for constant , , but not an l1-heavy hitter
Outline Optimal algorithm in all parameters , for l1-guarantee Optimal algorithm for l2-guarantee for constant ,
Misra-Gries Maintain a list L of c = O(1/ ) pairs of the form (key, value) key Given an update to item i If i is in L, increment value by 1 If i is not in L, and there are fewer than c pairs in L, put (i,1) in L Otherwise, subtract 1 from all values in L. Remove pairs with value 0 value If an item i is not a key, charge its updates to c-1 distinct updates of other items: fi c 1 m, so fi m Charge each update not included in the value f i of a key i to c-1 updates of other items: fi m fi fi
Space Complexity of Misra-Gries c log n = O( 1log n) bits, assuming stream length m poly n Optimal if = , since output size is O( 1logn) bits But what if say, = and = 1/log n? Misra-Gries uses O(log2n) bits but lower bound only (logn) bits
Our Results Obtain an optimal algorithm using 1 + 1logn ) bits O( 1log If = and = 1/log n we obtain the optimal O(log n) bits! For general stream lengths m, there is an additive O(log log m) in upper and lower bounds, so also optimal 1 O(1) update and reporting times provided m > poly
A Simple Initial Improvement 1 + 1logn ) bit algorithm, then improve it to + 1logn ) bits First show O( 1log optimal O( 1log 1 Idea: use same number c of (key, value) pairs, but compress each pair 1 2 random stream positions. If sample items 2m then for all i in [n], new frequency gi satisfies gi/p fi m Compress values by sampling O with probability p = C 1 2 distinct keys after sampling, so hash identities to universe of size O 1 4 O
Why Sampling Works? 1 2 random stream positions. If sample items with Compress the values by sampling O C 2mthen for all i in [n], new frequency gi satisfies probability p = 7 3 1 1 2 gi/p fi m gi p= fi gi p= p2 fi p 1 p fi 1 E and Var p gi p fi m fi fi Pr p 2m2= C m 3 1 gi p fi m] i Cm 1 fi Pr[ i for which C
Misra-Gries after Hashing 1 2 after sampling Stream length is O 1 2 distinct keys after sampling, so hash identities pairwise-independently to universe of size O 4 O 1 1 ) bits of space Misra-Gries on (key, value) pairs takes O( 1log Heavy hitters in sampled stream correspond to heavy hitters in original stream, and frequencies are preserved up to additive m Problem: want original (non-hashed) identities of heavy hitters!
Maintaining Identities For the O( 1) items with largest counts, as reported by our data structure, maintain actual log n bit identities Always possible to maintain since if we sample an insertion of an item i, we have its actual identity in hand hashed key value 3 1 4 6 5 9 1000 20 33 5000 1 1 actual key value 458938 \ \ 30903 \ \ 100 20 33 5000 1 1
Summary of Initial Improvement 1 O( 1log + 1logn ) bit algorithm 1 Update and reporting time can be made O(1) provided m > poly For most stream updates, they re not sampled so do nothing! Spread out computation of expensive operations over future updates for which you do nothing
An Optimal Algorithm 1 1 + 1logn ) O( 1log + 1logn ) space, but want O( 1log Too much space for (key, value) pairs in Misra-Gries! Instead, run Misra-Gries to find items with frequency > m, then use a separate data structure to estimate their frequencies up to additive m Misra-Gries data structure takes O( 1logn) bits of space 1 ) independent repetitions of a data structure Separate data structure will be O(log using O( 1) bits. What can you do with O(? 1) bits?
An Optimal Algorithm Want to use O( 1) bits so that for any given item i, can report an m additive approximation to fi with probability > 2/3 1 ) repetitions is an m additive approximation Median of estimates across O(log with probability 1 /100. Union bound over 1/ items Keep O( 1) counters as in Misra-Gries, but each on average uses O(1) bits! Can t afford to keep item identifiers, even hashed ones.. Can t afford to keep exact counts, even on the sampled stream..
Dealing with Item Identifiers Choose a pairwise-independent hash function h:[n] -> {1, 2, , 1/ } Don t keep item identifiers, just treat all items that go to the same hash bucket as one item Expected noise in a bucket is (sampled stream length) = 1/ Solves the problem with item identifiers, but what about counts?
Dealing with Item Counts 1 2, and want to We have r = O(1/ ) counters c1,c2, ,cr, with ici= O store each ci up to additive error 1/ Round each ci to its nearest integer multiple of 1/ Gives O(1/ ) bits of space But how to maintain this as the stream progresses? classic probabilistic counters do not work design accelerated counters which are more accurate as count increases For more details, please see the paper!
Conclusions on l1-guarantee 1 + 1logn ) bits of space O( 1log 1 , then update and reporting times are O(1) If m > poly Show a matching lower bound Is this also a significant practical improvement over Misra-Gries?
Outline Optimal algorithm in all parameters , for l1-guarantee Optimal algorithm for l2-guarantee for constant ,
CountSketch achieves the l2guarantee [CCFC] Assign each coordinate i a random sign (i) 2 {-1,1} E[ (i) ch(i)] = (i) i : h(i ) = h(i) (i ) fi = fi Randomly partition coordinates into B buckets, maintain cj = i: h(i) = j (i) fi in j-th bucket Repeat this hashing scheme O(log n) times Output median of estimates Ensures every fj is approximated up to an additive (F2/B)1/2 Gives O(log2 n) bits of space f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 i: h(i) = 2 (i) fi . . . Estimate fi as (i) ch(i)
Known Space Bounds for l2 heavy hitters CountSketch achieves O(log2 n) bits of space If the stream is allowed to have deletions, this is optimal [DPIW] What about insertion-only streams? This is the model originally introduced by Alon, Matias, and Szegedy Models internet search logs, network traffic, databases, scientific data, etc. The only known lower bound is (log n) bits, just to report the identity of the heavy hitter
Our Results [BCIW] We give an algorithm using O(log n log log n) bits of space! Same techniques give a number of other results: (F2 at all times) Estimate F2 at all times in a stream with O(log n log log n) bits of space Improves the union bound which would take O(log2 n) bits of space Improves an algorithm of [HTY] which requires m >> poly(n) to achieve savings (L -Estimation) Compute maxi fi up to additive ( F2)1/2 using O(log n log log n) bits of space (Resolves IITK Open Question 3)
Simplifications Output a set containing all items i for which fi 2 F2 for constant There are at most O(1/ ) = O(1) such items i Hash items into O(1) buckets All items i for which fi 2 F2 will go to different buckets with good probability Problem reduces to having a single i*in {1, 2, , n} with fi* ( F2)1/2
Intuition Suppose first that fi n1/2log n and fi in {0,1} for all i in {1, 2, , n} \ {i*} For the moment, also assume that we have an infinitely long random tape Assign each coordinate i a random sign (i) 2 {-1,1} Randomly partition items into 2 buckets Maintain c1 = i: h(i) = 1 (i) fi and c2 = i: h(i) = 2 (i) fi Suppose h(i*) = 1. What do the values c1 and c2 look like?
Eventually, fi* > 2??1/2, ? ?? ?? ???? ? ?? ?????? ???????? ? ! Only gives 1 bit of information. Can t repeat log n times in parallel, but can repeat log n times sequentially! c1 = (i*) fi* + i i , h i =1 i fiand c2 = i, h i =2 i fi c1 - (i*) fi* and c2 evolve as random walks as the stream progresses (Random Walks) There is a constant C > 0 so that with probability 9/10, at all times, |c1 - (i*) fi*| < Cn1/2 and |c2| < Cn1/2
Repeating Sequentially Wait until either |c1| or |c2| exceeds Cn1/2 If |c1| > Cn1/2 then h(i*) = 1, otherwise h(i*) = 2 This gives 1 bit of information about i* (Repeat) initialize 2 new counters to 0 and perform the procedure again! Assuming fi = (n1/2log n), we will have at least 10 log n repetitions, and we will be correct in a 2/3 fraction of them (Chernoff) only a single value of i* whose hash values match a 2/3 fraction of repetitions
Gaussian Processes We don t actually have fi n1/2log n and fi in {0,1} for all i in {1, 2, , n} \ {i*} Fix both problems using Gaussian processes (Gaussian Process) Collection {Xt}t in T of random variables, for an index set T, for which every finite linear combination of random variables is Gaussian Assume E[Xt] = 0 for all t Process entirely determined by covariances E[XsXt] Distance function d(s,t) = (E[|Xs-Xt|2])1/2 is a pseudo-metric on T (Connection to Data Streams) Suppose we replace the signs (i) with normal random variables g(i), and consider a counter c at time t: c(t) = ig(i) fi(t) fi(t) is frequency of item i after processing t stream insertions c(t) is a Gaussian process!
Chaining Inequality [Fernique, Talagrand] Let {Xt}t in T be a Gaussian process and let T0 T1 T2 T be such that T0 = 1 and Ti 22i for i 1. Then, 2i/2d(t,Ti) E sup t T|Xt| O 1 sup t T i 0 How can we apply this to c(t) = ig(i) fi(t)? Let F2t be the value of F2after t stream insertions Let the Tibe a recursive partitioning of the stream where F2(t) changes by a factor of 2
a1 a2 a3 a4 a5 at am F2 m 2 at is the first point in the stream for which F2t Apply the chaining inequality! T0= t Let Ti be the set of 22i times t1,t2, ,t22i in the stream such that tj is the first point in the stream with 22i j F2 m F2tj Then T0 T1 T2 T,and T0 = 1, and Ti 22i for i 1
Applying the Chaining Inequality Let {Xt}t in T be a Gaussian process and let T0 T1 T2 T be such that T0 = 1 and Ti 22i for i 1. Then, 2i/2d(t,Ti) E sup t T|Xt| O 1 sup t T i 0 Same behavior as for random walks! d(t,Ti) = (min tj ?? E|c(t) c(tj)|2])1/2 (?2 22?)1/2 i 0 2i/2(?2 Hence, E sup 22?)1/2 = O(F21/2) t T|Xt| O 1 sup t T
Removing Frequency Assumptions We don t actually have fi n1/2log n and fj in {0,1} for all j in {1, 2, , n} \ {t} Gaussian process removes the restriction that fj in {0,1} for all j in {1, 2, , n} \ {t} The random walk bound of Cn1/2 we needed on counters holds without this restriction 1/2(-i*) log n to learn log n bits about the heavy hitter But we still need fi F2 How to replace this restriction with fi ( F2(-i*)) 1/2? Assume > log log n by hashing into log log n buckets and incurring a log log n factor in space
Amplification Create O(log log n) pairs of streams from the input stream (streamL1 , streamR1), (streamL2 , streamR2), , (streamLlog log n , streamRlog log n) For each j in O(log log n), choose a hash function hj :{1, , n} -> {0,1} streamLj is the original stream restricted to items i with hj(i) = 0 streamRj is the remaining part of the input stream maintain counters cL = i: hj(i) = 0g(i) fi and cR = i: hj(i) = 1g(i) fi (Chaining Inequality + Chernoff) the larger counter is usually the substream with i* The larger counter stays larger forever if the Chaining Inequality holds Run algorithm on items corresponding to the larger counts Expected F2 value of items, excluding i*, is F2/poly(log n), so i* is heavier
Derandomization We don t have an infinitely long random tape We need to (1) derandomize a Gaussian process (2) derandomize the hash functions used to sequentially learn bits of i* We achieve (1) by (Derandomized Johnson Lindenstrauss) defining our counters by first applying a Johnson-Lindenstrauss (JL) transform [KMN] to the frequency vector, reducing n dimensions to log n, then taking the inner product with fully independent Gaussians (Slepian s Lemma) counters don t change much because a Gaussian process is determined by its covariances and all covariances are roughly preserved by JL For (2), derandomize an auxiliary algorithm via Nisan s PRG [I]
An Optimal Algorithm [BCINWW] Want O(log n) bits instead of O(log n log log n) bits Sources where the O(log log n) factor is coming from Amplification Use a tree-based scheme and that the heavy hitter becomes heavier! Derandomization Show 6-wise independence suffices for derandomizing a Gaussian process!
Conclusions on l2-guarantee Beat CountSketch for finding l2-heavy hitters in a data stream Achieve O(log n) bits of space instead of O(log2 n) bits New results for estimating F2 at all points and L - estimation Questions: Is this a significant practical improvement over CountSketch as well? Can we use Gaussian processes for other insertion-only stream problems?
Accelerated Counters What if we update a counter c for fi with probability p = ? E[c/p] = fi Sum of the counts is expected to be O(1/ ) We have 1 1 bits counters with sum O(1/ ) , can store all with O 1 2 Problem: very inaccurate if fi =
Accelerated Counters Instead, suppose you knew a value r with r = fi Update a counter c with probability p = 2r. Output c p Var[c p] = fi p2 p 1 p fi p 2fi Pr fi c p>1 p= 1 Problem: don t know r = fi in advance!
Accelerated Counters Solution: increase sampling probability as counter increases! Opposite of standard probabilistic counters 2) A frequency fi will in expectation have count value about O( 2fi With 1 1 2 , space is maximized at O 1 bits counters subject to ifi=