Understanding Streaming Algorithms: Group 7 Presentation Overview
Group 7 from the National University of Singapore presents on Streaming Algorithms, highlighting the challenges of processing massive data sets sequentially. The presentation covers motivations, a simple example of finding missing numbers, two general approaches - sampling and sketching, and an outline including discussions on frequency moments and count-min sketches.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Streaming Algorithm Presented by: Group 7 Min Chen Zheng Leong Chua Anurag Anshu Samir Kumar Nguyen Duy Anh Tuan Hoo Chin Hau Jingyuan Chen Advanced Algorithm National University of Singapore
Motivation Google gets 117 million searches per day Facebook get 2 billion clicks per day Huge amount of data How to do queries on this huge data set? e.g, how many times a particular page has been visited Impossible to load the data into the random access memory 2
Streaming Algorithm ?0 ?? ?1 ?2 Access the data sequentially Data stream: A data stream we consider here is a sequence of data that is usually too large to be stored in available memory E.g, Network traffic, Database transactions, and Satellite data Streaming algorithm aims for processing such data stream. Usually, the algorithm has limited memory available (much less than the input size) and also limited processing time per item A streaming algorithm is measured by: 1. Number of passes of the data stream 2. Size of memory used 3. Running time 3
Simple Example: Finding the missing number There are n consecutive numbers, where n is a fairly large number 1 n 2 3 A number k is missing now Now the data stream becomes like: n 1 2 k-1 k+1 Suppose you only have log(?) size of memory Can you propose a streaming algorithm to find k? which examine the data stream as less times as possible 4
Two general approach for streaming algorithm 1. Sampling ?? ?? ?0 ?? ?1 ?? m samples, ? ? Choose part of the stream to represent the whole stream ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? 2. Sketching ?0 ?1 ?? Mapping the whole stream into some data structures Difference between these two approach: Sampling: Keep part of the stream with accurate information Sketching: Keep the summary of the whole streaming but not accurately 5
Outline of the presentation 1. Sampling - (Zheng Leong Chua, Anurag anshu) In this part, 1)we will using sampling to calculate the Frequency moment of a data stream Where, the k-th frequency moment is defined as ??? = ?=1 the frequency of ?? 2) We will discuss one algorithm for ?0, which is the count of distinct numbers in a stream, and one algorithm is for ??(? 1), and one algorithm for special case ?2 3)Proof for the algorithms ? ?(??)?, ? ?? is 2. Sketching - (Samir Kumar, Hoo Chin Hau, Tuan Nguyen) In this part, 1)we will formally introduce sketches 2)implementation for count-min sketches 3)Proof for count-min sketches 3. Conclusion and applications - (Jingyuan Chen) 6
Approximating Frequency Moments Chua Zheng Leong & Anurag Anshu Alon, Noga; Matias, Yossi; Szegedy, Mario (1999), "The space complexity of approximating the frequency moments", Journal of Computer and System Sciences 58 (1): 137 147,
Estimating Fk Input: a stream of integers in the range {1 n} Let mibe the number of times i appears in the stream. Objective is to output Fk= i mik Randomized version: given a parameter , output a number in the range [(1- )Fk,(1+ )Fk] with probability atleast 7/8. 14
Analysis Important observation is that E(X) = Fk Proof: Contribution to the expectation for integer i is m/m ((mik)-(mi-1)k + (mi-1)k (mi-2)k 2k 1k + 1k) = mik. Summing up all the contributions gives Fk 16
Analysis Also E(X2) is bounded nicely. E(X2) = m( i (mi)2k (mi-1)2k + (mi-1)2k (mi-2) 2k 22k 12k + 12k) < kn(1-1/k)Fk2 Hence given the random variable Y = X1+..Xs/s E(Y) = E(X) = Fk Var(Y) = Var(X)/s < E(X2)/s = kn(1-1/k)Fk2/s 17
Analysis Hence Pr (|Y-Fk|> Fk) < Var(Y)/ 2Fk < kn(1- 1/k)/s 2 < 1/8 To improve the error, we can use yet more processors. Hence, space complexity is: O((log n + log m)kn(1-1/k)/ 2) 18
Estimating F2 Algorithm (bad space-inefficient way): Generate a random sequence of n independent numbers: e1,e2 en, from the set [-1,1]. Let Z=0 . For the incoming integer i from stream, change Z-> Z+ei . 19
Hence Z= i eimi Output Y=Z2. E(Z2) = F2, since E(ei)=0 and E(eiej)=E(ei)E(ej), for i j E(Z4) E(Z2)2 < 2F22, since E(eiejekel)=E(ei)E(ej)E(ek)E(el), when all i,j,k,l are different. 20
Same process is run in parallel on s independent processors. We choose s= 16/ 2 Thus, by Chebysev s inequality, Pr(|Y-F2|> F2) < Var(Y)/ 2F22< 2/s 2 =1/8 21
Estimating F2 Recall that storing e1,e2 en requires O(n) space. To generate these numbers more efficiently, we notice that only requirement is that the numbers {e1,e2 en} be 4-wise independent. In above method, they were n-wise independent too much. 22
Orthogonal array We use `orthogonal array of strength 4 . OA of n-bits, with K runs, and strength t is an array of K rows and n columns and entries in 0,1 such that in any set of t columns, all possible t bit numbers appear democratically. So simplest OA of n bits and strength 1 is 000000000000000 111111111111111 23
Strength > 1 This is more challenging. Not much help via specializing to strength 2 . So lets consider general strength t. A technique: Consider a matrix G, having k columns, with the property that every set of t columns are linearly independent. Let it have R rows. 24
Technique Then OA with 2R runs and k columns and strength t is obtained as: 1. For each R bit sequence [w1,w2 wR], compute the row vector [w1,w2..wR] G. 2. This gives one of the rows of OA. 3. There are 2R rows. 25
Proof that G gives an OA Pick up any t columns in OA. They came from multiplying [w1,w2 wR]to corresponding t columns in G. Let the matrix formed by these t columns of G be G . Now consider [w1,w2 wR]G = [b1,b2..bt]. 1. For a given [b1,b2..bt], there are 2R-t possible [w1,w2 wR], since G has as many null vectors. 2. Hence there are 2t distinct values of [b1,b2..bt]. 3. Hence, all possible values of [b1,b2..bt] obtained with each value appearing equal number of times. 26
Constructing a G We want strength = 4 for n bit numbers. Assume n to be a power of 2, else change n to the closest bigger power of 2. We show that OA can be obtained using corresponding G having 2log(n)+1 rows and n columns Let X1,X2 Xn be elements of F(n). Look at Xi as a column vector of log(n) length. 27
G is 1 1 1 1 1 X1 X2 X3 X4 Xn X13 X23 X33 X43 Xn3 Property: every 5 columns of G are linearly independent. Hence the OA is of strength 5 => of strength 4. 28
Efficiency To generate the desired random sequence e1,e2 en, we proceed as: 1. Generate a random sequence w1,w2 wR 2. If integer i comes, compute the i-th column of G, which is as easy as computing i-th element of F(n), which has efficiency O(log(n)). 3. Compute vector product of this column and random sequence to obtain ei. 29
Sketches Samir Kumar
What are Sketches? Sketches are data structures that store a summary of the complete data set. Sketches are usually created when the cost of storing the complete data is an expensive operation. Sketches are lossy transformations of the input. The main feature of sketching data structures is that they can answer certain questions about the data extremely efficiently, at the price of the occasional error ( ). 31
How Do Sketches work? The data comes in and a prefixed transformation is applied and a default sketch is created. Each update in the stream causes this synopsis to be modified, so that certain queries can be applied to the original data. Sketches are created by sketching algorithms. Sketching algorithms preform a transform via randomly chosen hash functions. 32
Standard Data Stream Models Input stream a1, a2, . . . . arrives sequentially, item by item, and describes an underlying signal A, a one- dimensional function A : [1...N] R. Models differ on how ai describe A There are 3 broad data stream models. 1. Time Series 2. Cash Register 3. Turnstile 33
Time Series Model The data stream flows in at a regular interval of time. Each ai equals A[i] and they appear in increasing order of i. 34
Cash Register Model The data updates arrive in an arbitrary order. Each update must be non-negative. At[i] = At-1[i]+c where c 0 35
Turnstile Model The data updates arrive in an arbitrary order. There is no restriction on the incoming updates i.e. they can also be negative. At[i] = At-1[i]+c 36
Properties of Sketches Queries Supported:- Each sketch supports a certain set of queries. The answer obtained is an approximate answer to the query. Sketch Size:-Sketch doesn t have a constant size. The sketch is inversely proportional to and (probability of giving inaccurate approximation). 37
Properties of Sketches-2 Update Speed:- When the sketch transform is very dense, each update affects all entries in the sketch and so it takes time linear in sketch size. Query Time:- Again is time linear in sketch size. 38
Comparing Sketching with Sampling Sketch contains a summary of the entire data set. Whereas sample contains a small part of the entire data set. 39
Count-min Sketch Nguyen Duy Anh Tuan & Hoo Chin Hau
Introduction Problem: Given a vector a of a very large dimension n. One arbitrary element ai can be updated at any time by a value c: ai = ai + c. We want to approximate a efficiently in terms of space and time without actually storing a. 41
Count-min Sketch Proposed by Graham and Muthukrishnan [1] Count-min (CM) sketch is a data structure Count = counting or UPDATE Min = computing the minimum or ESTIMATE The structure is determined by 2 parameters: : the error of estimation : the certainty of estimation [1] Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75. 42
Definition A CM sketch with parameters ( , ) is represented by two-dimensional d-by-w array count: count[1,1] count[d,w]. In which: = 1 e = ln( ) , d w (e is the natural number) 43
Definition In addition, d hash functions are chosen uniformly at random from a pair-wise independent family: ... 1 { : ... } 1 { ... } h h n w 1 d 44
Update operation UPDATE(i, c): Add value cto the i-th element of a c can be non-negative (cash-register model) or anything (turnstile model). Operations: For each hash function hj: , [ j count + c = ( )] h i j 45
Update Operation UPDATE(23, 2) 23 h1 h2 h3 1 2 3 4 5 6 7 8 1 0 0 0 0 0 0 0 0 d = 3 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 w = 8 46
Update Operation UPDATE(23, 2) 23 h1 h2 h3 3 1 7 1 2 3 4 5 6 7 8 1 0 0 2 0 0 0 0 0 d = 3 2 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 2 0 w = 8 47
Update Operation UPDATE(99, 5) 99 h1 h2 h3 1 2 3 4 5 6 7 8 1 0 0 2 0 0 0 0 0 d = 3 2 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 2 0 w = 8 48
Update Operation UPDATE(99, 5) 99 h1 h2 h3 5 1 3 1 2 3 4 5 6 7 8 1 0 0 2 0 0 0 0 0 d = 3 2 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 2 0 w = 8 49
Update Operation UPDATE(99, 5) 99 h1 h2 h3 5 1 3 1 2 3 4 5 6 7 8 1 0 0 2 0 5 0 0 0 d = 3 2 7 0 0 0 0 0 0 0 3 0 0 5 0 0 0 2 0 w = 8 50