Understanding Tradeoff between Sample and Space Complexity in Stochastic Streams
Explore the relationship between sample and space complexity in stochastic streams to estimate distribution properties and solve various problems. The research delves into the tradeoff between the number of samples required to solve a problem and the space needed for the algorithm, covering topics such as collision probability estimation and connectivity decisions in graphs and linear algebra.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Stochastic Streams: Sample Complexity vs. Space Complexity David Woodruff IBM Almaden Joint work with Michael Crouch, Andrew McGregor, and Greg Valiant
Motivation (Well-studied) Statistics question: how many samples from a distribution are needed to estimate a property of a distribution? (Well-studied) Streaming question: for a given fixed stream of samples, how much space is needed to estimate a property of a distribution? Our work: understand the tradeoff between the sample and space complexity
Model 4 3 7 3 1 1 2 Algorithm sees a stream of i.i.d. samples from a distribution Algorithm only has 1 pass over the samples Goal: understand the tradeoff between the number t of samples needed to solve a problem, versus the space s of the algorithm
Problems (Statistics) Given t samples from a distribution p = p1, ,pn on n items, estimate the collision probability ipi 2 up to a 1 + - factor (Graph Problems) Given t independent edges chosen with replacement from a graph G, decide if G is connected (Linear Algebra) Given t independent samples from a subspace S of GF 2d, determine if S has dimension d/2 or dimension d
Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank
Collision Probability Given t samples from distribution p = p1, ,pn on n items, estimate the collision probability ipi Collision probability is called F2 2 up to a 1 + - factor If t = o(n.5), impossible with any amount of space Distinguish p = 1 n, ,1 nvs. p = (2 n, ,2 n,0, ,0) Our algorithm For any t = ?(n.5), can 1 + -approximate F2 with t samples and O (1 +n t) space
Collision Probability Algorithm Break the t samples into t/w contiguous groups of w samples 1 1 4 3 7 3 Group 3 Group 2 Group 1 For each group of samples a1, ,aw let Xi,j= 1 if ai= aj, and let X = w(w 1) i jXi,j be the probability of a collision on the group 1 Use w log n bits of space to compute an estimate X for a group, and average estimates over t/w groups
Collision Probability Algorithm 2= F2 E X = E Xi,j = kpk w w 1+ (n.5F2 F2 Var X w) Chebyshev s inequality implies a 1 + -approximation to F2 with error probability O wt 2+ t 2 n.5 n Set t >n.5 n ?2, and w = O t 2
Collision Probability Lower Bound Use lower bound for random order streams Case 1: see a stream a1, ,atof t < n distinct items from a universe U Case 2: see a stream a1, ,at of t-r distinct items together with an item i which occurs r times, all from universe U Order of streams is random t [AOMP, GH] any streaming algorithm needs the cases, even with an infinite random tape (conjectured space: r2.5 space to distinguish t r2)
Collision Probability Lower Bound Choose a random function h: U -> [n] Given a stream a1, ,at of items in a random order, feed the algorithm for IID streams the stream h(a1), ,h(at) a1 a2 a3 a4 a5 at h(a5) h(a3) h(a1) h(a2) h(a4) h(at)
Collision Probability Lower Bound If a1, ,at are distinct, obtain IID samples from distribution (1 n, ,1 n) If a1, ,at are distinct together with an item i occurring r times, roughly see IID samples from distribution (1 n Question: Extend to Fk nt, ,1 r r r t ,1 nt, ,1 r r n nt, n n nt) If r > t/n.5, then F2 in two cases differs by a constant factor 5 4 Implies w = (n t1.5) (Conjectured = (n t))
Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank
Graph Connectivity Given t independent edges chosen with replacement from graph G, decide if G is connected Simulate a random walk starting at node 1 Store current vertex If see an edge not incident to the current vertex, discard it Remember first node i which you haven t seen. Finish when i > n
Graph Connectivity 2 Current Vertex: 1 Start at vertex 1 3 1 First Untouched Vertex: 2 3 4 done 4 See IID Stream: {1, 4}, {2, 3}, {1, 4}, {3,4}, {1,2}, {2, 3}, {1,2}, {3,4} do nothing do nothing do nothing O(log n) space, and O m (mn2) = O m2n2 samples
The Loopy Graph For each vertex v in G, add m dv self-loops to vertex v The resulting loopy graph H is m-regular and has mn edges Perform previous algorithm on H Each sample is important! O(log n) space, but only O mn n2= O mn3samples
Use More Space and Fewer Samples We have an algorithm with O(log n) space and O mn3samples How can we use more space but fewer samples? Take p random walks in loopy graph, and remember current vertex of each one O(p log n) space Use each sample to update all p random walks Issue: random walks not independent Fix: can simulate independence since most walks don t move on a given sample What to do with p independent random walks?
Space/Time Tradeoff for Connectivity [Feige] For any k, the following is a correct algorithm whp: (Phase 1) Ensure graph has no connected components of size k (Phase 2) Sample n log n vertices and verify the samples are connected k If there is a component with k vertices, declare graph is disconnected Will sample a vertex from each group of k vertices Otherwise, suppose we are in phase 2 x x x x
Implementation in the IID Model For any p n, can implement algorithm with O(p) space and O(mn2 p2) samples Set k = O(n/p) Even for p = O(1), this improves our earlier O(log n) space and O mn3 samples Phase 1: sample O(p) nodes at random, run random walk on each of them estimate in O(1) space the number of distinct nodes visited in each walk Phase 2: sample O(p) nodes at random, run random walk on each of them keep track of which sampled nodes are connected
Talk Outline Sample/Space Tradeoff for Collision Probability Estimation Sample/Space Tradeoff for Deciding Connectivity Sample/Space Tradeoff for Determining if a Subspace is Full Rank
Determining if a Subspace Has Full Rank Given t IID samples from a subspace S of GF 2d, is rank(S) = d/2 or rank(S) = d? O(log d) space algorithm: choose a random standard unit vector ei and check if any sample equals ei After O 2d samples, if S has full rank, will see ei If S has rank d/2, with probability , will never see ei Our Lower Bound: Any algorithm succeeding with probability > 2/3 and using o(d) space must use 2 d samples
Statistical Query Framework A statistical query algorithm (s.q. algorithm) is adaptively proposes functions f1,f2, with fi: 0,1d 1,1 and receives estimates of Exfi xthat are corrupted via adversarial noise An s-query s.q. algorithm with tolerance is an algorithm that: for every rank d/2 subspace S, after querying f1, ,fs with responses r1, ,rs with ri Ex uniform in S[fix ] , it outputs rank d/2 with probability 3/4 if the responses r1, ,rs satisfy ri Ex uniform in 0,1d[fi(x) , the algorithm outputs full rank with probability 3/4 Main Lemma: For any fi, > 2 d 8 2 d Pr S Ex uniform on 0,1d fix Ex uniform on Sfix 4
Statistical Query Framework Each of t players in a communication game receives a sample from GF 2d P1 P3 Pt P2 Each message depends on all previous messages and is at most s bits [SVG] If there is a communication protocol with probability 1- , then there is an O(st) query s.q. algorithm with probability 1-2 with tolerance t2s Set s = o d and t = 2 (d) and apply our main lemma
Conclusions Studied space versus sample tradeoffs in the data stream model Obtained tradeoffs for statistical, graph, and linear algebra problems Open questions: tighten our bounds General question: unify the techniques for the different problems