Exploring Statistics, Big Data, and High-Dimensional Distributions
Delve into the realms of statistics, big data, and high-dimensional distributions in this visual journey that touches on topics ranging from lottery fairness to independence testing in shopping patterns. Discover insights into the properties of BIG distributions and the prevalence of massive data sets across various domains like social media, genomics, and astronomical observations.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Statistics vs Big Data Constantinos Daskalakis CSAIL and EECS, MIT Greek Stochastics ?
YOU WANT BIG DATA? I LL GIVE YOU BIG DATA!
BIG Data small Facebook: 20petabytes images daily Human genome: 40 exabytes storage by 2025 SKA Telescope: 1 exabyte daily
High-dimensional Expensive DNA microarray Experimental drugs Computer vision Financial records
What properties do your BIG distributions have?
e.g.1: play the lottery? Is it uniform?
Is the lottery unfair? from Hitlotto.com: Lotteryexperts agree, past number histories can be the key to predicting future winners.
True Story! Polish lottery Multilotek Choose uniformly at random distinct 20 numbers out of 1 to 80. Initial machine biased e.g., probability of 50-59 too small
New Jersey Pick 3,4 Lottery New Jersey Pick k (=3,4) Lottery. Pick k random digits in order. 10k possible values. Data: Pick 3 - 8522 results from 5/22/75 to 10/15/00 2-test (on Excel) answers "42% confidence Pick 4 - 6544 results from 9/1/77 to 10/15/00. fewer results than possible values not a good idea to run ?2 test convergence to ?2 distribution won t kick in for small sample size (textbook) rule of thumb: expected number of at least 5 observations of each element in the domain under the null hypothesis to run ?2
e.g. 2: Independence Testing Shopping patterns: Independent of zip code?
e.g.2: Linkage Disequilibrium Genome locus ? locus 1 locus 2 Single nucleotide polymorphisms, are they independent? Suppose ? loci, 2 possible states each, then: state of one s genome {0,1}? humans: some distribution ? over {0,1}? Question: Is ? a product dist n OR far from all product dist ns? should we expect the genomes from the 1000 human genomes project to be sufficient? up to how many loci?
e.g. 3: Outbreak of diseases Similar patterns in different years? More prevalent near large airports? Flu 2005 Flu 2006
Distributions on BIG domains Given samples of a distribution, need to know, e.g., entropy number of distinct elements shape (monotone, unimodal, etc) closeness to uniform, Gaussian, Zipfian No assumptions on shape of distribution i.e., parametric, smoothness, monotonicity, normality, Considered in Statistics, information theory, machine learning, databases, algorithms, physics, biology,
Old questions, new challenges Modern Setting Classical Setting Domain: Domain: 1000 tosses One human genome Large domain ? Small domain ? ? ?????, ? ????? ? ?????, ? ????? (comparatively) New challenges: samples, computation, communication, storage Asymptotic analysis Computation not crucial
A Key Question How many samples do you need in terms of domain size? Do you need to estimate the probabilities of each domain item? -- OR -- Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniques
Aim Algorithms with sublinear sample complexity MACHINE LEARNING STATISTICS INFORMATION THEORY ALGORITHMS
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
Problem formulation discrete Model ?: family of distributions over ? may be non-parametric, e.g. unimodal, product, log-concave Problem Given: samples from unknown? with probability 0.9, distinguish ? ?vs?(?,?) > ? Objective Minimize samples Computational efficiency 1?,? 2 min ? ? Sublinear in |?|? max ?????? ?|? ? ?(?)|
Well-studied Problem (Composite) hypothesis testing Neyman-Pearson test Kolmogorov-Smirnov test Pearson s chi-squared test Generalized likelihood ratio test Quantities of Interest ??= Pr ?????? ? ?? ???? ???? ????? ??= Pr(?????? ? ?? ???? ???? ????) - Sublinear in |?|? - Strong control for false positives? Focus Asymptotic regime: Results kick in when? |?| Consistency Error exponents: exp ? ? as ?
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
Testing Fairness of a Coin 1 ? = Bernoulli ? = Bernoulli(?) ? ?vs?TV(?,?) > ? 2 ? : unknown probability of Question: Is ? = 0.5 OR ? 0.5 > ?? Goal: Toss coin several times, deduce correct answer w/ probability > 0.99 Number of samples required? ? Can estimate ?, by tossing ? times, then taking ? = ?=1 ?? ? 1 ?2, ? ? <? By concentration bounds, if ? > ? 3, w/ probability > 0.99 1 ?2 many samples necessary? Suppose there is tester using ? samples Then it can distinguish one sample from ? = (?1, ,??) where each ?? ?(0.5) from one sample from ? = ?1, ,?? where each ?? ? 0.5 + ? w/ probability > 0.99 Claim: Any tester has error probability at least 1 Are 2(1 dTV(?,?)) dTV?,? 2 ? ?,? = ? ? ?
Testing Uniformity ? = Uniform? ? : unknown ? ?vs?TV(?,?) > ? ?: unknown distribution over ? Sample access to ? Question: is ? = ?? or ?TV?,?? > ? ? |?| ?2 [Paninski 03]: samples and time Intuition: (Lower Bound) Suppose ? is uniform distribution over {1, ,?} and ? is either uniform on 1, ,? or uniform on a random ? size subset of {1, ,?} - unless ( ?) samples are observed, there are no collisions, hence cannot distinguish (Upper Bound) Collision statistics suffice to distinguish 2
Proving Lower Bounds [Le Cam 73]: Consider two disjoint sets of distributions ?1, ?2. Suppose algorithm ? is given ? samples from some unknown ? ?1 ?2 and claims to distinguish ? ?1 vs p ?2. Then: Pr ????? 1 2 1 inf ????1,?2 ?1 ??????1 ?2 ??????2 ?????? : all distributions generating samples as follows - choose a random distribution ? from ? (according to some distribution over ?) - then generate ? samples from ?
Proving Lower Bounds [Le Cam 73]: Consider two disjoint sets of distributions ?1, ?2. Suppose algorithm ? is given ? samples from some unknown ? ?1 ?2 and claims to distinguish ? ?1 vs p ?2. Then: Pr ????? 1 2 1 inf ????1,?2 ?1 ??????1 ?2 ??????2 ?????? : all distributions generating samples as follows - choose a random distribution ? from ? (according to some distribution over ?) - then generate ? samples from ?
Proving Lower Bounds [Le Cam 73]: Consider two disjoint sets of distributions ?1, ?2. Suppose algorithm ? is given ? samples from some unknown ? ?1 ?2 and claims to distinguish ? ?1 vs p ?2. Then: Pr ????? 1 2 1 inf ????1,?2 ?1 ??????1 ?2 ??????2 ?????? : all distributions generating samples as follows - choose a random distribution ? from ? (according to some distribution over ?) - then generate ? samples from ? ? To prove lower bound for uniformity ?2 testing take: ?1= Uniform? ?2= ? | ?2?+1=1 ?? ?,?2?+2=1 ?? ?, ?
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
Identity Testing (goodness of fit) ? = ? ? : unknown ? ?vs?TV(?,?) > ? ?,?: distributions over ? ?: given; sample access to ? Question: is ? = ? or ?TV?,? > ? ? [Batu-Fisher-Fortnow-Kumar-Rubinfeld-White 01] [Paninski 08, Valiant-Valiant 14]: |?| ?2 samples and time [w/ Acharya-Kamath 15]: a tolerant goodness of fit test with same sample size can distinguish: ?2?,? ?2vs 1 ?? ??2 ?? Cauchy-Schwarz: ?2?,? 1?,?2 2?,? > 4 ?2 ?2?,? ? ?
A new ?2- Goodness of Fit Test Goal: given ? and sample access to ? distinguish: Case 1:?2?,? ?2vs Case 2: 1 Approach: Draw Poisson(?) many samples from ? ??: # of appearances of symbol ? ? ??~Poisson ? ?? ?? ? ? independent random variables ?? ? ??2 ?? ? ?? ? ? = ? ?2(?,?) Case 1:? ? ? ?2; Case 2:? ? ? 4 ?2 2?,? 4 ?2 Side-Note: Pearson s ?2 test uses 2 ?? ? ?? ? ?? statistic ? Subtracting ?? in the numerator gives an unbiased estimator and importantly may hugely decrease variance Statistic:? = ? ? chug chug chug bound variance of ? O samples ?2 suffice to distinguish
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
Testing Properties of Distributions so far ? ={single distribution} restrictive, as rarely know hypothesis distribution exactly natural extension: test structural properties monotonicity: PDF is monotone, e.g. cancer vs radiation unimodality: PDF is single-peaked, e.g. single source of disease log-concavity: logPDFis concave monotone-hazard rate: log (1 CDF)is concave product distribution, e.g. testing linkage disequilibrium Example question: ? = {???????? ????????????? ???? [?]} Sample access to ? Is ? unimodal OR is ??-far from or unimodal distributions?
Testing Properties of Distributions [w/ Acharya and Kamath 2015]: 1. Testing identity, monotonicity, log-concavity, monotone hazard-rate, unimodality |?| ?2 for distributions over (ordered set) ?is doable w/ ? samples and time. Testing monotonicity/independence of a distribution over ? = ??is doable ??/2 ?2 ? ?2 samples and time. 2. |?| w/ ? ?? 0.5 ?4 previous best for monotonicity testing: ? [Bhattacharya-Fisher-Rubinfeld-Valiant 11] previous best for independence: d=2, worst bounds [Batu et al. 01] All bounds above are optimal Matching lower bounds for 1 and 2 via Le Cam. Unified approach, computationally efficient tests 3. 4. N.B. Contemporaneous work of [Canonne et al 2015] provide a different unified approach for testing structure but their results are suboptimal.
A Natural Approach Goal: Given ? and sample access to ?, distinguish ? ? vs 1?,? > ?. Choose a Hypothesis ? ? Test the Hypothesis (how well does ? fit ??)
A Natural Approach (contd) Goal: Given ? and sample access to ?, distinguish ? ? vs 1?,? > ?. A Learning-Followed-By-Testing Algorithm: 1. Learn hypothesis ? ? s.t. ? ? 1?,? <? 1?,? > ? 1?,? > ? (automatic since ? ?) 2. Reduce to tolerant goodness of fit given ?, sample access to ?, and explicit description of ?, distinguish 1?,? <? Tolerant tester requires almost linear #samples in the support of ? namely (|?|/log|?|) samples [Valiant-Valiant 10] Could try investing more samples for more accurate learning, but proper learning complexity vs tolerant testing complexity tradeoff does not work out to give optimal testing complexity 2 (needs cheap proper learner ) 2 vs 1?,? > ?
A Modified Approach Goal: Given ? and sample access to ?, distinguish ? ? vs 1?,? > ?. A Learning-Followed-By-Testing Algorithm: 1. Learn hypothesis ? ? s.t. ? ? 1?,? <? 1?,? > ? 1?,? > ? (automatic since ? ?) 2. Reduce to tolerant goodness of fit given ?, sample access to ?, and explicit description of ?, distinguish 1?,? <? 2 (needs cheap proper learner ) 2 vs 1?,? > ? ? Now tolerant testing has the right complexity of ? ?2 Pertinent Question: are there sublinear proper learners in ?2? We show that the ?2-learning complexity is dominated by the testing complexity for all properties of distributions we consider
Summary so far Hypothesis Testing in the small sample regime. ?unknown distribution over some discrete set ? ?: set of distributions over ? Given:?,?, sample access to ? Goal: w/ prob 1 ? tell ? ? vs 1?,? > ? Properties of interest: Is ? uniform? unimodal? log- concave? MHR? product measure? ? i.i.d. samples ?? ???? ? All above properties can be tested w/ O ? Test samples and time Unified approach based on modifiedPearson s goodness 2 ?? ?? ?? of fit test: statistic ? = ? ? tight control for false positives: want to be able to both assert and reject the null hypothesis accommodate sublinear sample size Pass/Fail? ??
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions
Other Distances (beyond 1) So far focused on 1(a.k.a. total variation) distance Given sample access to ?, w/ prob 1 ? distinguish: ? ? vs 1?,? > ? Stronger distances? [Acharya-D-Kamath]: results are actually shown for ?2?,? 1?,?2 Should also extend to KL ?,? 1 ?,?2 Weaker distances? 2 is easy to test for [Goldreich Ron], but makes less sense, e.g. p = (2/m,2/m,2/m, ,2/n,0,0,0,0,0,0,0 .0) q = (0,0,0,0,0,0,0 .0,2/m,2/m,2/m, ,2/m) l1 distance = 2 l2 distance = ? 2
Tolerance So far, focused on non-tolerant version: Given set of distributions ?, and sample access to ? Distinguish: ? ? vs ??? ?,? > ? Tolerant version: ? 2vs ??? ?,? > ? ? log |?| samples are needed Distinguish: ??? ?,? < [Valiant-Valiant 10]: Tolerant version in (?2, 1): Distinguish: ?2?,? < Different ratios change the constants in ?() notation ? 2vs 1 2?,? > ? |?| ?2 [w/ Acharya, Kamath 15]:O samples suffice
Goodness of Fit Our goodness of fit test was given an explicit distribution ? and sample access to a distribution ?, and was asked to test ? = ? vs ????,? > ?. Sometimes both distributions are unknown, e.g. Transactions of 30-40 yr olds Transactions of 20-30 yr olds Same or different?
Goodness of Fit w/ two unknowns Given sample access to two unknown distributions ?,?: Distinguish ? = ? vs ???(?,?) > ? p q i.i.d. samples i.i.d. samples Test Pass/Fail?
Goodness of Fit w/ two unknowns [Batu Fortnow Rubinfeld Smith White], [P. Valiant], [Chan Diakonikolas Valiant Valiant]: Tight upper and lower bound of: 2 3 |?| ? max , . 4 3 ?2 ? Why different? Collision statistics are all that matter Collisions on heavy elements can hide collision statistics of rest of the domain Construct pairs of distributions where heavy elements are identical, but light elements are either identical or very different
Continuous Distributions What can we say about continuous distributions? without extra assumptions such as smoothness of PDF/parametric modeling cannot stick to hard distances ( 1,?2,??) Instead of restricting ?, ?, let us switch distances Can extend results if we switch to Kolmogorov distance recall: ????,? = sup in contrast: ???,? = sup |? ?( )| : ?????????|? ?( )| Now want to distinguish: ? ? vs ???,? > ? Claim:Tolerant testing in Kolmogorov distance of any distribution property (continuous or discrete) of ?-dimensional distributions can be performed from ? ?2 samples. Importantly: Kolmogorov distance allows graceful scaling with dimensionality of data ?
DvoretzkyKieferWolfowitz inequality Suppose ?1, ,?? i.i.d. samples from (single-dimensional) ?, and let ?? be the resulting empirical CDF, namely ??? =1 ? ? 1?? ?. ?=1 Then: Pr ???,?? > ? 2? 2??2, ? > 0. i.e. ? ?2 samples suffice to learn any single-dimensional dist n to within ? in Kolmogorov distance. VC Inequality same is true for ?-dimensional distributions when #samples is at least ? After learning in Kolmogorov, can tolerant test any property. Runtime under investigation. trouble: computing/approximating the Kolmogorov distance of two high- dimensional distributions is generally a hard computational question. 1 ? ?2
The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions