Statistics, Big Data, and High-Dimensional Distributions

Statistics vs Big Data
 
Human genome:
 40 exabytes
 storage by 2025
SKA Telescope: 
1 exabyte
 daily
Facebook: 
20
 
petabytes
 images daily
BIG Data
 
small
High-dimensional
  
   Expensive
 
DNA 
microarray
Computer
vision
Financial
records
Experimental 
drugs
What properties do your BIG
distributions have?
e.g.1: play the lottery?
Is the lottery 
un
fair?
 
from Hitlotto.com
:
“Lottery
 
experts agree, past number histories
can be the key to predicting future winners.”
True Story!
Polish lottery Multilotek
Choose “uniformly” at random distinct 20 numbers
out of 1 to 80.
Initial machine biased
e.g., probability of 50-59 too small
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)
New Jersey Pick 3,4 Lottery
e.g. 2: Independence Testing
Shopping patterns:
 
 
 
Independent of zip code?
e.g.2: Linkage Disequilibrium
 
Genome
 
Single nucleotide polymorphisms, are they independent?
 
should we expect the genomes from the 1000 human genomes
project to be sufficient? up to how many loci?
e.g. 3: Outbreak of diseases
Similar patterns in different years?
More prevalent near large airports?
Flu 2005
Flu 2006
Distributions on BIG domains
 
Given 
samples
 
of a distribution, need to know, e.g.,
entropy
number of distinct elements
“shape” (monotone, unimodal, etc)
closeness to uniform, Gaussian, Zipfian…
 
No assumptions on 
shape
 
of distribution
i.e.,  parametric, smoothness, monotonicity,  normality,…
 
Considered in Statistics, information theory, machine
learning, databases, algorithms, physics, biology,…
Classical Setting
Modern Setting
Old questions, new challenges
 
Asymptotic analysis
Computation 
not crucial
 
New challenges:
 
samples, computation,
 
communication, storage
A Key Question
 
How many samples do you need in terms of
domain size?
Do you need to estimate the probabilities of 
each
domain item?
-- OR --
Can sample complexity be 
sublinear
 in  size of the
domain?
 
Rules out standard statistical
techniques
Aim
Algorithms with 
sublinear
 sample complexity
INFORMATION
THEORY
MACHINE
LEARNING
STATISTICS
ALGORITHMS
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Problem formulation
discrete
Well-studied Problem
 
- Strong control for
false
 
positives
?
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing Fairness of a Coin
Testing Uniformity
Proving Lower Bounds
Proving Lower Bounds
Proving Lower Bounds
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Identity Testing (“goodness of fit”)
 
[Batu-Fisher-Fortnow-Kumar-Rubinfeld-White’01]…
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing Properties of Distributions
Testing Properties of Distributions
A Natural Approach
A Natural Approach (cont’d)
A Modified Approach
Tutorial: part 2
Summary so far
 
Hypothesis Testing in the small sample regime.
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Tolerance
Goodness of Fit
 
Transactions of 20-30 yr olds
 
Transactions of 30-40 yr olds
 
 
 
Same or different?
Goodness of Fit w/ two unknowns
p
Test
Pass/Fail?
q
i.i.d.
samples
i.i.d.
samples
Goodness of Fit w/ two unknowns
Continuous Distributions
Dvoretzky–Kiefer–Wolfowitz inequality
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Testing in High-Dimensions
High-Dimensional Discrete Distn’s
 
400 bit
images
High-Dimensional Discrete Distn’s
400 bit
images
Ising Model
Ising Model: Strong vs weak ties
 
“low temperature regime”
 
“high temperature regime”
Testing Ising Models
Testing Ising Models
Testing Ising Models
Can localize departure from Uniformity (independence) on some edge
Cheaper non-
localizing test?
Testing Ising Models
Cheaper non-
localizing test?
Low temperature.
How about high
temperature?
Ising Model: Strong vs weak ties
“low temperature regime”
“high temperature regime”
Exponential mixing of the
Glauber dynamics
Testing Ising Models
Exchangeable Pairs
Silly Example
Interesting Example: Ising
Need to show function contracts as
Glauber dynamics unravels
Requires a good
coupling
Showing Contraction
 
Lemma 1:
 
Lemma 2:
 
Generous coupling: 
choose same node, but update independently
Different than “greedy
coupling” typically used
where the same node is
chosen and the update is
coordinated to maximize
probability of same update
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Future Directions
The Menu
Testing Properties of Distributions
Discussion/Road Ahead
Testing in High Dimensions
Future Directions
Markov Chain Testing
 
 
Empirical Fact
:           
          
 vs.                    different Markov chains! [
Diakonis’03
]
 
 
Example
:
 
 
 
 
 
*riffle shuffle = Gilbert-Shannon-Reeds (GSR) model for distribution on card permutations
.
 
[Ongoing work with Dikkala, Gravin]
Markov Chain Testing
 
Distance?
 
[Ongoing work with Dikkala, Gravin]
spectral radius
Testing Combinatorial Structure
 
Is the phylogenic tree
assumption true?
Sapiens-Neanderthal
early interbreeding
[
Slatkin et al’13
]
 
Is a graphical model a tree?
[ongoing work with Bresler, Acharya]
Efficiently testing combinatorial structure
Testing from a Single Sample
 
Given 
one
 social network, 
one
 brain, etc., how
can we test the validity of a certain generative
model?
Get many samples from one sample?
INFORMATION
THEORY
MACHINE
LEARNING
STATISTICS
ALGORITHMS
Thank You!
Slide Note
Embed
Share

Delve into the realms of statistics, big data, and high-dimensional distributions in this visual journey that touches on topics ranging from lottery fairness to independence testing in shopping patterns. Discover insights into the properties of BIG distributions and the prevalence of massive data sets across various domains like social media, genomics, and astronomical observations.

  • Statistics
  • Big Data
  • High-Dimensional Distributions
  • Lottery Fairness
  • Independence Testing

Uploaded on Sep 14, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Statistics vs Big Data Constantinos Daskalakis CSAIL and EECS, MIT Greek Stochastics ?

  2. YOU WANT BIG DATA? I LL GIVE YOU BIG DATA!

  3. BIG Data small Facebook: 20petabytes images daily Human genome: 40 exabytes storage by 2025 SKA Telescope: 1 exabyte daily

  4. High-dimensional Expensive DNA microarray Experimental drugs Computer vision Financial records

  5. What properties do your BIG distributions have?

  6. e.g.1: play the lottery? Is it uniform?

  7. Is the lottery unfair? from Hitlotto.com: Lotteryexperts agree, past number histories can be the key to predicting future winners.

  8. True Story! Polish lottery Multilotek Choose uniformly at random distinct 20 numbers out of 1 to 80. Initial machine biased e.g., probability of 50-59 too small

  9. Thanks to Krzysztof Onak (pointer) and Eric Price (graph)

  10. New Jersey Pick 3,4 Lottery New Jersey Pick k (=3,4) Lottery. Pick k random digits in order. 10k possible values. Data: Pick 3 - 8522 results from 5/22/75 to 10/15/00 2-test (on Excel) answers "42% confidence Pick 4 - 6544 results from 9/1/77 to 10/15/00. fewer results than possible values not a good idea to run ?2 test convergence to ?2 distribution won t kick in for small sample size (textbook) rule of thumb: expected number of at least 5 observations of each element in the domain under the null hypothesis to run ?2

  11. e.g. 2: Independence Testing Shopping patterns: Independent of zip code?

  12. e.g.2: Linkage Disequilibrium Genome locus ? locus 1 locus 2 Single nucleotide polymorphisms, are they independent? Suppose ? loci, 2 possible states each, then: state of one s genome {0,1}? humans: some distribution ? over {0,1}? Question: Is ? a product dist n OR far from all product dist ns? should we expect the genomes from the 1000 human genomes project to be sufficient? up to how many loci?

  13. e.g. 3: Outbreak of diseases Similar patterns in different years? More prevalent near large airports? Flu 2005 Flu 2006

  14. Distributions on BIG domains Given samples of a distribution, need to know, e.g., entropy number of distinct elements shape (monotone, unimodal, etc) closeness to uniform, Gaussian, Zipfian No assumptions on shape of distribution i.e., parametric, smoothness, monotonicity, normality, Considered in Statistics, information theory, machine learning, databases, algorithms, physics, biology,

  15. Old questions, new challenges Modern Setting Classical Setting Domain: Domain: 1000 tosses One human genome Large domain ? Small domain ? ? ?????, ? ????? ? ?????, ? ????? (comparatively) New challenges: samples, computation, communication, storage Asymptotic analysis Computation not crucial

  16. A Key Question How many samples do you need in terms of domain size? Do you need to estimate the probabilities of each domain item? -- OR -- Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniques

  17. Aim Algorithms with sublinear sample complexity MACHINE LEARNING STATISTICS INFORMATION THEORY ALGORITHMS

  18. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  19. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  20. Problem formulation discrete Model ?: family of distributions over ? may be non-parametric, e.g. unimodal, product, log-concave Problem Given: samples from unknown? with probability 0.9, distinguish ? ?vs?(?,?) > ? Objective Minimize samples Computational efficiency 1?,? 2 min ? ? Sublinear in |?|? max ?????? ?|? ? ?(?)|

  21. Well-studied Problem (Composite) hypothesis testing Neyman-Pearson test Kolmogorov-Smirnov test Pearson s chi-squared test Generalized likelihood ratio test Quantities of Interest ??= Pr ?????? ? ?? ???? ???? ????? ??= Pr(?????? ? ?? ???? ???? ????) - Sublinear in |?|? - Strong control for false positives? Focus Asymptotic regime: Results kick in when? |?| Consistency Error exponents: exp ? ? as ?

  22. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  23. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  24. Testing Fairness of a Coin 1 ? = Bernoulli ? = Bernoulli(?) ? ?vs?TV(?,?) > ? 2 ? : unknown probability of Question: Is ? = 0.5 OR ? 0.5 > ?? Goal: Toss coin several times, deduce correct answer w/ probability > 0.99 Number of samples required? ? Can estimate ?, by tossing ? times, then taking ? = ?=1 ?? ? 1 ?2, ? ? <? By concentration bounds, if ? > ? 3, w/ probability > 0.99 1 ?2 many samples necessary? Suppose there is tester using ? samples Then it can distinguish one sample from ? = (?1, ,??) where each ?? ?(0.5) from one sample from ? = ?1, ,?? where each ?? ? 0.5 + ? w/ probability > 0.99 Claim: Any tester has error probability at least 1 Are 2(1 dTV(?,?)) dTV?,? 2 ? ?,? = ? ? ?

  25. Testing Uniformity ? = Uniform? ? : unknown ? ?vs?TV(?,?) > ? ?: unknown distribution over ? Sample access to ? Question: is ? = ?? or ?TV?,?? > ? ? |?| ?2 [Paninski 03]: samples and time Intuition: (Lower Bound) Suppose ? is uniform distribution over {1, ,?} and ? is either uniform on 1, ,? or uniform on a random ? size subset of {1, ,?} - unless ( ?) samples are observed, there are no collisions, hence cannot distinguish (Upper Bound) Collision statistics suffice to distinguish 2

  26. Proving Lower Bounds [Le Cam 73]: Consider two disjoint sets of distributions ?1, ?2. Suppose algorithm ? is given ? samples from some unknown ? ?1 ?2 and claims to distinguish ? ?1 vs p ?2. Then: Pr ????? 1 2 1 inf ????1,?2 ?1 ??????1 ?2 ??????2 ?????? : all distributions generating samples as follows - choose a random distribution ? from ? (according to some distribution over ?) - then generate ? samples from ?

  27. Proving Lower Bounds [Le Cam 73]: Consider two disjoint sets of distributions ?1, ?2. Suppose algorithm ? is given ? samples from some unknown ? ?1 ?2 and claims to distinguish ? ?1 vs p ?2. Then: Pr ????? 1 2 1 inf ????1,?2 ?1 ??????1 ?2 ??????2 ?????? : all distributions generating samples as follows - choose a random distribution ? from ? (according to some distribution over ?) - then generate ? samples from ?

  28. Proving Lower Bounds [Le Cam 73]: Consider two disjoint sets of distributions ?1, ?2. Suppose algorithm ? is given ? samples from some unknown ? ?1 ?2 and claims to distinguish ? ?1 vs p ?2. Then: Pr ????? 1 2 1 inf ????1,?2 ?1 ??????1 ?2 ??????2 ?????? : all distributions generating samples as follows - choose a random distribution ? from ? (according to some distribution over ?) - then generate ? samples from ? ? To prove lower bound for uniformity ?2 testing take: ?1= Uniform? ?2= ? | ?2?+1=1 ?? ?,?2?+2=1 ?? ?, ?

  29. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  30. Identity Testing (goodness of fit) ? = ? ? : unknown ? ?vs?TV(?,?) > ? ?,?: distributions over ? ?: given; sample access to ? Question: is ? = ? or ?TV?,? > ? ? [Batu-Fisher-Fortnow-Kumar-Rubinfeld-White 01] [Paninski 08, Valiant-Valiant 14]: |?| ?2 samples and time [w/ Acharya-Kamath 15]: a tolerant goodness of fit test with same sample size can distinguish: ?2?,? ?2vs 1 ?? ??2 ?? Cauchy-Schwarz: ?2?,? 1?,?2 2?,? > 4 ?2 ?2?,? ? ?

  31. A new ?2- Goodness of Fit Test Goal: given ? and sample access to ? distinguish: Case 1:?2?,? ?2vs Case 2: 1 Approach: Draw Poisson(?) many samples from ? ??: # of appearances of symbol ? ? ??~Poisson ? ?? ?? ? ? independent random variables ?? ? ??2 ?? ? ?? ? ? = ? ?2(?,?) Case 1:? ? ? ?2; Case 2:? ? ? 4 ?2 2?,? 4 ?2 Side-Note: Pearson s ?2 test uses 2 ?? ? ?? ? ?? statistic ? Subtracting ?? in the numerator gives an unbiased estimator and importantly may hugely decrease variance Statistic:? = ? ? chug chug chug bound variance of ? O samples ?2 suffice to distinguish

  32. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  33. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  34. Testing Properties of Distributions so far ? ={single distribution} restrictive, as rarely know hypothesis distribution exactly natural extension: test structural properties monotonicity: PDF is monotone, e.g. cancer vs radiation unimodality: PDF is single-peaked, e.g. single source of disease log-concavity: logPDFis concave monotone-hazard rate: log (1 CDF)is concave product distribution, e.g. testing linkage disequilibrium Example question: ? = {???????? ????????????? ???? [?]} Sample access to ? Is ? unimodal OR is ??-far from or unimodal distributions?

  35. Testing Properties of Distributions [w/ Acharya and Kamath 2015]: 1. Testing identity, monotonicity, log-concavity, monotone hazard-rate, unimodality |?| ?2 for distributions over (ordered set) ?is doable w/ ? samples and time. Testing monotonicity/independence of a distribution over ? = ??is doable ??/2 ?2 ? ?2 samples and time. 2. |?| w/ ? ?? 0.5 ?4 previous best for monotonicity testing: ? [Bhattacharya-Fisher-Rubinfeld-Valiant 11] previous best for independence: d=2, worst bounds [Batu et al. 01] All bounds above are optimal Matching lower bounds for 1 and 2 via Le Cam. Unified approach, computationally efficient tests 3. 4. N.B. Contemporaneous work of [Canonne et al 2015] provide a different unified approach for testing structure but their results are suboptimal.

  36. A Natural Approach Goal: Given ? and sample access to ?, distinguish ? ? vs 1?,? > ?. Choose a Hypothesis ? ? Test the Hypothesis (how well does ? fit ??)

  37. A Natural Approach (contd) Goal: Given ? and sample access to ?, distinguish ? ? vs 1?,? > ?. A Learning-Followed-By-Testing Algorithm: 1. Learn hypothesis ? ? s.t. ? ? 1?,? <? 1?,? > ? 1?,? > ? (automatic since ? ?) 2. Reduce to tolerant goodness of fit given ?, sample access to ?, and explicit description of ?, distinguish 1?,? <? Tolerant tester requires almost linear #samples in the support of ? namely (|?|/log|?|) samples [Valiant-Valiant 10] Could try investing more samples for more accurate learning, but proper learning complexity vs tolerant testing complexity tradeoff does not work out to give optimal testing complexity 2 (needs cheap proper learner ) 2 vs 1?,? > ?

  38. A Modified Approach Goal: Given ? and sample access to ?, distinguish ? ? vs 1?,? > ?. A Learning-Followed-By-Testing Algorithm: 1. Learn hypothesis ? ? s.t. ? ? 1?,? <? 1?,? > ? 1?,? > ? (automatic since ? ?) 2. Reduce to tolerant goodness of fit given ?, sample access to ?, and explicit description of ?, distinguish 1?,? <? 2 (needs cheap proper learner ) 2 vs 1?,? > ? ? Now tolerant testing has the right complexity of ? ?2 Pertinent Question: are there sublinear proper learners in ?2? We show that the ?2-learning complexity is dominated by the testing complexity for all properties of distributions we consider

  39. Tutorial: part 2

  40. Summary so far Hypothesis Testing in the small sample regime. ?unknown distribution over some discrete set ? ?: set of distributions over ? Given:?,?, sample access to ? Goal: w/ prob 1 ? tell ? ? vs 1?,? > ? Properties of interest: Is ? uniform? unimodal? log- concave? MHR? product measure? ? i.i.d. samples ?? ???? ? All above properties can be tested w/ O ? Test samples and time Unified approach based on modifiedPearson s goodness 2 ?? ?? ?? of fit test: statistic ? = ? ? tight control for false positives: want to be able to both assert and reject the null hypothesis accommodate sublinear sample size Pass/Fail? ??

  41. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  42. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

  43. Other Distances (beyond 1) So far focused on 1(a.k.a. total variation) distance Given sample access to ?, w/ prob 1 ? distinguish: ? ? vs 1?,? > ? Stronger distances? [Acharya-D-Kamath]: results are actually shown for ?2?,? 1?,?2 Should also extend to KL ?,? 1 ?,?2 Weaker distances? 2 is easy to test for [Goldreich Ron], but makes less sense, e.g. p = (2/m,2/m,2/m, ,2/n,0,0,0,0,0,0,0 .0) q = (0,0,0,0,0,0,0 .0,2/m,2/m,2/m, ,2/m) l1 distance = 2 l2 distance = ? 2

  44. Tolerance So far, focused on non-tolerant version: Given set of distributions ?, and sample access to ? Distinguish: ? ? vs ??? ?,? > ? Tolerant version: ? 2vs ??? ?,? > ? ? log |?| samples are needed Distinguish: ??? ?,? < [Valiant-Valiant 10]: Tolerant version in (?2, 1): Distinguish: ?2?,? < Different ratios change the constants in ?() notation ? 2vs 1 2?,? > ? |?| ?2 [w/ Acharya, Kamath 15]:O samples suffice

  45. Goodness of Fit Our goodness of fit test was given an explicit distribution ? and sample access to a distribution ?, and was asked to test ? = ? vs ????,? > ?. Sometimes both distributions are unknown, e.g. Transactions of 30-40 yr olds Transactions of 20-30 yr olds Same or different?

  46. Goodness of Fit w/ two unknowns Given sample access to two unknown distributions ?,?: Distinguish ? = ? vs ???(?,?) > ? p q i.i.d. samples i.i.d. samples Test Pass/Fail?

  47. Goodness of Fit w/ two unknowns [Batu Fortnow Rubinfeld Smith White], [P. Valiant], [Chan Diakonikolas Valiant Valiant]: Tight upper and lower bound of: 2 3 |?| ? max , . 4 3 ?2 ? Why different? Collision statistics are all that matter Collisions on heavy elements can hide collision statistics of rest of the domain Construct pairs of distributions where heavy elements are identical, but light elements are either identical or very different

  48. Continuous Distributions What can we say about continuous distributions? without extra assumptions such as smoothness of PDF/parametric modeling cannot stick to hard distances ( 1,?2,??) Instead of restricting ?, ?, let us switch distances Can extend results if we switch to Kolmogorov distance recall: ????,? = sup in contrast: ???,? = sup |? ?( )| : ?????????|? ?( )| Now want to distinguish: ? ? vs ???,? > ? Claim:Tolerant testing in Kolmogorov distance of any distribution property (continuous or discrete) of ?-dimensional distributions can be performed from ? ?2 samples. Importantly: Kolmogorov distance allows graceful scaling with dimensionality of data ?

  49. DvoretzkyKieferWolfowitz inequality Suppose ?1, ,?? i.i.d. samples from (single-dimensional) ?, and let ?? be the resulting empirical CDF, namely ??? =1 ? ? 1?? ?. ?=1 Then: Pr ???,?? > ? 2? 2??2, ? > 0. i.e. ? ?2 samples suffice to learn any single-dimensional dist n to within ? in Kolmogorov distance. VC Inequality same is true for ?-dimensional distributions when #samples is at least ? After learning in Kolmogorov, can tolerant test any property. Runtime under investigation. trouble: computing/approximating the Kolmogorov distance of two high- dimensional distributions is generally a hard computational question. 1 ? ?2

  50. The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#