Insights on Data Dredging and Financial Tips

G
u
i
l
t
-
F
r
e
e
 
D
a
t
a
 
D
r
e
d
g
i
n
g
v
i
a
 
P
r
i
v
a
c
y
Omer Reingold
SRA
Joint with: Cynthia Dwork, Vitaly Feldman, Moritz
Hardt, Toniann Pitassi, and Aaron Roth
 
 
False discovery — 
Just Getting Worse
“Trouble at the Lab” – The Economist 
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 2/27/15
Subject: Gr8 investment tip!!!
Hi! You don’t know me, but here is a tip!
Go long on SHAK – it will go up today.
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 2/28/15
Subject: Gr8 investment tip!!!
Hi again! Go short today. SHAK is going down.
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/1/15
Subject: Gr8 investment tip!!!
Down again.
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/2/15
Subject: Gr8 investment tip!!!
Up!
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/3/15
Subject: Gr8 investment tip!!!
Down.
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/4/15
Subject: Gr8 investment tip!!!
Up!
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/5/15
Subject: Gr8 investment tip!!!
Up!
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/6/15
Subject: Gr8 investment tip!!!
Up!
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/7/15
Subject: Gr8 investment tip!!!
Down.
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/8/15
Subject: Gr8 investment tip!!!
Down.
From: 
Trustworthy-broker@trustme.com
To: 
aaroth@cis.upenn.edu
Date: 3/8/15
Subject: Gr8 investment opportunity!!!
Hi there. I’m tired of giving out this great advice for free. Let me manage your money,
and I’ll continue giving you my stock prediction tips in exchange for a small cut!
BANK ACCOUNT NUMBER PLS!!
Hmm…
What happened
100,000
What happened
50,000
What happened
25,000
After 10 days…
“The Multiple Hypothesis Testing
Problem”
 
 
In the real world: does ketchup prevents cancer?
Hack Your Way To Scientific Glory
Try to prove a hunch: 
The U.S. economy is affected
by whether Republicans or Democrats are in
office.
 Using real data going back to 1948.
1,800 possible combinations
Many possible settings of the learning algorithm
“P hacking”, “data dredging”, very tempting
Preventing false discovery
 
Decade old subject in Statistics
 
Theory focuses on 
non-adaptive
 data analysis
Powerful results such as Benjamini-Hochberg
work on controlling 
False Discovery Rate
Lots of tools:
Cross-validation, bootstrapping, holdout sets
Non-adaptive data analysis
 
Specify exact
experimental setup
hypotheses to test
, …
Collect data
Run experiment
Observe outcome
Can’t reuse data 
after observing outcome.
Adaptive
 data analysis
Specify exact
experimental setup
hypotheses to test
, …
Collect data
Run experiment
Observe outcome
Revise experiment
Why adaptivity is troublesome
Exacerbates the multiple hypothesis testing
problem exponentially – must account for all
hypotheses that 
might
 have been tested.
An adaptive data analyst is a
decision tree…
An adaptive data analyst is a
decision tree…
Adaptivity arises naturally.
 
Freedman’s Paradox
: an equation 𝑌=𝑋𝛽+𝑧  is fitted
erroneously with a little bit of adaptivity.
Vast
 literature on addressing this and other special
cases.
Natural learning procedures (like gradient descent)
adaptively query the data.
Common practice to first “feel/plot/see the data” before
coming up with hypotheses and setting the “right”
parameters
More insidious
: studies conducted by researchers who
have read papers that used the same data set must be
considered adaptive.
Will be the norm as 
large data sets are shared and re-
used
!
Better Habits?
 
 
For every hypothesis to be tested gather fresh
data?
Bad habits are sometimes prevailing for a reason:
Easy, Cheap, Addictive
 
 
 
Better if we can provide automatic 
correctness
protection
Standard holdout method
training data
holdout
Non
-reusable:
 Can’t use information from 
holdout in training stage adaptively
Reusing can cause overfitting
:
Kaggle’s data analysis competition
Variant of Freedman’s Paradox
Our R
eusable holdout
Data
training data
reusable 
holdout
unrestricted
access
Our R
eusable holdout
reusable 
holdout
 
Check validity against holdout via a mechanism
(differentially private)
Goals: Provides useful information about the
distribution, Control information leakage
sufficiently well to 
prevent over-fitting
D
Differential Privacy
[Dwork-McSherry-Nissim-Smith 06]
Algorithm
Pr [r]
 
ratio bounded
Alice
 
Bob
Chris
 
Donna
Ernie
Xavier
Intuition
 
Differential privacy is a 
stability
 guarantee:
Changing one data point doesn’t affect the
outcome much
Stability implies generalization
“Overfitting is not stable”
 
Differential privacy composes well (helps deal
with adaptivity)
Rich Algorithmic Literature
 
Counts, linear queries, histograms, contingency tables (marginals)
Location and spread (eg, median, interquartile range)
Dimension reduction (PCA, SVD)
Support Vector Machines
Sparse regression/LASSO, logistic and linear regression, gradient descent
Clustering
Boosting, Multiplicative Weights
Combinatorial optimzation, mechanism design
Privacy Under Continual Observation, Pan-Privacy
Kalman filtering
Statistical Queries learning model, PAC learning
The Algorithmic Foundations of Differential Privacy, 
Dwork and Roth,
August 2014
Example: 
Thresholdout, 
an
 
Instantiation of the
Reusable Holdout [relies on HR10]
 
Input
: Training set 
S
t
, holdout set 
S
h
,
Budgets
: 
m
 on total number of queries, and 
B
 on
number of over fitting detected (
S
h
 
diff from 
S
t
),
Other Parameters
: threshold 
T
, tolerance 
τ
Given a 
Statistical
 
Query 
φ: 
X
 → [−1,1], do:
1.
If 
B
 < 1 or 
m
 < 1, output “
2.
Else sample 
η,ξ 
 
Lap(
τ)
(a)
If |E_
S
h
[
φ]−
E_
S
t
[
φ]| > 
T
 + 
η, 
output E_
S
h
[
φ] + ξ
 
set 
B
B
−1.
(b) Otherwise, output E_
S
t
[
φ].
3.
m
m
−1
 
Parameters: 
m
 can be exponential in |
S
h
|. 
B
 quadratic
.
Conclusions and Further Research
 
Reusable holdout: provable correctness protection
Utility and privacy can be in sync.
Parameters: can we improve? What are the limits?
Do we need Differential Privacy (max-information).
Experiments:
is it practical?
Design practical
methods for real
data analysis problems
Improved parameters and more general
settings [Bassily, Nissim, Smith, Steinke,
Stemmer, Ullman 15]
Limitations for efficient mechanisms
[Hardt,Ullman 14] and [Steinke,Ullman 15]
A reusable holdout: 
Thresholdhout
An illustrative experiment
Data set with 
2
n
 = 20,000
 rows and 
d
 = 10,000
variables. Class labels in 
{-1,1}
Analyst performs 
stepwise variable selection
:
1.
Split data into training/holdout of size 
n
2.
Select “best” 
k
 variables on training data
3.
Only use variables also good on holdout
4.
Build linear predictor out of 
k
 variables
5.
Find best 
k = 
10,20,30,…
No correlation
between data and labels
data are random gaussians 
labels are drawn 
independently
 at random from {-1,1}
Thresholdout correctly detects overfitting!
High correlation
20 attributes are highly correlated with target
remaining attributes are uncorrelated
Thresholdout correctly detects right model size!
Lots of Future Directions
Theory
Is it possible to obtain improved bounds? (We’ve basically hit the limits via
differential privacy, but there may be other approaches)
i.e. is differential privacy the best way to achieve generalization?
If 
not
, is there an information theoretic measure that characterizes
generalization for adaptive analyses? (See DFHPRR15 for a partial result)
Practice
Design 
practical
 methods for 
real
 data analysis problems
Thresholdout is a step in this direction, but there is more to data analysis
than holdout sets.
Constants matter!
To read more
The main theorem:
“Preserving Statistical Validity in Adaptive Data Analysis”, Dwork, Feldman, Hardt, Pitassi,
Reingold, Roth.  
http://arxiv.org/abs/1411.2664
  2014.
Improvements (Improved parameters and more general settings):
“More General Queries and Less Generalization Error in Adaptive Data Analysis”,
Bassily, Smith, Steinke, Ullman. 
http://arxiv.org/abs/1503.04843
 2015.
“On the Generalization Properties of Differential Privacy”, Nissim, Stemmer.
http://arxiv.org/abs/1504.05800
 2015.
 Description length and more general measures, as well as Thresholdout
and experiments:
“Generalization in Adaptive Data Analysis and Holdout Reuse”, Dwork, Feldman, Hardt,
Pitassi, Reingold, Roth. 
http://arxiv.org/abs/1506.02629
 2015.
For more on Differential Privacy:
“The Algorithmic Foundations of Differential Privacy”, Dwork, Roth.
http://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
 2014.
Slide Note
Embed
Share

Explore the concept of guilt-free data dredging via privacy and the implications of false discovery, alongside intriguing investment tips shared through email correspondence between "Trustworthy-broker@trustme.com" and "aaroth@cis.upenn.edu" regarding SHAK stock predictions.

  • Data Dredging
  • False Discovery
  • Privacy
  • Investment Tips
  • Finance

Uploaded on Sep 16, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Guilt Guilt- -Free Data Dredging Free Data Dredging via Privacy via Privacy Omer Reingold SRA Joint with: Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, and Aaron Roth

  2. False discovery Just Getting Worse Trouble at the Lab The Economist

  3. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 2/27/15 Subject: Gr8 investment tip!!! Hi! You don t know me, but here is a tip! Go long on SHAK it will go up today.

  4. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 2/28/15 Subject: Gr8 investment tip!!! Hi again! Go short today. SHAK is going down.

  5. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/1/15 Subject: Gr8 investment tip!!! Down again.

  6. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/2/15 Subject: Gr8 investment tip!!! Up!

  7. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/3/15 Subject: Gr8 investment tip!!! Down.

  8. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/4/15 Subject: Gr8 investment tip!!! Up!

  9. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/5/15 Subject: Gr8 investment tip!!! Up!

  10. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/6/15 Subject: Gr8 investment tip!!! Up!

  11. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/7/15 Subject: Gr8 investment tip!!! Down.

  12. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/8/15 Subject: Gr8 investment tip!!! Down.

  13. From: Trustworthy-broker@trustme.com To: aaroth@cis.upenn.edu Date: 3/8/15 Subject: Gr8 investment opportunity!!! Hi there. I m tired of giving out this great advice for free. Let me manage your money, and I ll continue giving you my stock prediction tips in exchange for a small cut! BANK ACCOUNT NUMBER PLS!!

  14. Hmm The chance he was right 10 times in a row if he was just randomly guessing is only 1 210 0.0009. ? < 0.05. I can reject the null hypothesis that these predictions were luck. HERE IS MY MONEY!

  15. What happened 100,000

  16. What happened 50,000

  17. What happened 25,000

  18. After 10 days There remain 100 people who have received perfect predictions. The individual recipient s error was his failure to take into account the size of the email pool.

  19. The Multiple Hypothesis Testing Problem A finding with p-value 0.05 has only a 5% probability of being realized by chance But we expect 5 such findings if we test 100 hypotheses, even if they are all wrong Even worse: Publication Bias

  20. In the real world: does ketchup prevents cancer?

  21. Hack Your Way To Scientific Glory Try to prove a hunch: The U.S. economy is affected by whether Republicans or Democrats are in office. Using real data going back to 1948. 1,800 possible combinations Many possible settings of the learning algorithm P hacking , data dredging , very tempting

  22. Preventing false discovery Decade old subject in Statistics Powerful results such as Benjamini-Hochberg work on controlling False Discovery Rate Lots of tools: Cross-validation, bootstrapping, holdout sets Theory focuses on non-adaptive data analysis

  23. Non-adaptive data analysis Specify exact experimental setup hypotheses to test, Collect data Run experiment Observe outcome Data analyst Can t reuse data after observing outcome.

  24. Adaptive data analysis Specify exact experimental setup hypotheses to test, Collect data Run experiment Observe outcome Revise experiment Data analyst

  25. Why adaptivity is troublesome Exacerbates the multiple hypothesis testing problem exponentially must account for all hypotheses that might have been tested.

  26. An adaptive data analyst is a decision tree ?1 ?4 ?2 ?10 ?5 ?16 ? ?11 ?14 ?13 ?37 ?24

  27. An adaptive data analyst is a decision tree Must account not just for the ? queries actually made, but the (2?) queries that could have been made given all possible outcomes. ?1 ?4 ?2 ?10 ?5 ?16 ? ?11 ?14 ?13 ?37 ?24

  28. Adaptivity arises naturally. Freedman s Paradox: an equation ?=??+? is fitted erroneously with a little bit of adaptivity. Vast literature on addressing this and other special cases. Natural learning procedures (like gradient descent) adaptively query the data. Common practice to first feel/plot/see the data before coming up with hypotheses and setting the right parameters More insidious: studies conducted by researchers who have read papers that used the same data set must be considered adaptive. Will be the norm as large data sets are shared and re- used!

  29. Better Habits? For every hypothesis to be tested gather fresh data? Bad habits are sometimes prevailing for a reason: Easy, Cheap, Addictive Better if we can provide automatic correctness protection

  30. Standard holdout method training data unrestricted access Data holdout good for one validation Data analyst Reusing can cause overfitting: Kaggle s data analysis competition Non-reusable: Can t use information from holdout in training stage adaptively Variant of Freedman s Paradox

  31. Our Reusable holdout training data unrestricted access Data reusable holdout can be used many times adaptively Data analyst essentially as good as using fresh data each time!

  32. Our Reusable holdout Check validity against holdout via a mechanism (differentially private) Goals: Provides useful information about the distribution, Control information leakage sufficiently well to prevent over-fitting reusable holdout Data analyst can be used many times adaptively essentially as good as using fresh data each time!

  33. Differential Privacy [Dwork-McSherry-Nissim-Smith 06] D Xavier Bob Chris Donna Ernie Alice Algorithm ratio bounded Pr [r]

  34. Intuition Differential privacy is a stability guarantee: Changing one data point doesn t affect the outcome much Stability implies generalization Overfitting is not stable Differential privacy composes well (helps deal with adaptivity)

  35. Rich Algorithmic Literature Counts, linear queries, histograms, contingency tables (marginals) Location and spread (eg, median, interquartile range) Dimension reduction (PCA, SVD) Support Vector Machines Sparse regression/LASSO, logistic and linear regression, gradient descent Clustering Boosting, Multiplicative Weights Combinatorial optimzation, mechanism design Privacy Under Continual Observation, Pan-Privacy Kalman filtering Statistical Queries learning model, PAC learning The Algorithmic Foundations of Differential Privacy, Dwork and Roth, August 2014

  36. Example: Thresholdout, an Instantiation of the Reusable Holdout [relies on HR10] Input: Training set St, holdout set Sh, Budgets: m on total number of queries, and B on number of over fitting detected (Sh diff from St), Other Parameters: threshold T, tolerance Given a StatisticalQuery : X [ 1,1], do: 1. If B < 1 or m< 1, output 2. Else sample , Lap( ) (a) If |E_Sh[ ] E_St[ ]| > T + , output E_Sh[ ] + set B B 1. (b) Otherwise, output E_St[ ]. 3. m m 1 Parameters: m can be exponential in |Sh|. B quadratic.

  37. Conclusions and Further Research Reusable holdout: provable correctness protection Utility and privacy can be in sync. Parameters: can we improve? What are the limits? Do we need Differential Privacy (max-information). Experiments: is it practical? Design practical methods for real data analysis problems Limitations for efficient mechanisms [Hardt,Ullman 14] and [Steinke,Ullman 15] Improved parameters and more general settings [Bassily, Nissim, Smith, Steinke, Stemmer, Ullman 15]

  38. An illustrative experiment Data set with 2n = 20,000 rows and d = 10,000 variables. Class labels in {-1,1} Analyst performs stepwise variable selection: 1. Split data into training/holdout of size n 2. Select best k variables on training data 3. Only use variables also good on holdout 4. Build linear predictor out of k variables 5. Find best k = 10,20,30,

  39. No correlation between data and labels data are random gaussians labels are drawn independently at random from {-1,1} Thresholdout correctly detects overfitting!

  40. High correlation 20 attributes are highly correlated with target remaining attributes are uncorrelated Thresholdout correctly detects right model size!

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#