Understanding Generalization in Adaptive Data Analysis by Vitaly Feldman
Adaptive data analysis involves techniques such as statistical inference, model complexity, stability, and generalization guarantees. It focuses on sequentially analyzing data with steps like exploratory analysis, feature selection, and model tuning. The approach emphasizes on avoiding hypothesis testing suggested by the data and prioritizes predictive modeling and error estimation. The goal is to design algorithms that can efficiently answer adaptively-chosen queries in an iterative manner to achieve accurate results close to running on fresh samples.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Understanding Generalization in Adaptive Data Analysis Vitaly Feldman
Overview Adaptive data analysis o Motivation o Definitions o Basic techniques With Dwork, Hardt, Pitassi, Reingold, Roth [DFHPRR 14,15] New results [F, Steinke 17] Open problems 2
???[????(?,?)]=? Learning problem Distribution ? over domain ? Model Data ? = ?(?) ? = ?1, ,?? ?? Analysis ? 3
Statistical inference Data ? Data ? i.i.d. samples from ? Data Theory Model complexity Rademacher compl. Stability Online-to-batch Algorithm ? Generalization guarantees for ? ? = ?(?)
Data analysis is adaptive Steps depend on previous analyses of the same dataset ?1 Exploratory data analysis Feature selection Model stacking Hyper-parameter tuning Shared datasets ?1 ?2 ?2 ?? ?? ? Data analyst(s)
Thou shalt not test hypotheses suggested by data Quiet scandal of statistics [Leo Breiman, 1992]
ML practice Data Data Data Data Data Data Testing Training ? Test error of ? ?? ?[????(?,?)] 7
ML practice now Data Data Data Data Data Data Data Validation Training Testing ? ? ?? Test error of ? ?? ?[????(?,?)] 8
Adaptive data analysis [DFHPRR 14] ?1 ?1 ?2 ?2 ?? ?? ? = ?1, ,?? ?? Data analyst(s) Algorithm Goal: given ? compute ?? s close to running ?? on fresh samples Each analysis is a query Design algorithm for answering adaptively-chosen queries
Adaptive statistical queries ?1 ?1 ?2 ?2 ? ?? ?? Statistical query ? = ?1, ,?? ?? oracle [Kearns 93] Data analyst(s) 1 ? ? ? ??? ??:? 0,1 Example: ??= ????(?,?) with prob. 1 ? ??(?) ?? ?? ???? Can measure correlations, moments, accuracy/loss Run any statistical query algorithm
Answering non-adaptive SQs Given ? non-adaptive query functions ?1, ,?? and ? i.i.d. samples from ? estimate ?? ???? 1 ? ? ? ??? Use empirical mean: ???? = log (?/?) ?2 ? = ?
Answering adaptively-chosen SQs What if we use ????? For some constant ? > 0: ? ? ?2 Variable selection, boosting, bagging, step-wise regression .. ? log ? ?2 Data splitting: ? = ?
Answering adaptive SQs [DFHPRR 14] Exists an algorithm that can answer ? adaptively chosen SQs with accuracy ?for ? ? = ? ?2.5 Data splitting: ? ? ?2 [Bassily,Nissim,Smith,Steinke,Stemmer,Ullman 15] ? ? = ? ?2 Generalizes to low-sensitivity analyses: ??? ??? Estimates ?? ??[??(?)] within ? 1 ? when ?,? differ in a single element
Differential privacy [Dwork,McSherry,Nissim,Smith 06] ? ? M ratio bounded Pr ?[?] Randomized algorithm ? is (?,?)-differentially private if for any two data sets ?,? that differ in one element: ?? ? ? ?? Pr ?? ? ? + ? ? range ? , Pr
DP implies generalization DP composes adaptively Differential privacy is stability Implies strongly uniform replace-one stability and generalization in expectation Composition of ? ?-DP algorithms: for every ? > 0, is ? ? log 1/? ,? -DP [Dwork,Rothblum,Vadhan 10] DP implies generalization with high probability [DFHPRR 14, BNSSSU 15]
Value perturbation [DMNS 06] Answer low-sensitivity query ? with ? ? + ? ?,? ? ? ? ? max 1/? Gaussian ?(0,?) 1 4 where Given ? samples achieves error (?) ? ? (?) is the worst-case sensitivity: max ?,? ? ? ?(? ) (?) ? could be much larger than standard deviation of ? 16
Beyond low-sensitivity [F, Steinke 17] Exists an algorithm that for any adaptively- chosen sequence ?1, ,??:?? given ? = ? samples from ? outputs values ?1, ,?? such that w.h.p. for all ?: ?? ?? ??? ? ? i.i.d. ?? 2?? where ??= ???? ?? ??? For statistical queries: ??:? [ ?,?] given ? samples get error that scales as ???? ???? ? ?1/4 ? Value perturbation: ? ?1/4 17
Stable Median ? ?1 ?2 ?3 ?? ? = ?? ?? ?? ?? ?? ?? ?? ?? 2 ?? 1 ?? ?1 ?2 ?3 ? ? Find an approximate median with DP relative to ? value ? greater than bottom 1/3 and smaller than top 1/3 in ? 18
Median algorithms Requires discretization: ground set ?, |?| = ? Upper bound: 2?(log ?) samples Lower bound: (log ?) samples [Bun,Nissim,Stemmer,Vadhan 15] ? ? Exponential mechanism [McSherry, Talwar 07] ? ?? ? ? ? # Output ? ? with prob. ? Uses ? ? Stability and confidence amplification for the price of one log factor! 2 log ? samples 19
Analysis Differential privacy approximately preserves quantiles [F, Steinke 17] Let ? be a DP algorithm that on input ? ?? outputs a function ?:? and a value ? . Then w.h.p. over ? ?? and ?,? ? ? : Pr ? ?? ?(?) Pr ? ?? ?(?) 1 3,2 If ? is within 3 empirical quantiles 1 4,3 then ? is within ? is within mean 2? 4 true quantiles If ? is well-concentrated on ? then easy to prove high probability bounds 20
Limits Any algorithm for answering ? adaptively chosen SQs with accuracy ?requires* ? = ( ?/?)samples [Hardt, Ullman 14; Steinke, Ullman 15] *in sufficiently high dimension or under crypto assumptions Verification of responses to queries: ? = ?( ? log ?) where ? is the number of queries that failed verification o Data splitting if overfitting [DFHPRR 14] o Reusable holdout [DFHPRR 15] o Maintaining public leaderboard in a competition [Blum, Hardt 15] 21
Open problems Analysts without side information about ? Queries depend only on previous answers Does there exist an SQ analyst whose queries require more than ?(log?) samples to answer? (with 0.1 accuracy/confidence) Fixed natural analyst/Learning algorithm o Gradient descent for stochastic convex optimization 22
Stochastic convex optimization Convex body ? = ?2 Class ? of convex 1-Lipschitz functions ? = ? convex ? ?, ??(?)2 1 ?1 ? ?2 1} Given ?1, ,??sampled i.i.d. from unknown ? over ? Minimize true (expected) objective: ??(?) ?? ?[?(?)]over ?: Find ? s.t. ?? ? min ? ? ??? + ? ?1 ?2 ?? ?? ? 23
Gradient descent ERM via projected gradient descent: ??(?) 1 ? ??(?) 1 ? ???(??) Initialize ?1 ? For ? = 1 to ? ??+1= Project??? ? ????? Output: 1 Fresh samples: ????? ????? ? ??? 2 1/ ? Sample complexity is unknown ? ?2 samples (tight [F. 16]) Uniform convergence: ? 1 ?2 samples [Robbins,Monro 51; Polyak 90] SGD solves using ? Overall: ?/?2 statistical queries with accuracy ? in 1/?2 adaptive rounds log ? ?4 Sample splitting: ? samples ? DP: ? ?3 samples 24
Conclusions Real-valued analyses (without any assumptions) Going beyond tools from DP o Other notions of stability for outcomes o Max/mutual information Generalization beyond uniform convergence Using these techniques in practice 25