Understanding Bootstrap: A Powerful Statistical Methodology
Bootstrap is a paradigm-shifting methodology that allows statisticians to estimate various properties of an unknown distribution with ease. By creating alternative worlds based on observed data and applying the Plug-in principle, Bootstrap simplifies the estimation process for various statistical properties. This approach is beneficial in practice, especially when dealing with complex calculations such as estimating variances of correlations. Through graphical representations and examples, Bootstrap demonstrates its importance in statistics.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Bootstrap The Statisticians Magic Wand Saharon Rosset
Bootstrap is a paradigm, not an algorithm
An abstract view of statistics There is a world (=unknown distribution) F We observe some data from the world, say 100 heights (z) and weights (y) of random people We want to learn about some property of the world F, e.g.: Mean of height Correlation between height and weight Variance of the empirical correlation between height and weight
Standard statistical methodology Find a way to estimate the property of F of interest directly from the data Mean height estimated by average Correlation between height and weight estimated by empirical correlation How do we estimate the variance of the correlation? There are some formulae under some assumptions, but it gets complicated Instead, we want to invent a general approach that will allow estimating every property of F relatively easily (hopefully, also well)
Bootstrap idea: The Plug-in principle We are interested in some property of the world = ?(?) We create an alternative world ? (usually using our data), and in it we estimate = ? ? Usually simply done empirically by drawing data from this world and empirically estimating ? ? The main wisdom lies in how to build the Bootstrap world ? so that it is similar to ? in the ways that matter to us Secondary problem: how to perform the estimation in the bootstrap world, usually straight forward
Graphical representation Real world Bootstrap world Dist. ? Data X Data X* Dist. ? Determine ? = ? ? = ? ? = ? ? Statistic s(X) Statistic s(X*)
Example: variance of empirical correlation F is the bivariate distribution of z=height and y=weight, we are given data X with 100 pairs of (z,y) The statistic of interest is ? ? = ??? ?,? The property of F we are interested in is ? = ????? Bootstrap approach: Build bootstrap world ? Repeatedly draw bootstrap samples X* from ? Repeatedly estimate s(X*) from each sample Use these estimates to empirically estimate ? = ??? ??(? ) in the boostrap world This is your estimate of ? in the real world
How to build ?? The double arrow is the key to designing a bootstrap algorithm The most standard approach: use the empirical distribution of the data Drawing X* is drawing 100 pairs (z*,y*) with return from the original dataset This is commonly referred to as bootstrap sampling or nonparametric bootstrap But this is not the only approach, and often not the best one!
Parametric Bootstrap example Assume we know that ? (common dist. of height, weight) is bivariate normal Then it makes sense to make ? bivariate normal, with parameters estimated from the data X Then we can repeat exactly the same stages of drawing X*, and estimating the variance empirically
Concrete example Let s choose ? of height and weight to be bi-normal: ? ? 175 75 ,100 50 50 50 ~ ? We start by drawing 105 random samples of 100 and observing the distribution of ? ? = particular we get ? = ????? 0.00261 ??? ?,? , in Now we want to try different Bootstrap approaches for estimating ?
Approach 1: standard non-parametric Bootstrap Define ? to be the empirical distribution of X, then: Sample many X* (Bootstrap samples) Calculate ? ? = ??? ? ,? for each X* Estimate the variance of s by the empirical variance of s* In simulation we can repeat this whole exercise many times to get a distribution of Bootstrap estimates
Approach 2: parametric Bootstrap using normal distribution Use X to estimate mean and covariance of ?, assuming it is normal, and define ? to be this normal distribution. The rest proceeds as before: Sample many X* from this bi-normal distribution (parametric Bootstrap samples) Calculate ? ? = ??? ? ,? for each X* Estimate the variance of s by the empirical variance of s* Again, in simulation repeat this many times
Does Bootstrap always work? Of course not! From what we already know it s clear that if we fail to build ? so that = ? ?is similar to = ?(?) then our approach is useless Can be a result of wrong assumptions on ? used in building ? Can easily devise examples where no Bootstrap approach will give reasonable results Still, the usefulness of properly implemented Bootstrap is very general and applies to almost any reasonable problem we encounter
Hypothesis testing with Bootstrap Recall the components of a hypothesis testing problem: Null hypothesis: ?0: ? = ?0 Test statistic m=s(X) Performing a test entails calculating quantities like p value = ??0(? ? > ?) and rejecting if it is small The p-value for a given z is also a property of ?, but how can we use the Bootstrap to estimate it? If ?0uniquely defines the distribution, then it s trivial, a standard simulation exercise But if ?0 contains many possible ? s, we can implement the bootstrap paradigm: Choose ? as a member of ?0that is consistent with our data , calculate the p value under this distribution
Inference on phylogenetic trees Felsenstein (1985) Dataset of malaria genetic sequences from different organisms (11 species, sequences of length 221): Result of applying standard phylogenetic tree learning approach: Our inference goal: asses confidence in the 9-10 clade (subtree) is it strongly supported by the data?
Felsensteins Bootstrap of Phylogenetic trees Given this phylogenetic tree built on this dataset, Felsenstein wanted to get an answer to questions like: how certain am I that subtree ?0 (say, 9-10) is real (i.e. exists in the world and not just my data) He suggested using the Bootstrap as follows: Draw bootstrap samples of markers Build tree on each sample (all species, sampled markers) Use the % of time we get the subtree ?0as confidence in this subtree
Is this Bootstrap legit? We want to know whether the subtree exists in ? so we estimate this by % of time it exists in data drawn from ? This is not exactly a Bootstrap recipe (details are not critical) But assuming it is a Bootstrap approach, is it a good one? Not at all, because ?was built based on the sample whose best tree contains the subtree This basically means that ? contains the subtree, so we know we are getting over-optimistic results A more correct formulation of this question is as a hypothesis test of ?0:???? ???? ??? ??????? ?0 If we reject ?0 we can conclude that ?0 is reliable
Efrons solution(s) In a beautiful paper, Efron et al. (1996, PNAS) reanalyze this problem and show: That under some (quite complicated) assumptions Felsenstein s approach can be considered a legitimate Bootstrap That without these assumptions (but with some complicated math and geometry), an appropriate Bootstrap can be devised for the hypothesis testing view of the problem
Efrons hypothesis testing view First task: Build a Bootstrap world ? where: ?0 holds ? is as similar as possible to the empirical distribution of our data Then we can test ?0 by examining what percentage of the time ?0 gets selected in this world If it is smaller than 5%, we reject ?0 at level 0.05 and conclude ?0 is well supported The challenge is the first task, and this is what Efron concentrates on
Comparing Bootstrap results of Felsenstein and Efron We recall that Felsenstein s method gave 96.5% confidence for the 9-10 clade Efron is rewarded for his hard work with a result that 93.8% of trees in his Bootstrap world do not contain the 9-10 clade His Bootstrap p-value for ?0 is 0.062 The results are only slightly different, but if we treat 95% confidence / 5% p-value as the holy grail then we conclude: According to Felsenstein we are confident in this clade According to Efron we cannot reject that this clade is a coincidence
Summary Bootstrap is an extremely general and flexible paradigm for statistical inference Allows us to handle complex situations with minimal assumptions and without complicated math Doing theory (and also devising solutions for some problems) can get very complicated, though Has been widely influential in science and industry However, despite the conceptual simplicity it is often misunderstood and misapplied (well beyond Felsenstein)
Thanks! saharon@post.tau.ac.il