Bootstrap: A Powerful Statistical Methodology

 
Bootstrap – The Statistician’s
Magic Wand
 
Saharon Rosset
Bootstrap is a paradigm, not an
algorithm
 
An abstract view of statistics
 
There is a “world” (=unknown distribution) F
We observe some data from the world, say 100
heights (z) and weights (y) of random people
We want to learn about some property of the
world F, e.g.:
Mean of height
Correlation between height and weight
Variance of the empirical correlation between
height and weight
Standard statistical methodology
 
Find a way to estimate the property of F of
interest directly from the data
Mean height estimated by average
Correlation between height and weight estimated by
empirical correlation
How do we estimate the variance of the correlation?
There are some formulae under some assumptions, but
it gets complicated
Instead, we want to invent a general approach
that will allow estimating every property of F
relatively easily (hopefully, also well)
Bootstrap idea: The Plug-in principle
Graphical representation
Real world
Data  X
Statistic
s(X)
 
Data  X*
 
Statistic
s(X*)
 
Bootstrap world
Is Bootstrap important in practice?
Example: variance of empirical
correlation
 
The “double arrow” is the key to designing a bootstrap
algorithm
The most standard approach: use the empirical
distribution of the data
Drawing X* is drawing 100 pairs (z*,y*) with return from
the original dataset
This is commonly referred to as “bootstrap sampling” or
“nonparametric bootstrap”
But this is not the only approach, and often not the
best one!
Parametric Bootstrap example
Concrete example
Approach 1: standard non-parametric
Bootstrap
Approach 2: parametric Bootstrap
using normal distribution
Which one will be better here?
Does Bootstrap always work?
Hypothesis testing with Bootstrap
Inference on phylogenetic trees
Felsenstein (1985)
 
Dataset of malaria genetic sequences from different organisms (11 species,
sequences of length 221):
 
Result of applying standard
phylogenetic tree learning approach:
 
Our inference goal: asses confidence
in the 9-10 clade (subtree) – is it
strongly supported by the data?
Felsenstein’s Bootstrap of Phylogenetic
trees
Is this Bootstrap legit?
Efron’s solution(s)
 
In a beautiful paper, Efron et al. (1996, PNAS) reanalyze this
problem and show:
That under some (quite complicated) assumptions
Felsenstein’s approach can be considered a legitimate
Bootstrap
That without these assumptions (but with some
complicated math and geometry), an appropriate
Bootstrap can be devised for the hypothesis testing view
of the problem
Efron’s hypothesis testing view
 
A peek into Efron’s approach
Comparing Bootstrap results of
Felsenstein and Efron
Summary
 
Bootstrap is an extremely general and flexible paradigm for
statistical inference
Allows us to handle complex situations with minimal
assumptions and without complicated math
Doing theory (and also devising solutions for some problems)
can get very complicated, though
Has been widely influential in science and industry
However, despite the conceptual simplicity it is often
misunderstood and misapplied (well beyond Felsenstein)
 
Thanks!
 
saharon@post.tau.ac.il
Slide Note
Embed
Share

Bootstrap is a paradigm-shifting methodology that allows statisticians to estimate various properties of an unknown distribution with ease. By creating alternative worlds based on observed data and applying the Plug-in principle, Bootstrap simplifies the estimation process for various statistical properties. This approach is beneficial in practice, especially when dealing with complex calculations such as estimating variances of correlations. Through graphical representations and examples, Bootstrap demonstrates its importance in statistics.

  • Bootstrap
  • Statistical Methodology
  • Estimation
  • Data Analysis
  • Statistical Inference

Uploaded on Oct 04, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Bootstrap The Statisticians Magic Wand Saharon Rosset

  2. Bootstrap is a paradigm, not an algorithm

  3. An abstract view of statistics There is a world (=unknown distribution) F We observe some data from the world, say 100 heights (z) and weights (y) of random people We want to learn about some property of the world F, e.g.: Mean of height Correlation between height and weight Variance of the empirical correlation between height and weight

  4. Standard statistical methodology Find a way to estimate the property of F of interest directly from the data Mean height estimated by average Correlation between height and weight estimated by empirical correlation How do we estimate the variance of the correlation? There are some formulae under some assumptions, but it gets complicated Instead, we want to invent a general approach that will allow estimating every property of F relatively easily (hopefully, also well)

  5. Bootstrap idea: The Plug-in principle We are interested in some property of the world = ?(?) We create an alternative world ? (usually using our data), and in it we estimate = ? ? Usually simply done empirically by drawing data from this world and empirically estimating ? ? The main wisdom lies in how to build the Bootstrap world ? so that it is similar to ? in the ways that matter to us Secondary problem: how to perform the estimation in the bootstrap world, usually straight forward

  6. Graphical representation Real world Bootstrap world Dist. ? Data X Data X* Dist. ? Determine ? = ? ? = ? ? = ? ? Statistic s(X) Statistic s(X*)

  7. Is Bootstrap important in practice?

  8. Example: variance of empirical correlation F is the bivariate distribution of z=height and y=weight, we are given data X with 100 pairs of (z,y) The statistic of interest is ? ? = ??? ?,? The property of F we are interested in is ? = ????? Bootstrap approach: Build bootstrap world ? Repeatedly draw bootstrap samples X* from ? Repeatedly estimate s(X*) from each sample Use these estimates to empirically estimate ? = ??? ??(? ) in the boostrap world This is your estimate of ? in the real world

  9. How to build ?? The double arrow is the key to designing a bootstrap algorithm The most standard approach: use the empirical distribution of the data Drawing X* is drawing 100 pairs (z*,y*) with return from the original dataset This is commonly referred to as bootstrap sampling or nonparametric bootstrap But this is not the only approach, and often not the best one!

  10. Parametric Bootstrap example Assume we know that ? (common dist. of height, weight) is bivariate normal Then it makes sense to make ? bivariate normal, with parameters estimated from the data X Then we can repeat exactly the same stages of drawing X*, and estimating the variance empirically

  11. Concrete example Let s choose ? of height and weight to be bi-normal: ? ? 175 75 ,100 50 50 50 ~ ? We start by drawing 105 random samples of 100 and observing the distribution of ? ? = particular we get ? = ????? 0.00261 ??? ?,? , in Now we want to try different Bootstrap approaches for estimating ?

  12. Approach 1: standard non-parametric Bootstrap Define ? to be the empirical distribution of X, then: Sample many X* (Bootstrap samples) Calculate ? ? = ??? ? ,? for each X* Estimate the variance of s by the empirical variance of s* In simulation we can repeat this whole exercise many times to get a distribution of Bootstrap estimates

  13. Approach 2: parametric Bootstrap using normal distribution Use X to estimate mean and covariance of ?, assuming it is normal, and define ? to be this normal distribution. The rest proceeds as before: Sample many X* from this bi-normal distribution (parametric Bootstrap samples) Calculate ? ? = ??? ? ,? for each X* Estimate the variance of s by the empirical variance of s* Again, in simulation repeat this many times

  14. Which one will be better here?

  15. Does Bootstrap always work? Of course not! From what we already know it s clear that if we fail to build ? so that = ? ?is similar to = ?(?) then our approach is useless Can be a result of wrong assumptions on ? used in building ? Can easily devise examples where no Bootstrap approach will give reasonable results Still, the usefulness of properly implemented Bootstrap is very general and applies to almost any reasonable problem we encounter

  16. Hypothesis testing with Bootstrap Recall the components of a hypothesis testing problem: Null hypothesis: ?0: ? = ?0 Test statistic m=s(X) Performing a test entails calculating quantities like p value = ??0(? ? > ?) and rejecting if it is small The p-value for a given z is also a property of ?, but how can we use the Bootstrap to estimate it? If ?0uniquely defines the distribution, then it s trivial, a standard simulation exercise But if ?0 contains many possible ? s, we can implement the bootstrap paradigm: Choose ? as a member of ?0that is consistent with our data , calculate the p value under this distribution

  17. Inference on phylogenetic trees Felsenstein (1985) Dataset of malaria genetic sequences from different organisms (11 species, sequences of length 221): Result of applying standard phylogenetic tree learning approach: Our inference goal: asses confidence in the 9-10 clade (subtree) is it strongly supported by the data?

  18. Felsensteins Bootstrap of Phylogenetic trees Given this phylogenetic tree built on this dataset, Felsenstein wanted to get an answer to questions like: how certain am I that subtree ?0 (say, 9-10) is real (i.e. exists in the world and not just my data) He suggested using the Bootstrap as follows: Draw bootstrap samples of markers Build tree on each sample (all species, sampled markers) Use the % of time we get the subtree ?0as confidence in this subtree

  19. Is this Bootstrap legit? We want to know whether the subtree exists in ? so we estimate this by % of time it exists in data drawn from ? This is not exactly a Bootstrap recipe (details are not critical) But assuming it is a Bootstrap approach, is it a good one? Not at all, because ?was built based on the sample whose best tree contains the subtree This basically means that ? contains the subtree, so we know we are getting over-optimistic results A more correct formulation of this question is as a hypothesis test of ?0:???? ???? ??? ??????? ?0 If we reject ?0 we can conclude that ?0 is reliable

  20. Efrons solution(s) In a beautiful paper, Efron et al. (1996, PNAS) reanalyze this problem and show: That under some (quite complicated) assumptions Felsenstein s approach can be considered a legitimate Bootstrap That without these assumptions (but with some complicated math and geometry), an appropriate Bootstrap can be devised for the hypothesis testing view of the problem

  21. Efrons hypothesis testing view First task: Build a Bootstrap world ? where: ?0 holds ? is as similar as possible to the empirical distribution of our data Then we can test ?0 by examining what percentage of the time ?0 gets selected in this world If it is smaller than 5%, we reject ?0 at level 0.05 and conclude ?0 is well supported The challenge is the first task, and this is what Efron concentrates on

  22. A peek into Efrons approach

  23. Comparing Bootstrap results of Felsenstein and Efron We recall that Felsenstein s method gave 96.5% confidence for the 9-10 clade Efron is rewarded for his hard work with a result that 93.8% of trees in his Bootstrap world do not contain the 9-10 clade His Bootstrap p-value for ?0 is 0.062 The results are only slightly different, but if we treat 95% confidence / 5% p-value as the holy grail then we conclude: According to Felsenstein we are confident in this clade According to Efron we cannot reject that this clade is a coincidence

  24. Summary Bootstrap is an extremely general and flexible paradigm for statistical inference Allows us to handle complex situations with minimal assumptions and without complicated math Doing theory (and also devising solutions for some problems) can get very complicated, though Has been widely influential in science and industry However, despite the conceptual simplicity it is often misunderstood and misapplied (well beyond Felsenstein)

  25. Thanks! saharon@post.tau.ac.il

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#