Understanding Maximum Likelihood Estimation

Slide Note
Embed
Share

Estimation methods play a crucial role in statistical modeling. Maximum Likelihood Estimation (MLE) is a powerful technique invented by Fisher in 1922 for estimating unknown model parameters. This session explores how MLE works, its applications in different scenarios like genetic analysis, and practical examples of its implementation. By maximizing the likelihood of observed data, MLE provides valuable insights for making informed inferences about population parameters. Dive into the world of MLE and enhance your statistical modeling skills.


Uploaded on Sep 19, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Estimation Summer Institutes Module 1, Session 4 1

  2. Estimation Probability/statistical models depend on parameters Binomial depends on probability of success . Normal depends on mean , standard deviation . Parameters are properties of the population and are typically unknown. The process of taking a sample of data to make inferences about these parameters is referred to as estimation . There are a number of different estimation methods we will study two estimation methods: Maximum likelihood (ML) Bayes Summer Institutes Module 1, Session 4 2

  3. Maximum Likelihood Fisher (1922) invented this general method. Problem: Unknown model parameters, Set-up: Write the probability of the data, Y, in terms of the model parameter and the data, P(Y | ) Solution: Choose as your estimate the value of the unknown parameter that makes your data look as likely as possible. Pick ? that maximizes the probability of the observed data. The estimator ? is called the maximum likelihood estimator (MLE). Summer Institutes Module 1, Session 4 3

  4. Maximum Likelihood - Example Suppose a man is known to have transmitted allele A1 to his child at a locus that has only two alleles: A1 and A2. What is the maximum likelihood estimate of the man s genotype? Let X represent the data (paternal allele in the child) and let represent the parameter (man s genotype): X = A1 = {A1A1, A1A2, A2A2} The probability function is based on P(X | ) . P(X = A1 | = A1A1) = 1 P(X = A1 | = A1A2) = .5 P(X = A1 | = A2A2) = 0 Therefore, the MLE is = A1A1 Summer Institutes Module 1, Session 4 4

  5. Maximum Likelihood - Example Suppose we have a sample of 20 gametes in which the number of recombinants (Z) and nonrecombinants (N-Z) for two loci can be counted. Use these data to estimate the recombination fraction ( ) between the two loci. The probability of the data can be modeled using a binomial distribution. The probability distribution function is: 20 ? ??(1 ?)(20 ?) ?(?; ?) = where Z is the variable and is fixed. The likelihood function is the same function: 20 ( ) ( )( ) 20 Z = Z ; 1 L Z Z except now is the variable and Z is fixed. Summer Institutes Module 1, Session 4 5

  6. Maximum Likelihood - Example Two ways to look at this: 1) Probability: fix (e.g. = 0.1) and look at the probability of different values of Z: 0.1 = ( , ) P Z Z 0 1 2 3 4 5 0.122 0.270 0.285 0.190 0.090 0.032 2) Likelihood: fix Z (e.g. Z = 3) and look at the likelihood under different values of (this is called the likelihood function): Z = 3 ( , P Z ) 0.01 0.05 0.10 0.20 0.30 0.40 0.001 0.060 0.190 0.205 0.072 0.012 Summer Institutes Module 1, Session 4 6

  7. Maximum Likelihood - Example For the data Z = 3 then the likelihood function is shown in the plots below: P(Z=3) as function of 0.25 0.20 0.15 likelihood 0.10 0.05 0.0 0.0 0.2 0.4 0.6 0.8 1.0 pi log P(Z=3) as function of 0 -1 log-likelihood -2 -3 -4 -5 0.0 0.2 0.4 0.6 0.8 1.0 pi Summer Institutes Module 1, Session 4 7

  8. Maximum Likelihood We can use calculus to find the maximum of the (log) likelihood function: log d d L = 0 dZ d + = log (20 )log(1 ) 0 Z (20 1 ) Z Z = 0 Z = 20 Not surprisingly, the likelihood in this example is maximized at the observed proportion, 3/20. Sometimes (e.g. this example) the MLE has a simple closed form. In more complex problems, numerical optimization is used. Computers can find these maximum values! Summer Institutes Module 1, Session 4 8

  9. Maximum Likelihood - Notation L( ) = Likelihood as a function of the parameter, . l( ) = log(L( )), the log-likelihood. Usually more convenient to work with analytically and numerically. S( ) = dl( )/d = the score . Set dl( )/d = 0 and solve for to find the MLE. I( ) = -d2l( )/d 2= the information . The inverse of the expected information gives the variance of ? Var( ) = E(I( ))-1 (in most cases) Summer Institutes Module 1, Session 4 9

  10. Maximum Likelihood - Example 20 ? ??(1 ?)(20 ?) ?(?) = (?) = ?log(?) + (20 ?)log(1 ?) ?(?) =? ? (20 ?) 1 ? ? 20 ? = ? ?2+(20 ?) (1 ?)2 ?(?) = ?(?(?)) =20? ?2+(20 20?) (1 ?)2 20 ?(1 ?) = (note: constant dropped from l( )) Summer Institutes Module 1, Session 4 10

  11. Numerical Optimization In complex problems it may not be possible to find the MLE analytically; in that case we use numerical optimization to search for the value of that maximizes the likelihood A common problem with maximum likelihood estimation is accidentally finding a local maximum instead of a global one; solution is to try multiple starting values Likelihood Summer Institutes Module 1, Session 4 11

  12. Maximum Likelihood Example (numerical) If you have access to R, here is code to numerically find the mle for the binomial problem that we solved earlier. Try running it. # Numerical mle example loglike = function(theta,z,n){ # maximize loglike = minimize negative loglike -(z*log(theta) + (n-z)*log(1-theta)) } #initialize theta init = .5 # numerical optimization with boundaries # function fails if theta = 0 or 1 so keep away from boundaries eps = .Machine$double.eps # optim minimizes function loglike optim(init,loglike,method="L-BFGS-B",lower=eps,upper=1-eps,z=3,n=20) Summer Institutes Module 1, Session 4 12

  13. Maximum Likelihood - Comments Maximum likelihood estimates (MLEs) are always based on a probability model for the data. Maximum likelihood is the best method of estimation for any situation that you are willing to write down a probability model (so generally does not apply to nonparametric problems). Maximum likelihood can be used even when there are multiple unknown parameters, in which case has several components (ie. ?0,?1, ,??). The MLE is a point estimate (i.e. gives the single most likely value of ). In lecture 5 we will learn about interval estimates, which describe a range of values which are likely to include the true value of . We combine the MLE and Var( ) to generate these intervals. The likelihood function lets us compare different models (next). Summer Institutes Module 1, Session 4 13

  14. Model Comparisons Q: Suppose we have two alternative models for the data; in each case we use maximum likelihood to estimate the parameters. How do we decide which model fits the data better ? A: First thought - compare the likelihoods. Larger likelihood is better, but the tradeoff is larger likelihood more complex model. How to choose? A common approach is to penalize the likelihood for more complex models (i.e. more parameters). The AIC and BIC are two examples of penalized likelihood measures. Summer Institutes Module 1, Session 4 14

  15. Model Comparisons AIC, BIC AIC Akaike s Information Criterion = 2 ? 2? BIC Bayes Information Criterion = 2 ? ? log(?) (?) = log-likelihood k = number of parameters Use AIC, BIC to compare a series of models. Pick the model with the largest AIC or BIC Larger model larger likelihood (typically) Therefore, penalize the likelihood for each added parameter AIC tries to find the model that would have the minimum prediction error on a new set of data. BIC tries to find the model with the highest posterior probability given the data Typically, BIC is more conservative (picks smaller models) Summer Institutes Module 1, Session 4 15

  16. Example AIC, BIC Continue with the recombinant example. We have N = 20 gametes and Z = 3 recombinants. Let be the recombination fraction between the two loci. Recall that the data can be modeled using the binomial distribution: ? ? The situation of no linkage corresponds to = 0.5, so we can express the models as ??(1 ?)? ? ?(?; ) = Model 1: = 0.5 Model 2: anywhere between 0 and 0.5 Summer Institutes Module 1, Session 4 16

  17. Example AIC, BIC Model 1: The situation of no linkage corresponds to = 0.5. If we substitute this into the likelihood equation, we get ???1= ???0.5 + (? ?)??0.5 = ???0.5 This model has 0 (free) parameters. Model 2: The log-likelihood when is unrestricted is ???2= ???? + (? ?)??(1 ?) This model has 1 parameter. Recall, the mle of is ? =? ? If we substitute this back into the log-likelihood, we get ???2= ???? ?+ ? ? ?? (1 ? ?) Summer Institutes Module 1, Session 4 17

  18. Example AIC, BIC AIC = 2 ? 2? BIC = 2 ? ? log(?) Here are the AIC and BIC calculations for N = 20, Z = 3 ???1= ???0.5 = 13.86 ???2= ???? ?+ ? ? ?? 1 ? = 8.45 ? L1 ( = .5) .5 -13.86 0 -27.72 -27.72 L2 ( arb) .15 -8.45 1 -18.91 -19.90 Log likelihood k AIC BIC Summer Institutes Module 1, Session 4 18

  19. Bayes Estimation Recall Bayes theorem (written in terms of data X and parameter ): P(X| )P( ) P(X| )P( ) = P( |X) Notice the change in perspective - is now treated as a random variable instead of a fixed number. P(X| ) is the likelihood function, as before. P( ) is called the prior distribution of . P( | X) is called the posterior distribution of and is used for estimation Based onP( | X) we can define a number of possible estimators of . A commonly used estimate is the maximum a posteriori (MAP) estimate: MAP= max?P( |X) We can also use P( | X) to define credible intervals for . Summer Institutes Module 1, Session 4 19

  20. Bayes Estimation The MAP estimator is a very simple Bayes estimator. More generally, Bayes estimators minimize a loss function a penalty based on how far ? is from (e.g. Loss =( ? ?)2). The Bayesian procedure provides a convenient way of combining external information or previous data (through the prior distribution) with the current data (through the likelihood) to create a new estimate. As N increases, the data (through the likelihood) overwhelms the prior and the Bayes estimator typically converges to the MLE Controversy arises when P( ) is used to incorporate subjective beliefs or opinions. If the prior distribution P( ) is simply that is uniformly distributed over all possible values, this is called an uninformative prior, and the MAP is the same as the MLE. Summer Institutes Module 1, Session 4 20

  21. Bayes Estimation Example Suppose a man is known to have transmitted allele A1 to his child at a locus that has only two alleles: A1 and A2. What is his most likely genotype? Soln. Let X represent the paternal allele in the child and let represent the man s genotype: X = A1 = {A1A1, A1A2, A2A2} We can write the likelihood function as: P(X | = A1A1) = 1 P(X | = A1A2) = .5 P(X | = A2A2) = 0 Therefore, the MLE is = A1A1. Summer Institutes Module 1, Session 4 21

  22. Bayes Estimation Suppose, however, that we know that the frequency of the A1 allele in the general population is only 1%. Assuming HW equilibrium we have P( = A1A1) = .0001 P( = A1A2) = .0198 P( = A2A2) = .9801 Also, P X = P X ) P = .01 This leads to the posterior distribution P( = A1A1 | X) = P(X | = A1A1) P( = A1A1) / P(X) = 1 * .0001 / .01 = .01 P( = A1A2 | X) = P(X | = A1A2) P( = A1A2) / P(X) = .5 * .0198 / .01 = .99 P( = A2A2 | X) = 0 So the Bayesian MAP estimator is = A1A2. Summer Institutes Module 1, Session 4 22

  23. Summary Maximum likelihood is a method of estimating parameters from data ML requires you to write a probability model for the data MLE s may be found analytically or numerically (Inverse of the negative of the) second derivative of the log-likelihood gives variance of estimates Comparison of log-likelihoods allows us to choose between alternative models Bayesian procedures allow us to incorporate additional information about the parameters in the form of prior data, external information or personal beliefs. Summer Institutes Module 1, Session 4 23

  24. Extra Problems 1. Redo the previous problem assuming the man has 2 children who both have the A1 paternal allele. 2. Suppose 197 animals are distributed into five categories with frequencies (95,30,18,20,34). A genetic model for the population predicts the following frequencies for the categories: (.5, .25*p, .25*(1- p), .25*(1-p), .25*p). Use maximum likelihood to estimate p (Hint: use the multinomial distribution). Summer Institutes Module 1, Session 4 24

  25. Extra Problems 3. Suppose we are interested in estimating the recombination fraction, , from the following experiment. We do a series of crosses: AB/ab x AB/ab and measure the frequency of the various phases in the gametes (assume we can do this). If the recombination fraction is then we expect the following probabilities: phase probability (*4) 3 - 2 + 2 AB 2 - 2 Ab 2 - 2 aB 1 - 2 + 2 ab Suppose we observe (AB,Ab,aB,aa) = (125,18,20,34). Use maximum likelihood to estimate and find the variance of the estimate. (This will require a numerical solution) Summer Institutes Module 1, Session 4 25

  26. Extra Problems 4. Every human being can be classified into one of four blood groups: O, A, B, AB. Inheritance of these blood groups is controlled by 1 gene with 3 alleles: O, A and B where O is recessive to A and B. Suppose the frequency of these alleles is r, p, and q, respectively (p+q+r=1). If we observe (O,A,B,AB) = (176,182,60,17) use maximum likelihood to estimate r, p and q. Summer Institutes Module 1, Session 4 26

  27. Extra Problems 5. Suppose we wish to estimate the recombination fraction ( ) for a particular locus. We observe N = 50 and R = 18. Several previously published studies of the recombination fraction in nearby loci (that we believe should have similar recombination fractions) have shown recombination fractions between .22 and .44. We decide to model this prior information as a beta distribution (see http://en.wikipedia.org/wiki/Beta_distribution) with parameters a = 19 and b = 40: ?(?) = (? + ?) (?) (?)?? 1(1 ?)? 1 Find the MLE and Bayesian MAP estimators of the recombination fraction. Also find a 95% confidence interval (for the MLE) and a 95% credible interval (for the MAP) Summer Institutes Module 1, Session 4 27

More Related Content