Outils et methodes A. Tilquin

Outils et methodes A. Tilquin
Slide Note
Embed
Share

This content discusses statistical techniques such as frequentist and Bayesian approaches, and tools like Minuit and MCMC. It covers concepts like frequentist statistics, central limit theorem, maximum likelihood estimation, minimization methods, bias estimation, and error computation for physical parameters. The discussion includes practical examples and insights into optimizing statistical models.

  • Statistics
  • Methods
  • Tools
  • Data Analysis
  • Bayesian

Uploaded on Feb 15, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Outils et methodes A. Tilquin Quelles statistiques: -Frequentiste -Bayesian Quels outils: -Minimiseur: Minuit -MCMC

  2. Freqentist statistic Definition: Probability is interpreted as the frequency of the outcome of a repeatable experiment. Central limit theorem If you repeat N times a measurement, and for N->inf, the measurement distribution (pdf) will be a Gaussian distribution around the mean value with a half width equal to the error of your experiment.

  3. Maximum likelihood(=0). What is the best curve ? Answer : The most probable curve ! The probability of the theoretical curve is the product of each individual point to be around the curve: ( ) 2 ( , ) m m iz i k n n 1 2 2 = i = i m = = i ( m , ( , )) L p m z e i k i 2 1 1 m i L = We have to maximze L with respect to k Because it is simpler to work with sum: = i 1 0 k ( ) 2 2 ( , ) 1 n m m z n = ( ) ( 2 ) i k i Ln L Ln m = 2 1 i i m i = 0 with respect to k We have to minimize k

  4. Minimisation of 2 and bias estimate ( ) = i m i k We Tailor expand the 2around ko: ( ) 2 2 2 2 ( , ) ( , ) m m z m z ( , ) m m z = n n = = = 2 0 i k i k i 2 i k i 1 i 1 m k i 2 2 2 ( ) ( + ) ( ) 1 = + 2 2 o k o k o k o k ( ) ( ) k k k k 2 2 k k o k o k We apply the minimum condition in k 2 ( ) 2 2 2 ( ) 1 1 = = o k 0 k k 2 2 2 k k k o k o k 1 o k ( ) ( , ) J U m m z 1 0 1 ( ) ( ) V J U J m i i ( ) n k m i i We get the first order iterative quation: = 1 1 1 o k T o k ( ) ( , ) J U J J U m m z k i i ( ) ( ) p n If theoritical model is linear, this equation is exact (no iteration)

  5. Computing errors When the 2is defined on measured variables (i.e magnitude), how to compute the errors on physical parameters, k? ( ) ) ( = th V m m We perform a Tailor expansion of the 2around minimum: ( ) ) 1 ( ( m m th + 2 2 2 1 n T T = + 2 min min k min k min k min k n p ( , ) ( , ) m m i k i k k k 2 k k l min k min k =0 ( ) p , min If the transformation m( k) is linear, then: The second derivative of the 2is a symetric positive matrix The error on kare Gaussian k ) ) ( ( 0 k 0 k = 1 U k k 2 2 1 kl = 1 U 2 k l

  6. Computing errors(2) ( = i 1 ( ) = m i 1 ) , ( ) 2 2 ( , ) m m z n = 2 Simple case: i k i m i 2 2 ( , ) ( ) m m z m n = 2 i k i k 1 i k k ( ) 2 2 2 2 ( , ) ( , ) ( ) 1 m z m z m m z m n n = k i k i i k i k 2 = = 2 1 1 i i k l k m l m k l i i = 0 Jacobi 1 V If m( k, zi) is linear Error on physical parameter are deduced by a simple projection on the k space parameter (linear approximation) Fisher analysis ii Independant of the measured points ( , ) ( , ) m z m z = 1 1 k i k i U V k k , , i k i k If m( k, zi) is not linear: Fisher is a good approximation if: 2 ( ) m V 1 1 ( , ) k m m z U i k i k l

  7. Non-linarity If m( k) is linear in kthen: If errors on mi are Gaussian then errors on k will be ( 0 1 = m m ( 2 = ) ( ) ) 2 1 ( ) ( ( V m m 0 2( k) is exactly a quadratic form 2 ) ) ( 0 k 0 k 1 U k k The covariance matrix is positive and symetric z m U ( , ) ( , ) m z = 1 1 k i k i V k k , , i k i k Fisher analysis is rigorously exact. On the contrary only 12is rigorously exact: Fisher matrix is a linear approximation. The only valid properties are: Best fit is given by 0 k 1 2 1= ( ) 2 2 2 => = o i i k i k o k o k The s sigma error is given by solving: 12= 2min+s2

  8. Non-linearity: computing errors If m( k) is not linear, errors on k is not gaussian. Fisher analysis is no more correct. If one use it, results should be verify a posteriori. To estimated the errors we should come back to the first definition of the 12 and solve the equation 12= 2min+1 ( , ( ) , ( 0 1 = M M m m ) ( ) ) 2 1 ) ( ( , V m m 0 M If we want to estimate ( ) what about M ? How to take care of correlation ? How to marginalize over M ? 1 = 2 1 2 1 2 1 ( ) ( , ) ( ( , ) p d Average answer (simulation) M M M 0 = 2 1 2 1 ( ) min ( , ), Most probable answer (data) M M It can be shown than both methods are equivalent for simulation if simulated point are not randomized. mmes = mth

  9. Non linearity: example Evolution of 12- min2: SNAP simlation, flatness at 1 % 2= min2+1 Fisher analysis Secondary minimum - + asymetric error Rem:This secondary minimum is highly due to non linearity. = w + = . 1 . 0 . 0 38 13 00 w + 0 . 0 . 0 26 25 . 0 88 0

  10. Non Gaussianity When errors on observables are not Gaussian, only the minimum iterative equation can be use. So go back to the definition: Probability is interpreted as the frequency of the outcome of a repeatable experiment and do simulation: Gedanken experiments a) Determine the cosmological model { k0 } by looking for the best fit parameters on data. This set of parameters is assumed to be the true cosmology. b) Compute the expected observables and randomize them inside the experimental errors, taking into account non Gaussianity. Do the same thing with prior. c) For each virtual experiment, compute the new minimum to get a new set of cosmological parameters ki d) Simulate as many virtual experiments as you can e) The distributions of these best fit value { ki} give the errors: The error matrix is given by second order moments: Ui,j={< i j> - < i> < j>} positive define The error on errors scale as ( ) ~ / 2N

  11. Statistique freqentiste Math matiquement parfaitement d finie -Conservation locale des probabilit s -Non lin arit et non Gaussianit r solus par simulation L outil principal : MINUIT Limitations: Modele th orique doit tre diff renciable Difficile d utiliser des priors i.e m >0 Attention aux corr lations (instabilit num rique) Marginalisation (contour) gourmand en CPU

  12. Bayesian statistic or The complexity of interpretation 1702-1761 (paper only published in 1764)

  13. Bayesian theorem. measurement Posterior to measurement Prior to measurement Likelihood * prior Posterior = marginal Likelihood Normalization factor=Evidence Normalization factor is the sum over all possible posteriors to ensure unitarily of probability. = = p(M p(M E ) * ( ) p * E M i p(E M) i i = E ) ( ) p E M j j j Where > means after and < means before.

  14. Example Question: Suppose you have been tested positive for a disease; what is the probability that you actually have the disease? (T L -Efficiency of the test: + + = = = = ) L (T ) 95 % D D + + = = = = L (T ) L (T ) 5 % D D + = p( ) 1 % D T -Disease is rare: = p( ) 99% D T What is the Bayesian probability? + + + = L p ( ) * ( ) T D p D T + + = ( ) p D T + + + + = + = L ( ) * ( ) L ( ) * ( ) T D D T T D p D T . 0 95 . 0 * + 01 + + = = ( ) 16 % p D T . 0 95 . 0 * 01 . 0 05 . 0 * 99 Why a so small Bayesian probability (16%) compare to Likelihood probability of 95%? Which method is wrong?

  15. Intuitive argument. Over 100 people, the doctor expect 1 people has a disease and 99 have no disease. If doctor makes test to all people: 1 has a disease and will probably have a positive test 5 will have a positive test while they have no disease 6 positive tests for only 1 true disease So the probability for a patient to have a disease when the test is positive is 1/6~16% =>Likelihood is wrong? In the previous argument doctor used the whole population to compute its probability: 1% disease and 99% not disease before the test He have assumed that the patient is a random guy. The patient state before the measurement is a superposition of 2 states /patient> = 0.01*/desease>+0.99*/healthy> But what about yourself before the test? /you> = /disease> or /healthy> but not both state in the same time /you> /patient> =>Bayesian is wrong?

  16. Which statistic is correct? Both! But they do not answer to the same question!: Frequentist: If my test is positive what is the probability for me to have a disease? 95% Bayesian: If one of the patient has a positive test what is the probability for this patient to have a disease? 16% Different questions give different answers! Conclusion: In Bayesian statistic, the most important is the prior because it can change the question! just playing with prior can solve a lot of different problems. It s the reason why statistician like Bayesian statistic, because and interpretation of the Bayesian probability. On the contrary, in Scientific works, we should care about prior

  17. Baeysian and cosmology In cosmology, a simplify version of Baye s theorem is used: ???? ~ ?????? ??? ????? Only useful for models comparison: Evidence or f-value To estimate parameters and errors, normalization factor is useless. Usual priors in cosmology are: Gaussian prior: m = 0.3 0.05. -Usually refer to an experiment. -Equivalent to frequentist. Delta function prior : k = D(0). -Constraints may be artificially tight! Flat prior or top hat prior : p( m>0) 0 ; p( m < 0)=0 P(w0, w1) 0 if -1.5 < w0 < 0 ; -2 < w1 < 1 -If too narrow it can bias final results! -If wide enough, no influence. Only affect evidence.

  18. Parameters estimate in Bayesian 2?? ???? = 2Ln ?????? ??? 2??(?????) ????=?2 2??(?????) ?2 But: posterior 2 is no more a quadratic form expect if Gaussian prior It may not be define, for example if prior is a step function ( m>0) Derivative may not be define. 1 2 2 2 ( ) = o i i k i k o k o k So: In general, we work on the posterior likelihood. But how to find maximum likelihood and how to compute error? MCMC to explore posterior Likelihood

  19. Monte Carlo Markov Chain for Bayesian statistic MCMC explores the Likelihood with respect to cosmological parameters using a chain of correlated points. It s a random walk using metropolis method. How to simulate a set of points such that the points density is proportional to a given pdf ? 1) Start from x0 =xi 2) Simulate a random step s and compute xi+1=xi s 3) If p(xi+1)>p(xi) keep the point xi+1 ->xi goto 2 4) If not we simulate a random number (r) between 0 and p(xi) 5) If r>p(xi+1) we reject the point and goto 2 6) If r<p(xi+1) we keep the point xi+1 ->xi goto 2 The central limit theorem tell us that for N toward the true pdf. (typically 1 million for MCMC in cosmology) the set of points {xi} converge

  20. Computing errors Best parameters are given by the maximum Likelihood point reach by the chain. The error matrix is given by second order moments: Ui,j={< i j> - < i> < j>} The error on errors doesn t scale as 2? (convergence) Use 2 subsets of the chain and compute both error matrices. U[1]~U[2]

  21. Methods comparison using CMB+BAO+WL+SN: (w0%wa) Frequentist: 2= 2min+s2 and Gedanken Bayesian with MCMC MCMC Li et al. 2009 68%+95% CI Gedanken and MCMC give same results and require about same computing time (3 to 4 years on a single processor)

  22. Datagrid (DIRAC)

  23. Conclusions Frequentist ou Bayesian? Qu importe: Les deux sont acceptables Cependant, si le modele th orique n est pas defini dans tous l espace des param tres ou si les corr lations sont importantes, la statistique Bayesian est mieux adapt e (MCMC) Les outils principaux: Minuit: Tr s efficace, archi debugge, mais parfois instable. MCMC: C est le buldozer mais attention a la convergence. Dans les 2 cas CPU time scale = plusieurs ann es(>10) N cessite soit un gros serveur (au moins 128 CPU) soit la grille de calcul (difficile a utiliser)

  24. Outlook General problem The frequentist statistic Likelihood, log-likelihood and 2 Property of the 2 Fisher analysis and limitation Gedanken experiments The Bayesian statistic The Bayes theorem Example and interpretation. Prior in cosmology and consequences MCMC Summary

  25. General problem(1) Let assumes we have N Supernovae at different redshift z m , 1 ) , , ( = = ( , ) m m z and a given model: th k i i m i N i How to find the best curve ? 1. We look for the closest curve with respect to the data points: 2 min ( i m D 2 = ( , ) ) z m k i i But the worst measured points should have less weight. = i 2 ( , ) m z m 2 min k i i m i

  26. General problem(2) 2. The problem now is: 2 Find the kvalue such that 2is minimum ( , ) m z m = 2 k i i i m i = 0 k Computed the errors on kstarting from the errors on mi ( f k = , , ) mm i i k Statistic is necessary

  27. Gaussian probability In all the following I will assume errors are Gaussian, I.e if we repeat N time the measurement, the distribution of measurements follows a Gaussian distribution: ( ) 2 m m 0 1 2 m = 2 ( , , ) p m m e 0 m 2 + , m ( , m) p m m dm 0 With an average value: + = . ( , , m) m m p m m dm 0 0 Variance or error2: + = 2 2 ( ) . ( , , ) m m p m m dm 0 0 m m

  28. N variables case(correlation) = 2 V ii m i Assume n Gaussian measurement : = V ij ij m m i j m (1 1 ( 2 / 1 ) ) m m V m m 1 0 0 = ( . ) p m e / 1 2 / 2 n 2 ( ) V Correlation coefficient. m n Determinant Example of 2 variables: ( ) ( ) 2 2 2 ( )( ) m m 2 m m m m m m 2 1 0 0 0 j i j 0 i + 1 2 1 ( 2 ) = m m m m ( , ) p m m e i j i j i j 2 2 1 m m i j = Si =0 ( , ) ( ) ( ) p m m p m p m i j i j Si =1 m = V is singular ( ) f m j i

  29. Some 2 property Probability and 2: By definition 2 1 = / 1 = 2 2 ( ) ( ) 2 ( ) Ln L p Ln L L L e 2 0 0 Definition in matrix form: m 1 1 = 1 Si . then ( ) ( ) m m m V m m = = 1 ( ) 0 [ ] V 0 0 ii 2 i m n Second derivative: m 2 2 1 2 2 2 2 ( ) ( ) 2 x x x x = 1 = = = V 0 0 2 ij 2 m 2 x 2 x 2 2 x x x i j The first 2derivative gives le minimum The second 2derivative gives the weight matrix = inverse of the error matrix independent of the measured data points.

  30. Probability density of the 2 Probability density of the 2: Assume we repeat N times the measurement of x. For each measurement we have : 2 ( ) x x = 2 k 0 k 2 x 2 1 e 2 2 ( ) g x 2 x 1 = = 2 2 2 ( ) ( ) ( ) p d g x dx p / 1 = 2 2 2 ( ) ( ) p e 2 2 d 2 2 ( 2 0) x x dx 2 x 2 In case of n degree of freedom 1 ( ) ) 0 ( 2 k ( ) n x x 2 = i = 2 2 / 2 1 n ( , ) ( ) p n e 2 = 2 i i / 2 n 2 ( ) 2 / n k 1 x ndof is the number observations parameters that must be estimated from data sample. i of independent the number We can show that: 1. The average of the 2is n 2. The variance is 2n So, the first test of the compatibility between a model and an experiment is to verify Warning about the variance of 2n minus of 2 2 n n

  31. P-value and confidence level We define p-value as the probability that all new experiment will give a higher 2than the observed one: 10% of experiments will give a 2>16 + = = d 2 2 2 - ( ) ( , ) p value p n 2 Definition of the standard deviation (confidence level): A measurement with an error at s sigma. ) 0 ( m 2 ( ) = = + ) 0 ( m 2 2 min 2 m s s m 2 m m n=1, p=68% n=2, p=39% For more variables measured simultaneously, a result at s standard deviation is defined with the 2, but probabilities depend on the number of variables. n=2, p=68% For many variables we prefer to talk about the probability !!!

  32. P-value property The Probability density function of the p-value h( ) : + = = d 2 2 2 ( ) ( , ) p value p n 2 Using the probability conservation law 2 2 ( ) ( ) p p = = = = 2 2 st ( ) ( ) ( ) h d p d h C d d + 2 2 ( ) p d 2 2 d d 2 If the model is correct and if errors are statistical only and correctly estimated, the probability density function of the p- value should be flat, i.e kmeasurements are uniformly distributed between 0 and 1.

  33. Application to light curve fit Assume we have N light curve fitted with given templates. i 2 + = d 2 2 2 ( ) ( , ) p value p n 2 j 2 Bad fit Good light curves Selection criteria: Wrong SN 2 _ ( ) . 0 05 p Efficiency = 95% value Bad template Error are too small Errors are too big bad template

  34. Exemple: variables change Assume we know errors on met , (no correlation). We would like to compute errors on S= m+ et D= m- : / 1 2 0 We construct the covariance matrix We construct the Jacobian: + = 2 D S = 1 U m 2 0 / 1 S D / / 1 1 S S m m = = / 1 2 J / / 1 1 D D 2 = m We project: + 2 2 2 2 1 2 2 2 2 = = 1 1 V J U J m m + m 2 2 4 We inverse V: m m + 2 2 2 2 2 S 2 2 2 2 = = S D V m m + 2 D S D m m 2 2 = = + = 2 2 ; m + ( ) ( ) + 2 2 m m m m

  35. External constraint or prior Problem: Using SN we would like to measured ( m, ) knowing that from the CMB we have : T= m+ =1.01 0.02. This measurement is independent from the SN measurement. So we can add it to 2. ( 1 2 i m All the previous equations are still correct by replacing: ( ) ) 2 2 + o T ( , , ) m m z = n + 2 i m i m 2 = i T 2 m / 1 0 .. 0 ( , , ) m m z 1 1 m 1 0 .. 0 0 ... 1 ( , ) U m m z et i k i 2 m .. 0 / 1 0 ( , , ) m m z n m n n + 2 o T 0 0 0 / 1 m ) 1 + ( n T ( , ) m kz i And the Jacobi: = J k + ( ) m k = ( , ) k m

  36. Online demonstration of MCMC Only if enough time!

  37. Prior and new discovery In 1999, 2 different groups (SCP and High- z teams) discovered the cosmological constant by using Sn1a at high redshift. These 2 groups are mainly compose of particle physicists and were using freqentist statistic. As a consequence they didn t use prior like m>0 And they made a fundamental discovery!!!

  38. Summary Both statistics are used in cosmology and give similar results if no or week priors are used. The frequentist statistic is very simple to use for gaussian errors and rather linear model. Errors can easily be computed using Fisher analysis. Bayesian statistic might be the only method to solve very complex problem. But warning about probability interpretation! For complex problem, only simulation can be used and are lot of computing time. When using priors in both cases a careful analysis of results should be done

  39. References http://pdg.lbl.gov/2009/reviews/rpp2009-rev- statistics.pdf http://www.inference.phy.cam.ac.uk/mackay/itila/ http://ipsur.r-forge.r-project.org/ http://www.nu.to.infn.it/Statistics/ http://en.wikipedia.org/wiki/F-distribution http://www.danielsoper.com/statcalc/calc07.aspx If you have any question or problem, send me an e-mail: tilquin@cppm.in2p3.fr

  40. Why Baeysian statistic would have miss discovery? Positivity of m = + ( , , ) 5 log D . m z m 1 m c 10 s m s L = k m + 1 z = 0 sin( ) D J k L k k = = + 0 1 ( ) D z J k L + 1 z = 0 sinh( ) D J k L k k z 0 ' dz ( ( ) z m = = + + ; ( ) 1 ( ) 1 J H z z ) ' H z

  41. F-test and F-value Way to compare 2 different models or hypothesis. Consider two models, 1 and 2, where model 1 (p1 parameters) is 'nested' within model 2 (p2>p1).. Model 2 will give a better fit to the data than model 1. But one often wants to determine whether model 2 gives a significantly better fit to the data. One approach to this problem is to use an F test. If there are n data points to estimate parameters of both models from, then one can calculate the F statistic, given by: = 2 2 n 2 1 2 2 p p 2 1 F p 2 Under the null hypothesis that model 2 does not provide a significantly better fit than model 1, F will have an F distribution, with (p2 p1, n p2) degrees of freedom. The null hypothesis is rejected if the F calculated from the data is greater than the critical value of the F distribution for some desired false-rejection probability (e.g. 0.05) . The test is also known as likelihood ratio test. (see http://en.wikipedia.org/wiki/F- distribution, http://www.danielsoper.com/statcalc/calc07.aspx)

  42. Statistical methods in cosmology Andr Tilquin (CPPM) tilquin@cppm.in2p3.fr = + ( , = , , , , ) 5 log . m z m w w m c D 0 1 10 s m X s L 1 k m X + 1 z = 0 sin( ) D J k L k k = = + 0 1 ( ) D z J k L + 1 z = 0 sinh( ) D J k L k k z ' dz = = + + + + 3 2 ; ( ) 1 ( ) ( ) 1 ( ) J H z z f z z m X k ( ) ' z H = 0 + + 1 ( 3 ) ) 3 w w w z ( ) 1 ( f z e z 0 1 1

  43. Bayesian evidence Bayes forecasts method: define experiment configuration and models simulate data D for all fiducial parameters compute evidence (using the data from b) plot evidence ratio B01 = E(M0)/E(M1) limits: plot contours of iso-evidence ratio ln(B01) = 0 (equal probability) ln(B01) = -2.5 (1:12 ~ substantial) ln(B01) = -5 (1:150 ~ strong) Computationally intensive: need to calc. 100s of evidences

  44. Graphical interpretation (contour) The equation: define an iso-probability ellipse. min s + = 2 2 2 / 1 2 0 = 1 = 1 1 U V J U J m 2 0 / 1 ( ) ( ) 2 2 ) 0 ( m ) 0 ( + + + + ) 0 ( m ) 0 ( ) 0 ( m ) 0 ( ( ) ( ) = + = 2 = = 2 1 1 m m m 1 V 2 2 ) 0 ( m ) 0 ( ) 0 ( m ) 0 ( ( ) ( ) m m m m m + m + = + 2 2 min 1 2 + m 2 m + ( 0 ) ( 0 ) = + 2 2 min 1 39% ( 0 ) 39% 2 + m 2 m - /4 2 ( ( m 0 ) m m ( 0 ) ( 0 ) m = ) 2 X Y tg 68% 2 X 2 Y

  45. Systematic errors D finition: Systematic error is all that is not statistic. Statistic: If we repeat n the measurement of the quantity Q with a statistical error Q, the average value <Q> tends to the true value Q0with an error Q/ n. Systematic:Whatever is the number of experiments, <Q> will never tends to Q0better than the systematic error S. How to deal with: If systematic effect is measurable, we correct it, by calculating <Q- Q> with the error Q 2= Q2+ Q2 If not, we add the error matrices: V = Vstat+Vsyst and we use the general formalism. Challenge:The systematic error should be less than the statistical error. If not, just stop the experiment, because they won !!!!

  46. Error on the z parameter SNAP mesure miet zi with errors met z. Redshift is used as a paremeter on the theoritical model and its error is not on the 2. ( = i m i But the error on z leads to an error on m( m, ,zi) z m Thus, the error on the difference mi-mthis: ) , ( i ) 2 ( , , ) m m z n = 2 i m i 2 1 m ( , ) = ( ) z k m iz z i z 2 m z z = + ( m ) 2 m T k z z i i i z

More Related Content