Understanding Null Hypothesis Significance Testing (NHST) in Statistics

Slide Note
Embed
Share

Null Hypothesis Significance Testing (NHST) is a common method in statistics to determine if a particular value of a parameter can be rejected, such as testing if a coin is fair. This involves calculating probabilities of outcomes and p-values to make decisions. The process relies on defining spaces of possible outcomes based on data collection intentions, with the goal of isolating experimenter intentions from observed data for unbiased results.


Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Null Hypothesis Significance Testing (NHST) Chapter 11 of Kruschke text Darrell A. Worthy Texas A&M University

  2. NHST We now have some idea of how Bayesian inference works. This makes it appropriate to compare Bayesian inference with NHST In NHST the goal is to decide whether a particular value of a parameter can be rejected. If we want to decide whether a coin is fair then we are asking whether we can reject the null hypothesis that the bias of a coin has a specific value, =.5.

  3. Logic of NHST Suppose the coin is fair. Then when we flip the coin, we expect that about half the flips should be head and half tails. To make this precise we need to figure out the exact probabilities of all possible outcomes, which can be used to figure out the probability of getting an outcome as extreme or more than the observed outcome. This probability is called a p value . If it s very small, say less than .05, then we decide to reject the null.

  4. Logic of NHST Notice that this reasoning depends on defining a space of all possible outcomes from the null hypothesis. This space of all possible outcomes is based on how we intend to collect the data. Was our intention to flip the coin exactly N times? Was the intention to flip the coin until the zth head appeared? Was the intention to flip for a specified duration? All of these different sets of intentions create different spaces of all possible outcomes that we need to use to do NHST A more explicit (and ignored) definition of a p value is the probability of observing an outcome at least that extreme when using the intended sampling and testing procedures.

  5. Logic of NHST This figure illustrates how a p value is defined. The important thing is the space of all possible outcomes that differs based on sampling intention, resulting in a different conclusion. People often refer to the p-value for a set of data, when in fact there are many possible p values depending on how the cloud of imaginary outcomes is generated.

  6. Logic of NHST The cloud of possible outcomes depends on the intended stopping criterion. It also depends on the intended tests that the researcher wants to make, because additional tests expand the cloud with additional possibilities from these added tests. Do the actually observed data depend on the intended stopping rule or on what tests you plan to run on them? Not in appropriately conducted research. A good experiment is based on the principle that the data are insulated from the experimenter s intentions.

  7. Logic of NHST If our experiment was to flip a coin many times in a row then the coin only knows how often its been flipped; it doesn t know what our intentions are. Our conclusion about the coin should not depend on what the experimenter had in mind while flipping it, nor what tests the experimenter wanted to do with the data afterwards. The essential constraint on the stopping rule is that it should just not bias the data that we obtain. Stopping at a fixed number of flips does not bias the data.

  8. Logic of NHST Stopping at a fixed number of heads (or tails) does bias the data because a random sequence of unrepresentative flips can cause data collection to stop prematurely. Peeking at the data as it accumulates, and continuing to collect additional data only if there is not yet an extreme outcome can bias the data if a random extreme outcome can stop data collection prematurely. However, optional stopping is less problematic with a Bayesian approach and many argue that it is not problematic at all (Rouder, 2014). We will later look at an example where a Bayes factor indicating strong evidence was used for a stopping rule.

  9. Soul searching Kruschke shows several simulations where p values from analysis of the same data differ based on the sampling intentions. These differences were somewhat small ranging from p=.017 to .10 A defender of NHST could argue that these are small, perhaps trivial differences, and the differences would get smaller as N gets larger. A flaw in this argument is that it does not deny the problem. It also gives no solution for reasoning about data with small N There are also situations where p values differ more dramatically. One case is multiple comparisons

  10. Soul searching For NHST practitioners the solution to these quandaries is to establish the true intention of the researcher. This is done explicitly when applying corrections for multiple comparisons. The same approach is taken for stopping rules: the analyst should determine what the truly intended stopping rule was, and then compute the appropriate p value. Determining true intentions can be difficult so now some researchers are pre-registering their data as a way to determine p values.

  11. Soul searching What if we have followed this new set of constantly changing rules, but an unforeseen event interrupts the data collection, or produces a windfall of extra data? What if after the data have been collected, it becomes clear that there should have been other tests? p values need to be adjusted for NHST Fundamentally, the intentions should not matter to the interpretation of the data because the propensity of a coin to come up heads does not depend on the intentions of the coin flipper. We design our experiments to insulate data from experimenters intentions.

  12. Soul searching Howson & Urbach go to great length to note the flaws in NHST in their book. We suggest that such information about experimenters subjective intentions has no inductive relevance whatever in this context, and that in practice it is never sought or even contemplated. The fact that significance tests and, indeed, all classical inference models require it is a decisive objection to the whole approach.

  13. Bayesian analysis The Bayesian interpretation of data does not depend on the covert sampling and testing intentions of the data collector. In general, for data that are independent across trials, the probability of the set of data is simply the product of the probabilities of the individual outcomes. The likelihood function captures everything we assume to influence the data. In the case of the coin, we assume that this bias parameter ( ) is the only influence on its outcomes, and that trials are independent. The Bernoulli likelihood function completely captures these assumptions.

  14. Prior Knowledge Suppose that we are no longer flipping a coin, but we are flipping a flat- headed nail. When we flip the nail it can land with its pointy tail touching the ground (tails) or on its head with the pointy part sticking up (heads). From prior experience encountering nails and mental modeling we reason that the nail will land tails much more often than heads. In other words, we have a strong prior belief that the nail is tails-biased. Suppose we flipped it 24 times and it landed heads 7 times. Is the nail fair ? Would we use it to determine which team kicked off at the Superbowl?

  15. NHST analysis The NHST analysis does not have a way to incorporate our prior knowledge or beliefs about nails into the analysis. If we are just asking is the coin fair, is =.50?, then we only know something about the null-hypothesis true distribution. If we declare that the intention was to flip the nail 24 times, then an outcome of 7 heads means we do not reject the hypothesis that the nail is fair. The one-tailed p value is .032, making the p-value for the two-tailed test of whether the coin is fair equal .064. Guess we can use it in the Superbowl!

  16. Bayesian analysis The Bayesian analysis can incorporate prior knowledge into the analysis. In data analysis we normally assume vague priors that cover a range of possible values for the parameter. Bayesian analysis would allow us to establish a prior by appealing to well-regarded, publicly available research. In our example we may know of a previous nail-flipping study that had 95% tails out of a sample size of 20. That translates to a beta ( |2,20) prior distribution that may better represent our prior beliefs than beta( |1,1).

  17. Bayesian analysis If a beta ( |2,20) prior distribution was generally agreeable to everyone then the lower plot shows the resulting posterior distribution. The 95% HDI spans from .08 - .31, and .50 is clearly nowhere near this credible interval. A Bayesian analyst would strongly urge the NFL not to use a nail in the Superbowl. A person using NHST would have no statistical reason to argue against using the nail. If they collected more data it would be violating the stopping rule!

  18. Priors are overt and relevant Some have criticized prior beliefs as no less mysterious than the experimenter s stopping and testing intentions. However, prior beliefs are overt, open to debate, and founded on publicly available prior knowledge. The Bayesian researcher must convince his or her audience that the priors used are appropriate. In many analyses we will do priors are specified as vague beforehand. However, it s important to note that Bayesian analysis provides an intellectually coherent method for determining the degree to which beliefs should change.

  19. Confidence Interval and Highest Density Interval Many people have acknowledged the perils of p values, and have suggested that data analysis would be better if practitioners used confidence intervals (CIs). The idea is to use point estimates for effect sizes and CIs around these estimates. These recommendations have the admirable goal of getting people to understand uncertainty of estimation instead of only a yes/no decision about a null hypothesis. Unfortunately, CIs suffer the same problems as p values because they are defined in terms of p values. Recall that we multiply the standard error by the t value to get CIs. Bayesian posterior distributions can be interpreted directly as the probability for different parameter values.

  20. CI is not a distribution The CI is merely two points. For our nail example with the same intentions the CI is [.126 - .511]. A common misconception of a confidence interval is that is indicates some sort of probability distribution over parameter values. It s tempting to believe that values in the middle of a CI should be more believable than values near or beyond the CI limits. Cumming and Fidler (2009) proposed superimposing the sampling distribution over a CI. This may look similar, but it s not the same as the posterior probability distribution of a parameter.

  21. CI is not a distribution Several other methods have been proposed for superimposing a distribution onto a CI. All of them seem to be motivated by a natural Bayesian intuition: Parameter values that are consistent with the data should be more credible than parameter values that are not consistent with the data. If we were confined to frequentist methods then we could devise ways to express that intuition. Rather than resorting to clever workarounds we can express our natural Bayesian intuitions in fully Bayesian formalisms.

  22. Bayesian HDI The 95% HDI consists of those values of that have at least some minimal level of posterior credibility, such that the total probability of all such values is .95. With 7/24 heads and a beta p( |11,11) prior we can see the posterior probability distribution in the bottom panel. The 95% HDI goes from =.254 to =.531. It also shows exactly how credible each possible bias is.

  23. Bayesian HDI There are at least three advantages of the HDI over a NHST CI. First the HDI has a direct interpretation in terms of the credibilities of values of the parameter p( |D) which is what we want to know. The CI has no direct relationship with what we want to know. Second the HDI has no dependence on the sampling and testing intentions of the experimenter. The CI tells us the probabilities of the data relative to the imaginary possibilities generated from the experimenter s intentions. The HDI is responsive to the analysist s prior beliefs, as it should be. The Bayesian analysis indicates how much new data should alter our beliefs.

  24. Multiple comparisons NHST involves the practice of correcting for false alarms when conducting multiple comparisons. Bayesian analysis is not immune from flukes or false alarms, but false alarms are planned to occur 5% of the time in NHST, and multiple comparisons dramatically increase the chance of a Type 1 error. But, if a comparison is planned and not too many are planned, then it s ok to do the comparison as normal. If the comparison is post hoc, or only conceived after the data have been collected then the critical p value for rejecting the null hypothesis must be lowered, making the test more conservative.

  25. Multiple comparisons The problem with the NHST approach is that the correction that must be made depends on the analyst s intentions. Two researchers can come to the same data and leave with different conclusions because of the variety of comparisons that they find interesting enough to conduct, and what provoked their interest. A creative and inquisitive analyst, who wants to conduct many comparisons because of deep theorizing or to explore some provocative new trend they have observed in the data is penalized. This researcher can only do those additional analyses at the cost of using a more stringent threshold for each comparison.

  26. Multiple comparisons The uninquisitive researcher is rewarded with an easier criterion for achieving significance. This researcher may have a higher chance of getting a significant result, and getting his or her work published by feigning narrow- mindedness under the pretense of protecting the world from false alarms. Another issue is that researchers may have the data to test another question that could possibly augment or counter the main findings, but they may argue that they cannot test this additional question because they did not conceive of it beforehand.

  27. Multiple comparisons Consider estimating baseball batting abilities for players from different fielding positions. A basic question might be whether batting ability differs for infielders versus outfielders. The uninquisitive researcher might plan to just do the single comparison of infielders versus outfielders. A more inquisitive and knowledgeable researcher might plan to do this plus three additional comparisons: outfielders versus basemen, outfielders versus catchers/pitchers, and basemen versus catchers/pitchers.

  28. Multiple comparisons The more inquisitive and knowledgeable expert is punished with a more stringent criterion for declaring significance, even on the comparison of outfielders versus infielders. Suppose that on seeing the data the inquisitive researcher sees that catchers actually have about the same batting average as basemen, and therefore should be grouped with basemen rather than pitchers. Should this be considered post-hoc, or planned because the researcher should have planned this comparison based on publicly available knowledge? Suppose that the researcher sees catchers aren t that different than outfielders and nixes that comparison to increase the critical threshold. Is this decision post-hoc since the test was planned in advance?

  29. Multiple comparisons All of these rules and heuristics for correcting for multiple comparisons are difficult to enforce and Psychology currently has a crisis of confidence in the reliability of our research. It seems problematic to have so much riding on the intentions of the researcher. The rules and corrections can unnecessarily burden the research enterprise by punishing researchers for looking at their data more closely, or testing additional questions after exploring the data. The problem is not solved by picking a story and sticking to it, because any story still presumes that the researcher s testing intentions should influence data interpretation.

  30. Just one Bayesian posterior no matter how you look at it The data from an experiment are carefully collected to be insulated from the experimenter s intentions regarding subsequent tests. There should be no way for an individual in one experimental group to be influenced by the presence or absence of any groups or subjects, before or after the experiment. Moreover, the data are uninfluenced by the experimenter s intentions regarding groups and sample size. The nice thing about a Bayesian analysis is that the interpretation of the data is not influenced by the experimenter s stopping and testing intentions (assuming those intentions don t affect the data).

  31. Just one Bayesian posterior no matter how you look at it A Bayesian analysis yields a posterior distribution over the parameters of the model. The posterior distribution is the complete implication of the data. It can be examined in as many ways as the researcher deems necessary. Various comparisons of groups are merely different perspectives we are taking on the posterior distribution.

  32. Just one Bayesian posterior no matter how you look at it In the baseball example the comparisons of different positions we could run are examined as marginal distributions that merely summarize the posterior distribution from different perspectives. The posterior distribution itself is unchanged by how we look at it. This is unlike the cloud of imaginary possibilities from the null hypothesis. The Bayesian posterior directly tells us the credibilities of the magnitudes of differences. NHST only tells us about whether a difference is extreme in a cloud of possibilities determined by the experimenter s intentions.

  33. How a Bayesian analysis mitigates false alarms No analysis is immune to false alarms because randomly sampled data will occasionally contain accidental coincidences of outlying values. Bayesian analysis eschews the use of p values as a criterion for decision-making, however, because p values control false alarms on the basis of the researcher s intentions, not on the basis of the data. Bayesian analysis instead accepts the fact that the posterior distribution is the best inference we can make, given the observed data and prior knowledge. If a researcher feels their finding is a false alarm they can collect more data until a greater degree of posterior credibility is reached.

  34. How a Bayesian analysis mitigates false alarms Bayesian analysis addresses the problem of false alarms by incorporating prior knowledge into the structure of the model. If we know that different groups have some overarching commonality, despite different treatments, we can describe the different group parameters as having been drawn from an overarching distribution that expresses the commonality. This is done using hierarchical or multilevel models which are covered in Chapter 9 of the Kruschke text. We will not cover multilevel models in this (short) short course, but Bayesian estimation is especially seamless and straightforward for implementing and evaluating them.

  35. What a sampling distribution is good for Sampling distributions, or the cloud of imaginary possibilities, aren t as useful as posterior distributions for making inferences from a set of observed data. The reason is that sampling distributions tell us the probabilities of possible data if we run an intended experiment given a particular hypothesis, rather than the credibilities of possible hypotheses given that we have a particular set of data. Sampling distributions tell us the probability of imaginary outcomes given a parameter value and the researcher s intentions, instead of the probability of parameter values given the actual data.

  36. What a sampling distribution is good for Despite their shortcomings sampling distributions are good for at least a couple things. One is when planning an experiment. Sampling distributions may be useful to conduct simulations so we know what to expect. I will not cover goals and power analyses in this short course, but they are discussed in Chapter 13 of the Kruschke text The second is when conducting posterior predictive checks This is an inspection of patterns in simulated data generated by posterior parameter values. I will not cover this in extensive detail in this course as many run-of-the-mill analyses we will run have parameters that can be easily estimated.

  37. Summary Many people somewhat rightly state that the conclusions from NHST and Bayesian analysis usually are in alignment. The issue is that NHST is probably only widely used because of historical precedent and inertia. NHST s logic is less intuitive and the conclusions are based on the intentions of the experimenter. Bayesian conclusions are based on the data and any prior beliefs. NHST also does not allow us to readily reason about the probability of particular parameter values, given the data. The ability to directly reason about the probabilities of parameter values such as r or is one of the main reasons why researchers and thinking human beings are attracted to Bayesian reasoning.

Related


More Related Content