Annotated and Expanded Slides from AAA Webinar on P-values for Inference

Slide Note
Embed
Share

An expanded set of annotated slides from William M. Cready's webinar presentation on the usefulness of p-values for inference, addressing confidence intervals, estimation precision, and cross-design differences in cost of capital. The comments provided shed light on interpretations and limitations of the research design used in the analysis.


Uploaded on Sep 22, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. PRESENTATION SLIDES: ANNOTATED AND EXPANDED AAA Webinar on How Useful are P-values for Inference March 18, 2021 William M. Cready An expanded set of annotated slides pertaining to my webinar presentation. I also provide my answers to the set of questions submitted in advance by webinar attendees.

  2. Comments: A GATEWAY EVENT The confidence intervals are from my paper, The Big N Audit Quality Kerfuffle. The intervals loosely identify sets of evidence compatible difference values identified by each estimation strategy. 95% Confidence Intervals for Big N Auditor Implied Cost of Capital Reductions The large overlap in them is best interpreted as indicating that the underlying estimation precisions here are inadequate to the task of identifying possibly meaningful cross-design differences in design specific mean difference parameter values. In the full paper I also address the range of differences that are evidence- compatible here. These ranges do encompass rather sizable differences. For instance, the PSM parameter value could be 0 while the LMZ Repl. value could be 60 a 60 difference of 60 basis points! But, to flip things around, the PSM parameter value could be 60 and the LMZ Repl. value could be 3, a 57 basis point difference in the opposite direction! KR LMZ Repl. LMZ PSM A more casual interpretation is that there is not a reliable basis in the examined evidence for thinking that the PSM design is producing radically different insights about the cost of capital difference. Recognize, however, that this outcome stems largely from the fact that the research design here lacks the power to identify possibly meaningful cross-design differences in parameter values. -20 0 20 40 60 80 Estimated Difference Between Non-Big N and Big N Cost of Capital in Basis Points. KR: Khurana and Raman (TAR, 2004). LMZ: Lawrence, Minutti-Meza, and Zhang (TAR, 2011). PSM Propensity Score Matching Multiple Variable Regression Design.

  3. A GATEWAY EVENT (cont.) Comments: 95% Confidence Intervals for Big N Auditor Implied Cost of Capital Reductions Lawrence et al. interpret their evidence as indicating that The confidence intervals are from my paper, The Big N Audit Quality Kerfuffle. The intervals loosely identify sets of evidence compatible difference values identified by each estimation strategy. KR The large overlap in them is best interpreted as indicating that the underlying estimation precisions here are inadequate to the task of identifying possibly meaningful cross-design differences in design specific mean difference parameter values. (In the full paper I also address the range of differences that are evidence-compatible here. These ranges do encompass rather sizable differnces. For instance, the PSM parameter value could be 0 while the LMZ Repl. value could be 60 a 60 difference of 60 basis points! But, to flip things around, the PSM parameter value could be 60 and the LMZ Repl. value could be 3, a 57 basis point difference in the opposite direction!) LMZ Repl. LMZ PSM A more casual interpretation is that there is not a reliable basis in the examined evidence for thinking that the PSM design is producing radically different insights about the cost of capital difference. Recognize, however, that this outcome stems largely from the fact that the research design here lacks the power to identify possibly meaningful cross- design differences in parameter values. -20 0 20 40 60 80 Estimated Difference Between Non-Big N and Big N Cost of Capital in Basis Points. KR: Khurana and Raman (TAR, 2004). LMZ: Lawrence, Minutti-Meza, and Zhang (TAR, 2011). PSM Propensity Score Matching Multiple Variable Regression Design.

  4. WHAT ABOUT STATISTICAL SIGNIFICANCE? COMMENTS Statistical Significance assessment in this setting enters as an arrow targeted at the tested null hypothesis value of 0. (If you wish to think of all this from a one-sided perspective then the null target area extends all the way from zero to negative infinity.) If the arrow hits its target then the null is rejected/nullified/had doubt cast upon it. If it gets blocked then perhaps it is because it is true. True nulls in the NHST approach are (almost) arrowproof. However, arrows directed at null hypotheses that the difference here is +20, +40 or even +60 are similarly blocked by the PSM CI. Getting blocked simply does not tell us much of anything about the truth of the chosen null hypothesis. 95% Confidence Intervals for Big N Auditor Implied Cost of Capital Reductions KR LMZ Repl. The blocking of the PSM arrow does, of course, indicate that that the evidence is consistent with the possibility that the difference is zero. But, we knew this already from the PSM CI. It is also compatible with the idea that the difference is possibly inconsequential. But, the lower bound of the LMZ Repl. CI provides a similar insight. Hence, this is a setting where p-values have no material inferential value over basic robust estimation assessment of the evidence. LMZ PSM -20 0 20 40 60 80 Estimated Difference Between Non-Big N and Big N Cost of Capital in Basis Points. KR: Khurana and Raman (TAR, 2004). LMZ: Lawrence, Minutti-Meza, and Zhang (TAR, 2011). PSM Propensity Score Matching Multiple Variable Regression Design.

  5. COMMENTS An instructive way of thinking about null hypothesis significance testing is to view it through the lens of a twenty-sided die roll. Such a roll can produce a valid NHST assessment of any null hypothesis you can possibly imagine. Assuming the die is fair, and your decision rule is to reject only if a 20 is rolled, then you will incorrectly reject a true null only 5% of the time (i.e., p = .05) by employing such roll of the die testing. THE REGULAR ICOSAHEDRON PERSPECTIVE 95% Confidence Intervals for Big N Auditor Implied Cost of Capital Reductions In general, of course, NHST assessments do better than resorting to the rolling of die. They examine data which contains information about the studied phenomenon. However, it is also the case that no matter how good your data are, how big your N is, or how sophisticated your identifications strategies are, the die roll component to the analysis is still there. You can shrink it, but you can t eliminate it. So, an operative way of thinking about any NHST assessment is to pose the question: How close is this to (or far away from) a twenty-sided die roll NHST? KR LMZ Repl. LMZ PSM Here, for instance, the relative widths of the three CIs suggest that the LMZ PSM design is the closest of the three to being a twenty-sided die roll. This is because the underlying parameter is being measured with more error (i.e., twenty-sided die roll material). And, twenty-sided die rolls commonly produce 1 to 19 outcomes. -20 0 20 40 60 80 Estimated Difference Between Non-Big N and Big N Cost of Capital in Basis Points. Finally, if the null is true, a NHST assessment is always equivalent to a twenty sided die roll. Something you might bear in mind the next time you encounter a so-called falsification test. KR: Khurana and Raman (TAR, 2004). LMZ: Lawrence, Minutti-Meza, and Zhang (TAR, 2011). PSM Propensity Score Matching Multiple Variable Regression Design.

  6. Null Hypothesis Viability and P-value Relevance COMMENTS Faux Null Hypothesis Testing in Accounting Research Abstract Conventionally null hypothesis testing and consequently p-value inference centers on testing null hypotheses. Commonly such hypotheses are expressed in the form of no-effect-at-all is present terms (i.e., the effect is zero). The usefulness of such test of hypothesis exercises depend to a great degree on there being some basis for thinking that the tested null is true. Otherwise, type 1 errors (which is what p-values measure/reflect/control) are impossible, it is only possible for an examination to commit type 2 errors. This study examines the degree to which presentations and analyses of null hypothesis rejections reported by articles in a leading accounting journal, The Accounting Review, provide a basis for taking considered null hypotheses as being possibly true. It finds little evidence that null hypotheses are taken seriously. Rather, the evidence is broadly consistent with their use as faux test of hypothesis props for advancing specific alternative hypotheses. In particular, articles routinely focus on how they conduct tests of alternative hypotheses not null hypotheses (only null hypotheses, not alternative hypotheses, are tested in conventional hypothesis testing designs), generally do not clearly state tested null hypotheses or the conditions required for them to hold, and, in those few instances where they actually engage the question of why the null might be true, either shift to discussing why the underlying effect size is possibly quite small or couch the question in the form of (non-mutually exclusive) offsetting effects. Expositions of the former type explicitly embrace the notion that the null is false while expositions along the latter line are fundamentally advancing multiple reasons for thinking that a zero net effect null is false. Collectively, this examination suggests that accounting research could benefit considerably from more careful attention to the structure in which statistics-based hypothesis testing works. It also implicitly suggests that much of what passes for test-of- hypothesis analysis in the literature is, in fact, a form of locational description by means of p-values. However, as should be readily evident from the recent ASA Statement on Statistical Signficance and P Values principles, p-values are not a particularly useful metric for conducting rigorous locational descriptive examinations of data. This is an abstract for a current work in process that addresses the importance of null hypothesis saliency to p-value usefulness. The analysis part is done. The writing isn t. So, this is all I have to share with you about it. The core point here is that NHST tests nulls. Hence, it is rather fundamental that such nulls be test-worthy. Nothing useful is possibly learned from shooting rejection arrows into dead hypotheses. Yet, accounting research commonly tests null hypotheses that, at a purely conceptual level (i.e., before one even considers the truth of the research design assumptions being made to test them) are know ably false. (I would note that the Bernard and Thomas article discussed by Sanjay is a notable exception. Its market efficiency null hypothesis was very much alive at the time. Indeed, in its more general form it remains so today, despite the legions of rejection arrows embedded in it.) For a far more complete treatment of the topic I recommend The Earth is Round (p<.05) by Jacob Cohen (American Psychologist, 1994). To fully appreciate his arguments I suggest you start with the title, which employs the conventional alternative hypothesis form, and identify what the associated null hypothesis he has in mind must be.

  7. How Bad is It? Comment The picture here captures the key message in a co-authored working paper, Is there a confidence interval for that? I have that addresses how accounting research fails to faithfully represent high p- value evidence in research articles. The hurricane tracking cone pictured here, formally known as the cone of uncertainty, identifies a region of locations for which a null hypothesis that the hurricane will pass directly overhead within the next 5 days cannot be rejected (p> ~.3). That said, the likelihood that the hurricane does indeed pass directly overhead any location in it is, once one gets a day or more out, is actually rather small. Hence, illustrating the point that a high p-value is not even a remotely sensible basis for thinking a tested null is true. Nevertheless, our former president selectively extends the cone out a day to support a prior comment that the hurricane might or would strike Alabama. Given that the extension accurately but incompletely identifies areas that would have been included in a six day out cone, had the NWS chosen to produce one, the salient point here is the use of high p-value outcomes to advance preferred locational claims. Specifically, that the inability to reject the null hypothesis that the storm will strike Alabama is transmuted into a claim that the storm either will or is likely to strike Alabama. And the cone is selectively, as opposed to broadly, extended to emphasize this point. Our analysis demonstrates that accounting research performance on this dimension is, at best, on par with the former president s. Indeed, we are worse because we don t say might. Nor do we say anything at all, to say nothing of a color coded chart, in terms of identifying other places an effect could possibly be (besides zero). Our former president received numerous pants-on-fire and Pinocchio awards for this particular performance. What does this say about the sorts of awards accounting research papers would receive should they be subjected to similar levels of scrutiny? You can read about the editorial response given by the journal upon being informed about this state of affairs in my Complacency at the Gates article. (Significance, 2019)

  8. Q&A My Answer Question What should reviewer and editor do? I strongly recommend the article, Five nonobvious changes in editorial practice for editors and reviewers to consider when evaluating submissions in a Post p<0.05 universe by by David Tramifow. I think we should place a lot less emphasis on (mostly fake) tension and the resolution thereof (via faux null hypothesis testing) and recognize that most of what we do is fundamentally descriptive. Hence, articles should be evaluated more on the basis of are they undertaking the description of (potentially) interesting phenomena and is the descriptive assessment that follows done well, irrespective of what it does or does not turn up. While NHST can certainly play a role in such descriptive assessment, it should typically not be the central player. What should authors do? This is much harder. In terms of myself I am senior faculty. I have the freedom to pretty much do what I want. So, in my own work these days I am making every effort to practice what I preach. However, I am also involved in joint work with others whose careers are more at risk. And, no I don t make them practice what I preach. I do try and moderate things, but that is as far as I go with it. I cannot, particularly seeing the reception my various efforts in the area have received thus far, in good conscience impose my fate on others. So, until journals start sending out signals that they are interested in changing their contribution assessment paradigms I would focus on avoiding doing really bad things (such as relying on high p-values outcomes as a basis for thinking effects are absent or inconsequential), look to identify relevant null hypotheses that have some life to them (e.g., test a null hypothesis that some new information item leads to a 10% improvement forecast accuracy where 10% improvement is plausibly argued to be a salient lower bound on economic significance) as opposed to taking the conventional route of testing the faux null that no improvement at all occurs (faux because its truth requires that no forecaster in the population ever finds the new item to be useful in forecasting).

  9. Q&A My Answer Question Give some examples of the right way to use p- values for inference. The right way starts with a clear identification of the tested null hypothesis inclusive of the conditions required for it to be true. This requires a lot more than simply stating that you are testing b = 0. Cohen (1994), in fact, argues that in social science/behavior settings such hypotheses are almost never true. Humans respond to stimuli. So, the idea that no human or coalition of humans in a population responds to some proposed behavioral stimuli (e.g. pay, public signals, repeated public signals, news announcements, tweets, days of sunshine, etc.) is inherently implausible. What may be true, of course, is that very few respond or they don t respond very much. But, in my observation of things, we don t test those sorts of null hypotheses. For instance, if the estimated coefficient is large in terms of economic magnitude, the two- sided p-value is 0.12, what can we conclude? Assuming that the estimated coefficient is of descriptive interest you should describe it. Here I would provide a confidence interval around it. I employ 95% CIs in my papers on significance testing mainly because it allows me to easily transition between test of hypothesis assessment and CI assessment. In this setting I would likely opt for +/- one standard error intervals. After all, 1 s.e. intervals seems to work quite well for hurricane tracking. Second, in assessing importance I would focus on the bounds of the interval more than the effect estimate. The likelihood that the true parameter value actually equals or is even close to your estimate value is not high. So, the operative question is whether the lower bound of you CI is economically meaningful. If so, then you might says something like the evidence broadly favors the presence of an economically meaningful effect. If not, or if you opt for say 95% CIs to satisfy the whims of a NHST obsessed reviewer or editor, then you simply say something like there is some support for the effect being consequential, but you cannot reliably rule out the possibility that it is indistinguishable from zero.

  10. Q&A My Answer Question I ve always thought they were useful; have used them extensively in my research and have never had their use questioned. When aren t they useful or when is it inappropriate to use them, what should be using in their place and why are they preferable I suggest you go look at a couple of your more important works. Write out the null hypotheses that are being tested inclusive of what they are asserting must hold for every member of the population you study. For instance, if the issue concerns earnings management, that no population member ever manages earnings in the fashion that your alternative hypothesis suggests that they do. Or, if it is investor trading response (a field I work in) that no investor responds to any firm s disclosure of the studied type of information (e.g., earnings announcements) by trading that firm s stock. Then carefully consider the plausibility of this null possibly being true. Importantly, you cannot do substitutions such as rare for never , hard to find for not present, or opposite direction effects for offsetting effects that to the nth digit exactly equal one another . Also, if you motivate your analysis with alternative to the null anecdotes, recognize that such anecdotes effectively negate your null, thereby rendering NHST assessment superfluous. (Be aware, I have done this exercise myself. It was a rather depressing experience.) Is p-value the most important statistic? Is EPS the most important accounting number? The key question I have in mind is the black and white nature of its use. Too many reviewers seem to treat the p-value as threshold to be exceeded rather than a simple indicator of relevance for the reader to assess. I agree. But that said, I really do really really like seeing those stars in my SAS and STATA outputs. I am simply interested in the speakers comments on this topic. Hopefully you still feel this way. How much weight/importance should we place on p-values Less would definitely be more.

  11. Q&A My Answer Question I have reviewers asking if a subset of coefficients in a regression model are different from each other when one of the coefficients in question is not significant. I was wondering if chi-square test is appropriate as the insignificant variable has a large standard error. Well, is the null hypothesis that they are exactly equal to one another at all plausible. I suspect not as, to quote Tukey "It is foolish to ask 'Are the effects of A and B different?' They are always different for some decimal place." (Tukey, 1991, The Philosophy of Multiple Comparisons, p. 100). If this is the case then testing the hypothesis that they are the same is pointless. It would make more sense to produce a confidence interval for the difference in estimates and discuss the bounds you obtain. The arguments for using p-values and giving confidence intervals, given that the former seems to be emphasized in accounting research and the latter in medical research. Medical research is far more serious about effect sizes. Medical outcome impact matters greatly to them so while they are concerned with whether the evidence is incompatible with a treatment not working at all. They are far more interested in getting an idea of what the evidence says about how well or poorly it may work. I would note that in my limited observation of things medical research is quite fond of identifying high p-value outcomes as evidencing absence of effect (e.g., that masks don t stop COVID spread). So they do lean Trump on this dimension. Are there disciplines using p-values > 0.10 as benchmarks for statistical significance? I am sure there are. Indeed, in my experiences at least accounting journals are rather flexible in terms of setting p-value cutoffs. A bit more troubling is when an article obtains a p-value between 0.10 and 0.05 and identifies the underlying effect as being absent (typically, because its being zero and nothing but zero is a fundamental requirement for the central inferences it advances.) Curious about the latest discussion. I hope your curiosity wasn t satisfied, because we really only scratched the surface of things.

More Related Content