Error Correction and Reproducibility in Science

 
S
e
v
e
r
e
 
T
e
s
t
i
n
g
:
 
T
h
e
 
K
e
y
 
t
o
 
E
r
r
o
r
C
o
r
r
e
c
t
i
o
n
 
Deborah G Mayo
Virginia Tech
March 17, 2017
“Understanding Reproducibility and Error Correction in
Science”
 
Statistical Crisis of Replication
 
O
 Statistical ‘findings’ disappear when others
look for them.
O
 Beyond the social sciences to genomics,
bioinformatics, and medicine (Big Data)
O
 Methodological reforms (some welcome,
others radical)
O
 Need to understand philosophical,
statistical, historical issues
 
2
 
A
m
e
r
i
c
a
n
 
S
t
a
t
i
s
t
i
c
a
l
 
A
s
s
o
c
i
a
t
i
o
n
(
A
S
A
)
:
S
t
a
t
e
m
e
n
t
 
o
n
 
P
-
v
a
l
u
e
s
 
T
h
e
 
s
t
a
t
i
s
t
i
c
a
l
 
c
o
m
m
u
n
i
t
y
 
h
a
s
 
b
e
e
n
 
d
e
e
p
l
y
c
o
n
c
e
r
n
e
d
 
a
b
o
u
t
 
i
s
s
u
e
s
 
o
f
 
r
e
p
r
o
d
u
c
i
b
i
l
i
t
y
 
a
n
d
r
e
p
l
i
c
a
b
i
l
i
t
y
 
o
f
 
s
c
i
e
n
t
i
f
i
c
 
c
o
n
c
l
u
s
i
o
n
s
.
 
.
 
m
u
c
h
c
o
n
f
u
s
i
o
n
 
a
n
d
 
e
v
e
n
 
d
o
u
b
t
 
a
b
o
u
t
 
t
h
e
 
v
a
l
i
d
i
t
y
 
o
f
s
c
i
e
n
c
e
 
i
s
 
a
r
i
s
i
n
g
.
 
S
u
c
h
 
d
o
u
b
t
 
c
a
n
 
l
e
a
d
 
t
o
r
a
d
i
c
a
l
 
c
h
o
i
c
e
s
 
s
u
c
h
 
a
s
t
o
 
b
a
n
 
p
-
v
a
l
u
e
s
(
A
S
A
,
 
W
a
s
s
e
r
s
t
e
i
n
 
&
 
L
a
z
a
r
 
2
0
1
6
,
 
p
.
 
1
2
9
)
 
3
 
I
 
w
a
s
 
a
 
p
h
i
l
o
s
o
p
h
i
c
a
l
 
o
b
s
e
r
v
e
r
 
a
t
 
t
h
e
A
S
A
 
P
-
v
a
l
u
e
 
p
o
w
 
w
o
w
 
 
4
 
D
o
n
t
 
t
h
r
o
w
 
o
u
t
 
t
h
e
 
e
r
r
o
r
 
c
o
n
t
r
o
l
 
b
a
b
y
w
i
t
h
 
t
h
e
 
b
a
d
 
s
t
a
t
i
s
t
i
c
s
 
b
a
t
h
w
a
t
e
r
T
h
e
 
A
m
e
r
i
c
a
n
 
S
t
a
t
i
s
t
i
c
i
a
n
 
5
 
O
 
The most used methods are most
criticized
O
“Statistical significance tests are a small
part of a rich set of:
“techniques for systematically appraising
and bounding the probabilities … of
seriously misleading interpretations of
data” (Birnbaum 1970, p. 1033)
O
 These I call 
error
 
statistical
 
methods
 (or
sampling theory)”.
 
6
 
E
r
r
o
r
 
S
t
a
t
i
s
t
i
c
s
 
O
 
Statistics
: Collection, modeling, drawing
inferences from data to claims about
aspects of processes
O
 The inference may be in error
O
 It’s qualified by a claim about the
method’s capabilities to control and alert
us to erroneous interpretations (error
probabilities)
 
 
7
 
p
-
v
a
l
u
e
.
 
t
o
 
t
e
s
t
 
t
h
e
 
c
o
n
f
o
r
m
i
t
y
 
o
f
 
t
h
e
p
a
r
t
i
c
u
l
a
r
 
d
a
t
a
 
u
n
d
e
r
 
a
n
a
l
y
s
i
s
 
w
i
t
h
 
H
0
 
i
n
s
o
m
e
 
r
e
s
p
e
c
t
:
w
e
 
f
i
n
d
 
a
 
f
u
n
c
t
i
o
n
 
T
 
=
 
t
(
y
)
 
o
f
 
t
h
e
 
d
a
t
a
,
 
t
o
b
e
 
c
a
l
l
e
d
 
t
h
e
 
t
e
s
t
 
s
t
a
t
i
s
t
i
c
,
 
s
u
c
h
 
t
h
a
t
the larger the value of
 T 
the more
inconsistent are the data with
 H
0
;
T
h
e
 
r
a
n
d
o
m
 
v
a
r
i
a
b
l
e
 
T
 
=
 
t
(
Y
)
 
h
a
s
 
a
(
n
u
m
e
r
i
c
a
l
l
y
)
 
k
n
o
w
n
 
p
r
o
b
a
b
i
l
i
t
y
d
i
s
t
r
i
b
u
t
i
o
n
 
w
h
e
n
 
H
0
 
i
s
 
t
r
u
e
.
…the p-value corresponding to any 
t
0bs
 
as
p = p(t) = Pr(T ≥ t
0bs
; H
0
)”
(Mayo and Cox 2006, p. 81)
 
8
 
T
e
s
t
i
n
g
 
R
e
a
s
o
n
i
n
g
 
O
 If even larger differences than 
t
0bs
 occur fairly
frequently under 
H
0
(i.e.,P-value is not small), there’s scarcely
evidence of incompatibility with 
H
0
O
S
m
a
l
l
 
P
-
v
a
l
u
e
 
i
n
d
i
c
a
t
e
s
 
s
o
m
e
 
u
n
d
e
r
l
y
i
n
g
d
i
s
c
r
e
p
a
n
c
y
 
f
r
o
m
 
H
0
 
b
e
c
a
u
s
e
 
v
e
r
y
 
p
r
o
b
a
b
l
y
 
y
o
u
w
o
u
l
d
 
h
a
v
e
 
s
e
e
n
 
a
 
l
e
s
s
 
i
m
p
r
e
s
s
i
v
e
 
d
i
f
f
e
r
e
n
c
e
t
h
a
n
 
t
0
b
s
 
w
e
r
e
 
H
0
 
t
r
u
e
.
O
 This indication isn’t evidence of a genuine
statistical effect 
H
, let alone a scientific conclusion
H
*
 
 
 
 
 
 
 
 
 
 
 
 
 
S
t
a
t
-
S
u
b
 
f
a
l
l
a
c
y
 
 
 
H
 
=
>
 
H
*
 
9
 
O
 I’m not keen to defend many uses of
significance tests long lampooned
 
O
 I introduce a reformulation of tests in
terms of discrepancies (effect sizes) that
are and are not severely-tested
 
O
 The criticisms are often based on
misunderstandings; consequently so are
many “reforms”
 
 
10
 
R
e
p
l
i
c
a
t
i
o
n
 
P
a
r
a
d
o
x
(
f
o
r
 
S
i
g
n
i
f
i
c
a
n
c
e
 
T
e
s
t
 
C
r
i
t
i
c
s
)
 
 
Critic:
 It’s much too easy to get a small P-
value
You:
 Why do they find it so difficult to
replicate the small P-values others found?
 
I
s
 
i
t
 
e
a
s
y
 
o
r
 
i
s
 
i
t
 
h
a
r
d
?
 
11
 
O
n
l
y
 
3
6
 
o
f
 
1
0
0
 
p
s
y
c
h
o
l
o
g
y
 
e
x
p
e
r
i
m
e
n
t
s
 
y
i
e
l
d
e
d
s
m
a
l
l
 
P
-
v
a
l
u
e
s
 
i
n
 
O
p
e
n
 
S
c
i
e
n
c
e
 
C
o
l
l
a
b
o
r
a
t
i
o
n
 
o
n
r
e
p
l
i
c
a
t
i
o
n
 
i
n
 
p
s
y
c
h
o
l
o
g
y
 
 
 
 
O
S
C
:
 
R
e
p
r
o
d
u
c
i
b
i
l
i
t
y
 
P
r
o
j
e
c
t
:
 
P
s
y
c
h
o
l
o
g
y
:
2
0
1
1
-
1
5
 
(
S
c
i
e
n
c
e
 
2
0
1
5
)
:
 
C
r
o
w
d
-
s
o
u
r
c
e
d
 
e
f
f
o
r
t
 
t
o
r
e
p
l
i
c
a
t
e
 
1
0
0
 
a
r
t
i
c
l
e
s
 
(
L
e
d
 
b
y
 
B
r
i
a
n
 
N
o
z
e
k
,
 
U
.
 
V
A
)
 
12
 
O
 R.A. Fisher: it’s easy to lie with statistics by
selective reporting, not the test’s fault
O
S
u
f
f
i
c
i
e
n
t
 
f
i
n
a
g
l
i
n
g
c
h
e
r
r
y
-
p
i
c
k
i
n
g
,
 
P
-
h
a
c
k
i
n
g
,
 
s
i
g
n
i
f
i
c
a
n
c
e
 
s
e
e
k
i
n
g
,
 
m
u
l
t
i
p
l
e
t
e
s
t
i
n
g
,
 
l
o
o
k
 
e
l
s
e
w
h
e
r
e
m
a
y
 
p
r
a
c
t
i
c
a
l
l
y
g
u
a
r
a
n
t
e
e
 
a
 
p
r
e
f
e
r
r
e
d
 
c
l
a
i
m
 
H
 
g
e
t
s
 
s
u
p
p
o
r
t
,
e
v
e
n
 
i
f
 
i
t
s
 
u
n
w
a
r
r
a
n
t
e
d
 
b
y
 
e
v
i
d
e
n
c
e
(biasing selection effects, need to adjust P-values)
 Note: Support for some preferred claim 
H is by
rejecting a null hypothesis
O
H hasn’t passed a severe test
 
13
 
S
e
v
e
r
i
t
y
 
R
e
q
u
i
r
e
m
e
n
t
:
 
I
f
 
t
h
e
 
t
e
s
t
 
p
r
o
c
e
d
u
r
e
 
h
a
d
 
l
i
t
t
l
e
 
o
r
 
n
o
 
c
a
p
a
b
i
l
i
t
y
o
f
 
f
i
n
d
i
n
g
 
f
l
a
w
s
 
w
i
t
h
 
H
 
(
e
v
e
n
 
i
f
 
H
 
i
s
 
i
n
c
o
r
r
e
c
t
)
,
t
h
e
n
 
a
g
r
e
e
m
e
n
t
 
b
e
t
w
e
e
n
 
d
a
t
a
 
x
0
 
a
n
d
 
H
p
r
o
v
i
d
e
s
 
p
o
o
r
 
(
o
r
 
n
o
)
 
e
v
i
d
e
n
c
e
 
f
o
r
 
H
(“too cheap to be worth having” Popper)
O
 Such a test fails a 
minimal requirement
 for a
stringent or severe test
O
 My account: severe testing based on error
statistics (requires reinterpreting tests)
 
 
14
 
T
h
i
s
 
a
l
t
e
r
s
 
t
h
e
 
r
o
l
e
 
o
f
 
p
r
o
b
a
b
i
l
i
t
y
:
t
y
p
i
c
a
l
l
y
 
j
u
s
t
 
2
 
O
P
r
o
b
a
b
i
l
i
s
m
.
 
T
o
 
a
s
s
i
g
n
 
a
 
d
e
g
r
e
e
 
o
f
 
p
r
o
b
a
b
i
l
i
t
y
,
c
o
n
f
i
r
m
a
t
i
o
n
,
 
s
u
p
p
o
r
t
 
o
r
 
b
e
l
i
e
f
 
i
n
 
a
 
h
y
p
o
t
h
e
s
i
s
,
g
i
v
e
n
 
d
a
t
a
 
x
0
(e.g., Bayesian, likelihoodist)—with regard for
inner coherency
O
P
e
r
f
o
r
m
a
n
c
e
.
 
E
n
s
u
r
e
 
l
o
n
g
-
r
u
n
 
r
e
l
i
a
b
i
l
i
t
y
 
o
f
m
e
t
h
o
d
s
,
 
c
o
v
e
r
a
g
e
 
p
r
o
b
a
b
i
l
i
t
i
e
s
 
(
f
r
e
q
u
e
n
t
i
s
t
,
b
e
h
a
v
i
o
r
i
s
t
i
c
 
N
e
y
m
a
n
-
P
e
a
r
s
o
n
)
 
 
15
 
W
h
a
t
 
h
a
p
p
e
n
e
d
 
t
o
 
u
s
i
n
g
 
p
r
o
b
a
b
i
l
i
t
y
 
t
o
a
s
s
e
s
s
 
t
h
e
 
e
r
r
o
r
 
p
r
o
b
i
n
g
 
c
a
p
a
c
i
t
y
 
b
y
t
h
e
 
s
e
v
e
r
i
t
y
 
r
e
q
u
i
r
e
m
e
n
t
?
 
 
O
 
Neither “probabilism” nor
“performance” directly captures it
O
 Good long-run performance is a
necessary, not a sufficient, condition
for severity
 
16
 
O
 Problems with selective reporting, cherry
picking, stopping when the data look
good, P-hacking, are not problems about
long-runs—
O
 It’s that 
we cannot say the case at hand
has done a good job of avoiding the
sources of misinterpreting data
Key to revising the role of error probabilities
 
17
 
A
 
c
l
a
i
m
 
C
 
i
s
 
n
o
t
 
w
a
r
r
a
n
t
e
d
 
_
_
_
_
_
_
_
 
O
P
r
o
b
a
b
i
l
i
s
m
:
 
u
n
l
e
s
s
 
C
 
i
s
 
t
r
u
e
 
o
r
 
p
r
o
b
a
b
l
e
(
g
e
t
s
 
a
 
p
r
o
b
a
b
i
l
i
t
y
 
b
o
o
s
t
,
 
i
s
 
m
a
d
e
c
o
m
p
a
r
a
t
i
v
e
l
y
 
f
i
r
m
e
r
)
 
O
P
e
r
f
o
r
m
a
n
c
e
:
 
u
n
l
e
s
s
 
i
t
 
s
t
e
m
s
 
f
r
o
m
 
a
m
e
t
h
o
d
 
w
i
t
h
 
l
o
w
 
l
o
n
g
-
r
u
n
 
e
r
r
o
r
 
O
P
r
o
b
a
t
i
v
i
s
m
 
(
s
e
v
e
r
e
 
t
e
s
t
i
n
g
)
 
u
n
l
e
s
s
s
o
m
e
t
h
i
n
g
 
(
a
 
f
a
i
r
 
a
m
o
u
n
t
)
 
h
a
s
 
b
e
e
n
 
d
o
n
e
t
o
 
p
r
o
b
e
 
w
a
y
s
 
w
e
 
c
a
n
 
b
e
 
w
r
o
n
g
 
a
b
o
u
t
 
C
 
 
 
18
 
O
I
f
 
y
o
u
 
a
s
s
u
m
e
 
p
r
o
b
a
b
i
l
i
s
m
 
i
s
 
r
e
q
u
i
r
e
d
 
f
o
r
i
n
f
e
r
e
n
c
e
,
 
e
r
r
o
r
 
p
r
o
b
a
b
i
l
i
t
i
e
s
 
a
r
e
 
r
e
l
e
v
a
n
t
f
o
r
 
i
n
f
e
r
e
n
c
e
 
o
n
l
y
 
b
y
 
m
i
s
i
n
t
e
r
p
r
e
t
a
t
i
o
n
F
a
l
s
e
!
O
 I claim, error probabilities play a crucial
role in appraising well-testedness
O
 It’s crucial to be able to say, 
C
 is highly
believable or plausible but poorly tested
 
19
 
B
i
a
s
i
n
g
 
s
e
l
e
c
t
i
o
n
 
e
f
f
e
c
t
s
:
 
 
O
 One function of severity is to identify
problematic selection effects (not all are)
O
B
i
a
s
i
n
g
 
s
e
l
e
c
t
i
o
n
 
e
f
f
e
c
t
s
:
 
w
h
e
n
 
d
a
t
a
 
o
r
h
y
p
o
t
h
e
s
e
s
 
a
r
e
 
s
e
l
e
c
t
e
d
 
o
r
 
g
e
n
e
r
a
t
e
d
 
(
o
r
a
 
t
e
s
t
 
c
r
i
t
e
r
i
o
n
 
i
s
 
s
p
e
c
i
f
i
e
d
)
,
 
i
n
 
s
u
c
h
 
a
 
w
a
y
t
h
a
t
 
t
h
e
 
m
i
n
i
m
a
l
 
s
e
v
e
r
i
t
y
 
r
e
q
u
i
r
e
m
e
n
t
 
i
s
v
i
o
l
a
t
e
d
,
 
s
e
r
i
o
u
s
l
y
 
a
l
t
e
r
e
d
 
o
r
 
i
n
c
a
p
a
b
l
e
o
f
 
b
e
i
n
g
 
a
s
s
e
s
s
e
d
O
P
i
c
k
i
n
g
 
u
p
 
o
n
 
t
h
e
s
e
 
a
l
t
e
r
a
t
i
o
n
s
 
i
s
p
r
e
c
i
s
e
l
y
 
w
h
a
t
 
e
n
a
b
l
e
s
 
e
r
r
o
r
 
s
t
a
t
i
s
t
i
c
s
 
t
o
b
e
 
s
e
l
f
-
c
o
r
r
e
c
t
i
n
g
 
20
 
N
o
m
i
n
a
l
 
v
s
 
a
c
t
u
a
l
 
S
i
g
n
i
f
i
c
a
n
c
e
 
l
e
v
e
l
s
 
Suppose that twenty sets of differences
have been examined, that one difference
seems large enough to test and that this
difference turns out to be ‘significant at
the 5 percent level.’ ….The actual level of
significance is not 5 percent, but 64
percent!
 (Selvin, 1970, p. 104)
From (Morrison & Henkel’s 
Significance Test
controversy
 
1970!)
 
 
21
 
O
M
o
r
r
i
s
o
n
 
a
n
d
 
H
e
n
k
e
l
 
w
e
r
e
 
c
l
e
a
r
 
o
n
 
t
h
e
f
a
l
l
a
c
y
:
 
b
l
u
r
r
i
n
g
 
t
h
e
 
c
o
m
p
u
t
e
d
 
o
r
n
o
m
i
n
a
l
 
s
i
g
n
i
f
i
c
a
n
c
e
 
l
e
v
e
l
,
 
a
n
d
 
t
h
e
a
c
t
u
a
l
 
l
e
v
e
l
O
 There are many more ways you can be
wrong with hunting (different sample
space)
 
 
22
 
S
p
u
r
i
o
u
s
 
P
-
V
a
l
u
e
Y
o
u
 
r
e
p
o
r
t
:
 
S
u
c
h
 
r
e
s
u
l
t
s
 
w
o
u
l
d
 
b
e
 
d
i
f
f
i
c
u
l
t
 
t
o
a
c
h
i
e
v
e
 
u
n
d
e
r
 
t
h
e
 
a
s
s
u
m
p
t
i
o
n
 
o
f
 
H
0
W
h
e
n
 
i
n
 
f
a
c
t
 
s
u
c
h
 
r
e
s
u
l
t
s
 
a
r
e
 
c
o
m
m
o
n
 
u
n
d
e
r
 
t
h
e
a
s
s
u
m
p
t
i
o
n
 
o
f
 
H
0
 
(Formally):
O
 You say Pr(P-value ≤ P
obs
; 
H
0
) ~ P
obs
 small
O
 But in fact Pr(P-value ≤ P
obs
; 
H
0
) = high
 
23
 
S
c
a
p
e
g
o
a
t
i
n
g
O
 Nowadays, we’re likely to see the tests
blamed
O
 My view: Tests don’t kill inferences,
people do
O
 Even worse are those statistical accounts
where the abuse vanishes!
24
 
O
n
 
s
o
m
e
 
v
i
e
w
s
,
 
t
a
k
i
n
g
 
a
c
c
o
u
n
t
 
o
f
 
b
i
a
s
i
n
g
s
e
l
e
c
t
i
o
n
 
e
f
f
e
c
t
s
 
d
e
f
i
e
s
 
s
c
i
e
n
t
i
f
i
c
 
s
e
n
s
e
 
T
w
o
 
p
r
o
b
l
e
m
s
 
t
h
a
t
 
p
l
a
g
u
e
 
f
r
e
q
u
e
n
t
i
s
t
 
i
n
f
e
r
e
n
c
e
:
m
u
l
t
i
p
l
e
 
c
o
m
p
a
r
i
s
o
n
s
 
a
n
d
 
m
u
l
t
i
p
l
e
 
l
o
o
k
s
,
 
o
r
,
 
a
s
t
h
e
y
 
a
r
e
 
m
o
r
e
 
c
o
m
m
o
n
l
y
 
c
a
l
l
e
d
,
 
d
a
t
a
 
d
r
e
d
g
i
n
g
a
n
d
 
p
e
e
k
i
n
g
 
a
t
 
t
h
e
 
d
a
t
a
.
 
T
h
e
 
f
r
e
q
u
e
n
t
i
s
t
 
s
o
l
u
t
i
o
n
t
o
 
b
o
t
h
 
p
r
o
b
l
e
m
s
 
i
n
v
o
l
v
e
s
 
a
d
j
u
s
t
i
n
g
 
t
h
e
 
P
-
v
a
l
u
e
B
u
t
 
a
d
j
u
s
t
i
n
g
 
t
h
e
 
m
e
a
s
u
r
e
 
o
f
 
e
v
i
d
e
n
c
e
b
e
c
a
u
s
e
 
o
f
 
c
o
n
s
i
d
e
r
a
t
i
o
n
s
 
t
h
a
t
 
h
a
v
e
 
n
o
t
h
i
n
g
t
o
 
d
o
 
w
i
t
h
 
t
h
e
 
d
a
t
a
 
d
e
f
i
e
s
 
s
c
i
e
n
t
i
f
i
c
 
s
e
n
s
e
,
b
e
l
i
e
s
 
t
h
e
 
c
l
a
i
m
 
o
f
 
o
b
j
e
c
t
i
v
i
t
y
 
t
h
a
t
 
i
s
 
o
f
t
e
n
 
m
a
d
e
f
o
r
 
t
h
e
 
P
-
v
a
l
u
e
 
(
G
o
o
d
m
a
n
 
1
9
9
9
,
 
p
.
 
1
0
1
0
)
 
 
(To his credit, he’s open about this; heads the
Meta-Research Innovation Center at Stanford)
 
25
 
T
e
c
h
n
i
c
a
l
 
a
c
t
i
v
i
s
m
 
i
s
n
t
 
f
r
e
e
 
o
f
 
p
h
i
l
o
s
o
p
h
y
 
Ben Goldacre (of 
Bad Science
) in a 2016 
Nature
article, is puzzled that bad statistical practices
continue even in the face of the new "technical
activism”:
T
h
e
 
e
d
i
t
o
r
s
 
a
t
 
A
n
n
a
l
s
 
o
f
 
I
n
t
e
r
n
a
l
 
M
e
d
i
c
i
n
e
,
r
e
p
e
a
t
e
d
l
y
 
(
b
u
t
 
c
o
n
f
u
s
e
d
l
y
)
 
a
r
g
u
e
 
t
h
a
t
 
i
t
 
i
s
a
c
c
e
p
t
a
b
l
e
 
t
o
 
i
d
e
n
t
i
f
y
 
p
r
e
s
p
e
c
i
f
i
e
d
 
o
u
t
c
o
m
e
s
[
f
r
o
m
 
r
e
s
u
l
t
s
]
 
p
r
o
d
u
c
e
d
 
a
f
t
e
r
 
a
 
t
r
i
a
l
b
e
g
a
n
;
 
.
t
h
e
y
 
s
a
y
 
t
h
a
t
 
t
h
e
i
r
 
e
x
p
e
r
t
i
s
e
 
a
l
l
o
w
s
t
h
e
m
 
t
o
 
p
e
r
m
i
t
 
 
a
n
d
 
e
v
e
n
 
s
o
l
i
c
i
t
 
u
n
d
e
c
l
a
r
e
d
 
o
u
t
c
o
m
e
-
s
w
i
t
c
h
i
n
g
 
26
 
H
i
s
 
p
a
p
e
r
:
 
 
M
a
k
e
 
j
o
u
r
n
a
l
s
 
r
e
p
o
r
t
c
l
i
n
i
c
a
l
 
t
r
i
a
l
s
 
p
r
o
p
e
r
l
y
 
O
 He shouldn’t close his eyes to the
possibility that some of the pushback he’s
seeing has a basis in statistical
philosophy!
 
 
27
 
L
i
k
e
l
i
h
o
o
d
 
P
r
i
n
c
i
p
l
e
 
(
L
P
)
 
T
h
e
 
v
a
n
i
s
h
i
n
g
 
a
c
t
 
l
i
n
k
s
 
t
o
 
a
 
p
i
v
o
t
a
l
d
i
s
a
g
r
e
e
m
e
n
t
 
i
n
 
t
h
e
 
p
h
i
l
o
s
o
p
h
y
 
o
f
 
s
t
a
t
i
s
t
i
c
s
b
a
t
t
l
e
s
In probabilisms, the import of the data is via
the ratios of likelihoods of hypotheses
P
(
x
0
;
H
1
)
/
P
(
x
0
;
H
0
)
T
h
e
 
d
a
t
a
 
x
0
 
a
r
e
 
f
i
x
e
d
,
 
w
h
i
l
e
 
t
h
e
 
h
y
p
o
t
h
e
s
e
s
v
a
r
y
 
 
28
 
A
l
l
 
e
r
r
o
r
 
p
r
o
b
a
b
i
l
i
t
i
e
s
 
v
i
o
l
a
t
e
 
t
h
e
 
L
P
(
e
v
e
n
 
w
i
t
h
o
u
t
 
s
e
l
e
c
t
i
o
n
 
e
f
f
e
c
t
s
)
:
 
S
a
m
p
l
i
n
g
 
d
i
s
t
r
i
b
u
t
i
o
n
s
,
 
s
i
g
n
i
f
i
c
a
n
c
e
 
l
e
v
e
l
s
,
p
o
w
e
r
,
 
a
l
l
 
d
e
p
e
n
d
 
o
n
 
s
o
m
e
t
h
i
n
g
 
m
o
r
e
 
[
t
h
a
n
t
h
e
 
l
i
k
e
l
i
h
o
o
d
 
f
u
n
c
t
i
o
n
]
s
o
m
e
t
h
i
n
g
 
t
h
a
t
 
i
s
i
r
r
e
l
e
v
a
n
t
 
i
n
 
B
a
y
e
s
i
a
n
 
i
n
f
e
r
e
n
c
e
n
a
m
e
l
y
 
t
h
e
s
a
m
p
l
e
 
s
p
a
c
e
 
(
L
i
n
d
l
e
y
 
1
9
7
1
,
 
p
.
 
4
3
6
)
T
h
e
 
L
P
 
i
m
p
l
i
e
s
t
h
e
 
i
r
r
e
l
e
v
a
n
c
e
 
o
f
p
r
e
d
e
s
i
g
n
a
t
i
o
n
,
 
o
f
 
w
h
e
t
h
e
r
 
a
 
h
y
p
o
t
h
e
s
i
s
w
a
s
 
t
h
o
u
g
h
t
 
o
f
 
b
e
f
o
r
e
 
h
a
n
d
 
o
r
 
w
a
s
i
n
t
r
o
d
u
c
e
d
 
t
o
 
e
x
p
l
a
i
n
 
k
n
o
w
n
 
e
f
f
e
c
t
s
(
R
o
s
e
n
k
r
a
n
t
z
,
 
1
9
7
7
,
 
p
.
 
1
2
2
)
 
29
 
P
a
r
a
d
o
x
 
o
f
 
O
p
t
i
o
n
a
l
 
S
t
o
p
p
i
n
g
:
 
Error probing capacities are altered not just
by cherry picking and data dredging, but
also via data dependent stopping rules:
 
X
i 
~ N(μ, σ
2
), 2-sided
 H
0
: μ = 0 vs. 
H
1
: μ ≠ 0.
 
I
n
s
t
e
a
d
 
o
f
 
f
i
x
i
n
g
 
t
h
e
 
s
a
m
p
l
e
 
s
i
z
e
 
n
 
i
n
a
d
v
a
n
c
e
,
 
i
n
 
s
o
m
e
 
t
e
s
t
s
,
 
n
 
i
s
 
d
e
t
e
r
m
i
n
e
d
 
b
y
a
 
s
t
o
p
p
i
n
g
 
r
u
l
e
:
 
 
30
 
T
r
y
i
n
g
 
a
n
d
 
t
r
y
i
n
g
 
a
g
a
i
n
 
O
 Keep sampling until 
H
0
 is rejected at 0.05
level
 i.e., keep sampling until M 
 1.96 
/√
n
O
 Trying and trying again
: Having failed to
rack up a 1.96 
 difference after 10 trials,
go to 20, 30 
and so on until obtaining a
1.96 
 
difference
 
31
 
N
o
m
i
n
a
l
 
v
s
.
 
A
c
t
u
a
l
s
i
g
n
i
f
i
c
a
n
c
e
 
l
e
v
e
l
s
 
a
g
a
i
n
:
 
O
 With 
n
 fixed the Type 1 error probability is
0.05
O
 With this stopping rule the actual
significance level differs from, and will be
greater than 0.05
O
 Violates Cox and Hinkley’s (1974) “weak
repeated sampling principle”
 
32
 
O
T
h
e
 
A
S
A
 
(
p
.
 
1
3
1
)
 
c
o
r
r
e
c
t
l
y
 
w
a
r
n
s
 
t
h
a
t
[
c
]
o
n
d
u
c
t
i
n
g
 
m
u
l
t
i
p
l
e
 
a
n
a
l
y
s
e
s
 
o
f
 
t
h
e
 
d
a
t
a
a
n
d
 
r
e
p
o
r
t
i
n
g
 
o
n
l
y
 
t
h
o
s
e
 
w
i
t
h
 
c
e
r
t
a
i
n
 
p
-
v
a
l
u
e
s
 
l
e
a
d
s
 
t
o
 
s
p
u
r
i
o
u
s
 
p
-
v
a
l
u
e
s
(
P
r
i
n
c
i
p
l
e
 
4
)
O
 They don’t mention that the same p-
hacked hypothesis can occur in Bayes
factors, credibility intervals, likelihood
ratios
 
 
 
33
 
W
i
t
h
 
O
n
e
 
B
i
g
 
D
i
f
f
e
r
e
n
c
e
:
 
O
 
“The direct grounds to criticize inferences
as flouting error statistical control is lost
O
 They condition on the actual data,
O
 Error probabilities take into account other
outcomes that could have occurred but
did not (sampling distribution)”
 
 
34
 
H
o
w
 
m
i
g
h
t
 
p
r
o
b
a
b
i
l
i
s
t
s
 
b
l
o
c
k
 
i
n
t
u
i
t
i
v
e
l
y
u
n
w
a
r
r
a
n
t
e
d
 
i
n
f
e
r
e
n
c
e
s
(
w
i
t
h
o
u
t
 
e
r
r
o
r
 
p
r
o
b
a
b
i
l
i
t
i
e
s
)
?
 
A subjective Bayesian might say:
 
If our beliefs were mixed into the interpretation of
the evidence, we wouldn’t declare there’s
statistical evidence of some unbelievable claim
(distinguishing shades of grey and being
politically moderate, ovulation and voting
preferences)
 
35
 
R
e
s
c
u
e
d
 
b
y
 
b
e
l
i
e
f
s
O
 That could work in some cases (
it still
wouldn’t show what researchers had done
wrong)
battle of beliefs
O
 Besides, researchers sincerely believe
their hypotheses
O
 So now you’ve got two sources of
flexibility, priors and biasing selection
effects
 
36
 
N
o
 
h
e
l
p
 
w
i
t
h
 
o
u
r
 
m
o
s
t
 
i
m
p
o
r
t
a
n
t
p
r
o
b
l
e
m
 
 
O
 How to distinguish the warrant for a
single hypothesis 
H
 with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
 
37
 
M
o
s
t
 
B
a
y
e
s
i
a
n
s
 
u
s
e
 
d
e
f
a
u
l
t
 
p
r
i
o
r
s
 
O
 Eliciting subjective priors too difficult,
scientists reluctant to allow subjective
beliefs to overshadow data
O
 Default, or reference priors are supposed
to prevent prior beliefs from influencing
the posteriors (O-Bayesians, 2006)
 
38
 
O
The priors are not to be considered expressions
of uncertainty, ignorance, or degree of belief.
Default priors may not even be probabilities…
(Cox and Mayo 2010, p. 299)
O
 
Default Bayesian Reforms are touted as free of
selection effects
O
 “…
Bayes factors can be used in the complete
absence of a sampling plan…
” (Bayarri,
Benjamin, Berger, Sellke 2016, p. 100)
 
39
 
G
r
a
n
t
e
d
,
 
s
o
m
e
 
a
r
e
 
p
r
e
p
a
r
e
d
 
t
o
 
a
b
a
n
d
o
n
t
h
e
 
L
P
 
f
o
r
 
m
o
d
e
l
 
t
e
s
t
i
n
g
 
 
In an attempted meeting of the minds Andrew
Gelman and Cosma Shalizi say:
 
O
 “[C]
rucial parts of Bayesian data analysis, such
as model checking, can be understood as ‘error
probes’ in Mayo’s sense”
 which might be seen
as using modern statistics to implement the
Popperian criteria of severe tests.” 
(2013, p.10)
.
O
 
An open question
 
40
 
T
h
e
 
A
S
A
 
d
o
c
 
h
i
g
h
l
i
g
h
t
s
 
c
l
a
s
s
i
c
 
f
o
i
b
l
e
s
t
h
a
t
 
b
l
o
c
k
 
r
e
p
l
i
c
a
t
i
o
n
 
I
n
 
r
e
l
a
t
i
o
n
 
t
o
 
t
h
e
 
t
e
s
t
 
o
f
 
s
i
g
n
i
f
i
c
a
n
c
e
,
 
w
e
m
a
y
 
s
a
y
 
t
h
a
t
 
a
 
p
h
e
n
o
m
e
n
o
n
 
i
s
e
x
p
e
r
i
m
e
n
t
a
l
l
y
 
d
e
m
o
n
s
t
r
a
b
l
e
 
w
h
e
n
 
w
e
k
n
o
w
 
h
o
w
 
t
o
 
c
o
n
d
u
c
t
 
a
n
 
e
x
p
e
r
i
m
e
n
t
w
h
i
c
h
 
w
i
l
l
 
r
a
r
e
l
y
 
f
a
i
l
 
t
o
 
g
i
v
e
 
u
s
 
a
s
t
a
t
i
s
t
i
c
a
l
l
y
 
s
i
g
n
i
f
i
c
a
n
t
 
r
e
s
u
l
t
(Fisher 
1935, p. 14)
 
“isolated” low P-value ≠> 
H
: statistical effect
 
41
 
S
t
a
t
i
s
t
i
c
a
l
 
>
 
s
u
b
s
t
a
n
t
i
v
e
 
(
H
 
>
 
H
*
)
 
[
A
]
c
c
o
r
d
i
n
g
 
t
o
 
F
i
s
h
e
r
,
 
r
e
j
e
c
t
i
n
g
 
t
h
e
 
n
u
l
l
h
y
p
o
t
h
e
s
i
s
 
i
s
 
n
o
t
 
e
q
u
i
v
a
l
e
n
t
 
t
o
a
c
c
e
p
t
i
n
g
 
t
h
e
 
e
f
f
i
c
a
c
y
 
o
f
 
t
h
e
 
c
a
u
s
e
 
i
n
q
u
e
s
t
i
o
n
.
 
T
h
e
 
l
a
t
t
e
r
.
.
.
r
e
q
u
i
r
e
s
 
o
b
t
a
i
n
i
n
g
m
o
r
e
 
s
i
g
n
i
f
i
c
a
n
t
 
r
e
s
u
l
t
s
 
w
h
e
n
 
t
h
e
e
x
p
e
r
i
m
e
n
t
,
 
o
r
 
a
n
 
i
m
p
r
o
v
e
m
e
n
t
 
o
f
 
i
t
,
 
i
s
r
e
p
e
a
t
e
d
 
a
t
 
o
t
h
e
r
 
l
a
b
o
r
a
t
o
r
i
e
s
 
o
r
 
u
n
d
e
r
o
t
h
e
r
 
c
o
n
d
i
t
i
o
n
s
 
(
G
i
g
e
r
e
n
t
z
e
r
 
1
9
8
9
,
 
p
p
.
9
5
-
6
)
 
42
 
T
h
e
 
P
r
o
b
l
e
m
 
i
s
 
w
i
t
h
 
s
o
-
c
a
l
l
e
d
 
N
H
S
T
(
n
u
l
l
 
h
y
p
o
t
h
e
s
i
s
 
s
i
g
n
i
f
i
c
a
n
c
e
 
t
e
s
t
i
n
g
)
O
 NHSTs supposedly allow moving from
statistical to substantive hypotheses
O
 If defined that way, they exist only as
abuses of tests
O
 ASA doc ignores Neyman-Pearson (N-P)
tests
 
43
 
N
e
y
m
a
n
-
P
e
a
r
s
o
n
 
(
N
-
P
)
 
t
e
s
t
s
:
 
A null and alternative hypotheses
H
0
, 
H
1 
that are exhaustive
H
0
: μ ≤ 12 vs 
H
1
: μ > 12
 
 
O
S
o
 
t
h
i
s
 
f
a
l
l
a
c
y
 
o
f
 
r
e
j
e
c
t
i
o
n
 
H
H
*
 
i
s
i
m
p
o
s
s
i
b
l
e
O
 Rejecting the null only indicates statistical
alternatives (how discrepant from null)
 
44
 
P
-
v
a
l
u
e
s
 
D
o
n
t
 
R
e
p
o
r
t
 
E
f
f
e
c
t
 
S
i
z
e
s
(
P
r
i
n
c
i
p
l
e
 
5
)
 
Who ever said to just report a P-value?
O
T
e
s
t
s
 
s
h
o
u
l
d
 
b
e
 
a
c
c
o
m
p
a
n
i
e
d
 
b
y
i
n
t
e
r
p
r
e
t
i
v
e
 
t
o
o
l
s
 
t
h
a
t
 
a
v
o
i
d
 
t
h
e
f
a
l
l
a
c
i
e
s
 
o
f
 
r
e
j
e
c
t
i
o
n
 
a
n
d
 
n
o
n
-
r
e
j
e
c
t
i
o
n
.
T
h
e
s
e
 
c
o
r
r
e
c
t
i
v
e
s
 
c
a
n
 
b
e
 
a
r
t
i
c
u
l
a
t
e
d
 
i
n
e
i
t
h
e
r
 
F
i
s
h
e
r
i
a
n
 
o
r
 
N
e
y
m
a
n
-
P
e
a
r
s
o
n
t
e
r
m
s
 
(
M
a
y
o
 
a
n
d
 
C
o
x
 
2
0
0
6
,
 
M
a
y
o
 
a
n
d
S
p
a
n
o
s
 
2
0
0
6
)
 
45
 
T
o
 
A
v
o
i
d
 
I
n
f
e
r
r
i
n
g
 
a
 
D
i
s
c
r
e
p
a
n
c
y
B
e
y
o
n
d
 
W
h
a
t
s
 
W
a
r
r
a
n
t
e
d
:
l
a
r
g
e
 
n
 
p
r
o
b
l
e
m
.
 
 
O
 Severity tells us: an α-significant
difference is indicative of 
less
 of a
discrepancy from the null if it results from
larger (
n
1
) rather than a smaller (
n
2
)
sample size (
n
1
 > 
n
2
 )
 
46
 
O
 What’s more indicative of a large effect
(fire), a fire alarm that goes off with burnt
toast or one so insensitive that it doesn’t
go off unless the house is fully ablaze?
 
 
 
 
 
 
O
 [The larger sample size is like the one
that goes off with burnt toast]
 
47
 
W
h
a
t
 
A
b
o
u
t
 
F
a
l
l
a
c
i
e
s
 
o
f
N
o
n
-
S
i
g
n
i
f
i
c
a
n
t
 
R
e
s
u
l
t
s
?
 
O
They don’t warrant 0 discrepancy
O
 Use the same severity reasoning to rule out
discrepancies that very probably would have
resulted in a larger difference than observed- set
upper bounds
O
 If you very probably would have observed a more
impressive (smaller) p-value than you did, if μ >
μ
1
 (μ
1
 = μ
0
 + γ), then the data are good evidence
that μ< μ
1
O
A
k
i
n
 
t
o
 
p
o
w
e
r
 
a
n
a
l
y
s
i
s
 
(
C
o
h
e
n
,
 
N
e
y
m
a
n
)
 
b
u
t
s
e
n
s
i
t
i
v
e
 
t
o
 
x
0
 
48
 
 
O
 There’s another kind of fallacy behind a
move that’s supposed improve replication
but it confuses the notions from significance
testing and it leads to “Most findings are
false”
O
Fake replication crisis.
 
49
 
D
i
a
g
n
o
s
t
i
c
 
S
c
r
e
e
n
i
n
g
 
M
o
d
e
l
 
o
f
 
T
e
s
t
s
:
u
r
n
 
o
f
 
n
u
l
l
s
(
m
o
s
t
 
f
i
n
d
i
n
g
s
 
a
r
e
 
f
a
l
s
e
)
 
O
 If we imagine randomly select a hypothesis
from an urn of nulls 90% of which are true
O
 Consider just 2 possibilities: H
0
: no effect
H
1
: meaningful effect, all else ignored,
O
 Take the prevalence of 90% as
Pr(
H
0 
you picked) = .9, Pr(
H
1
)= .1
O
 Rejecting 
H
0 
with a single (just) .05 significant
result, Cherry-picking to boot
 
 
 
50
 
51
 
 
The unsurprising result is that 
most “findings” are
false: Pr(
H
0
| findings with a P-value of .05) > .5
P
r
(
H
0
|
 
f
i
n
d
i
n
g
s
 
w
i
t
h
 
a
 
P
-
v
a
l
u
e
 
o
f
 
.
0
5
)
 
P
r
(
P
-
v
a
l
u
e
 
o
f
 
.
0
5
 
|
 
H
0
)
Only the second one is a Type 1 error probability)
Major source of confusion….
(Berger and Sellke 1987, Ioannidis 2005,
Colquhoun 2014)
 
O
 A: Announce a finding (a P-value of .05)
 
 
 
 
 
O
 Not properly Bayesian (not even
empirical Bayes), not properly frequentist
O
W
h
e
r
e
 
d
o
e
s
 
t
h
e
 
h
i
g
h
 
p
r
e
v
a
l
e
n
c
e
 
c
o
m
e
f
r
o
m
?
 
52
 
 
C
o
n
c
l
u
d
i
n
g
 
R
e
m
a
r
k
O
If replication research and reforms are to lead to
error correction, they must correct errors: they
don’t 
always
 do that
 
O
T
h
e
y
 
d
o
 
w
h
e
n
 
t
h
e
y
 
e
n
c
o
u
r
a
g
e
 
p
r
e
r
e
g
i
s
t
r
a
t
i
o
n
,
c
o
n
t
r
o
l
 
e
r
r
o
r
 
p
r
o
b
a
b
i
l
i
t
i
e
s
 
&
 
r
e
q
u
i
r
e
 
g
o
o
d
 
d
e
s
i
g
n
             RCTs,
             checking model assumptions)
 
O
 They don’t when they permit tools that lack error
control
 
53
 
D
o
n
t
 
T
h
r
o
w
 
O
u
t
 
t
h
e
 
E
r
r
o
r
 
C
o
n
t
r
o
l
 
B
a
b
y
 
O
 
Main source of hand-wringing behind the
statistical crisis in science stems from cherry-
picking, hunting for significance, multiple testing
 
O
 These biasing selection effects are picked up
by tools that assess error control (performance
or severity)
 
O
 Reforms based on “probabilisms” enable rather
than check unreliable results due to biasing
selection effects
 
54
 
R
e
p
l
i
g
a
t
e
 
O
 Replication research has pushback:
some call it 
methodological terrorism
(enforcing good science or bullying?)
 
O
 My gripe is that replications, at least in
social psychology, should go beyond the
statistical criticism
 
55
 
N
o
n
-
r
e
p
l
i
c
a
t
i
o
n
s
 
c
o
n
s
t
r
u
e
d
 
a
s
s
i
m
p
l
y
 
w
e
a
k
e
r
 
e
f
f
e
c
t
s
 
O
 One of the non-replications: cleanliness and
morality: 
Does unscrambling soap words make
you less judgmental?
 
M
s
.
 
S
c
h
n
a
l
l
 
h
a
d
 
4
0
 
u
n
d
e
r
g
r
a
d
u
a
t
e
s
 
u
n
s
c
r
a
m
b
l
e
s
o
m
e
 
w
o
r
d
s
.
 
O
n
e
 
g
r
o
u
p
 
u
n
s
c
r
a
m
b
l
e
d
 
w
o
r
d
s
t
h
a
t
 
s
u
g
g
e
s
t
e
d
 
c
l
e
a
n
l
i
n
e
s
s
 
(
p
u
r
e
,
 
i
m
m
a
c
u
l
a
t
e
,
p
r
i
s
t
i
n
e
)
,
 
w
h
i
l
e
 
t
h
e
 
o
t
h
e
r
 
g
r
o
u
p
 
u
n
s
c
r
a
m
b
l
e
d
n
e
u
t
r
a
l
 
w
o
r
d
s
.
 
T
h
e
y
 
w
e
r
e
 
t
h
e
n
 
p
r
e
s
e
n
t
e
d
 
w
i
t
h
a
 
n
u
m
b
e
r
 
o
f
 
m
o
r
a
l
 
d
i
l
e
m
m
a
s
,
 
l
i
k
e
 
w
h
e
t
h
e
r
 
i
t
s
c
o
o
l
 
t
o
 
e
a
t
 
y
o
u
r
 
d
o
g
 
a
f
t
e
r
 
i
t
 
g
e
t
s
 
r
u
n
 
o
v
e
r
 
b
y
 
a
c
a
r
.
 
C
h
r
o
n
i
c
l
e
 
o
f
 
H
i
g
h
e
r
 
E
d
.
 
56
 
…Turns out, it did. Subjects who had
unscrambled clean words weren’t as harsh on the
guy who chows down on his chow.” (
Chronicle of
Higher Education
)
 
O
 Focusing on the P-values ignore larger
questions of measurement in psych & the leap
from the statistical to the substantive.
H
H
*
 
O
 Increasingly the basis for experimental
philosophy-needs philosophical scrutiny
 
57
 
T
h
e
 
A
S
A
s
 
S
i
x
 
P
r
i
n
c
i
p
l
e
s
 
O
(
1
)
 
P
-
v
a
l
u
e
s
 
c
a
n
 
i
n
d
i
c
a
t
e
 
h
o
w
 
i
n
c
o
m
p
a
t
i
b
l
e
 
t
h
e
 
d
a
t
a
 
a
r
e
w
i
t
h
 
a
 
s
p
e
c
i
f
i
e
d
 
s
t
a
t
i
s
t
i
c
a
l
 
m
o
d
e
l
O
(
2
)
 
P
-
v
a
l
u
e
s
 
d
o
 
n
o
t
 
m
e
a
s
u
r
e
 
t
h
e
 
p
r
o
b
a
b
i
l
i
t
y
 
t
h
a
t
 
t
h
e
s
t
u
d
i
e
d
 
h
y
p
o
t
h
e
s
i
s
 
i
s
 
t
r
u
e
,
 
o
r
 
t
h
e
 
p
r
o
b
a
b
i
l
i
t
y
 
t
h
a
t
 
t
h
e
 
d
a
t
a
w
e
r
e
 
p
r
o
d
u
c
e
d
 
b
y
 
r
a
n
d
o
m
 
c
h
a
n
c
e
 
a
l
o
n
e
O
(
3
)
 
S
c
i
e
n
t
i
f
i
c
 
c
o
n
c
l
u
s
i
o
n
s
 
a
n
d
 
b
u
s
i
n
e
s
s
 
o
r
 
p
o
l
i
c
y
d
e
c
i
s
i
o
n
s
 
s
h
o
u
l
d
 
n
o
t
 
b
e
 
b
a
s
e
d
 
o
n
l
y
 
o
n
 
w
h
e
t
h
e
r
 
a
 
p
-
v
a
l
u
e
p
a
s
s
e
s
 
a
 
s
p
e
c
i
f
i
c
 
t
h
r
e
s
h
o
l
d
O
(
4
)
 
P
r
o
p
e
r
 
i
n
f
e
r
e
n
c
e
 
r
e
q
u
i
r
e
s
 
f
u
l
l
 
r
e
p
o
r
t
i
n
g
 
a
n
d
t
r
a
n
s
p
a
r
e
n
c
y
O
(
5
)
 
A
 
p
-
v
a
l
u
e
,
 
o
r
 
s
t
a
t
i
s
t
i
c
a
l
 
s
i
g
n
i
f
i
c
a
n
c
e
,
 
d
o
e
s
 
n
o
t
m
e
a
s
u
r
e
 
t
h
e
 
s
i
z
e
 
o
f
 
a
n
 
e
f
f
e
c
t
 
o
r
 
t
h
e
 
i
m
p
o
r
t
a
n
c
e
 
o
f
 
a
 
r
e
s
u
l
t
O
(
6
)
 
B
y
 
i
t
s
e
l
f
,
 
a
 
p
-
v
a
l
u
e
 
d
o
e
s
 
n
o
t
 
p
r
o
v
i
d
e
 
a
 
g
o
o
d
 
m
e
a
s
u
r
e
o
f
 
e
v
i
d
e
n
c
e
 
r
e
g
a
r
d
i
n
g
 
a
 
m
o
d
e
l
 
o
r
 
h
y
p
o
t
h
e
s
i
s
 
 
 
58
 
T
e
s
t
 
T
+
:
 
N
o
r
m
a
l
 
t
e
s
t
i
n
g
:
 
H
0
:
 
μ
 
<
 
μ
0
 
v
s
.
 
 
H
1
:
 
μ
 
>
 
μ
0
σ known
(
F
E
V
/
S
E
V
)
:
 
I
f
 
d
(
x
)
 
i
s
 
n
o
t
 
s
t
a
t
i
s
t
i
c
a
l
l
y
 
s
i
g
n
i
f
i
c
a
n
t
,
t
h
e
n
μ
 
<
 
M
0
 
+
 
k
ε
σ
/
n
 
 
p
a
s
s
e
s
 
t
h
e
 
t
e
s
t
 
T
+
 
w
i
t
h
s
e
v
e
r
i
t
y
 
(
1
 
 
ε
)
(
F
E
V
/
S
E
V
)
:
 
I
f
 
d
(
x
)
 
i
s
 
s
t
a
t
i
s
t
i
c
a
l
l
y
 
s
i
g
n
i
f
i
c
a
n
t
,
 
t
h
e
n
μ
 
>
 
M
0
 
+
 
k
ε
σ
/
n
 
p
a
s
s
e
s
 
t
h
e
 
t
e
s
t
 
T
+
 
w
i
t
h
s
e
v
e
r
i
t
y
 
(
1
 
 
ε
)
w
h
e
r
e
 
P
(
d
(
X
)
 
>
 
k
ε
)
 
=
 
ε
 
59
 
R
e
f
e
r
e
n
c
e
s
 
O
Armitage, P. 1962. “Contribution to Discussion.” In 
The Foundations of Statistical
Inference: A Discussion
, edited by L. J. Savage. London: Methuen.
O
Berger, J. O. 2003  'Could Fisher, Jeffreys and Neyman Have Agreed on Testing?'
and 'Rejoinder,', 
Statistical Science
 18(1): 1-12; 28-32.
O
Berger, J. O. 2006. “The Case for Objective Bayesian Analysis.” 
Bayesian Analysis
1 (3): 385–402.
O
Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” 
Nature
 225 (5237) (March 14): 1033.
O
Efron, B. 2013. 'A 250-Year Argument: Belief, Behavior, and the Bootstrap', 
Bulletin
of the American Mathematical Society
 50(1): 126-46.
O
Box, G. 1983. “An Apology for Ecumenism in Statistics,” in Box, G.E.P., Leonard, T.
and Wu, D. F. J.  (eds.), pp. 51-84, 
Scientific Inference, Data Analysis, and
Robustness.
 New York: Academic Press.
O
Cox, D. R. and Hinkley, D. 1974. 
Theoretical Statistics
. London: Chapman and
Hall.
O
Cox, D. R., and Deborah G. Mayo. 2010. “Objectivity and Conditionality in
Frequentist Inference.” In 
Error and Inference: Recent Exchanges on Experimental
Reasoning, Reliability, and the Objectivity and Rationality of Science
, edited by
Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University
Press.
O
Fisher, R. A. 1935. 
The Design of Experiments
. Edinburgh: Oliver and Boyd.
O
Fisher, R. A. 1955. “Statistical Methods and Scientific Induction.” 
Journal of the
Royal Statistical Society, Series B (Methodological)
 17 (1) (January 1): 69–78.
 
60
 
O
Gelman, A. and Shalizi, C. 2013. 'Philosophy and the Practice of Bayesian
Statistics' and 'Rejoinder', 
British Journal of Mathematical and Statistical
Psychology
 66(1): 8–38; 76-80.
O
Gigerenzer, G., Swijtink, Porter, T. Daston, L. Beatty, J, and Kruger, L. 1989. 
The
Empire of Chance. Cambridge: Cambridge University Press.
O
Goldacre, B. 2008. 
Bad Science
. HarperCollins Publishers.
O
Goldacre, B. 2016. “Make journals report clinical trials properly”, 
Nature
530(7588);online 02Feb2016.
O
Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes
factor,” 
Annals of Internal Medicine
 1999; 130:1005 –1013.
O
Lindley, D. V. 1971. “The Estimation of Many Parameters.” In 
Foundations of
Statistical Inference
, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto:
Holt, Rinehart and Winston.
O
Mayo, D. G. 1996. 
Error and the Growth of Experimental Knowledge
. Science and
Its Conceptual Foundation. Chicago: University of Chicago Press.
O
Mayo, D. G. and Cox, D. R. (2010). "Frequentist Statistics as a Theory of Inductive
Inference" in 
Error and Inference: Recent Exchanges on Experimental Reasoning,
Reliability and the Objectivity and Rationality of Science
 (D Mayo and A. Spanos
eds.), Cambridge: Cambridge University Press: 1-27. This paper appeared in 
The
Second Erich L. Lehmann Symposium: Optimality
, 2006, Lecture Notes-
Monograph Series, Volume 49, Institute of Mathematical Statistics, pp. 247-275.
O
Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction.” 
British Journal for the Philosophy of
Science
 57 (2) (June 1): 323–357.
 
 
 
 
61
 
O
Mayo, D. G., and A. Spanos.  2011. “Error Statistics.” In 
Philosophy of Statistics
,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook
of the Philosophy of Science. The Netherlands: Elsevier.
O
Meehl, P. E., and N. G. Waller. 2002. “The Path Analysis Controversy: A New
Statistical Approach to Strong Appraisal of Verisimilitude.” 
Psychological Methods
 7
(3): 283–300.
O
Morrison, D. E., and R. E. Henkel, ed. 1970. 
The Significance Test Controversy: A
Reader
. Chicago: Aldine De Gruyter.
O
Pearson, E. S. & Neyman, J. (1930). On the problem of two samples. 
Joint Statistical
Papers
 by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First
published in 
Bul. Acad. Pol.Sci. 
73-96.
O
Rosenkrantz, R. 1977. 
Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. 
Dordrecht, The Netherlands: D. Reidel.
O
Savage, L. J. 1962. 
The Foundations of Statistical Inference: A Discussion
. London:
Methuen.
O
Selvin, H. 1970. “A critique of tests of significance in survey research. In 
The
significance test controversy
, edited by D. Morrison and R. Henkel, 94-106. Chicago:
Aldine De Gruyter.
O
Simonsohn, U. 2013, "Just Post It: The Lesson From Two Cases of Fabricated Data
Detected by Statistics Alone", 
Psychological Science, 
vol. 24, no. 10, pp. 1875-1888.
O
Trafimow D. and Marks, M. 2015. “Editorial”, 
Basic and Applied Social Psychology
37(1): pp. 1-2.
O
Wasserstein, R. and Lazar, N. 2016. “The ASA’s statement on p-values: context,
process, and purpose”, 
The American Statistician
 
 
62
 
A
b
s
t
r
a
c
t
 
If 
a statistical methodology is to be adequate, it needs
to register how “questionable research practices”
(QRPs) alter a method’s error probing capacities. If
little has been done to rule out flaws in taking data as
evidence for a claim, then that claim has not passed a
stringent or severe test. The goal of severe testing is
the linchpin for (re)interpreting frequentist methods so
as to avoid long-standing fallacies at the heart of
today’s statistics wars. A contrasting philosophy views
statistical inference in terms of posterior probabilities
in hypotheses: probabilism. Presupposing probabilism,
critics mistakenly argue 
that significance and
confidence levels are misinterpreted, exaggerate
evidence, or are irrelevant for inference.
Recommended replacements–Bayesian updating,
Bayes factors, likelihood ratios–fail to control severity.
 
63
Slide Note
Embed
Share

Explore the importance of severe testing, statistical crisis of replication, and the American Statistical Association's stance on P-values in ensuring reproducibility and error correction in scientific research. Delve into the philosophical, statistical, and historical aspects of error statistical methods for drawing accurate inferences from data while controlling error probabilities.

  • Error Correction
  • Reproducibility
  • Science
  • Statistical Methods
  • Philosophy

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Severe Testing: The Key to Error Correction Deborah G Mayo Virginia Tech March 17, 2017 Understanding Reproducibility and Error Correction in Science

  2. Statistical Crisis of Replication O Statistical findings disappear when others look for them. O Beyond the social sciences to genomics, bioinformatics, and medicine (Big Data) O Methodological reforms (some welcome, others radical) O Need to understand philosophical, statistical, historical issues 2

  3. American Statistical Association (ASA):Statement on P-values The statistical community has been deeply concerned about issues of reproducibility and replicability of scientific conclusions. . much confusion and even doubt about the validity of science is arising. Such doubt can lead to radical choices such as to ban p-values (ASA, Wasserstein & Lazar 2016, p. 129) 3

  4. I was a philosophical observer at the ASA P-value pow wow 4

  5. Dont throw out the error control baby with the bad statistics bathwater The American Statistician 5

  6. OThe most used methods are most criticized O Statistical significance tests are a small part of a rich set of: techniques for systematically appraising and bounding the probabilities of seriously misleading interpretations of data (Birnbaum 1970, p. 1033) O These I call errorstatisticalmethods (or sampling theory) . 6

  7. Error Statistics OStatistics: Collection, modeling, drawing inferences from data to claims about aspects of processes O The inference may be in error OIt s qualified by a claim about the method s capabilities to control and alert us to erroneous interpretations (error probabilities) 7

  8. p-value. to test the conformity of the particular data under analysis with H0in some respect: we find a function T = t(y) of the data, to be called the test statistic, such that the larger the value of T the more inconsistent are the data with H0; The random variable T = t(Y) has a (numerically) known probability distribution when H0is true. the p-value corresponding to any t0bsas p = p(t) = Pr(T t0bs; H0) (Mayo and Cox 2006, p. 81) 8

  9. Testing Reasoning O If even larger differences than t0bsoccur fairly frequently under H0 (i.e.,P-value is not small), there s scarcely evidence of incompatibility with H0 O Small P-value indicates some underlying discrepancy from H0because very probably you would have seen a less impressive difference than t0bswere H0true. OThis indication isn t evidence of a genuine statistical effect H, let alone a scientific conclusion H* Stat-Sub fallacy H => H* 9

  10. OIm not keen to defend many uses of significance tests long lampooned O I introduce a reformulation of tests in terms of discrepancies (effect sizes) that are and are not severely-tested O The criticisms are often based on misunderstandings; consequently so are many reforms 10

  11. Replication Paradox (for Significance Test Critics) Critic:It s much too easy to get a small P- value You: Why do they find it so difficult to replicate the small P-values others found? Is it easy or is it hard? 11

  12. Only 36 of 100 psychology experiments yielded small P-values in Open Science Collaboration on replication in psychology OSC: Reproducibility Project: Psychology: 2011-15 (Science 2015): Crowd-sourced effort to replicate 100 articles (Led by Brian Nozek, U. VA) 12

  13. OR.A. Fisher: its easy to lie with statistics by selective reporting, not the test s fault O Sufficient finagling cherry-picking, P- hacking, significance seeking, multiple testing, look elsewhere may practically guarantee a preferred claim H gets support, even if it s unwarranted by evidence (biasing selection effects, need to adjust P-values) Note: Support for some preferred claim H is by rejecting a null hypothesis O H hasn t passed a severe test 13

  14. Severity Requirement: If the test procedure had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H ( too cheap to be worth having Popper) O Such a test fails a minimal requirement for a stringent or severe test O My account: severe testing based on error statistics (requires reinterpreting tests) 14

  15. This alters the role of probability: typically just 2 OProbabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0 (e.g., Bayesian, likelihoodist) with regard for inner coherency O Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson) 15

  16. What happened to using probability to assess the error probing capacity by the severity requirement? ONeither probabilism nor performance directly captures it O Good long-run performance is a necessary, not a sufficient, condition for severity 16

  17. O Problems with selective reporting, cherry picking, stopping when the data look good, P-hacking, are not problems about long-runs OIt s that we cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data Key to revising the role of error probabilities 17

  18. A claim C is not warranted _______ O Probabilism: unless C is true or probable (gets a probability boost, is made comparatively firmer) O Performance: unless it stems from a method with low long-run error O Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about C 18

  19. O If you assume probabilism is required for inference, error probabilities are relevant for inference only by misinterpretation False! O I claim, error probabilities play a crucial role in appraising well-testedness OIt s crucial to be able to say, C is highly believable or plausible but poorly tested 19

  20. Biasing selection effects: O One function of severity is to identify problematic selection effects (not all are) OBiasing selection effects: when data or hypotheses are selected or generated (or a test criterion is specified), in such a way that the minimal severity requirement is violated, seriously altered or incapable of being assessed O Picking up on these alterations is precisely what enables error statistics to be self-correcting 20

  21. Nominal vs actual Significance levels Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be significant at the 5 percent level. .The actual level of significance is not 5 percent, but 64 percent! (Selvin, 1970, p. 104) From (Morrison & Henkel s Significance Test controversy 1970!) 21

  22. O Morrison and Henkel were clear on the fallacy: blurring the computed or nominal significance level, and the actual level O There are many more ways you can be wrong with hunting (different sample space) 22

  23. Spurious P-Value You report: Such results would be difficult to achieve under the assumption of H0 When in fact such results are common under the assumption of H0 (Formally): O You say Pr(P-value Pobs; H0) ~ Pobs small O But in fact Pr(P-value Pobs; H0) = high 23

  24. Scapegoating ONowadays, we re likely to see the tests blamed OMy view: Tests don t kill inferences, people do O Even worse are those statistical accounts where the abuse vanishes! 24

  25. On some views, taking account of biasing selection effects defies scientific sense Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P- value But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of objectivity that is often made for the P-value (Goodman 1999, p. 1010) (To his credit, he s open about this; heads the Meta-Research Innovation Center at Stanford) 25

  26. Technical activism isnt free of philosophy Ben Goldacre (of Bad Science) in a 2016 Nature article, is puzzled that bad statistical practices continue even in the face of the new "technical activism : The editors at Annals of Internal Medicine, repeatedly (but confusedly) argue that it is acceptable to identify prespecified outcomes [from results] produced after a trial began; .they say that their expertise allows them to permit and even solicit undeclared outcome-switching 26

  27. His paper: Make journals report clinical trials properly OHe shouldn t close his eyes to the possibility that some of the pushback he s seeing has a basis in statistical philosophy! 27

  28. Likelihood Principle (LP) The vanishing act links to a pivotal disagreement in the philosophy of statistics battles In probabilisms, the import of the data is via the ratios of likelihoods of hypotheses The data x0 are fixed, while the hypotheses vary P(x0;H1)/P(x0;H0) 28

  29. All error probabilities violate the LP (even without selection effects): Sampling distributions, significance levels, power, all depend on something more [than the likelihood function] something that is irrelevant in Bayesian inference namely the sample space (Lindley 1971, p. 436) The LP implies the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects (Rosenkrantz, 1977, p. 122) 29

  30. Paradox of Optional Stopping: Error probing capacities are altered not just by cherry picking and data dredging, but also via data dependent stopping rules: Xi ~ N( , 2), 2-sided H0: = 0 vs. H1: 0. Instead of fixing the sample size n in advance, in some tests, n is determined by a stopping rule: 30

  31. Trying and trying again O Keep sampling until H0 is rejected at 0.05 level i.e., keep sampling until M 1.96 / n O Trying and trying again: Having failed to rack up a 1.96 difference after 10 trials, go to 20, 30 and so on until obtaining a 1.96 difference 31

  32. Nominal vs. Actual significance levels again: O With n fixed the Type 1 error probability is 0.05 O With this stopping rule the actual significance level differs from, and will be greater than 0.05 OViolates Cox and Hinkley s (1974) weak repeated sampling principle 32

  33. OThe ASA (p. 131) correctly warns that [c]onducting multiple analyses of the data and reporting only those with certain p- values leads to spurious p-values (Principle 4) OThey don t mention that the same p- hacked hypothesis can occur in Bayes factors, credibility intervals, likelihood ratios 33

  34. With One Big Difference: O The direct grounds to criticize inferences as flouting error statistical control is lost O They condition on the actual data, O Error probabilities take into account other outcomes that could have occurred but did not (sampling distribution) 34

  35. How might probabilists block intuitively unwarranted inferences (without error probabilities)? A subjective Bayesian might say: If our beliefs were mixed into the interpretation of the evidence, we wouldn t declare there s statistical evidence of some unbelievable claim (distinguishing shades of grey and being politically moderate, ovulation and voting preferences) 35

  36. Rescued by beliefs O That could work in some cases (it still wouldn t show what researchers had done wrong) battle of beliefs O Besides, researchers sincerely believe their hypotheses OSo now you ve got two sources of flexibility, priors and biasing selection effects 36

  37. No help with our most important problem O How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, pre-registered results and precautions)? 37

  38. Most Bayesians use default priors O Eliciting subjective priors too difficult, scientists reluctant to allow subjective beliefs to overshadow data O Default, or reference priors are supposed to prevent prior beliefs from influencing the posteriors (O-Bayesians, 2006) 38

  39. OThe priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Default priors may not even be probabilities (Cox and Mayo 2010, p. 299) ODefault Bayesian Reforms are touted as free of selection effects O Bayes factors can be used in the complete absence of a sampling plan (Bayarri, Benjamin, Berger, Sellke 2016, p. 100) 39

  40. Granted, some are prepared to abandon the LP for model testing In an attempted meeting of the minds Andrew Gelman and Cosma Shalizi say: O [C]rucial parts of Bayesian data analysis, such as model checking, can be understood as error probes in Mayo s sense which might be seen as using modern statistics to implement the Popperian criteria of severe tests. (2013, p.10). OAn open question 40

  41. The ASA doc highlights classic foibles that block replication In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result (Fisher 1935, p. 14) isolated low P-value > H: statistical effect 41

  42. Statistical > substantive (H > H*) [A]ccording to Fisher, rejecting the null hypothesis is not equivalent to accepting the efficacy of the cause in question. The latter...requires obtaining more significant results when the experiment, or an improvement of it, is repeated at other laboratories or under other conditions (Gigerentzer 1989, pp. 95-6) 42

  43. The Problem is with so-called NHST ( null hypothesis significance testing ) O NHSTs supposedly allow moving from statistical to substantive hypotheses O If defined that way, they exist only as abuses of tests O ASA doc ignores Neyman-Pearson (N-P) tests 43

  44. Neyman-Pearson (N-P) tests: A null and alternative hypotheses H0, H1 that are exhaustive H0: 12 vs H1: > 12 O So this fallacy of rejection H impossible O Rejecting the null only indicates statistical alternatives (how discrepant from null) H*is 44

  45. P-values Dont Report Effect Sizes (Principle 5) Who ever said to just report a P-value? O Tests should be accompanied by interpretive tools that avoid the fallacies of rejection and non-rejection. These correctives can be articulated in either Fisherian or Neyman-Pearson terms (Mayo and Cox 2006, Mayo and Spanos 2006) 45

  46. To Avoid Inferring a Discrepancy Beyond What s Warranted: large n problem. OSeverity tells us: an -significant difference is indicative of less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) 46

  47. OWhats more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one so insensitive that it doesn t go off unless the house is fully ablaze? O [The larger sample size is like the one that goes off with burnt toast] 47

  48. What About Fallacies of Non-Significant Results? O They don t warrant 0 discrepancy O Use the same severity reasoning to rule out discrepancies that very probably would have resulted in a larger difference than observed- set upper bounds O If you very probably would have observed a more impressive (smaller) p-value than you did, if > 1( 1= 0+ ), then the data are good evidence that < 1 O Akin to power analysis (Cohen, Neyman) but sensitive to x0 48

  49. OTheres another kind of fallacy behind a move that s supposed improve replication but it confuses the notions from significance testing and it leads to Most findings are false O Fake replication crisis. 49

  50. Diagnostic Screening Model of Tests: urn of nulls ( most findings are false ) O If we imagine randomly select a hypothesis from an urn of nulls 90% of which are true O Consider just 2 possibilities: H0: no effect H1: meaningful effect, all else ignored, O Take the prevalence of 90% as Pr(H0 you picked) = .9, Pr(H1)= .1 O Rejecting H0 with a single (just) .05 significant result, Cherry-picking to boot 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#