Quickly locate reproducibility failures in Stata code

 
A DIME Analytics Stata command
to detect reproducibility issues
Stata Conference 2023
Presented by Benjamin Daniels
 
iedorep
:
Quickly locate reproducibility
failures in Stata code
Motivation
Journal-led
requirements for
reproducible code
submission
Manually checked
by editorial staff
after conditional
acceptance
 
Motivation
“High stakes” –
issues with
reproducibility at
this stage can
result in changed
results, long
slowdowns, or loss
of publication after
large investments
 
https://social-science-data-editors.github.io/guidance/Verification_guidance.html
DIME Analytics:
Guides to work
reproducibly
Development
Research in
Practice outlines
workflows and
checklists for
writing reproducible
and readable code
during
 a project
DIME Analytics:
Guides to work
reproducibly
DIME Standards
provides checklists
and materials to
prepare for third-
party reproducibility
checks
AEA compliant
results from DIME
 
https://github.com/worldbank/dime-
standards/tree/master/dime-research-
standards/pillar-3-research-reproducibility
Three main
outputs with
each publication
Research outputs:
publish by journal
Data package:
index or publish
(WB Microdata)
Reproducibility
package:
GitHub + Zenodo
 
https://worldbank.github.io/dime-data-handbook/
Reproducibility
package tested
by publisher
Development
Research in
Practice 
outlines
components of final
research products
S
e
p
a
r
a
t
e
 
f
r
o
m
 
t
h
e
a
r
c
h
i
v
e
s
 
o
f
 
w
o
r
k
h
e
l
d
 
b
y
 
t
e
a
m
!
Focus here on
code package
Running code
“fresh” in Stata is a
challenge
User-written
packages, Stata
versioning,
directory setup, etc.
Basic layout of
submission and
replication pack
Journal receives
manuscript, figures
and tables, and
(zipped) replication
package
Full folder structure
May or may not
include data files
 
root
(zip)
 
code
 
README.md
 
runfile
 
LICENSE
 
data
 
output
 
makedata
 
exhibits
 
raw
 
constructed
[EMPTY]
 
manuscript
 
figures
 
tables
 
.do
 
.do
 
.do
Today: Tools to
quickly check
code ex-post
We support peer
review and do
working paper
reproducibility
checks
But shouldn’t “basic
reproducibility” be
easy to check by
yourself?
 
https://worldbank.github.io/dime-data-handbook/
 
Introducing 
iedorep
 
VERIFY REPRODUCIBILITY OF STATA CODE INSTANTLY
Introducing
iedorep
Detects non-
reproducible Stata
code instantly
Reports type of issue
and line number after
second run
Under development to
manage projects and
sub-do-files
 
. ssc install ietoolkit
. help iedorep
Why not do
this manually?
We could, say, just
run code twice
using our runfile
Compare state of
outputs using
SHA5, or by eye, or
using Git
 
root
(zip)
 
code
 
runfile
 
data
 
output
 
makedata
 
exhibits
 
.do
 
.do
 
.do
 
figures
 
tables
Code runs
Human eyes
can only check
outputs of code
Time consuming
Can make
mistakes
“Existence” check:
Showing there 
is
 a
problem doesn’t
locate
 the problem
 
root
(zip)
 
code
 
runfile
 
data
 
output
 
makedata
 
exhibits
 
.do
 
.do
 
.do
 
figures
 
tables
Code runs
Code tool can
monitor code
execution
Run code once
W
a
t
c
h
 
e
v
e
r
y
t
h
i
n
g
t
h
a
t
 
t
h
e
 
c
o
d
e
 
d
o
e
s
Run code again
S
e
e
 
i
f
 
a
n
y
t
h
i
n
g
 
i
s
d
i
f
f
e
r
e
n
t
 
a
n
d
 
w
h
e
r
e
 
root
(zip)
 
code
 
runfile
 
data
 
output
 
makedata
 
exhibits
 
.do
 
.do
 
.do
 
figures
 
tables
Code runs
iedorep
Fast; easy;
automatic
No human errors
Easy-to-read report
Location of all
possible errors
identified in two
code runs
 
root
(zip)
 
code
 
runfile
 
data
 
output
 
makedata
 
exhibits
 
.do
 
.do
 
.do
 
figures
 
tables
Code runs
iedorep
After each code
line*, 
iedorep
checks:
D
a
t
a
 
i
s
 
i
n
 
t
h
e
 
s
a
m
e
s
t
a
t
e
 
b
o
t
h
 
t
i
m
e
s
R
N
G
 
s
e
e
d
 
i
s
 
i
n
 
t
h
e
s
a
m
e
 
s
t
a
t
e
 
b
o
t
h
t
i
m
e
s
S
o
r
t
 
s
e
e
d
 
i
s
 
i
n
 
t
h
e
s
a
m
e
 
s
t
a
t
e
 
b
o
t
h
t
i
m
e
s
 
*Excluding loops, logic, and sub-files.
Updates under active development!
iedorep
does 
not
:
Check that outputs
appear and are
identical
(use Git / VC)
Check that initial data
is unchanged
(use 
iecodebook
)
Check that packages
are installed
appropriately
(use 
ieboilstart
)
Basic logic of
reproducibility
I
f
 
c
o
d
e
 
s
t
a
t
e
 
i
s
s
t
a
b
l
e
 
e
a
c
h
 
t
i
m
e
t
h
e
 
c
o
d
e
 
i
s
 
r
u
n
,
c
o
d
e
 
o
u
t
p
u
t
s
s
h
o
u
l
d
*
 
a
l
s
o
 
b
e
s
t
a
b
l
e
 
*Exceptions in rare cases
 
How to use 
iedorep
 
CHECKING REPRODUCIBILITY OF STATA CODE
Currently, use
interactively
In Command
window, type:
iedorep
"[filepath]"
File is run twice,
then report is
printed to Results
window in Stata*
 
*Upcoming: Output Git-
compatible report in Markdown
Every line
independently
evaluated
Causes of errors
can be far from the
outputs they affect
iedorep 
flags
errors at their
source
 in Stata
code (although it
can’t fix them)
Flag #d
Unstable
No seeding
Sub-do-file
Non-unique
Verbosity
options allow
more details
“Errors”* are
reported only if the
state changes
between runs.
Verbosity options
allow detection of
all 
potential
 error
sites for review.
 
* Considering changing
this name. Any ideas?
Run once for
each file
When “Subfile” flag
is on, you can re-run
targeting sub-file
However, Stata
state can’t be
guaranteed
identical unless file
is separable*
 
*Fixes under active development!
 
How to write code for 
iedorep
 
CODE EXAMPLES, STYLE GUIDE, AND GENERAL ADVICE
“Modular”
coding
Fully separate data
creation and
analysis tasks.
Write “separable”
chunks within a file.
Avoid cascading
errors!
“Modular”
coding
Avoid sub-do-file
dependencies
Save and load
intermediate
outputs and
datasets
Sub-do-file
dependency
For example, sub-
do-file must use its
own data and
locals
If not, second run
will clear Stata
state and run
completely
differently*
 
*Fixes under active development!
 
THANK YOU!
 
iedorep
:
Quickly locate reproducibility
failures in Stata code
 
A DIME Analytics Stata command
to detect reproducibility issues
Stata Conference 2023
Presented by Benjamin Daniels
 
Thank you!
Slide Note
Embed
Share

Explore DIME Analytics' Stata command for efficiently identifying reproducibility issues in Stata code, crucial in research to meet journal-led requirements and avoid potential setbacks. Learn more about the significance of reproducibility in data analysis projects and the tools provided by DIME Analytics to enhance work reproducibility and maintain research standards.

  • Stata code
  • Reproducibility failures
  • DIME Analytics
  • Research standards

Uploaded on Mar 27, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. iedorep: Quickly locate reproducibility failures in Stata code A DIME Analytics Stata command to detect reproducibility issues Stata Conference 2023 Presented by Benjamin Daniels

  2. Motivation Journal-led requirements for reproducible code submission Manually checked by editorial staff after conditional acceptance https://aeadataeditor.github.io

  3. Motivation High stakes issues with reproducibility at this stage can result in changed results, long slowdowns, or loss of publication after large investments https://social-science-data-editors.github.io/guidance/Verification_guidance.html

  4. DIME Analytics: Guides to work reproducibly Development Research in Practice outlines workflows and checklists for writing reproducible and readable code during a project

  5. DIME Analytics: Guides to work reproducibly DIME Standards provides checklists and materials to prepare for third- party reproducibility checks AEA compliant results from DIME https://github.com/worldbank/dime- standards/tree/master/dime-research- standards/pillar-3-research-reproducibility

  6. Three main outputs with each publication Research outputs: publish by journal Data package: index or publish (WB Microdata) Reproducibility package: GitHub + Zenodo https://worldbank.github.io/dime-data-handbook/

  7. Reproducibility package tested by publisher Development Research in Practice outlines components of final research products Separate from the archives of work held by team!

  8. Focus here on code package Running code fresh in Stata is a challenge User-written packages, Stata versioning, directory setup, etc.

  9. root (zip) Basic layout of submission and replication pack Journal receives manuscript, figures and tables, and (zipped) replication package Full folder structure May or may not include data files manuscript tables figures code data output README.md .do .do raw makedata runfile [EMPTY] .do constructed LICENSE exhibits

  10. Today: Tools to quickly check code ex-post We support peer review and do working paper reproducibility checks But shouldn t basic reproducibility be easy to check by yourself? https://worldbank.github.io/dime-data-handbook/

  11. Introducing iedorep VERIFY REPRODUCIBILITY OF STATA CODE INSTANTLY

  12. . ssc install ietoolkit . help iedorep Introducing iedorep Detects non- reproducible Stata code instantly Reports type of issue and line number after second run Under development to manage projects and sub-do-files

  13. root (zip) Why not do this manually? code data output We could, say, just run code twice using our runfile Compare state of outputs using SHA5, or by eye, or using Git .do .do figures makedata runfile .do Code runs tables exhibits

  14. root (zip) Human eyes can only check outputs of code code data output Time consuming Can make mistakes Existence check: Showing there is a problem doesn t locate the problem .do .do figures makedata runfile .do Code runs tables exhibits

  15. root (zip) Code tool can monitor code execution code data output Run code once Watch everything that the code does Run code again See if anything is different and where .do .do iedorep figures makedata runfile .do Code runs tables exhibits

  16. root (zip) Fast; easy; automatic code data output No human errors Easy-to-read report Location of all possible errors identified in two code runs .do .do iedorep figures makedata runfile .do Code runs tables exhibits

  17. After each code line*, iedorep checks: Data is in the same state both times RNG seed is in the same state both times Sortseed is in the same state both times *Excluding loops, logic, and sub-files. Updates under active development!

  18. iedorep does not: Check that outputs appear and are identical (use Git / VC) Check that initial data is unchanged (use iecodebook) Check that packages are installed appropriately (use ieboilstart)

  19. Basic logic of reproducibility If code state is stable each time the code is run, code outputs should* also be stable *Exceptions in rare cases

  20. How to use iedorep CHECKING REPRODUCIBILITY OF STATA CODE

  21. Currently, use interactively In Command window, type: iedorep "[filepath]" File is run twice, then report is printed to Results window in Stata* *Upcoming: Output Git- compatible report in Markdown

  22. Every line independently evaluated Flag #d Causes of errors can be far from the outputs they affect iedorep flags errors at their source in Stata code (although it can t fix them) Unstable No seeding Non-unique Sub-do-file

  23. Verbosity options allow more details Errors * are reported only if the state changes between runs. Verbosity options allow detection of all potential error sites for review. * Considering changing this name. Any ideas?

  24. Run once for each file When Subfile flag is on, you can re-run targeting sub-file However, Stata state can t be guaranteed identical unless file is separable* *Fixes under active development!

  25. How to write code for iedorep CODE EXAMPLES, STYLE GUIDE, AND GENERAL ADVICE

  26. Modular coding Fully separate data creation and analysis tasks. Write separable chunks within a file. Avoid cascading errors!

  27. Modular coding Avoid sub-do-file dependencies Save and load intermediate outputs and datasets

  28. Sub-do-file dependency For example, sub- do-file must use its own data and locals If not, second run will clear Stata state and run completely differently* *Fixes under active development!

  29. THANK YOU! iedorep: Quickly locate reproducibility failures in Stata code A DIME Analytics Stata command to detect reproducibility issues Stata Conference 2023 Presented by Benjamin Daniels

  30. Thank you!

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#