Importance of Data Reuse and Transparency in the Data Lifecycle

 
Data Reuse and Transparency in the Data
Lifecycle
 
Steven Worley
Doug Schuster
National Center for Atmospheric Research
Boulder, CO USA
 
Topics
 
Data Reuse and Transparency
What are these data features?
Why are they important?
Archiving practices
Access practices
 
 
 
 
 
 
2
 
EGU, 23-27 April, Vienna, Austria
What are these data features?
 
Data 
reuse
 implies:
Expanding usage beyond intended primary community
Maintaining reference datasets and building many
products from them
Data 
transparency
 implies:
Reproducibility - ability to reproduce data files or
products for users
Traceability – tagging and preserving access details
EGU, 23-27 April, Vienna, Austria
3
 
Why are Reuse and Transparency Important?
 
Data centers/providers are 
expected
 to support fact-
based outcomes:
Traditionally for science/research
Now also for policy makers, community leaders,
individual citizens, and commercial interests
.
 
EGU, 23-27 April, Vienna, Austria
 
4
Supporting New Reuse and Transparency
Decisions by 
policy makers
Traceable open access sources
Actions by 
community leaders
Planning for societal services
Emergencies, water, energy, etc.
Usage by 
citizens and educators
Inquisitive science, family activities, safety
Science learning
Collaborative 
commercial applications
Tighter coupling between engineering and science
 Wx forecasts for wind energy production
Energy companies contribute mesoscale observations
 
EGU, 23-27 April, Vienna, Austria
5
Archiving practices
Curation that assures data authenticity
Preserve original data formats, to the max. extent
possible.
Maintaining 100% content and accuracy – serious challenge
Use a “rich” metadata standard
 A local standard?
 Generate discipline and cross-discipline standards
 E.g. ISO, DIF, etc.
Create multiple copies
Data files, metadata, documentation, and software
Disaster recovery – not a secondary concern
EGU, 23-27 April, Vienna, Austria
6
Archiving practices
Collection completeness and integrity
Closely monitor data work flow
Account for 
every
 file
 Read every file
Gather, check, preserve metadata
Compute and preserve file checksums
Maintain dataset lineage / provenance
Use approved processes to delete datasets (never?)
Establish tiered “level of service” for data
Move old / superseded versions to lower level
Keep all metadata on the highest tier – discoverable!
EGU, 23-27 April, Vienna, Austria
7
Archiving practices
Explicit data version tracking
Sometimes, internal to files
Always, within data management system
Include notations in all documentation
Establish Digital Object Identifiers (DOIs)
Two-way linkage between publications and data
Promotes easy path for follow-on research from
publications
Leverages skills / facilities of libraries – richer
knowledge base
Create data family tree connections
EGU, 23-27 April, Vienna, Austria
8
Dataset Family Tree Example
EGU, 23-27 April, Vienna, Austria
9
C
h
i
l
d
Dataset Family Tree - Evolution
EGU, 23-27 April, Vienna, Austria
10
Data Center Centric
C
h
i
l
d
Web Centric
C
h
a
l
l
e
n
g
e
s
:
S
y
s
t
e
m
 
o
f
 
i
m
m
u
t
a
b
l
e
 
I
D
s
 
 
D
O
I
s
?
M
u
l
t
i
-
i
n
s
t
i
t
u
t
i
o
n
 
p
r
e
s
e
r
v
a
t
i
o
n
 
c
o
m
m
i
t
m
e
n
t
T
r
a
n
s
p
a
r
e
n
c
y
 
a
c
r
o
s
s
 
i
n
s
t
i
t
u
t
i
o
n
s
,
 
a
c
c
e
p
t
e
d
 
s
t
a
n
d
a
r
d
s
/
g
o
v
e
r
n
a
n
c
e
P
r
o
m
o
t
e
 
d
i
s
c
o
v
e
r
y
 
b
y
 
s
h
a
r
i
n
g
 
m
e
t
a
d
a
t
a
,
 
O
A
I
-
P
M
H
F
u
t
u
r
e
,
 
k
n
o
w
l
e
d
g
e
-
b
a
s
e
d
 
d
i
s
c
o
v
e
r
y
 
a
n
d
 
a
c
c
e
s
s
 
v
i
a
 
o
n
t
o
l
o
g
i
e
s
 
w
i
t
h
i
n
s
e
m
a
n
t
i
c
 
w
e
b
Access Practices
User Identification – key to reproducibility
Record all data access transactions
Who received what and when
Log product creation constraints from GUIs and web
services
Log software IDs used for product creation
Benefits
Reproduce a data access process
Feedback to users about data changes
Use metrics imply how to improve access
EGU, 23-27 April, Vienna, Austria
11
Metrics Example
CFSR 6hrly, GRIB2, 1979-2011, 75TB, 28K fields/time step, 168K files
EGU, 23-27 April, Vienna, Austria
12
63% of users
are non-US
Now exporting
25+ TB monthly
Subsetting, now
~500 requests/month
Track User activity:
  - who accessed what and when
Conclusions
Data reuse and transparency are rapidly
expanding in importance
Many “best practices” in archive management
support reuse and transparency
Archive access monitoring is necessary for
transparency, reproducibility, and traceability
Need significant improvement in linking data
family trees and data to publications to 
advance
reuse and transparency
EGU, 23-27 April, Vienna, Austria
13
 
EGU, 23-27 April, Vienna, Austria
 
14
Research Data Archive at NCAR
http://rda.ucar.edu/
Slide Note
Embed
Share

Data reuse and transparency play a crucial role in supporting fact-based outcomes for various stakeholders, including scientists, policymakers, community leaders, and commercial interests. By enabling expanded usage beyond intended communities and ensuring reproducibility and traceability, data centers/providers can facilitate informed decisions and services for societal benefit.

  • Data Reuse
  • Transparency
  • Data Lifecycle
  • Stakeholders
  • Fact-based Outcomes

Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Reuse and Transparency in the Data Lifecycle Steven Worley Doug Schuster National Center for Atmospheric Research Boulder, CO USA

  2. Topics Data Reuse and Transparency What are these data features? Why are they important? Archiving practices Access practices 2 EGU, 23-27 April, Vienna, Austria

  3. What are these data features? Data reuse implies: Expanding usage beyond intended primary community Maintaining reference datasets and building many products from them Data transparency implies: Reproducibility - ability to reproduce data files or products for users Traceability tagging and preserving access details 3 EGU, 23-27 April, Vienna, Austria

  4. Why are Reuse and Transparency Important? Data centers/providers are expected to support fact- based outcomes: Traditionally for science/research Now also for policy makers, community leaders, individual citizens, and commercial interests. 4 EGU, 23-27 April, Vienna, Austria

  5. Supporting New Reuse and Transparency Decisions by policy makers Traceable open access sources Actions by community leaders Planning for societal services Emergencies, water, energy, etc. Usage by citizens and educators Inquisitive science, family activities, safety Science learning Collaborative commercial applications Tighter coupling between engineering and science Wx forecasts for wind energy production Energy companies contribute mesoscale observations 5 EGU, 23-27 April, Vienna, Austria

  6. Archiving practices Curation that assures data authenticity Preserve original data formats, to the max. extent possible. Maintaining 100% content and accuracy serious challenge Use a rich metadata standard A local standard? Generate discipline and cross-discipline standards E.g. ISO, DIF, etc. Create multiple copies Data files, metadata, documentation, and software Disaster recovery not a secondary concern 6 EGU, 23-27 April, Vienna, Austria

  7. Archiving practices Collection completeness and integrity Closely monitor data work flow Account for every file Read every file Gather, check, preserve metadata Compute and preserve file checksums Maintain dataset lineage / provenance Use approved processes to delete datasets (never?) Establish tiered level of service for data Move old / superseded versions to lower level Keep all metadata on the highest tier discoverable! 7 EGU, 23-27 April, Vienna, Austria

  8. Archiving practices Explicit data version tracking Sometimes, internal to files Always, within data management system Include notations in all documentation Establish Digital Object Identifiers (DOIs) Two-way linkage between publications and data Promotes easy path for follow-on research from publications Leverages skills / facilities of libraries richer knowledge base Create data family tree connections 8 EGU, 23-27 April, Vienna, Austria

  9. Dataset Family Tree Example Global and Regional Atmospheric and Ocean Re-analyses NCEP/NCAR, NARR, ERA-40, ERA-Interim, 20CR, OARCA NOC Surf. Flux (1973-2009) WASwind (1950-2009) Etc. Ocean Clouds (1900-2010) JMA SST (1871-2011) HadSLP (1871-2011) HadISST (1871-2011) NOAA OI SST (1981-2011) NOAA ERSST (1854-2011) International Comprehensive Ocean Atmosphere Data Set (ICOADS) Global marine surface observations (1662-2011) 9 EGU, 23-27 April, Vienna, Austria

  10. Dataset Family Tree - Evolution Child Grand Child Child Grand Child Parent Data Center Centric Parent Web Centric Challenges: System of immutable IDs DOIs? Multi-institution preservation commitment Transparency across institutions, accepted standards/governance Promote discovery by sharing metadata, OAI-PMH Future, knowledge-based discovery and access via ontologies within semantic web 10 EGU, 23-27 April, Vienna, Austria

  11. Access Practices User Identification key to reproducibility Record all data access transactions Who received what and when Log product creation constraints from GUIs and web services Log software IDs used for product creation Benefits Reproduce a data access process Feedback to users about data changes Use metrics imply how to improve access 11 EGU, 23-27 April, Vienna, Austria

  12. Metrics Example CFSR 6hrly, GRIB2, 1979-2011, 75TB, 28K fields/time step, 168K files 63% of users are non-US Now exporting 25+ TB monthly Track User activity: - who accessed what and when Subsetting, now ~500 requests/month 12 EGU, 23-27 April, Vienna, Austria

  13. Conclusions Data reuse and transparency are rapidly expanding in importance Many best practices in archive management support reuse and transparency Archive access monitoring is necessary for transparency, reproducibility, and traceability Need significant improvement in linking data family trees and data to publications to advance reuse and transparency 13 EGU, 23-27 April, Vienna, Austria

  14. Research Data Archive at NCAR http://rda.ucar.edu/ 14 EGU, 23-27 April, Vienna, Austria

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#