Data Documentation Initiative (DDI) for Enhanced EOSC Applications

DDI-CDI: An Introduction for
Possible EOSC Applications
 
Outline
Introduction/Background
Examples from The Specification
Foundational Metadata: Datums and the Variable Cascade
Structural Description of Data
Process Description
Possible Applications
Application: Recognizing Similar Variables in Difficult Cases
Application: Automating Data Integration
Application: The Provenance Browser
Application: Transparency – Replication/Reproduction of Findings
Summary
Questions to Consider…
Can DDI-CDI help you solve problems you face within your EOSC area?
Would DDI-CDI allow you to do things which you are not currently
able to do?
What other standards are important and how would DDI-CDI need to
interface with them?
Are there requirements for integration of data for secondary use or
across domain boundaries? Do we have the needed
standards/metadata to address these today?
DDI-CDI and FAIR
Many people talk about 
F
indability and 
A
ccess
Not so much about 
I
nteroperability and 
R
euse
DDI-CDI focuses on these aspects of FAIR data
It is also quite useful for data discovery
Interoperability and reuse of data are metadata-intensive
Historically, these aspects of data management are expensive and
have not been fully incentivized by research funders
Today’s focus on FAIR data 
demands
 that we do more!
Challenges
Increasingly, research is conducted across domain boundaries
“Grand Challenges”: COVID-19, climate change, resilient cities, etc.
Technology is able to scale very efficiently
Cloud computing
Big data technologies
“Data-hungry” approaches to research are becoming common
Social media as a “data source”
Machine-learning
The bottleneck is the metadata: how do we understand the data we can
find and access?
When EOSC becomes a reality, problems of scale will increase massively
FAIR data means 
more
 data-sharing than is possible today!
Background: DDI Standards
The Data Documentation Initiative (DDI) has produced metadata
standards for documenting data in the Social, Behavioural, and
Economic (SBE) sciences (also official statistics at the national level)
DDI Codebook, DDI Lifecycle
Granular, machine-actionable metadata for describing data (XML)
DDI Cross-Domain Integration (DDI - CDI) is different
Model for describing 
data
 across a wide range of domains/structures
Sensor and event data, key-value data, rectangular data, multi-dimensional data
Model for describing 
processing
 of data from one form into another
Model for showing how 
individual datums 
are used across the entire chain
Complement to traditional metadata/documentation standards
Typical Data Transformations
Raw Data
“Microdata”
/Unit
Record Data
 Aggregate
Data/ Data
“Cube”
Indicators
DDI – CDI describes the data at each stage, indicating the roles played
by each atomic bit of data (“datum”)
DDI – CDI tracks the processing between each stage (implements
PROV), reflecting the relationships between atomic datums (uses
other standards for describing specific processes - SDTL)
Cross-Domain Data Sharing
Within a domain, the available data is understood by researchers/users
Familiar with the organizations and projects
Familiar with the domain/literature
Common tools and data structures
Secondary use is already hard
Demands good metadata and documentation
Cross-domain data is even harder to use
Lack of background/context
Different tools and data structure
Different semantics
Using and integrating “external” data is an expensive, manual process
As much as 80% of effort in a research project(!)
DDI – CDI and Data Integration
Data described using the DDI – CDI model comes with a richer set of
context: 
data provenance
What were the inputs?
How were they processed?
Can findings be reproduced?
The integration of data described in DDI – CDI can be automated to a
greater extent
The 
role
 played by each datum is known (identifier, measure, dimension, etc.)
The relationship between datums is explicit because the processing is
described
Changes in the role played by a datum in different structures can be predicted
(programmatically) and documented (for future reference)
DDI-CDI Development
DDI-CDI was developed based on real-world use cases
They were already using other standards/specifications
They involved data integration across “domain” boundaries
The goal was to fill in the missing pieces
Tie existing metadata/information together
Avoid duplication where possible
Cross-platform, technology-neutral
UML model is the center-piece
Has XML binding
Can be used with any technology binding
Currently in public review
Release in Q2 of 2021
Actively engaging across domains to validate/refine the model
What Does DDI-CDI Do?
Describes “foundational metadata” in rich detail
Concepts (and their uses)
Classifications/codelists
Variables
Datums/data points (etc.)
Describes a range of data structures
Wide data/rectangular/unit-record
Tall data/events/sensor data
Key-Value data/”big” data/No-SQL data
Multidimensional (“cube”) data/time series/indicators
Describes process
Declarative (“black box” multi-threaded)
Procedural (stepwise)
Ties to data (describing provenance chains)
What’s in the Specification? (1)
DDI-CDI is a formal UML class model
Expressed in a portable form for easy use by many UML tools
Portability is through Canonical XMI – an interchange format for UML models
Uses a limited subset of the UML class model
UML can be easily implemented across technology platforms
Auto-generation of syntax bindings
Familiar to many developers
UML “future-proofs” the specification
Emerging technology platforms can be supported
Content of model is extensible
What’s in the Specification? (2)
An XML representation is provided
Described in W3C XML Schema
Other representations (e.g., RDF, JSON) are being considered
Documentation of model and schemas
Architecture and design documentation
Includes documentation of approach to other standards
Examples of use
Including other standards (e.g., DDI C/L, Schema.org, etc.)
Alignment with Other Standards
DDI-CDI attempts to align with a number of other standards
For process description:
PROV-O (and PROV-ONE) . Directly implements these
SDTL/VTL – provides for combined use
BPMN – formal alignment
For discovery:
Schema.org
DCAT (in progress)
For data description:
DDI Codebook, DDI Lifecycle
GSIM
SDMX/Data Cube
Others (see documentation)
Open to further work in this area!
Examples from the Model
Features of DDI-CDI to Illustrate its Potential Utility
Foundational Metadata: The Variable Cascade
Understanding the roles played by variables is critical in integration of
data
Variables do 
many, many
 different things
Not all variables are the same!
We have three levels of variables in our model:
Conceptual Variables
Represented Variables
Instance Variables
Variable Cascade – Conceptual Variable
Variable
descriptions at a
high level. Early in
designing data
collection, broad
searches. Broadly
reusable.
Variable Cascade - RepresentedVariable
More specificity
about value
domain, units of
measurement. Still
reusable.
Variable Cascade - InstanceVariable
Describing 
collected
data. Physical
datatype and
platform. Invariant
role of the variable
(e.g. a weight)
DDI-CDI and Domain Ontologies
DDI-CDI allows for the use of any concept system
It does not model the semantics of any specific domain
Rich model for classifications, codelists, and controlled vocabularies
Mechanism for referencing domain ontologies
Formal use of concepts is specified
Concepts can be variables, categories, population, units, etc.
Concept systems are modelled generically, but can be external
Semantic mapping is not supported
Semantic mapping has its own standards/mechanisms
DDI-CDI provides a framework where semantic mapping becomes more
meaningful
Data Structures
DDI-CDI currently can describe four different data structures
Wide – as with unit records
Tall -  as with event or stream data
Key value – as in a key-value store
Dimensional -  as with aggregate data
Our example
Imagine that a program (Python?)
collects Covid related information at
building doors:
Blood Pressure( Systolic, diastolic)
Position for BP
weight
temperature
pctO2
 pulse
beenToFloridaEtc?
exposed?
Position
 (prone, sitting, standing)
beenToFloridaEtc
 ,
Exposed
 (yes, no)
Wide Example
entry
 
datetime
 
systolic
 
diastolic
 
position
 
weight
 
temp
 
pctO2
 
pulse
 
away
 
exposed
101
 
2020-07-14T13:54
 
114
 
70
 
2
 
83,914.6
 
36.44
 
98
 
70
 
n
 
n
132
 
2020-07-14T14:03
 
125
 
86
 
3
 
68,038.9
 
37.50
 
85
 
92
 
y
 
n
As a spreadsheet table
As tab delimited text lines
Tall Example
Key-Value Example
Dimensional Example
Dimensions are defined by away and exposed. For each
combination of dimension values there is a summary value
– the mean of temp.
The dimensional data are shown here in two layouts, a cross
tabulation and a tall structure.
Questions:
Have you traveled outside of the county in the last two weeks? (circle one)
Yes    No
Have you had contact with anyone diagnosed with Covid-19? (circle one)?
Yes    No
Roles for Data Points
Roles: Identifiers, Measures, and Attributes
Identifiers
VariableValueComponent
AttributeComponent
VariableDescriptorComponent
The DDI-CDI Process Model
Describes the use of individual processes, and how they fit together
Supports standard descriptions (SDTL, VTL) and specific languages
(SQL, R, STAT, SPSS, Python, SAS, etc.)
Three “modes”:
Procedural: Step-wise, with decision points
Declarative: “Black box” multi-threaded, uses a “playbook” and configurations
Hybrid approaches of the two
Simple Diagram
Activity
Step
Sub-Step
Control Logic
Inputs
Outputs
PROCEDURAL PROCESS
DECLARATIVE PROCESS
Process
Engine
Parameters
Playbook
(Functions)
Inputs
Outputs
Process Model (High Level)
Datum: Bringing it Together
Possible Applications of DDI-CDI
Examples of What Might Be Done
Application: Recognizing Similar variables in
Difficult Cases
Two variables in different data sets might:
Measure the same concept differently
Measure the same concept in the same way with different physical
representations
Exist identically in two data sets, but with no formal link
In all of these cases, understanding the variables at each level
(conceptual, representational, and actual) provides a strong basis for
programmatically identifying them as potential points for joining data
sets
Documenting Comparability among
Variables (Simple Example)
maritalstatus
(conceptual
variable)
Maritalstatus
2010
(variable)
maritalstatus
(represented
variable)
Maritalstatus
2007
(variable)
maritalstatus
2018
(variable)
maritalstatusplus
(represented
variable)
Represented variable
Common variable
specification with a
code representation
Conceptual variable
Common variable
specification without a
representation
Variable
Variable specification
within a dataset
context
1 = Married
2 = Not Married
A = Married
B = Not Married
C = Don’t Know
Application: Automating Data Integration
If I understand the role played by any given data point in its data set
of origin, I can predict what role it must play in the data set I need to
transform it into for integration purposes
The DDI-CDI model shows us how these relate, and can avoid manual
intervention in performing the needed structural transformations
Reduces the (up to 80%) resource burden on projects for preparing
data for analysis
entry
 
datetime
  
systolic
 
diastolic
 
position
 
weight
  
temp
 
pctO2
 
pulse
 
away    exposed
101
 
2020-07-14T13:54
 
114
 
70
 
2
 
83.914,6
 
36,44
 
98
 
70
 
n
 
n
132
 
2020-07-14T14:03
 
125
 
86
 
3
 
68.038,9
 
37,50
 
85
 
92
 
y
 
n
Systolic, diastolic and position could
be defined as a variable collection
with a structure indicating that
First Dataset Roles (Wide Data)
entry
 
datetime
  
systolic
 
diastolic
 
position
 
weight
 
temp
 
pctO2
 
pulse
 
away    exposed
101
 
2020-07-14T13:54
 
114
 
70
 
2
 
83.914,6
 
36,44
 
98
 
70
 
n
 
n
132
 
2020-07-14T14:03
 
125
 
86
 
3
 
68.038,9
 
37,50
 
85
 
92
 
y
 
n
(Unit)
Identifier 
Component
Variable
Value
Component
Attribute
Component
Second Data Set Roles (Long Data)
(Unit)
Identifier
Component
Identifier
Component
Variable
Descriptor
Component
Variable
Value
Component
Attribute
Component
The Variable Descriptor Component has values 
taken from the list of non-Unit Identifiers
and Variable Components in the first data set. 
(This can be programmatically known.)
The “key” for each value is composed from
the identifiers plus the Variable Descriptor.
What is Gained?
Statistical packages have been able to cast between various data
structures for a long time
This requires human input and set-up
Part of your “80%” resource burden
By making structural information at this level explicit
These processes can be automated, lowering the resource burden
Unanticipated/specialized data structure transformations can be supported
This does not solve 
semantic
 mapping, but can support it
Application: A Provenance Browser
One question researchers often ask is: “Where did this number come
from? What is it 
really
?”
Typically, the answer is unsatisfying and/or non-existent.
The following examples are drawn from an application being
developed using the DDI-CDI model to bring together processes and
data sets, and the metadata attached to them, to give researchers a
useful way to look at any portion of the provenance chain, and to
answer this question.
Application: A Provenance Browser (cont.)
This is an application was one of the test cases for DDI-CDI
specification development
Provenance metadata (the processes) are mined programmatically
from the ETL platform, Pentaho
These are often “chains” which leverage STATA scripts to perform the
processing itself
Variable descriptions are taken from DDI Codebook XML
Human-readable “Purpose” statements are added manually
The “browser” brings this together in an easy-to-use form, from the
viewpoint of a specific process step or data set
[NAME OF PROCESS]
JOB: [NAME OF JOB]
 
[TASK]
 
[TASK]
 
[TASK]
JOB: [NAME OF JOB]
 
[TASK]
 
[TASK]
 
[TASK]
DATASETS
[DATASET NAME]
[DATASET NAME]
[DATASET NAME]
[NAME OF JOB]
Overview
Purpose
Algorithm
[INPUT
DATASET]
[INPUT
DATASET]
[OUTPUT
DATASET]
[OUTPUT
DATASET]
[NAME OF PROCESS]
JOB: [NAME OF JOB]
 
[TASK]
 
[TASK]
 
[TASK]
JOB: [NAME OF JOB]
 
[TASK]
 
[TASK]
 
[TASK]
DATASETS
[DATASET NAME]
[DATASET NAME]
[DATASET NAME]
[NAME OF
DATASET]
Overview
Purpose
Algorithm
[CREATING
JOB]
[CONSUMING
JOB]
[CONSUMING
JOB]
[NAME OF PROCESS]
JOB: [NAME OF JOB]
 
[TASK]
 
[TASK]
 
[TASK]
JOB: [NAME OF JOB]
 
[TASK]
 
[TASK]
 
[TASK]
DATASETS
[DATASET NAME]
[DATASET NAME]
[DATASET NAME]
use hsb2, clear
* get values for boxplot
summarize write, d
gen f=43             /* set value a little larger than bin with highest
frequency */
gen pmin=r(min)
gen p25=r(p25)
gen p50=r(p50)
gen p75=r(p75)
gen pmax=r(max)
gen pmean=r(mean)
* graph histogram and boxplot on same axes
two (histogram write, start(30) width(5) freq)                  ///
    (rcap pmin pmax f in 1, hor bcolor(dknavy))                 ///
    (rbar p25 p75 f in 1, hor bcolor(dknavy))                   ///
    (rcap p50 p50 f in 1, hor bcolor(white))                    ///
    (rcapsym pmean pmean f in 1, hor msym(plus) mcolor(white)), ///
    legend(off) xtitle("Writing Score") ytitle("Frequency")
* drop variables created for boxplot values
drop f-pmean
[NAME OF TASK FILE]
[03]
6.1 Data Quality
Metrics
[1] Raw
6.1 Event
Format
ALPHA Data Pipeline Spec. 6.1
Business Processes:
[02] Core ETL for Raw Input
    Overview
    Purpose
    Steps:
        [2.1] Generate Anonymized IDs
        [2.2] Map original IDs to
                 Anonymized IDs
        [2.3] Store ID Mapping
        [2.4] Create 6.1 Data from Raw
                  Data
[03] 6.1 Data Quality Metrics
    Overview
    Purpose
    Steps:
        [3.1] Compile Quality Metrics
        [3.2] Compile Residency Starting
                  Events
        [3.3] Compile Residency Ending
                  Events
        [3.4] Compile Legal and Illegal
                 Starting Events
[2] DoB
Quality
Metrics
[3] Illegal
Transitions
INPUT DS
OUTPUT DS
OUTPUT DS
EXAMPLE: Data Quality Metrics (“Job”)
ALPHA Data Pipeline Spec. 6.1
Business Processes:
[02] Core ETL for Raw Input
    Overview
    Purpose
    Steps:
        [2.1] Generate Anonymized IDs
        [2.2] Map original IDs to
                 Anonymized IDs
        [2.3] Store ID Mapping
        [2.4] Create 6.1 Data from Raw
                  Data
[03] 6.1 Data Quality Metrics
    Overview
    Purpose
    Steps:
        [3.1] Compile Quality Metrics
        [3.2] Compile Residency Starting
                  Events
        [3.3] Compile Residency Ending
                  Events
        [3.4] Compile Legal and Illegal
                 Starting Events
02 Core ETL for Raw Input - Purpose
Creates staging tables from member-centre-specific data. The
staging tables are then transformed further to create the ALPHA
specification 6.1.
Example: Descriptive Metadata about a Job
ALPHA Data Pipeline Spec. 6.1
Business Processes:
[02] Core ETL for Raw Input
    Overview
    Purpose
    Steps:
        [2.1] Generate Anonymized IDs
        [2.2] Map original IDs to
                 Anonymized IDs
        [2.3] Store ID Mapping
        [2.4] Create 6.1 Data from Raw
                  Data
[03] 6.1 Data Quality Metrics
    Overview
    Purpose
    Steps:
        [3.1] Compile Quality Metrics
        [3.2] Compile Residency Starting
                  Events
        [3.3] Compile Residency Ending
                  Events
        [3.4] Compile Legal and Illegal
                 Starting Events
[3.2] Compile Residency Starting Events
Description:
Identify in the data, events that start a residency episode (birth,
external-immigration, enumeration, becoming eligible for a
study, found after being lost to follow-up).
Concepts:
 This algorithm step references the following study concepts:
residency
birth
migration
 The change of residence by a
registered individual or social
group (e.g., a household).
There are two types of
migration that occur among
the registered population.
These are internal and
external migration.
Example: Conceptual Metadata about a Data Set
[1]
Raw 6.1 Event
Format
[01] Site-
Specific ETL
ALPHA Data Pipeline Spec. 6.1
Business Processes:
[03] 6.1 Data Quality Metrics
    Overview
    Purpose
    Steps:
        [3.1] Compile Quality Metrics
        [3.2] Compile Residency Starting
                  Events
        [3.3] Compile Residency Ending
                  Events
        [3.4] Compile Legal and Illegal
                 Starting Events
Data Sets:
    
[1] Raw 6.1 Event Format
    [2] DoB Quality Metrics
    [3] Illegal Transitions
 CREATING
PROCESS
[03] 6.1 Data
Quality
Metrics
 CONSUMING
PROCESS
[04] Clean 6.1
Data
 CONSUMING
PROCESS
View Dataset Structure
recnr [INT]
study_name
[VARCHAR(15)]
idno [VARCHAR(32)]
hhold_id [VARCHAR(32)]
hhold_id_extra
[VARCHAR(32)]
sex [INT]
dob [DATETIME]
residence [VARCHAR(5)]
eventnr [INT]
event [INT]
event_date [DATETIME]
type_of_date [INT]
obs_date [DATETIME]
obs_round [VARCHAR(2)]
Example: Structural/Codebook Metadata about a Data Set
Application: Transparency -
Replication/Reproduction of Findings
With rich provenance metadata, it is much easier for humans to
reproduce findings
With a complete machine-actionable record of data provenance,
reproducibility of findings can be performed by machines –
‘computational replication’
Transparency requires this ability, but…
Problems of scale will demand that these processes be more efficient!
Summary: Questions…
Can DDI-CDI help you solve problems you face within your EOSC area?
Would DDI-CDI allow you to do things which you are not currently
able to do?
Does the variable cascade support functionality that you need?
Could you automate functionality around data integration?
What other standards are important and how would DDI-CDI need to
interface with them?
Are there requirements for integration of data for secondary use or
across domain boundaries? Do we have the needed
standards/metadata to address these today?
Slide Note
Embed
Share

Introduction to DDI-CDI and its relevance in EOSC applications, highlighting examples, foundational metadata, possible applications, questions for consideration, interoperability with FAIR data principles, and challenges in research across domain boundaries. DDI standards aim to improve data documentation in the Social, Behavioural, and Economic sciences to facilitate data sharing and reuse.


Uploaded on Aug 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. DDI-CDI: An Introduction for Possible EOSC Applications

  2. Outline Introduction/Background Examples from The Specification Foundational Metadata: Datums and the Variable Cascade Structural Description of Data Process Description Possible Applications Application: Recognizing Similar Variables in Difficult Cases Application: Automating Data Integration Application: The Provenance Browser Application: Transparency Replication/Reproduction of Findings Summary

  3. Questions to Consider Can DDI-CDI help you solve problems you face within your EOSC area? Would DDI-CDI allow you to do things which you are not currently able to do? What other standards are important and how would DDI-CDI need to interface with them? Are there requirements for integration of data for secondary use or across domain boundaries? Do we have the needed standards/metadata to address these today?

  4. DDI-CDI and FAIR Many people talk about Findability and Access Not so much about Interoperability and Reuse DDI-CDI focuses on these aspects of FAIR data It is also quite useful for data discovery Interoperability and reuse of data are metadata-intensive Historically, these aspects of data management are expensive and have not been fully incentivized by research funders Today s focus on FAIR data demands that we do more!

  5. Challenges Increasingly, research is conducted across domain boundaries Grand Challenges : COVID-19, climate change, resilient cities, etc. Technology is able to scale very efficiently Cloud computing Big data technologies Data-hungry approaches to research are becoming common Social media as a data source Machine-learning The bottleneck is the metadata: how do we understand the data we can find and access? When EOSC becomes a reality, problems of scale will increase massively FAIR data means more data-sharing than is possible today!

  6. Background: DDI Standards The Data Documentation Initiative (DDI) has produced metadata standards for documenting data in the Social, Behavioural, and Economic (SBE) sciences (also official statistics at the national level) DDI Codebook, DDI Lifecycle Granular, machine-actionable metadata for describing data (XML) DDI Cross-Domain Integration (DDI - CDI) is different Model for describing data across a wide range of domains/structures Sensor and event data, key-value data, rectangular data, multi-dimensional data Model for describing processing of data from one form into another Model for showing how individual datums are used across the entire chain Complement to traditional metadata/documentation standards

  7. Typical Data Transformations Microdata /Unit Record Data Aggregate Data/ Data Cube Raw Data Indicators DDI CDI describes the data at each stage, indicating the roles played by each atomic bit of data ( datum ) DDI CDI tracks the processing between each stage (implements PROV), reflecting the relationships between atomic datums (uses other standards for describing specific processes - SDTL)

  8. Cross-Domain Data Sharing Within a domain, the available data is understood by researchers/users Familiar with the organizations and projects Familiar with the domain/literature Common tools and data structures Secondary use is already hard Demands good metadata and documentation Cross-domain data is even harder to use Lack of background/context Different tools and data structure Different semantics Using and integrating external data is an expensive, manual process As much as 80% of effort in a research project(!)

  9. DDI CDI and Data Integration Data described using the DDI CDI model comes with a richer set of context: data provenance What were the inputs? How were they processed? Can findings be reproduced? The integration of data described in DDI CDI can be automated to a greater extent The role played by each datum is known (identifier, measure, dimension, etc.) The relationship between datums is explicit because the processing is described Changes in the role played by a datum in different structures can be predicted (programmatically) and documented (for future reference)

  10. DDI-CDI Development DDI-CDI was developed based on real-world use cases They were already using other standards/specifications They involved data integration across domain boundaries The goal was to fill in the missing pieces Tie existing metadata/information together Avoid duplication where possible Cross-platform, technology-neutral UML model is the center-piece Has XML binding Can be used with any technology binding Currently in public review Release in Q2 of 2021 Actively engaging across domains to validate/refine the model

  11. What Does DDI-CDI Do? Describes foundational metadata in rich detail Concepts (and their uses) Classifications/codelists Variables Datums/data points (etc.) Describes a range of data structures Wide data/rectangular/unit-record Tall data/events/sensor data Key-Value data/ big data/No-SQL data Multidimensional ( cube ) data/time series/indicators Describes process Declarative ( black box multi-threaded) Procedural (stepwise) Ties to data (describing provenance chains)

  12. Whats in the Specification? (1) DDI-CDI is a formal UML class model Expressed in a portable form for easy use by many UML tools Portability is through Canonical XMI an interchange format for UML models Uses a limited subset of the UML class model UML can be easily implemented across technology platforms Auto-generation of syntax bindings Familiar to many developers UML future-proofs the specification Emerging technology platforms can be supported Content of model is extensible

  13. Whats in the Specification? (2) An XML representation is provided Described in W3C XML Schema Other representations (e.g., RDF, JSON) are being considered Documentation of model and schemas Architecture and design documentation Includes documentation of approach to other standards Examples of use Including other standards (e.g., DDI C/L, Schema.org, etc.)

  14. Alignment with Other Standards DDI-CDI attempts to align with a number of other standards For process description: PROV-O (and PROV-ONE) . Directly implements these SDTL/VTL provides for combined use BPMN formal alignment For discovery: Schema.org DCAT (in progress) For data description: DDI Codebook, DDI Lifecycle GSIM SDMX/Data Cube Others (see documentation) Open to further work in this area!

  15. Examples from the Model Features of DDI-CDI to Illustrate its Potential Utility

  16. Foundational Metadata: The Variable Cascade Understanding the roles played by variables is critical in integration of data Variables do many, many different things Not all variables are the same! We have three levels of variables in our model: Conceptual Variables Represented Variables Instance Variables

  17. Variable Cascade Conceptual Variable Variable descriptions at a high level. Early in designing data collection, broad searches. Broadly reusable.

  18. Variable Cascade - RepresentedVariable More specificity about value domain, units of measurement. Still reusable.

  19. Variable Cascade - InstanceVariable Describing collected data. Physical datatype and platform. Invariant role of the variable (e.g. a weight)

  20. DDI-CDI and Domain Ontologies DDI-CDI allows for the use of any concept system It does not model the semantics of any specific domain Rich model for classifications, codelists, and controlled vocabularies Mechanism for referencing domain ontologies Formal use of concepts is specified Concepts can be variables, categories, population, units, etc. Concept systems are modelled generically, but can be external Semantic mapping is not supported Semantic mapping has its own standards/mechanisms DDI-CDI provides a framework where semantic mapping becomes more meaningful

  21. Data Structures DDI-CDI currently can describe four different data structures Wide as with unit records Tall - as with event or stream data Key value as in a key-value store Dimensional - as with aggregate data

  22. Our example Imagine that a program (Python?) collects Covid related information at building doors: Blood Pressure( Systolic, diastolic) Position for BP weight temperature pctO2 pulse beenToFloridaEtc? exposed? Position (prone, sitting, standing) beenToFloridaEtc ,Exposed (yes, no)

  23. Wide Example As a spreadsheet table entry datetime systolic diastolic position weight temp pctO2 pulse away exposed 101 2020-07-14T13:54 114 70 2 83914.6 36.44 98 70 n n 132 2020-07-14T14:03 125 86 3 68038.9 37.5 85 92 y n As tab delimited text lines entry datetime systolic exposed diastolic position weight temp pctO2 pulse away 83,914.6 101 2020-07-14T13:54 n 114 70 2 36.44 98 70 n 68,038.9 132 2020-07-14T14:03 y 125 86 3 37.50 85 92 n

  24. Tall Example Entry DateTime 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 Measure systolic diastolic weight temp pctO2 pulse away exposed systolic diastolic weight temp pctO2 pulse away exposed Position Value 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 114 70 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n

  25. Key-Value Example Key 101_2020-07-14T13:54_2_systolic 101_2020-07-14T13:54_2_diastolic 101_2020-07-14T13:54_2_weight 101_2020-07-14T13:54_2_temp 101_2020-07-14T13:54_2_pctO2 101_2020-07-14T13:54_2_pulse 101_2020-07-14T13:54_2_away 101_2020-07-14T13:54_2_exposed 132_2020-07-14T14:03_3_systolic 132_2020-07-14T14:03_3_diastolic 132_2020-07-14T14:03_3_weight 132_2020-07-14T14:03_3_temp 132_2020-07-14T14:03_3_pctO2 132_2020-07-14T14:03_3_pulse 132_2020-07-14T14:03_3_away 132_2020-07-14T14:03_3_exposed Value 114 70 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n

  26. Dimensional Example away exposed meanTemp Y Y Y N N Y N N meantemp away Y N exposed 38.3 37.2 37.8 36.6 Y N 38.3 37.8 37.2 36.6 Dimensions are defined by away and exposed. For each combination of dimension values there is a summary value the mean of temp. The dimensional data are shown here in two layouts, a cross tabulation and a tall structure. Questions: Have you traveled outside of the county in the last two weeks? (circle one) Yes No Have you had contact with anyone diagnosed with Covid-19? (circle one)? Yes No

  27. Roles for Data Points

  28. Roles: Identifiers, Measures, and Attributes VariableDescriptorComponent AttributeComponent Identifiers Entry DateTime 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 Position Measure Value 2systolic 2diastolic 2weight 2temp 2pctO2 2pulse 2away 2exposed 3systolic 3diastolic 3weight 3temp 3pctO2 3pulse 3away 3exposed 114 70 VariableValueComponent 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n

  29. The DDI-CDI Process Model Describes the use of individual processes, and how they fit together Supports standard descriptions (SDTL, VTL) and specific languages (SQL, R, STAT, SPSS, Python, SAS, etc.) Three modes : Procedural: Step-wise, with decision points Declarative: Black box multi-threaded, uses a playbook and configurations Hybrid approaches of the two

  30. Simple Diagram Parameters Process Engine Playbook (Functions) Activity Control Logic Inputs Outputs Step Inputs DECLARATIVE PROCESS Outputs Sub-Step PROCEDURAL PROCESS

  31. Process Model (High Level) class Process Hierarchy Identifiable Service ProcessingAgent ProvONE:: Workflow 0..* trace 0..* Plan ProvONE::Program PROV-O::Entity PROV-O::Activity PROV-O::Agent trace performs informs hasSubControlLogic trace 0..* 0..* 0..1 trace trace 0..* Identifiable 0..* Identifiable Activity Identifiable 0..* 0..* 0..* hasInternal produces 0..* DataDescription:: InformationObject ControlLogic 1..* 0..* 0..* + description: TypedString + workflow: ExternalControlledVocabularyEntry uses invokes 0..* 0..* 0..1 0..* 0..* hasSubActivity has has 0..* 0..* Identifiable Parameters Identifiable from Step DeterministicImperative NonDeterministicDeclarative TemporalConstraints 0..* InformationFlowDefinition uses 0..1 0..* to 0..* 0..* 0..* 0..* hasSubStep 0..* 0..* isBoundTo Sequence ConditionalControlLogic RuleBasedScheduling AllenIntervalAlgebra TemporalControlConstruct

  32. Datum: Bringing it Together

  33. Possible Applications of DDI-CDI Examples of What Might Be Done

  34. Application: Recognizing Similar variables in Difficult Cases Two variables in different data sets might: Measure the same concept differently Measure the same concept in the same way with different physical representations Exist identically in two data sets, but with no formal link In all of these cases, understanding the variables at each level (conceptual, representational, and actual) provides a strong basis for programmatically identifying them as potential points for joining data sets

  35. Documenting Comparability among Variables (Simple Example) Conceptual variable Common variable specification without a representation maritalstatus (conceptual variable) Represented variable Common variable specification with a code representation maritalstatus (represented variable) maritalstatusplus (represented variable) 1 = Married 2 = Not Married A = Married B = Not Married C = Don t Know Variable Variable specification within a dataset context maritalstatus 2018 (variable) Maritalstatus 2007 (variable) Maritalstatus 2010 (variable)

  36. Application: Automating Data Integration If I understand the role played by any given data point in its data set of origin, I can predict what role it must play in the data set I need to transform it into for integration purposes The DDI-CDI model shows us how these relate, and can avoid manual intervention in performing the needed structural transformations Reduces the (up to 80%) resource burden on projects for preparing data for analysis

  37. Entry DateTime 1012020-07-14T13:54 systolic 1012020-07-14T13:54 diastolic 1012020-07-14T13:54 weight 1012020-07-14T13:54 temp 1012020-07-14T13:54 pctO2 1012020-07-14T13:54 pulse 1012020-07-14T13:54 away 1012020-07-14T13:54 exposed 1322020-07-14T14:03 systolic 1322020-07-14T14:03 diastolic 1322020-07-14T14:03 weight 1322020-07-14T14:03 temp 1322020-07-14T14:03 pctO2 1322020-07-14T14:03 pulse 1322020-07-14T14:03 away 1322020-07-14T14:03 exposed Measure Position Value Systolic, diastolic and position could be defined as a variable collection with a structure indicating that 2 2 114 70 83914.60 36.44 98 70 n n 3 3 125 86 68038.90 37.5 85 92 y n entry datetime systolic diastolic position weight 83.914,6 68.038,9 temp pctO2 pulse away exposed 101 2020-07-14T13:54 114 70 2 36,44 98 70 n n 132 2020-07-14T14:03 125 86 3 37,50 85 92 y n

  38. First Dataset Roles (Wide Data) Attribute Component Variable Value Component (Unit) Identifier Component entry 101 132 datetime 2020-07-14T13:54 2020-07-14T14:03 systolic 114 125 diastolic 70 86 position 2 3 weight 83.914,6 36,44 68.038,9 37,50 temp pctO2 98 85 pulse 70 92 away exposed n y n n

  39. Second Data Set Roles (Long Data) Attribute Component (Unit) Identifier Component Identifier Component Variable Descriptor Component Variable Value Component Entry DateTime 1012020-07-14T13:54 systolic 1012020-07-14T13:54 diastolic 1012020-07-14T13:54 weight 1012020-07-14T13:54 temp 1012020-07-14T13:54 pctO2 1012020-07-14T13:54 pulse 1012020-07-14T13:54 away 1012020-07-14T13:54 exposed 1322020-07-14T14:03 systolic 1322020-07-14T14:03 diastolic 1322020-07-14T14:03 weight 1322020-07-14T14:03 temp 1322020-07-14T14:03 pctO2 1322020-07-14T14:03 pulse 1322020-07-14T14:03 away 1322020-07-14T14:03 exposed Measure Position Value 2 2 114 70 The Variable Descriptor Component has values taken from the list of non-Unit Identifiers and Variable Components in the first data set. (This can be programmatically known.) 83914.60 36.44 98 70 n n 3 3 125 86 68038.90 37.5 85 92 The key for each value is composed from the identifiers plus the Variable Descriptor. y n

  40. What is Gained? Statistical packages have been able to cast between various data structures for a long time This requires human input and set-up Part of your 80% resource burden By making structural information at this level explicit These processes can be automated, lowering the resource burden Unanticipated/specialized data structure transformations can be supported This does not solve semantic mapping, but can support it

  41. Application: A Provenance Browser One question researchers often ask is: Where did this number come from? What is it really? Typically, the answer is unsatisfying and/or non-existent. The following examples are drawn from an application being developed using the DDI-CDI model to bring together processes and data sets, and the metadata attached to them, to give researchers a useful way to look at any portion of the provenance chain, and to answer this question.

  42. Application: A Provenance Browser (cont.) This is an application was one of the test cases for DDI-CDI specification development Provenance metadata (the processes) are mined programmatically from the ETL platform, Pentaho These are often chains which leverage STATA scripts to perform the processing itself Variable descriptions are taken from DDI Codebook XML Human-readable Purpose statements are added manually The browser brings this together in an easy-to-use form, from the viewpoint of a specific process step or data set

  43. [NAME OF PROCESS] JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [INPUT DATASET] [OUTPUT DATASET] [NAME OF JOB] Overview Purpose Algorithm JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [OUTPUT DATASET] [INPUT DATASET] DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]

  44. [NAME OF PROCESS] JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [CONSUMING JOB] [NAME OF DATASET] Overview Purpose Algorithm JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [CREATING JOB] [CONSUMING JOB] DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]

  45. [NAME OF PROCESS] [NAME OF TASK FILE] use hsb2, clear JOB: [NAME OF JOB] [TASK] [TASK] [TASK] * get values for boxplot summarize write, d gen f=43 /* set value a little larger than bin with highest frequency */ gen pmin=r(min) gen p25=r(p25) gen p50=r(p50) gen p75=r(p75) gen pmax=r(max) gen pmean=r(mean) JOB: [NAME OF JOB] [TASK] [TASK] [TASK] * graph histogram and boxplot on same axes two (histogram write, start(30) width(5) freq) /// (rcap pmin pmax f in 1, hor bcolor(dknavy)) /// (rbar p25 p75 f in 1, hor bcolor(dknavy)) /// (rcap p50 p50 f in 1, hor bcolor(white)) /// (rcapsym pmean pmean f in 1, hor msym(plus) mcolor(white)), /// legend(off) xtitle("Writing Score") ytitle("Frequency") * drop variables created for boxplot values drop f-pmean DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]

  46. EXAMPLE: Data Quality Metrics (Job) ALPHA Data Pipeline Spec. 6.1 Business Processes: [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events OUTPUT DS [2] DoB Quality Metrics INPUT DS [03] [1] Raw 6.1 Event Format 6.1 Data Quality Metrics OUTPUT DS [3] Illegal Transitions

  47. Example: Descriptive Metadata about a Job ALPHA Data Pipeline Spec. 6.1 Business Processes: 02 Core ETL for Raw Input - Purpose [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Creates staging tables from member-centre-specific data. The staging tables are then transformed further to create the ALPHA specification 6.1.

  48. Example: Conceptual Metadata about a Data Set ALPHA Data Pipeline Spec. 6.1 Business Processes: [3.2] Compile Residency Starting Events [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Description: Identify in the data, events that start a residency episode (birth, external-immigration, enumeration, becoming eligible for a study, found after being lost to follow-up). Concepts: This algorithm step references the following study concepts: residency birth migration registered individual or social group (e.g., a household). There are two types of migration that occur among the registered population. These are internal and external migration. The change of residence by a

  49. Example: Structural/Codebook Metadata about a Data Set ALPHA Data Pipeline Spec. 6.1 Business Processes: recnr [INT] study_name [VARCHAR(15)] idno [VARCHAR(32)] hhold_id [VARCHAR(32)] hhold_id_extra [VARCHAR(32)] sex [INT] dob [DATETIME] residence [VARCHAR(5)] eventnr [INT] event [INT] event_date [DATETIME] type_of_date [INT] obs_date [DATETIME] obs_round [VARCHAR(2)] [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Data Sets: [1] Raw 6.1 Event Format [2] DoB Quality Metrics [3] Illegal Transitions CONSUMING PROCESS [03] 6.1 Data Quality Metrics CREATING PROCESS [1] Raw 6.1 Event Format [01] Site- Specific ETL CONSUMING PROCESS [04] Clean 6.1 Data View Dataset Structure

  50. Application: Transparency - Replication/Reproduction of Findings With rich provenance metadata, it is much easier for humans to reproduce findings With a complete machine-actionable record of data provenance, reproducibility of findings can be performed by machines computational replication Transparency requires this ability, but Problems of scale will demand that these processes be more efficient!

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#