Data Documentation Initiative (DDI) for Enhanced EOSC Applications

DDI-CDI: An Introduction for

Possible EOSC Applications

Outline

•

Introduction/Background

•

Examples from The Specification

•

Foundational Metadata: Datums and the Variable Cascade

•

Structural Description of Data

•

Process Description

•

Possible Applications

•

Application: Recognizing Similar Variables in Difficult Cases

•

Application: Automating Data Integration

•

Application: The Provenance Browser

•

Application: Transparency – Replication/Reproduction of Findings

•

Summary

Questions to Consider…

•

Can DDI-CDI help you solve problems you face within your EOSC area?

•

Would DDI-CDI allow you to do things which you are not currently

able to do?

•

What other standards are important and how would DDI-CDI need to

interface with them?

•

Are there requirements for integration of data for secondary use or

across domain boundaries? Do we have the needed

standards/metadata to address these today?

DDI-CDI and FAIR

•

Many people talk about

indability and

ccess

•

Not so much about

nteroperability and

euse

•

DDI-CDI focuses on these aspects of FAIR data

•

It is also quite useful for data discovery

•

Interoperability and reuse of data are metadata-intensive

•

Historically, these aspects of data management are expensive and

have not been fully incentivized by research funders

•

Today’s focus on FAIR data

demands

 that we do more!

Challenges

•

Increasingly, research is conducted across domain boundaries

•

“Grand Challenges”: COVID-19, climate change, resilient cities, etc.

•

Technology is able to scale very efficiently

•

Cloud computing

•

Big data technologies

•

“Data-hungry” approaches to research are becoming common

•

Social media as a “data source”

•

Machine-learning

•

The bottleneck is the metadata: how do we understand the data we can

find and access?

•

When EOSC becomes a reality, problems of scale will increase massively

•

FAIR data means

more

 data-sharing than is possible today!

Background: DDI Standards

•

The Data Documentation Initiative (DDI) has produced metadata

standards for documenting data in the Social, Behavioural, and

Economic (SBE) sciences (also official statistics at the national level)

•

DDI Codebook, DDI Lifecycle

•

Granular, machine-actionable metadata for describing data (XML)

•

DDI Cross-Domain Integration (DDI - CDI) is different

•

Model for describing

data

 across a wide range of domains/structures

•

Sensor and event data, key-value data, rectangular data, multi-dimensional data

•

Model for describing

processing

 of data from one form into another

•

Model for showing how

individual datums

are used across the entire chain

•

Complement to traditional metadata/documentation standards

Typical Data Transformations

Raw Data

“Microdata”

/Unit

Record Data

 Aggregate

Data/ Data

“Cube”

Indicators

•

DDI – CDI describes the data at each stage, indicating the roles played

by each atomic bit of data (“datum”)

•

DDI – CDI tracks the processing between each stage (implements

PROV), reflecting the relationships between atomic datums (uses

other standards for describing specific processes - SDTL)

Cross-Domain Data Sharing

•

Within a domain, the available data is understood by researchers/users

•

Familiar with the organizations and projects

•

Familiar with the domain/literature

•

Common tools and data structures

•

Secondary use is already hard

•

Demands good metadata and documentation

•

Cross-domain data is even harder to use

•

Lack of background/context

•

Different tools and data structure

•

Different semantics

•

Using and integrating “external” data is an expensive, manual process

•

As much as 80% of effort in a research project(!)

DDI – CDI and Data Integration

•

Data described using the DDI – CDI model comes with a richer set of

context:

data provenance

•

What were the inputs?

•

How were they processed?

•

Can findings be reproduced?

•

The integration of data described in DDI – CDI can be automated to a

greater extent

•

The

role

 played by each datum is known (identifier, measure, dimension, etc.)

•

The relationship between datums is explicit because the processing is

described

•

Changes in the role played by a datum in different structures can be predicted

(programmatically) and documented (for future reference)

DDI-CDI Development

•

DDI-CDI was developed based on real-world use cases

•

They were already using other standards/specifications

•

They involved data integration across “domain” boundaries

•

The goal was to fill in the missing pieces

•

Tie existing metadata/information together

•

Avoid duplication where possible

•

Cross-platform, technology-neutral

•

UML model is the center-piece

•

Has XML binding

•

Can be used with any technology binding

•

Currently in public review

•

Release in Q2 of 2021

•

Actively engaging across domains to validate/refine the model

What Does DDI-CDI Do?

•

Describes “foundational metadata” in rich detail

•

Concepts (and their uses)

•

Classifications/codelists

•

Variables

•

Datums/data points (etc.)

•

Describes a range of data structures

•

Wide data/rectangular/unit-record

•

Tall data/events/sensor data

•

Key-Value data/”big” data/No-SQL data

•

Multidimensional (“cube”) data/time series/indicators

•

Describes process

•

Declarative (“black box” multi-threaded)

•

Procedural (stepwise)

•

Ties to data (describing provenance chains)

What’s in the Specification? (1)

•

DDI-CDI is a formal UML class model

•

Expressed in a portable form for easy use by many UML tools

•

Portability is through Canonical XMI – an interchange format for UML models

•

Uses a limited subset of the UML class model

•

UML can be easily implemented across technology platforms

•

Auto-generation of syntax bindings

•

Familiar to many developers

•

UML “future-proofs” the specification

•

Emerging technology platforms can be supported

•

Content of model is extensible

What’s in the Specification? (2)

•

An XML representation is provided

•

Described in W3C XML Schema

•

Other representations (e.g., RDF, JSON) are being considered

•

Documentation of model and schemas

•

Architecture and design documentation

•

Includes documentation of approach to other standards

•

Examples of use

•

Including other standards (e.g., DDI C/L, Schema.org, etc.)

Alignment with Other Standards

•

DDI-CDI attempts to align with a number of other standards

•

For process description:

•

PROV-O (and PROV-ONE) . Directly implements these

•

SDTL/VTL – provides for combined use

•

BPMN – formal alignment

•

For discovery:

•

Schema.org

•

DCAT (in progress)

•

For data description:

•

DDI Codebook, DDI Lifecycle

•

GSIM

•

SDMX/Data Cube

•

Others (see documentation)

•

Open to further work in this area!

Examples from the Model

Features of DDI-CDI to Illustrate its Potential Utility

Foundational Metadata: The Variable Cascade

•

Understanding the roles played by variables is critical in integration of

data

•

Variables do

many, many

 different things

•

Not all variables are the same!

•

We have three levels of variables in our model:

•

Conceptual Variables

•

Represented Variables

•

Instance Variables

Variable Cascade – Conceptual Variable

Variable

descriptions at a

high level. Early in

designing data

collection, broad

searches. Broadly

reusable.

Variable Cascade - RepresentedVariable

More specificity

about value

domain, units of

measurement. Still

reusable.

Variable Cascade - InstanceVariable

Describing

collected

data. Physical

datatype and

platform. Invariant

role of the variable

(e.g. a weight)

DDI-CDI and Domain Ontologies

•

DDI-CDI allows for the use of any concept system

•

It does not model the semantics of any specific domain

•

Rich model for classifications, codelists, and controlled vocabularies

•

Mechanism for referencing domain ontologies

•

Formal use of concepts is specified

•

Concepts can be variables, categories, population, units, etc.

•

Concept systems are modelled generically, but can be external

•

Semantic mapping is not supported

•

Semantic mapping has its own standards/mechanisms

•

DDI-CDI provides a framework where semantic mapping becomes more

meaningful

Data Structures

•

DDI-CDI currently can describe four different data structures

•

Wide – as with unit records

•

Tall -  as with event or stream data

•

Key value – as in a key-value store

•

Dimensional -  as with aggregate data

Our example

Imagine that a program (Python?)

collects Covid related information at

building doors:

•

Blood Pressure( Systolic, diastolic)

•

Position for BP

•

weight

•

temperature

•

pctO2

•

 pulse

•

beenToFloridaEtc?

•

exposed?

Position

 (prone, sitting, standing)

beenToFloridaEtc

Exposed

 (yes, no)

Wide Example

entry

datetime

systolic

diastolic

position

weight

temp

pctO2

pulse

away

exposed

2020-07-14T13:54

83,914.6

36.44

2020-07-14T14:03

68,038.9

37.50

As a spreadsheet table

As tab delimited text lines

Tall Example

Key-Value Example

Dimensional Example

Dimensions are defined by away and exposed. For each

combination of dimension values there is a summary value

– the mean of temp.

The dimensional data are shown here in two layouts, a cross

tabulation and a tall structure.

Questions:

Have you traveled outside of the county in the last two weeks? (circle one)

Yes    No

Have you had contact with anyone diagnosed with Covid-19? (circle one)?

Yes    No

Roles for Data Points

Roles: Identifiers, Measures, and Attributes

Identifiers

VariableValueComponent

AttributeComponent

VariableDescriptorComponent

The DDI-CDI Process Model

•

Describes the use of individual processes, and how they fit together

•

Supports standard descriptions (SDTL, VTL) and specific languages

(SQL, R, STAT, SPSS, Python, SAS, etc.)

•

Three “modes”:

•

Procedural: Step-wise, with decision points

•

Declarative: “Black box” multi-threaded, uses a “playbook” and configurations

•

Hybrid approaches of the two

Simple Diagram

Activity

Step

Sub-Step

Control Logic

Inputs

Outputs

PROCEDURAL PROCESS

DECLARATIVE PROCESS

Process

Engine

Parameters

Playbook

(Functions)

Inputs

Outputs

Process Model (High Level)

Datum: Bringing it Together

Possible Applications of DDI-CDI

Examples of What Might Be Done

Application: Recognizing Similar variables in

Difficult Cases

•

Two variables in different data sets might:

•

Measure the same concept differently

•

Measure the same concept in the same way with different physical

representations

•

Exist identically in two data sets, but with no formal link

•

In all of these cases, understanding the variables at each level

(conceptual, representational, and actual) provides a strong basis for

programmatically identifying them as potential points for joining data

sets

Documenting Comparability among

Variables (Simple Example)

maritalstatus

(conceptual

variable)

Maritalstatus

(variable)

maritalstatus

(represented

variable)

Maritalstatus

(variable)

maritalstatus

(variable)

maritalstatusplus

(represented

variable)

Represented variable

Common variable

specification with a

code representation

Conceptual variable

Common variable

specification without a

representation

Variable

Variable specification

within a dataset

context

1 = Married

2 = Not Married

A = Married

B = Not Married

C = Don’t Know

Application: Automating Data Integration

•

If I understand the role played by any given data point in its data set

of origin, I can predict what role it must play in the data set I need to

transform it into for integration purposes

•

The DDI-CDI model shows us how these relate, and can avoid manual

intervention in performing the needed structural transformations

•

Reduces the (up to 80%) resource burden on projects for preparing

data for analysis

entry

datetime

systolic

diastolic

position

weight

temp

pctO2

pulse

away    exposed

2020-07-14T13:54

83.914,6

36,44

2020-07-14T14:03

68.038,9

37,50

Systolic, diastolic and position could

be defined as a variable collection

with a structure indicating that

First Dataset Roles (Wide Data)

entry

datetime

systolic

diastolic

position

weight

temp

pctO2

pulse

away    exposed

2020-07-14T13:54

83.914,6

36,44

2020-07-14T14:03

68.038,9

37,50

(Unit)

Identifier

Component

Variable

Value

Component

Attribute

Component

Second Data Set Roles (Long Data)

(Unit)

Identifier

Component

Identifier

Component

Variable

Descriptor

Component

Variable

Value

Component

Attribute

Component

The Variable Descriptor Component has values

taken from the list of non-Unit Identifiers

and Variable Components in the first data set.

(This can be programmatically known.)

The “key” for each value is composed from

the identifiers plus the Variable Descriptor.

What is Gained?

•

Statistical packages have been able to cast between various data

structures for a long time

•

This requires human input and set-up

•

Part of your “80%” resource burden

•

By making structural information at this level explicit

•

These processes can be automated, lowering the resource burden

•

Unanticipated/specialized data structure transformations can be supported

•

This does not solve

semantic

 mapping, but can support it

Application: A Provenance Browser

•

One question researchers often ask is: “Where did this number come

from? What is it

really

?”

•

Typically, the answer is unsatisfying and/or non-existent.

•

The following examples are drawn from an application being

developed using the DDI-CDI model to bring together processes and

data sets, and the metadata attached to them, to give researchers a

useful way to look at any portion of the provenance chain, and to

answer this question.

Application: A Provenance Browser (cont.)

•

This is an application was one of the test cases for DDI-CDI

specification development

•

Provenance metadata (the processes) are mined programmatically

from the ETL platform, Pentaho

•

These are often “chains” which leverage STATA scripts to perform the

processing itself

•

Variable descriptions are taken from DDI Codebook XML

•

Human-readable “Purpose” statements are added manually

•

The “browser” brings this together in an easy-to-use form, from the

viewpoint of a specific process step or data set

[NAME OF PROCESS]

JOB: [NAME OF JOB]

[TASK]

[TASK]

[TASK]

JOB: [NAME OF JOB]

[TASK]

[TASK]

[TASK]

DATASETS

[DATASET NAME]

[DATASET NAME]

[DATASET NAME]

[NAME OF JOB]

Overview

Purpose

Algorithm

[INPUT

DATASET]

[INPUT

DATASET]

[OUTPUT

DATASET]

[OUTPUT

DATASET]

[NAME OF PROCESS]

JOB: [NAME OF JOB]

[TASK]

[TASK]

[TASK]

JOB: [NAME OF JOB]

[TASK]

[TASK]

[TASK]

DATASETS

[DATASET NAME]

[DATASET NAME]

[DATASET NAME]

[NAME OF

DATASET]

Overview

Purpose

Algorithm

[CREATING

JOB]

[CONSUMING

JOB]

[CONSUMING

JOB]

[NAME OF PROCESS]

JOB: [NAME OF JOB]

[TASK]

[TASK]

[TASK]

JOB: [NAME OF JOB]

[TASK]

[TASK]

[TASK]

DATASETS

[DATASET NAME]

[DATASET NAME]

[DATASET NAME]

use hsb2, clear

* get values for boxplot

summarize write, d

gen f=43             /* set value a little larger than bin with highest

frequency */

gen pmin=r(min)

gen p25=r(p25)

gen p50=r(p50)

gen p75=r(p75)

gen pmax=r(max)

gen pmean=r(mean)

* graph histogram and boxplot on same axes

two (histogram write, start(30) width(5) freq)                  ///

    (rcap pmin pmax f in 1, hor bcolor(dknavy))                 ///

    (rbar p25 p75 f in 1, hor bcolor(dknavy))                   ///

    (rcap p50 p50 f in 1, hor bcolor(white))                    ///

    (rcapsym pmean pmean f in 1, hor msym(plus) mcolor(white)), ///

    legend(off) xtitle("Writing Score") ytitle("Frequency")

* drop variables created for boxplot values

drop f-pmean

[NAME OF TASK FILE]

[03]

6.1 Data Quality

Metrics

[1] Raw

6.1 Event

Format

ALPHA Data Pipeline Spec. 6.1

Business Processes:

[02] Core ETL for Raw Input

    Overview

    Purpose

    Steps:

        [2.1] Generate Anonymized IDs

        [2.2] Map original IDs to

                 Anonymized IDs

        [2.3] Store ID Mapping

        [2.4] Create 6.1 Data from Raw

                  Data

[03] 6.1 Data Quality Metrics

    Overview

    Purpose

    Steps:

        [3.1] Compile Quality Metrics

        [3.2] Compile Residency Starting

                  Events

        [3.3] Compile Residency Ending

                  Events

        [3.4] Compile Legal and Illegal

                 Starting Events

[2] DoB

Quality

Metrics

[3] Illegal

Transitions

INPUT DS

OUTPUT DS

OUTPUT DS

EXAMPLE: Data Quality Metrics (“Job”)

ALPHA Data Pipeline Spec. 6.1

Business Processes:

[02] Core ETL for Raw Input

    Overview

    Purpose

    Steps:

        [2.1] Generate Anonymized IDs

        [2.2] Map original IDs to

                 Anonymized IDs

        [2.3] Store ID Mapping

        [2.4] Create 6.1 Data from Raw

                  Data

[03] 6.1 Data Quality Metrics

    Overview

    Purpose

    Steps:

        [3.1] Compile Quality Metrics

        [3.2] Compile Residency Starting

                  Events

        [3.3] Compile Residency Ending

                  Events

        [3.4] Compile Legal and Illegal

                 Starting Events

02 Core ETL for Raw Input - Purpose

Creates staging tables from member-centre-specific data. The

staging tables are then transformed further to create the ALPHA

specification 6.1.

Example: Descriptive Metadata about a Job

ALPHA Data Pipeline Spec. 6.1

Business Processes:

[02] Core ETL for Raw Input

    Overview

    Purpose

    Steps:

        [2.1] Generate Anonymized IDs

        [2.2] Map original IDs to

                 Anonymized IDs

        [2.3] Store ID Mapping

        [2.4] Create 6.1 Data from Raw

                  Data

[03] 6.1 Data Quality Metrics

    Overview

    Purpose

    Steps:

        [3.1] Compile Quality Metrics

        [3.2] Compile Residency Starting

                  Events

        [3.3] Compile Residency Ending

                  Events

        [3.4] Compile Legal and Illegal

                 Starting Events

[3.2] Compile Residency Starting Events

Description:

Identify in the data, events that start a residency episode (birth,

external-immigration, enumeration, becoming eligible for a

study, found after being lost to follow-up).

Concepts:

 This algorithm step references the following study concepts:

residency

birth

migration

 The change of residence by a

registered individual or social

group (e.g., a household).

There are two types of

migration that occur among

the registered population.

These are internal and

external migration.

Example: Conceptual Metadata about a Data Set

[1]

Raw 6.1 Event

Format

[01] Site-

Specific ETL

ALPHA Data Pipeline Spec. 6.1

Business Processes:

[03] 6.1 Data Quality Metrics

    Overview

    Purpose

    Steps:

        [3.1] Compile Quality Metrics

        [3.2] Compile Residency Starting

                  Events

        [3.3] Compile Residency Ending

                  Events

        [3.4] Compile Legal and Illegal

                 Starting Events

Data Sets:

[1] Raw 6.1 Event Format

    [2] DoB Quality Metrics

    [3] Illegal Transitions

 CREATING

PROCESS

[03] 6.1 Data

Quality

Metrics

 CONSUMING

PROCESS

[04] Clean 6.1

Data

 CONSUMING

PROCESS

View Dataset Structure

recnr [INT]

study_name

[VARCHAR(15)]

idno [VARCHAR(32)]

hhold_id [VARCHAR(32)]

hhold_id_extra

[VARCHAR(32)]

sex [INT]

dob [DATETIME]

residence [VARCHAR(5)]

eventnr [INT]

event [INT]

event_date [DATETIME]

type_of_date [INT]

obs_date [DATETIME]

obs_round [VARCHAR(2)]

Example: Structural/Codebook Metadata about a Data Set

Application: Transparency -

Replication/Reproduction of Findings

•

With rich provenance metadata, it is much easier for humans to

reproduce findings

•

With a complete machine-actionable record of data provenance,

reproducibility of findings can be performed by machines –

‘computational replication’

•

Transparency requires this ability, but…

•

Problems of scale will demand that these processes be more efficient!

Summary: Questions…

•

Can DDI-CDI help you solve problems you face within your EOSC area?

•

Would DDI-CDI allow you to do things which you are not currently

able to do?

•

Does the variable cascade support functionality that you need?

•

Could you automate functionality around data integration?

•

What other standards are important and how would DDI-CDI need to

interface with them?

•

Are there requirements for integration of data for secondary use or

across domain boundaries? Do we have the needed

standards/metadata to address these today?

Slide Note

Embed Share

Download Presentation

Introduction to DDI-CDI and its relevance in EOSC applications, highlighting examples, foundational metadata, possible applications, questions for consideration, interoperability with FAIR data principles, and challenges in research across domain boundaries. DDI standards aim to improve data documentation in the Social, Behavioural, and Economic sciences to facilitate data sharing and reuse.

draven Follow

Uploaded on Aug 08, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

DDI-CDI: An Introduction for Possible EOSC Applications

Outline Introduction/Background Examples from The Specification Foundational Metadata: Datums and the Variable Cascade Structural Description of Data Process Description Possible Applications Application: Recognizing Similar Variables in Difficult Cases Application: Automating Data Integration Application: The Provenance Browser Application: Transparency Replication/Reproduction of Findings Summary

Questions to Consider Can DDI-CDI help you solve problems you face within your EOSC area? Would DDI-CDI allow you to do things which you are not currently able to do? What other standards are important and how would DDI-CDI need to interface with them? Are there requirements for integration of data for secondary use or across domain boundaries? Do we have the needed standards/metadata to address these today?

DDI-CDI and FAIR Many people talk about Findability and Access Not so much about Interoperability and Reuse DDI-CDI focuses on these aspects of FAIR data It is also quite useful for data discovery Interoperability and reuse of data are metadata-intensive Historically, these aspects of data management are expensive and have not been fully incentivized by research funders Today s focus on FAIR data demands that we do more!

Challenges Increasingly, research is conducted across domain boundaries Grand Challenges : COVID-19, climate change, resilient cities, etc. Technology is able to scale very efficiently Cloud computing Big data technologies Data-hungry approaches to research are becoming common Social media as a data source Machine-learning The bottleneck is the metadata: how do we understand the data we can find and access? When EOSC becomes a reality, problems of scale will increase massively FAIR data means more data-sharing than is possible today!

Background: DDI Standards The Data Documentation Initiative (DDI) has produced metadata standards for documenting data in the Social, Behavioural, and Economic (SBE) sciences (also official statistics at the national level) DDI Codebook, DDI Lifecycle Granular, machine-actionable metadata for describing data (XML) DDI Cross-Domain Integration (DDI - CDI) is different Model for describing data across a wide range of domains/structures Sensor and event data, key-value data, rectangular data, multi-dimensional data Model for describing processing of data from one form into another Model for showing how individual datums are used across the entire chain Complement to traditional metadata/documentation standards

Typical Data Transformations Microdata /Unit Record Data Aggregate Data/ Data Cube Raw Data Indicators DDI CDI describes the data at each stage, indicating the roles played by each atomic bit of data ( datum ) DDI CDI tracks the processing between each stage (implements PROV), reflecting the relationships between atomic datums (uses other standards for describing specific processes - SDTL)

Cross-Domain Data Sharing Within a domain, the available data is understood by researchers/users Familiar with the organizations and projects Familiar with the domain/literature Common tools and data structures Secondary use is already hard Demands good metadata and documentation Cross-domain data is even harder to use Lack of background/context Different tools and data structure Different semantics Using and integrating external data is an expensive, manual process As much as 80% of effort in a research project(!)

DDI CDI and Data Integration Data described using the DDI CDI model comes with a richer set of context: data provenance What were the inputs? How were they processed? Can findings be reproduced? The integration of data described in DDI CDI can be automated to a greater extent The role played by each datum is known (identifier, measure, dimension, etc.) The relationship between datums is explicit because the processing is described Changes in the role played by a datum in different structures can be predicted (programmatically) and documented (for future reference)

DDI-CDI Development DDI-CDI was developed based on real-world use cases They were already using other standards/specifications They involved data integration across domain boundaries The goal was to fill in the missing pieces Tie existing metadata/information together Avoid duplication where possible Cross-platform, technology-neutral UML model is the center-piece Has XML binding Can be used with any technology binding Currently in public review Release in Q2 of 2021 Actively engaging across domains to validate/refine the model

What Does DDI-CDI Do? Describes foundational metadata in rich detail Concepts (and their uses) Classifications/codelists Variables Datums/data points (etc.) Describes a range of data structures Wide data/rectangular/unit-record Tall data/events/sensor data Key-Value data/ big data/No-SQL data Multidimensional ( cube ) data/time series/indicators Describes process Declarative ( black box multi-threaded) Procedural (stepwise) Ties to data (describing provenance chains)

Whats in the Specification? (1) DDI-CDI is a formal UML class model Expressed in a portable form for easy use by many UML tools Portability is through Canonical XMI an interchange format for UML models Uses a limited subset of the UML class model UML can be easily implemented across technology platforms Auto-generation of syntax bindings Familiar to many developers UML future-proofs the specification Emerging technology platforms can be supported Content of model is extensible

Whats in the Specification? (2) An XML representation is provided Described in W3C XML Schema Other representations (e.g., RDF, JSON) are being considered Documentation of model and schemas Architecture and design documentation Includes documentation of approach to other standards Examples of use Including other standards (e.g., DDI C/L, Schema.org, etc.)

Alignment with Other Standards DDI-CDI attempts to align with a number of other standards For process description: PROV-O (and PROV-ONE) . Directly implements these SDTL/VTL provides for combined use BPMN formal alignment For discovery: Schema.org DCAT (in progress) For data description: DDI Codebook, DDI Lifecycle GSIM SDMX/Data Cube Others (see documentation) Open to further work in this area!

Examples from the Model Features of DDI-CDI to Illustrate its Potential Utility

Foundational Metadata: The Variable Cascade Understanding the roles played by variables is critical in integration of data Variables do many, many different things Not all variables are the same! We have three levels of variables in our model: Conceptual Variables Represented Variables Instance Variables

Variable Cascade Conceptual Variable Variable descriptions at a high level. Early in designing data collection, broad searches. Broadly reusable.

Variable Cascade - RepresentedVariable More specificity about value domain, units of measurement. Still reusable.

Variable Cascade - InstanceVariable Describing collected data. Physical datatype and platform. Invariant role of the variable (e.g. a weight)

DDI-CDI and Domain Ontologies DDI-CDI allows for the use of any concept system It does not model the semantics of any specific domain Rich model for classifications, codelists, and controlled vocabularies Mechanism for referencing domain ontologies Formal use of concepts is specified Concepts can be variables, categories, population, units, etc. Concept systems are modelled generically, but can be external Semantic mapping is not supported Semantic mapping has its own standards/mechanisms DDI-CDI provides a framework where semantic mapping becomes more meaningful

Data Structures DDI-CDI currently can describe four different data structures Wide as with unit records Tall - as with event or stream data Key value as in a key-value store Dimensional - as with aggregate data

Our example Imagine that a program (Python?) collects Covid related information at building doors: Blood Pressure( Systolic, diastolic) Position for BP weight temperature pctO2 pulse beenToFloridaEtc? exposed? Position (prone, sitting, standing) beenToFloridaEtc ,Exposed (yes, no)

Wide Example As a spreadsheet table entry datetime systolic diastolic position weight temp pctO2 pulse away exposed 101 2020-07-14T13:54 114 70 2 83914.6 36.44 98 70 n n 132 2020-07-14T14:03 125 86 3 68038.9 37.5 85 92 y n As tab delimited text lines entry datetime systolic exposed diastolic position weight temp pctO2 pulse away 83,914.6 101 2020-07-14T13:54 n 114 70 2 36.44 98 70 n 68,038.9 132 2020-07-14T14:03 y 125 86 3 37.50 85 92 n

Tall Example Entry DateTime 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 Measure systolic diastolic weight temp pctO2 pulse away exposed systolic diastolic weight temp pctO2 pulse away exposed Position Value 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 114 70 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n

Key-Value Example Key 101_2020-07-14T13:54_2_systolic 101_2020-07-14T13:54_2_diastolic 101_2020-07-14T13:54_2_weight 101_2020-07-14T13:54_2_temp 101_2020-07-14T13:54_2_pctO2 101_2020-07-14T13:54_2_pulse 101_2020-07-14T13:54_2_away 101_2020-07-14T13:54_2_exposed 132_2020-07-14T14:03_3_systolic 132_2020-07-14T14:03_3_diastolic 132_2020-07-14T14:03_3_weight 132_2020-07-14T14:03_3_temp 132_2020-07-14T14:03_3_pctO2 132_2020-07-14T14:03_3_pulse 132_2020-07-14T14:03_3_away 132_2020-07-14T14:03_3_exposed Value 114 70 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n

Dimensional Example away exposed meanTemp Y Y Y N N Y N N meantemp away Y N exposed 38.3 37.2 37.8 36.6 Y N 38.3 37.8 37.2 36.6 Dimensions are defined by away and exposed. For each combination of dimension values there is a summary value the mean of temp. The dimensional data are shown here in two layouts, a cross tabulation and a tall structure. Questions: Have you traveled outside of the county in the last two weeks? (circle one) Yes No Have you had contact with anyone diagnosed with Covid-19? (circle one)? Yes No

Roles for Data Points

Roles: Identifiers, Measures, and Attributes VariableDescriptorComponent AttributeComponent Identifiers Entry DateTime 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 Position Measure Value 2systolic 2diastolic 2weight 2temp 2pctO2 2pulse 2away 2exposed 3systolic 3diastolic 3weight 3temp 3pctO2 3pulse 3away 3exposed 114 70 VariableValueComponent 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n

The DDI-CDI Process Model Describes the use of individual processes, and how they fit together Supports standard descriptions (SDTL, VTL) and specific languages (SQL, R, STAT, SPSS, Python, SAS, etc.) Three modes : Procedural: Step-wise, with decision points Declarative: Black box multi-threaded, uses a playbook and configurations Hybrid approaches of the two

Simple Diagram Parameters Process Engine Playbook (Functions) Activity Control Logic Inputs Outputs Step Inputs DECLARATIVE PROCESS Outputs Sub-Step PROCEDURAL PROCESS

Process Model (High Level) class Process Hierarchy Identifiable Service ProcessingAgent ProvONE:: Workflow 0..* trace 0..* Plan ProvONE::Program PROV-O::Entity PROV-O::Activity PROV-O::Agent trace performs informs hasSubControlLogic trace 0..* 0..* 0..1 trace trace 0..* Identifiable 0..* Identifiable Activity Identifiable 0..* 0..* 0..* hasInternal produces 0..* DataDescription:: InformationObject ControlLogic 1..* 0..* 0..* + description: TypedString + workflow: ExternalControlledVocabularyEntry uses invokes 0..* 0..* 0..1 0..* 0..* hasSubActivity has has 0..* 0..* Identifiable Parameters Identifiable from Step DeterministicImperative NonDeterministicDeclarative TemporalConstraints 0..* InformationFlowDefinition uses 0..1 0..* to 0..* 0..* 0..* 0..* hasSubStep 0..* 0..* isBoundTo Sequence ConditionalControlLogic RuleBasedScheduling AllenIntervalAlgebra TemporalControlConstruct

Datum: Bringing it Together

Possible Applications of DDI-CDI Examples of What Might Be Done

Application: Recognizing Similar variables in Difficult Cases Two variables in different data sets might: Measure the same concept differently Measure the same concept in the same way with different physical representations Exist identically in two data sets, but with no formal link In all of these cases, understanding the variables at each level (conceptual, representational, and actual) provides a strong basis for programmatically identifying them as potential points for joining data sets

Documenting Comparability among Variables (Simple Example) Conceptual variable Common variable specification without a representation maritalstatus (conceptual variable) Represented variable Common variable specification with a code representation maritalstatus (represented variable) maritalstatusplus (represented variable) 1 = Married 2 = Not Married A = Married B = Not Married C = Don t Know Variable Variable specification within a dataset context maritalstatus 2018 (variable) Maritalstatus 2007 (variable) Maritalstatus 2010 (variable)

Application: Automating Data Integration If I understand the role played by any given data point in its data set of origin, I can predict what role it must play in the data set I need to transform it into for integration purposes The DDI-CDI model shows us how these relate, and can avoid manual intervention in performing the needed structural transformations Reduces the (up to 80%) resource burden on projects for preparing data for analysis

Entry DateTime 1012020-07-14T13:54 systolic 1012020-07-14T13:54 diastolic 1012020-07-14T13:54 weight 1012020-07-14T13:54 temp 1012020-07-14T13:54 pctO2 1012020-07-14T13:54 pulse 1012020-07-14T13:54 away 1012020-07-14T13:54 exposed 1322020-07-14T14:03 systolic 1322020-07-14T14:03 diastolic 1322020-07-14T14:03 weight 1322020-07-14T14:03 temp 1322020-07-14T14:03 pctO2 1322020-07-14T14:03 pulse 1322020-07-14T14:03 away 1322020-07-14T14:03 exposed Measure Position Value Systolic, diastolic and position could be defined as a variable collection with a structure indicating that 2 2 114 70 83914.60 36.44 98 70 n n 3 3 125 86 68038.90 37.5 85 92 y n entry datetime systolic diastolic position weight 83.914,6 68.038,9 temp pctO2 pulse away exposed 101 2020-07-14T13:54 114 70 2 36,44 98 70 n n 132 2020-07-14T14:03 125 86 3 37,50 85 92 y n

First Dataset Roles (Wide Data) Attribute Component Variable Value Component (Unit) Identifier Component entry 101 132 datetime 2020-07-14T13:54 2020-07-14T14:03 systolic 114 125 diastolic 70 86 position 2 3 weight 83.914,6 36,44 68.038,9 37,50 temp pctO2 98 85 pulse 70 92 away exposed n y n n

Second Data Set Roles (Long Data) Attribute Component (Unit) Identifier Component Identifier Component Variable Descriptor Component Variable Value Component Entry DateTime 1012020-07-14T13:54 systolic 1012020-07-14T13:54 diastolic 1012020-07-14T13:54 weight 1012020-07-14T13:54 temp 1012020-07-14T13:54 pctO2 1012020-07-14T13:54 pulse 1012020-07-14T13:54 away 1012020-07-14T13:54 exposed 1322020-07-14T14:03 systolic 1322020-07-14T14:03 diastolic 1322020-07-14T14:03 weight 1322020-07-14T14:03 temp 1322020-07-14T14:03 pctO2 1322020-07-14T14:03 pulse 1322020-07-14T14:03 away 1322020-07-14T14:03 exposed Measure Position Value 2 2 114 70 The Variable Descriptor Component has values taken from the list of non-Unit Identifiers and Variable Components in the first data set. (This can be programmatically known.) 83914.60 36.44 98 70 n n 3 3 125 86 68038.90 37.5 85 92 The key for each value is composed from the identifiers plus the Variable Descriptor. y n

What is Gained? Statistical packages have been able to cast between various data structures for a long time This requires human input and set-up Part of your 80% resource burden By making structural information at this level explicit These processes can be automated, lowering the resource burden Unanticipated/specialized data structure transformations can be supported This does not solve semantic mapping, but can support it

Application: A Provenance Browser One question researchers often ask is: Where did this number come from? What is it really? Typically, the answer is unsatisfying and/or non-existent. The following examples are drawn from an application being developed using the DDI-CDI model to bring together processes and data sets, and the metadata attached to them, to give researchers a useful way to look at any portion of the provenance chain, and to answer this question.

Application: A Provenance Browser (cont.) This is an application was one of the test cases for DDI-CDI specification development Provenance metadata (the processes) are mined programmatically from the ETL platform, Pentaho These are often chains which leverage STATA scripts to perform the processing itself Variable descriptions are taken from DDI Codebook XML Human-readable Purpose statements are added manually The browser brings this together in an easy-to-use form, from the viewpoint of a specific process step or data set

[NAME OF PROCESS] JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [INPUT DATASET] [OUTPUT DATASET] [NAME OF JOB] Overview Purpose Algorithm JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [OUTPUT DATASET] [INPUT DATASET] DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]

[NAME OF PROCESS] JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [CONSUMING JOB] [NAME OF DATASET] Overview Purpose Algorithm JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [CREATING JOB] [CONSUMING JOB] DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]

[NAME OF PROCESS] [NAME OF TASK FILE] use hsb2, clear JOB: [NAME OF JOB] [TASK] [TASK] [TASK] * get values for boxplot summarize write, d gen f=43 /* set value a little larger than bin with highest frequency */ gen pmin=r(min) gen p25=r(p25) gen p50=r(p50) gen p75=r(p75) gen pmax=r(max) gen pmean=r(mean) JOB: [NAME OF JOB] [TASK] [TASK] [TASK] * graph histogram and boxplot on same axes two (histogram write, start(30) width(5) freq) /// (rcap pmin pmax f in 1, hor bcolor(dknavy)) /// (rbar p25 p75 f in 1, hor bcolor(dknavy)) /// (rcap p50 p50 f in 1, hor bcolor(white)) /// (rcapsym pmean pmean f in 1, hor msym(plus) mcolor(white)), /// legend(off) xtitle("Writing Score") ytitle("Frequency") * drop variables created for boxplot values drop f-pmean DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]

EXAMPLE: Data Quality Metrics (Job) ALPHA Data Pipeline Spec. 6.1 Business Processes: [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events OUTPUT DS [2] DoB Quality Metrics INPUT DS [03] [1] Raw 6.1 Event Format 6.1 Data Quality Metrics OUTPUT DS [3] Illegal Transitions

Example: Descriptive Metadata about a Job ALPHA Data Pipeline Spec. 6.1 Business Processes: 02 Core ETL for Raw Input - Purpose [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Creates staging tables from member-centre-specific data. The staging tables are then transformed further to create the ALPHA specification 6.1.

Example: Conceptual Metadata about a Data Set ALPHA Data Pipeline Spec. 6.1 Business Processes: [3.2] Compile Residency Starting Events [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Description: Identify in the data, events that start a residency episode (birth, external-immigration, enumeration, becoming eligible for a study, found after being lost to follow-up). Concepts: This algorithm step references the following study concepts: residency birth migration registered individual or social group (e.g., a household). There are two types of migration that occur among the registered population. These are internal and external migration. The change of residence by a

Example: Structural/Codebook Metadata about a Data Set ALPHA Data Pipeline Spec. 6.1 Business Processes: recnr [INT] study_name [VARCHAR(15)] idno [VARCHAR(32)] hhold_id [VARCHAR(32)] hhold_id_extra [VARCHAR(32)] sex [INT] dob [DATETIME] residence [VARCHAR(5)] eventnr [INT] event [INT] event_date [DATETIME] type_of_date [INT] obs_date [DATETIME] obs_round [VARCHAR(2)] [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Data Sets: [1] Raw 6.1 Event Format [2] DoB Quality Metrics [3] Illegal Transitions CONSUMING PROCESS [03] 6.1 Data Quality Metrics CREATING PROCESS [1] Raw 6.1 Event Format [01] Site- Specific ETL CONSUMING PROCESS [04] Clean 6.1 Data View Dataset Structure

Application: Transparency - Replication/Reproduction of Findings With rich provenance metadata, it is much easier for humans to reproduce findings With a complete machine-actionable record of data provenance, reproducibility of findings can be performed by machines computational replication Transparency requires this ability, but Problems of scale will demand that these processes be more efficient!

Data Documentation Initiative (DDI) for Enhanced EOSC Applications

Download Presentation

Presentation Transcript

Related

More Related Content