Data Documentation Initiative (DDI) for Enhanced EOSC Applications
Introduction to DDI-CDI and its relevance in EOSC applications, highlighting examples, foundational metadata, possible applications, questions for consideration, interoperability with FAIR data principles, and challenges in research across domain boundaries. DDI standards aim to improve data documentation in the Social, Behavioural, and Economic sciences to facilitate data sharing and reuse.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
DDI-CDI: An Introduction for Possible EOSC Applications
Outline Introduction/Background Examples from The Specification Foundational Metadata: Datums and the Variable Cascade Structural Description of Data Process Description Possible Applications Application: Recognizing Similar Variables in Difficult Cases Application: Automating Data Integration Application: The Provenance Browser Application: Transparency Replication/Reproduction of Findings Summary
Questions to Consider Can DDI-CDI help you solve problems you face within your EOSC area? Would DDI-CDI allow you to do things which you are not currently able to do? What other standards are important and how would DDI-CDI need to interface with them? Are there requirements for integration of data for secondary use or across domain boundaries? Do we have the needed standards/metadata to address these today?
DDI-CDI and FAIR Many people talk about Findability and Access Not so much about Interoperability and Reuse DDI-CDI focuses on these aspects of FAIR data It is also quite useful for data discovery Interoperability and reuse of data are metadata-intensive Historically, these aspects of data management are expensive and have not been fully incentivized by research funders Today s focus on FAIR data demands that we do more!
Challenges Increasingly, research is conducted across domain boundaries Grand Challenges : COVID-19, climate change, resilient cities, etc. Technology is able to scale very efficiently Cloud computing Big data technologies Data-hungry approaches to research are becoming common Social media as a data source Machine-learning The bottleneck is the metadata: how do we understand the data we can find and access? When EOSC becomes a reality, problems of scale will increase massively FAIR data means more data-sharing than is possible today!
Background: DDI Standards The Data Documentation Initiative (DDI) has produced metadata standards for documenting data in the Social, Behavioural, and Economic (SBE) sciences (also official statistics at the national level) DDI Codebook, DDI Lifecycle Granular, machine-actionable metadata for describing data (XML) DDI Cross-Domain Integration (DDI - CDI) is different Model for describing data across a wide range of domains/structures Sensor and event data, key-value data, rectangular data, multi-dimensional data Model for describing processing of data from one form into another Model for showing how individual datums are used across the entire chain Complement to traditional metadata/documentation standards
Typical Data Transformations Microdata /Unit Record Data Aggregate Data/ Data Cube Raw Data Indicators DDI CDI describes the data at each stage, indicating the roles played by each atomic bit of data ( datum ) DDI CDI tracks the processing between each stage (implements PROV), reflecting the relationships between atomic datums (uses other standards for describing specific processes - SDTL)
Cross-Domain Data Sharing Within a domain, the available data is understood by researchers/users Familiar with the organizations and projects Familiar with the domain/literature Common tools and data structures Secondary use is already hard Demands good metadata and documentation Cross-domain data is even harder to use Lack of background/context Different tools and data structure Different semantics Using and integrating external data is an expensive, manual process As much as 80% of effort in a research project(!)
DDI CDI and Data Integration Data described using the DDI CDI model comes with a richer set of context: data provenance What were the inputs? How were they processed? Can findings be reproduced? The integration of data described in DDI CDI can be automated to a greater extent The role played by each datum is known (identifier, measure, dimension, etc.) The relationship between datums is explicit because the processing is described Changes in the role played by a datum in different structures can be predicted (programmatically) and documented (for future reference)
DDI-CDI Development DDI-CDI was developed based on real-world use cases They were already using other standards/specifications They involved data integration across domain boundaries The goal was to fill in the missing pieces Tie existing metadata/information together Avoid duplication where possible Cross-platform, technology-neutral UML model is the center-piece Has XML binding Can be used with any technology binding Currently in public review Release in Q2 of 2021 Actively engaging across domains to validate/refine the model
What Does DDI-CDI Do? Describes foundational metadata in rich detail Concepts (and their uses) Classifications/codelists Variables Datums/data points (etc.) Describes a range of data structures Wide data/rectangular/unit-record Tall data/events/sensor data Key-Value data/ big data/No-SQL data Multidimensional ( cube ) data/time series/indicators Describes process Declarative ( black box multi-threaded) Procedural (stepwise) Ties to data (describing provenance chains)
Whats in the Specification? (1) DDI-CDI is a formal UML class model Expressed in a portable form for easy use by many UML tools Portability is through Canonical XMI an interchange format for UML models Uses a limited subset of the UML class model UML can be easily implemented across technology platforms Auto-generation of syntax bindings Familiar to many developers UML future-proofs the specification Emerging technology platforms can be supported Content of model is extensible
Whats in the Specification? (2) An XML representation is provided Described in W3C XML Schema Other representations (e.g., RDF, JSON) are being considered Documentation of model and schemas Architecture and design documentation Includes documentation of approach to other standards Examples of use Including other standards (e.g., DDI C/L, Schema.org, etc.)
Alignment with Other Standards DDI-CDI attempts to align with a number of other standards For process description: PROV-O (and PROV-ONE) . Directly implements these SDTL/VTL provides for combined use BPMN formal alignment For discovery: Schema.org DCAT (in progress) For data description: DDI Codebook, DDI Lifecycle GSIM SDMX/Data Cube Others (see documentation) Open to further work in this area!
Examples from the Model Features of DDI-CDI to Illustrate its Potential Utility
Foundational Metadata: The Variable Cascade Understanding the roles played by variables is critical in integration of data Variables do many, many different things Not all variables are the same! We have three levels of variables in our model: Conceptual Variables Represented Variables Instance Variables
Variable Cascade Conceptual Variable Variable descriptions at a high level. Early in designing data collection, broad searches. Broadly reusable.
Variable Cascade - RepresentedVariable More specificity about value domain, units of measurement. Still reusable.
Variable Cascade - InstanceVariable Describing collected data. Physical datatype and platform. Invariant role of the variable (e.g. a weight)
DDI-CDI and Domain Ontologies DDI-CDI allows for the use of any concept system It does not model the semantics of any specific domain Rich model for classifications, codelists, and controlled vocabularies Mechanism for referencing domain ontologies Formal use of concepts is specified Concepts can be variables, categories, population, units, etc. Concept systems are modelled generically, but can be external Semantic mapping is not supported Semantic mapping has its own standards/mechanisms DDI-CDI provides a framework where semantic mapping becomes more meaningful
Data Structures DDI-CDI currently can describe four different data structures Wide as with unit records Tall - as with event or stream data Key value as in a key-value store Dimensional - as with aggregate data
Our example Imagine that a program (Python?) collects Covid related information at building doors: Blood Pressure( Systolic, diastolic) Position for BP weight temperature pctO2 pulse beenToFloridaEtc? exposed? Position (prone, sitting, standing) beenToFloridaEtc ,Exposed (yes, no)
Wide Example As a spreadsheet table entry datetime systolic diastolic position weight temp pctO2 pulse away exposed 101 2020-07-14T13:54 114 70 2 83914.6 36.44 98 70 n n 132 2020-07-14T14:03 125 86 3 68038.9 37.5 85 92 y n As tab delimited text lines entry datetime systolic exposed diastolic position weight temp pctO2 pulse away 83,914.6 101 2020-07-14T13:54 n 114 70 2 36.44 98 70 n 68,038.9 132 2020-07-14T14:03 y 125 86 3 37.50 85 92 n
Tall Example Entry DateTime 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 Measure systolic diastolic weight temp pctO2 pulse away exposed systolic diastolic weight temp pctO2 pulse away exposed Position Value 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 114 70 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n
Key-Value Example Key 101_2020-07-14T13:54_2_systolic 101_2020-07-14T13:54_2_diastolic 101_2020-07-14T13:54_2_weight 101_2020-07-14T13:54_2_temp 101_2020-07-14T13:54_2_pctO2 101_2020-07-14T13:54_2_pulse 101_2020-07-14T13:54_2_away 101_2020-07-14T13:54_2_exposed 132_2020-07-14T14:03_3_systolic 132_2020-07-14T14:03_3_diastolic 132_2020-07-14T14:03_3_weight 132_2020-07-14T14:03_3_temp 132_2020-07-14T14:03_3_pctO2 132_2020-07-14T14:03_3_pulse 132_2020-07-14T14:03_3_away 132_2020-07-14T14:03_3_exposed Value 114 70 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n
Dimensional Example away exposed meanTemp Y Y Y N N Y N N meantemp away Y N exposed 38.3 37.2 37.8 36.6 Y N 38.3 37.8 37.2 36.6 Dimensions are defined by away and exposed. For each combination of dimension values there is a summary value the mean of temp. The dimensional data are shown here in two layouts, a cross tabulation and a tall structure. Questions: Have you traveled outside of the county in the last two weeks? (circle one) Yes No Have you had contact with anyone diagnosed with Covid-19? (circle one)? Yes No
Roles: Identifiers, Measures, and Attributes VariableDescriptorComponent AttributeComponent Identifiers Entry DateTime 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1012020-07-14T13:54 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 1322020-07-14T14:03 Position Measure Value 2systolic 2diastolic 2weight 2temp 2pctO2 2pulse 2away 2exposed 3systolic 3diastolic 3weight 3temp 3pctO2 3pulse 3away 3exposed 114 70 VariableValueComponent 83914.60 36.44 98 70 n n 125 86 68038.90 37.5 85 92 y n
The DDI-CDI Process Model Describes the use of individual processes, and how they fit together Supports standard descriptions (SDTL, VTL) and specific languages (SQL, R, STAT, SPSS, Python, SAS, etc.) Three modes : Procedural: Step-wise, with decision points Declarative: Black box multi-threaded, uses a playbook and configurations Hybrid approaches of the two
Simple Diagram Parameters Process Engine Playbook (Functions) Activity Control Logic Inputs Outputs Step Inputs DECLARATIVE PROCESS Outputs Sub-Step PROCEDURAL PROCESS
Process Model (High Level) class Process Hierarchy Identifiable Service ProcessingAgent ProvONE:: Workflow 0..* trace 0..* Plan ProvONE::Program PROV-O::Entity PROV-O::Activity PROV-O::Agent trace performs informs hasSubControlLogic trace 0..* 0..* 0..1 trace trace 0..* Identifiable 0..* Identifiable Activity Identifiable 0..* 0..* 0..* hasInternal produces 0..* DataDescription:: InformationObject ControlLogic 1..* 0..* 0..* + description: TypedString + workflow: ExternalControlledVocabularyEntry uses invokes 0..* 0..* 0..1 0..* 0..* hasSubActivity has has 0..* 0..* Identifiable Parameters Identifiable from Step DeterministicImperative NonDeterministicDeclarative TemporalConstraints 0..* InformationFlowDefinition uses 0..1 0..* to 0..* 0..* 0..* 0..* hasSubStep 0..* 0..* isBoundTo Sequence ConditionalControlLogic RuleBasedScheduling AllenIntervalAlgebra TemporalControlConstruct
Possible Applications of DDI-CDI Examples of What Might Be Done
Application: Recognizing Similar variables in Difficult Cases Two variables in different data sets might: Measure the same concept differently Measure the same concept in the same way with different physical representations Exist identically in two data sets, but with no formal link In all of these cases, understanding the variables at each level (conceptual, representational, and actual) provides a strong basis for programmatically identifying them as potential points for joining data sets
Documenting Comparability among Variables (Simple Example) Conceptual variable Common variable specification without a representation maritalstatus (conceptual variable) Represented variable Common variable specification with a code representation maritalstatus (represented variable) maritalstatusplus (represented variable) 1 = Married 2 = Not Married A = Married B = Not Married C = Don t Know Variable Variable specification within a dataset context maritalstatus 2018 (variable) Maritalstatus 2007 (variable) Maritalstatus 2010 (variable)
Application: Automating Data Integration If I understand the role played by any given data point in its data set of origin, I can predict what role it must play in the data set I need to transform it into for integration purposes The DDI-CDI model shows us how these relate, and can avoid manual intervention in performing the needed structural transformations Reduces the (up to 80%) resource burden on projects for preparing data for analysis
Entry DateTime 1012020-07-14T13:54 systolic 1012020-07-14T13:54 diastolic 1012020-07-14T13:54 weight 1012020-07-14T13:54 temp 1012020-07-14T13:54 pctO2 1012020-07-14T13:54 pulse 1012020-07-14T13:54 away 1012020-07-14T13:54 exposed 1322020-07-14T14:03 systolic 1322020-07-14T14:03 diastolic 1322020-07-14T14:03 weight 1322020-07-14T14:03 temp 1322020-07-14T14:03 pctO2 1322020-07-14T14:03 pulse 1322020-07-14T14:03 away 1322020-07-14T14:03 exposed Measure Position Value Systolic, diastolic and position could be defined as a variable collection with a structure indicating that 2 2 114 70 83914.60 36.44 98 70 n n 3 3 125 86 68038.90 37.5 85 92 y n entry datetime systolic diastolic position weight 83.914,6 68.038,9 temp pctO2 pulse away exposed 101 2020-07-14T13:54 114 70 2 36,44 98 70 n n 132 2020-07-14T14:03 125 86 3 37,50 85 92 y n
First Dataset Roles (Wide Data) Attribute Component Variable Value Component (Unit) Identifier Component entry 101 132 datetime 2020-07-14T13:54 2020-07-14T14:03 systolic 114 125 diastolic 70 86 position 2 3 weight 83.914,6 36,44 68.038,9 37,50 temp pctO2 98 85 pulse 70 92 away exposed n y n n
Second Data Set Roles (Long Data) Attribute Component (Unit) Identifier Component Identifier Component Variable Descriptor Component Variable Value Component Entry DateTime 1012020-07-14T13:54 systolic 1012020-07-14T13:54 diastolic 1012020-07-14T13:54 weight 1012020-07-14T13:54 temp 1012020-07-14T13:54 pctO2 1012020-07-14T13:54 pulse 1012020-07-14T13:54 away 1012020-07-14T13:54 exposed 1322020-07-14T14:03 systolic 1322020-07-14T14:03 diastolic 1322020-07-14T14:03 weight 1322020-07-14T14:03 temp 1322020-07-14T14:03 pctO2 1322020-07-14T14:03 pulse 1322020-07-14T14:03 away 1322020-07-14T14:03 exposed Measure Position Value 2 2 114 70 The Variable Descriptor Component has values taken from the list of non-Unit Identifiers and Variable Components in the first data set. (This can be programmatically known.) 83914.60 36.44 98 70 n n 3 3 125 86 68038.90 37.5 85 92 The key for each value is composed from the identifiers plus the Variable Descriptor. y n
What is Gained? Statistical packages have been able to cast between various data structures for a long time This requires human input and set-up Part of your 80% resource burden By making structural information at this level explicit These processes can be automated, lowering the resource burden Unanticipated/specialized data structure transformations can be supported This does not solve semantic mapping, but can support it
Application: A Provenance Browser One question researchers often ask is: Where did this number come from? What is it really? Typically, the answer is unsatisfying and/or non-existent. The following examples are drawn from an application being developed using the DDI-CDI model to bring together processes and data sets, and the metadata attached to them, to give researchers a useful way to look at any portion of the provenance chain, and to answer this question.
Application: A Provenance Browser (cont.) This is an application was one of the test cases for DDI-CDI specification development Provenance metadata (the processes) are mined programmatically from the ETL platform, Pentaho These are often chains which leverage STATA scripts to perform the processing itself Variable descriptions are taken from DDI Codebook XML Human-readable Purpose statements are added manually The browser brings this together in an easy-to-use form, from the viewpoint of a specific process step or data set
[NAME OF PROCESS] JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [INPUT DATASET] [OUTPUT DATASET] [NAME OF JOB] Overview Purpose Algorithm JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [OUTPUT DATASET] [INPUT DATASET] DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]
[NAME OF PROCESS] JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [CONSUMING JOB] [NAME OF DATASET] Overview Purpose Algorithm JOB: [NAME OF JOB] [TASK] [TASK] [TASK] [CREATING JOB] [CONSUMING JOB] DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]
[NAME OF PROCESS] [NAME OF TASK FILE] use hsb2, clear JOB: [NAME OF JOB] [TASK] [TASK] [TASK] * get values for boxplot summarize write, d gen f=43 /* set value a little larger than bin with highest frequency */ gen pmin=r(min) gen p25=r(p25) gen p50=r(p50) gen p75=r(p75) gen pmax=r(max) gen pmean=r(mean) JOB: [NAME OF JOB] [TASK] [TASK] [TASK] * graph histogram and boxplot on same axes two (histogram write, start(30) width(5) freq) /// (rcap pmin pmax f in 1, hor bcolor(dknavy)) /// (rbar p25 p75 f in 1, hor bcolor(dknavy)) /// (rcap p50 p50 f in 1, hor bcolor(white)) /// (rcapsym pmean pmean f in 1, hor msym(plus) mcolor(white)), /// legend(off) xtitle("Writing Score") ytitle("Frequency") * drop variables created for boxplot values drop f-pmean DATASETS [DATASET NAME] [DATASET NAME] [DATASET NAME]
EXAMPLE: Data Quality Metrics (Job) ALPHA Data Pipeline Spec. 6.1 Business Processes: [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events OUTPUT DS [2] DoB Quality Metrics INPUT DS [03] [1] Raw 6.1 Event Format 6.1 Data Quality Metrics OUTPUT DS [3] Illegal Transitions
Example: Descriptive Metadata about a Job ALPHA Data Pipeline Spec. 6.1 Business Processes: 02 Core ETL for Raw Input - Purpose [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Creates staging tables from member-centre-specific data. The staging tables are then transformed further to create the ALPHA specification 6.1.
Example: Conceptual Metadata about a Data Set ALPHA Data Pipeline Spec. 6.1 Business Processes: [3.2] Compile Residency Starting Events [02] Core ETL for Raw Input Overview Purpose Steps: [2.1] Generate Anonymized IDs [2.2] Map original IDs to Anonymized IDs [2.3] Store ID Mapping [2.4] Create 6.1 Data from Raw Data [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Description: Identify in the data, events that start a residency episode (birth, external-immigration, enumeration, becoming eligible for a study, found after being lost to follow-up). Concepts: This algorithm step references the following study concepts: residency birth migration registered individual or social group (e.g., a household). There are two types of migration that occur among the registered population. These are internal and external migration. The change of residence by a
Example: Structural/Codebook Metadata about a Data Set ALPHA Data Pipeline Spec. 6.1 Business Processes: recnr [INT] study_name [VARCHAR(15)] idno [VARCHAR(32)] hhold_id [VARCHAR(32)] hhold_id_extra [VARCHAR(32)] sex [INT] dob [DATETIME] residence [VARCHAR(5)] eventnr [INT] event [INT] event_date [DATETIME] type_of_date [INT] obs_date [DATETIME] obs_round [VARCHAR(2)] [03] 6.1 Data Quality Metrics Overview Purpose Steps: [3.1] Compile Quality Metrics [3.2] Compile Residency Starting Events [3.3] Compile Residency Ending Events [3.4] Compile Legal and Illegal Starting Events Data Sets: [1] Raw 6.1 Event Format [2] DoB Quality Metrics [3] Illegal Transitions CONSUMING PROCESS [03] 6.1 Data Quality Metrics CREATING PROCESS [1] Raw 6.1 Event Format [01] Site- Specific ETL CONSUMING PROCESS [04] Clean 6.1 Data View Dataset Structure
Application: Transparency - Replication/Reproduction of Findings With rich provenance metadata, it is much easier for humans to reproduce findings With a complete machine-actionable record of data provenance, reproducibility of findings can be performed by machines computational replication Transparency requires this ability, but Problems of scale will demand that these processes be more efficient!