Preferred Capabilities of Record Linkage Systems for Research

Preferred Capabilities of Record Linkage Systems

for Facilitating Research Record Linkage

Krista Park

US Census Bureau

Center for Optimization and Data Science

Presentation for

FEDCASIC 2023

Disclaimers

This presentation is released to inform interested parties of research and

research requirements and to encourage discussion of a work in progress.

Any views expressed are those of the authors and not those of the U.S.

Census Bureau.

Further, although all of the contributing authors for the report upon which

this presentation is based are credited for their large efforts on the project,

the contents of this presentation are solely those of the presenter. Not all

other contributors may have had the opportunity to do a deep review of this

presentation.

This presentation includes no content derived from restricted data sets.

Researcher Stakeholder Team / Co-Authors

Coordination, Leadership Oversight, Computer

Environment Support, Program Management Support

Outline of Linkage Process

Research v. Production Record Linkage

Research

•

Variety of data set formats /

schema

•

Easy to experiment with

multiple blocking & matching

algorithms

•

Easy to change settings

•

Detailed metadata for both the

matching & computational

systems

Production

•

Restricted data set formats /

schema

•

Able to do planned blocking &

matching algorithm

•

Metadata required for

downstream production

Requirements

 Gathering & Refinement Phases

1.

Current State Assessment

(March 2021 Start)

2.

Elicit Capability Requirements

and Weights

3.

Demonstration

4.

Gating

5.

Technical Solutions

Assessment (TSA) Scoring and

Selection

6.

POCs /QaQI  and Score Update

7.

TSA Model and Results

8.

Findings

Phase 1: Current State

•

Repository of Census Bureau Record Linkage

•

Software

•

Research Papers

•

Internal Reports

•

External Analysis

•

List of commercial, academic, and open source record linkage solutions

•

Reviewed reports by research and advisory companies or organizations such

as Forrester, Gartner, USDA, and MIT

Phase 2: Elicit Capability Requirements,

Criteria and Weights

•

The Subject Matter Experts (SMEs) were divided into five teams.

•

Each team met for 3-6 Requirements gathering workshops. Each

workshop lasted approximately 2- 3 hours.

•

The groups were then shuffled, and in three teams met for 3 hour

workshops (Over 10 workshops) to refine the requirements and

finalize criteria and weights for each requirement

•

Finally, the entire group met to walkthrough and approve the

requirements in another 3 hour workshop.

•

Total of over 35 sessions

 Capability Requirements Categories and Topics

Total number of Capability Requirements: 378

Requirements are organized into 6 categories aligned with the

Census Bureau Technical Solutions Assessment (TSA) Framework published by the Chief

Technology Office (CTO)/Office of Systems Engineering (OSE)

 and used to build out a comprehensive set of topics within each category encapsulating

the Records Linkage Capability Requirements.

*Developed in a later project phase

Phase 3: Demonstrations

•

3 Hour Presentations on Commercial, Open Source, and Internal solutions

Record Linkage as well as

•

90 Minute Demonstrations

•

45 Minute Answers to Distributed Questions

•

45 Minute Interactive Q&A + Suggestions/Recommendations

•

Demonstrations used simulated data generated by the febrl data generator

(200k records in the original file; 300k records in the duplicate file)

•

Demonstrations were conducted explicitly NOT as part of an acquisition

with several layers of protection to ensure Census Bureau staff who

participate in future acquisitions in this zone will not get the results from

these demonstrations.

Phase 4: Gating

Phase 5:  Technical Solutions Assessment (TSA)

Scoring and Selection

•

43 Key Capabilities Identified. Used to assess the entire pool of

candidate software packages and eliminate (Gate) those that

wouldn’t meet the key requirements.

•

Remaining packages were scored against the entire TSA

•

Lessons Learned

•

Small

commercial solutions,

Internal and Open-Source Solutions were

primarily built for the specific purpose of performing records linkage at a

component level versus attempting to deliver End-to-End Enterprise

capabilities as larger Commercial Solutions implement

•

Large, complex, simulated data set is needed to enable this type of product

research. The simulated data set used was too small to fully evaluate the

products

Reasons for Initial Gating

Phase 5 (con’t): Condensed TSA

•

SMEs realized that

Census Internal

and

Open-Source Solutions

were

not intended to result in end-to-end solutions (the model the TSA was

designed for)

•

Created a subset of the Capability Requirements (194/358) focused

on Records Linkage engines better suited to this set of solutions

•

Team recommendations for the hands-on-test were informed by, not

dictated by the scores. Considerations:

Number of solutions achievable within the schedule  + Is there value add for

more algorithms or features in Records Linkage in Python? + The capability to

process the number of records defined in the QaQI Use Cases + User Base for

the Solution vs the Language + Can the Solution run on Spark? + Relative speed

of the Solution + Ability to change Solution Code + Amount of debugging

required to support the QaQI + QaQI team skillset

Phase 6: QAQI and Score Updates

•

Replaced a Proof-of-Concept Test involving commercial projects due to

logistical hurdles

•

Hand on Testing with only Internal and Open Source packages

•

Cloud Computing Environment

•

Mix of Decennial Census & Business Use Cases

•

Results Documented

•

Quality Metrics Calculations

•

Revisions to existing TSA Capability Requirements

•

20 New QaQI Quality Metrics and Experience Capability Requirements defined and

Scored

•

User Experience Write-ups

•

QAQI Results Evaluation

•

Forthcoming paper by Yves Thibaudeau

Phase 7: TSA Model and Results

•

Viable commercial options exist that meet current requirements

•

Commercial options are often pipelines (they are user friendly w/

GUIs)

•

Open source options are less frequently complete pipelines

•

Performance and Quality benchmarks were developed as part of the

QAQI effort

•

Internal and Open Source solutions that passed the Gating review

were benchmarked using the process developed during the QAQI

effort

Phase 8: Findings

•

Census has emerging requirements that aren’t met by existing solutions

•

Census users often need a complete pipeline and would prefer a GUI

•

Users want

•

A broad range of data transformation / standardization tools built in

•

A variety of different (1) blocking and (2) linking algorithm options available within the

tool

•

Native support for multiprocessing, multi-core processors, and threading

•

Realtime monitoring of the environment and resources including load history, activity,

errors, workflows and services including triggering actions when criteria are met

Slide Note

Embed Share

Download

This presentation by Krista Park from the US Census Bureau explores the preferred capabilities of record linkage systems in facilitating research. It includes disclaimers, details about the researcher stakeholder team, coordination, leadership oversight, and the outline of the linkage process. The focus is on the intersection of research, data science, and optimization for effective record linkage.

etienne Follow

Uploaded on Apr 16, 2024 | 4 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Preferred Capabilities of Record Linkage Systems for Facilitating Research Record Linkage Krista Park US Census Bureau Center for Optimization and Data Science Presentation for FEDCASIC 2023 1

Disclaimers This presentation is released to inform interested parties of research and research requirements and to encourage discussion of a work in progress. Any views expressed are those of the authors and not those of the U.S. Census Bureau. Further, although all of the contributing authors for the report upon which this presentation is based are credited for their large efforts on the project, the contents of this presentation are solely those of the presenter. Not all other contributors may have had the opportunity to do a deep review of this presentation. This presentation includes no content derived from restricted data sets. 2

Researcher Stakeholder Team / Co-Authors Author Name Directorate Center/Division Field * Mishal Ahmed R&M CES Economist * Glenn Ambill Decennial DITD IT Specialist (APPSW) * John Cuffe DEPDIR DEPDIR Survey Statistician / MAMBA * Khoa Dong ECON ESMD Mathematical Statistician * Suzanne Dorinski R&M CES Survey Statistician * Juan C. Humud ECON ERD Survey Statistician / PVS Team * Shawn Klimek R&M CES Economist * Daniel Moshinsky Decennial ADDC IT Specialist (APPSW) * Kevin Shaw Decennial DITD Mathematical Statistician / PEARSIS * Damon R. Smith ECON ERD Mathematical Statistician / PVS Team * Yves Thibaudeau R&M CSRM Research Mathematical Statistician * Christine Tomaszewski ECON ERD Mathematical Statistician / PVS Team * Victoria Udalova DEMO ADDP Economist / EHEALTH * Daniel Weinberg R&M CSRM Research Mathematical Statistician * Daniel Whitehead ECON ESMD Mathematical Statistician 3

Coordination, Leadership Oversight, Computer Environment Support, Program Management Support A Name Directorate Center/Division Field * Krista Park R&M Center for Optimization and Data Science (CODS) * Casey Blalock R&M Center for Economic Studies (CES) Statistician (Data Scientist) * Steven Nesbit Contractor OCIO / ADRM J. David Brown R&M Center for Economic Studies (CES) Economist Kristee Camilletti R&M ADRM Contracts / Acquisitions Jaya Damineni R&M Center for Optimization and Data Science (CODS) CODS ADC - Software Engineering Ken Haase R&M ADRM ST Computer Science Anup Mathur R&M Center for Optimization and Data Science (CODS) Chief, CODS Vincent T. Mule Decennial DSSD Mathematical Statistician And Thank you to the many people I didn t list. 4

Outline of Linkage Process 5

Research v. Production Record Linkage Research Variety of data set formats / schema Easy to experiment with multiple blocking & matching algorithms Easy to change settings Detailed metadata for both the matching & computational systems Production Restricted data set formats / schema Able to do planned blocking & matching algorithm Metadata required for downstream production 6

Requirements Gathering & Refinement Phases 1. Current State Assessment (March 2021 Start) 2. Elicit Capability Requirements and Weights 3. Demonstration 4. Gating 5. Technical Solutions Assessment (TSA) Scoring and Selection 6. POCs /QaQI and Score Update 7. TSA Model and Results 8. Findings 7

Phase 1: Current State Repository of Census Bureau Record Linkage Software Research Papers Internal Reports External Analysis List of commercial, academic, and open source record linkage solutions Reviewed reports by research and advisory companies or organizations such as Forrester, Gartner, USDA, and MIT 8

Phase 2: Elicit Capability Requirements, Criteria and Weights The Subject Matter Experts (SMEs) were divided into five teams. Each team met for 3-6 Requirements gathering workshops. Each workshop lasted approximately 2- 3 hours. The groups were then shuffled, and in three teams met for 3 hour workshops (Over 10 workshops) to refine the requirements and finalize criteria and weights for each requirement Finally, the entire group met to walkthrough and approve the requirements in another 3 hour workshop. Total of over 35 sessions 9

Capability Requirements Categories and Topics Requirements are organized into 6 categories aligned with the Census Bureau Technical Solutions Assessment (TSA) Framework published by the Chief Technology Office (CTO)/Office of Systems Engineering (OSE) and used to build out a comprehensive set of topics within each category encapsulating the Records Linkage Capability Requirements. Categories Capability Requirement Topics (with number of capability requirements per topic) Data Handling (81) Outputs/Results (15) Indexing, Blocking, Clustering or Equivalent (16) Use Cases (6) Diagnosability (7) Reportability (26) Operational Risk (1) Elasticity (2) Response Time (5) Risk (2) Field Comparison (32) Technical Risk (1) Matching / Classification (31) Technical (182) Configurability (16) Operability (21) Supportability (13) Adaptability (1) Scalability (12) QaQI Quality Metrics and Experience (20)* Enterprise Licensing Agreement (1) Source Code Availability (1) Cost Factor Risk (1) Maintainability (3) Monitorability (6) Manageability (1) Serviceability(8) Operations (102) Integration (5) Accuracy (2) Interoperability (3) Performance Metrics (6) Performance (58) Solution Price Model (1) Commitment Requirement (1) Training Price Model (1) Hidden Costs (1) IT Implementation and Support Costs (1) Maintenance and Support Costs (1) Training (1) Cost Factors (9) User Experience (9) Accessibility 508 (1) Product Documentation (1) Resilience (1) Localization (1) Technical Documentation (1) Cybersecurity (11) Usability (4) Compliance (6) Security (18) *Developed in a later project phase Total number of Capability Requirements: 378 10

Phase 3: Demonstrations 3 Hour Presentations on Commercial, Open Source, and Internal solutions Record Linkage as well as 90 Minute Demonstrations 45 Minute Answers to Distributed Questions 45 Minute Interactive Q&A + Suggestions/Recommendations Demonstrations used simulated data generated by the febrl data generator (200k records in the original file; 300k records in the duplicate file) Demonstrations were conducted explicitly NOT as part of an acquisition with several layers of protection to ensure Census Bureau staff who participate in future acquisitions in this zone will not get the results from these demonstrations. 11

Phase 4: Gating Phase 5: Technical Solutions Assessment (TSA) Scoring and Selection 43 Key Capabilities Identified. Used to assess the entire pool of candidate software packages and eliminate (Gate) those that wouldn t meet the key requirements. Remaining packages were scored against the entire TSA Lessons Learned Small commercial solutions, Internal and Open-Source Solutions were primarily built for the specific purpose of performing records linkage at a component level versus attempting to deliver End-to-End Enterprise capabilities as larger Commercial Solutions implement Large, complex, simulated data set is needed to enable this type of product research. The simulated data set used was too small to fully evaluate the products 12

Reasons for Initial Gating Solutions Eliminating Criteria Data Formats Databases Supported Isolate/Handle Errors w/o Disruption Logging Maintain Original Dataset Turn Off Built-In Standardizer Technical Risk Category Gating Capability Requirement Topics Data Formats Databases Supported Maintain Original Dataset Local Storage Locations Turn Off Built-In Standardizer Process Missing Values Isolate/Handle Errors w/o Disruption Logging Select/Customize Blocking Variables Only Non-Traditional Indexing Explains Approach Support Deduplication Match a single input to another Matching a single dataset to another Matching 1:M datasets Matching M:M datasets Select two or more datasets to match De-Duplication Facility Identify Best Match Available Store Possible Matches Match Quality Indicators Technical Risk Technical Operational Risk Census Bureau Approved Operating System Commercial Software Ability to Obtain SWG Approval Operational Risk (e.g., Stability) Operations Performance Risk Enterprise Standards Profile Ability to be Compatible Performance Risk (e.g., Stability) Solution Pricing Model Commitment Requirement Hidden Costs Localization Minimal Skillset Level Performance Solution Pricing Model Cost Factor Risk Maintenance and Support Costs Cost Factor Risk (e.g., No Commercial Pricing, No U.S. Sales) Cost Factors Technical Documentation Overall Product Documentation Technical Documentation User Experience Internet Connections Security Monitoring Enterprise Certificate Authority RDP or SSH Support, Cryptography Secure Sockets Layer (SSL) Transport Layer Security (TLS) FedRAMP Solution National Origin of Solution Security 13

Phase 5 (cont): Condensed TSA SMEs realized that Census Internal and Open-Source Solutions were not intended to result in end-to-end solutions (the model the TSA was designed for) Created a subset of the Capability Requirements (194/358) focused on Records Linkage engines better suited to this set of solutions Team recommendations for the hands-on-test were informed by, not dictated by the scores. Considerations: Number of solutions achievable within the schedule + Is there value add for more algorithms or features in Records Linkage in Python? + The capability to process the number of records defined in the QaQI Use Cases + User Base for the Solution vs the Language + Can the Solution run on Spark? + Relative speed of the Solution + Ability to change Solution Code + Amount of debugging required to support the QaQI + QaQI team skillset 14

Phase 6: QAQI and Score Updates Replaced a Proof-of-Concept Test involving commercial projects due to logistical hurdles Hand on Testing with only Internal and Open Source packages Cloud Computing Environment Mix of Decennial Census & Business Use Cases Results Documented Quality Metrics Calculations Revisions to existing TSA Capability Requirements 20 New QaQI Quality Metrics and Experience Capability Requirements defined and Scored User Experience Write-ups QAQI Results Evaluation Forthcoming paper by Yves Thibaudeau 15

Phase 7: TSA Model and Results Viable commercial options exist that meet current requirements Commercial options are often pipelines (they are user friendly w/ GUIs) Open source options are less frequently complete pipelines Performance and Quality benchmarks were developed as part of the QAQI effort Internal and Open Source solutions that passed the Gating review were benchmarked using the process developed during the QAQI effort 16

Phase 8: Findings Census has emerging requirements that aren t met by existing solutions Census users often need a complete pipeline and would prefer a GUI Users want A broad range of data transformation / standardization tools built in A variety of different (1) blocking and (2) linking algorithm options available within the tool Native support for multiprocessing, multi-core processors, and threading Realtime monitoring of the environment and resources including load history, activity, errors, workflows and services including triggering actions when criteria are met 17

Preferred Capabilities of Record Linkage Systems for Research

Download Presentation

Presentation Transcript

Related

More Related Content