Preferred Capabilities of Record Linkage Systems for Research

Preferred Capabilities of Record Linkage Systems
for Facilitating Research Record Linkage
Krista Park
US Census Bureau
Center for Optimization and Data Science
Presentation for
FEDCASIC 2023
1
Disclaimers
This presentation is released to inform interested parties of research and
research requirements and to encourage discussion of a work in progress.
Any views expressed are those of the authors and not those of the U.S.
Census Bureau.
Further, although all of the contributing authors for the report upon which
this presentation is based are credited for their large efforts on the project,
the contents of this presentation are solely those of the presenter. Not all
other contributors may have had the opportunity to do a deep review of this
presentation.
This presentation includes no content derived from restricted data sets.
2
Researcher Stakeholder Team / Co-Authors
3
Coordination, Leadership Oversight, Computer
Environment Support, Program Management Support
4
Outline of Linkage Process
5
Research v. Production Record Linkage
Research
Variety of data set formats /
schema
Easy to experiment with
multiple blocking & matching
algorithms
Easy to change settings
Detailed metadata for both the
matching & computational
systems
Production
Restricted data set formats /
schema
Able to do planned blocking &
matching algorithm
Metadata required for
downstream production
6
Requirements
 Gathering & Refinement Phases
1.
Current State Assessment
(March 2021 Start)
2.
Elicit Capability Requirements
and Weights
3.
Demonstration
4.
Gating
5.
Technical Solutions
Assessment (TSA) Scoring and
Selection
6.
POCs /QaQI  and Score Update
7.
TSA Model and Results
8.
Findings
7
Phase 1: Current State
Repository of Census Bureau Record Linkage
Software
Research Papers
Internal Reports
External Analysis
List of commercial, academic, and open source record linkage solutions
Reviewed reports by research and advisory companies or organizations such
as Forrester, Gartner, USDA, and MIT
8
Phase 2: Elicit Capability Requirements,
Criteria and Weights
The Subject Matter Experts (SMEs) were divided into five teams.
Each team met for 3-6 Requirements gathering workshops. Each
workshop lasted approximately 2- 3 hours.
The groups were then shuffled, and in three teams met for 3 hour
workshops (Over 10 workshops) to refine the requirements and
finalize criteria and weights for each requirement
Finally, the entire group met to walkthrough and approve the
requirements in another 3 hour workshop.
Total of over 35 sessions
9
 Capability Requirements Categories and Topics
10
Total number of Capability Requirements: 378
Requirements are organized into 6 categories aligned with the 
Census Bureau Technical Solutions Assessment (TSA) Framework published by the Chief
Technology Office (CTO)/Office of Systems Engineering (OSE)
 and used to build out a comprehensive set of topics within each category encapsulating
the Records Linkage Capability Requirements.
*Developed in a later project phase
Phase 3: Demonstrations
3 Hour Presentations on Commercial, Open Source, and Internal solutions
Record Linkage as well as
90 Minute Demonstrations
45 Minute Answers to Distributed Questions
45 Minute Interactive Q&A + Suggestions/Recommendations
Demonstrations used simulated data generated by the febrl data generator
(200k records in the original file; 300k records in the duplicate file)
Demonstrations were conducted explicitly NOT as part of an acquisition
with several layers of protection to ensure Census Bureau staff who
participate in future acquisitions in this zone will not get the results from
these demonstrations.
11
Phase 4: Gating
Phase 5:  Technical Solutions Assessment (TSA)
Scoring and Selection
43 Key Capabilities Identified. Used to assess the entire pool of
candidate software packages and eliminate (Gate) those that
wouldn’t meet the key requirements.
Remaining packages were scored against the entire TSA
Lessons Learned
Small 
commercial solutions, 
Internal and Open-Source Solutions were
primarily built for the specific purpose of performing records linkage at a
component level versus attempting to deliver End-to-End Enterprise
capabilities as larger Commercial Solutions implement
Large, complex, simulated data set is needed to enable this type of product
research. The simulated data set used was too small to fully evaluate the
products
12
 
Reasons for Initial Gating
13
Phase 5 (con’t): Condensed TSA
14
SMEs realized that 
Census Internal 
and 
Open-Source Solutions 
were
not intended to result in end-to-end solutions (the model the TSA was
designed for)
Created a subset of the Capability Requirements (194/358) focused
on Records Linkage engines better suited to this set of solutions
Team recommendations for the hands-on-test were informed by, not
dictated by the scores. Considerations:
Number of solutions achievable within the schedule  + Is there value add for
more algorithms or features in Records Linkage in Python? + The capability to
process the number of records defined in the QaQI Use Cases + User Base for
the Solution vs the Language + Can the Solution run on Spark? + Relative speed
of the Solution + Ability to change Solution Code + Amount of debugging
required to support the QaQI + QaQI team skillset
Phase 6: QAQI and Score Updates
Replaced a Proof-of-Concept Test involving commercial projects due to
logistical hurdles
Hand on Testing with only Internal and Open Source packages
Cloud Computing Environment
Mix of Decennial Census & Business Use Cases
Results Documented
Quality Metrics Calculations
Revisions to existing TSA Capability Requirements
20 New QaQI Quality Metrics and Experience Capability Requirements defined and
Scored
User Experience Write-ups
QAQI Results Evaluation
Forthcoming paper by Yves Thibaudeau
15
Phase 7: TSA Model and Results
Viable commercial options exist that meet current requirements
Commercial options are often pipelines (they are user friendly w/
GUIs)
Open source options are less frequently complete pipelines
Performance and Quality benchmarks were developed as part of the
QAQI effort
Internal and Open Source solutions that passed the Gating review
were benchmarked using the process developed during the QAQI
effort
16
Phase 8: Findings
Census has emerging requirements that aren’t met by existing solutions
Census users often need a complete pipeline and would prefer a GUI
Users want
A broad range of data transformation / standardization tools built in
A variety of different (1) blocking and (2) linking algorithm options available within the
tool
Native support for multiprocessing, multi-core processors, and threading
Realtime monitoring of the environment and resources including load history, activity,
errors, workflows and services including triggering actions when criteria are met
17
Slide Note
Embed
Share

This presentation by Krista Park from the US Census Bureau explores the preferred capabilities of record linkage systems in facilitating research. It includes disclaimers, details about the researcher stakeholder team, coordination, leadership oversight, and the outline of the linkage process. The focus is on the intersection of research, data science, and optimization for effective record linkage.


Uploaded on Apr 16, 2024 | 4 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Preferred Capabilities of Record Linkage Systems for Facilitating Research Record Linkage Krista Park US Census Bureau Center for Optimization and Data Science Presentation for FEDCASIC 2023 1

  2. Disclaimers This presentation is released to inform interested parties of research and research requirements and to encourage discussion of a work in progress. Any views expressed are those of the authors and not those of the U.S. Census Bureau. Further, although all of the contributing authors for the report upon which this presentation is based are credited for their large efforts on the project, the contents of this presentation are solely those of the presenter. Not all other contributors may have had the opportunity to do a deep review of this presentation. This presentation includes no content derived from restricted data sets. 2

  3. Researcher Stakeholder Team / Co-Authors Author Name Directorate Center/Division Field * Mishal Ahmed R&M CES Economist * Glenn Ambill Decennial DITD IT Specialist (APPSW) * John Cuffe DEPDIR DEPDIR Survey Statistician / MAMBA * Khoa Dong ECON ESMD Mathematical Statistician * Suzanne Dorinski R&M CES Survey Statistician * Juan C. Humud ECON ERD Survey Statistician / PVS Team * Shawn Klimek R&M CES Economist * Daniel Moshinsky Decennial ADDC IT Specialist (APPSW) * Kevin Shaw Decennial DITD Mathematical Statistician / PEARSIS * Damon R. Smith ECON ERD Mathematical Statistician / PVS Team * Yves Thibaudeau R&M CSRM Research Mathematical Statistician * Christine Tomaszewski ECON ERD Mathematical Statistician / PVS Team * Victoria Udalova DEMO ADDP Economist / EHEALTH * Daniel Weinberg R&M CSRM Research Mathematical Statistician * Daniel Whitehead ECON ESMD Mathematical Statistician 3

  4. Coordination, Leadership Oversight, Computer Environment Support, Program Management Support A Name Directorate Center/Division Field * Krista Park R&M Center for Optimization and Data Science (CODS) * Casey Blalock R&M Center for Economic Studies (CES) Statistician (Data Scientist) * Steven Nesbit Contractor OCIO / ADRM J. David Brown R&M Center for Economic Studies (CES) Economist Kristee Camilletti R&M ADRM Contracts / Acquisitions Jaya Damineni R&M Center for Optimization and Data Science (CODS) CODS ADC - Software Engineering Ken Haase R&M ADRM ST Computer Science Anup Mathur R&M Center for Optimization and Data Science (CODS) Chief, CODS Vincent T. Mule Decennial DSSD Mathematical Statistician And Thank you to the many people I didn t list. 4

  5. Outline of Linkage Process 5

  6. Research v. Production Record Linkage Research Variety of data set formats / schema Easy to experiment with multiple blocking & matching algorithms Easy to change settings Detailed metadata for both the matching & computational systems Production Restricted data set formats / schema Able to do planned blocking & matching algorithm Metadata required for downstream production 6

  7. Requirements Gathering & Refinement Phases 1. Current State Assessment (March 2021 Start) 2. Elicit Capability Requirements and Weights 3. Demonstration 4. Gating 5. Technical Solutions Assessment (TSA) Scoring and Selection 6. POCs /QaQI and Score Update 7. TSA Model and Results 8. Findings 7

  8. Phase 1: Current State Repository of Census Bureau Record Linkage Software Research Papers Internal Reports External Analysis List of commercial, academic, and open source record linkage solutions Reviewed reports by research and advisory companies or organizations such as Forrester, Gartner, USDA, and MIT 8

  9. Phase 2: Elicit Capability Requirements, Criteria and Weights The Subject Matter Experts (SMEs) were divided into five teams. Each team met for 3-6 Requirements gathering workshops. Each workshop lasted approximately 2- 3 hours. The groups were then shuffled, and in three teams met for 3 hour workshops (Over 10 workshops) to refine the requirements and finalize criteria and weights for each requirement Finally, the entire group met to walkthrough and approve the requirements in another 3 hour workshop. Total of over 35 sessions 9

  10. Capability Requirements Categories and Topics Requirements are organized into 6 categories aligned with the Census Bureau Technical Solutions Assessment (TSA) Framework published by the Chief Technology Office (CTO)/Office of Systems Engineering (OSE) and used to build out a comprehensive set of topics within each category encapsulating the Records Linkage Capability Requirements. Categories Capability Requirement Topics (with number of capability requirements per topic) Data Handling (81) Outputs/Results (15) Indexing, Blocking, Clustering or Equivalent (16) Use Cases (6) Diagnosability (7) Reportability (26) Operational Risk (1) Elasticity (2) Response Time (5) Risk (2) Field Comparison (32) Technical Risk (1) Matching / Classification (31) Technical (182) Configurability (16) Operability (21) Supportability (13) Adaptability (1) Scalability (12) QaQI Quality Metrics and Experience (20)* Enterprise Licensing Agreement (1) Source Code Availability (1) Cost Factor Risk (1) Maintainability (3) Monitorability (6) Manageability (1) Serviceability(8) Operations (102) Integration (5) Accuracy (2) Interoperability (3) Performance Metrics (6) Performance (58) Solution Price Model (1) Commitment Requirement (1) Training Price Model (1) Hidden Costs (1) IT Implementation and Support Costs (1) Maintenance and Support Costs (1) Training (1) Cost Factors (9) User Experience (9) Accessibility 508 (1) Product Documentation (1) Resilience (1) Localization (1) Technical Documentation (1) Cybersecurity (11) Usability (4) Compliance (6) Security (18) *Developed in a later project phase Total number of Capability Requirements: 378 10

  11. Phase 3: Demonstrations 3 Hour Presentations on Commercial, Open Source, and Internal solutions Record Linkage as well as 90 Minute Demonstrations 45 Minute Answers to Distributed Questions 45 Minute Interactive Q&A + Suggestions/Recommendations Demonstrations used simulated data generated by the febrl data generator (200k records in the original file; 300k records in the duplicate file) Demonstrations were conducted explicitly NOT as part of an acquisition with several layers of protection to ensure Census Bureau staff who participate in future acquisitions in this zone will not get the results from these demonstrations. 11

  12. Phase 4: Gating Phase 5: Technical Solutions Assessment (TSA) Scoring and Selection 43 Key Capabilities Identified. Used to assess the entire pool of candidate software packages and eliminate (Gate) those that wouldn t meet the key requirements. Remaining packages were scored against the entire TSA Lessons Learned Small commercial solutions, Internal and Open-Source Solutions were primarily built for the specific purpose of performing records linkage at a component level versus attempting to deliver End-to-End Enterprise capabilities as larger Commercial Solutions implement Large, complex, simulated data set is needed to enable this type of product research. The simulated data set used was too small to fully evaluate the products 12

  13. Reasons for Initial Gating Solutions Eliminating Criteria Data Formats Databases Supported Isolate/Handle Errors w/o Disruption Logging Maintain Original Dataset Turn Off Built-In Standardizer Technical Risk Category Gating Capability Requirement Topics Data Formats Databases Supported Maintain Original Dataset Local Storage Locations Turn Off Built-In Standardizer Process Missing Values Isolate/Handle Errors w/o Disruption Logging Select/Customize Blocking Variables Only Non-Traditional Indexing Explains Approach Support Deduplication Match a single input to another Matching a single dataset to another Matching 1:M datasets Matching M:M datasets Select two or more datasets to match De-Duplication Facility Identify Best Match Available Store Possible Matches Match Quality Indicators Technical Risk Technical Operational Risk Census Bureau Approved Operating System Commercial Software Ability to Obtain SWG Approval Operational Risk (e.g., Stability) Operations Performance Risk Enterprise Standards Profile Ability to be Compatible Performance Risk (e.g., Stability) Solution Pricing Model Commitment Requirement Hidden Costs Localization Minimal Skillset Level Performance Solution Pricing Model Cost Factor Risk Maintenance and Support Costs Cost Factor Risk (e.g., No Commercial Pricing, No U.S. Sales) Cost Factors Technical Documentation Overall Product Documentation Technical Documentation User Experience Internet Connections Security Monitoring Enterprise Certificate Authority RDP or SSH Support, Cryptography Secure Sockets Layer (SSL) Transport Layer Security (TLS) FedRAMP Solution National Origin of Solution Security 13

  14. Phase 5 (cont): Condensed TSA SMEs realized that Census Internal and Open-Source Solutions were not intended to result in end-to-end solutions (the model the TSA was designed for) Created a subset of the Capability Requirements (194/358) focused on Records Linkage engines better suited to this set of solutions Team recommendations for the hands-on-test were informed by, not dictated by the scores. Considerations: Number of solutions achievable within the schedule + Is there value add for more algorithms or features in Records Linkage in Python? + The capability to process the number of records defined in the QaQI Use Cases + User Base for the Solution vs the Language + Can the Solution run on Spark? + Relative speed of the Solution + Ability to change Solution Code + Amount of debugging required to support the QaQI + QaQI team skillset 14

  15. Phase 6: QAQI and Score Updates Replaced a Proof-of-Concept Test involving commercial projects due to logistical hurdles Hand on Testing with only Internal and Open Source packages Cloud Computing Environment Mix of Decennial Census & Business Use Cases Results Documented Quality Metrics Calculations Revisions to existing TSA Capability Requirements 20 New QaQI Quality Metrics and Experience Capability Requirements defined and Scored User Experience Write-ups QAQI Results Evaluation Forthcoming paper by Yves Thibaudeau 15

  16. Phase 7: TSA Model and Results Viable commercial options exist that meet current requirements Commercial options are often pipelines (they are user friendly w/ GUIs) Open source options are less frequently complete pipelines Performance and Quality benchmarks were developed as part of the QAQI effort Internal and Open Source solutions that passed the Gating review were benchmarked using the process developed during the QAQI effort 16

  17. Phase 8: Findings Census has emerging requirements that aren t met by existing solutions Census users often need a complete pipeline and would prefer a GUI Users want A broad range of data transformation / standardization tools built in A variety of different (1) blocking and (2) linking algorithm options available within the tool Native support for multiprocessing, multi-core processors, and threading Realtime monitoring of the environment and resources including load history, activity, errors, workflows and services including triggering actions when criteria are met 17

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#