Leveraging Open Source Tools for Custom Data Entry Application Development

 
Using Open Source Tools to
Build a Custom Data Entry
Application
 
Cecile Murray and Katie Genadek
Business Development Staff, Economic Reimbursable Surveys Division
U.S. Census Bureau
November 4, 2021
 
1
 
Any views expressed are those of the author(s) and not necessarily those of
the U.S Census Bureau.
 
Decennial Census Digitization and Data
Linkage (DCDL) project overview
 
Goal: 
Capture the names from digitized images of 1960-1990
decennial census forms to enable longitudinal research
Problem:
 850+ million records would make hand entry prohibitively
expensive and slow
Solution:
 Custom Optical Character Recognition (OCR) machine
learning algorithm to automate name capture
           Need to efficiently collect high-quality data to train OCR
 
algorithm: have professional keyers hand enter data using
 
custom application
 
2
 
Roadmap
 
Demonstration
Some technical details
Why build a custom application?
Bringing the application into production
Recommendations
 
3
 
Collecting a name from a 1970 form
 
4
 
Demonstration with blank forms
 
 
5
 
Technical details
 
Django: open source Python library
Backend database is Postgresql
Runs on an AWS govcloud server
Keyers access the application in their web browser via a local port
forward
 
6
 
Data structure
 
Entity-relationship
diagram: tool for
designing database
schema
Color-coding and
arrows to illustrate
flow of data
 
7
 
Why build a custom application?
 
Existing Census Bureau systems are not configured for these images
Data are protected, so we cannot use external solutions like Amazon
mTurk
Entry in MS Excel offers low start-up costs, but a custom application
provides features that pay off in the long run
 
8
 
Why build a custom application?
 
9
 
Bringing the application to production
 
Internal testing as development proceeded:
Small group demos
Team demos
Rollout to select group of keyers and collecting feedback
Creating and positioning a suffix field
What to do about blank values
Figuring out ways to further avoid repetitive keying
Fixing lots of bugs!
Eventual rollout to more keyers
 
10
 
Challenges we faced
 
Steep technical learning curve + substantial up-front time investment
in development
Deployment to production
IT approvals to make sure keyers can access application while protecting
server
Understanding keyer workstation requirements
Training and reference material development
 
11
 
Recommendations
 
Document installation/setup, maintenance, and update processes
Use an iterative development process: build something, test it, and
incorporate feedback
Do structured demos to test specific features early and often
Use git/version control, and split development into development, test, and
production branches
Use logging and automated testing to help identify bugs
 
12
 
Thank you! Questions?
 
Code on GitHub: https://github.com/census-bds/dcdl-data-entry
Cecile Murray, Data Scientist 
cecile.m.murray@census.gov
Katie Genadek, DCDL Project Director 
katie.r.genadek@census.gov
 
13
Slide Note
Embed
Share

Explore how the U.S. Census Bureau utilized custom Optical Character Recognition algorithms to automate name capture from digitized images of decennial census forms. The project aimed to enhance longitudinal research by efficiently collecting and processing high-quality data, showcasing the benefits of building a custom application over traditional methods like manual data entry or using external solutions. Technical details, demonstrations, and the rationale behind custom development are highlighted, emphasizing the advantages in data accuracy and handling complex data structures.

  • Open Source Tools
  • Custom Application Development
  • Data Entry
  • Optical Character Recognition
  • U.S. Census Bureau

Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Using Open Source Tools to Build a Custom Data Entry Application Cecile Murray and Katie Genadek Business Development Staff, Economic Reimbursable Surveys Division U.S. Census Bureau November 4, 2021 Any views expressed are those of the author(s) and not necessarily those of the U.S Census Bureau. 1

  2. Decennial Census Digitization and Data Linkage (DCDL) project overview Goal: Capture the names from digitized images of 1960-1990 decennial census forms to enable longitudinal research Problem: 850+ million records would make hand entry prohibitively expensive and slow Solution: Custom Optical Character Recognition (OCR) machine learning algorithm to automate name capture Need to efficiently collect high-quality data to train OCR algorithm: have professional keyers hand enter data using custom application 2

  3. Roadmap Demonstration Some technical details Why build a custom application? Bringing the application into production Recommendations 3

  4. Demonstration with blank forms 5

  5. Technical details Django: open source Python library Backend database is Postgresql Runs on an AWS govcloud server Keyers access the application in their web browser via a local port forward 6

  6. Data structure Entity-relationship diagram: tool for designing database schema Color-coding and arrows to illustrate flow of data 7

  7. Why build a custom application? Existing Census Bureau systems are not configured for these images Data are protected, so we cannot use external solutions like Amazon mTurk Entry in MS Excel offers low start-up costs, but a custom application provides features that pay off in the long run 8

  8. Why build a custom application? Challenge/feature Excel Our custom data entry application Is the keyer looking at correct image? Risk of human error No risk of error Complex data structure Would require repetitive entry Can accommodate complex relationships between data items Visually relate data in image to entry form No Yes Danger of accidental modification/deletion of records Substantial Minimal Managing reel/image assignment Manual, cumbersome Mostly automated Getting usable data out Manual, cumbersome Can be automated Time required for development Minimal Substantial 9

  9. Bringing the application to production Internal testing as development proceeded: Small group demos Team demos Rollout to select group of keyers and collecting feedback Creating and positioning a suffix field What to do about blank values Figuring out ways to further avoid repetitive keying Fixing lots of bugs! Eventual rollout to more keyers 10

  10. Challenges we faced Steep technical learning curve + substantial up-front time investment in development Deployment to production IT approvals to make sure keyers can access application while protecting server Understanding keyer workstation requirements Training and reference material development 11

  11. Recommendations Document installation/setup, maintenance, and update processes Use an iterative development process: build something, test it, and incorporate feedback Do structured demos to test specific features early and often Use git/version control, and split development into development, test, and production branches Use logging and automated testing to help identify bugs 12

  12. Thank you! Questions? Code on GitHub: https://github.com/census-bds/dcdl-data-entry Cecile Murray, Data Scientist cecile.m.murray@census.gov Katie Genadek, DCDL Project Director katie.r.genadek@census.gov 13

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#