Machine Learning in Commodity Flow Survey: Improving Data Quality

 
Machine Learning and the
Commodity Flow Survey
 
Christian Moscardi
FCSM
11/4/2021
 
Any views expressed are those of the author(s) and not necessarily those of the U.S Census
Bureau.
 
2
 
O
v
e
r
v
i
e
w
 
Commodity Flow Survey
(CFS)
Sponsored by U.S. DOT
Bureau of Transportation
Statistics
Conducted every 5 years
(2017, 2022)
Many uses, including
infrastructure funding
decision-making
Respondents provide
sampling of shipments
from each quarter
3
O
v
e
r
v
i
e
w
 
4
 
M
L
 
m
o
d
e
l
 
 
d
e
t
a
i
l
s
 
a
n
d
 
p
i
l
o
t
 
Training data
: 6.4M shipment records from 2017 CFS
Data provided by respondents, cleaned and edited by analysts
Model output
: 5-digit product code (SCTG)
Model input
: free-text shipment description, establishment’s NAICS
(industry) Code
Model form
: 
Bag-of-words logistic regression
 
 
Initial uses
: 
Correcting and imputing 2017 CFS response data
Proved concept on a small scale
 
 
 
 
5
M
L
 
m
o
d
e
l
 
 
i
n
 
p
r
o
d
u
c
t
i
o
n
 
 
Upcoming uses (2022 CFS)
Eliminating the need for respondents to provide SCTG product codes
Benefits
Reducing respondent burden -> increased response rates
Reducing Census cost to process data
Improving data quality
 
6
 
Improving Model -- Crowdsourcing
 
To improve the model, we need
improved training data
More coverage of product types
More variety in description text
 
To improve training data, need
humans to label more records
 
Solution: 
Crowdsourcing new data
labels
Model
Human
Labeled
Data
Unlabeled
Data
 
Train model
 
Predict Labels
 
Send least confident
predictions for
labelling
 
Update training
dataset
 
7
 
Amazon MTurk
 
Overview
“Crowdsourcing marketplace”
Used for surveys (incl. pre-testing), data labelling
for AI/ML
Anyone can offer a task
Anyone can complete a task
Workers (aka Turkers) paid per task completed
 
Alternatives we considered
In-house labelling prohibitively expensive
For AI/ML labelling, alternatives exist (e.g.
CrowdFlower) but not as flexible and larger-scale
 
8
 
Crowdsourcing Task Setup
 
Product description data sources
NAICS manual
Publicly available online product catalogs
~250,000 descriptions total
 
Task details
“Which product category best matches
the given product description?”
List of categories comes from current
best model (Top 10 shown in random
order + “None of the above”)
U.S. workers only
Calculated reasonable pay rate
 
9
 
Crowdsourcing Quality Control
 
How do we ensure quality?
 
Gateway/gold standard task
Label 50 “gold standard” records
Worker must be at least 60% correct on min. 5 records
Production task: “Quadruple-key entry”
Only qualifying workers
4 workers label each record
Work in batches of 1,000; update model
Continuous validation
Keep track of inter-rater reliability
Inject “gold standard” records during production task
Remove poorly performing workers after each batch of production task
 
10
 
Step 1: Onboard workers with “gold standard” task
Upload Gold Standard data to MTurk
Evaluate workers for agreement to
gold standard
Qualify high performers for
Production
Workers label gold standard product
descriptions
 
Crowdsourcing  Worker Onboarding
 
11
 
Step 2: Label new data in “production task”
Baseline model, predict top 10 codes
for all product descriptions
Upload Production batch type to
MTurk; 4 workers label each desc.
Download results; incorporate
results with majority agreement into
baseline model; worker QC
Sample 1,000 descriptions where
model is not highly confident
 
Crowdsourcing Production Task
 
12
 
Results
 
To date
 
Workers labelled ~5,000 records
~25,000 tasks completed
82% classified with high reliability by
majority vote
Among reliable classifications, estimated
correct
 classification rate of 95%
16% improvement to model’s classification
ability on public product descriptions
(~40,000 more records)
 
Direct cost: ~$3,000
 
Model’s classification power before/after MTurk labels
 
Note
: all results in figure derived from public product description data
 
13
 
Zooming out – Costs of Implementation
 
14
 
Thank you!
 
Email: 
Christian.L.Moscardi@census.gov
 
15
 
Crowdsourcing Best Practices
 
Provide 
clear, simple instruction 
and 
fair pay
 to crowd-workers.
Perform 
preliminary tests to improve instructions
, pre-qualify and
recruit skilled workers.
Use a gold standard 
to onboard workers and continuously monitor
workers’ quality; contact or replace workers if necessary.
Use batches 
to continuously update workers’ qualifications based on
gold standard and agreement metrics.
Use a baseline model 
if possible, to help limit the choices that
workers need to make and simplify the task.
 
* Thanks to Magdalena Asborno and Sarah Hernandez at University
of Arkansas for collaborating to develop best practices.
Slide Note
Embed
Share

Exploring the application of machine learning in the Commodity Flow Survey (CFS), this report discusses the use of logistic regression models to correct and impute shipment data. The upcoming shift to eliminate manual SCTG product code provision aims to enhance respondent experience and response rates. To further enhance the model, crowdsourcing initiatives are suggested for better training data and improved data labeling. The use of Amazon Mechanical Turk for crowdsourcing tasks is highlighted as a cost-effective solution for AI/ML data labeling.

  • Machine Learning
  • Commodity Flow Survey
  • Data Quality
  • Crowdsourcing
  • Amazon Mechanical Turk

Uploaded on Sep 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Machine Learning and the Commodity Flow Survey Christian Moscardi FCSM 11/4/2021 Any views expressed are those of the author(s) and not necessarily those of the U.S Census Bureau.

  2. Overview Overview Commodity Flow Survey (CFS) Sponsored by U.S. DOT Bureau of Transportation Statistics Conducted every 5 years (2017, 2022) Many uses, including infrastructure funding decision-making Respondents provide sampling of shipments from each quarter 2

  3. Overview Overview 3

  4. ML model ML model details and pilot details and pilot Training data: 6.4M shipment records from 2017 CFS Data provided by respondents, cleaned and edited by analysts Model output: 5-digit product code (SCTG) Model input: free-text shipment description, establishment s NAICS (industry) Code Model form: Bag-of-words logistic regression Initial uses: Correcting and imputing 2017 CFS response data Proved concept on a small scale 4

  5. ML model ML model in production in production Upcoming uses (2022 CFS) Eliminating the need for respondents to provide SCTG product codes Benefits Reducing respondent burden -> increased response rates Reducing Census cost to process data Improving data quality 5

  6. Improving Model -- Crowdsourcing To improve the model, we need improved training data More coverage of product types More variety in description text Predict Labels Train model Model Labeled Data Unlabeled Data To improve training data, need humans to label more records Human Solution: Crowdsourcing new data labels Update training dataset Send least confident predictions for labelling 6

  7. Amazon MTurk Overview Crowdsourcing marketplace Used for surveys (incl. pre-testing), data labelling for AI/ML Anyone can offer a task Anyone can complete a task Workers (aka Turkers) paid per task completed Alternatives we considered In-house labelling prohibitively expensive For AI/ML labelling, alternatives exist (e.g. CrowdFlower) but not as flexible and larger-scale 7

  8. Crowdsourcing Task Setup Product description data sources NAICS manual Publicly available online product catalogs ~250,000 descriptions total Task details Which product category best matches the given product description? List of categories comes from current best model (Top 10 shown in random order + None of the above ) U.S. workers only Calculated reasonable pay rate 8

  9. Crowdsourcing Quality Control How do we ensure quality? Gateway/gold standard task Label 50 gold standard records Worker must be at least 60% correct on min. 5 records Production task: Quadruple-key entry Only qualifying workers 4 workers label each record Work in batches of 1,000; update model Continuous validation Keep track of inter-rater reliability Inject gold standard records during production task Remove poorly performing workers after each batch of production task 9

  10. Crowdsourcing Worker Onboarding Step 1: Onboard workers with gold standard task Upload Gold Standard data to MTurk Reco rds per batc h Workers label gold standard product descriptions Task Type Purpose Evaluate workers for agreement to gold standard Recruit workers. Compare worker quality to gold standard. Workers who correctly label >=60% of time qualify to Production task. Gold Standard 50 Label records. Have 4 workers label per task. Records with >50% agreement, or 2-1-1 vote split considered labelled. 2-2 records or 1-1-1-1 records considered difficult, require further review. Qualify high performers for Production Production 1000 10

  11. Crowdsourcing Production Task Step 2: Label new data in production task Baseline model, predict top 10 codes for all product descriptions Recor ds per batch Task Type Purpose Sample 1,000 descriptions where model is not highly confident Recruit workers. Compare worker quality to gold standard. Workers who agree with gold standard >=60% of time qualify to Production task. Label records. Have 4 workers label per task. Gold Standard 50 Upload Production batch type to MTurk; 4 workers label each desc. Records with >50% agreement, or 2-1-1 vote split considered reliable. 2-2 records or 1-1-1-1 records considered difficult, require further review. Production 1000 Download results; incorporate results with majority agreement into baseline model; worker QC 11

  12. Results To date Model s classification power before/after MTurk labels Workers labelled ~5,000 records ~25,000 tasks completed 82% classified with high reliability by majority vote Among reliable classifications, estimated correct classification rate of 95% 16% improvement to model s classification ability on public product descriptions (~40,000 more records) Direct cost: ~$3,000 Note: all results in figure derived from public product description data 12

  13. Zooming out Costs of Implementation 13

  14. Thank you! Email: Christian.L.Moscardi@census.gov 14

  15. Crowdsourcing Best Practices Provide clear, simple instruction and fair pay to crowd-workers. Perform preliminary tests to improve instructions, pre-qualify and recruit skilled workers. Use a gold standard to onboard workers and continuously monitor workers quality; contact or replace workers if necessary. Use batches to continuously update workers qualifications based on gold standard and agreement metrics. Use a baseline model if possible, to help limit the choices that workers need to make and simplify the task. * Thanks to Magdalena Asborno and Sarah Hernandez at University of Arkansas for collaborating to develop best practices. 15

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#