Machine Learning in Commodity Flow Survey: Improving Data Quality

Machine Learning and the

Commodity Flow Survey

Christian Moscardi

FCSM

11/4/2021

Any views expressed are those of the author(s) and not necessarily those of the U.S Census

Bureau.

•

Commodity Flow Survey

(CFS)

•

Sponsored by U.S. DOT

Bureau of Transportation

Statistics

•

Conducted every 5 years

(2017, 2022)

•

Many uses, including

infrastructure funding

decision-making

•

Respondents provide

sampling of shipments

from each quarter

–

•

Training data

: 6.4M shipment records from 2017 CFS

•

Data provided by respondents, cleaned and edited by analysts

•

Model output

: 5-digit product code (SCTG)

•

Model input

: free-text shipment description, establishment’s NAICS

(industry) Code

•

Model form

Bag-of-words logistic regression

•

Initial uses

Correcting and imputing 2017 CFS response data

•

Proved concept on a small scale

–

Upcoming uses (2022 CFS)

•

Eliminating the need for respondents to provide SCTG product codes

Benefits

•

Reducing respondent burden -> increased response rates

•

Reducing Census cost to process data

•

Improving data quality

Improving Model -- Crowdsourcing

•

To improve the model, we need

improved training data

•

More coverage of product types

•

More variety in description text

•

To improve training data, need

humans to label more records

•

Solution:

Crowdsourcing new data

labels

Model

Human

Labeled

Data

Unlabeled

Data

Train model

Predict Labels

Send least confident

predictions for

labelling

Update training

dataset

Amazon MTurk

Overview

•

“Crowdsourcing marketplace”

•

Used for surveys (incl. pre-testing), data labelling

for AI/ML

•

Anyone can offer a task

•

Anyone can complete a task

•

Workers (aka Turkers) paid per task completed

Alternatives we considered

•

In-house labelling prohibitively expensive

•

For AI/ML labelling, alternatives exist (e.g.

CrowdFlower) but not as flexible and larger-scale

Crowdsourcing Task Setup

Product description data sources

•

NAICS manual

•

Publicly available online product catalogs

•

~250,000 descriptions total

Task details

•

“Which product category best matches

the given product description?”

•

List of categories comes from current

best model (Top 10 shown in random

order + “None of the above”)

•

U.S. workers only

•

Calculated reasonable pay rate

Crowdsourcing Quality Control

How do we ensure quality?

•

Gateway/gold standard task

•

Label 50 “gold standard” records

•

Worker must be at least 60% correct on min. 5 records

•

Production task: “Quadruple-key entry”

•

Only qualifying workers

•

4 workers label each record

•

Work in batches of 1,000; update model

•

Continuous validation

•

Keep track of inter-rater reliability

•

Inject “gold standard” records during production task

•

Remove poorly performing workers after each batch of production task

•

Step 1: Onboard workers with “gold standard” task

Upload Gold Standard data to MTurk

Evaluate workers for agreement to

gold standard

Qualify high performers for

Production

Workers label gold standard product

descriptions

Crowdsourcing  Worker Onboarding

•

Step 2: Label new data in “production task”

Baseline model, predict top 10 codes

for all product descriptions

Upload Production batch type to

MTurk; 4 workers label each desc.

Download results; incorporate

results with majority agreement into

baseline model; worker QC

Sample 1,000 descriptions where

model is not highly confident

Crowdsourcing Production Task

Results

To date

•

Workers labelled ~5,000 records

•

~25,000 tasks completed

•

82% classified with high reliability by

majority vote

•

Among reliable classifications, estimated

correct

 classification rate of 95%

•

16% improvement to model’s classification

ability on public product descriptions

(~40,000 more records)

•

Direct cost: ~$3,000

Model’s classification power before/after MTurk labels

Note

: all results in figure derived from public product description data

Zooming out – Costs of Implementation

Thank you!

Email:

Christian.L.Moscardi@census.gov

Crowdsourcing Best Practices

•

Provide

clear, simple instruction

and

fair pay

 to crowd-workers.

•

Perform

preliminary tests to improve instructions

, pre-qualify and

recruit skilled workers.

•

Use a gold standard

to onboard workers and continuously monitor

workers’ quality; contact or replace workers if necessary.

•

Use batches

to continuously update workers’ qualifications based on

gold standard and agreement metrics.

•

Use a baseline model

if possible, to help limit the choices that

workers need to make and simplify the task.

* Thanks to Magdalena Asborno and Sarah Hernandez at University

of Arkansas for collaborating to develop best practices.

Slide Note

Embed Share

Download

Exploring the application of machine learning in the Commodity Flow Survey (CFS), this report discusses the use of logistic regression models to correct and impute shipment data. The upcoming shift to eliminate manual SCTG product code provision aims to enhance respondent experience and response rates. To further enhance the model, crowdsourcing initiatives are suggested for better training data and improved data labeling. The use of Amazon Mechanical Turk for crowdsourcing tasks is highlighted as a cost-effective solution for AI/ML data labeling.

leal_vi Follow

Uploaded on Sep 13, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Machine Learning and the Commodity Flow Survey Christian Moscardi FCSM 11/4/2021 Any views expressed are those of the author(s) and not necessarily those of the U.S Census Bureau.

Overview Overview Commodity Flow Survey (CFS) Sponsored by U.S. DOT Bureau of Transportation Statistics Conducted every 5 years (2017, 2022) Many uses, including infrastructure funding decision-making Respondents provide sampling of shipments from each quarter 2

Overview Overview 3

ML model ML model details and pilot details and pilot Training data: 6.4M shipment records from 2017 CFS Data provided by respondents, cleaned and edited by analysts Model output: 5-digit product code (SCTG) Model input: free-text shipment description, establishment s NAICS (industry) Code Model form: Bag-of-words logistic regression Initial uses: Correcting and imputing 2017 CFS response data Proved concept on a small scale 4

ML model ML model in production in production Upcoming uses (2022 CFS) Eliminating the need for respondents to provide SCTG product codes Benefits Reducing respondent burden -> increased response rates Reducing Census cost to process data Improving data quality 5

Improving Model -- Crowdsourcing To improve the model, we need improved training data More coverage of product types More variety in description text Predict Labels Train model Model Labeled Data Unlabeled Data To improve training data, need humans to label more records Human Solution: Crowdsourcing new data labels Update training dataset Send least confident predictions for labelling 6

Amazon MTurk Overview Crowdsourcing marketplace Used for surveys (incl. pre-testing), data labelling for AI/ML Anyone can offer a task Anyone can complete a task Workers (aka Turkers) paid per task completed Alternatives we considered In-house labelling prohibitively expensive For AI/ML labelling, alternatives exist (e.g. CrowdFlower) but not as flexible and larger-scale 7

Crowdsourcing Task Setup Product description data sources NAICS manual Publicly available online product catalogs ~250,000 descriptions total Task details Which product category best matches the given product description? List of categories comes from current best model (Top 10 shown in random order + None of the above ) U.S. workers only Calculated reasonable pay rate 8

Crowdsourcing Quality Control How do we ensure quality? Gateway/gold standard task Label 50 gold standard records Worker must be at least 60% correct on min. 5 records Production task: Quadruple-key entry Only qualifying workers 4 workers label each record Work in batches of 1,000; update model Continuous validation Keep track of inter-rater reliability Inject gold standard records during production task Remove poorly performing workers after each batch of production task 9

Crowdsourcing Worker Onboarding Step 1: Onboard workers with gold standard task Upload Gold Standard data to MTurk Reco rds per batc h Workers label gold standard product descriptions Task Type Purpose Evaluate workers for agreement to gold standard Recruit workers. Compare worker quality to gold standard. Workers who correctly label >=60% of time qualify to Production task. Gold Standard 50 Label records. Have 4 workers label per task. Records with >50% agreement, or 2-1-1 vote split considered labelled. 2-2 records or 1-1-1-1 records considered difficult, require further review. Qualify high performers for Production Production 1000 10

Crowdsourcing Production Task Step 2: Label new data in production task Baseline model, predict top 10 codes for all product descriptions Recor ds per batch Task Type Purpose Sample 1,000 descriptions where model is not highly confident Recruit workers. Compare worker quality to gold standard. Workers who agree with gold standard >=60% of time qualify to Production task. Label records. Have 4 workers label per task. Gold Standard 50 Upload Production batch type to MTurk; 4 workers label each desc. Records with >50% agreement, or 2-1-1 vote split considered reliable. 2-2 records or 1-1-1-1 records considered difficult, require further review. Production 1000 Download results; incorporate results with majority agreement into baseline model; worker QC 11

Results To date Model s classification power before/after MTurk labels Workers labelled ~5,000 records ~25,000 tasks completed 82% classified with high reliability by majority vote Among reliable classifications, estimated correct classification rate of 95% 16% improvement to model s classification ability on public product descriptions (~40,000 more records) Direct cost: ~$3,000 Note: all results in figure derived from public product description data 12

Zooming out Costs of Implementation 13

Thank you! Email: Christian.L.Moscardi@census.gov 14

Crowdsourcing Best Practices Provide clear, simple instruction and fair pay to crowd-workers. Perform preliminary tests to improve instructions, pre-qualify and recruit skilled workers. Use a gold standard to onboard workers and continuously monitor workers quality; contact or replace workers if necessary. Use batches to continuously update workers qualifications based on gold standard and agreement metrics. Use a baseline model if possible, to help limit the choices that workers need to make and simplify the task. * Thanks to Magdalena Asborno and Sarah Hernandez at University of Arkansas for collaborating to develop best practices. 15

Machine Learning in Commodity Flow Survey: Improving Data Quality

Download Presentation

Presentation Transcript

Related

More Related Content