Enhancing Name and Address Parsing for Data Standardization

 
NAME AND ADDRESS PARSER
USING ACTIVE LEARNING
 
UNIVERSITY OF ARKANSAS AT LITTLE ROCK
PROJECT OF THE US CENSUS BUREAU
APRIL 17
TH
, 2024
 
IMPROVE THE QUALITY OF NAME & ADDRESS
PARSING AND STANDARDIZATION
 
WHAT IS PARSING?
 
IN GENERAL PARSING MEANS TO CLEAN OR TO PICK IMPORTANT/ REQUIRED ELEMENTS/
TOKENS FROM A STRING.
 
 
Given Name and Address: -
 
John Doe
, 
234 pine street
.
PERSON_NAME
PERSON_ADDRESS
 
WHAT IS NAME AND ADDRESS PARSING?
 
WHILE WE
 PERFORM ENTITY RESOLUTION, ADDRESS PLAYS A CRUCIAL ROLE IN IDENTIFYING A
REFERENCE IN A DATA REPOSITORY.
AND MOST OF THE TIME THEY ARE UNSTRUCTURED, MISSPELT, AND INCOMPLETE.
IN ORDER TO MAKE THE ADDRESSES STANDARDIZE FIRST WE MUST IDENTIFY/ LABEL ITS TOKENS,
AND PERFORM STANDARDIZATION.
THE US ADDRESS DATA PREPARATION FUNCTION IS DESIGNED TO PARSE AN UNSTRUCTURED
ADDRESS STRING INTO A SET OF ADDRESS COMPONENTS
 
SO FIRST LETS FOCUS ON ADDRESSES.
 
TYPES OF US ADDRESSES
 
SCOPE: ABILITY TO PARSE SIX OF THE EIGHT BASIC TYPES OF NAME AND ADDRESS STYLES
IDENTIFIED IN USPS PUBLICATION 28 PART A2
INDIVIDUAL
RURAL ROUTE
ATTENTION LINE
HIGHWAY CONTRACT
POST OFFICE BOX
MILITARY
DEVELOPED AN INITIAL TOKEN PATTERN WITH “HUMAN-IN-THE-LOOP” PROOF-OF-CONCEPT
SYSTEM IN PYTHON
 
TOKEN PATTERN APPROACH
 
GIVEN A FILE OF NAME AND ADDRESS RECORDS, THE BASIC PROCESS USES TOKEN “PATTERNS” TO
1.
USE LIGHT-WEIGHT TOKEN PATTERNS TO IDENTIFY AND SEPARATE NAME TOKENS AND ADDRESS
TOKENS
2.
SEND NAMES TOKENS TO US NAME PARSER
USE NAME-SPECIFIC PATTERNS TO MAP (PARSE) THE NAME TOKENS INTO 6 STANDARD FIELDS
3.
SEND ADDRESS TOKENS TO US ADDRESS PARSER
USE ADDRESS-SPECIFIC PATTERNS TO PARSE THE ADDRESS TOKENS INTO 15 STANDARD FIELDS
IF A PATTERN IS NOT FOUND, A “BEST GUESS” ALGORITHM PARSES THE DATA.
AT THE SAME TIME, THE DATA ARE SENT TO A PERSON TO REVIEW AND TO CREATE THE “CORRECT” PATTERN
TO BE ADDED TO THE PATTERN KNOWLEDGEBASE
 
EXAMPLE ADDRESS PATTERN
 
CONVERT ADDRESS TO MASK USING CLUE TABLE:
123 OAK ST, ST CLOUD, MN 63646
NWF,FW,TN
” (@1,200 ENTRIES)
“ST” IS IN CLUE TABLE AS STREET SUFFIX (CODE “F”), “MN” IS STATE CLUE
 
IF MASK IS IN THE MAPPING KNOWLEDGEBASE, USE THE MAPPING TO PARSE
T1
STRNBR, T2
STRNAME, T3
STRSUFFIX, T4 & T5
CITYNAME, T6
STATE, T7
ZIP
 
IF MASK IS NOT IN THE KNOWLEDGEBASE, WRITE TO EXCEPTION FILE FOR PERSON TO CREATE
MAPPING, THEN ADD THE MAPPING BACK TO THE KB
 
IMPLEMENTED AS TWO PROCESSES
Clues Table
Mask-Mapping
KB
Generate Mask
& Lookup
Name & Address File
Mask
found
Parse
Name/Address
Parsed
Information
 
Yes
Exception
Output
Exception
Output
Display Mask
Input Mapping
UI
Mask-Mapping
KB
Update KB
 
Process 1
 
Process 2
 
STEP BY STEP EXPLANATION USING AN EXAMPLE
 
123-1/2 N. Oak Street,
123-1/2 N. Oak Street,
Apt 3A, Little Rock, ARK
Apt 3A, Little Rock, ARK
72203-4352
72203-4352
 
NDWF,SN,WW,TN
Address
Mask
Token Table
Final Parsing
 
DEMONSTRATION
 
THANK YOU!
QUESTIONS
Slide Note
Embed
Share

Explore the project focused on improving the quality of name and address parsing using active learning methods at the University of Arkansas. Learn about the importance of parsing, entity resolution, and the token pattern approach in standardizing and processing unstructured addresses. Discover the types of US addresses covered and the token pattern process for efficient data parsing and standardization.

  • Data Standardization
  • Active Learning
  • Address Parsing
  • Token Patterns
  • Entity Resolution

Uploaded on Sep 29, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. NAME AND ADDRESS PARSER USING ACTIVE LEARNING UNIVERSITY OF ARKANSAS AT LITTLE ROCK PROJECT OF THE US CENSUS BUREAU APRIL 17TH, 2024

  2. IMPROVE THE QUALITY OF NAME & ADDRESS PARSING AND STANDARDIZATION

  3. WHAT IS PARSING? IN GENERAL PARSING MEANS TO CLEAN OR TO PICK IMPORTANT/ REQUIRED ELEMENTS/ TOKENS FROM A STRING. Given Name and Address: - John Doe, 234 pine street. PERSON_NAME PERSON_ADDRESS

  4. WHAT IS NAME AND ADDRESS PARSING? WHILE WE PERFORM ENTITY RESOLUTION, ADDRESS PLAYS A CRUCIAL ROLE IN IDENTIFYING A REFERENCE IN A DATA REPOSITORY. AND MOST OF THE TIME THEY ARE UNSTRUCTURED, MISSPELT, AND INCOMPLETE. IN ORDER TO MAKE THE ADDRESSES STANDARDIZE FIRST WE MUST IDENTIFY/ LABEL ITS TOKENS, AND PERFORM STANDARDIZATION. THE US ADDRESS DATA PREPARATION FUNCTION IS DESIGNED TO PARSE AN UNSTRUCTURED ADDRESS STRING INTO A SET OF ADDRESS COMPONENTS SO FIRST LETS FOCUS ON ADDRESSES.

  5. TYPES OF US ADDRESSES SCOPE: ABILITY TO PARSE SIX OF THE EIGHT BASIC TYPES OF NAME AND ADDRESS STYLES IDENTIFIED IN USPS PUBLICATION 28 PART A2 INDIVIDUAL RURAL ROUTE ATTENTION LINE HIGHWAY CONTRACT POST OFFICE BOX MILITARY DEVELOPED AN INITIAL TOKEN PATTERN WITH HUMAN-IN-THE-LOOP PROOF-OF-CONCEPT SYSTEM IN PYTHON

  6. TOKEN PATTERN APPROACH GIVEN A FILE OF NAME AND ADDRESS RECORDS, THE BASIC PROCESS USES TOKEN PATTERNS TO 1. USE LIGHT-WEIGHT TOKEN PATTERNS TO IDENTIFY AND SEPARATE NAME TOKENS AND ADDRESS TOKENS 2. SEND NAMES TOKENS TO US NAME PARSER USE NAME-SPECIFIC PATTERNS TO MAP (PARSE) THE NAME TOKENS INTO 6 STANDARD FIELDS 3. SEND ADDRESS TOKENS TO US ADDRESS PARSER USE ADDRESS-SPECIFIC PATTERNS TO PARSE THE ADDRESS TOKENS INTO 15 STANDARD FIELDS IF A PATTERN IS NOT FOUND, A BEST GUESS ALGORITHM PARSES THE DATA. AT THE SAME TIME, THE DATA ARE SENT TO A PERSON TO REVIEW AND TO CREATE THE CORRECT PATTERN TO BE ADDED TO THE PATTERN KNOWLEDGEBASE

  7. EXAMPLE ADDRESS PATTERN CONVERT ADDRESS TO MASK USING CLUE TABLE: 123 OAK ST, ST CLOUD, MN 63646 NWF,FW,TN (@1,200 ENTRIES) ST IS IN CLUE TABLE AS STREET SUFFIX (CODE F ), MN IS STATE CLUE IF MASK IS IN THE MAPPING KNOWLEDGEBASE, USE THE MAPPING TO PARSE T1 STRNBR, T2 STRNAME, T3 STRSUFFIX, T4 & T5 CITYNAME, T6 STATE, T7 ZIP IF MASK IS NOT IN THE KNOWLEDGEBASE, WRITE TO EXCEPTION FILE FOR PERSON TO CREATE MAPPING, THEN ADD THE MAPPING BACK TO THE KB

  8. IMPLEMENTED AS TWO PROCESSES Process 2 Process 1 Exception Output Name & Address File Generate Mask & Lookup Display Mask Input Mapping Mask-Mapping KB Clues Table UI Update KB Mask found Yes Exception Output Parse Mask-Mapping KB Name/Address Parsed Information

  9. STEP BY STEP EXPLANATION USING AN EXAMPLE Pos 1 2 3 4 5 6 7 8 9 10 Token 123-1/2 N OAK STREET APT 3A LITTLE ROCK ARK 72203-4352 Code N D W F S N W W T N Comp Code @USAD_SNO @USAD_SPR @USAD_SNM @USAD_SFX @USAD_ANM @USAD_ANO @USAD_CTY @USAD_STA @USAD_ZIP Value Assigned 123-1/2 D OAK STREET APT 3A LITTLE ROCK ARK 72203-4352 123-1/2 N. Oak Street, Apt 3A, Little Rock, ARK 72203-4352 NDWF,SN,WW,TN Address Mask Token Table Final Parsing

  10. DEMONSTRATION

  11. THANK YOU! QUESTIONS

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#