Enhancing Name and Address Parsing for Data Standardization

Slide Note
Embed
Share

Explore the project focused on improving the quality of name and address parsing using active learning methods at the University of Arkansas. Learn about the importance of parsing, entity resolution, and the token pattern approach in standardizing and processing unstructured addresses. Discover the types of US addresses covered and the token pattern process for efficient data parsing and standardization.


Uploaded on Sep 29, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. NAME AND ADDRESS PARSER USING ACTIVE LEARNING UNIVERSITY OF ARKANSAS AT LITTLE ROCK PROJECT OF THE US CENSUS BUREAU APRIL 17TH, 2024

  2. IMPROVE THE QUALITY OF NAME & ADDRESS PARSING AND STANDARDIZATION

  3. WHAT IS PARSING? IN GENERAL PARSING MEANS TO CLEAN OR TO PICK IMPORTANT/ REQUIRED ELEMENTS/ TOKENS FROM A STRING. Given Name and Address: - John Doe, 234 pine street. PERSON_NAME PERSON_ADDRESS

  4. WHAT IS NAME AND ADDRESS PARSING? WHILE WE PERFORM ENTITY RESOLUTION, ADDRESS PLAYS A CRUCIAL ROLE IN IDENTIFYING A REFERENCE IN A DATA REPOSITORY. AND MOST OF THE TIME THEY ARE UNSTRUCTURED, MISSPELT, AND INCOMPLETE. IN ORDER TO MAKE THE ADDRESSES STANDARDIZE FIRST WE MUST IDENTIFY/ LABEL ITS TOKENS, AND PERFORM STANDARDIZATION. THE US ADDRESS DATA PREPARATION FUNCTION IS DESIGNED TO PARSE AN UNSTRUCTURED ADDRESS STRING INTO A SET OF ADDRESS COMPONENTS SO FIRST LETS FOCUS ON ADDRESSES.

  5. TYPES OF US ADDRESSES SCOPE: ABILITY TO PARSE SIX OF THE EIGHT BASIC TYPES OF NAME AND ADDRESS STYLES IDENTIFIED IN USPS PUBLICATION 28 PART A2 INDIVIDUAL RURAL ROUTE ATTENTION LINE HIGHWAY CONTRACT POST OFFICE BOX MILITARY DEVELOPED AN INITIAL TOKEN PATTERN WITH HUMAN-IN-THE-LOOP PROOF-OF-CONCEPT SYSTEM IN PYTHON

  6. TOKEN PATTERN APPROACH GIVEN A FILE OF NAME AND ADDRESS RECORDS, THE BASIC PROCESS USES TOKEN PATTERNS TO 1. USE LIGHT-WEIGHT TOKEN PATTERNS TO IDENTIFY AND SEPARATE NAME TOKENS AND ADDRESS TOKENS 2. SEND NAMES TOKENS TO US NAME PARSER USE NAME-SPECIFIC PATTERNS TO MAP (PARSE) THE NAME TOKENS INTO 6 STANDARD FIELDS 3. SEND ADDRESS TOKENS TO US ADDRESS PARSER USE ADDRESS-SPECIFIC PATTERNS TO PARSE THE ADDRESS TOKENS INTO 15 STANDARD FIELDS IF A PATTERN IS NOT FOUND, A BEST GUESS ALGORITHM PARSES THE DATA. AT THE SAME TIME, THE DATA ARE SENT TO A PERSON TO REVIEW AND TO CREATE THE CORRECT PATTERN TO BE ADDED TO THE PATTERN KNOWLEDGEBASE

  7. EXAMPLE ADDRESS PATTERN CONVERT ADDRESS TO MASK USING CLUE TABLE: 123 OAK ST, ST CLOUD, MN 63646 NWF,FW,TN (@1,200 ENTRIES) ST IS IN CLUE TABLE AS STREET SUFFIX (CODE F ), MN IS STATE CLUE IF MASK IS IN THE MAPPING KNOWLEDGEBASE, USE THE MAPPING TO PARSE T1 STRNBR, T2 STRNAME, T3 STRSUFFIX, T4 & T5 CITYNAME, T6 STATE, T7 ZIP IF MASK IS NOT IN THE KNOWLEDGEBASE, WRITE TO EXCEPTION FILE FOR PERSON TO CREATE MAPPING, THEN ADD THE MAPPING BACK TO THE KB

  8. IMPLEMENTED AS TWO PROCESSES Process 2 Process 1 Exception Output Name & Address File Generate Mask & Lookup Display Mask Input Mapping Mask-Mapping KB Clues Table UI Update KB Mask found Yes Exception Output Parse Mask-Mapping KB Name/Address Parsed Information

  9. STEP BY STEP EXPLANATION USING AN EXAMPLE Pos 1 2 3 4 5 6 7 8 9 10 Token 123-1/2 N OAK STREET APT 3A LITTLE ROCK ARK 72203-4352 Code N D W F S N W W T N Comp Code @USAD_SNO @USAD_SPR @USAD_SNM @USAD_SFX @USAD_ANM @USAD_ANO @USAD_CTY @USAD_STA @USAD_ZIP Value Assigned 123-1/2 D OAK STREET APT 3A LITTLE ROCK ARK 72203-4352 123-1/2 N. Oak Street, Apt 3A, Little Rock, ARK 72203-4352 NDWF,SN,WW,TN Address Mask Token Table Final Parsing

  10. DEMONSTRATION

  11. THANK YOU! QUESTIONS

Related


More Related Content