Context-Aware Fuzzy Date Matching: BlockDates Solution
BlockDates is a tool designed to improve the extraction and interpretation of date information embedded in text narratives, particularly handling date ranges and varied formats efficiently. The tool utilizes a context-aware approach to accurately interpret single dates and date ranges, aiding organizations such as the Bureau of Transportation Statistics in parsing weekly logs into daily records for enhanced data analysis. Implemented using Julia open-source software, BlockDates enables the processing and scoring of textual blocks to generate daily records with interpretation notes.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
BlockDates: A Context-Aware Fuzzy Date Matching Solution Francis Smart, Censeo Consulting Group (formerly Bureau of Transportation Statistics) Amanda Lemons, Bureau of Transportation Statistics Allison Fischman, Bureau of Transportation Statistics FedCASIC Session- Data Science Applications: Classification April 6, 2022
Problem Statement How can we make better and easier use of datetime data when it is embedded in narratives and often differs in format? Existing tools provide a solution for interpreting single dates, but not date ranges. Existing tools do not provide an easy solution for finding datetime information embedded in text. A tool is needed to (a) interpret both single dates and date ranges with inconsistent formats, and (b) extract the associated text for each day (the context). Specific application at the Bureau of Transportation Statistics BTS uses weekly activity logs to validate submitted safety reports and measure levels of activity. BTS requires the capability to parse these logs into daily records. Daily-level data allows for more precise measurement of event frequency and time-based trends, as well as error checking and checking for data completeness. Broader potential application Federal statistical agencies routinely use additional data sources to validate survey data and aid in analysis. Data may include dates, but sometimes these are harder to use due to varied format. This research could improve usefulness of such data. 2
Methodology BlockDates is as a context-aware date matching tool BlockDates uses the Julia open-source software package Julia was selected for implementation as high-level programming language with rapid processing speed due to just-in-time compiling. Overview of the tool Input: Start & end dates, and a block of text with embedded dates Processing: Identifying, interpreting, and correcting dates Scoring each row and overall transformation Output: Daily records dates, text, interpretation notes, and score 3
BlockDates Input/Output Flow Input Start Date: 03/18/2022 End Date: 03/19/2022 Text Block 03/18/2022 lorem ipsum etc cetera.Lorem ipsum dolor sit amet, consectetur adipiscing elit, 03/19/2021 ullamco laboris nisi ut aliquip ex ea commodo consequat BlockDates Output Daily Records Score Modifiers Overall Score 03/18/2022 lorem ipsum etc cetera.Lorem ipsum dolor sit amet, consectetur adipiscing 03/18/2022 5 4.5 03/19/2021 ullamco laboris nisi ut aliquip ex ea commodo consequat.. 03/19/2022 4 Year +1 4
Detailed BlockDates Flow Date variations interpreted: Date Singletons ( MM DD YYYY , YYYY MM DD , DD MM YYYY , etc.) Date Ranges ( MM DD YYYY MM DD YYYY , MM DD MM DD YYYY , etc.) Ranges and Singletons can be mixed in the input Locations leading or embedded ( MM DD YYYY The event versus The event occurred on MM DD YYYY ) Feedback provided Overall mean score Daily score Daily text Daily date Date interpretation notes Order of dates found in text block ([1], [2], [3]) 5
Scoring System Positive Scoring Rewards ( 0) Negative Scoring Penalties ( < 0) Message Unambiguous date formats: 03/21/2022 , 03/22/2022 , etc. Ambiguous date formats: ( 03/21/22 , 03/22/22 ), ( 3/21 , 3/22 ), etc. sformat variables rformat variables Sequential ordering: 8/5/22 , 8/6/22 , 8/7/22 , 8/8/22 Non-sequential ordering: 8/7/22 , 8/8/22 , 8/5/22 , 8/4/22 outOfOrder Complete coverage: start = 4/4/22 , end = 4/7/22 , text block = 4/4/22 , 4/5/22 , 4/6/22 . , 4/7/22 . Incomplete coverage: start = 4/4/22 , end = 4/7/22 , textblock = 4/4/22 , 4/6/22 . , 4/7/22 . filled (this date is filled in output) In-Range: start = 4/4/22 , end = 4/5/22 , text = 4/4/22 , 4/5/22 Out of Range: start = 4/4/22 , end = 4/5/22 , text = 4/4/22 , 4/5/22 , 4/6/22 , 5/7/22.. outOfRange* *,** indicate degree Year correctly input 4/4/2022 , 4/5/2022 , 4/6/2022 . Year typos 4/4/2022 , 4/5/2022 , 4/6/2021 . 4/7/2202 +1year, -180year, etc. Single instance dates: 4/4/22 , 4/5/22 , 4/6/22 . Duplicate dates: 4/4/22 , 4/5/22 , 4/4/22 . duplicateDate Range found: 4/4/22 through 4/6/22 . rangefound 6
Balancing Tradeoffs Structured/Unstructured The more flexible more false positives: (example format date at line start not imposed) Text Block : Start Date = 4/5/2022 Daily Text Daily Date Message 4/5 We visited a ruins. We went walking. We found a coin 3/4 inches in diameter. 4/6 travel continued . 4/5 We visited a ruins. We went walking. 4/5/2022 4/6 travel continued . 4/6/2022 We found a coin 3/4 inches in diameter. 3/4/2022 outOfRange** outOfOrder Too rigid of structure can create false negatives: Text Block : Start Date = 8/10/2022 Daily Text Daily Date Message 8 10 2022 Snorkeling. 8 11 Sky diving. 8 12 2022 Sleeping all day 8 10 2022 Snorkeling. 8 11 Sky diving. 8/10/2022 <<filled>> 8/11/2022 filled 8 12 2022 Sleeping all day 8/12/2022 7
Findings and Next Steps Application to BTS data: sample of 59,314 narrative records that include dates 96.5% of records returned positive scores high confidence date is valid Of those with no matching date, 77.9% were correctly recognized as having no viable date match Next step: Julia public registration In the process of fulfilling public registration requirements. Expanding tests to cover all features (60% complete) Documenting all functions (90% complete) Publishing tutorial/handbook The unofficial package is currently available at https://github.com/EconometricsBySimulation/BlockDates.jl 8
Contact US Francis Smart Fsmart@censeoconsulting.com Allison Fischman Allison.Fischman@dot.gov Amanda Lemons Amanda.Lemons@dot.gov 9