Progressive Approach to Relational Entity Resolution
In this research paper authored by Yasser Altowim, Dmitri Kalashnikov, and Sharad Mehrotra, a progressive approach to relational entity resolution is presented. The study focuses on balancing cost and quality in entity resolution tasks for relational datasets. The goal is to develop a method that achieves high-quality results within a specified cost budget. Various aspects such as ER graphs, resolution plans, and node benefits are explored to provide insights into the proposed approach.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra
ProgressiveER Cost vs. Quality Cost vs. Quality Cost vs. Quality Cost vs. Quality Cost vs. Quality Cost vs. Quality Cost vs. Quality Cost vs. Quality Progressive ER Quality Quality Quality Quality Resolution Cost Resolution Cost Resolution Cost Resolution Cost
RelationalDataset Paper Id p1 p2 p3 p4 Title Authors {a1, a2} {a1} {a3, a4} {a3} Venue u1 u2 u3 u4 Transaction Support in Read Optimized Read Optimized File System Designs: Transaction Support in Read Optimized Berkeley DB: A Retrospective .. Author Name Venue Id u1 u2 u3 u4 Name Papers {p1} {p2} {p3} {p4} Id a1 a2 a3 a4 Papers {p1, p2} {p1} {p3, p4} {p3} Very Large Data Bases Marge Seltzer ICDE Conference Michael Stonebraker VLDB Margo I. Seltzer IEEE Data Eng. Bull M. Stonebraker
Graph Representation u1,u3 p1,p3 duplicate Resolve duplicate
Problem Definition Given a relational dataset D, and a cost budget BG, Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.
ER Graph T1 R1 S1 S2 R2 T2
ER Graph v1 v2 v3 v5 v9 v10 v11 v6 v7 R1 v4 S1 T1 v12 v8 S2 R2 T2
Partially Constructed Graph v1 v2 v3 v5 v9 v10 v11 v6 v7 R1 v4 S1 T1 v12 v8 S2 R2 T2
Resolution Windows BG Windown Window1 Window2 1. Plan Generation. 2. Plan Execution ( ). Resolution Plan ( ) Set of blocks ( ) to be instantiated. Lazy Resolution Strategy Set of nodes ( ) to be resolved.
Node Benefit State State v3 v5 v1 v4 v2 v6 Direct Benefit Indirect Benefit
Plan Generation Phase Benefit-vs-Cost Analysis: Each node and block has an updated cost and benefit. 1. Generate a plan such that: h . 2. NP-hard Oregon-Trail Knapsack is maximized.
Plan Generation Algorithm Instantiated Unresolved Nodes v1 v2 v6 v7 v15 v16 Step#1 v4 v1 v2 v6 v10 v13 v16 v10 v21 Uninstantiated Blocks Step#2 R1 R2 R4 R5 R6 R8 R9
Plan Generation Algorithm R1 R8 R6 R2 Step#3 v1 v2 v30 v40 v32 v42 v34 v45 v30 v10 v38 v48 v36 v47 v1 v2 v6 If > v16 v10 else return and
ExperimentalEvaluation CiteSeerX Dataset 1. Papers (P) 2. Authors (A) 3. Venues (U) = (Title, Abstract, Keywords, Authors, Venue). = (Name, Email, Affiliation, Address, Paper). = (Name, Year, Pages, Papers). Number of Entities 30,000 83,152 30,000 Blocking Functions 2 1 1 Similarity Functions 3 4 3 Resolve Function Na ve Bayes P A U Na ve Bayes Na ve Bayes
ExperimentalEvaluation Algorithms: 1. DepGraph. X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD. 2. Static. S. E. Whang et al. Joint entity resolution. ICDE. R T S S2 S6 S5 T6 T1 T3 R1 R4 R5 3. Full: No lazy resolution strategy. 4. Random: Lazy resolution strategy but with random order.
Lazy Resolution with Workflow Our Approach Our Approach Random Random Full Full Execution Time (sec) Execution Time (sec) 300.33 300.33 396.55 396.55 542.43 542.43 Plan Generation Plan Generation 4.76% 4.76% 3.81% 3.81% 2.58% 2.58% Plan Execution Reading Blocks 95.11% 4.70% 96.17% 3.75% 97.40 2.90% Graph Creation Reading Blocks. Creating Nodes. Resolving Nodes. 8.40% 6.25% 4.72% Node Resolution 82.01% 86.17% 89.78%
Conclusion Progressive Approach to Relational ER. Cost and benefit model for generating a resolution plan. Lazy resolution strategy to resolve nodes with the least amount of cost. Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.