Debugging Blocker in Entity Matching using MatchCatcher

Slide Note
Embed
Share

Explore how the MatchCatcher solution helps identify and improve matches killed off by blockers in the entity matching process. The solution allows users to quickly find and analyze these matches, enhancing the efficiency of the blocking stage in entity matching.


Uploaded on Nov 23, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. MatchCatcher: A Debugger for Blocking in Entity Matching Han Li University of Wisconsin-Madison Joint work with Pradap Konda, Paul Suganthan G. C., AnHai Doan, Benjamin Snyder, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra @WalmartLabs

  2. Entity Matching Find data records referring to the same real-world entity Table A Table B Name Dave Smith Daniel Smith Joe Welson Charles Williams City Altanta LA New York Chicago Age 18 18 25 45 Name City Atlanta NY LA Chicago Age 18 25 30 45 David Smith Joe Wilson Daniel W. Smith Charles Williams Charlie William Atlanta 28 Critical problem in many domains E-commerce, biomedical, military intelligence, etc.

  3. EM in Practice Two fundamental steps: blocking and matching Table A Name City Age Dave Smith Altanta 18 Daniel Smith LA 18 Joe Welson New York 25 Charles Williams Chicago 45 Charlie William Atlanta 28 a5 a1 a2 a3 a4 (a2, b3) - (a4, b4) + (a5, b1) - (a2, b3) (a4, b4) (a5, b1) block on a.City = b.City match Table B Name City Atlanta NY LA Chicago Age 18 25 30 45 David Smith Joe Wilson Daniel W. Smith Charles Williams b1 b2 b3 b4 Many blocking methods have been developed Hash, similarity-based, rule-based, etc. 3

  4. Messages of This Talk Debugging blocking is important We developed the first blocking debugger It has been released as a part of the Magellan EM system Used extensively in many real-world settings Overwhelmingly positive feedback Try it out! https://sites.google.com/site/anhaidgroup/projects/magellan 4

  5. Debugging Blocking Table A Name Dave Smith Daniel Smith Joe Welson Charles Williams Charlie William Table B Name David Smith Joe Wilson Daniel W. Smith Charles Williams City Altanta LA New York Chicago Atlanta Age 18 18 25 45 28 Dave Smith Altanta a1 a2 a3 a4 a5 18 C (a2, b3) (a4, b4) (a5, b1) blocker Q a.City = b.City City Atlanta NY LA Chicago Age 18 25 30 45 18 Atlanta David Smith b1 b2 b3 b4 Does Q kill off too many matches? What are the killed-off matches? Why are they killed off by Q? Need to debug blocking! 5

  6. The MatchCatcher Solution Find matches killed-off by the blocker User can examine these matches and improve the blocker A B D = A B \ C C = Q(A, B) Quickly find as many killed-off true matches as possible 6

  7. Example Q1: a.City = b.City Table A Name Dave Smith Daniel Smith Joe Welson Charles Williams Charlie William Table B Name David Smith Joe Wilson Daniel W. Smith Charles Williams City Altanta LA New York Chicago Atlanta Age 18 18 25 45 28 Debugger Output (a1, b1) (a3, b2) (a2, b1) a1 a2 a3 a4 a5 Altanta C1 (a2, b3) (a4, b4) (a5, b1) New York City Atlanta NY LA Chicago Age 18 25 30 45 Atlanta NY b1 b2 b3 b4 7

  8. Example Q2: a.City = b.City OR lastword(a.Name) = lastword(b.Name) Table A Name Dave Smith Daniel Smith Joe Welson Charles Williams Charlie William Table B Name David Smith Joe Wilson Daniel W. Smith Charles Williams City Altanta LA New York Chicago Atlanta Age 18 18 25 45 28 C2 (a1, b1) (a1, b3) (a2, b1) (a2, b3) Debugger Output (a3, b2) (a1, b2) (a1, b4) a1 a2 a3 a4 a5 (a4, b4) (a5, b1) Welson City Atlanta NY LA Chicago Age 18 25 30 45 b1 b2 b3 b4 Wilson 8

  9. Example Q3: a.City = b.City OR ed(lastword(a.Name), lastword(b.Name)) 2 Table A Name Dave Smith Daniel Smith Joe Welson Charles Williams Charlie William Table B Name David Smith Joe Wilson Daniel W. Smith Charles Williams City Altanta LA New York Chicago Atlanta Age 18 18 25 45 28 C3 (a1, b1) (a1, b3) (a2, b1) (a2, b3) Debugger Output (a1, b2) (a1, b4) (a2, b2) a1 a2 a3 a4 a5 (a3, b2) (a5, b4) (a5, b1) (a5, b4) City Atlanta NY LA Chicago Age 18 25 30 45 b1 b2 b3 b4 9

  10. MatchCatcher in the Wild Has been open sourced and integrated into Magellan State-of-the-art open sourced system for building end-to-end EM solutions [Konda et al. VLDB 16, Konda et al. SIGMOD Record 18] Available at https://sites.google.com/site/anhaidgroup/projects/magellan Has been used extensively for EM over the past two years 300+ students in 4 data science classes 7 EM teams at 6 organizations Feedback has been overwhelmingly positive Discovering dirty data Finding correct blocker types and attributes Tuning blocker parameters Knowing when to stop revising blocking 10

  11. Key Idea Must find killed-off matches quickly Use string similarity joins (SSJs) Config = {Name, City} A A Dave Smith Altanta Daniel Smith LA Joe Welson New York Charles Williams Chicago Name Dave Smith Daniel Smith Joe Welson Charles Williams Charlie William City Altanta LA New York Chicago Atlanta Age 18 18 25 45 28 a1 a2 a3 a4 a5 Dave Smith Altanta Daniel Smith LA Joe Welson New York Charles Williams Chicago Charlie William Atlanta Charlie William Atlanta (a1, b1) (a3, b2) (a2, b1) B Top-3 SSJ B David Smith Atlanta Joe Wilson NY Daniel W. Smith LA Name City Atlanta NY LA Chicago Age 18 25 30 45 b1 b2 b3 b4 David Smith Joe Wilson Daniel W. Smith Charles Williams David Smith Atlanta Joe Wilson NY Daniel W. Smith LA Charles Williams Chicago Charles Williams Chicago 11

  12. Challenges A Top-k SSJ Name Dave Smith Daniel Smith Joe Welson Charles Williams Charlie William City Altanta LA New York Chicago Atlanta Age 18 18 25 45 28 {Name} Top-k SSJ {City} {Age} B Top-k SSJ {Name, City} Name City Atlanta NY LA Chicago Age 18 25 30 45 David Smith Joe Wilson Daniel W. Smith Charles Williams {Name, City, Age} 12

  13. Selection of Configurations Matching tuples have similar values on certain attributes Tuple b Tuple a Name City Vegas Age 19 Name City Age 18 David Smith DavidA. Smith Las Vegas Name + City Name + City Jaccard( , ) = 0.6 David A. Smith Las Vegas David Smith Vegas Run a top-k string similar join, use top-k pairs as match candidates Use multiple configs to find as many killed-off matches as possible But cannot use all of them as there are too many E.g., 10 attributes -> 1023 configs Need to select a set of configs 13

  14. Generating a Config Tree Procedure for generating a config tree 1. Given schema S, select a set of most promising attributes T Discard numeric, boolean and categorical attributes 2. Generate a tree of configs in a top-down fashion T = {n,c,s,d} {c,s,d} {n,s,d} {n,c,d} {n,c,s} {c,d} {n,d} {n,c} Pick one node to expand Equivalent to pick one attribute to exclude Attributes with more unique values and fewer missing values are more useful E.g. name vs. city Assign a score for each attribute, remove the one with the lowest score E.g. e(n) > e(d) > e(c) > e(s), remove attribute s {d} {n} {n,c,s,d} Manage long string attributes E.g., description, comments Cause multiple configs to generate similar top-k lists Need to remove the long attribute first {c,s,d} {n,s,d} {n,c,d} {n,c,s} {c,s} {n,s} {n,c} {c} {n} 14

  15. Challenges A Top-k SSJ Name Dave Smith Daniel Smith Joe Welson Charles Williams Charlie William City Altanta LA New York Chicago Atlanta Age 18 18 25 45 28 {Name} Top-k SSJ {City} {Age} B Top-k SSJ {Name, City} Name City Atlanta NY LA Chicago Age 18 25 30 45 David Smith Joe Wilson Daniel W. Smith Charles Williams {Name, City, Age} 15

  16. Top-k String Similarity Joins Similarity measure for top-k SSJs |? ?| |? ?| Use Jaccard similarity: ??????? ?,? = MatchCatcher works with all set-based measures: Cosine, Dice, Jaccard, Overlap State of the art on top-k SSJs for a single config TopKJoin [Xiao et al. 2009], based on prefix-filtering Maintain a prefix for each string by inverted index, incrementally extend the prefix Find pairs with prefix overlap and compute similarity scores TopKJoin has a major limitation Compute similarity score immediately if there is an overlap in the prefix for a pair Waste a lot of time for pairs not in the actual top-k list 16

  17. QJoin Algorithm Key idea keep track of the number of token overlaps each pair has, only compute the similarity score if there are q overlaps in the prefix Save unnecessary similarity score computation Select the best q Select the best q empirically as it s difficult to analyze A machine with n CPU cores q = 1 Core 1: best q = 2 top-k top-50 q = 2 Core 2: q = n Core n: 17

  18. Joint Top-k SSJs Across All Configs Need to do joint SSJs on a set of configs Generate the configs in a tree structure Speed up the SSJs by reusing computations and parallelization Reuse similarity score computations Process the configs in a breath-first order Reuse the computation from the parents f1 f2 f3 f3 f1 f2 f1 f3 f2 a b H (a,c): (a,b): o(f1, f1)=2 o(f1, f2)=3 o(f3, f3)=1 f2 f3 f1 f3 f1 f2 f1 f2 Reuse top-k lists Initialize the top-k list by re-adjusting the top-k pair scores from the parent Can make QJoin stop earlier as the top-k list is not generated from scratch Parallel processing of the configs Process one config join per core in the same breath-first order Concurrency control issues 18

  19. Challenges A Top-k SSJ Name Dave Smith Daniel Smith Joe Welson Charles Williams Charlie William City Altanta LA New York Chicago Atlanta Age 18 18 25 45 28 {Name} Top-k SSJ {City} {Age} B Top-k SSJ {Name, City} Name City Atlanta NY LA Chicago Age 18 25 30 45 David Smith Joe Wilson Daniel W. Smith Charles Williams {Name, City, Age} 19

  20. Empirical Evaluation # of # of attrs 5 7 5 7 8 8 Average length 205, 38 76, 179 16, 19 11, 10 9, 9 9, 9 Dataset Tuple type Table A Table B matches 1300 1154 2224 112 2978 73646 Amazon-Google Walmart-Amazon electronic product ACM-DBLP Fodors-Zagats Music1 Music2 software product 1363 2554 2294 533 100000 500000 3226 22074 2616 331 100000 500000 paper restaurant song song 20

  21. Empirical Evaluation Dataset Blocker Q (OL) title_overlap_word<3 (HASH) attr_equal_manuf (SIM) title_cos_word<0.4 (R) title_jac_word<0.2 AND manuf_jac_3gram<0.4 (OL) title_overlap_word<3 (HASH) attr_equal_brand (SIM) title_cos_word<0.4 (R) price_absdiff>20 OR title_jac_word<0.5 (OL) authors_overlap_word<2 (SIM) title_jac_3gram<0.7 (R1) title_cos_word<0.8 AND authors_jac_3gram<0.8 (R2) year_abs_diff>0.5 OR title_jac_word<0.7 (OL) name_overlap_word<2 (HASH) attr_equal_city (SIM) addr_jac_3gram<0.3 (R) (name_cos_word<0.5 AND type_jac_3gram<0.7) ORaddr_jac_3gram<0.3 (OL) artist_name_overlap_word<2 (HASH) attr_equal_artist_name (SIM) title_cos_word<0.5 (R) year_absdiff>0.5 OR title_cos_word<0.7 (HASH1) attr_equal_artist_name (HASH2) attr_equal_release_OR_attr_equal_artist_name (SIM1) title_cos_word<0.6 (SIM2) title_cos_word<0.7 (SIM3) title_cos_word<0.8 Amazon-Google Walmart-Amazon ACM-DBLP Fodors-Zagats Music1 Music2 21

  22. Empirical Evaluation Overall accuracy Top-k SSJs retrieved 12-100% of killed-off matches in D A B D = A B \ C User identified 70-100% of these matches C Examining matches reveals these reasons: Large threshold Different string appearance Missing values Unnormalized attribute Incomplete attribute extraction ... A-G W-A 41.5-83.6 77.1-91.0 A-D F-Z M1 M2 Recall of E (%) Acc. of verifier (%) 54.2-65.3 70.2-97.9 96.7-100 90.2-97.1 92.3-98.1 100.0 48.7-86.5 94.7-100 11.7-45.6 96.0-100 22

  23. Empirical Evaluation Runtime Redhat 7.2 Linux with Intel E5-1650 CPU A-G 6.6-9.4 W-A 97-310 A-D 2.8-3.2 F-Z 0.2 M1 M2 Runtime (sec.) 12.1-24.4 57-230 Debug real blockers to improve accuracy Ask users to develop best possible hash blockers for five datasets Initial Q recall (%) 75.6 95.1 100 97.3 100 Final Q recall (%) 99.7 99.6 - 100 - Datasets Rounds |C| increase (%) A-G W-A A-D F-Z M1 2 1 - 1 - 7.8 3.4 - 16.8 - 23

  24. Conclusion Introduce the problem of debugging blocking Develop MatchCatcher, a solution to debugging blocking Has been integrated into Magellan Has been used extensively with overwhelmingly positive feedback Try it out at https://sites.google.com/site/anhaidgroup/projects/magellan Future work Automatically explain why each match is killed off by the blocker Cluster the killed-off matches such that users can fix most pervasive problems first Extend MatchCatcher for specific blocker types 24

More Related Content