Recent Advances in Crowdsourced Entity Matching Problems

Slide Note
Embed
Share

Recent research highlights the use of crowdsourcing in entity matching problems, aiming to improve accuracy and efficiency. The study investigates the challenges faced in scaling crowdsourced solutions for large enterprises and explores the limitations of developer-dependent workflows. The need for innovative approaches in handling crowdsourcing for diverse user groups, such as journalists and small business workers, is discussed.


Uploaded on Sep 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Corleone: Hands-Off Crowdsourcing for Entity Matching Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu @WalmartLabs

  2. Entity Matching Amazon Walmart id name brand price id Name brand price 1 Transcend JetFlash 700 Transcend 30.0 1 HP Biscotti G72 17.3 Laptop .. HP 395.0 2 HP Biscotti 17.3 G72 Laptop .. HP 388.0 2 Transcend 16 GB JetFlash 500 Transcend 17.5 .... .. . .. .... .. . .. . .. .. . .. . .. .. . .. Has been studied extensively for decades No satisfactory solution as yet Recent work has considered crowdsourcing 2

  3. Recent Crowdsourced EM Work Verifying predicted matches e.g., [Demartini et al. WWW 12, Wang et al. VLDB 12, SIGMOD 13] Finding best questions to ask crowd to minimize number of such questions e.g., [Whang et al. VLDB 13] Finding best UI to pose questions display 1 question per page, or 10, or ? display record pairs or clusters? e.g., [Marcus et al. VLDB 11, Whang et al. TR 12] 3

  4. Recent Crowdsourced EM Work Example: verifying predicted matches a b c A (a,d) (b,e) (c,d) (c,e) (a,d) Y (b,e) N (c,d) Y (c,e) Y Verifying Blocking Matching (a,d) Y (c,e) Y d e B sample blocking rule: if prices differ by at least $50 do not match Shows that crowdsourced EM is highly promising But suffers from a major limitation crowdsources only parts of workflow needs a developer to execute the remaining parts 4

  5. Need for Developer Poses Serious Problems Does not scale to EM at enterprises enterprises often have tens to hundreds of EM problems can t afford so many developers Example: matching products at WalmartLabs walmart.com Walmart Stores (brick&mortar) all all clothes electronics ... books electronics clothes ... ... ... shirts TVs pants science TVs romance hundreds of major product categories to obtain high accuracy, must match each category separately so have hundreds of EM problems, one per category 5

  6. Need for Developer Poses Serious Problems Can not handle crowdsourcing for the masses masses can t be developers, can t use crowdsourcing startups either E.g., journalist wants to match two long lists of political donors can t use current EM solutions, because can t act as a developer can pay up to $500 can t ask a crowdsourcing startup to help $500 is too little for them to engage a developer same problem for domain scientists, small business workers, end users, data enthusiasts, 6

  7. Our Solution: Hands-Off Crowdsourcing Crowdsources the entire workflow of a task requiring no developers Given a problem P supplied by user U, a crowdsourced solution to P is hands-off iff uses no developers, only crowd user U does no or little initial setup work, requiring no special skills Example: to match two tables A and B, user U supplies the two tables a short textual instruction to the crowd on what it means to match two negative & two positive examples to illustrate the instruction 7

  8. Hands-Off Crowdsourcing (HOC) A next logical direction for EM research from no- to partial- to complete crowdsourcing Can scale up EM at enterprises Can open up crowdsourcing for the masses E.g., journalist wants to match two lists of donors uploads two lists to an HOC website specifies a budget of $500 on a credit card HOC website uses crowd to execute the EM workflow, returns matches to journalist Very little work so far on crowdsourcing for the masses even though that s where crowdsourcing can make a lot of impacts 8

  9. Our Solution: Corleone, an HOC System for EM Crowd of workers (e.g., on Amazon Mechanical Turk) Tables - Predicted matches A Candidate tuple pairs Predicted matches Accuracy Estimator Blocker Matcher - Accuracy estimates (P, R) User B Instructions to the crowd Difficult Pairs Locator Four examples 9

  10. Blocking |A x B| is often very large (e.g., 10B pairs or more) developer writes rules to remove obviously non-matched pairs [for matching Citations] trigram(a.title, b.title) < 0.2 overlap(a.brand, b.brand) = 0 [for matching Products] AND cosine(a.title, b.title) 0.1 AND a.price/b.price 3 OR b.price/a.price 3 OR isNULL(a.price,b.price)) critical step in EM How do we get the crowd to do this? ordinary workers; can t write machine-readable rules if write in English, we can t convert them into machine-readable Crowdsourced EM so far asks people to label examples no work has asked people to write machine-readable rules 10

  11. Our Key Idea Ask people to label examples, as before Use them to generate many machine-readable rules using machine learning, specifically a random forest Ask crowd to evaluate, select and apply the best rules This has proven highly promising e.g., reduce # of tuple pairs from 168M to 38.2K, at cost of $7.20 from 56M to 173.4K, at cost of $22 with no developer involved in some cases did much better than using a developer (bigger reduction, higher accuracy) 11

  12. Blocking in Corleone Decide if blocking is necessary If |A X B| < , no blocking, return A X B. Otherwise do blocking. Take sample S from A x B Train a random forest F on S (to match tuple pairs) using active learning, where crowd labels pairs Four examples supplied by user (2 pos, 2 neg) Train a random forest F Y Random forest F Stopping criterion satisfied? N Sample S from |A x B| Label the q selected examples using crowd Select q most informative unlabeled examples Amazon s Mechanical Turk

  13. Blocking in Corleone Extract candidate rules from random forest F isbn_match N title_match N Example random forest F for matching books Y Y No #pages_match N No publisher_match N No N No Y Y year_match No Yes Y Extracted candidate rules Yes (isbn_match = N) No (isbn_match = Y) and (#pages_match = N) No (title_match = N) No (title_match = Y) and (publisher_match = N) No (title_match = Y) and (publisher_match = Y) and (year_match = N) No 13

  14. Blocking in Corleone Evaluate the precision of extracted candidate rules for each rule R, apply R to predict match / no match on sample S ask crowd to evaluate R s predictions compute precision for R Select most precise rules as blocking rules Apply blocking rules to A and B using Hadoop, to obtain a smaller set of candidate pairs to be matched Multiple difficult optimization problems in blocking to minimize crowd effort & scale up to very large tables A and B see paper 14

  15. The Rest of Corleone Crowd of workers (e.g., on Amazon Mechanical Turk) Tables - Predicted matches A Candidate tuple pairs Predicted matches Accuracy Estimator Blocker Matcher - Accuracy estimates User B Instructions to the crowd Difficult Pairs Locator Four examples 15

  16. Empirical Evaluation Datasets Table A Table B |A X B| |M| # attributes # features 533 2616 2554 331 64,263 21,537 176,423 168.1 M 55 M 112 5347 1154 4 4 9 12 7 23 Restaurants Citations Products Mechanical Turk settings Turker qualifications: at least 100 HITs completed with 95% approval rate Payment: 1-2 cents per question Repeated three times on each data set, each run in a different week 16

  17. Performance Comparison Two traditional solutions: Baseline 1 and Baseline 2 developer performs blocking supervised learning to match the candidate set Baseline 1: labels the same # of pairs as Corleone Baseline 2: labels 20% of the candidate set for Products, Corleone labels 3205 pairs, Baseline 2 labels 36076 Also compare with results from published work 17

  18. Performance Comparison Published Works Corleone Baseline 1 Baseline 2 Datasets P R Cost P R P R F1 F1 F1 F1 92-97 % [1,2] 88-92 % [2,3,4] Not available 97.0 96.1 $9.20 10.0 6.1 99.2 93.8 Restaurants 96.5 7.6 96.4 89.9 94.3 $69.50 90.4 84.3 93.0 91.1 Citations 92.1 87.1 92.0 91.5 87.4 $256.80 92.9 26.6 95.0 54.8 Products 89.3 40.5 69.5 [1] CrowdER: crowdsourcing entity resolution, Wang et al., VLDB 12. [2] Frameworks for entity matching: A comparison, Kopcke et al., Data Knowl. Eng. (2010). [3] Evaluation of entity resolution approaches on real-world match problems, Kopcke et al., PVLDB 10. [4] Active sampling for entity matching. Bellare et al., SIGKDD 12. 18

  19. Blocking Cartesian Product Candidate Set Recall (%) Total cost Datasets Time 176.4K 176.4K 100 $0 - Restaurants 168 million 38.2K 99 $7.20 6.2 hours Citations 56 million 173.4K 92 $22.00 2.7 hours Products Comparison against blocking by a developer Citations: 100% recall with 202.5K candidate pairs Products: 90% recall with 180.2K candidate pairs See paper for more experiments on blocking, matcher, accuracy estimator, difficult pairs locator, etc. 19

  20. Conclusion Current crowdsourced EM often requires a developer Need for developer poses serious problems does not scale to EM at enterprises cannot handle crowdsourcing for the masses Proposed hands-off crowdsourcing (HOC) crowdsource the entire workflow, no developer Developed Corleone, the first HOC system for EM competitive with or outperforms current solutions no developer effort, relatively little money being transitioned into production at WalmartLabs Future directions scaling up to very large data sets HOC for other tasks, e.g., joins in crowdsourced RDBMSs, IE

Related


More Related Content