
LCJoin: Set Containment Join via List Crosscutting
Explore the concept of LCJoin, a method for performing set containment join via list crosscutting. The content covers problem definitions, existing solutions like Intersection-Oriented and Union-Oriented methods, and motivations behind different approaches. Dive into experimental results, conclusion, and examples of 4-bit signatures for better understanding. Discover how LCJoin leverages list cross-cutting to efficiently handle set containment joins.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
LCJoin: Set Containment Join via List Crosscutting Dong Deng, Chengcheng Yang, Shuo Shang, Fan Zhu, Li Liu, Ling Shao Inception Institute of Artificial Intelligence
Contents Set Containment Join Existing Solutions LCJoin Experimental Results Conclusion
Problem Definition Set Containment Join (??,??),(??,??) Find All Pairs Such That ? ? ? Rid Sid ?1 {?1,?2,?4,?5,?6} ?1 {?1,?2,?3,?4} ?2 {?2,?4,?5} ?2 {?2,?3,?5} ?3 {?1,?2,?3,?4,?6} ?3 {?1,?2,?5,?6} Job Advertisements Job Seeker Publish/Subscribe
Existing Solutions Set Containment Join (SCJ) SCJ Intersection- Oriented Method Union-Oriented Method Basic Idea 1 1 Generate Signature ???(?)/ ???(?) for Each Set in and ? Build an Inverted Index for ? 2 Generate Candidate Pairs (R, S) Where ???(?) ???(?) 2 For Each R in , Intersect all Inverted Lists Corresponding to Elements in R 3 Verify Candidates
Motivation Existing Solutions Union-Oriented Methods Example: 4-bit Signatures for R and S. To set a bit, i mod 4 for each ?? ? Rid Sid Sig Sig ?1 {?1,?2,?4,?5,?6} 1110 1111 ?1 {?1,?2,?3,?4} ?2 {?2,?4,?5} 1110 0111 ?2 {?2,?3,?5} ?3 {?1,?2,?3,?4,?6} 1111 ?3 {?1,?2,?5,?6} 0110 Enumerate & Lookup Candidates: (??,??), (??,??), (??,??) Verification Result: (??,??) Large Signature Size High Enumeration Cost Signature Size? Small Signature Size Many Candidates. High Verification Cost
Existing Solutions Intersection-Oriented Methods ? Rid Sid ?1 {?1,?2,?4,?5,?6} ?1 {?1,?2,?3,?4} ?2 {?2,?4,?5} ?2 {?2,?3,?5} ?3 {?1,?2,?3,?4,?6} Build Inverted Index ??(??): ?3 {?1,?2,?5,?6} ??, ?? Result of ??= ??(??) ??(??) ??(??) ???? = ?? ??(??): ??, ??, ?? Result of ??= ??(??): ?? Result of ??= ?? ??(??): ??, ??, ?? ??(??): ??, ?? Verification Free ??(??): ??, ??
LCJoin List Cross-Cutting Based Set Containment Join ?1 ?2 ?3 ?4 ??(?1) ?3 ?4 ?5 ?6 ?7 ?1 ?2 ?3 ?4 ??(?1) ?3 ?4 ?5 ?6 ?7 ? ? ??(?2) ??(?3) ??(?4) ??(?2) ??(?3) ??(?4) ?1 ?2 ?3 ?7 ?1 ?2 ?3 ?5 ?6 ?7 ?1 ?3 ?4 ?5 ?6 ?1 ?2 ?3 ?7 ?1 ?2 ?3 ?5 ?6 ?7 ?1 ?3 ?4 ?5 ?6 Cross-Cutting Rip-Cutting ??(?) ??(?) GAP ??(?) ??(?) GAP GAP
LCJoin List Cross-Cutting Based Set Containment Join Example s1s10s13s14s18s25s27s30 ??(?1) s1s13s17s20s21s23s25 ??(?2) s7s12s13s16s26s29 ??(?3) s1s3s13s15s19 ??(?4) MaxSid
LCJoin List Cross-Cutting Based Set Containment Join Early Termination s1s10s13s14s18s25s27s30 ??(?1) s1s13s17s20s21s23s25 ??(?2) s7s12s13s16s26s29 ??(?3) s1s3s13s15s19 ??(?4) MaxSid
LCJoin List Cross-Cutting Based Set Containment Join Two Optimizations Design a prefix-tree-based method to share the computation between sets in Partition the data to further improve efficiency
LCJoin Prefix-Tree Based Method PostOrder-Traverse root ?(?1) ?(?2) ?(?3) ??????? = ?? ??exists on the nodes from root to leaf ??. result pair (??,??) ?1 ?????? = ?? ??????? = ?1 ?3 ?4 ?5 ?6 ?7 ?1 ?2 ?3 ?5 ?6 ?7 ?1 ?2 ?3 ?7 ??????? = ?? ?2 ?????? = ?? ??????? = ?1 ??????? = ?? ??????? = ?? ?(?4) ?(?5) ?(?6) ?3 ?5 ?????? = ?? ??????? = ? ?????? = ?? ??????? = ?1 ?1 ?3 ?4 ?5 ?6 ?1 ?2 ?4 ?5 ?1 ?3 ?4 ?5 ?6 ?7 ??????? = ?? ??????? = ?? ?6 ?4 ?????? = ?? ??????? = ?? ?????? = ?? ??????? = ?1 ?1 ?2 use the max NextMax on the path from root to leaf as the next MaxSid
SAL-Hashing LCJoin Data Partition ???? ?(?2) ?(?3) ?(?1) ?(?2) ?(?3) ?3 ?7 ?1 ?2 ?3 ?7 ?3 ?4 ?5 ?6 ?7 ?1 ?2 ?3 ?5 ?6 ?7 ?1 ?2 ?3 ?7 ?1 ?2 Use ??,??,??,?? ?3 ?2 ?(?4) ?(?5) ?(?6) Build Local Inverted Index ?1 ?3 ?1 ?2 ?1 ?3 ?7 ?(?4) ?(?5) ?(?6) ?3 ?5 ?5 ?1 ?3 ?4 ?5 ?6 ?1 ?2 ?4 ?5 ?1 ?3 ?4 ?5 ?6 ?7 Shorter Inverted Lists ?6 ?3 ?4 ?1 ?2
SAL-Hashing LCJoin Data Partition ???? Benefit Shorter Inverted List for Each Node on This Partition Cost ?(?1) ?(?2) ?(?3) Build Local Inverted Index ?3 ?4 ?5 ?6 ?7 ?1 ?2 ?3 ?5 ?6 ?7 ?1 ?2 ?3 ?7 ?1 ?2 ?3 ?2 ?(?4) ?(?5) ?(?6) ?3 ?5 ?5 ?1 ?3 ?4 ?5 ?6 ?1 ?2 ?4 ?5 ?1 ?3 ?4 ?5 ?6 ?7 Build Local Inverted Index ?6 ?3 ?4 Use Original Built Inverted Index ?1 ?2
Experimental Results Experimental Setup Ubuntu 16.04 Intel Xeon Gold-6148 Processor 64G DRAM Characteristics of Our Datasets Dataset # of Sets Min/Max/Avg Size # of Elements Z-Value FLICKR 3,546,729 1 / 1230 / 5.4 618,971 0.63 AOL 36,389,577 1 / 125 / 2.5 3,849,556 0.68 ORKUT 15,301,901 2 / 9120 / 7 2,322,299 0.13 TWITTER 28,819,434 2 / 4998 / 9 13,096,918 0.3
Experimental Results Evaluation of the tree-based methods FLICKR TWITTER AOL Tree Based Method Can Share Computations in Common Prefixes of Sets in
Experimental Results Evaluation of data partition methods AOL TWITTER AOL Dynamic Partition achieved the best performance
Experimental Results Comparing with State-of-the-Art ? ?? TWITTER AOL LCJoin Outperforms the State-of-the-Art Method by Up to ??
Conclusion A cross-cutting based list intersection method, which can skip many irrelevant entries. A A A prefix tree on to share the computation across sets. Data partitions to further improve the efficiency and scalability. B B LCJoin C C Experimenting to demonstrate the superiority of LCJoin D D