TopK Interesting Subgraph Discovery in Information Networks

Slide Note
Embed
Share

Discovering top-K interesting subgraphs in information networks is crucial for various applications like network bottlenecks, team selection, resource allocation, and more. This research focuses on developing low-cost indexes and novel algorithms to efficiently detect these subgraphs. The contributions include new indexes, a top-K algorithm, and validation on synthetic and real datasets.


Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. TopK Interesting Subgraph Discovery in Information Networks Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han gmanish@microsoft.com 10/9/2024 1

  2. Real World Problems Network Bottlenecks Discovery Computer Networks Organization Networks Team Selection Interestingness = Highest Historical Compatibility Interestingness = Lowest Bandwidth Suspicious Relationships Discovery Battlefield Networks Resource Allocation Social Networks Interestingness = Highest Negative Association Strength of Attribute Values Interestingness = Lowest Distance between Entities gmanish@microsoft.com 2

  3. The Basic Underlying Problem Team Selection Network Bottlenecks Discovery Given Edge-weighted Typed Network G Typed Subgraph Query Q Edge Interestingness measure Interestingness = Lowest Bandwidth Interestingness = Highest Historical Compatibility Find Suspicious Relationships Discovery TopK matching subgraphs Resource Allocation Interestingness = Highest Negative Association Strength Interestingness = Lowest Distance gmanish@microsoft.com 3

  4. Nave Solution: Ranking After Matching Match Score ?2 2.2 12 13 0.2 Query Q Network G C C ?4 2.2 ?9 2.1 0.4 0.5 0.4 0.3 2 3 6 5 4 3 2 1 Ranking ?7 2.0 A A B A A A A B ?8 0.6 0.8 0.8 0.7 0.2 1.8 0.9 0.1 0.7 0.1 ?1 1.8 1 4 10 9 8 7 11 ?5 B 1.7 A C A A A B 0.3 0.6 0.5 0.2 ?6 1.6 6 5 Matching ?3 1.4 B A 4 3 2 ?? 10 9 8 7 0.6 4 A A 5 A A A B A A A 0.9 0.8 0.3 0.6 0.5 0.8 0.7 10 9 0.9 0.1 ?? 0.1 A A 9 7 ?? ?? 0.3 7 6 5 4 3 A 6 B B 5 B A A A 4 3 2 5 0.6 0.8 ?? 0.8 A ?? A B A A A ?? 0.8 0.7 0.6 0.9 4 3 2 1 9 8 7 0.7 0.9 ?? 9 8 A A B A ?? A A B 7 0.6 0.5 B A A 0.8 0.7 0.2 0.6 gmanish@microsoft.com 4

  5. Our Contributions New notion: TopK interesting subgraph detection in information networks Three new low-cost indexes Graph topology index Sorted edge lists Graph maximum metapath weight index Novel top-K algorithm to answer interestingness queries on large graphs Detailed effectiveness and efficiency validation on several synthetic and real datasets gmanish@microsoft.com 5

  6. Relationship with Previous Work Subgraph matching Approximate: fuzzy node/edge similarity Exact: Matching without ranking RDF graphs, probabilistic graphs, temporal graphs TopK querying on graphs H-hop aggregate queries Keyword queries on RDF graphs K most frequent patterns Twig queries gmanish@microsoft.com 6

  7. System Overview 2 Network G Breadth First Traversal from each Node up to Distance D Graph Topology Index Offline Index Construction Distance D Sort Edges 3 Graph Maximum MetaPath Weight Index 1 Sorted Edge Lists Find Candidate Nodes Query Q Candidate Nodes Top-K Computation Online Query Processing Top-K Subgraphs gmanish@microsoft.com 7

  8. G=(V,E), B=avg #neighbors, T=#types Index Structures 13 12 0.2 Network G C C AA BB CC AB AC BC (5,9):0.9 (12,13):0.2 (2,7): 0.7 (3,12): 0.5 (7,11): 0.2 0.4 0.5 0.4 0.3 6 5 4 3 2 1 (3,4):0.8 (5,6): 0.6 (4,12): 0.4 (1,11): 0.1 B A A A A B (4,5):0.8 (8,7): 0.5 (3,13): 0.4 0.6 0.8 0.8 0.7 0.2 (2,3):0.7 (2,1): 0.2 (2,13): 0.3 0.9 0.1 0.7 0.1 10 9 8 7 11 (8,9):0.6 (4,7): 0.1 C A A A B 0.3 0.6 0.5 0.2 (9,10):0.3 d Node Id 1 2 3 4 5 6 7 1 2 d Node Id 1 2 3 4 5 6 7 1 2 A B C AABACAABBBCBACBCCC Time Time Time Space Complexity Complexity Complexity Space Space A B C AA BA CA AB BB CB AC BC CC Index Index Index Complexity Complexity Complexity 1 1 2 2 2 1 3 1 1 2 1 1 1 1 2 3 2 3 1 1 1 0.2 0.70.70.3 1.5 1.2 0.7 0.8 0.5 1.6 0.80.10.4 1.7 0.8 0.9 1.4 0.1 0.9 0.9 0.3 0.5 2 2 1 2 1 1 2 2 1 1 1 Sorted edge lists edge lists edge lists Sorted Sorted 1.2 0.9 0.5 1.2 1.3 0.3 0.6 2 ? 2 ? 2 ? ?(|?|) ?(|?|) ?(|?|) ? |?|??? ? |?|??? ? |?|??? 2 2 2 ?(? + 1) ?(? + 1) ?(? + 1) 0.9 1.4 0.7 1 1 2 1 1 Graph topology index index Graph topology 1 ?( ? ??) ?( ? ??) ?( ? ??) ?( ? ??) 0.90.6 0.6 1.6 1.5 0.9 1.2 1 1 1 2 0.7 0.2 1.4 0.9 0.3 1 Graph max metapath weight index 8 1 1 2 2 1 8 0.60.5 1.5 1.2 0.7 9 3 1 2 ?( ? ??) ?( ? ??) 9 0.9 1.7 1.5 10 1 2 10 0.3 1.2 11 2 3 11 0.2 0.9 12 2 1 4 2 1 1 12 0.5 0.2 1.3 0.6 0.5 0.9 gmanish@microsoft.com 8

  9. Find Candidate Nodes Graph Topology Index Query Q Graph Topology Index Query Q ?1 ?1 ?2 ?2 ?3 ?3 ?4 ?4 2 2 2 2 2 2 1 1 d Node Id 1 2 3 4 5 6 7 2 3 1 2 3 3 3 3 3 3 6 6 A B C AABACAABBBCBACBCCC A A 4 4 4 4 4 4 7 7 1 1 2 2 2 1 3 1 1 2 1 1 1 1 2 3 2 3 1 1 1 2 2 1 2 1 1 2 2 1 1 1 1 4 5 5 5 5 5 5 2 2 2 B A 1 1 2 1 1 8 8 8 8 8 8 1 9 9 9 9 9 9 Query Topology 1 1 1 2 10 10 10 10 10 10 8 1 1 2 2 1 d Node Id 1 2 3 4 1 2 9 3 1 2 10 1 2 A B C AABACAABBBCBACBCCC 11 2 3 1 2 1 1 1 12 2 1 4 2 1 1 1 1 1 1 gmanish@microsoft.com 9

  10. Finding and Scoring Matches Key Idea Top-K Computation Query Q Start B A A A 2 3 Y More valid edges? Generate a Size-1 Candidate A A (5,9):0.9 (2,7): 0.7 N (3,4):0.8 (5,6): 0.6 Y 1 4 (4,5):0.8 (8,7): 0.5 Compute Actual and UB Score B A TopK Quit? (2,3):0.7 (2,1): 0.2 N (8,9):0.6 (4,7): 0.1 Y N Candidate Size==|Q|? N Grow Candidates (9,10):0.3 Y Y Compute Actual and UB Score Top-K Heap TopK Quit? ?1 Update Heap ?4 ?2 Compute Max UB Score N Y Done! TopK Quit? ?3 ?5 gmanish@microsoft.com 10

  11. Finding and Scoring Matches Generating Size-1 Candidates B A A A Size-1 Candidates 9 Query Q 2 9 3 5 5 5 9 (5,9):0.9 (2,7): 0.7 A A A A A A A A A A (3,4):0.8 (5,6): 0.6 (4,5):0.8 (8,7): 0.5 5 1 4 9 (2,3):0.7 (2,1): 0.2 B B A A B B B A A A (8,9):0.6 (4,7): 0.1 Multiple query edges of the same type Query Edge with both endpoints of same type (9,10):0.3 Candidate Growth 9 5 9 5 A A A A Order (5,9) (3,4) (4,5) (2,3) (2,7) Heapify? Prune? 9 5 8 8 6 A A Discard? Grow? B A Prune? B A 9 5 9 Grow? 5 B A A A A A Heapify? Prune? 10 6 10 Discard? Grow? B A B A gmanish@microsoft.com 11

  12. Finding and Scoring Matches Actual Score and Upper Bound Score 9 9 9 5 5 5 Candidate Growth A A A A A A Prune? Prune? Heapify? 6 8 8 Grow? Grow? Discard? B B B A A A 9 5 B A A A A A Actual Score= 0.9 (5,9):0.9 (2,7): 0.7 UB Score = 0.9+ UB(NonConsidered Edges) = 0.9+ (0.6+0.6) = 2.1 Partially grown candidate Prune if UBScore< min(heap) Grow otherwise Fully grown candidate Discard if UBScore< min(heap) Update heap otherwise B A (3,4):0.8 (5,6): 0.6 (4,5):0.8 (8,7): 0.5 (2,3):0.7 (2,1): 0.2 (8,9):0.6 (4,7): 0.1 (9,10):0.3 Useful Edge Lists gmanish@microsoft.com 12

  13. Finding and Scoring Matches Global Top-K Quit 12 13 B A A A 0.2 Network G C C (5,9):0.9 (2,7): 0.7 Query Q 2 0.4 0.5 0.4 0.3 6 5 4 3 2 1 (3,4):0.8 (5,6): 0.6 3 B A A A A B A A (4,5):0.8 (8,7): 0.5 0.6 0.8 0.8 0.7 0.2 0.9 0.1 0.7 (2,3):0.7 (2,1): 0.2 0.1 1 4 10 9 8 7 11 B A (8,9):0.6 (4,7): 0.1 C A A A B 0.3 0.6 0.5 0.2 (9,10):0.3 K=2 TopK Heap 0.7+0.6+0.7 = 2 <2.2 Stop (4,3,2,7): 2.2 (3,4,5,6): 2.2 gmanish@microsoft.com 13

  14. Faster Query Processing using Graph Maximum MetaPath Weight Index Slight complication 1 1 1 C 4 3 5 C C 4 3 5 A B C A B C 2 2 2 C C C Query ? 6 7 B C 1 Query ? Partial Instantiation UB Score = Actual Score(1-2) + UB(1-3) + UB(2-3) + UB(3-4) + UB(4-5) C 1 C 4 3 5 2 C A B C 4 B Partial Candidate 7 3 6 7 C A UB Score = Actual Score(1-2) + UB(1-3-4-5) + UB(2-3) B C 2 1 C C 4 3 5 Paths to cover Non-Considered Edges Edges to Consider Separately A B C 3 A Paths to cover Non-Considered Edges UB Score = Actual Score(1-2) + UB(1-3-4-5-7) + UB(2-3) + UB(4-6) +UB(6-7) 2 Using MMW Index! C gmanish@microsoft.com 14

  15. Faster Query Processing using Graph Maximum MetaPath Weight Index B A A A 5 A A (5,9):0.9 (2,7): 0.7 d Node Id 1 2 3 4 5 6 7 1 2 Prune? (3,4):0.8 (5,6): 0.6 A B C AA BA CA AB BB CB AC BC CC 9 (4,5):0.8 (8,7): 0.5 Grow? 0.2 0.70.70.3 1.5 1.2 0.7 0.8 0.5 1.6 0.80.10.4 1.7 0.8 0.9 1.4 0.1 0.9 0.9 0.3 0.5 B A 1.2 0.9 0.5 1.2 1.3 0.3 0.6 (2,3):0.7 (2,1): 0.2 0.9 1.4 0.7 (8,9):0.6 (4,7): 0.1 Edge-based UBScore 0.9+0.8+0.7 =2.4 > 2.0 0.90.6 0.6 1.6 1.5 0.9 1.2 (9,10):0.3 0.7 0.2 1.4 0.9 0.3 1 Grow 8 0.60.5 1.5 1.2 0.7 K=2 9 0.9 1.7 1.5 TopK Heap 10 0.3 1.2 11 0.2 0.9 Path-based UBScore 0.9+UB(5-A-B) =0.9+0.9 =1.8 < 2.0 (8,9,5,6): 2.1 (5,9,8,7): 2.0 12 0.5 0.2 1.3 0.6 0.5 0.9 Prune MMW Index gmanish@microsoft.com 15

  16. Discussions Queries with multiple edge semantics Directed graphs Homogeneous networks Weighted query edges Weights signify expected amount of interestingness Weights signify importance of query edge Faster computations versus index size gmanish@microsoft.com 16

  17. Low-cost Index Structures 10000 Topology+MMW (D=2) SPath (D=2) Sorted Edge Lists 1000 Time (sec) 100 10 |V| 1 100000000 Edge Lists Topology (D=2) Topology (D=3) MMW (D=2) MMW (D=3) SPath (D=2) SPath (D=3) Graph Size 1000 10000 100000 1000000 10000000 Index Size (KBs) 1000000 100000 10000 1000 100 10 |V| 1000 10000 100000 1000000 gmanish@microsoft.com 17

  18. Faster Query Execution |Q|=2 |Q|=3 |Q|=4 |Q|=5 144 8698 34639 174992 10 375 14689 229136 13 446 16754 200065 12 562 19088 201708 156 2277 17182 161533 11 346 13547 199617 |Q|=2 |Q|=3 |Q|=4 |Q|=5 245 2004 14628 169328 15 32 19 36 20 40 218 1733 18 34 RAM RWM0 RWM1 RWM2 RWM3 RWM4 RAM RWM0 RWM1 RWM2 RWM3 RWM4 43 98 442 2337 42 122 178 6887 3933 118 Query Execution Time (msec) for Clique Queries (Graph G2 and indexes with D=2) Query Execution Time (msec) for Path Queries (Graph G2 and indexes with D=2) |Q|=2 |Q|=3 |Q|=4 |Q|=5 158 3186 39294 469962 10 165 12 195 12 212 111 1486 12 165 RAM: Ranking After Matching baseline RWM0: without using the candidate node filtering RWM1: without using the MMW index RWM2: same as RWM1 without the pruning any partially grown candidates RWM3: same as RWM1 without the global top-K quit check RWM4: same as RWM1 with the MMW index RAM RWM0 RWM1 RWM2 RWM3 RWM4 824 1022 3135 27363 3978 791 4660 5891 9972 4518 Query Execution Time (msec) for Subgraph Queries (Graph G2 and indexes with D=2) gmanish@microsoft.com 18

  19. Good Scalability thanks to Effective Good Scalability Pruning QuerySize Graph |V|=1e+3 |V|=1e+4 |V|=1e+5 |V|=1e+6 |Q|=2 |Q|=3 |Q|=4 |Q|=5 |Q|=6 |Q|=7 5 10 52 362 18 90 396 4907 28600 184523 1216893 9786327 77 407 2794 18412 131256 1006773 382 2267 1870 12366 7656 87657 Running time (msec) for different Query Sizes and Graph Sizes (D=2) 10000 Average Query Execution |Q|=2 |Q|=3 |Q|=4 |Q|=5 7.86 28.28 18.31 7.94 #Candidates of Size 2 9.54 #Candidates of Size 3 #Candidates of Size 4 #Candidates of Size 5 4.38 1.63 K=10 K=20 K=50 K=100 24.42 25.5 1000 13.61 Time (msec) 100 Number of Candidates as Percentage of Total Matches for Different Query Sizes and Candidate Sizes 10 1 |Q|=2 |Q|=3 Size of the Query |Q|=4 |Q|=5 Query Execution Time for Different Values of K gmanish@microsoft.com 19

  20. Real Dataset Case Studies Dataset DBLP Wikipedia #Nodes 138K 670K 2 2 4 1 1 Author Conf Author Conf Keyword #Edges 1.6M 4.1M 3 3 #Types 3 10 Author Author Edge List Index Size 50 MB 261 MB Q1 Q2 Topology Index Size 5.8 MB 148 MB 2 2 4 1 1 Person Film Person Company Settlement MMW Index Size 11.4 MB 249 MB 3 3 SPath Index Size 4.3 GB 13.7 GB Person Person Topology+MMW Construction Time 513 minutes 1203 minutes Q3 Q4 Avg Query Time 100 sec 42 sec gmanish@microsoft.com 20

  21. Real Dataset Case Studies DBLP 1: Rohit Gupta, 2: BICoB, 3: Vipin Kumar Rohit Gupta -- computer networking Vipin Kumar -- Data and Information Systems BICoB -- International Conference on Bioinformatics and Computational Biology 1: Jimeng Sun, 2: Operating Systems Review (SIGOPS), 3: Christos Faloutsos, 4: mining Jimeng Sun and Christos Faloutsos -- Data and Information Systems, Artificial intelligence, and Computational biology "mining" -- Data and Information Systems "Operating Systems Review (SIGOPS)" -- Operating systems, Computer architecture, Computer networking gmanish@microsoft.com 21

  22. Real Dataset Case Studies Wikipedia 1: Stacy Keach, 2: The Biggest Battle, 3: John Huston Stacy Keach and John Huston starred in the movie The Biggest Battle Stacy Keach (American), John Huston (American), movie is Italian Stacy (narration, comedy, music), John (drama, documentary, adventure), movie (war) 1: Medha Patkar, 2: BBC, 3: Felix D Alviella, 4: Mogilino Medha Patkar -- Indian social activist -- won Best International Political Campaigner by BBC Felix D Alviella -- Belgian actor in the BBC soap opera Doctors Mogilino -- village in Bulgaria -- BBC showed the popular film "Bulgaria s Abandoned Children" in 2007 British company rewarding an Indian woman, covering a place in Bulgaria or linked to a person from Belgium is rare gmanish@microsoft.com 22

  23. Related Work (1) Theory literature on subgraph isomorphism [Cordella et al., 2004; McKay, 1981; Ullmann, 1976] Exact subgraph matching [Cheng et al., 2008; He and Singh, 2008; Sun et al., 2012; Zhang et al., 2007; Zhang et al., 2009; Zhao and Han, 2010; Zou et al., 2009] Approximate subgraph matching [Zou et al., 2007; Zeng et al., 2012; Tian et al., 2007; Zhang et al., 2010] gmanish@microsoft.com 23

  24. Related Work (2) Matching in graph databases [Ranu and Singh, 2009; Yan et al., 2005; Zhu et al., 2012] Matching for RDF graphs [Liu et al., 2012], probabilistic graphs [Yuan et al., 2012] and temporal graphs [Bogdanov et al., 2011] Top-K queries h-hop aggregate queries [Yan et al., 2010] K most frequent patterns [Yang et al., 2012; Zhu et al., 2011] Top-K keyword queries on RDF graphs [Tran et al., 2009] Top-K similarity queries [Zou et al., 2007] Twig queries [Gou and Chirkova, 2008] gmanish@microsoft.com 24

  25. Conclusion Given Typed unweighted query A heterogeneous edge-weighted information network Edge interestingness measure Find Top-K interesting subgraphs Investigated ranking after matching baseline Proposed three new graph indexes and exploited them for building a top-K solution Showed efficiency, scalability and effectiveness on multiple synthetic and real datasets gmanish@microsoft.com 25

  26. Thanks! gmanish@microsoft.com 26

Related


More Related Content