Challenges in Querying Graph Streams for Friendly Compression
Graph streams pose challenges in querying due to trade-offs among space, accuracy, and efficiency. The need to balance space and accuracy while maintaining throughput presents obstacles in constructing summaries and incorporating incremental updates. Additional challenges include query expressibility and preserving connectivity information of heavy-hitter edges.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Query Query- -Friendly Compression Friendly Compression of Streams of Graph Graph Streams Arijit Khan Charu C. Aggarwal Nanyang Technical University Singapore IBM T. J. Watson Research Lab NY, USA
Graph Streams Graph Stream: Continuous stream of graph edges Telephone network, communication network, social media data, IP traffic e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 e15 e13 Edge Stream e12 e16 Graph Structure With Edge Frequency 1/23 A. Khan, C. Aggarwal
Graph Streams Graph Stream: Continuous stream of graph edges Telephone network, communication network, social media data, IP traffic e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 e15 e13 Edge Stream e12 e16 Graph Structure With Edge Frequency Massive volume and high speed Construct summary to support future queries 1/23
Challenges in Data Streams Querying Trade-off among Space, Accuracy, and Efficiency: -- Increasing space increases accuracy, but reduces throughput Other requirements: -- Build summary in one pass over the stream -- Incremental updates in summary 2/23 A. Khan, C. Aggarwal
Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 e15 e13 Edge Stream e12 e16 Graph Data: Red Edges are heavy-hitter edges 3/23 A. Khan, C. Aggarwal
Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges V1 e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 V2 e15 e13 Edge Stream e12 e16 Graph Data: Red Edges are heavy-hitter edges 3/23 A. Khan, C. Aggarwal
Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges V1 e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 V2 e15 e13 Edge Stream e12 e16 Graph Data: Red Edges are heavy-hitter edges Need to preserve connectivity information of the edges in the graph data 3/23
Related Work Graph Summarization: - Query Preserving Graph Compression (SIGMOD 2012) - Graph Summarization with Bounded Error (SIGMOD 2008) - Representing Web Graphs (ICDE 2003) - The Transitive Reduction of a Directed Graph (SIGCOMP 1972) Data Stream Summarization: - Sketches (SIGMOD 2002, VLDB 2002, SIGMOD 2004) - Histograms (SIGMOD 1996, VLDB 1998) - Wavelets (SIAM Rev. 1996) - Space Saving (ICDT 2005) Graph Streams Querying: - gSketches (VLDB 2012) - Analyzing Graph Structure via Linear Measurements (SODA 2012) - Graph Sketches: Sparsification, Spanners, and Subgraphs (PODS 2012) - TCM Sketch (SIGMOD 2016) 4/23
Related Work Graph Summarization: - Query Preserving Graph Compression (SIGMOD 2012) - Graph Summarization with Bounded Error (SIGMOD 2008) - Representing Web Graphs (ICDE 2003) - The Transitive Reduction of a Directed Graph (SIGCOMP 1972) Data Stream Summarization: - Sketches (SIGMOD 2002, VLDB 2002, SIGMOD 2004) - Histograms (SIGMOD 1996) - Wavelets (SIAM Rev. 1996) - Space Saving (ICDT 2005) Graph Streams Querying: - gSketches (VLDB 2012) - Analyzing Graph Structure via Linear Measurements (SODA 2012) - Graph Sketches: Sparsification, Spanners, and Subgraphs (PODS 2012) - TCM Sketch (SIGMOD 2016) 5/23
Related Work TCM Sketch (SIGMOD 2016): - Does not provide theoretical error bounds - Difficult to answer reachability over heavy-hitter edges 6/23 A. Khan, C. Aggarwal
Count-Min Sketch h + f H1(e) ( e, f ) + f w Hw(e) + f h much smaller than total no of edges Estimate frequency of an edge, find heavy-hitter edges Cannot answer structural queries: are these two nodes connected by only high-frequency edges? 7/23
Our Solution: GMatrix Synopsis H4(.) incoming edge: e = (i,j) H3(.) w h much smaller than total no of nodes H2(.) H1(.) h k-th Hash Function hashes into ( Hk(i), Hk(j)) h (H1(i), H1(j)) 8/23 A. Khan, C. Aggarwal
GMatrix Compression Contract nodes into a total of h super-nodes Different hash functions create different contractions Holds key to effective query processing A graph with 108 nodes, 1010 edges Storage 40 GB GMatrix with h = 103 and w = 10 Storage 40 MB 9/23 A. Khan, C. Aggarwal
Choice of Hash Functions Pair-wise independent, e.g., modular hash function ( ( ) ) = + ( ) mod mod H i ai b P h P is a prime number larger than any node id: (1, 2, , n) a, b chosen uniformly from (1, P-1) 10/23 A. Khan, C. Aggarwal
Reverse Hash Mapping = = 1 ( ) : ( ) H p i H i p 7x mod 9 = 1 x= 4 7*4 = 3*9 + 1 Reverse hash mapping small size and computed efficiently Modular hash function: reverse hash mapping size P/h Can be computed in time O( P/h log P) using extended Euclidean algorithm 11/23 A. Khan, C. Aggarwal
Other Synopsis Options with Same Functionality as GMatrix h2 + f H1(ij) ( ij, f ) + f w Hw(ij) + f Reverse hash mapping computes w . n2/h2 intersections In GMatrix, reverse hash mapping computes 2. w . n/h intersections 12/23 A. Khan, C. Aggarwal
Queries supported by GMatrix (not a comprehensive list) Edge Frequency Query Heavy-hitter Edge Query Node Frequency Query Sub-graph Edge Frequency Query Heavy-hitter Node Query Reachability Query over High-frequency Edges 13/23 A. Khan, C. Aggarwal
Queries supported by GMatrix (not a comprehensive list) Edge Frequency Query Heavy-hitter Edge Query Node Frequency Query Sub-graph Edge Frequency Query Heavy-hitter Node Query Reachability Query over High-frequency Edges 13/23
Edge-Frequency Estimation Query For edge (i, j), compute the frequencies of w different cells: (Hk(i), Hk(j), k) The minimum of these values is returned as the estimated frequency Estimation is good for high-frequency edges If true frequency is significant fraction of total stream size, then relative error is small + . L , ( i ) , ( i ) , ( i ) Q j Q j Q j 14/23 A. Khan, C. Aggarwal
Heavy-Hitter Edge Query Find all edges with frequency greater than F No false negative, but false positive Find all hash-edges with frequency at least F ( p H V k ), ( ) ( ), H q k F k Reverse hash mapping to find real edges : ) , ( j i E k = ) q 1 1 ( ), ( i H p j H k k Intersection of edge sets w k = E k 1 15/23 A. Khan, C. Aggarwal
Heavy-Hitter Edge Query: Optimization First Optimization If a node does not appear as the source node of some potential frequent edge in at least one of the w hash functions, that node and its outgoing edges can be safely eliminated. Second Optimization , ( = ( , ) : ) j , S Q Q i i Q j Q 1 2 1 2 = ( , ) ( , ) ( , ) S Q Q S P P S Q P Q P 1 2 1 2 1 1 2 2 16/23 A. Khan, C. Aggarwal
Heavy-Hitter Edge Query: Time Complexity w P = k + 2 2 log O h w P k c h 1 17/23 A. Khan, C. Aggarwal
Reachability Query Find if two query nodes are connected by a path with edges having frequency at least F Determine all edges for which frequency is at least F using heavy- hitter edge query Answer reachability query with these edges 18/23 A. Khan, C. Aggarwal
Experimental Results #Nodes #Edges Agg. Edge Freq. Max. Edge Freq. 4.43 108 Flat Stream Size Compressed Stream Size 16.47 GB 2.37 GB 250 MB Skew 1.0 66M 3612M 1010 80GB Skew 1.2 1.81 109 3.22 109 Skew 1.4 Friendster Stream (Zipf Frequency Distribution with Varying Skew) GMatrix Size GMatrix Update Time 40MB (h=1000, w=10) 10-6 sec Experiments were performed on a single core of 10GB, 2.4GHz Xeon server 19/23 A. Khan, C. Aggarwal
Edge Frequency Estimation Query 8.0E-04 5 Stream Update Time Observed Error 4 (micro seconds) 6.0E-04 3 GMatrix 4.0E-04 2 2.0E-04 1 Count-Min Sketch 0 1.0E-07 3 # Hash Functions (w) 5 10 1 Skew (Zipf) 1.2 1.4 Query i | Est Freq True Freq | i i = Observed Error Query over top-500 frequent edges Query i True Freq i 20/23 A. Khan, C. Aggarwal
Heavy Hitter Edge Query 10 Frequency Threshold = 0.01% of Total Stream Size False Positive Rate 5 GMatrix Count Min Sketch 0 1 1.2 1.4 Skew (ZipF) # Edges incorrectl reported y heavy hitter = False Positive Rate # True heavy hitter edges Frequency Threshold (% of Total Stream Size) 1 0.1 0.01 GMatrix Count-Min Sketch 28 sec 149 sec 771 sec 1 sec 2 sec 7 sec Query Answering Time 21/23
Reachability Query Skew (ZipF) 1.0 1.2 1.4 Reachability Error 0.012 0.008 0.004 Frequency Threshold = 0.01% of Total Stream Size # incorrectl pairs Node reported y reachable = Reachabili Error ty # Node pairs Each reachability query can be processed in 0.1 sec 22/23 A. Khan, C. Aggarwal
Conclusions GMatrix synopsis for summarizing rapid graph streams Can be leveraged for a variety of frequency and structural queries Future Work: Improving accuracy by hashing high- and low-frequency edges separately? 23/23 A. Khan, C. Aggarwal