Challenges in Querying Graph Streams for Friendly Compression

undefined
Query-Friendly Compression of Graph
Streams
 
Graph Streams
1
/23
Graph Stream:
 Continuous stream of graph edges 
 
Telephone
network, communication network, social media data, IP traffic
e
2
e
1
e
5
e
9
e
11
e
1
e
2
e
3
e
4
e
5
e
6
e
8
e
9
e
10
e
11
e
12
e
13
e
15
e
16
Edge Stream
Graph Structure
With Edge Frequency
….
A.
 Khan
, C. Aggarwal
 
Graph Streams
1
/23
Graph Stream:
 Continuous stream of graph edges 
 
Telephone
network, communication network, social media data, IP traffic
Massive volume and high speed
Construct summary to support future queries
e
2
e
1
e
5
e
9
e
11
e
1
e
2
e
3
e
4
e
5
e
6
e
8
e
9
e
10
e
11
e
12
e
13
e
15
e
16
Edge Stream
Graph Structure
With Edge Frequency
….
Challenges in Data Streams
Querying
2
/23
Trade-off among Space, Accuracy, and Efficiency:
       -- Increasing space increases accuracy, but reduces throughput
Other requirements:
       -- Build summary in one pass over the stream
       --  Incremental updates in summary
A.
 Khan
, C. Aggarwal
Additional Challenges in Graph Streams
Querying: Query Expressibility
3
/23
       
Compute reachability formed by heavy-hitter edges
e
2
e
1
e
5
e
9
e
11
e
1
e
2
e
3
e
4
e
5
e
6
e
8
e
9
e
10
e
11
e
12
e
13
e
15
e
16
Edge Stream
Graph Data:
Red Edges are heavy-hitter edges
….
A.
 Khan
, C. Aggarwal
Additional Challenges in Graph Streams
Querying: Query Expressibility
3
/23
       
Compute reachability formed by heavy-hitter edges
e
2
e
1
e
5
e
9
e
11
e
1
e
2
e
3
e
4
e
5
e
6
e
8
e
9
e
10
e
11
e
12
e
13
e
15
e
16
Edge Stream
Graph Data:
Red Edges are heavy-hitter edges
….
V
1
V
2
A.
 Khan
, C. Aggarwal
Additional Challenges in Graph Streams
Querying: Query Expressibility
3
/23
       
Compute reachability formed by heavy-hitter edges
e
2
e
1
e
5
e
9
e
11
e
1
e
2
e
3
e
4
e
5
e
6
e
8
e
9
e
10
e
11
e
12
e
13
e
15
e
16
Edge Stream
Graph Data:
Red Edges are heavy-hitter edges
….
V
1
V
2
Related Work
4
/23
Graph Summarization:
              
- Query Preserving Graph Compression (SIGMOD 2012)
               
- Graph Summarization with Bounded Error (SIGMOD 2008)
               
- Representing Web Graphs (ICDE 2003)
               - The Transitive Reduction of a Directed Graph (SIGCOMP 1972)
Data Stream Summarization:
               - Sketches (SIGMOD 2002, VLDB 2002, SIGMOD 2004)
               - Histograms (SIGMOD 1996, VLDB 1998)
               - Wavelets (SIAM Rev. 1996)
               - Space Saving (ICDT 2005)
Graph Streams Querying: 
               - gSketches (VLDB 2012)
               - Analyzing Graph Structure via Linear Measurements (SODA 2012)
               
- Graph Sketches: Sparsification, Spanners, and Subgraphs (PODS 2012)
               -
 
TCM Sketch (SIGMOD 2016)
Related Work
5
/23
Graph Summarization:
              
- Query Preserving Graph Compression (SIGMOD 2012)
               
- Graph Summarization with Bounded Error (SIGMOD 2008)
               
- Representing Web Graphs (ICDE 2003)
               - The Transitive Reduction of a Directed Graph (SIGCOMP 1972)
Data Stream Summarization:
               
- Sketches (SIGMOD 2002, VLDB 2002, SIGMOD 2004)
               - Histograms (SIGMOD 1996)
               - Wavelets (SIAM Rev. 1996)
               - Space Saving (ICDT 2005)
Graph Streams Querying: 
               
- gSketches (VLDB 2012)
               - Analyzing Graph Structure via Linear Measurements (SODA 2012)
               
- Graph Sketches: Sparsification, Spanners, and Subgraphs (PODS 2012)
               - 
TCM Sketch (SIGMOD 2016)
Related Work
6
/23
TCM Sketch (SIGMOD 2016):
              
- Does not provide theoretical error bounds
               
               - Difficult to answer reachability over heavy-hitter edges
A.
 Khan
, C. Aggarwal
Count-Min Sketch
h
w
( e, f ) 
+ f
+ f
+ f
H
1
(e)
H
w
(e)
“h”  much smaller than total no of edges
Estimate frequency of an edge, find heavy-hitter edges
Cannot answer structural queries: are these two nodes
connected by only high-frequency edges?
7
/23
Our Solution: GMatrix Synopsis
h
h
 k-th Hash Function
hashes into
( H
k
(i), H
k
(j))
w
(H
1
(i), H
1
(j))
H
1
(
.
)
H
3
(
.
)
H
4
(
.
)
H
2
(
.
)
incoming edge:  e = (i,j)
 
“h”  much smaller than 
         total no of nodes
 
8/23
A.
 Khan
, C. Aggarwal
GMatrix Compression
       
Contract nodes into a total of h super-nodes
Different hash functions create different contractions 
 Holds key to
effective query processing
A graph with 10
8 
nodes, 10
10
 edges 
 Storage  40 GB
 
GMatrix with h = 10
3
 and w = 10 
 Storage  40 MB
9
/23
A.
 Khan
, C. Aggarwal
Choice of Hash Functions
10
/23
       
Pair-wise independent, e.g., modular hash function
P is a prime number larger than any node id:  (1, 2, … , n)
a, b chosen uniformly from (1, P-1)
A.
 Khan
, C. Aggarwal
Reverse Hash Mapping
11
/23
Modular hash function:    reverse hash mapping size ⌊P/h⌋
Reverse hash mapping 
 small size and computed efficiently
7
x
  mod 9 = 1
x
= 4
7*4 = 3*9 + 1
Can be computed in time O(⌊P/h⌋  log P) using extended Euclidean
algorithm
A.
 Khan
, C. Aggarwal
Other Synopsis Options with Same
Functionality as GMatrix
12
/23
Reverse hash mapping computes   w . n
2
/h
2
  intersections
h
2
w
( ij, f ) 
+ f
+ f
+ f
H
1
(ij)
H
w
(ij)
In GMatrix, reverse hash mapping computes  2. w . n/h  intersections
A.
 Khan
, C. Aggarwal
Queries supported by GMatrix
(not a comprehensive list)
13
/23
       
Edge Frequency Query
Heavy-hitter Edge Query
Node Frequency Query
Sub-graph Edge Frequency Query
Heavy-hitter Node Query
Reachability Query over High-frequency Edges
A.
 Khan
, C. Aggarwal
Queries supported by GMatrix
(not a comprehensive list)
13
/23
       
Edge Frequency Query
Heavy-hitter Edge Query
Node Frequency Query
Sub-graph Edge Frequency Query
Heavy-hitter Node Query
Reachability Query over High-frequency Edges
Edge-Frequency Estimation Query
14
/23
       
For edge (i, j), compute the frequencies of w different cells: (H
k
(i), H
k
(j), k)
The minimum of these values is returned as the estimated frequency
Estimation is good for high-frequency edges
If true frequency is significant fraction of total stream size, then relative
error is small
A.
 Khan
, C. Aggarwal
Heavy-Hitter Edge Query
15
/23
       
Find all edges with frequency greater than F
No false negative, but false positive
Find all hash-edges with frequency at least F
Reverse hash mapping to find real edges
Intersection of edge sets
A.
 Khan
, C. Aggarwal
Heavy-Hitter Edge Query: Optimization
16
/23
       
If a node does not appear as the source node of some potential
frequent edge in at least one of the w hash functions, that node and
its outgoing edges can be safely eliminated.
First Optimization
Second
 Optimization
A.
 Khan
, C. Aggarwal
Heavy-Hitter Edge Query: Time Complexity
17
/23
A.
 Khan
, C. Aggarwal
Reachability Query
18
/23
       
Find if two query nodes are connected by a path with edges
having frequency at least F
Determine all edges for which frequency is at least F using heavy-
hitter edge query
Answer reachability query with these edges
A.
 Khan
, C. Aggarwal
Experimental Results
19
/23
Friendster Stream (Zipf Frequency Distribution with Varying Skew)
Experiments were performed on a single core of 10GB, 2.4GHz Xeon server
 Skew 1.0
 Skew 1.2
 Skew 1.4
A.
 Khan
, C. Aggarwal
Edge Frequency Estimation Query
Query over top-500 
frequent edges
20
/23
A.
 Khan
, C. Aggarwal
Heavy Hitter Edge Query
21
/23
Frequency Threshold 
= 0.01% of Total Stream Size
Query Answering Time
Reachability Query
22
/23
Each reachability query can be processed in 0.1 sec
Frequency Threshold 
= 0.01% of Total Stream Size
A.
 Khan
, C. Aggarwal
Conclusions
23
/23
       
GMatrix synopsis for summarizing rapid graph streams
Can be leveraged for a variety of frequency and structural queries
A.
 Khan
, C. Aggarwal
Future Work: 
Improving accuracy by hashing high- and low-frequency
edges separately
?
Slide Note
Embed
Share

Graph streams pose challenges in querying due to trade-offs among space, accuracy, and efficiency. The need to balance space and accuracy while maintaining throughput presents obstacles in constructing summaries and incorporating incremental updates. Additional challenges include query expressibility and preserving connectivity information of heavy-hitter edges.

  • Graph Streams
  • Querying
  • Challenges
  • Friendly Compression
  • Connectivity

Uploaded on Sep 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Query Query- -Friendly Compression Friendly Compression of Streams of Graph Graph Streams Arijit Khan Charu C. Aggarwal Nanyang Technical University Singapore IBM T. J. Watson Research Lab NY, USA

  2. Graph Streams Graph Stream: Continuous stream of graph edges Telephone network, communication network, social media data, IP traffic e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 e15 e13 Edge Stream e12 e16 Graph Structure With Edge Frequency 1/23 A. Khan, C. Aggarwal

  3. Graph Streams Graph Stream: Continuous stream of graph edges Telephone network, communication network, social media data, IP traffic e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 e15 e13 Edge Stream e12 e16 Graph Structure With Edge Frequency Massive volume and high speed Construct summary to support future queries 1/23

  4. Challenges in Data Streams Querying Trade-off among Space, Accuracy, and Efficiency: -- Increasing space increases accuracy, but reduces throughput Other requirements: -- Build summary in one pass over the stream -- Incremental updates in summary 2/23 A. Khan, C. Aggarwal

  5. Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 e15 e13 Edge Stream e12 e16 Graph Data: Red Edges are heavy-hitter edges 3/23 A. Khan, C. Aggarwal

  6. Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges V1 e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 V2 e15 e13 Edge Stream e12 e16 Graph Data: Red Edges are heavy-hitter edges 3/23 A. Khan, C. Aggarwal

  7. Additional Challenges in Graph Streams Querying: Query Expressibility Compute reachability formed by heavy-hitter edges V1 e1 e2 e5 e4 e3 . e2 e1 e5 e9 e11 e6 e8 e11 e9e10 V2 e15 e13 Edge Stream e12 e16 Graph Data: Red Edges are heavy-hitter edges Need to preserve connectivity information of the edges in the graph data 3/23

  8. Related Work Graph Summarization: - Query Preserving Graph Compression (SIGMOD 2012) - Graph Summarization with Bounded Error (SIGMOD 2008) - Representing Web Graphs (ICDE 2003) - The Transitive Reduction of a Directed Graph (SIGCOMP 1972) Data Stream Summarization: - Sketches (SIGMOD 2002, VLDB 2002, SIGMOD 2004) - Histograms (SIGMOD 1996, VLDB 1998) - Wavelets (SIAM Rev. 1996) - Space Saving (ICDT 2005) Graph Streams Querying: - gSketches (VLDB 2012) - Analyzing Graph Structure via Linear Measurements (SODA 2012) - Graph Sketches: Sparsification, Spanners, and Subgraphs (PODS 2012) - TCM Sketch (SIGMOD 2016) 4/23

  9. Related Work Graph Summarization: - Query Preserving Graph Compression (SIGMOD 2012) - Graph Summarization with Bounded Error (SIGMOD 2008) - Representing Web Graphs (ICDE 2003) - The Transitive Reduction of a Directed Graph (SIGCOMP 1972) Data Stream Summarization: - Sketches (SIGMOD 2002, VLDB 2002, SIGMOD 2004) - Histograms (SIGMOD 1996) - Wavelets (SIAM Rev. 1996) - Space Saving (ICDT 2005) Graph Streams Querying: - gSketches (VLDB 2012) - Analyzing Graph Structure via Linear Measurements (SODA 2012) - Graph Sketches: Sparsification, Spanners, and Subgraphs (PODS 2012) - TCM Sketch (SIGMOD 2016) 5/23

  10. Related Work TCM Sketch (SIGMOD 2016): - Does not provide theoretical error bounds - Difficult to answer reachability over heavy-hitter edges 6/23 A. Khan, C. Aggarwal

  11. Count-Min Sketch h + f H1(e) ( e, f ) + f w Hw(e) + f h much smaller than total no of edges Estimate frequency of an edge, find heavy-hitter edges Cannot answer structural queries: are these two nodes connected by only high-frequency edges? 7/23

  12. Our Solution: GMatrix Synopsis H4(.) incoming edge: e = (i,j) H3(.) w h much smaller than total no of nodes H2(.) H1(.) h k-th Hash Function hashes into ( Hk(i), Hk(j)) h (H1(i), H1(j)) 8/23 A. Khan, C. Aggarwal

  13. GMatrix Compression Contract nodes into a total of h super-nodes Different hash functions create different contractions Holds key to effective query processing A graph with 108 nodes, 1010 edges Storage 40 GB GMatrix with h = 103 and w = 10 Storage 40 MB 9/23 A. Khan, C. Aggarwal

  14. Choice of Hash Functions Pair-wise independent, e.g., modular hash function ( ( ) ) = + ( ) mod mod H i ai b P h P is a prime number larger than any node id: (1, 2, , n) a, b chosen uniformly from (1, P-1) 10/23 A. Khan, C. Aggarwal

  15. Reverse Hash Mapping = = 1 ( ) : ( ) H p i H i p 7x mod 9 = 1 x= 4 7*4 = 3*9 + 1 Reverse hash mapping small size and computed efficiently Modular hash function: reverse hash mapping size P/h Can be computed in time O( P/h log P) using extended Euclidean algorithm 11/23 A. Khan, C. Aggarwal

  16. Other Synopsis Options with Same Functionality as GMatrix h2 + f H1(ij) ( ij, f ) + f w Hw(ij) + f Reverse hash mapping computes w . n2/h2 intersections In GMatrix, reverse hash mapping computes 2. w . n/h intersections 12/23 A. Khan, C. Aggarwal

  17. Queries supported by GMatrix (not a comprehensive list) Edge Frequency Query Heavy-hitter Edge Query Node Frequency Query Sub-graph Edge Frequency Query Heavy-hitter Node Query Reachability Query over High-frequency Edges 13/23 A. Khan, C. Aggarwal

  18. Queries supported by GMatrix (not a comprehensive list) Edge Frequency Query Heavy-hitter Edge Query Node Frequency Query Sub-graph Edge Frequency Query Heavy-hitter Node Query Reachability Query over High-frequency Edges 13/23

  19. Edge-Frequency Estimation Query For edge (i, j), compute the frequencies of w different cells: (Hk(i), Hk(j), k) The minimum of these values is returned as the estimated frequency Estimation is good for high-frequency edges If true frequency is significant fraction of total stream size, then relative error is small + . L , ( i ) , ( i ) , ( i ) Q j Q j Q j 14/23 A. Khan, C. Aggarwal

  20. Heavy-Hitter Edge Query Find all edges with frequency greater than F No false negative, but false positive Find all hash-edges with frequency at least F ( p H V k ), ( ) ( ), H q k F k Reverse hash mapping to find real edges : ) , ( j i E k = ) q 1 1 ( ), ( i H p j H k k Intersection of edge sets w k = E k 1 15/23 A. Khan, C. Aggarwal

  21. Heavy-Hitter Edge Query: Optimization First Optimization If a node does not appear as the source node of some potential frequent edge in at least one of the w hash functions, that node and its outgoing edges can be safely eliminated. Second Optimization , ( = ( , ) : ) j , S Q Q i i Q j Q 1 2 1 2 = ( , ) ( , ) ( , ) S Q Q S P P S Q P Q P 1 2 1 2 1 1 2 2 16/23 A. Khan, C. Aggarwal

  22. Heavy-Hitter Edge Query: Time Complexity w P = k + 2 2 log O h w P k c h 1 17/23 A. Khan, C. Aggarwal

  23. Reachability Query Find if two query nodes are connected by a path with edges having frequency at least F Determine all edges for which frequency is at least F using heavy- hitter edge query Answer reachability query with these edges 18/23 A. Khan, C. Aggarwal

  24. Experimental Results #Nodes #Edges Agg. Edge Freq. Max. Edge Freq. 4.43 108 Flat Stream Size Compressed Stream Size 16.47 GB 2.37 GB 250 MB Skew 1.0 66M 3612M 1010 80GB Skew 1.2 1.81 109 3.22 109 Skew 1.4 Friendster Stream (Zipf Frequency Distribution with Varying Skew) GMatrix Size GMatrix Update Time 40MB (h=1000, w=10) 10-6 sec Experiments were performed on a single core of 10GB, 2.4GHz Xeon server 19/23 A. Khan, C. Aggarwal

  25. Edge Frequency Estimation Query 8.0E-04 5 Stream Update Time Observed Error 4 (micro seconds) 6.0E-04 3 GMatrix 4.0E-04 2 2.0E-04 1 Count-Min Sketch 0 1.0E-07 3 # Hash Functions (w) 5 10 1 Skew (Zipf) 1.2 1.4 Query i | Est Freq True Freq | i i = Observed Error Query over top-500 frequent edges Query i True Freq i 20/23 A. Khan, C. Aggarwal

  26. Heavy Hitter Edge Query 10 Frequency Threshold = 0.01% of Total Stream Size False Positive Rate 5 GMatrix Count Min Sketch 0 1 1.2 1.4 Skew (ZipF) # Edges incorrectl reported y heavy hitter = False Positive Rate # True heavy hitter edges Frequency Threshold (% of Total Stream Size) 1 0.1 0.01 GMatrix Count-Min Sketch 28 sec 149 sec 771 sec 1 sec 2 sec 7 sec Query Answering Time 21/23

  27. Reachability Query Skew (ZipF) 1.0 1.2 1.4 Reachability Error 0.012 0.008 0.004 Frequency Threshold = 0.01% of Total Stream Size # incorrectl pairs Node reported y reachable = Reachabili Error ty # Node pairs Each reachability query can be processed in 0.1 sec 22/23 A. Khan, C. Aggarwal

  28. Conclusions GMatrix synopsis for summarizing rapid graph streams Can be leveraged for a variety of frequency and structural queries Future Work: Improving accuracy by hashing high- and low-frequency edges separately? 23/23 A. Khan, C. Aggarwal

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#