Balanced Graph Edge Partition and Its Practical Applications
Balanced graph edge partitioning is a crucial problem in graph computation, machine learning, and graph databases. It involves partitioning a graph's vertices or edges into balanced components while minimizing cut costs. This process is essential for various real-world applications such as iterative computation platforms like Pregel and Giraph, machine learning algorithms like Graphlab, and graph databases. The partitioning must adhere to certain constraints while being able to adapt to streaming data. Practical algorithms, streaming heuristics, and performance comparisons between edge and vertex partitioning are significant areas of study in this domain.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Balanced Graph Edge Partition Florian Bourse ENS Marc Lelarge INRIA-ENS Milan Vojnovic Microsoft Research ACM KDD 2014
Balanced Graph Partition Much interest in practice: Iterative computation platforms (ex. Pregel, Giraph) Machine learning (ex. Graphlab) Graph databases Problem Given a directed graph ? = (?,?) and a positive integer ?, partition either the set of vertices or the set of edges into balanced components such that a cut cost function is minimized Minimize ?(?) Subject to ?? (1 + ?) ? ?(?) ? is a valid assignment Online problem: vertices or edges presented as a stream: streaming graph partition , ? ? 2
Different Variants VP No Aggregation VPA ? traditional u u EP u Aggregation ? ? u PowerGraph [OSDI 2012] u EPA u Edge partition Vertex partition 3
Questions Performance benefits of using balanced edge partition as opposed to using more traditional balanced vertex partition ? Practical algorithms for balanced edge partition w/o aggregation and their theoretical guarantees ? Streaming heuristics for balanced edge partition ? 4
Costs: Cuts and Loads Master vertex assignment ? ? ????,? = ???? = ???? ??,? ??,? ,? ??,?(1 ??,?) ? ? ?=1 ? ???(?) ?=1 ?,? ? ?? = ???? ??,? ?? = ????,? ? ? ? ? ? ? ????? = ????? = max ? ???(?)??,? ,? ??,? max ? ???(?)??,? ?=1 ? ? ?=1 ? ? ? ? ?? = ????,? ?? = ??,? ? ? ? ? 5
Expected Costs of Random Assignments ? ???= 1 1 ? ???= 1 1 ? ? ? ? ? ????= (? 1)? 1 ? ???,? ? ????= ?? 1 ? ???,? ? ???(?) 1 ? ? ?1 1 ? = number of vertices, ? = number of edges, ? ???,? = ? 6
Random Assignment Comparison ? ????< ? ??? if ? ? ? ? ????< ? ??? ? ????< ? ???? Gap: ? ????= ? ???? ??(???,?) 7
Approximation Guarantees There exists a polynomial-time algorithm such that given a feasible vertex assignment ? with load constraint ? for the problem ?,? -VP with cost ???? outputs a feasible edge assignment ? with load constraint ? + ?2/? for the problem ?-EPA with cost ????(?): 1 ???? ????? ???? ???? Algorithm: given a vertex assignment ?, split the edges that span each pair of clusters evenly between the two clusters Approximation guarantee: ?(???? log ? log(?)) 8
Approximation Guarantees (contd) Edge partition with aggregation is in a one-to-one correspondence to a vertex partition with no aggregation and degree-weighted vertices Algorithm: Find a balanced vertex partition with degree-weighted vertices Assign each edge to any cluster that contains at least one vertex of this edge Approximation guarantee: ? aggregation log?log? for edge partition with no 9
Streaming Heuristics Online assignment of vertices or edges as they are observed in an input stream Irrevocable assignments Reassignments are expensive in web-scale systems (consistency of distributed state) Use local graph knowledge (neighbourhood sets) Scalable One pass through the vertices or edges Previously proposed streaming heuristic: PowerGraph [OSDI 2012] 10
PowerGraph Streaming Heuristic ? 1 1 1 ? ? ? = (?,?) ? 2 2 2 3 3 3 Place e to 1 or 2 whichever contains either ? or ? with the least number of unassigned edges Place e to a least loaded cluster Place e to 1 Prioritizes assignment of edges to clusters that already contain its end vertices: prone to large load imbalance 11
Greedy: Least Incremental Cost Each edge ? has incremental cost 0, 1 or 2 0 : both end vertices of edge ? already in the given cluster 1 : one end vertex of edge ? already in the given cluster 2 : none of end vertices of edge ? are already in the given cluster Combine with the criterion of load balancing Ex 1. adding a penalty function that is convex increasing in the load of cluster Ex 2. restrict application of least incremental cost to a subset of clusters with a small load relative to other clusters 12
Performance of Random Assignment Graph: Amazon 14
Streaming Heuristics Graph: Amazon 15
Performance of Random Assignment (contd) Graph: Youtube 16
Concluding Remarks Balanced graph vertex and edge partiton w/o aggregation Expected communication for random vertex or edge assignment Comparison results Approximation algorithms for edge partition w/o aggregation Without aggregation: ?(???? log?log?) With no aggregation: ?( log?log?), same as for balanced vertex partition In both cases, balanced vertex partition used as subroutine (practical) Empirical evidence suggests greedy incremental cost streaming heuristics performs well on real-world graphs It alleviates high load imbalance of PowerGraph heuristic 17