Managing Large Graphs on Multi-Cores with Graph Awareness

 
Managing Large Graphs on
Multi-Cores With Graph Awareness
 
Vijayan, Ming, Xuetian, Frank, Lidong, Maya
Microsoft Research
 
Motivation
 
Tremendous increase in graph data and applications
New class of graph applications that require real-time
responses
Even batch-processed workloads have strict time-constraints
 
Multi-core revolution
Default standards on most machines
Large-scale multi-cores with terabytes of main memory
Run workloads that are traditionally run on distributed
systems
 
Existing graph-processing systems lack support for both
 
A High-level Description of Grace
Grace is an in-memory graph management and
processing system
Implements several optimizations
Graph-specific
Multi-core-specific
Supports snapshots and transactional updates
on graphs
Evaluation shows that optimizations help Grace
run several times faster than other alternatives
 
Overview
 
 
Details of
optimizations
 
 
Details on
transactions
 
Subset of
results
 
Outline
 
Keeps an entire graph in
memory in smaller parts.
 
 
Exposes C-style API for writing
graph workloads, iterative
workloads, and updates.
 
 
Design driven by two trends
-
Graph-specific locality
-
Partitionable and
parallelizable workloads
Grace API
Core 0
Core 1
Graph and Multi-core Optimizations
An Overview of Grace
C
C
Data Structures
A
B
D
C
Edge Pointer Array
Vertex Index
1
1
1
0
Vertex Allocation Map
A
B
C
B
C
B
C
Edges of A
Edges of B
Edges of C
Vertex
Log
Edge
Log
Data Structures in a Partition
Graph-Aware Partitioning & Placement
 
Partitioning and placement – are they useful on a single machine?
Yes, to take advantage of multi-cores and memory hierarchies
 
Solve them using graph partitioning algorithms
Divide a graph into sub-graphs, minimizing edge-cuts
 
Grace provides an extensible library
Graph-aware: heuristic-based, spectral partitioning, Metis
Graph-agnostic: hash partitioning
 
Achieve better layout by recursive graph partitioning
Recursively run graph partition until a sub-graph can fit in a cache line
Recompose all the sub-graphs to get the vertex layout
 
 
Platform for Parallel Iterative Computations
 
Iterative computation platform implements
“bulk synchronous parallel” model.
 
Barrier
 
Parallel computations
 
Propagate updates
 
Iteration 1
 
Iteration 2
Load Balancing and Updates Batching
 
Solution1: Load balancing is implemented
by sharing a portion of vertices
 
Solution2: Updates batching is implemented by
-
grouping updates by their destination part
-
Issuing updates in a round-robin fashion
 
Grace supports structural changes to a graph
 
      BeginTransaction()
      AddVertex(X)
      AddEdge(X, Y)
      EndTransaction()
 
Transactions use snapshot isolation
-
Instantaneous snapshots using CoW techniques
-
CoW can affect careful memory layout!
 
Transactions on Graphs
 
Graphs:
-
Web  (v:88M, e:275M), sparse
-
Orkut (v:3M, e:223M), dense
 
Workloads:
-
N-hop-neighbor queries, BFS, DFS, PageRank, Weakly-
Connected Components, Shortest Path
 
Architecture:
-
Intel Xeon-12 cores, 2 chips with 6 cores each
-
AMD Opteron-48 cores, 4 chips with 12 cores each
 
Questions:
-
How well partitioning and placement work?
-
How useful are load balancing and updates batching?
-
How does Grace compare to other systems?
 
Evaluation
Partitioning and Placement Performance On Intel
 
Observation: For smaller number of partitions, partition algorithm
didn’t make a big difference
 
Reason: All the partitions fit within cores of single chip minimizing
communication cost
PageRank Speedup
Orkut graph partitions
Web graph partitions
 
Observation: Placing neighboring vertices close together improves
performance significantly
 
Reason: L1, L2, and L3 cache and Data-TLB misses are reduced
 
Observation: Careful vertex arrangement works better when graph
partitioning is used for sparse graphs
 
Reason: graph partitioning puts neighbors under same part
helping better placement
 
1
 
2
 
3
Load Balancing and Updates Batching On Intel
PageRank Speedup
Orkut graph partitions
Web graph partitions
 
Observation: Load balancing and updates batching didn’t improve
performance for web graph
 
Reason: Sparse graphs can be partitioned better and there are
fewer updates to send
 
Observation: Batching updates gives better performance
improvement for Orkut graph
 
Reason: Updates batching reduces remote cache accesses
 
1
 
2
 
Comparing Grace, BDB, and Neo4j
 
Conclusion
 
Grace explores graph-specific and multi-core specific
optimizations
 
What worked and what didn’t (in our setup; your mileage
might differ)
Careful vertex placement in memory gave good
improvements
Partitioning and updates batching worked in most
cases, but not always
Load balancing wasn’t as useful
Slide Note
Embed
Share

This research discusses the challenges in managing large graphs on multi-core systems and introduces Grace, an in-memory graph management and processing system with optimizations for graph-specific and multi-core-specific operations. The system keeps the entire graph in memory in smaller parts and provides a C-style API for graph workloads. Graph-aware partitioning techniques are used to optimize graph placement and achieve better performance.

  • Graph Management
  • Multi-Cores
  • Optimization
  • In-Memory Processing
  • Graph Partitioning

Uploaded on Sep 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Managing Large Graphs on Multi-Cores With Graph Awareness Vijayan, Ming, Xuetian, Frank, Lidong, Maya Microsoft Research

  2. Motivation Tremendous increase in graph data and applications New class of graph applications that require real-time responses Even batch-processed workloads have strict time-constraints Multi-core revolution Default standards on most machines Large-scale multi-cores with terabytes of main memory Run workloads that are traditionally run on distributed systems Existing graph-processing systems lack support for both

  3. A High-level Description of Grace Outline Grace is an in-memory graph management and processing system Overview Implements several optimizations Graph-specific Multi-core-specific Details of optimizations Details on transactions Supports snapshots and transactional updates on graphs Evaluation shows that optimizations help Grace run several times faster than other alternatives Subset of results

  4. An Overview of Grace Keeps an entire graph in memory in smaller parts. v = GetVertex(Id) for (i=0; i<v.degree;i++) neigh=v.GetNeighbor(i) Iterative Programs (e.g., PageRank) Grace API Net Exposes C-style API for writing graph workloads, iterative workloads, and updates. RPC Graph and Multi-core Optimizations B A C E Design driven by two trends - Graph-specific locality - Partitionable and parallelizable workloads D Core 0 Core 1

  5. Data Structures Vertex Log A B C Edge Log Edges of A Edges of B Edges of C A 0 B 1 C 2 Edge Pointer Array 0 1 1 1 Vertex Index Vertex Allocation Map Data Structures in a Partition

  6. Graph-Aware Partitioning & Placement Partitioning and placement are they useful on a single machine? Yes, to take advantage of multi-cores and memory hierarchies Solve them using graph partitioning algorithms Divide a graph into sub-graphs, minimizing edge-cuts Grace provides an extensible library Graph-aware: heuristic-based, spectral partitioning, Metis Graph-agnostic: hash partitioning Achieve better layout by recursive graph partitioning Recursively run graph partition until a sub-graph can fit in a cache line Recompose all the sub-graphs to get the vertex layout

  7. Platform for Parallel Iterative Computations Iterative computation platform implements bulk synchronous parallel model. Parallel computations Iteration 1 Propagate updates Barrier Iteration 2

  8. Load Balancing and Updates Batching Problem1: overloaded partitions can affect performance Solution1: Load balancing is implemented by sharing a portion of vertices Problem2: Updates in arbitrary order can increase cache misses Solution2: Updates batching is implemented by - grouping updates by their destination part - Issuing updates in a round-robin fashion Barrier Cache line B D A C Part1 Core1 Part2 Core2 Part0 Core0

  9. Transactions on Graphs Grace supports structural changes to a graph BeginTransaction() AddVertex(X) AddEdge(X, Y) EndTransaction() Transactions use snapshot isolation - Instantaneous snapshots using CoW techniques - CoW can affect careful memory layout!

  10. Evaluation Graphs: - Web (v:88M, e:275M), sparse - Orkut (v:3M, e:223M), dense Workloads: - N-hop-neighbor queries, BFS, DFS, PageRank, Weakly- Connected Components, Shortest Path Architecture: - Intel Xeon-12 cores, 2 chips with 6 cores each - AMD Opteron-48 cores, 4 chips with 12 cores each Questions: - How well partitioning and placement work? - How useful are load balancing and updates batching? - How does Grace compare to other systems?

  11. Partitioning and Placement Performance On Intel 12 30 PageRank Speedup 10 25 8 20 6 15 4 10 2 5 0 0 3 6 12 3 6 12 Orkut graph partitions Web graph partitions Hash Heuristic Metis Hash + Placement Heuristic + Placement Metis + Placement 1 2 3 Observation: Placing neighboring vertices close together improves Observation: Careful vertex arrangement works better when graph Observation: For smaller number of partitions, partition algorithm didn t make a big difference performance significantly partitioning is used for sparse graphs Reason: L1, L2, and L3 cache and Data-TLB misses are reduced Reason: graph partitioning puts neighbors under same part Reason: All the partitions fit within cores of single chip minimizing communication cost helping better placement

  12. Load Balancing and Updates Batching On Intel PageRank Speedup 2 2 1 1 0 0 3 6 12 3 6 12 Orkut graph partitions Web graph partitions Load Balancing Load Balancing + Update Batching 1 2 Observation: Load balancing and updates batching didn t improve performance for web graph improvement for Orkut graph Observation: Batching updates gives better performance Reason: Sparse graphs can be partitioned better and there are fewer updates to send 1500 Reason: Updates batching reduces remote cache accesses Retired Load Balancing Load 1000 500 Load Balancing + Updates Batching 0 Sibling Core Remote Chip

  13. Comparing Grace, BDB, and Neo4j 10000 1000 Running Time (s) 100 BDB 10 Neo4j 1 Grace 0.1

  14. Conclusion Grace explores graph-specific and multi-core specific optimizations What worked and what didn t (in our setup; your mileage might differ) Careful vertex placement in memory gave good improvements Partitioning and updates batching worked in most cases, but not always Load balancing wasn t as useful

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#