Introduction to GraphLab: Large-Scale Distributed Analytics Engine

 
Distributed Systems
CS 15-440
 
GraphLab
Lecture 20, November 11, 2019
 
Mohammad Hammoud
 
Today…
 
Last Session:
Pregel
 
 
Today’s Session:
GraphLab
 
Announcements:
Quiz II is on Wednesday, November 13
PS4 is due on 
November 16 
by midnight
P3 is due on November 20 by midnight
The GraphLab Analytics Engine
GraphLab
Motivation &
Definition
The
Programming
Model
Input, Output
&
Components
The
Architectural
Model
The
Computation
Model
Motivation for GraphLab
 
There is an exponential growth in the 
scale
 of Machine Learning and Data
Mining (MLDM) algorithms
 
Designing, implementing, and testing MLDM at large-scale are challenging
due to:
Synchronization
Deadlocks
Scheduling
Distributed state management
Fault-tolerance
 
The interest on analytics engines that can execute MLDM algorithms
automatically 
and 
efficiently
 is increasing
MapReduce is inefficient with 
iterative jobs 
(common in MLDM algorithms)
Pregel cannot run 
asynchronous problems 
(common  in MLDM algorithms)
What is GraphLab?
 
GraphLab is a large-scale 
graph-parallel
 distributed analytics engine
 
Some Characteristics:
In-Memory (opposite to MapReduce and similar to Pregel)
High scalability
Automatic fault-tolerance
Flexibility in expressing arbitrary graph algorithms (more flexible than Pregel)
Shared-based abstraction (opposite to Pregel but similar to MapReduce)
Peer-to-peer architecture (dissimilar to Pregel and MapReduce)
Asynchronous (dissimilar to Pregel and MapReduce)
GraphLab
Motivation &
Definition
The
Programming
Model
Input, Output
&
Components
The
Architectural
Model
The
Computation
Model
The GraphLab Analytics Engine
 
GraphLab assumes problems modeled as graphs
 
It adopts two phases, the initialization and the execution phases
Input, Graph Flow and Output
Initialization Phase
GraphLab Execution Phase
Distributed
File system
(MapReduce)
Graph Builder
Distributed
File system
Raw Graph
Data
Raw Graph
Data
Parsing +
Partitioning
Atom
Collection
Index
Construction
Atom Index
Atom
File
Atom
File
Atom
File
Atom
File
Atom
File
Atom
File
Cluster
Distributed
File system
TCP RPC Comms
Atom Index
Atom
File
Atom
File
Atom
File
Atom
File
Atom
File
Atom
File
Monitoring +
Atom
Placement
GL Engine
GL Engine
GL Engine
Components of the GraphLab Engine: The Data-Graph
The GraphLab engine incorporates three main parts:
1.
The
 
data-graph
, which represents the user program state at a cluster machine
Data-Graph
Vertex
Edge
Components of the GraphLab Engine: The Update Function
The GraphLab engine incorporates three main parts:
2.
The
 
update function
, which involves two main sub-functions:
2.1- Altering data within a 
scope
 of a vertex
2.2- Scheduling future update functions at neighboring vertices
v
 
S
v
 
The scope of a vertex v (i.e., S
v
)
is the data stored in v and in all
v’s adjacent edges and vertices
Components of the GraphLab Engine: The Update Function
The GraphLab engine incorporates three main parts:
2.
The 
update function
, which involves two main sub-functions:
2.1- 
Altering data within a 
scope
 of a vertex
2.2- Scheduling future update functions at neighboring vertices
 
The update function
 
S
c
h
e
d
u
l
e
 
v
Components of the GraphLab Engine: The Update Function
The GraphLab engine incorporates three main parts:
2.
The
 
update function
, which involves two main sub-functions:
2.1- 
Altering data within a 
scope
 of a vertex
2.2-
 
Scheduling future update functions at neighboring vertices
CPU 1
CPU 2
b
i
h
a
i
b
e
f
j
c
Scheduler
The process repeats until the scheduler is empty
Components of the GraphLab Engine: The Sync Operation
 
The GraphLab engine incorporates three main parts:
3.
The 
sync operation
, which maintains global statistics describing data stored in the data-
graph
 
Global values maintained by the sync operation can be written by 
all
 update
functions across the cluster machines
 
The sync operation is similar to Pregel’s aggregators
 
A mutual exclusion mechanism is applied by the sync operation to avoid write-
write conflicts
 
For scalability reasons, the sync operation is not enabled by default
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
GraphLab
Motivation &
Definition
The
Programming
Model
Input, Output
&
Components
The
Architectural
Model
The
Computation
Model
The GraphLab Analytics Engine
The Architectural Model
 
GraphLab adopts a peer-to-peer architecture
All engine instances are symmetric
Engine instances communicate together using Remote Procedure Call (RPC) protocol
over TCP/IP
The first triggered engine has an additional responsibility of being a
monitoring/master engine
 
Advantages:
Highly scalable
Precludes centralized bottlenecks and single point of failures
 
Main disadvantage:
Complexity
 
 
 
 
 
 
 
 
 
 
 
 
 
 
GraphLab
Motivation &
Definition
The
Programming
Model
Input, Output
&
Components
The
Architectural
Model
The
Computation
Model
The GraphLab Analytics Engine
The Programming Model
 
GraphLab offers a 
shared-based programming model
 
It allows scopes to 
overlap
 and vertices to read/write from/to their scopes
Consistency Models in GraphLab
GraphLab guarantees 
sequential consistency
Provides the same result as a 
sequential
 execution of the computational steps
User-defined consistency models
Full Consistency
Vertex Consistency
Edge Consistency
 
V
e
r
t
e
x
 
v
Consistency Models in GraphLab
D
1
D
2
D
3
D
4
D
5
D
1↔2
D
2↔3
D
3↔4
D
4↔5
D
1
D
2
D
3
D
4
D
5
D
1↔2
D
2↔3
D
3↔4
D
4↔5
 
R
e
a
d
 
W
r
i
t
e
 
R
e
a
d
 
W
r
i
t
e
D
1
D
2
D
3
D
4
D
5
D
1↔2
D
2↔3
D
3↔4
D
4↔5
 
R
e
a
d
 
W
r
i
t
e
F
u
l
l
C
o
n
s
i
s
t
e
n
c
y
M
o
d
e
l
E
d
g
e
C
o
n
s
i
s
t
e
n
c
y
M
o
d
e
l
V
e
r
t
e
x
C
o
n
s
i
s
t
e
n
c
y
M
o
d
e
l
GraphLab
Motivation &
Definition
The
Programming
Model
Input, Output
&
Components
The
Architectural
Model
The
Computation
Model
The GraphLab Analytics Engine
The Computation Model
 
GraphLab employs an 
asynchronous computation model
 
It suggests two asynchronous engines
Chromatic Engine
Locking Engine
 
The 
chromatic engine 
executes vertices 
partially
 asynchronous
It applies 
vertex coloring 
(e.g., no adjacent vertices share the same color)
All vertices with the same color are executed before proceeding to a different color
 
The 
locking engine 
executes vertices 
fully
 asynchronously
Data on vertices and edges are susceptible to corruption
It applies a 
permission-based
 distributed mutual exclusion mechanism to avoid read-
write and write-write hazards
 
How Does GraphLab Compare to
MapReduce and Pregel?
GraphLab vs. Pregel vs. MapReduce
 
Next Week…
 
Caching
 
PageRank: Recap
 
The rank value indicates an importance of a particular web page
 
A hyperlink to a page counts
as a vote of support
 
A page that is linked to by many pages
with high PageRank receives a
high rank itself
 
A PageRank of 0.5 means there is a 50% chance that a person clicking on a
random link will be directed to the document with the 0.5 PageRank
PageRank: Algorithm
Iterate:
Where:
α
 
is the random reset probability
L[j]
 is the number of links on page 
j
1
3
2
4
6
5
PageRank Example in GraphLab
PageRank algorithm is defined as a
 
per-vertex
 operation working on the
 
scope 
of
 
the vertex
Slide Note
Embed
Share

GraphLab is a powerful distributed analytics engine designed for large-scale graph-parallel processing. It offers features like in-memory processing, automatic fault-tolerance, and flexibility in expressing graph algorithms. With characteristics such as high scalability and asynchronous processing, GraphLab addresses challenges in executing Machine Learning and Data Mining algorithms efficiently. By adopting a peer-to-peer architecture and shared-based abstraction, GraphLab complements existing frameworks like MapReduce and Pregel.

  • GraphLab
  • Distributed Systems
  • Large-scale Analytics
  • Machine Learning
  • Data Mining

Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Distributed Systems CS 15-440 GraphLab Lecture 20, November 11, 2019 Mohammad Hammoud

  2. Today Last Session: Pregel Today s Session: GraphLab Announcements: Quiz II is on Wednesday, November 13 PS4 is due on November 16 by midnight P3 is due on November 20 by midnight

  3. The GraphLab Analytics Engine GraphLab The Input, Output & Components The The Motivation & Definition Computation Model Architectural Model Programming Model

  4. Motivation for GraphLab There is an exponential growth in the scale of Machine Learning and Data Mining (MLDM) algorithms Designing, implementing, and testing MLDM at large-scale are challenging due to: Synchronization Deadlocks Scheduling Distributed state management Fault-tolerance The interest on analytics engines that can execute MLDM algorithms automatically and efficiently is increasing MapReduce is inefficient with iterative jobs (common in MLDM algorithms) Pregel cannot run asynchronous problems (common in MLDM algorithms)

  5. What is GraphLab? GraphLab is a large-scale graph-parallel distributed analytics engine Some Characteristics: In-Memory (opposite to MapReduce and similar to Pregel) High scalability Automatic fault-tolerance Flexibility in expressing arbitrary graph algorithms (more flexible than Pregel) Shared-based abstraction (opposite to Pregel but similar to MapReduce) Peer-to-peer architecture (dissimilar to Pregel and MapReduce) Asynchronous (dissimilar to Pregel and MapReduce)

  6. The GraphLab Analytics Engine GraphLab The Input, Output & Components The The Motivation & Definition Computation Model Architectural Model Programming Model

  7. Input, Graph Flow and Output GraphLab assumes problems modeled as graphs It adopts two phases, the initialization and the execution phases GraphLab Execution Phase Initialization Phase (MapReduce) Graph Builder Cluster Distributed File system Distributed File system Distributed File system TCP RPC Comms Parsing + Partitioning Monitoring + Atom Placement Atom Index Atom Index Raw Graph Data Atom File Atom File Atom File Atom File Atom Collection GL Engine Atom File Atom File Atom File Atom File Raw Graph Data GL Engine Atom File Atom File Atom File Atom File GL Engine Index Construction

  8. Components of the GraphLab Engine: The Data-Graph The GraphLab engine incorporates three main parts: 1. The data-graph, which represents the user program state at a cluster machine Vertex Edge Data-Graph

  9. Components of the GraphLab Engine: The Update Function The GraphLab engine incorporates three main parts: 2. The update function, which involves two main sub-functions: 2.1- Altering data within a scope of a vertex 2.2- Scheduling future update functions at neighboring vertices Sv The scope of a vertex v (i.e., Sv) is the data stored in v and in all v s adjacent edges and vertices v

  10. Components of the GraphLab Engine: The Update Function The GraphLab engine incorporates three main parts: 2. The update function, which involves two main sub-functions: 2.1- Altering data within a scope of a vertex 2.2- Scheduling future update functions at neighboring vertices Algorithm: TheGraphLab Execution Engine Schedule v The update function

  11. Components of the GraphLab Engine: The Update Function The GraphLab engine incorporates three main parts: 2. The update function, which involves two main sub-functions: 2.1- Altering data within a scope of a vertex 2.2- Scheduling future update functions at neighboring vertices c c b b d a CPU 1 Scheduler e e f f g a b h i i i j j k h CPU 2 The process repeats until the scheduler is empty

  12. Components of the GraphLab Engine: The Sync Operation The GraphLab engine incorporates three main parts: 3. The sync operation, which maintains global statistics describing data stored in the data- graph Global values maintained by the sync operation can be written by all update functions across the cluster machines The sync operation is similar to Pregel s aggregators A mutual exclusion mechanism is applied by the sync operation to avoid write- write conflicts For scalability reasons, the sync operation is not enabled by default

  13. The GraphLab Analytics Engine GraphLab The Input, Output & Components The The Motivation & Definition Computation Model Architectural Model Programming Model

  14. The Architectural Model GraphLab adopts a peer-to-peer architecture All engine instances are symmetric Engine instances communicate together using Remote Procedure Call (RPC) protocol over TCP/IP The first triggered engine has an additional responsibility of being a monitoring/master engine Advantages: Highly scalable Precludes centralized bottlenecks and single point of failures Main disadvantage: Complexity

  15. The GraphLab Analytics Engine GraphLab The Input, Output & Components The The Motivation & Definition Computation Model Architectural Model Programming Model

  16. The Programming Model GraphLab offers a shared-based programming model It allows scopes to overlap and vertices to read/write from/to their scopes

  17. Consistency Models in GraphLab GraphLab guarantees sequential consistency Provides the same result as a sequential execution of the computational steps User-defined consistency models Full Consistency Vertex Consistency Edge Consistency Full Consistency Edge Consistency Vertex Consistency Vertex v

  18. Consistency Models in GraphLab Read Write Consistency Model Full 1 2 3 4 5 D3 D4 D1 D2 4 5 2 3 D3 D4 D1 D2 D5 Read Write Consistency 1 2 3 4 5 Model Edge D3 D4 D1 D2 4 5 2 3 D3 D4 D1 D2 D5 Read Write Consistency 1 2 3 4 5 Vertex Model D3 D4 D1 D2 4 5 2 3 D3 D4 D1 D2 D5

  19. The GraphLab Analytics Engine GraphLab The Input, Output & Components The The Motivation & Definition Computation Model Architectural Model Programming Model

  20. The Computation Model GraphLab employs an asynchronous computation model It suggests two asynchronous engines Chromatic Engine Locking Engine The chromatic engine executes vertices partially asynchronous It applies vertex coloring (e.g., no adjacent vertices share the same color) All vertices with the same color are executed before proceeding to a different color The locking engine executes vertices fully asynchronously Data on vertices and edges are susceptible to corruption It applies a permission-based distributed mutual exclusion mechanism to avoid read- write and write-write hazards

  21. How Does GraphLab Compare to MapReduce and Pregel?

  22. GraphLab vs. Pregel vs. MapReduce Aspect Aspect Aspect Aspect Aspect Aspect Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce Pregel Pregel Pregel Pregel Pregel Pregel GraphLab GraphLab GraphLab GraphLab GraphLab GraphLab Programming Model Programming Model Programming Model Programming Model Programming Model Programming Model Shared-Based Shared-Based Shared-Based Shared-Based Shared-Based Shared-Based Message-Passing Message-Passing Message-Passing Message-Passing Message-Passing Message-Passing Shared-Memory Shared-Memory Shared-Memory Shared-Memory Shared-Memory Shared-Memory Computation Model Computation Model Computation Model Computation Model Computation Model Synchronous Synchronous Synchronous Synchronous Synchronous Synchronous Synchronous Synchronous Synchronous Synchronous Asynchronous Asynchronous Asynchronous Asynchronous Asynchronous Parallelism Model Parallelism Model Parallelism Model Parallelism Model Data-Parallel Data-Parallel Data-Parallel Data-Parallel Graph-Parallel Graph-Parallel Graph-Parallel Graph-Parallel Graph-Parallel Graph-Parallel Graph-Parallel Graph-Parallel Architectural Model Architectural Model Architectural Model Master-Slave Master-Slave Master-Slave Master-Slave Master-Slave Master-Slave Peer-to-Peer Peer-to-Peer Peer-to-Peer Task/Vertex Scheduling Model Scheduling Model Task/Vertex Pull-Based Pull-Based Push-Based Push-Based Push-Based Push-Based Loosely- Application Suitability Strongly-Connected Applications Strongly-Connected Applications (more precisely MLDM apps) Connected/Embarrassing ly Parallel Applications

  23. Next Week Caching

  24. PageRank: Recap The rank value indicates an importance of a particular web page A hyperlink to a page counts as a vote of support A page that is linked to by many pages with high PageRank receives a high rank itself A PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with the 0.5 PageRank

  25. PageRank: Algorithm Iterate: Where: is the random reset probability L[j] is the number of links on page j 1 2 3 4 5 6

  26. PageRank Example in GraphLab PageRank algorithm is defined as a per-vertex operation working on the scope of the vertex pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope; // Update the vertex data 1 ( ] [ + i R [ N j ) [ ; ] j W R ji ] i // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); } Dynamic computation

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#