Accelerated Hypergraph Coarsening Procedure on GPU

An Accelerated Procedure
for Hypergraph Coarsening
on the GPU
Lin Cheng, Hyunsu Cho, and Peter Yoon
Trinity College
Hartford, CT, USA
Outline
Hypergraph coarsening
Implementation challenges
Runtime task planning
Results
Hypergraph
Nodes
H
y
p
e
r
e
d
g
e
s
 
(
n
e
t
s
)
Subsets of nodes
v
2
v
1
v
3
v
5
v
6
v
4
e
1
e
2
e
3
Hypergraph
Hypergraph partitioning
M
i
n
i
m
i
z
e
 
e
d
g
e
 
c
u
t
Balance constraint
NP-complete
v
2
v
1
v
3
v
5
v
6
v
4
e
1
e
2
e
3
Image classification
Similar images should go to same category
N
e
e
d
 
t
o
 
c
o
m
p
a
r
e
 
i
m
a
g
e
s
 
t
o
 
o
n
e
 
a
n
o
t
h
e
r
Hypergraph construction
Compare multiple pictures at a time
Hypergraph coarsening
Heuristic: reduce # nodes by fusing
v
2
v
1
v
3
v
5
v
6
v
4
e
1
e
2
e
3
6 nodes 
Hypergraph coarsening
Heuristic: reduce # nodes by fusing
v
2
v
3
v
5
v
6
v
4
e
1
e
2
e
3
v
1
3
 
n
o
d
e
s
 
Mondriaan algorithm
Given a node, find most “similar” neighbor
S
i
m
i
l
a
r
i
t
y
 
=
 
#
 
h
y
p
e
r
e
d
g
e
s
 
c
o
n
t
a
i
n
i
n
g
 
b
o
t
h
v
2
v
3
Affinity = 2
Mondriaan algorithm
Given a node, find most “similar” neighbor
S
i
m
i
l
a
r
i
t
y
 
=
 
#
 
h
y
p
e
r
e
d
g
e
s
 
c
o
n
t
a
i
n
i
n
g
 
b
o
t
h
Binary
sparse
matrix 
← Nodes
Edges →
Member? (T/F)
Mondriaan algorithm
Given a node, find most “similar” neighbor
S
i
m
i
l
a
r
i
t
y
 
=
 
#
 
h
y
p
e
r
e
d
g
e
s
 
c
o
n
t
a
i
n
i
n
g
 
b
o
t
h
 
1. Find edges
containing 
v
3
Mondriaan algorithm
Given a node, find most “similar” neighbor
S
i
m
i
l
a
r
i
t
y
 
=
 
#
 
h
y
p
e
r
e
d
g
e
s
 
c
o
n
t
a
i
n
i
n
g
 
b
o
t
h
 
1. Find edges
containing 
v
3
2. Collect
nonzeros
Mondriaan algorithm
Given a node, find most “similar” neighbor
S
i
m
i
l
a
r
i
t
y
 
=
 
#
 
h
y
p
e
r
e
d
g
e
s
 
c
o
n
t
a
i
n
i
n
g
 
b
o
t
h
 
2
3. Inspect
column index of
each nonzero
Mondriaan algorithm
Given a node, find most “similar” neighbor
S
i
m
i
l
a
r
i
t
y
 
=
 
#
 
h
y
p
e
r
e
d
g
e
s
 
c
o
n
t
a
i
n
i
n
g
 
b
o
t
h
 
3. Inspect
column index of
each nonzero
2
2
2
1
1
Similarity scores
Mondriaan algorithm
Parallel algorithm:
Inspect edges in parallel
 
2
2
2
1
1
T
h
r
e
a
d
 
0
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
GPU as parallel accelerator
N
V
I
D
I
A
 
G
P
U
s
 
:
 
o
r
g
a
n
i
z
e
d
 
i
n
 
w
a
r
p
s
 
 
3
2
 
t
h
r
e
a
d
s
 
s
h
a
r
e
 
o
n
e
 
i
n
s
t
r
u
c
t
i
o
n
 
c
o
u
n
t
e
r
X
(register)
A[*]
(memory)
LOAD  
A[*]   
INTO   
X
Warp divergence
X
A[*]
Thread 0-9: LOAD  
A[0-9]   
INTO   
X
0
1
2
3
4
5
6
7
8
9
Warp divergence
X
A[*]
Thread 0-4: LOAD  
A[0-4]   
INTO   
X
Thread 5-9:                             NONE
0
1
2
3
4
5
6
7
8
9
Warp divergence
X
A[*]
Thread 0-4: LOAD  
A[0-4]   
INTO   
X
Thread 5-9:                             NONE
0
1
2
3
4
5
6
7
8
9
s
t
a
l
l
e
d
Warp divergence
Serializes execution
C
a
u
s
e
d
 
b
y
 
l
o
a
d
 
i
m
b
a
l
a
n
c
e
Sparse/irregular data
Mondriaan algorithm
A naïve strategy results in load imbalance
Nonzero entry = workload
 
2
2
2
1
1
T
h
r
e
a
d
 
0
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
6
3
2
SHFL to the rescue
Compiler primitive
Shuffles content of adjacent registers
 
x=shfl_down(x,2)
SHFL to the rescue
Compiler primitive
Shuffles content of adjacent registers
S
i
n
g
l
e
 
m
a
c
h
i
n
e
 
i
n
s
t
r
u
c
t
i
o
n
W
a
r
p
-
s
y
n
c
h
r
o
n
o
u
s
;
 
n
o
 
s
y
n
c
.
 
n
e
e
d
e
d
 
a
f
t
e
r
 
x=shfl_down(x,2)
Runtime planning with SHFL
T
h
r
e
a
d
 
0
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
6
3
2
Runtime planning with SHFL
T
h
r
e
a
d
 
0
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
6
3
2
P
r
e
f
i
x
 
s
u
m
1
1
0
6
9
Runtime planning with SHFL
T
h
r
e
a
d
 
0
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
6
3
2
P
r
e
f
i
x
 
s
u
m
0
6
9
1
1
c
e
i
l
(
1
1
 
/
 
3
)
 
=
 
4
Runtime planning with SHFL
T
h
r
e
a
d
 
0
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
6
3
2
1
1
c
e
i
l
(
1
1
 
/
 
3
)
 
=
 
4
R
a
n
g
e
0
 
 
3
4
 
 
7
8
 
 
1
0
Runtime planning with SHFL
T
h
r
e
a
d
 
0
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
R
a
n
g
e
0
 
 
3
4
 
 
7
8
 
 
1
0
Equal # of
nonzeros
Runtime planning with SHFL
T
h
r
e
a
d
 
0
T
h
r
e
a
d
 
1
T
h
r
e
a
d
 
2
R
a
n
g
e
0
 
 
3
4
 
 
7
8
 
 
1
0
2
2
2
1
1
Results
NVIDIA Tesla K20c (5GB mem)
Intel Xeon E5-2620
Reference sequential implementation of
Mondriaan
Results: 
long-tailed distribution
Y-axis: # of nonzeros in columns, descending, log-scale
GPU: 50 s
Seq. : 811 s
1
6
 
x
 
s
p
e
e
d
-
u
p
GPU: 59 s
Seq. : 877 s
1
5
 
x
 
s
p
e
e
d
-
u
p
wikipedia
flickr
1.E+05
1.E+04
Results: 
non-long-tailed distribution
Y-axis: # of nonzeros in columns, descending, log-scale
GPU: 146 s
Seq. :  65 s
0
.
4
5
 
x
GPU: 626 s
Seq. : 41 s
0
.
0
7
 
x
stanford
stanford-berkeley
1.E+03
1.E+03
Analysis of results
G
o
o
d
 
s
p
e
e
d
u
p
 
f
o
r
 
d
a
t
a
 
w
i
t
h
 
l
o
n
g
-
t
a
i
l
e
d
d
i
s
t
r
i
b
u
t
i
o
n
 
o
f
 
n
o
n
z
e
r
o
s
Analysis of results
G
o
o
d
 
s
p
e
e
d
u
p
 
f
o
r
 
d
a
t
a
 
w
i
t
h
 
l
o
n
g
-
t
a
i
l
e
d
d
i
s
t
r
i
b
u
t
i
o
n
 
o
f
 
n
o
n
z
e
r
o
s
Synthetic data
First 1,000 columns : 100,000 nonzeros each
Next 699,000 columns : all zero
Speedup: 123 x
Analysis of results
Most nodes sparsely connected ( << 32 edges)
Now: one warp, one instance of Mondriaan
Work in progress
P
o
o
l
 
m
u
l
t
i
p
l
e
 
i
n
s
t
a
n
c
e
s
 
o
f
 
M
o
n
d
r
i
a
a
n
 
i
n
t
o
 
w
a
r
p
Conclusion
Implemented hypergraph algorithm handling
arbitrary connectivity patterns
Explored SHFL for task planning
Future work: more flexible allocation strategy
Acknowledgment
Trinity College: Student Research Program
NVIDIA: CUDA Teaching Center
Slide Note
Embed
Share

An accelerated procedure for hypergraph coarsening on the GPU, presented by Lin Cheng, Hyunsu Cho, and Peter Yoon from Trinity College, Hartford, CT, USA. The research covers hypergraph coarsening, implementation challenges, runtime task planning, hypergraph nodes, hypergraph partitioning, image classification, hypergraph construction, and the Mondriaan algorithm for node similarity. The study focuses on reducing the number of nodes by fusing them, improving image categorization, and enhancing hypergraph construction efficiency.

  • Hypergraph Coarsening
  • GPU Computing
  • Image Classification
  • Mondriaan Algorithm
  • Hypergraph Partitioning

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. An Accelerated Procedure for Hypergraph Coarsening on the GPU Lin Cheng, Hyunsu Cho, and Peter Yoon Trinity College Hartford, CT, USA

  2. Outline Hypergraph coarsening Implementation challenges Runtime task planning Results

  3. Hypergraph Nodes Hyperedges (nets) Subsets of nodes e1 e2 v2 v1 v3 e3 v4 v5 v6

  4. Hypergraph Hypergraph partitioning Minimize edge cut Balance constraint NP-complete e1 e2 v2 v1 v3 e3 v4 v5 v6

  5. Image classification Similar images should go to same category Need to compare images to one another

  6. Hypergraph construction Compare multiple pictures at a time

  7. Hypergraph coarsening Heuristic: reduce # nodes by fusing e1 e2 v2 v1 v3 6 nodes e3 v4 v5 v6

  8. Hypergraph coarsening Heuristic: reduce # nodes by fusing e1 e2 v2v3 v1 3 nodes v4 e3 v5v6

  9. Mondriaan algorithm Given a node, find most similar neighbor Similarity = # hyperedges containing both v2 Affinity = 2 v3

  10. Mondriaan algorithm Given a node, find most similar neighbor Similarity = # hyperedges containing both Nodes Binary sparse matrix Edges Member? (T/F)

  11. Mondriaan algorithm Given a node, find most similar neighbor Similarity = # hyperedges containing both 1. Find edges containing v3

  12. Mondriaan algorithm Given a node, find most similar neighbor Similarity = # hyperedges containing both 1. Find edges containing v3 2. Collect nonzeros

  13. Mondriaan algorithm Given a node, find most similar neighbor Similarity = # hyperedges containing both 2 3. Inspect column index of each nonzero

  14. Mondriaan algorithm Given a node, find most similar neighbor Similarity = # hyperedges containing both 2 2 1 2 1 Similarity scores 3. Inspect column index of each nonzero

  15. Mondriaan algorithm Parallel algorithm: Inspect edges in parallel 2 2 1 2 1 Thread 0 Thread 1 Thread 2

  16. GPU as parallel accelerator NVIDIA GPUs : organized in warps 32 threads share one instruction counter LOAD A[*] INTO X X (register) A[*] (memory)

  17. Warp divergence Thread 0-9: LOAD A[0-9] INTO X 3 8 9 0 1 2 4 5 6 7 X A[*]

  18. Warp divergence Thread 0-4: LOAD A[0-4] INTO X Thread 5-9: NONE 3 8 9 0 1 2 4 5 6 7 X A[*]

  19. Warp divergence Thread 0-4: LOAD A[0-4] INTO X Thread 5-9: NONE 3 8 9 0 1 2 4 5 6 7 X A[*] stalled

  20. Warp divergence Serializes execution Caused by load imbalance Sparse/irregular data

  21. Mondriaan algorithm A na ve strategy results in load imbalance Nonzero entry = workload 2 2 1 2 1 6 Thread 0 3 Thread 1 Thread 2 2

  22. SHFL to the rescue Compiler primitive Shuffles content of adjacent registers x=shfl_down(x,2)

  23. SHFL to the rescue Compiler primitive Shuffles content of adjacent registers Single machine instruction Warp-synchronous; no sync. needed after x=shfl_down(x,2)

  24. Runtime planning with SHFL 6 Thread 0 3 Thread 1 Thread 2 2

  25. Runtime planning with SHFL Prefix sum 6 0 Thread 0 3 6 Thread 1 Thread 2 9 2 11

  26. Runtime planning with SHFL Prefix sum 6 0 Thread 0 3 6 Thread 1 Thread 2 9 2 11 ceil(11 / 3) = 4

  27. Runtime planning with SHFL Range 6 0 3 Thread 0 4 7 3 Thread 1 8 10 Thread 2 2 11 ceil(11 / 3) = 4

  28. Runtime planning with SHFL Range 0 3 Thread 0 Equal # of nonzeros 4 7 Thread 1 8 10 Thread 2

  29. Runtime planning with SHFL Range 2 2 1 2 1 0 3 Thread 0 4 7 Thread 1 8 10 Thread 2

  30. Results NVIDIA Tesla K20c (5GB mem) Intel Xeon E5-2620 Reference sequential implementation of Mondriaan

  31. Results: long-tailed distribution flickr wikipedia 1.E+04 1.E+05 GPU: 59 s Seq. : 877 s 15 x speed-up GPU: 50 s Seq. : 811 s 16 x speed-up Y-axis: # of nonzeros in columns, descending, log-scale

  32. Results: non-long-tailed distribution stanford-berkeley 1.E+03 stanford 1.E+03 GPU: 146 s Seq. : 65 s 0.45 x GPU: 626 s Seq. : 41 s 0.07 x Y-axis: # of nonzeros in columns, descending, log-scale

  33. Analysis of results Good speedup for data with long-tailed distribution of nonzeros

  34. Analysis of results Good speedup for data with long-tailed distribution of nonzeros Synthetic data First 1,000 columns : 100,000 nonzeros each Next 699,000 columns : all zero Speedup: 123 x

  35. Analysis of results Most nodes sparsely connected ( << 32 edges) Now: one warp, one instance of Mondriaan 2 2 1 2 1 Thread 0 Thread 1 What if this column has < 32 nonzeros? Thread 2

  36. Work in progress Pool multiple instances of Mondriaan into warp 2 1 2 2 1 2 1 1 2 1 Thread 0 Thread 1 Thread 2

  37. Conclusion Implemented hypergraph algorithm handling arbitrary connectivity patterns Explored SHFL for task planning Future work: more flexible allocation strategy

  38. Acknowledgment Trinity College: Student Research Program NVIDIA: CUDA Teaching Center

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#