GPU Performance for NFA Processing

W
h
y
 
G
P
U
s
 
a
r
e
 
S
l
o
w
 
a
t
 
E
x
e
c
u
t
i
n
g
 
N
F
A
s
 
a
n
d
H
o
w
 
t
o
 
M
a
k
e
 
t
h
e
m
 
F
a
s
t
e
r
H
o
n
g
y
u
a
n
 
L
i
u
 
(
W
i
l
l
i
a
m
 
&
 
M
a
r
y
)
,
Sreepathi Pai (University of Rochester),
and Adwait Jog (William & Mary)
A
u
t
o
m
a
t
a
 
(
F
i
n
i
t
e
 
S
t
a
t
e
 
M
a
c
h
i
n
e
s
)
 
P
r
o
c
e
s
s
i
n
g
 
2
 
Larger Applications
 
Efficient Automata Processing
3
 
Parallelism
 
Many Options
Unavailable
 
Architectures for automata processing…
 
✔️
4
Deterministic Finite Automata
(DFA)
Nondeterministic Finite Automata
(NFA)
Simple
Not compact
More Parallelism
Compact
 
✔️
E
f
f
i
c
i
e
n
t
l
y
 
P
r
o
c
e
s
s
i
n
g
 
L
a
r
g
e
-
s
c
a
l
e
N
F
A
s
 
o
n
 
G
P
U
5
O
u
t
l
i
n
e
Introduction
M
o
t
i
v
a
t
i
o
n
Addressing the Data Movement Problem
Addressing the Utilization Problem
Evaluation
Contributions
6
7
A
ccepting Pattern 
b*.y*z
8
Alphabet-oriented Transition Table in GPU Global Memory
T
r
a
n
s
i
t
i
o
n
 
t
a
b
l
e
 
i
s
s
p
a
r
s
e
 
a
n
d
 
r
e
d
u
n
d
a
n
t
!
P
r
o
b
l
e
m
 
#
1
:
 
D
a
t
a
 
M
o
v
e
m
e
n
t
9
* Starting states are 
always
 active. 
Not all states are active all the time. 
 
A
v
e
r
a
g
e
 
=
 
0
.
3
9
%
,
 
M
a
x
i
m
u
m
 
=
 
3
.
0
5
%
P
r
o
b
l
e
m
 
#
2
:
 
T
h
r
e
a
d
 
U
t
i
l
i
z
a
t
i
o
n
10
Input Stream: 
x
yz
 
T
hreads must access the
transition table for 
every
 symbol.
P
r
o
b
l
e
m
 
#
1
:
 
D
a
t
a
 
M
o
v
e
m
e
n
t
I
d
e
a
l
 
C
a
s
e
:
Threads only load input
 
symbols
11
2
5
X
1
8
X
O
u
t
l
i
n
e
Introduction
Motivation
A
d
d
r
e
s
s
i
n
g
 
t
h
e
 
D
a
t
a
 
M
o
v
e
m
e
n
t
 
P
r
o
b
l
e
m
Addressing the Utilization Problem
Evaluation
Contributions
12
N
e
w
 
T
r
a
n
s
i
t
i
o
n
 
T
a
b
l
e
13
I
d
e
a
:
Put transition table to GPU registers 
as much as possible.
 
T0
 
Local
Memory
 
Alphabet-oriented
Transition Table
 
New Transition Table
 
Registers
M
a
t
c
h
s
e
t
 
C
o
m
p
r
e
s
s
i
o
n
 
(
M
a
C
)
I
d
e
a
:
 
C
o
n
v
e
r
t
 
m
e
m
o
r
y
 
a
c
c
e
s
s
e
s
 
t
o
 
c
o
m
p
u
t
e
.
14
 
….000010000…..
 
Start = 98   End = 99
256 bit
  16 bit
 
….111101111…..
 
Start = 98   End = 99
256 bit
  
16 bit
 
complete
 
complement
b
^b
M
a
t
c
h
s
e
t
 
C
o
m
p
r
e
s
s
i
o
n
 
(
M
a
C
)
Bit Checking
15
Range Checking
 
New Transition Table with MaC
S
c
o
p
e
 
o
f
 
M
a
t
c
h
s
e
t
 
C
o
m
p
r
e
s
s
i
o
n
 
(
M
a
C
)
16
O
u
t
l
i
n
e
Introduction
Background and Motivation
Addressing the Data Movement Problem
A
d
d
r
e
s
s
i
n
g
 
t
h
e
 
U
t
i
l
i
z
a
t
i
o
n
 
P
r
o
b
l
e
m
Evaluation
Contributions
17
18
A
c
t
i
v
i
t
y
 
o
f
 
S
t
a
t
e
s
A
c
t
i
v
i
t
y
-
b
a
s
e
d
 
P
r
o
c
e
s
s
i
n
g
 
I
d
e
a
:
 
H
o
t
 
S
t
a
t
e
s
 
V
s
.
 
C
o
l
d
 
S
t
a
t
e
s
19
 
Threads
 
Hot
 States
 
Worklists
 
Cold
 States
20
T
h
r
e
a
d
 
B
l
o
c
k
x
yz
 
S1 
 
S2
S1 
 
S3
S2, S3
 
Next Cold Worklist
 
Cold Worklist
 
C
o
l
d
 
W
o
r
k
l
i
s
t
 
=
 
N
e
x
t
 
C
o
l
d
 
W
o
r
k
l
i
s
t
 
a
n
d
 
Z
e
r
o
 
N
e
x
t
 
C
o
l
d
 
W
o
r
k
l
i
s
t
Sync threads
21
T
h
r
e
a
d
 
B
l
o
c
k
x
y
z
 
S1 
 
S2
S1 
 
S3
S2, S3
 
Next Cold Worklist
S2, S3
 
Cold Worklist
 
C
o
l
d
 
W
o
r
k
l
i
s
t
 
=
 
N
e
x
t
 
C
o
l
d
 
W
o
r
k
l
i
s
t
 
a
n
d
 
Z
e
r
o
 
N
e
x
t
 
C
o
l
d
 
W
o
r
k
l
i
s
t
Sync threads
 
S2
 
 
S2
S2
 
 
S3
Duplicated
O
u
t
l
i
n
e
Introduction
Motivation
Addressing the Data Movement Problem
Addressing the Utilization Problem
E
v
a
l
u
a
t
i
o
n
Contributions
22
E
v
a
l
u
a
t
i
o
n
Baselines
iNFAnt [CCR’10]
NFA-CG [PPoPP’12]
AP Chip
NVIDIA Quadro P6000
Sixteen
 Applications
23
24
 
O
u
r
 
a
r
t
i
f
a
c
t
 
i
s
 
a
v
a
i
l
a
b
l
e
 
i
n
h
t
t
p
s
:
/
/
g
i
t
h
u
b
.
c
o
m
/
b
i
g
w
a
t
e
r
/
g
p
u
n
f
a
-
a
r
t
i
f
a
c
t
E
v
a
l
u
a
t
i
o
n
C
o
n
t
r
i
b
u
t
i
o
n
s
Two Performance Bottlenecks
Excessive Data Movement
Poor Compute Utilization
Our Proposals
Use on-chip resources when possible
Converting memory accesses to compute
Mapping only active states to threads
Performance Improvement
26.
5X / 5.3X over prior work
Reduced the gap towards an AP chip
25
W
e
 
a
c
k
n
o
w
l
e
d
g
e
 
t
h
e
 
s
u
p
p
o
r
t
 
o
f
 
t
h
e
 
N
a
t
i
o
n
a
l
 
S
c
i
e
n
c
e
 
F
o
u
n
d
a
t
i
o
n
 
(
N
S
F
)
 
g
r
a
n
t
s
 
(
#
1
6
5
7
3
3
6
,
 
#
1
7
5
0
6
6
7
)
Slide Note

My name is Hongyuan Liu. In this video, I will present our work Why GPUs are Slow at Executing NFAs and How to Make them Faster

This work was jointly performed with my advisor Prof. Adwait Jog and our collaborator Prof. Sreepathi Pai.

Embed
Share

Hongyuan Liu, Sreepathi Pai, and Adwait Jog delve into the challenges of GPU performance when executing NFAs. They address data movement and utilization issues, proposing solutions and discussing the efficiency of processing large-scale NFAs on GPUs. The research explores architectures and parallelism to optimize NFA processing, shedding light on the complexities involved in improving GPU performance for automata applications.

  • GPU Performance
  • NFA Processing
  • Parallelism
  • Data Movement
  • Automata

Uploaded on Sep 28, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Why GPUs are Slow at Executing NFAs and How to Make them Faster Hongyuan Liu (William & Mary), Sreepathi Pai (University of Rochester), and Adwait Jog (William & Mary)

  2. Automata (Finite State Machines) Processing Larger Applications Efficient Automata Processing 2

  3. Architectures for automata processing Many Options Unavailable Parallelism 3

  4. More Parallelism Compact Simple Not compact Deterministic Finite Automata (DFA) Nondeterministic Finite Automata (NFA) 4

  5. Efficiently Processing Large-scale NFAs on GPU 5

  6. Outline Introduction Motivation Addressing the Data Movement Problem Addressing the Utilization Problem Evaluation Contributions 6

  7. Starting State S2 y Reporting State z b * S0 S1 S3 Accepting Pattern b*.y*z 7

  8. S2 Starting State y z b * Reporting State S0 S1 S3 S0 S0 S0 S0 S1 S1 S1 S1 S2 S2 S2 S2 S3 S3 S3 S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 b b b Transition table is sparse and redundant! S1 S1 S1 S1 b x x x y x y y z S2, S3 S2, S3 S2, S3 S2, S3 y z z report report report report z Alphabet-oriented Transition Table in GPU Global Memory Problem #1: Data Movement 8

  9. T1 T1 T3 T3 T0 T0 T2 T2 GPU Threads GPU Threads S2 y z b * S0 S1 S2 S3 S0 S1 S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S1 b * Starting states are always active. x S2, S3 y Starting State report z Reporting State Not all states are active all the time. Average = 0.39%, Maximum = 3.05% Active State Matched State Problem #2: Thread Utilization 9

  10. Input Stream: xyz S2 S2 S2 y y y z z z b b b * * * S0 S1 S2 S3 S0 S0 S0 S1 S1 S1 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S3 S3 S3 S1 b Starting State x Reporting State S2, S3 y report z Active State Threads must access the transition table for every symbol. Matched State Problem #1: Data Movement 10

  11. 25X 18X Ideal Case: Threads only load input symbols 11

  12. Outline Introduction Motivation Addressing the Data Movement Problem Addressing the Utilization Problem Evaluation Contributions 12

  13. New Transition Table Idea: Put transition table to GPU registers as much as possible. T0 Alphabet-oriented Transition Table Registers New Transition Table S0 Local Memory outedges S1 S1 b matchset 00010000 x y 98th is 1 (ASCII of b ) z 13

  14. Matchset Compression (MaC) Idea: Convert memory accesses to compute. .000010000 .. Start = 98 End = 99 256 bit b 16 bit complete 256 bit .111101111 .. Start = 98 End = 99 ^b 16 bit complement 14

  15. Matchset Compression (MaC) Bit Checking Range Checking New Transition Table with MaC outedges S1 start 98 end 99 15

  16. Scope of Matchset Compression (MaC) 70% 16

  17. Outline Introduction Background and Motivation Addressing the Data Movement Problem Addressing the Utilization Problem Evaluation Contributions 17

  18. Activity of States 18

  19. Activity-based Processing Idea: Hot States Vs. Cold States Hot States Threads Cold States Worklists 19

  20. S2 y S1 z b * xyz S3 S0 Thread Block S1 S2 S1 S3 Hot Mode Next Cold Worklist S2, S3 Cold Worklist Cold Mode Cold Worklist = Next Cold Worklist and Zero Next Cold Worklist Sync threads 20

  21. S2 y S1 z b * xyz S3 S0 Thread Block S1 S2 S1 S3 Hot Mode Next Cold Worklist S2, S3 Duplicated Cold Worklist S2, S3 Cold Mode S2 S2 S2 S3 Cold Worklist = Next Cold Worklist and Zero Next Cold Worklist Sync threads 21

  22. Outline Introduction Motivation Addressing the Data Movement Problem Addressing the Utilization Problem Evaluation Contributions 22

  23. Evaluation Baselines iNFAnt [CCR 10] NFA-CG [PPoPP 12] AP Chip NVIDIA Quadro P6000 Sixteen Applications 23

  24. Evaluation Our artifact is available in https://github.com/bigwater/gpunfa-artifact 24

  25. Contributions Two Performance Bottlenecks Excessive Data Movement Poor Compute Utilization Our Proposals Use on-chip resources when possible Converting memory accesses to compute Mapping only active states to threads Performance Improvement 26.5X / 5.3X over prior work Reduced the gap towards an AP chip 25 We acknowledge the support of the National Science Foundation (NSF) grants (#1657336, #1750667)

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#