Understanding GPU Performance for NFA Processing
Hongyuan Liu, Sreepathi Pai, and Adwait Jog delve into the challenges of GPU performance when executing NFAs. They address data movement and utilization issues, proposing solutions and discussing the efficiency of processing large-scale NFAs on GPUs. The research explores architectures and parallelism to optimize NFA processing, shedding light on the complexities involved in improving GPU performance for automata applications.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Why GPUs are Slow at Executing NFAs and How to Make them Faster Hongyuan Liu (William & Mary), Sreepathi Pai (University of Rochester), and Adwait Jog (William & Mary)
Automata (Finite State Machines) Processing Larger Applications Efficient Automata Processing 2
Architectures for automata processing Many Options Unavailable Parallelism 3
More Parallelism Compact Simple Not compact Deterministic Finite Automata (DFA) Nondeterministic Finite Automata (NFA) 4
Efficiently Processing Large-scale NFAs on GPU 5
Outline Introduction Motivation Addressing the Data Movement Problem Addressing the Utilization Problem Evaluation Contributions 6
Starting State S2 y Reporting State z b * S0 S1 S3 Accepting Pattern b*.y*z 7
S2 Starting State y z b * Reporting State S0 S1 S3 S0 S0 S0 S0 S1 S1 S1 S1 S2 S2 S2 S2 S3 S3 S3 S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 b b b Transition table is sparse and redundant! S1 S1 S1 S1 b x x x y x y y z S2, S3 S2, S3 S2, S3 S2, S3 y z z report report report report z Alphabet-oriented Transition Table in GPU Global Memory Problem #1: Data Movement 8
T1 T1 T3 T3 T0 T0 T2 T2 GPU Threads GPU Threads S2 y z b * S0 S1 S2 S3 S0 S1 S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S1 b * Starting states are always active. x S2, S3 y Starting State report z Reporting State Not all states are active all the time. Average = 0.39%, Maximum = 3.05% Active State Matched State Problem #2: Thread Utilization 9
Input Stream: xyz S2 S2 S2 y y y z z z b b b * * * S0 S1 S2 S3 S0 S0 S0 S1 S1 S1 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S2, S3 S3 S3 S3 S1 b Starting State x Reporting State S2, S3 y report z Active State Threads must access the transition table for every symbol. Matched State Problem #1: Data Movement 10
25X 18X Ideal Case: Threads only load input symbols 11
Outline Introduction Motivation Addressing the Data Movement Problem Addressing the Utilization Problem Evaluation Contributions 12
New Transition Table Idea: Put transition table to GPU registers as much as possible. T0 Alphabet-oriented Transition Table Registers New Transition Table S0 Local Memory outedges S1 S1 b matchset 00010000 x y 98th is 1 (ASCII of b ) z 13
Matchset Compression (MaC) Idea: Convert memory accesses to compute. .000010000 .. Start = 98 End = 99 256 bit b 16 bit complete 256 bit .111101111 .. Start = 98 End = 99 ^b 16 bit complement 14
Matchset Compression (MaC) Bit Checking Range Checking New Transition Table with MaC outedges S1 start 98 end 99 15
Outline Introduction Background and Motivation Addressing the Data Movement Problem Addressing the Utilization Problem Evaluation Contributions 17
Activity-based Processing Idea: Hot States Vs. Cold States Hot States Threads Cold States Worklists 19
S2 y S1 z b * xyz S3 S0 Thread Block S1 S2 S1 S3 Hot Mode Next Cold Worklist S2, S3 Cold Worklist Cold Mode Cold Worklist = Next Cold Worklist and Zero Next Cold Worklist Sync threads 20
S2 y S1 z b * xyz S3 S0 Thread Block S1 S2 S1 S3 Hot Mode Next Cold Worklist S2, S3 Duplicated Cold Worklist S2, S3 Cold Mode S2 S2 S2 S3 Cold Worklist = Next Cold Worklist and Zero Next Cold Worklist Sync threads 21
Outline Introduction Motivation Addressing the Data Movement Problem Addressing the Utilization Problem Evaluation Contributions 22
Evaluation Baselines iNFAnt [CCR 10] NFA-CG [PPoPP 12] AP Chip NVIDIA Quadro P6000 Sixteen Applications 23
Evaluation Our artifact is available in https://github.com/bigwater/gpunfa-artifact 24
Contributions Two Performance Bottlenecks Excessive Data Movement Poor Compute Utilization Our Proposals Use on-chip resources when possible Converting memory accesses to compute Mapping only active states to threads Performance Improvement 26.5X / 5.3X over prior work Reduced the gap towards an AP chip 25 We acknowledge the support of the National Science Foundation (NSF) grants (#1657336, #1750667)